• Part 1 For each text build a vector of numerical entries as follows:
    1. We assume that the dictionary for the entire collection contains d terms.
    2. We assume that the total number of documents in the collection is n.
    3. The entry i=1…d in document j=1…n is computed as a product of 3 terms LijGiNj where
      • Lij is the number of times term i appears in document j,
      • Gi=n/ni, where ni is the number of documents containing term i,
      • Nj is the normalizing factor, i.e. Nj(∑iLijGi)=1
        (for details see New term weighting formulas for the vector space method in information retrieval, Chisholm and Kolda.
        Report ORNL/TM-13756, Computer Science and Mathematics Division, Oak Ridge National Laboratory)

  • Part 2 You will see that the vectors in Part 1 are very sparse, that is a vast majority of entries are 0. To save memory use Harwell-Boeing sparse matrix format.
  • Part 3 Report size of both matrices.