Multiple document collections using SparseVectorsFromSequenceFiles

John Conwell Fri, 24 May 2013 12:39:40 -0700

Is there a workflow figured out for how to handle collecting and processing
multiple document collections?  Meaning if I run N documents
through SparseVectorsFromSequenceFiles and a month later have another 50K
documents I'd like to add to the same corpus, what is the standard workflow
for doing this?


Are people re-processing the entire corpus, including new files?  I haven't
seen any code/classes in the mahout vectorizer package for adding new
documents to the dictionary, and tfidf vectors.

-- 

Thanks,
John C

Multiple document collections using SparseVectorsFromSequenceFiles

Reply via email to