Is there a workflow figured out for how to handle collecting and processing
multiple document collections?  Meaning if I run N documents
through SparseVectorsFromSequenceFiles and a month later have another 50K
documents I'd like to add to the same corpus, what is the standard workflow
for doing this?

Are people re-processing the entire corpus, including new files?  I haven't
seen any code/classes in the mahout vectorizer package for adding new
documents to the dictionary, and tfidf vectors.

-- 

Thanks,
John C

Reply via email to