SVFSF really is designed for a one-shot sort of processing. The issues arise with all of the corpus frequency cutoffs and such. N-gram detection, frequency cutoffs and so on are all going to be problems with piecewise conversion.
If all you use it for is tokenizing, then there isn't a problem. If you are interested in a more incremental architecture, I expect that it would be best to a) switch to a more incremental sort of dictionary so that new tokens can be added easily b) not use Strings so much in the tokenization (could result in substantial speedups) c) define an intermediate format for token and n-gram counts d) write code that supports combination of sub-corpora. Would you like to work on such a thing with us? The other very interesting option would be to simply create Lucene indices as your document repository format. These would satisfy requirements a through d quite easily. On Fri, May 24, 2013 at 12:39 PM, John Conwell <[email protected]> wrote: > Is there a workflow figured out for how to handle collecting and processing > multiple document collections? Meaning if I run N documents > through SparseVectorsFromSequenceFiles and a month later have another 50K > documents I'd like to add to the same corpus, what is the standard workflow > for doing this? > > Are people re-processing the entire corpus, including new files? I haven't > seen any code/classes in the mahout vectorizer package for adding new > documents to the dictionary, and tfidf vectors. > > -- > > Thanks, > John C >
