Re: Multiple document collections using SparseVectorsFromSequenceFiles

Suneel Marthi Tue, 28 May 2013 16:18:44 -0700

This is a feature that would be useful for what I am doing. I would like to 
help out on this.





________________________________
 From: John Conwell <[email protected]>
To: user <[email protected]> 
Sent: Tuesday, May 28, 2013 3:55 PM
Subject: Re: Multiple document collections using SparseVectorsFromSequenceFiles
 

Hey Ted,
I'm gona have to support corpus appending (with full vector generation)
anyway, so sure, I'd love to help work on the design (and code of course)
of this feature with you guys.

Your right, we'll definitely have to change several data structures to
allow incremental updates, as well as decoupling the logical components of
SVFSF.

So whats the first steps?




On Fri, May 24, 2013 at 5:25 PM, Ted Dunning <[email protected]> wrote:

> SVFSF really is designed for a one-shot sort of processing.
>
> The issues arise with all of the corpus frequency cutoffs and such.  N-gram
> detection, frequency cutoffs and so on are all going to be problems with
> piecewise conversion.
>
> If all you use it for is tokenizing, then there isn't a problem.
>
> If you are interested in a more incremental architecture, I expect that it
> would be best to
>
> a) switch to a more incremental sort of dictionary so that new tokens can
> be added easily
>
> b) not use Strings so much in the tokenization (could result in substantial
> speedups)
>
> c) define an intermediate format for token and n-gram counts
>
> d) write code that supports combination of sub-corpora.
>
> Would you like to work on such a thing with us?
>
>
> The other very interesting option would be to simply create Lucene indices
> as your document repository format.  These would satisfy requirements a
> through d quite easily.
>
>
>
>
> On Fri, May 24, 2013 at 12:39 PM, John Conwell <[email protected]> wrote:
>
> > Is there a workflow figured out for how to handle collecting and
> processing
> > multiple document collections?  Meaning if I run N documents
> > through SparseVectorsFromSequenceFiles and a month later have another 50K
> > documents I'd like to add to the same corpus, what is the standard
> workflow
> > for doing this?
> >
> > Are people re-processing the entire corpus, including new files?  I
> haven't
> > seen any code/classes in the mahout vectorizer package for adding new
> > documents to the dictionary, and tfidf vectors.
> >
> > --
> >
> > Thanks,
> > John C
> >
>



-- 

Thanks,
John C

Re: Multiple document collections using SparseVectorsFromSequenceFiles

Reply via email to