Ioan- when you understand them, please explain them here: https://cwiki.apache.org/confluence/display/MAHOUT/Data+Formats
On Mon, Nov 21, 2011 at 12:57 AM, Ioan Eugen Stan <[email protected]>wrote: > I'll try in the next few days to track down the numbers from running the >> stuff in my recent IBM article: http://www.ibm.com/** >> developerworks/java/library/j-**mahout-scaling/<http://www.ibm.com/developerworks/java/library/j-mahout-scaling/>. >> Or, you can go run them yourself! >> > > I think posting some reference data for the jobs will be great. I will > have something to compare to when I have something done. In the mean time I > will try to do a quick and dirty implementation working and see how things > move and post my findings. This could take a while as I depend on some > modifications. > > > Otherwise, I don't know that we have any formula just yet. I suspect >> that once you reach a certain number of documents, your dictionary will >> stop growing, more or less. Then, it is just a question of how many >> vectors you have and the sparseness. This probably could be guessed at by >> looking at what the average number of words are in your email collection. >> Naturally, attachments may skew this if you are including them. >> > > I also suspect that things will be asymptotically after a certain number > of documents, remains to see where that threshold is. > > > That has been my experience, too. Seq2Sparse is often the long part. I >> suspect one could get it done a lot faster in Lucene. >> SequenceFilesFromDirectory is also slow, but that is inherently sequential. >> > > I will be able to use a map reduce job to create vectors or just create > them as an indexing step so I hope this step will not count when > considering the effective clustering time. > > > I haven't explored yet what it would mean to use Encoded vectors in >> Clustering, but perhaps I can call Ted to the front of the class and see if >> he has thoughts on whether that even makes sense, as that would give you a >> fixed size Vector. >> >> -Grant >> > > I don't know about encoded vectors yet, I hope to get some more info on > them from Mahout in Action. If they do what I think they do, I will > definitely try them, and probably complain on the list (Ted) if I can't > interpret them right :). > > Thanks for the reply, > > -- > Ioan Eugen Stan > -- Lance Norskog [email protected]
