I'll try in the next few days to track down the numbers from running the stuff 
in my recent IBM article: 
http://www.ibm.com/developerworks/java/library/j-mahout-scaling/.  Or, you can 
go run them yourself!

I think posting some reference data for the jobs will be great. I will have something to compare to when I have something done. In the mean time I will try to do a quick and dirty implementation working and see how things move and post my findings. This could take a while as I depend on some modifications.

Otherwise, I don't know that we have any formula just yet.  I suspect that once 
you reach a certain number of documents, your dictionary will stop growing, 
more or less.  Then, it is just a question of how many vectors you have and the 
sparseness.  This probably could be guessed at by looking at what the average 
number of words are in your email collection.  Naturally, attachments may skew 
this if you are including them.

I also suspect that things will be asymptotically after a certain number of documents, remains to see where that threshold is.

That has been my experience, too.  Seq2Sparse is often the long part.  I 
suspect one could get it done a lot faster in Lucene.  
SequenceFilesFromDirectory is also slow, but that is inherently sequential.

I will be able to use a map reduce job to create vectors or just create them as an indexing step so I hope this step will not count when considering the effective clustering time.

I haven't explored yet what it would mean to use Encoded vectors in Clustering, 
but perhaps I can call Ted to the front of the class and see if he has thoughts 
on whether that even makes sense, as that would give you a fixed size Vector.

-Grant

I don't know about encoded vectors yet, I hope to get some more info on them from Mahout in Action. If they do what I think they do, I will definitely try them, and probably complain on the list (Ted) if I can't interpret them right :).

Thanks for the reply,

--
Ioan Eugen Stan

Reply via email to