Re: clustering hardware requirements

Ioan Eugen Stan Mon, 21 Nov 2011 00:58:22 -0800

I'll try in the next few days to track down the numbers from running the stuff 
in my recent IBM article: 
http://www.ibm.com/developerworks/java/library/j-mahout-scaling/.  Or, you can 
go run them yourself!

I think posting some reference data for the jobs will be great. I willhave something to compare to when I have something done. In the meantime I will try to do a quick and dirty implementation working and seehow things move and post my findings. This could take a while as Idepend on some modifications.

Otherwise, I don't know that we have any formula just yet.  I suspect that once 
you reach a certain number of documents, your dictionary will stop growing, 
more or less.  Then, it is just a question of how many vectors you have and the 
sparseness.  This probably could be guessed at by looking at what the average 
number of words are in your email collection.  Naturally, attachments may skew 
this if you are including them.

I also suspect that things will be asymptotically after a certain numberof documents, remains to see where that threshold is.

That has been my experience, too.  Seq2Sparse is often the long part.  I 
suspect one could get it done a lot faster in Lucene.  
SequenceFilesFromDirectory is also slow, but that is inherently sequential.

I will be able to use a map reduce job to create vectors or just createthem as an indexing step so I hope this step will not count whenconsidering the effective clustering time.

I haven't explored yet what it would mean to use Encoded vectors in Clustering, 
but perhaps I can call Ted to the front of the class and see if he has thoughts 
on whether that even makes sense, as that would give you a fixed size Vector.

-Grant

I don't know about encoded vectors yet, I hope to get some more info onthem from Mahout in Action. If they do what I think they do, I willdefinitely try them, and probably complain on the list (Ted) if I can'tinterpret them right :).


Thanks for the reply,

--
Ioan Eugen Stan

Re: clustering hardware requirements

Reply via email to