On Nov 16, 2011, at 9:39 PM, Ioan Eugen Stan wrote:

> Hello,
> 
> I have to figure out how much hardware is required to do clustering
> for my company on about 10+ milion user accounts, each with 100-5000
> documents. The documents will be indexed so vector creation will be
> done at indexing.
> Is there any formula to approximate the size of the vectors based on
> the index size? I'm looking for rough estimates (how much disk extra
> space should I consider?).

I'll try in the next few days to track down the numbers from running the stuff 
in my recent IBM article: 
http://www.ibm.com/developerworks/java/library/j-mahout-scaling/.  Or, you can 
go run them yourself!  

Otherwise, I don't know that we have any formula just yet.  I suspect that once 
you reach a certain number of documents, your dictionary will stop growing, 
more or less.  Then, it is just a question of how many vectors you have and the 
sparseness.  This probably could be guessed at by looking at what the average 
number of words are in your email collection.  Naturally, attachments may skew 
this if you are including them.

> 
> Which are the most time consuming tasks?  From my experience with
> clustering, the index/vector creation part is the most time consuming,
> while clustering being the second. Does anyone have some data on how
> much time a clustering job takes?

That has been my experience, too.  Seq2Sparse is often the long part.  I 
suspect one could get it done a lot faster in Lucene.  
SequenceFilesFromDirectory is also slow, but that is inherently sequential.

I haven't explored yet what it would mean to use Encoded vectors in Clustering, 
but perhaps I can call Ted to the front of the class and see if he has thoughts 
on whether that even makes sense, as that would give you a fixed size Vector.

-Grant

Reply via email to