Here's some numbers using http://aws.amazon.com/datasets/7791434387204566 
running locally:

Raw content size:
9.2 GB, 48K "items" -- note, most of the files are GZipped

It took 15 minutes to convert all of these to sequence files on an i7 single 
CPU w/ 4 cores and hyper-threading. 3.4 GHz machine with 16 GB of RAM

After converting to sequence files:
40 GB, 659 items.

Encoded Vectors (see build-asf-email.sh): cardinality = 5000: 11 GBs for 1,300 
items.  This took 83 minutes to convert

Splitting into test and train took 9 minutes for SGD.  I had to kill the SGD 
job due to some issues I'm having on my machine w/ CPU temperature (SGD really 
cranks on the CPU and something is messed up on my machine) that I need to 
track down.

For clustering,  about the same time for  converting to sequence files

The job to convert to vectors took a while (it scrolled out of my window).  The 
resulting tfidf-vecs were 7.8 gb.
Dictionary:
 82865442 2011-11-21 17:46 dictionary.file-0*
83269191 2011-11-21 17:46 dictionary.file-1*
10963133 2011-11-21 17:46 dictionary.file-2*

Freq files:

 37160153 2011-11-21 22:35 frequency.file-0*
 37160173 2011-11-21 22:35 frequency.file-1*
 37160173 2011-11-21 22:35 frequency.file-2*
 31407713 2011-11-21 22:35 frequency.file-3*

Total dir size for seq2sparse:  du -s seq2sparse/
30923564        seq2sparse/

More as they become available.

HTH,
Grant

On Nov 21, 2011, at 3:57 AM, Ioan Eugen Stan wrote:

>> I'll try in the next few days to track down the numbers from running the 
>> stuff in my recent IBM article: 
>> http://www.ibm.com/developerworks/java/library/j-mahout-scaling/.  Or, you 
>> can go run them yourself!
> 
> I think posting some reference data for the jobs will be great. I will have 
> something to compare to when I have something done. In the mean time I will 
> try to do a quick and dirty implementation working and see how things move 
> and post my findings. This could take a while as I depend on some 
> modifications.
> 
>> Otherwise, I don't know that we have any formula just yet.  I suspect that 
>> once you reach a certain number of documents, your dictionary will stop 
>> growing, more or less.  Then, it is just a question of how many vectors you 
>> have and the sparseness.  This probably could be guessed at by looking at 
>> what the average number of words are in your email collection.  Naturally, 
>> attachments may skew this if you are including them.
> 
> I also suspect that things will be asymptotically after a certain number of 
> documents, remains to see where that threshold is.
> 
>> That has been my experience, too.  Seq2Sparse is often the long part.  I 
>> suspect one could get it done a lot faster in Lucene.  
>> SequenceFilesFromDirectory is also slow, but that is inherently sequential.
> 
> I will be able to use a map reduce job to create vectors or just create them 
> as an indexing step so I hope this step will not count when considering the 
> effective clustering time.
> 
>> I haven't explored yet what it would mean to use Encoded vectors in 
>> Clustering, but perhaps I can call Ted to the front of the class and see if 
>> he has thoughts on whether that even makes sense, as that would give you a 
>> fixed size Vector.
>> 
>> -Grant
> 
> I don't know about encoded vectors yet, I hope to get some more info on them 
> from Mahout in Action. If they do what I think they do, I will definitely try 
> them, and probably complain on the list (Ted) if I can't interpret them right 
> :).
> 
> Thanks for the reply,
> 
> --
> Ioan Eugen Stan

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com



Reply via email to