Here's some numbers using http://aws.amazon.com/datasets/7791434387204566 running locally:
Raw content size: 9.2 GB, 48K "items" -- note, most of the files are GZipped It took 15 minutes to convert all of these to sequence files on an i7 single CPU w/ 4 cores and hyper-threading. 3.4 GHz machine with 16 GB of RAM After converting to sequence files: 40 GB, 659 items. Encoded Vectors (see build-asf-email.sh): cardinality = 5000: 11 GBs for 1,300 items. This took 83 minutes to convert Splitting into test and train took 9 minutes for SGD. I had to kill the SGD job due to some issues I'm having on my machine w/ CPU temperature (SGD really cranks on the CPU and something is messed up on my machine) that I need to track down. For clustering, about the same time for converting to sequence files The job to convert to vectors took a while (it scrolled out of my window). The resulting tfidf-vecs were 7.8 gb. Dictionary: 82865442 2011-11-21 17:46 dictionary.file-0* 83269191 2011-11-21 17:46 dictionary.file-1* 10963133 2011-11-21 17:46 dictionary.file-2* Freq files: 37160153 2011-11-21 22:35 frequency.file-0* 37160173 2011-11-21 22:35 frequency.file-1* 37160173 2011-11-21 22:35 frequency.file-2* 31407713 2011-11-21 22:35 frequency.file-3* Total dir size for seq2sparse: du -s seq2sparse/ 30923564 seq2sparse/ More as they become available. HTH, Grant On Nov 21, 2011, at 3:57 AM, Ioan Eugen Stan wrote: >> I'll try in the next few days to track down the numbers from running the >> stuff in my recent IBM article: >> http://www.ibm.com/developerworks/java/library/j-mahout-scaling/. Or, you >> can go run them yourself! > > I think posting some reference data for the jobs will be great. I will have > something to compare to when I have something done. In the mean time I will > try to do a quick and dirty implementation working and see how things move > and post my findings. This could take a while as I depend on some > modifications. > >> Otherwise, I don't know that we have any formula just yet. I suspect that >> once you reach a certain number of documents, your dictionary will stop >> growing, more or less. Then, it is just a question of how many vectors you >> have and the sparseness. This probably could be guessed at by looking at >> what the average number of words are in your email collection. Naturally, >> attachments may skew this if you are including them. > > I also suspect that things will be asymptotically after a certain number of > documents, remains to see where that threshold is. > >> That has been my experience, too. Seq2Sparse is often the long part. I >> suspect one could get it done a lot faster in Lucene. >> SequenceFilesFromDirectory is also slow, but that is inherently sequential. > > I will be able to use a map reduce job to create vectors or just create them > as an indexing step so I hope this step will not count when considering the > effective clustering time. > >> I haven't explored yet what it would mean to use Encoded vectors in >> Clustering, but perhaps I can call Ted to the front of the class and see if >> he has thoughts on whether that even makes sense, as that would give you a >> fixed size Vector. >> >> -Grant > > I don't know about encoded vectors yet, I hope to get some more info on them > from Mahout in Action. If they do what I think they do, I will definitely try > them, and probably complain on the list (Ted) if I can't interpret them right > :). > > Thanks for the reply, > > -- > Ioan Eugen Stan -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com
