Depends on what you do. E.g. you may find it difficult running svd on
a single node machine even for a 2G worth of matrix input. Some of the
stuff is cpu bound. another stuff might be iteration bound (als) but
still worth a try if you can figure good learning and reg rates. Some
stuff (ALS, again) you may find actually performs much better in BSP
environment than in MapReduce altogether.

So it depends on a problem, but as a general rule of thumb, if you
can't solve your problem on a single node in an hour (depends on your
requirements), that's probably when you might want to start trying a
machine cluster solution.

I'd always suggest to prototype in R first. If not for the volume sake
then just to make sure it does make sense for your data. If it looks
like it takes forever for your size, then you may want start looking
elsewhere.

On Sat, Oct 20, 2012 at 1:25 PM, Eric Link <[email protected]> wrote:
> We are looking at using mahout in our organization.  We have a need to do 
> statistical analysis and do clustering and make recommendations.  What is the 
> 'sweet spot' for doing this with mahout?  Meaning, what types of data sets 
> and data volumes are the best fit for using a tool like mahout, versus doing 
> things, say,  in a sql database.  I hear big data doesn't really start until 
> you have terabytes and petabytes of data, so I'm not sure the data sets I 
> have are worthy!    Thanks for any thoughts on the proper fit for a tool like 
> mahout.    - Eric
>
>
>
> On Oct 20, 2012, at 2:44 PM, Matt Molek <[email protected]> wrote:
>
>> First off, thank you everyone for your help so far. This mailing list
>> has been a great help getting me up and running with Mahout
>>
>> Right now, I'm clustering a set of ~3M documents into 300 clusters.
>> Then I'm using clusterpp to split the documents up into directories
>> containing the vectors belonging to each cluster. After I perform the
>> clustering, clusterdump shows that each cluster has between ~800 and
>> ~200,000 documents. This isn't a great spread, but the point is that
>> none of the clusters are empty.
>>
>> Here are my commands:
>>
>> bin/mahout kmeans -i ssvd2/USigma -c initial-centroids -o pca-clusters
>> -dm org.apache.mahout.common.distance.TanimotoDistanceMeasure -cd 0.05
>> -k 300 -x 15 -cl -ow
>>
>> bin/mahout clusterdump -i pca-clusters/clusters-11-final -o clusterdump.txt
>>
>> bin/mahout clusterpp -i pca-clusters -o bottom
>>
>>
>> Since none of my clusters are empty, I would expect clusterpp to
>> create 300 directories in "bottom", one for each cluster. Instead,
>> only 147 directories are created. The other 153 outputs are just empty
>> part-r-* files sitting in the "bottom" directory.
>>
>> I haven't found too much information when searching on this issue but
>> I did come across one mailing list post from a while back:
>> http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/%[email protected]%3E
>>
>> In that discussion someone said, "If that is the only thing that is
>> contained in the part-r-* file [it had no vectors], then the reducer
>> responsible to write to that part-r-* file did not receive any input
>> records to write to it. This happens because the program uses the
>> default hash partitioner which sometimes maps records belonging to
>> different clusters to a same reducer; thus leaving some reducers
>> without any input records."
>>
>> So if that's correct, is that what's happening to me? Half of my
>> clusters are being sent to the overlapping reducers? That seems like a
>> big issue, making clusterpp pretty much useless for my purposes. I
>> can't have documents randomly being sent to the wrong cluster's
>> directory, especially not 50+% of them.
>>
>> One final detail: I'm not sure if this matters, but the clusters
>> output by kmeans are not numbered 1 to 300. They have an odd looking,
>> nonsequential numbering sequence. The first 5 clusters are:
>> VL-3740844
>> VL-3741044
>> VL-3741140
>> VL-3741161
>> VL-3741235
>>
>> I haven't done much with kmeans before, so I wasn't sure if this was
>> an unexpected behavior or not.
>

Reply via email to