Re: reduce is too slow in StreamingKmeans

Suneel Marthi Tue, 18 Mar 2014 17:08:57 -0700

When dealing with Streaming KMeans, it would be helpful for troubleshooting 
purposes if u could provide the values for k (# of clusters), km ( = k log n) 
and n (# of datapoints).

Try setting -Xmx to a higher heap size and run the sequential version again.

I had seen OOM errors happen during the Reduce phase while running the MR 
version, my reduce heap size was set to 2GB and I was trying to cluster about 
2M datapoints each of cardinality 100 (that's after running thru SSVD-PCA).  

Speaking from my experience, its either been that the Reducer fails with OOM 
errors or is stuck forever at 76% (and raise alarms with the Operations stuck 
because its not making any progress).

How big is ur dataset and how long did it take for the map phase to complete? 

On Tuesday, March 18, 2014 12:54 AM, fx MA XIAOJUN <xiaojun...@fujixerox.co.jp> 
wrote:

As mahout streamingkmeans has no problems in sequential mode, 
I would like to try sequential mode.
However, "java.lang.OutofMemoryError" occurs.

I wonder where to set JVM heap size for sequential mode?
Is it the same with mapreduce mode?

-----Original Message-----
From: fx MA XIAOJUN [mailto:xiaojun...@fujixerox.co.jp] 
Sent: Tuesday, March 18, 2014 10:50 AM
To: Suneel Marthi; user@mahout.apache.org
Subject: RE:
 reduce is too slow in StreamingKmeans

Thank you for your extremely quick reply.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u 
>> mean Streaming KMeans here?
I want to try using -rskm in streaming kmeans. 
But in mahout 0.8, if setting -rskm as true, errors occur.
I heard that the bug has been fixed in 0.9. So I upgraded 0.8->0.9

The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 
2.x(YARN)
cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is 
compiled by cloudera.
So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 
distribution. 
It turned out that "Mahout kmeans" runs very well on mapreduce.
However, "Mahout
 streamingkmeans" runs properly in sequential mode, but fails in mapreduce mode.

If it is the problem of incompatibility between hadoop and mahout, I don’t 
think "mahout kmeans" can run properly.

Is mahout 0.9 compatible with Hadoop 0.20?

-----Original Message-----
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] 
Sent: Monday, March 17, 2014 6:21 PM
To: fx MA XIAOJUN; user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN <xiaojun...@fujixerox.co.jp> 
wrote:

Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 140000 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, 
Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 
76% for ever.

>> This has been my experience too both with 0.8 and 0.9. 

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm 
option.

Mahout kmeans can be executed properly, so I think the installation of mahout 
0.9 is successful.

>> What do u mean by
 this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans 
here?

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.
----------------------------------------------------------------------------------------
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found 
interface org.apache.hadoop.mapreduce.JobContext, but class was expected
    at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
    at

org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
    at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
--------------------------------------------------------------------------------------------

Seems like u r trying
 to execute on Hadoop 2 while Mahout 0.9 has been built with Hadoop 1.x 
profile, hence the error u r seeing.
If u would like to test on Hadoop 2, work off of present trunk and build the 
code with Hadoop 2 profile like below:

mvn clean install -Dhadoop2.profile=<hadoop 2.x version>

Please give that a try.

-----Original Message-----
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com]
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the 
slow performance that you have been experiencing. 

How did u come up with -km 63000?

Given that u would like 10000 clusters (= k) and have 2,000,000 datapoints (= 
n) so k * ln(n) = 10000 * ln(2 * 10^6)  = 145087 (rounded to nearest integer) 
and that should be the value of -km in ur case. (km = k * log (n) )

Not sure if that's gonna fix ur reduce being stuck at 76% forever but its 
definitely worth a try.

If you would like go to with -rskm option, please upgrade to Mahout 0.9.  I 
still think there's an issue with -rskm option with Mahout 0.9 and trunk today 
while executing in MR mode, but it definitely works in the nonMR (-xm 
sequential) mode in 0.9.

On Monday, February 17, 2014 9:05 PM, Sylvia Ma <xiaojun...@fujixerox.co.jp> 
wrote:

I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that 
reduce of mahout streamingkmeans is extremely slow.

For example:
With a dataset of 2000000 objects, 128 variables, I would like to get 10000 
clusters.

The command executed is as the following.
mahout streamingkmeans -i input -o output -ow -k 10000 -km 63000

I have 15 maps which were all completed in 4 hours.
However, reduce took over 100 hours and it was still
 stuck at 76%.

I have tuned performance of hadoop as the following. 
map task jvm = 3g
reduce task jvm = 10g
io.sort.mb = 512
io.sort.factor = 50
mapred.reduce.parallel.copies = 10
mapred.inmem.merge.threshold = 0 

I tried to assign enough memory but the reduce is still very very very slow.

Why does it take so much time in reduce?
And What can I do to speed up the job?

I wonder if it will be helpful to set -rskm to be true.
-rskm option has bug in Mahout 0.8, so I cannot get a try... 

Yours Sincerely,
Sylvia Ma

Re: reduce is too slow in StreamingKmeans

Reply via email to