help with mahout clustering on hadoop

Fuhrmann Alpert, Galit Wed, 24 Jul 2013 01:18:32 -0700

Hi Everyone,

I see this is an active list, so I'm trying again to reach out for help using 
mahout on hdfs.
I am a research scientist at eBay, working with Big Data analysis for 
e-commerce. I've been trying to run mahout on my data for quite some time now- 
running it locally no problem, but having problems running it on hdfs.
I hope you have some leads for the following (I accumulated quite a few 
unresolved issues):


1. I'm trying again to see if anyone has an answer to this matter:
        I've been running mahout kmeans successfully on hdfs, however, if I run 
mahout kmeans without the flag -cl, the clusteredPoints directory is not 
created.
        Whenever I add '-cl' to my call, I get an error: "There is no queue 
named default", even though I do specify a queue by -Dmapred.job.queue.name.
        I do not get this error "There is no queue named default" if I don't 
add the -cl to my call. It runs just fine. (not creating the clusteredPoints 
directory though).
        Does anyone have an idea why this happens?
2. My mahout clustering processes seem to be running very slow (several good 
hours on just ~1M items), and I'm wondering if there's anything that needs to 
be changed in setting/configuration. (and how?)
        I'm running on large clusters and could potentially use thousands of 
nodes. However, my mahout processes (kmeans/canopy.) are only using max 5 
mappers (I tried it on several data sets). 
        I've tried to define the number of mappers by something like: 
-Dmapred.map.tasks=100 but this didn't seem to have an effect, it still only 
uses <=5 mappers.
        Is there a different way to set the number of mappers/reducers for a 
mahout process?
        Or is there another configuration issue I need to consider?
3. When running mahout canopy clustering, the jobs consistently fail, with some 
out of memory errors such as:
        attempt_201306241658_137502_m_000001_1: Exception in thread "Thread for 
syncLogs" java.lang.OutOfMemoryError: Java heap space
        and finally:
        Exception in thread "main" java.lang.InterruptedException: Canopy Job 
failed processing whateverfilename.dat
        Even though the file does exist.
        I tried to increase the map/red memory by 
-Dmapred.child.java.opts=-Xmx4g, but this still fails:
                13/07/22 01:56:09 INFO mapred.JobClient:   Job Counters
                13/07/22 01:56:09 INFO mapred.JobClient:     
SLOTS_MILLIS_MAPS=23121
                13/07/22 01:56:09 INFO mapred.JobClient:     Total time spent 
by all reduces waiting after reserving slots (ms)=0
                13/07/22 01:56:09 INFO mapred.JobClient:     Total time spent 
by all maps waiting after reserving slots (ms)=0
                13/07/22 01:56:09 INFO mapred.JobClient:     Launched map 
tasks=13
                13/07/22 01:56:09 INFO mapred.JobClient:     
SLOTS_MILLIS_REDUCES=0
                13/07/22 01:56:09 INFO mapred.JobClient:     Failed map tasks=1
                Exception in thread "main" java.lang.InterruptedException: 
Canopy Job failed processing whateverfilename.dat
                      at 
org.apache.mahout.clustering.canopy.CanopyDriver.buildClustersMR(CanopyDriver.java:363)
                        at 
org.apache.mahout.clustering.canopy.CanopyDriver.buildClusters(CanopyDriver.java:248)
                        at 
org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:155)
                        at 
org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117)
                        at 
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
                        at 
org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64)
                        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
                        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
                        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                        at java.lang.reflect.Method.invoke(Method.java:597)
                        at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
                        at 
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
                        at 
org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
                        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
                        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
                        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                        at java.lang.reflect.Method.invoke(Method.java:597)
                        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
4. One of the first major problems I've encountered was that a mahout jar we've 
created that uses KMeansDriver (and that runs great on my local machine) did 
not even initiate a job on the hadoop cluster. It seemed to be running parallel 
but in fact it was running only on the local node.         Did this happen to 
anyone? If so, what is the fix for this? (I ended up dropping it and calling 
mahout step by step from command line, but I'd be happy to know if there a fix 
for this).

Any ideas/inputs on any of those issues would be really greatly appreciated.

Thanks!

Galit.

-----Original Message-----
From: Fuhrmann Alpert, Galit 
Sent: Wednesday, July 17, 2013 12:43 PM
To: [email protected]; 'Suneel Marthi'
Subject: RE: mahout kmeans not generating clusteredPoint dir?


Thanks Suneel.
I tried to add this flag (though I think clusteredPoints directory was supposed 
to be created by default?).
Either way, for some reason whenever I add '-cl' (tried to run it on several 
data sets), I get the following error: 
"There is no queue named default"
(even though I do specify a queue by -Dmapred.job.queue.name=...).
I don't get this error otherwise.

Has anyone ever encountered this error?
Is there some sort of configuration I'm missing?

Thanks,

Galit.

-----Original Message-----
From: Suneel Marthi [mailto:[email protected]] 
Sent: Wednesday, July 10, 2013 5:30 PM
To: [email protected]
Subject: Re: mahout kmeans not generating clusteredPoint dir?

Been a while since I last worked with this, I believe u r missing the 
clustering option '-cl'.
Give that a try.




________________________________
 From: "Fuhrmann Alpert, Galit" <[email protected]>
To: "[email protected]" <[email protected]> 
Sent: Wednesday, July 10, 2013 5:17 AM
Subject: mahout kmeans not generating clusteredPoint dir?
 

Hello,

I ran mahout kmeans (using rand seeds) on hadoop cluster. It ran successfully 
and created a directory containing clusters-*, including the last which was 
clusters-3-final.
However, it did not create the clusteredPoints, or at least I cannot find it 
under the same dir (or anywhere else).

My call was:
mahout kmeans  -k 4000 -i inputSeq.dat -o outputPath --maxIter 3 --clusters 
outputSeeds

Was there an extra argument I needed to specify in order for it to generate the 
clusteredPoints?
(BTW I also can't see the outputSeeds. Was it created for seeds and then 
deleted?)

According to mahout in action:

The k-means clustering implementation creates two types of directories in the 
output
folder. The clusters-* directories are formed at the end of each iteration: the 
clusters-0
directory is generated after the first iteration, clusters-1 after the second 
iteration, and
so on. These directories contain information about the clusters: centroid, 
standard
deviation, and so on. The clusteredPoints directory, on the other hand, 
contains the
final mapping from cluster ID to document ID. This data is generated from the 
output
of the last MapReduce operation.
The directory listing of the output folder looks something like this:
$ ls -l reuters-kmeans-clusters
drwxr-xr-x 4 user 5000 136 Feb 1 18:56 clusters-0
drwxr-xr-x 4 user 5000 136 Feb 1 18:56 clusters-1
drwxr-xr-x 4 user 5000 136 Feb 1 18:56 clusters-2
...
drwxr-xr-x 4 user 5000 136 Feb 1 18:59 clusteredPoint

Again, my call did not generate the clusteredPoint directory.
I would appreciate your help.

Thanks a lot,

Galit.

help with mahout clustering on hadoop

Reply via email to