Hi Everyone,
I see this is an active list, so I'm trying again to reach out for help using
mahout on hdfs.
I am a research scientist at eBay, working with Big Data analysis for
e-commerce. I've been trying to run mahout on my data for quite some time now-
running it locally no problem, but having problems running it on hdfs.
I hope you have some leads for the following (I accumulated quite a few
unresolved issues):
1. I'm trying again to see if anyone has an answer to this matter:
I've been running mahout kmeans successfully on hdfs, however, if I run
mahout kmeans without the flag -cl, the clusteredPoints directory is not
created.
Whenever I add '-cl' to my call, I get an error: "There is no queue
named default", even though I do specify a queue by -Dmapred.job.queue.name.
I do not get this error "There is no queue named default" if I don't
add the -cl to my call. It runs just fine. (not creating the clusteredPoints
directory though).
Does anyone have an idea why this happens?
2. My mahout clustering processes seem to be running very slow (several good
hours on just ~1M items), and I'm wondering if there's anything that needs to
be changed in setting/configuration. (and how?)
I'm running on large clusters and could potentially use thousands of
nodes. However, my mahout processes (kmeans/canopy.) are only using max 5
mappers (I tried it on several data sets).
I've tried to define the number of mappers by something like:
-Dmapred.map.tasks=100 but this didn't seem to have an effect, it still only
uses <=5 mappers.
Is there a different way to set the number of mappers/reducers for a
mahout process?
Or is there another configuration issue I need to consider?
3. When running mahout canopy clustering, the jobs consistently fail, with some
out of memory errors such as:
attempt_201306241658_137502_m_000001_1: Exception in thread "Thread for
syncLogs" java.lang.OutOfMemoryError: Java heap space
and finally:
Exception in thread "main" java.lang.InterruptedException: Canopy Job
failed processing whateverfilename.dat
Even though the file does exist.
I tried to increase the map/red memory by
-Dmapred.child.java.opts=-Xmx4g, but this still fails:
13/07/22 01:56:09 INFO mapred.JobClient: Job Counters
13/07/22 01:56:09 INFO mapred.JobClient:
SLOTS_MILLIS_MAPS=23121
13/07/22 01:56:09 INFO mapred.JobClient: Total time spent
by all reduces waiting after reserving slots (ms)=0
13/07/22 01:56:09 INFO mapred.JobClient: Total time spent
by all maps waiting after reserving slots (ms)=0
13/07/22 01:56:09 INFO mapred.JobClient: Launched map
tasks=13
13/07/22 01:56:09 INFO mapred.JobClient:
SLOTS_MILLIS_REDUCES=0
13/07/22 01:56:09 INFO mapred.JobClient: Failed map tasks=1
Exception in thread "main" java.lang.InterruptedException:
Canopy Job failed processing whateverfilename.dat
at
org.apache.mahout.clustering.canopy.CanopyDriver.buildClustersMR(CanopyDriver.java:363)
at
org.apache.mahout.clustering.canopy.CanopyDriver.buildClusters(CanopyDriver.java:248)
at
org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:155)
at
org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:117)
at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.mahout.clustering.canopy.CanopyDriver.main(CanopyDriver.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at
org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
4. One of the first major problems I've encountered was that a mahout jar we've
created that uses KMeansDriver (and that runs great on my local machine) did
not even initiate a job on the hadoop cluster. It seemed to be running parallel
but in fact it was running only on the local node. Did this happen to
anyone? If so, what is the fix for this? (I ended up dropping it and calling
mahout step by step from command line, but I'd be happy to know if there a fix
for this).
Any ideas/inputs on any of those issues would be really greatly appreciated.
Thanks!
Galit.
-----Original Message-----
From: Fuhrmann Alpert, Galit
Sent: Wednesday, July 17, 2013 12:43 PM
To: [email protected]; 'Suneel Marthi'
Subject: RE: mahout kmeans not generating clusteredPoint dir?
Thanks Suneel.
I tried to add this flag (though I think clusteredPoints directory was supposed
to be created by default?).
Either way, for some reason whenever I add '-cl' (tried to run it on several
data sets), I get the following error:
"There is no queue named default"
(even though I do specify a queue by -Dmapred.job.queue.name=...).
I don't get this error otherwise.
Has anyone ever encountered this error?
Is there some sort of configuration I'm missing?
Thanks,
Galit.
-----Original Message-----
From: Suneel Marthi [mailto:[email protected]]
Sent: Wednesday, July 10, 2013 5:30 PM
To: [email protected]
Subject: Re: mahout kmeans not generating clusteredPoint dir?
Been a while since I last worked with this, I believe u r missing the
clustering option '-cl'.
Give that a try.
________________________________
From: "Fuhrmann Alpert, Galit" <[email protected]>
To: "[email protected]" <[email protected]>
Sent: Wednesday, July 10, 2013 5:17 AM
Subject: mahout kmeans not generating clusteredPoint dir?
Hello,
I ran mahout kmeans (using rand seeds) on hadoop cluster. It ran successfully
and created a directory containing clusters-*, including the last which was
clusters-3-final.
However, it did not create the clusteredPoints, or at least I cannot find it
under the same dir (or anywhere else).
My call was:
mahout kmeansĀ -k 4000 -i inputSeq.dat -o outputPath --maxIter 3 --clusters
outputSeeds
Was there an extra argument I needed to specify in order for it to generate the
clusteredPoints?
(BTW I also can't see the outputSeeds. Was it created for seeds and then
deleted?)
According to mahout in action:
The k-means clustering implementation creates two types of directories in the
output
folder. The clusters-* directories are formed at the end of each iteration: the
clusters-0
directory is generated after the first iteration, clusters-1 after the second
iteration, and
so on. These directories contain information about the clusters: centroid,
standard
deviation, and so on. The clusteredPoints directory, on the other hand,
contains the
final mapping from cluster ID to document ID. This data is generated from the
output
of the last MapReduce operation.
The directory listing of the output folder looks something like this:
$ ls -l reuters-kmeans-clusters
drwxr-xr-x 4 user 5000 136 Feb 1 18:56 clusters-0
drwxr-xr-x 4 user 5000 136 Feb 1 18:56 clusters-1
drwxr-xr-x 4 user 5000 136 Feb 1 18:56 clusters-2
...
drwxr-xr-x 4 user 5000 136 Feb 1 18:59 clusteredPoint
Again, my call did not generate the clusteredPoint directory.
I would appreciate your help.
Thanks a lot,
Galit.