Hi Rafal,
Thanks for your feedback.
How do I actually start Mahout? I'm finding this frustrating. I'm trying to
just run the command:
hadoop jar <path to mahout-core-0.8-job.jar>
but it does not work as I expect (it gives me an error that no main class was
specified). The only way I've found that I can run Mahout commands is through
the MahoutDriver class. Is there a way to list available commands/classes?
I've tried getting my scripts up and running on Mahout 0.8, but I'm getting
into a problem preparing the input vectors. I've placed my raw text files under
/raw in HDFS and the folder contains a ham and a spam subfolder. When I try to
construct the input vectors using:
call hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar
org.apache.mahout.driver.MahoutDriver seqdirectory -i raw -o raw-seq -ow
call hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar
org.apache.mahout.driver.MahoutDriver seq2sparse -i raw-seq -o raw-vectors
-lnorm -nv -wt tfidf
The call to seq2sparse runs two successful Mapreduce jobs but fails on the
third trying to find the file 'dictionary.file-0':
13/08/26 10:44:08 INFO input.FileInputFormat: Total input paths to
process : 21
13/08/26 10:44:09 INFO mapred.JobClient: Running job:
job_201308231143_0018
13/08/26 10:44:10 INFO mapred.JobClient: map 0% reduce 0%
13/08/26 10:44:19 INFO mapred.JobClient: Task Id :
attempt_201308231143_0018_m_000022_0, Status : FAILED
Error initializing attempt_201308231143_0018_m_000022_0:
java.io.FileNotFoundException:
asv://[email protected]/user/hdp/raw-vectors/dictionary.file-0
: No such file or directory.
at
org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.getFileStatus(NativeAzureFileSystem.java:960)
at
org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(TaskDistributedCacheManager.java:179)
at
org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1223)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1135)
at
org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1214)
at
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1129)
at
org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2443)
at java.lang.Thread.run(Thread.java:722)
Notice that Mahout is looking under /user/hdp/raw-vectors but the file is in my
user directory (which is /user/admin/raw-vectors in HDInsight). This looks like
a bug to me? Can I fix this or is there a way to avoid using the dictionary
file? Can I just use the dense vectors for training the Naïve Bayes model?
I tried manually copying the file from /user/admin to /user/hdp and re-run the
seq2sparse command, but it complains that it detected that input files were out
of date - so that work-around did not work.
Thanks,
Simon
-----Original Message-----
From: Rafal Lukawiecki [mailto:[email protected]]
Sent: 23. august 2013 19:45
To: <[email protected]>
Subject: Re: Trying to use Mahout to make predictions based on log files
Simon,
Could you share what parameters you have passed to run this job?
On another note, the samples, which have been provided with HDInsight Azure
preview, are a little bit incomplete, have missing files and incorrectly names
directories, and they don't work too well. Also, Mahout 0.5 had a number of
issues of its own.
Regardless of the resolution of your current issue, I suggest that you download
mahout-distribution-0.8.zip from http://www.apache.org/dyn/closer.cgi/mahout/,
unzip it somewhere on your cluster using RDP into your HDInsight instance, and
invoke mahout-core-0.8-job.jar by specifying its full path from the Hadoop
prompt, or use the web-based HDInsight console to create a job, and browse for
the locally downloaded copy of mahout-core-0.8-job.jar. The difference will
only be as to where you keep your data-the console requires you to have it on
ASV, an Azure blob, while if you run the jobs from the prompt via RDP you can
just use hadoop fs -copyFromLocal to place it on "HDFS" (in quotes, because it
will end up on the ASV blob anyway).
Rafal
--
Rafal Lukawiecki
Strategic Consultant and Director
Project Botticelli Ltd
On 22 Aug 2013, at 13:56, Simon Ejsing
<[email protected]<mailto:[email protected]<mailto:[email protected]<mailto:[email protected]>>>
wrote:
Hi,
I'm new to using Mahout, and I'm trying to use it to make predictions on a
series of log files. I'm running it in a Windows Azure HDInsight cluster
(hadoop based). I'm using Mahout 0.5 as that is what I could get to work with
the samples (I'm fine with upgrading to 0.8 if I can get the samples work).
I'm following the same idea as the spam classification example found
here<http://searchhub.org/2011/05/04/an-introductory-how-to-build-a-spam-filter-server-with-mahout/>
using Naïve Bayes (which I can make work without problems), but when I try to
use my own data (which is obviously not emails), I end up with a prediction
model that characterizes everything asunknown. I can see that the computed
normalizing factors are NaN:
13/08/22 12:13:57 INFO bayes.BayesDriver: Calculating the weight Normalisation
factor for each class...
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_k for Each Label
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: {spam=NaN, ham=NaN}
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_kSigma_j for
each Label and for each Features
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: NaN
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Vocabulary Count
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: 182316.0
But I'm not sure what that means, or why that is? Could this be related to my
input documents? The spam filter is based on emails roughly a couple of kb in
size, whereas my inputs is a series of log files of roughly a couple of mb in
size. Also, the training is done on a small dataset of only 100-120 samples
(I'm working on gathering more data to run on a larger sample).
Attached is the script I use to train and test the model as well as the output
from executing the script on the cluster.
Any help is appreciated!
-Simon Ejsing
<stderr.txt>