Simon,

I'm glad to hear it works for you. To get to the command line, you need to open 
the Remote Desktop Connection to your cluster. Once in there, there should be a 
convenient "Hadoop Command Prompt" shortcut.

Rafal
--
Rafal Lukawiecki
Pardon brevity, mobile device.

On 26 Aug 2013, at 14:04, "Simon Ejsing" <[email protected]> wrote:

> Okay, I managed to solve the main issue. Use absolute paths instead of 
> relative paths and everything works like a charm... Still would like to hear 
> from you regarding running Mahout from the command line!
> 
> -----Original Message-----
> From: Simon Ejsing [mailto:[email protected]] 
> Sent: 26. august 2013 13:04
> To: [email protected]
> Subject: RE: Trying to use Mahout to make predictions based on log files
> 
> Hi Rafal,
> 
> Thanks for your feedback.
> 
> How do I actually start Mahout? I'm finding this frustrating. I'm trying to 
> just run the command:
>        hadoop jar <path to mahout-core-0.8-job.jar>
> 
> but it does not work as I expect (it gives me an error that no main class was 
> specified). The only way I've found that I can run Mahout commands is through 
> the MahoutDriver class. Is there a way to list available commands/classes?
> 
> I've tried getting my scripts up and running on Mahout 0.8, but I'm getting 
> into a problem preparing the input vectors. I've placed my raw text files 
> under /raw in HDFS and the folder contains a ham and a spam subfolder. When I 
> try to construct the input vectors using:
> 
>        call hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar 
> org.apache.mahout.driver.MahoutDriver seqdirectory -i raw -o raw-seq -ow
>        call hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar 
> org.apache.mahout.driver.MahoutDriver seq2sparse -i raw-seq -o raw-vectors 
> -lnorm -nv -wt tfidf
> 
> The call to seq2sparse runs two successful Mapreduce jobs but fails on the 
> third trying to find the file 'dictionary.file-0':
>          13/08/26 10:44:08 INFO input.FileInputFormat: Total input paths to 
> process : 21
>          13/08/26 10:44:09 INFO mapred.JobClient: Running job: 
> job_201308231143_0018
>          13/08/26 10:44:10 INFO mapred.JobClient:  map 0% reduce 0%
>          13/08/26 10:44:19 INFO mapred.JobClient: Task Id : 
> attempt_201308231143_0018_m_000022_0, Status : FAILED
>          Error initializing attempt_201308231143_0018_m_000022_0:
>          java.io.FileNotFoundException: 
> asv://[email protected]/user/hdp/raw-vectors/dictionary.file-0
>          : No such file or directory.
>                  at 
> org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.getFileStatus(NativeAzureFileSystem.java:960)
>                  at 
> org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(TaskDistributedCacheManager.java:179)
>                  at 
> org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1223)
>                  at java.security.AccessController.doPrivileged(Native Method)
>                  at javax.security.auth.Subject.doAs(Subject.java:415)
>                  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1135)
>                  at 
> org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1214)
>                  at 
> org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1129)
>                  at 
> org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2443)
>                  at java.lang.Thread.run(Thread.java:722)
> 
> Notice that Mahout is looking under /user/hdp/raw-vectors but the file is in 
> my user directory (which is /user/admin/raw-vectors in HDInsight). This looks 
> like a bug to me? Can I fix this or is there a way to avoid using the 
> dictionary file? Can I just use the dense vectors for training the Naïve 
> Bayes model?
> 
> I tried manually copying the file from /user/admin to /user/hdp and re-run 
> the seq2sparse command, but it complains that it detected that input files 
> were out of date - so that work-around did not work.
> 
> Thanks,
> Simon
> 
> -----Original Message-----
> From: Rafal Lukawiecki [mailto:[email protected]]
> Sent: 23. august 2013 19:45
> To: <[email protected]>
> Subject: Re: Trying to use Mahout to make predictions based on log files
> 
> Simon,
> 
> Could you share what parameters you have passed to run this job?
> 
> On another note, the samples, which have been provided with HDInsight Azure 
> preview, are a little bit incomplete, have missing files and incorrectly 
> names directories, and they don't work too well. Also, Mahout 0.5 had a 
> number of issues of its own.
> 
> Regardless of the resolution of your current issue, I suggest that you 
> download mahout-distribution-0.8.zip from 
> http://www.apache.org/dyn/closer.cgi/mahout/, unzip it somewhere on your 
> cluster using RDP into your HDInsight instance, and invoke 
> mahout-core-0.8-job.jar by specifying its full path from the Hadoop prompt, 
> or use the web-based HDInsight console to create a job, and browse for the 
> locally downloaded copy of mahout-core-0.8-job.jar. The difference will only 
> be as to where you keep your data-the console requires you to have it on ASV, 
> an Azure blob, while if you run the jobs from the prompt via RDP you can just 
> use hadoop fs -copyFromLocal to place it on "HDFS" (in quotes, because it 
> will end up on the ASV blob anyway).
> 
> Rafal
> 
> --
> 
> Rafal Lukawiecki
> 
> Strategic Consultant and Director
> 
> Project Botticelli Ltd
> 
> On 22 Aug 2013, at 13:56, Simon Ejsing 
> <[email protected]<mailto:[email protected]<mailto:[email protected]<mailto:[email protected]>>>
>  wrote:
> 
> Hi,
> 
> I'm new to using Mahout, and I'm trying to use it to make predictions on a 
> series of log files. I'm running it in a Windows Azure HDInsight cluster 
> (hadoop based). I'm using Mahout 0.5 as that is what I could get to work with 
> the samples (I'm fine with upgrading to 0.8 if I can get the samples work).
> 
> I'm following the same idea as the spam classification example found 
> here<http://searchhub.org/2011/05/04/an-introductory-how-to-build-a-spam-filter-server-with-mahout/>
>  using Naïve Bayes (which I can make work without problems), but when I try 
> to use my own data (which is obviously not emails), I end up with a 
> prediction model that characterizes everything asunknown. I can see that the 
> computed normalizing factors are NaN:
> 
> 13/08/22 12:13:57 INFO bayes.BayesDriver: Calculating the weight 
> Normalisation factor for each class...
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_k for Each 
> Label
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: {spam=NaN, ham=NaN}
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_kSigma_j for 
> each Label and for each Features
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: NaN
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Vocabulary Count
> 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: 182316.0
> 
> But I'm not sure what that means, or why that is? Could this be related to my 
> input documents? The spam filter is based on emails roughly a couple of kb in 
> size, whereas my inputs is a series of log files of roughly a couple of mb in 
> size. Also, the training is done on a small dataset of only 100-120 samples 
> (I'm working on gathering more data to run on a larger sample).
> 
> Attached is the script I use to train and test the model as well as the 
> output from executing the script on the cluster.
> 
> Any help is appreciated!
> 
> -Simon Ejsing
> <stderr.txt>
> 
> 

Reply via email to