Simon, I'm glad to hear it works for you. To get to the command line, you need to open the Remote Desktop Connection to your cluster. Once in there, there should be a convenient "Hadoop Command Prompt" shortcut.
Rafal -- Rafal Lukawiecki Pardon brevity, mobile device. On 26 Aug 2013, at 14:04, "Simon Ejsing" <[email protected]> wrote: > Okay, I managed to solve the main issue. Use absolute paths instead of > relative paths and everything works like a charm... Still would like to hear > from you regarding running Mahout from the command line! > > -----Original Message----- > From: Simon Ejsing [mailto:[email protected]] > Sent: 26. august 2013 13:04 > To: [email protected] > Subject: RE: Trying to use Mahout to make predictions based on log files > > Hi Rafal, > > Thanks for your feedback. > > How do I actually start Mahout? I'm finding this frustrating. I'm trying to > just run the command: > hadoop jar <path to mahout-core-0.8-job.jar> > > but it does not work as I expect (it gives me an error that no main class was > specified). The only way I've found that I can run Mahout commands is through > the MahoutDriver class. Is there a way to list available commands/classes? > > I've tried getting my scripts up and running on Mahout 0.8, but I'm getting > into a problem preparing the input vectors. I've placed my raw text files > under /raw in HDFS and the folder contains a ham and a spam subfolder. When I > try to construct the input vectors using: > > call hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar > org.apache.mahout.driver.MahoutDriver seqdirectory -i raw -o raw-seq -ow > call hadoop jar %MahoutDir%\mahout-examples-0.8-job.jar > org.apache.mahout.driver.MahoutDriver seq2sparse -i raw-seq -o raw-vectors > -lnorm -nv -wt tfidf > > The call to seq2sparse runs two successful Mapreduce jobs but fails on the > third trying to find the file 'dictionary.file-0': > 13/08/26 10:44:08 INFO input.FileInputFormat: Total input paths to > process : 21 > 13/08/26 10:44:09 INFO mapred.JobClient: Running job: > job_201308231143_0018 > 13/08/26 10:44:10 INFO mapred.JobClient: map 0% reduce 0% > 13/08/26 10:44:19 INFO mapred.JobClient: Task Id : > attempt_201308231143_0018_m_000022_0, Status : FAILED > Error initializing attempt_201308231143_0018_m_000022_0: > java.io.FileNotFoundException: > asv://[email protected]/user/hdp/raw-vectors/dictionary.file-0 > : No such file or directory. > at > org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.getFileStatus(NativeAzureFileSystem.java:960) > at > org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(TaskDistributedCacheManager.java:179) > at > org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1223) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1135) > at > org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1214) > at > org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1129) > at > org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2443) > at java.lang.Thread.run(Thread.java:722) > > Notice that Mahout is looking under /user/hdp/raw-vectors but the file is in > my user directory (which is /user/admin/raw-vectors in HDInsight). This looks > like a bug to me? Can I fix this or is there a way to avoid using the > dictionary file? Can I just use the dense vectors for training the Naïve > Bayes model? > > I tried manually copying the file from /user/admin to /user/hdp and re-run > the seq2sparse command, but it complains that it detected that input files > were out of date - so that work-around did not work. > > Thanks, > Simon > > -----Original Message----- > From: Rafal Lukawiecki [mailto:[email protected]] > Sent: 23. august 2013 19:45 > To: <[email protected]> > Subject: Re: Trying to use Mahout to make predictions based on log files > > Simon, > > Could you share what parameters you have passed to run this job? > > On another note, the samples, which have been provided with HDInsight Azure > preview, are a little bit incomplete, have missing files and incorrectly > names directories, and they don't work too well. Also, Mahout 0.5 had a > number of issues of its own. > > Regardless of the resolution of your current issue, I suggest that you > download mahout-distribution-0.8.zip from > http://www.apache.org/dyn/closer.cgi/mahout/, unzip it somewhere on your > cluster using RDP into your HDInsight instance, and invoke > mahout-core-0.8-job.jar by specifying its full path from the Hadoop prompt, > or use the web-based HDInsight console to create a job, and browse for the > locally downloaded copy of mahout-core-0.8-job.jar. The difference will only > be as to where you keep your data-the console requires you to have it on ASV, > an Azure blob, while if you run the jobs from the prompt via RDP you can just > use hadoop fs -copyFromLocal to place it on "HDFS" (in quotes, because it > will end up on the ASV blob anyway). > > Rafal > > -- > > Rafal Lukawiecki > > Strategic Consultant and Director > > Project Botticelli Ltd > > On 22 Aug 2013, at 13:56, Simon Ejsing > <[email protected]<mailto:[email protected]<mailto:[email protected]<mailto:[email protected]>>> > wrote: > > Hi, > > I'm new to using Mahout, and I'm trying to use it to make predictions on a > series of log files. I'm running it in a Windows Azure HDInsight cluster > (hadoop based). I'm using Mahout 0.5 as that is what I could get to work with > the samples (I'm fine with upgrading to 0.8 if I can get the samples work). > > I'm following the same idea as the spam classification example found > here<http://searchhub.org/2011/05/04/an-introductory-how-to-build-a-spam-filter-server-with-mahout/> > using Naïve Bayes (which I can make work without problems), but when I try > to use my own data (which is obviously not emails), I end up with a > prediction model that characterizes everything asunknown. I can see that the > computed normalizing factors are NaN: > > 13/08/22 12:13:57 INFO bayes.BayesDriver: Calculating the weight > Normalisation factor for each class... > 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_k for Each > Label > 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: {spam=NaN, ham=NaN} > 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_kSigma_j for > each Label and for each Features > 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: NaN > 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Vocabulary Count > 13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: 182316.0 > > But I'm not sure what that means, or why that is? Could this be related to my > input documents? The spam filter is based on emails roughly a couple of kb in > size, whereas my inputs is a series of log files of roughly a couple of mb in > size. Also, the training is done on a small dataset of only 100-120 samples > (I'm working on gathering more data to run on a larger sample). > > Attached is the script I use to train and test the model as well as the > output from executing the script on the cluster. > > Any help is appreciated! > > -Simon Ejsing > <stderr.txt> > >
