Re: Trying to use Mahout to make predictions based on log files

Rafal Lukawiecki Fri, 23 Aug 2013 10:46:15 -0700

Simon,

Could you share what parameters you have passed to run this job?


On another note, the samples, which have been provided with HDInsight Azure 
preview, are a little bit incomplete, have missing files and incorrectly names 
directories, and they don't work too well. Also, Mahout 0.5 had a number of 
issues of its own.

Regardless of the resolution of your current issue, I suggest that you download 
mahout-distribution-0.8.zip from http://www.apache.org/dyn/closer.cgi/mahout/, 
unzip it somewhere on your cluster using RDP into your HDInsight instance, and 
invoke mahout-core-0.8-job.jar by specifying its full path from the Hadoop 
prompt, or use the web-based HDInsight console to create a job, and browse for 
the locally downloaded copy of mahout-core-0.8-job.jar. The difference will 
only be as to where you keep your data—the console requires you to have it on 
ASV, an Azure blob, while if you run the jobs from the prompt via RDP you can 
just use hadoop fs -copyFromLocal to place it on "HDFS" (in quotes, because it 
will end up on the ASV blob anyway).

Rafal

--

Rafal Lukawiecki

Strategic Consultant and Director

Project Botticelli Ltd

On 22 Aug 2013, at 13:56, Simon Ejsing 
<[email protected]<mailto:[email protected]>> wrote:

Hi,

I’m new to using Mahout, and I’m trying to use it to make predictions on a 
series of log files. I’m running it in a Windows Azure HDInsight cluster 
(hadoop based). I’m using Mahout 0.5 as that is what I could get to work with 
the samples (I’m fine with upgrading to 0.8 if I can get the samples work).

I’m following the same idea as the spam classification example found 
here<http://searchhub.org/2011/05/04/an-introductory-how-to-build-a-spam-filter-server-with-mahout/>
 using Naïve Bayes (which I can make work without problems), but when I try to 
use my own data (which is obviously not emails), I end up with a prediction 
model that characterizes everything asunknown. I can see that the computed 
normalizing factors are NaN:

13/08/22 12:13:57 INFO bayes.BayesDriver: Calculating the weight Normalisation 
factor for each class...
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_k for Each Label
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: {spam=NaN, ham=NaN}
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Sigma_kSigma_j for 
each Label and for each Features
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: NaN
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: Vocabulary Count
13/08/22 12:13:57 INFO bayes.BayesThetaNormalizerDriver: 182316.0

But I’m not sure what that means, or why that is? Could this be related to my 
input documents? The spam filter is based on emails roughly a couple of kb in 
size, whereas my inputs is a series of log files of roughly a couple of mb in 
size. Also, the training is done on a small dataset of only 100-120 samples 
(I’m working on gathering more data to run on a larger sample).

Attached is the script I use to train and test the model as well as the output 
from executing the script on the cluster.

Any help is appreciated!

-Simon Ejsing
<stderr.txt>

Re: Trying to use Mahout to make predictions based on log files

Reply via email to