Hi ,
I'm trying to get a grip on the mahout command line options, and getting
caught either in gross misunderstanding or Java errors. Help greatly
appreciated.
I've created some hand-built data which I expect to be noisy, but still
hoped to run through my workflow before improving my data quality.
"id","brace","target"
000040045,0194,1
000006445,0149,1
000033554,0013,1
...
My understanding is that my workflow should be as follows
1: Use "trainAdaptiveLogistic" with scored data to create a model (here
called PC.model)
2: Use "validateAdaptiveLogistic " to test how good the model is on a
holdout data set which has been scored
3: Use "runAdaptiveLogistic" on some unscored data (ie no third column)
to find out new things
Firstly ... Is that a valid workflow?
runAdaptiveLogistic appears to expect scored data as well - at least, it
fails if I give it only unscored data (ie the "target" column is absent)
If not, how do I productionise a model?
(Note: I got the flow to work (at least with scored data for all three)
with mahout-0.7 and mahout-0.8 but as I thought the "run" step should
work differently I tried mahout-0.9. Here, the second step also fails.
[cloudera@localhost ]$ mahout trainAdaptiveLogistic \
--passes 100 \
--input ./PCtrain \
--features 50 \
--output ./PC.model \
--target target \
--categories 2 \
--predictors brace \
--types t
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop/bin/hadoop
and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB:
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/mahout/mahout-examples-0.8-cdh5.0.0-job.jar
14/05/26 13:56:19 WARN driver.MahoutDriver: No
trainAdaptiveLogistic.props found on classpath, will use command-line
arguments only
50
target ~
0.000000000 0.051644057 0.000000000 0.000000000
0.000000000 0.023763329 0.000000000 0.000000000
-0.054034312 -0.000000000 0.000000000 0.021475032
0.028820276 0.000000000 0.033145160 0.000000000
0.000000000 0.000000000 0.000000000 -0.000000000
0.000000000 0.000000000 0.000000000 0.000000000
0.000000000 0.000000000 0.051755156 0.000000000
-0.000000000 -0.000000001 0.000000000 -0.053815953
0.030166157 0.000000000 0.000000000 -0.073127179
0.000000000 -0.000000000 0.000000000 0.000000000
-0.000000000 0.000000000 0.000000000 -0.108047988
0.000000000 0.000000000 0.000000000 0.000000000
0.000000000 -0.000000000
14/05/26 13:56:36 INFO driver.MahoutDriver: Program took 17784 ms
(Minutes: 0.2964)
[cloudera@localhost]$ mahout validateAdaptiveLogistic \
--input ./PCtest \
--model ./PC.model \
--auc \
--confusion
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop/bin/hadoop
and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB:
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/mahout/mahout-examples-0.8-cdh5.0.0-job.jar
14/05/26 13:56:53 WARN driver.MahoutDriver: No
validateAdaptiveLogistic.props found on classpath, will use command-line
arguments only
Log-likelihood:Min=-0.78, Max=-0.61, Mean=-0.68, Median=-0.69
AUC = 0.65
=======================================================
Confusion Matrix
-------------------------------------------------------
a b <--Classified as
182 0 | 182 a = 1
0 18 | 18 b = 2
Entropy Matrix: [[-0.7, -0.4], [-0.7, -0.3]]
14/05/26 13:56:54 INFO driver.MahoutDriver: Program took 1125 ms
(Minutes: 0.018766666666666668)
[cloudera@localhost]$ mahout runAdaptiveLogistic \
--input ./PCrun \
--model ./PC.model \
--idcolumn id \
--output ./PC.out
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop/bin/hadoop
and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB:
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/mahout/mahout-examples-0.8-cdh5.0.0-job.jar
14/05/26 13:57:09 WARN driver.MahoutDriver: No runAdaptiveLogistic.props
found on classpath, will use command-line arguments only
Exception in thread "main" java.lang.NullPointerException
at
org.apache.mahout.classifier.sgd.CsvRecordFactory.firstLine(CsvRecordFactory.java:176)
at
org.apache.mahout.classifier.sgd.RunAdaptiveLogistic.mainToOutput(RunAdaptiveLogistic.java:83)
at
org.apache.mahout.classifier.sgd.RunAdaptiveLogistic.main(RunAdaptiveLogistic.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
[cloudera@localhost]$ mahout runAdaptiveLogistic \
--input ./PCtest \
--model ./PC.model \
--idcolumn id \
--output ./PC.out
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop/bin/hadoop
and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB:
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/mahout/mahout-examples-0.8-cdh5.0.0-job.jar
14/05/26 13:57:35 WARN driver.MahoutDriver: No runAdaptiveLogistic.props
found on classpath, will use command-line arguments only
100 records processed
200 records processed
200 records processed totally.
14/05/26 13:57:36 INFO driver.MahoutDriver: Program took 943 ms
(Minutes: 0.015716666666666667)
Thanks,
Duncan.