Re: trainnb labelindex not found error - help requested

Erdem Sahin Mon, 27 Apr 2015 18:49:54 -0700

Thank you so much Andrew! Running with mahout 0.10.0 solved the problem.
Much appreciated!
-Erdem


On Mon, Apr 27, 2015 at 8:15 PM, Andrew Palumbo <ap....@outlook.com> wrote:

> It looks like you have a mahout 0.9 install trying to run the mahout
> 0.10.0 Naive Bayes script.  The command line options have changed slightly
> for mahout 0.10.0 MapReduce trainnb.
>
> >mahout-examples-0.9-cdh5.3.0-job.jar
> 15/04/27 16:41:27 WARN
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
> <div>-------- Original message --------</div><div>From: Erdem Sahin <
> es2...@nyu.edu> </div><div>Date:04/27/2015  7:58 PM  (GMT-05:00)
> </div><div>To: user@mahout.apache.org </div><div>Subject: trainnb
> labelindex not found error - help requested </div><div>
> </div>Hi Mahout users,
>
> I'm trying to run the classify-20-newsgroups.sh  script
> <
> https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh
> >
> and
> it fails with a FileNotFoundException when it gets to the "trainnb"
> command. All prior steps run successfully. I'm trying algo 1 or algo 2.
>
> I have modified the script slightly so that it reads my input data instead
> of the canonical data set. I've created a "wifidata" folder on the local FS
> which has the following structure:
> wifidata/havewifi
> wifidata/nowifi
>
> and within havewifi and nowifi, there exist files with text file names and
> text content.These eventually get copied to HDFS.
>
> I'm not clear if the "labelindex" file, which cannot be found, is supposed
> to be created by trainnb or by a prior step.
>
> Please see the details of the modified script and the error below. Any help
> would be appreciated.
>
> Thanks and best regards,
> Erdem Sahin
>
> Script:
>
> if [ "$1" = "--help" ] || [ "$1" = "--?" ]; then
>   echo "This script runs SGD and Bayes classifiers over the classic 20 News
> Groups."
>   exit
> fi
>
> SCRIPT_PATH=${0%/*}
> if [ "$0" != "$SCRIPT_PATH" ] && [ "$SCRIPT_PATH" != "" ]; then
>   cd $SCRIPT_PATH
> fi
> START_PATH=`pwd`
>
> # Set commands for dfs
> source ${START_PATH}/set-dfs-commands.sh
>
> WORK_DIR=/tmp/mahout-work-${USER}
> algorithm=( cnaivebayes-MapReduce naivebayes-MapReduce cnaivebayes-Spark
> naivebayes-Spark sgd clean)
> if [ -n "$1" ]; then
>   choice=$1
> else
>   echo "Please select a number to choose the corresponding task to run"
>   echo "1. ${algorithm[0]}"
>   echo "2. ${algorithm[1]}"
>   echo "3. ${algorithm[2]}"
>   echo "4. ${algorithm[3]}"
>   echo "5. ${algorithm[4]}"
>   echo "6. ${algorithm[5]}-- cleans up the work area in $WORK_DIR"
>   read -p "Enter your choice : " choice
> fi
>
> echo "ok. You chose $choice and we'll use ${algorithm[$choice-1]}"
> alg=${algorithm[$choice-1]}
>
> # Spark specific check and work
> if [ "x$alg" == "xnaivebayes-Spark" -o "x$alg" == "xcnaivebayes-Spark" ];
> then
>   if [ "$MASTER" == "" ] ; then
>     echo "Please set your MASTER env variable to point to your Spark Master
> URL. exiting..."
>     exit 1
>   fi
>   if [ "$MAHOUT_LOCAL" != "" ] ; then
>     echo "Options 3 and 4 can not run in MAHOUT_LOCAL mode. exiting..."
>     exit 1
>   fi
> fi
>
> #echo $START_PATH
> cd $START_PATH
> cd ../..
>
> set -e
>
> if  ( [ "x$alg" == "xnaivebayes-MapReduce" ] ||  [ "x$alg" ==
> "xcnaivebayes-MapReduce" ] || [ "x$alg" == "xnaivebayes-Spark"  ] || [
> "x$alg" == "xcnaivebayes-Spark" ] ); then
>   c=""
>
>   if [ "x$alg" == "xcnaivebayes-MapReduce" -o "x$alg" ==
> "xnaivebayes-Spark" ]; then
>     c=" -c"
>   fi
>
>   set -x
>   echo "Preparing 20newsgroups data"
>   rm -rf ${WORK_DIR}/20news-all
>   mkdir ${WORK_DIR}/20news-all
>   cp -R $START_PATH/wifidata/* ${WORK_DIR}/20news-all
>
>
>   echo "Copying 20newsgroups data to HDFS"
>   set +e
>   $DFSRM ${WORK_DIR}/20news-all
>   $DFS -mkdir ${WORK_DIR}
>   $DFS -mkdir ${WORK_DIR}/20news-all
>   set -e
>   if [ $HVERSION -eq "1" ] ; then
>       echo "Copying 20newsgroups data to Hadoop 1 HDFS"
>       $DFS -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all
>   elif [ $HVERSION -eq "2" ] ; then
>       echo "Copying 20newsgroups data to Hadoop 2 HDFS"
>       $DFS -put ${WORK_DIR}/20news-all ${WORK_DIR}/
>   fi
>
>
>   echo "Creating sequence files from 20newsgroups data"
>   /usr/bin/mahout seqdirectory \
>     -i ${WORK_DIR}/20news-all \
>     -o ${WORK_DIR}/20news-seq -ow
>
>   echo "Converting sequence files to vectors"
>   /usr/bin/mahout seq2sparse \
>     -i ${WORK_DIR}/20news-seq \
>     -o ${WORK_DIR}/20news-vectors  -lnorm -nv  -wt tfidf
>
>   echo "Creating training and holdout set with a random 80-20 split of the
> generated vector dataset"
>   /usr/bin/mahout split \
>     -i ${WORK_DIR}/20news-vectors/tfidf-vectors \
>     --trainingOutput ${WORK_DIR}/20news-train-vectors \
>     --testOutput ${WORK_DIR}/20news-test-vectors  \
>     --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
>
>     if [ "x$alg" == "xnaivebayes-MapReduce"  -o  "x$alg" ==
> "xcnaivebayes-MapReduce" ]; then
>
>       echo "Training Naive Bayes model"
>       /usr/bin/mahout trainnb \
>         -i ${WORK_DIR}/20news-train-vectors \
>         -o ${WORK_DIR}/model \
>         -li ${WORK_DIR}/labelindex \
>         -ow $c
>
>       echo "Self testing on training set"
>
>       /usr/bin/mahout testnb \
>         -i ${WORK_DIR}/20news-train-vectors\
>         -m ${WORK_DIR}/model \
>         -l ${WORK_DIR}/labelindex \
>         -ow -o ${WORK_DIR}/20news-testing $c
>
>       echo "Testing on holdout set"
>
>       /usr/bin/mahout testnb \
>         -i ${WORK_DIR}/20news-test-vectors\
>         -m ${WORK_DIR}/model \
>         -l ${WORK_DIR}/labelindex \
>         -ow -o ${WORK_DIR}/20news-testing $c
>
>     elif [ "x$alg" == "xnaivebayes-Spark" -o "x$alg" ==
> "xcnaivebayes-Spark" ]; then
>
>       echo "Training Naive Bayes model"
>       /usr/bin/mahout spark-trainnb \
>         -i ${WORK_DIR}/20news-train-vectors \
>         -o ${WORK_DIR}/spark-model $c -ow -ma $MASTER
>
>       echo "Self testing on training set"
>       /usr/bin/mahout spark-testnb \
>         -i ${WORK_DIR}/20news-train-vectors\
>         -m ${WORK_DIR}/spark-model $c -ma $MASTER
>
>       echo "Testing on holdout set"
>       /usr/bin/mahout spark-testnb \
>         -i ${WORK_DIR}/20news-test-vectors\
>         -m ${WORK_DIR}/spark-model $c -ma $MASTER
>
>     fi
> elif [ "x$alg" == "xsgd" ]; then
>   if [ ! -e "/tmp/news-group.model" ]; then
>     echo "Training on ${WORK_DIR}/20news-bydate/20news-bydate-train/"
>     /usr/bin/mahout org.apache.mahout.classifier.sgd.TrainNewsGroups
> ${WORK_DIR}/20news-bydate/20news-bydate-train/
>   fi
>   echo "Testing on ${WORK_DIR}/20news-bydate/20news-bydate-test/ with
> model: /tmp/news-group.model"
>   /usr/bin/mahout org.apache.mahout.classifier.sgd.TestNewsGroups --input
> ${WORK_DIR}/20news-bydate/20news-bydate-test/ --model /tmp/news-group.model
> elif [ "x$alg" == "xclean" ]; then
>   rm -rf $WORK_DIR
>   rm -rf /tmp/news-group.model
>   $DFSRM $WORK_DIR
> fi
> # Remove the work directory
> #
>
> Error mesage:
>
> $  /usr/bin/mahout trainnb \
> >         -i ${WORK_DIR}/20news-train-vectors \
> >         -o ${WORK_DIR}/model \
> >         -li ${WORK_DIR}/labelindex \
> >         -ow
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
> HADOOP_CONF_DIR=/etc/hadoop/conf
> MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.9-cdh5.3.0-job.jar
> 15/04/27 16:41:27 WARN driver.MahoutDriver: No trainnb.props found on
> classpath, will use command-line arguments only
> 15/04/27 16:41:28 INFO common.AbstractJob: Command line arguments:
> {--alphaI=[1.0], --endPhase=[2147483647],
> --input=[/tmp/mahout-work-cloudera/20news-train-vectors],
> --labelIndex=[/tmp/mahout-work-cloudera/labelindex],
> --output=[/tmp/mahout-work-cloudera/model], --overwrite=null,
> --startPhase=[0], --tempDir=[temp]}
> 15/04/27 16:41:36 INFO common.HadoopUtil: Deleting temp
> 15/04/27 16:41:36 INFO Configuration.deprecation: mapred.input.dir is
> deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
> 15/04/27 16:41:36 INFO Configuration.deprecation:
> mapred.compress.map.output is deprecated. Instead, use
> mapreduce.map.output.compress
> 15/04/27 16:41:36 INFO Configuration.deprecation: mapred.output.dir is
> deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
> 15/04/27 16:41:36 INFO client.RMProxy: Connecting to ResourceManager at /
> 0.0.0.0:8032
> 15/04/27 16:41:44 INFO mapreduce.JobSubmitter: Cleaning up the staging area
> /tmp/hadoop-yarn/staging/cloudera/.staging/job_1430097605337_0028
> 15/04/27 16:41:44 WARN security.UserGroupInformation:
> PriviledgedActionException as:cloudera (auth:SIMPLE)
> cause:java.io.FileNotFoundException: File does not exist:
> /tmp/mahout-work-cloudera/labelindex
> Exception in thread "main" java.io.FileNotFoundException: File does not
> exist: /tmp/mahout-work-cloudera/labelindex
> at
>
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1093)
> at
>
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1085)
> at
>
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at
>
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1085)
> at
>
> org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:288)
> at
>
> org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:224)
> at
>
> org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestamps(ClientDistributedCacheManager.java:93)
> at
>
> org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestampsAndCacheVisibilities(ClientDistributedCacheManager.java:57)
> at
>
> org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:267)
> at
>
> org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:388)
> at
>
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:481)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)
> at
>
> org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:114)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at
>
> org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.main(TrainNaiveBayesJob.java:64)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:153)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
>

Re: trainnb labelindex not found error - help requested

Reply via email to