"Unknown program ... chosen" / "Valid program names are" ... truncated list from Hadoop program driver

Dan Brickley Sun, 04 Sep 2011 12:17:22 -0700

I've been trying to run Mahout hadoop tasks from Apache Pig, and have
problems with pieces of Mahout seemingly not visible to Hadoop's
ProgramDriver, even if accessible to bin/hadoop commandline.


First, I had some quick success (hacky and kludgy, but it worked),
encapsulating a call to CollocDriver within a Pig macro. So I thought
I'd try another piece of Mahout.

When I tried to move on to wrap
org.apache.mahout.text.SequenceFilesFromDirectory  / seqdirectory in
Pig, I couldn't find any variation on same trick that works.  Worse,
it seems that printUsage from Hadoop's ProgramDriver will print out a
strange  subset list of Mahout's Hadoop programs; a list that doesn't
include seqdirectory. I get to the point shortly, but a brief
digression for context. In other words, Hadoop (via API via Pig) seems
to deny that seqdirectory is an available as a program, even while
bin/hadoop on the commandline allows me to run that same program
successfully.

Backstory --- the guts of what I'm attempting with Pig macros:

DEFINE collocations (SEQDIR,IGNORE) RETURNS sorted_concepts {
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
raw_concepts = MAPREDUCE '../../core/target/mahout-core-0.6-SNAPSHOT-job.jar'
        STORE IGNORE INTO 'migtest/dummy-input'
        LOAD 'migtest/collocations_output/ngrams/part-r-*'
        USING SequenceFileLoader AS (phrase: chararray, score: float)
        `org.apache.mahout.driver.MahoutDriver
org.apache.mahout.vectorizer.collocations.llr.CollocDriver
        --input $SEQDIR --output migtest/collocations_output
--analyzerName org.apache.mahout.vectorizer.DefaultAnalyzer
--maxNGramSize 2 --preprocess --overwrite `;
$sorted_concepts = order raw_concepts by score desc;
};

The ugly detail there is that the Pig MAPREDUCE instruction requires a
Pig relation variable name to be passed in, even though in this case
the seq files are already stored in hdfs. But it works. With a few
more bits and pieces (which I'll blog but basic hack is in
https://gist.github.com/1192831 ) you can invoke Mahout collocations
from Pig like this, given a seqdir in hdfs:

reuters_phrases =
collocations('/user/danbri/migtest/reuters-out-seqdir', IGNORE);
political_phrases = FILTER reuters_phrases BY phrase MATCHES
'.*(president|government|election).*' AND score > (float)10;

So how to extend this to also drive seqdirectory from Pig macros, and
eventually make more interesting pipelines?

I tried a lot of variations on this,

DEFINE seqdirectory (TXTDIR, SEQDIR, IGNORE) RETURNS s {
$s = MAPREDUCE  '../../core/target/mahout-core-0.6-SNAPSHOT-job.jar'
        STORE IGNORE INTO 'migtest/dummy-input'
        LOAD '$SEQDIR/*'
        USING SequenceFileLoader
                `org.apache.mahout.driver.MahoutDriver
org.apache.mahout.text.SequenceFilesFromDirectory
                --input $TXTDIR --output $SEQDIR --overwrite`;
};


... and the details don't matter here, except to say that none of
these variations worked; even while invoking will simple commandline
Hadoop, did work. I tried using short name, long name, etc. (btw 'ted'
here is some ted.com data, not Ted D.).

hadoop --config /Users/danbri/working/hadoop/hadoop-0.20.2/conf jar
/Users/danbri/working/mahout/trunk/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
 org.apache.mahout.text.SequenceFilesFromDirectory --input ted/txt/
--output ted/tmp/ --overwrite
...this works fine; reading and writing from hdfs.


So my point: When the Pig/API-based invocation fails (long names,
short names, whatever), what I see is this *truncated* list of Mahout
programs:

Unknown program 'org.apache.mahout.text.SequenceFilesFromDirectory' chosen.
Valid program names are:
  baumwelch: : Baum-Welch algorithm for unsupervised HMM training
  canopy: : Canopy clustering
  cleansvd: : Cleanup and verification of SVD output
  dirichlet: : Dirichlet Clustering
  eigencuts: : Eigencuts spectral clustering
  evaluateFactorization: : compute RMSE of a rating matrix
factorization against probes in memory
  evaluateFactorizationParallel: : compute RMSE of a rating matrix
factorization against probes
  fkmeans: : Fuzzy K-means clustering
  fpg: : Frequent Pattern Growth
  hmmpredict: : Generate random sequence of observations by given HMM
  itemsimilarity: : Compute the item-item-similarities for item-based
collaborative filtering
  kmeans: : K-means clustering
  lda: : Latent Dirchlet Allocation
  matrixmult: : Take the product of two matrices
  meanshift: : Mean Shift clustering
  parallelALS: : ALS-WR factorization of a rating matrix
  predictFromFactorization: : predict preferences from a factorization
of a rating matrix
  recommenditembased: : Compute recommendations using item-based
collaborative filtering
  rowsimilarity: : Compute the pairwise similarities of the rows of a matrix
  seq2sparse: : Sparse Vector generation from Text sequence files
  spectralkmeans: : Spectral k-means clustering
  splitDataset: : split a rating dataset into training and probe parts
  ssvd: : Stochastic SVD
  svd: : Lanczos Singular Value Decomposition
  testclassifier: : Test Bayes Classifier
  trainclassifier: : Train Bayes Classifier
  transpose: : Take the transpose of a matrix
  vecdist: : Compute the distances between a set of Vectors (or
Cluster or Canopy, they must fit in memory) and a list of Vectors
  viterbi: : Viterbi decoding of hidden states from given output states sequence

This is substantially shorter than the list I get from running the
'bin/mahout' utility, and for example doesn't include 'seqdirectory',
even though it does seem to be a Hadoop program and I've successfully
run it as such.

These seem to be missing (are they all Hadoop-capable?):

arff.vector, cat, clusterdump, ldatopics, lucene.vector,
prepare20newsgroups, rowid, runAdaptiveLogistic
runlogistic, seqdirectory, seqdumper, seqwiki, trainAdaptiveLogistic,
trainlogistic, validateAdaptiveLogistic
vectordump, wikipediaDataSetCreator, wikipediaXMLSplitter

The relevant Hadoop code seems to be
http://www.docjar.com/html/api/org/apache/hadoop/util/ProgramDriver.java.html
which is invoked from
core/src/main/java/org/apache/mahout/driver/MahoutDriver.java and it
gets the list from classpath via
loadProperties("driver.classes.default.props").  Looking at that file
in the Mahout Job jar with, jar -xvf
mahout-examples-0.6-SNAPSHOT-job.jar driver.classes.default.props ...
it seems to have the full (47 item) list, including seqdirectory:

org.apache.mahout.text.SequenceFilesFromDirectory = seqdirectory :
Generate sequence files (of Text) from a directory

At this point, I've hit a wall and the weekend's winding up, so it
seemed time to write up and ask for clues.

Q: Is there some logic to the shorter list of Mahout Hadoop programs?
What distinguishes them from longer list?

Thanks for any pointers,

Dan

ps. if anyone follows up re Pig aspect, please change Subject: accordingly

"Unknown program ... chosen" / "Valid program names are" ... truncated list from Hadoop program driver

Reply via email to