I've been trying to run Mahout hadoop tasks from Apache Pig, and have
problems with pieces of Mahout seemingly not visible to Hadoop's
ProgramDriver, even if accessible to bin/hadoop commandline.
First, I had some quick success (hacky and kludgy, but it worked),
encapsulating a call to CollocDriver within a Pig macro. So I thought
I'd try another piece of Mahout.
When I tried to move on to wrap
org.apache.mahout.text.SequenceFilesFromDirectory / seqdirectory in
Pig, I couldn't find any variation on same trick that works. Worse,
it seems that printUsage from Hadoop's ProgramDriver will print out a
strange subset list of Mahout's Hadoop programs; a list that doesn't
include seqdirectory. I get to the point shortly, but a brief
digression for context. In other words, Hadoop (via API via Pig) seems
to deny that seqdirectory is an available as a program, even while
bin/hadoop on the commandline allows me to run that same program
successfully.
Backstory --- the guts of what I'm attempting with Pig macros:
DEFINE collocations (SEQDIR,IGNORE) RETURNS sorted_concepts {
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
raw_concepts = MAPREDUCE '../../core/target/mahout-core-0.6-SNAPSHOT-job.jar'
STORE IGNORE INTO 'migtest/dummy-input'
LOAD 'migtest/collocations_output/ngrams/part-r-*'
USING SequenceFileLoader AS (phrase: chararray, score: float)
`org.apache.mahout.driver.MahoutDriver
org.apache.mahout.vectorizer.collocations.llr.CollocDriver
--input $SEQDIR --output migtest/collocations_output
--analyzerName org.apache.mahout.vectorizer.DefaultAnalyzer
--maxNGramSize 2 --preprocess --overwrite `;
$sorted_concepts = order raw_concepts by score desc;
};
The ugly detail there is that the Pig MAPREDUCE instruction requires a
Pig relation variable name to be passed in, even though in this case
the seq files are already stored in hdfs. But it works. With a few
more bits and pieces (which I'll blog but basic hack is in
https://gist.github.com/1192831 ) you can invoke Mahout collocations
from Pig like this, given a seqdir in hdfs:
reuters_phrases =
collocations('/user/danbri/migtest/reuters-out-seqdir', IGNORE);
political_phrases = FILTER reuters_phrases BY phrase MATCHES
'.*(president|government|election).*' AND score > (float)10;
So how to extend this to also drive seqdirectory from Pig macros, and
eventually make more interesting pipelines?
I tried a lot of variations on this,
DEFINE seqdirectory (TXTDIR, SEQDIR, IGNORE) RETURNS s {
$s = MAPREDUCE '../../core/target/mahout-core-0.6-SNAPSHOT-job.jar'
STORE IGNORE INTO 'migtest/dummy-input'
LOAD '$SEQDIR/*'
USING SequenceFileLoader
`org.apache.mahout.driver.MahoutDriver
org.apache.mahout.text.SequenceFilesFromDirectory
--input $TXTDIR --output $SEQDIR --overwrite`;
};
... and the details don't matter here, except to say that none of
these variations worked; even while invoking will simple commandline
Hadoop, did work. I tried using short name, long name, etc. (btw 'ted'
here is some ted.com data, not Ted D.).
hadoop --config /Users/danbri/working/hadoop/hadoop-0.20.2/conf jar
/Users/danbri/working/mahout/trunk/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
org.apache.mahout.text.SequenceFilesFromDirectory --input ted/txt/
--output ted/tmp/ --overwrite
...this works fine; reading and writing from hdfs.
So my point: When the Pig/API-based invocation fails (long names,
short names, whatever), what I see is this *truncated* list of Mahout
programs:
Unknown program 'org.apache.mahout.text.SequenceFilesFromDirectory' chosen.
Valid program names are:
baumwelch: : Baum-Welch algorithm for unsupervised HMM training
canopy: : Canopy clustering
cleansvd: : Cleanup and verification of SVD output
dirichlet: : Dirichlet Clustering
eigencuts: : Eigencuts spectral clustering
evaluateFactorization: : compute RMSE of a rating matrix
factorization against probes in memory
evaluateFactorizationParallel: : compute RMSE of a rating matrix
factorization against probes
fkmeans: : Fuzzy K-means clustering
fpg: : Frequent Pattern Growth
hmmpredict: : Generate random sequence of observations by given HMM
itemsimilarity: : Compute the item-item-similarities for item-based
collaborative filtering
kmeans: : K-means clustering
lda: : Latent Dirchlet Allocation
matrixmult: : Take the product of two matrices
meanshift: : Mean Shift clustering
parallelALS: : ALS-WR factorization of a rating matrix
predictFromFactorization: : predict preferences from a factorization
of a rating matrix
recommenditembased: : Compute recommendations using item-based
collaborative filtering
rowsimilarity: : Compute the pairwise similarities of the rows of a matrix
seq2sparse: : Sparse Vector generation from Text sequence files
spectralkmeans: : Spectral k-means clustering
splitDataset: : split a rating dataset into training and probe parts
ssvd: : Stochastic SVD
svd: : Lanczos Singular Value Decomposition
testclassifier: : Test Bayes Classifier
trainclassifier: : Train Bayes Classifier
transpose: : Take the transpose of a matrix
vecdist: : Compute the distances between a set of Vectors (or
Cluster or Canopy, they must fit in memory) and a list of Vectors
viterbi: : Viterbi decoding of hidden states from given output states sequence
This is substantially shorter than the list I get from running the
'bin/mahout' utility, and for example doesn't include 'seqdirectory',
even though it does seem to be a Hadoop program and I've successfully
run it as such.
These seem to be missing (are they all Hadoop-capable?):
arff.vector, cat, clusterdump, ldatopics, lucene.vector,
prepare20newsgroups, rowid, runAdaptiveLogistic
runlogistic, seqdirectory, seqdumper, seqwiki, trainAdaptiveLogistic,
trainlogistic, validateAdaptiveLogistic
vectordump, wikipediaDataSetCreator, wikipediaXMLSplitter
The relevant Hadoop code seems to be
http://www.docjar.com/html/api/org/apache/hadoop/util/ProgramDriver.java.html
which is invoked from
core/src/main/java/org/apache/mahout/driver/MahoutDriver.java and it
gets the list from classpath via
loadProperties("driver.classes.default.props"). Looking at that file
in the Mahout Job jar with, jar -xvf
mahout-examples-0.6-SNAPSHOT-job.jar driver.classes.default.props ...
it seems to have the full (47 item) list, including seqdirectory:
org.apache.mahout.text.SequenceFilesFromDirectory = seqdirectory :
Generate sequence files (of Text) from a directory
At this point, I've hit a wall and the weekend's winding up, so it
seemed time to write up and ask for clues.
Q: Is there some logic to the shorter list of Mahout Hadoop programs?
What distinguishes them from longer list?
Thanks for any pointers,
Dan
ps. if anyone follows up re Pig aspect, please change Subject: accordingly