I'm trying to run the simple 20-newsgroups example to train a Mahout
classifier using Pig and am unsure about the elephant-bird stuff.
First, after battling with getting a build of elephant-bird, the store to
SequenceFile didn't work for me. Then I saw the PigModelStorage and just
used that and it works just fine. Here is my script (with comments removed
for brevity):
-- Train:
register '.../target/pig-vector-1.0-jar-with-dependencies.jar';
define train org.apache.mahout.pig.LogisticRegression('iterations=5,
inMemory=true, features=100000, categories=alt.atheism
comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns
comp.graphics comp.windows.x rec.sport.baseball sci.med
talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey
sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt
soc.religion.christian talk.religion.misc');
docs = load '20news-bydate-train/*/*' using
org.apache.mahout.pig.MessageLoader()
as (newsgroup, id:int, subject, body);
define encodeVector org.apache.mahout.pig.encoders.EncodeVector('100000',
'subject+body', 'group:word, article:numeric, subject:text, body:text');
vectors = foreach docs generate newsgroup, encodeVector(*) as v;
grouped = group vectors all;
model = foreach grouped generate 1 as key, train(vectors) as model;
store model into 'pv-tmp/news_model2' using
org.apache.mahout.pig.PigModelStorage();
-- Eval:
define evaluate
org.apache.mahout.pig.LogisticRegressionEval('sequence=pv-tmp/news_model2/part-r-00000,
key=1');
test = load '20news-bydate-test/*/*' using
org.apache.mahout.pig.MessageLoader()
as (newsgroup, id:int, subject, body);
testvecs = foreach test generate newsgroup, encodeVector(*) as v;
describe testvecs;
evalvecs = foreach testvecs generate evaluate(v);
dump evalvecs;
----
So my main question is what does the elephant-bird model storage stuff do
that PigModelStorage doesn't?
Cheers,
Tim