On Fri, May 11, 2012 at 11:38 AM, Timothy Potter <[email protected]>wrote:

> I'm trying to run the simple 20-newsgroups example to train a Mahout
> classifier using Pig and am unsure about the elephant-bird stuff.
>
> First, after battling with getting a build of elephant-bird,


Why did you have to build it?  Aren't the jars available via maven?


> the store to
> SequenceFile didn't work for me. Then I saw the PigModelStorage and just
> used that and it works just fine. Here is my script (with comments removed
> for brevity):
>
> -- Train:
>
> register '.../target/pig-vector-1.0-jar-with-dependencies.jar';
>
> define train org.apache.mahout.pig.LogisticRegression('iterations=5,
> inMemory=true, features=100000, categories=alt.atheism
> comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns
> comp.graphics comp.windows.x rec.sport.baseball sci.med
> talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey
> sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt
> soc.religion.christian talk.religion.misc');
>
> docs = load '20news-bydate-train/*/*' using
> org.apache.mahout.pig.MessageLoader()
>    as (newsgroup, id:int, subject, body);
>
> define encodeVector org.apache.mahout.pig.encoders.EncodeVector('100000',
> 'subject+body', 'group:word, article:numeric, subject:text, body:text');
> vectors = foreach docs generate newsgroup, encodeVector(*) as v;
>
> grouped = group vectors all;
>
> model = foreach grouped generate 1 as key, train(vectors) as model;
>
> store model into 'pv-tmp/news_model2' using
> org.apache.mahout.pig.PigModelStorage();
>
>
> -- Eval:
>
> define evaluate
>
> org.apache.mahout.pig.LogisticRegressionEval('sequence=pv-tmp/news_model2/part-r-00000,
> key=1');
> test = load '20news-bydate-test/*/*' using
> org.apache.mahout.pig.MessageLoader()
>    as (newsgroup, id:int, subject, body);
> testvecs = foreach test generate newsgroup, encodeVector(*) as v;
> describe testvecs;
> evalvecs = foreach testvecs generate evaluate(v);
>
> dump evalvecs;
>
> ----
>
> So my main question is what does the elephant-bird model storage stuff do
> that PigModelStorage doesn't?
>

SequenceFileStorage leads to producing data in a format which many of the
other
Mahout utilities can read (they typically assume things like SequenceFile's
of Text,
IntWritable, and/or VectorWritable).


>
> Cheers,
> Tim
>



-- 

  -jake

Reply via email to