Re: Creating vectors from lucene index on EMR via the CLI

hellen maziku Wed, 12 Dec 2012 09:39:06 -0800

Thank you for the advice. But on my machine I do not have hadoop installed. 
Running the jobs locally with mahout gives me heap size errors as seen from 
http://en.wikipedia.org/wiki/User:Bloodysnowrocker/Hadoop. I could only do 
recommendations locally but clustering and creating of vectors wasn't possible.


Do you suggest I should use the EMR GUI to submit my jobs, or I should just 
install hadoop on my machine or on ec2 and perform my tasks?




________________________________
 From: Ted Dunning <[email protected]>
To: [email protected]; hellen maziku <[email protected]> 
Sent: Wednesday, December 12, 2012 10:56 AM
Subject: Re: Creating vectors from lucene index on EMR via the CLI
 
I would still recommend that you switch to using the mahout programs
directly to submit jobs.  Those programs really have an assumption baked in
that they will be submitting the jobs themselves.  The EMR commands that
you are using take responsibility for creating the environment that you
need for job submission, but you are probably not getting the command line
arguments through to the Mahout program in good order.  As it typical with
shell script based utilities, determining how to get those across correctly
is probably somewhat difficult.

On Wed, Dec 12, 2012 at 7:58 AM, hellen maziku <[email protected]> wrote:

> Hi Ted,
> If I am running it as a single step, then how come I can add more steps to
> it. Currently there are 6 steps. Every time I get the errors, I just add
> another step to the same job ID. So I dont understand.
>
> Also the command to create the job flow is
> /elastic-mapreduce --create --alive    --log-uri
> s3n://mahout-output/logs/  --name dict_vectorize
>
>
> doesn't that mean that the keep alive is set?
>
>
>
> ________________________________
>  From: Ted Dunning <[email protected]>
> To: [email protected]; hellen maziku <[email protected]>
> Sent: Wednesday, December 12, 2012 9:48 AM
> Subject: Re: Creating vectors from lucene index on EMR via the CLI
>
> You are trying to run this job as a single step in an EMR flow.  Mahout's
> command line programs assume that you are running against a live cluster
> that will hang around (since many  mahout steps involve more than one
> map-reduce).
>
> It would probably be best to separate the creation of the cluster (with the
> keep-alive flag set) from the execution of the Mahout jobs with a
> subsequent explicit tear-down of the cluster.
>
> On Wed, Dec 12, 2012 at 3:55 AM, hellen maziku <[email protected]> wrote:
>
> > Hi,
> > I installed mahout and solr.
> >
> > I created an index from the dictionary.txt using the command below
> >
> > curl "
> > http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true";
> -F
> > "[email protected]"
> >
> > To create the vectors from my index
> >
> > I needed the org.apache.mahout.utils.vectors.lucene.Driver class. I
> > couldnot locate this class in mahout-core-o.7-job.jar. I could only
> > locate it from mahout-examples-0.7-job.jar, so I uploaded the
> > mahout-examples-0.7-job.jar on an s3 bucket.
> >
> > I also uploaded the dictionary index on a separete s3 bucket. I created
> > another bucket with two folders to store my dictOut and vectors.
> >
> > I created a job flow on the CLI
> >
> > /elastic-mapreduce --create --alive    --log-uri
> > s3n://mahout-output/logs/  --name dict_vectorize
> >
> > I added the step to vectorize my index using the following command
> > ./elastic-mapreduce -j j-2NSJRI6N9EQJ4  --jar
> > s3n://mahout-bucket/jars/mahout-examples-0.7-job.jar  --main-class
> > org.apache.mahout.utils.vectors.lucene.Driver --arg --dir
> > s3n://mahout-input/input1/index/ --arg --field doc1 --arg --dictOut
> > s3n://mahout-output/solr-dict-out/dict.txt --arg --output
> > s3n://mahout-output/solr-vect-out/vectors
> >
> >
> > But in the logs I get the following error
> >
> > 2012-12-12 09:37:17,883 ERROR
> > org.apache.mahout.utils.vectors.lucene.Driver (main): Exception
> > org.apache.commons.cli2.OptionException: Missing value(s) --dir
> >     at
> >
> org.apache.commons.cli2.option.ArgumentImpl.validate(ArgumentImpl.java:241)
> >     at
> > org.apache.commons.cli2.option.ParentImpl.validate(ParentImpl.java:124)
> >     at
> >
> org.apache.commons.cli2.option.DefaultOption.validate(DefaultOption.java:176)
> >     at
> > org.apache.commons.cli2.option.GroupImpl.validate(GroupImpl.java:265)
> >     at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:104)
> >     at
> >  org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:197)
> >     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >     at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >     at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >     at java.lang.reflect.Method.invoke(Method.java:597)
> >     at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
> >
> >
> > What am I doing wrong?
> > Another question: what is the correct value of the --field argument, is
> it
> > doc1 (the id) or dictionary(from the filename dictionary.txt). I am
> asking
> > this becasue when I issue the querry with q=doc1 on solr I get no
> > results. But when I issue the query with q=dictionary, I see my content.
> >
> > Thank you so much for help. I am a newbie, so please excuse my being too
> > verbal.
> >

Re: Creating vectors from lucene index on EMR via the CLI

Reply via email to