Re: Creating vectors from lucene index on EMR via the CLI

hellen maziku Thu, 13 Dec 2012 10:59:12 -0800

I think I have no choice but to do that. 
I am able to ssh to the EMR cluster but how do I run my mahout job? I donot 
know how to proceed. Also how can i mount my input files?





________________________________
 From: Ted Dunning <[email protected]>
To: [email protected]; hellen maziku <[email protected]> 
Sent: Wednesday, December 12, 2012 2:20 PM
Subject: Re: Creating vectors from lucene index on EMR via the CLI
 

You can ssh to the EMR cluster if you like.


On Wed, Dec 12, 2012 at 9:38 AM, hellen maziku <[email protected]> wrote:

Thank you for the advice. But on my machine I do not have hadoop installed. 
Running the jobs locally with mahout gives me heap size errors as seen from 
http://en.wikipedia.org/wiki/User:Bloodysnowrocker/Hadoop. I could only do 
recommendations locally but clustering and creating of vectors wasn't possible.
>
>Do you suggest I should use the EMR GUI to submit my jobs, or I should just 
>install hadoop on my machine or on ec2 and perform my tasks?
>
>
>
>
>________________________________
> From: Ted Dunning <[email protected]>
>To: [email protected]; hellen maziku <[email protected]>
>Sent: Wednesday, December 12, 2012 10:56 AM
>Subject: Re: Creating vectors from lucene index on EMR via the CLI
>
>I would still recommend that you switch to using the mahout programs
>directly to submit jobs.  Those programs really have an assumption baked in
>that they will be submitting the jobs themselves.  The EMR commands that
>you are using take responsibility for creating the environment that you
>need for job submission, but you are probably not getting the command line
>arguments through to the Mahout program in good order.  As it typical with
>shell script based utilities, determining how to get those across correctly
>is probably somewhat difficult.
>
>On Wed, Dec 12, 2012 at 7:58 AM, hellen maziku <[email protected]> wrote:
>
>> Hi Ted,
>> If I am running it as a single step, then how come I can add more steps to
>> it. Currently there are 6 steps. Every time I get the errors, I just add
>> another step to the same job ID. So I dont understand.
>>
>> Also the command to create the job flow is
>> /elastic-mapreduce --create --alive    --log-uri
>> s3n://mahout-output/logs/  --name dict_vectorize
>>
>>
>> doesn't that mean that the keep alive is set?
>>
>>
>>
>> ________________________________
>>  From: Ted Dunning <[email protected]>
>> To: [email protected]; hellen maziku <[email protected]>
>> Sent: Wednesday, December 12, 2012 9:48 AM
>> Subject: Re: Creating vectors from lucene index on EMR via the CLI
>>
>> You are trying to run this job as a single step in an EMR flow.  Mahout's
>> command line programs assume that you are running against a live cluster
>> that will hang around (since many  mahout steps involve more than one
>> map-reduce).
>>
>> It would probably be best to separate the creation of the cluster (with the
>> keep-alive flag set) from the execution of the Mahout jobs with a
>> subsequent explicit tear-down of the cluster.
>>
>> On Wed, Dec 12, 2012 at 3:55 AM, hellen maziku <[email protected]> wrote:
>>
>> > Hi,
>> > I installed mahout and solr.
>> >
>> > I created an index from the dictionary.txt using the command below
>> >
>> > curl "
>> > http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true";
>> -F
>> > "[email protected]"
>> >
>> > To create the vectors from my index
>> >
>> > I needed the org.apache.mahout.utils.vectors.lucene.Driver class. I
>> > couldnot locate this class in mahout-core-o.7-job.jar. I could only
>> > locate it from mahout-examples-0.7-job.jar, so I uploaded the
>> > mahout-examples-0.7-job.jar on an s3 bucket.
>> >
>> > I also uploaded the dictionary index on a separete s3 bucket. I created
>> > another bucket with two folders to store my dictOut and vectors.
>> >
>> > I created a job flow on the CLI
>> >
>> > /elastic-mapreduce --create --alive    --log-uri
>> > s3n://mahout-output/logs/  --name dict_vectorize
>> >
>> > I added the step to vectorize my index using the following command
>> > ./elastic-mapreduce -j j-2NSJRI6N9EQJ4  --jar
>> > s3n://mahout-bucket/jars/mahout-examples-0.7-job.jar  --main-class
>> > org.apache.mahout.utils.vectors.lucene.Driver --arg --dir
>> > s3n://mahout-input/input1/index/ --arg --field doc1 --arg --dictOut
>> > s3n://mahout-output/solr-dict-out/dict.txt --arg --output
>> > s3n://mahout-output/solr-vect-out/vectors
>> >
>> >
>> > But in the logs I get the following error
>> >
>> > 2012-12-12 09:37:17,883 ERROR
>> > org.apache.mahout.utils.vectors.lucene.Driver (main): Exception
>> > org.apache.commons.cli2.OptionException: Missing value(s) --dir
>> >     at
>> >
>> org.apache.commons.cli2.option.ArgumentImpl.validate(ArgumentImpl.java:241)
>> >     at
>> > org.apache.commons.cli2.option.ParentImpl.validate(ParentImpl.java:124)
>> >     at
>> >
>> org.apache.commons.cli2.option.DefaultOption.validate(DefaultOption.java:176)
>> >     at
>> > org.apache.commons.cli2.option.GroupImpl.validate(GroupImpl.java:265)
>> >     at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:104)
>> >     at
>> >  org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:197)
>> >     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >     at
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> >     at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >     at java.lang.reflect.Method.invoke(Method.java:597)
>> >     at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
>> >
>> >
>> > What am I doing wrong?
>> > Another question: what is the correct value of the --field argument, is
>> it
>> > doc1 (the id) or dictionary(from the filename dictionary.txt). I am
>> asking
>> > this becasue when I issue the querry with q=doc1 on solr I get no
>> > results. But when I issue the query with q=dictionary, I see my content.
>> >
>> > Thank you so much for help. I am a newbie, so please excuse my being too
>> > verbal.
>> >

Re: Creating vectors from lucene index on EMR via the CLI

Reply via email to