I think I have no choice but to do that. I am able to ssh to the EMR cluster but how do I run my mahout job? I donot know how to proceed. Also how can i mount my input files?
________________________________ From: Ted Dunning <[email protected]> To: [email protected]; hellen maziku <[email protected]> Sent: Wednesday, December 12, 2012 2:20 PM Subject: Re: Creating vectors from lucene index on EMR via the CLI You can ssh to the EMR cluster if you like. On Wed, Dec 12, 2012 at 9:38 AM, hellen maziku <[email protected]> wrote: Thank you for the advice. But on my machine I do not have hadoop installed. Running the jobs locally with mahout gives me heap size errors as seen from http://en.wikipedia.org/wiki/User:Bloodysnowrocker/Hadoop. I could only do recommendations locally but clustering and creating of vectors wasn't possible. > >Do you suggest I should use the EMR GUI to submit my jobs, or I should just >install hadoop on my machine or on ec2 and perform my tasks? > > > > >________________________________ > From: Ted Dunning <[email protected]> >To: [email protected]; hellen maziku <[email protected]> >Sent: Wednesday, December 12, 2012 10:56 AM >Subject: Re: Creating vectors from lucene index on EMR via the CLI > >I would still recommend that you switch to using the mahout programs >directly to submit jobs. Those programs really have an assumption baked in >that they will be submitting the jobs themselves. The EMR commands that >you are using take responsibility for creating the environment that you >need for job submission, but you are probably not getting the command line >arguments through to the Mahout program in good order. As it typical with >shell script based utilities, determining how to get those across correctly >is probably somewhat difficult. > >On Wed, Dec 12, 2012 at 7:58 AM, hellen maziku <[email protected]> wrote: > >> Hi Ted, >> If I am running it as a single step, then how come I can add more steps to >> it. Currently there are 6 steps. Every time I get the errors, I just add >> another step to the same job ID. So I dont understand. >> >> Also the command to create the job flow is >> /elastic-mapreduce --create --alive --log-uri >> s3n://mahout-output/logs/ --name dict_vectorize >> >> >> doesn't that mean that the keep alive is set? >> >> >> >> ________________________________ >> From: Ted Dunning <[email protected]> >> To: [email protected]; hellen maziku <[email protected]> >> Sent: Wednesday, December 12, 2012 9:48 AM >> Subject: Re: Creating vectors from lucene index on EMR via the CLI >> >> You are trying to run this job as a single step in an EMR flow. Mahout's >> command line programs assume that you are running against a live cluster >> that will hang around (since many mahout steps involve more than one >> map-reduce). >> >> It would probably be best to separate the creation of the cluster (with the >> keep-alive flag set) from the execution of the Mahout jobs with a >> subsequent explicit tear-down of the cluster. >> >> On Wed, Dec 12, 2012 at 3:55 AM, hellen maziku <[email protected]> wrote: >> >> > Hi, >> > I installed mahout and solr. >> > >> > I created an index from the dictionary.txt using the command below >> > >> > curl " >> > http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true" >> -F >> > "[email protected]" >> > >> > To create the vectors from my index >> > >> > I needed the org.apache.mahout.utils.vectors.lucene.Driver class. I >> > couldnot locate this class in mahout-core-o.7-job.jar. I could only >> > locate it from mahout-examples-0.7-job.jar, so I uploaded the >> > mahout-examples-0.7-job.jar on an s3 bucket. >> > >> > I also uploaded the dictionary index on a separete s3 bucket. I created >> > another bucket with two folders to store my dictOut and vectors. >> > >> > I created a job flow on the CLI >> > >> > /elastic-mapreduce --create --alive --log-uri >> > s3n://mahout-output/logs/ --name dict_vectorize >> > >> > I added the step to vectorize my index using the following command >> > ./elastic-mapreduce -j j-2NSJRI6N9EQJ4 --jar >> > s3n://mahout-bucket/jars/mahout-examples-0.7-job.jar --main-class >> > org.apache.mahout.utils.vectors.lucene.Driver --arg --dir >> > s3n://mahout-input/input1/index/ --arg --field doc1 --arg --dictOut >> > s3n://mahout-output/solr-dict-out/dict.txt --arg --output >> > s3n://mahout-output/solr-vect-out/vectors >> > >> > >> > But in the logs I get the following error >> > >> > 2012-12-12 09:37:17,883 ERROR >> > org.apache.mahout.utils.vectors.lucene.Driver (main): Exception >> > org.apache.commons.cli2.OptionException: Missing value(s) --dir >> > at >> > >> org.apache.commons.cli2.option.ArgumentImpl.validate(ArgumentImpl.java:241) >> > at >> > org.apache.commons.cli2.option.ParentImpl.validate(ParentImpl.java:124) >> > at >> > >> org.apache.commons.cli2.option.DefaultOption.validate(DefaultOption.java:176) >> > at >> > org.apache.commons.cli2.option.GroupImpl.validate(GroupImpl.java:265) >> > at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:104) >> > at >> > org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:197) >> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> > at >> > >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> > at >> > >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> > at java.lang.reflect.Method.invoke(Method.java:597) >> > at org.apache.hadoop.util.RunJar.main(RunJar.java:187) >> > >> > >> > What am I doing wrong? >> > Another question: what is the correct value of the --field argument, is >> it >> > doc1 (the id) or dictionary(from the filename dictionary.txt). I am >> asking >> > this becasue when I issue the querry with q=doc1 on solr I get no >> > results. But when I issue the query with q=dictionary, I see my content. >> > >> > Thank you so much for help. I am a newbie, so please excuse my being too >> > verbal. >> >
