Thank you for the advice. But on my machine I do not have hadoop installed. Running the jobs locally with mahout gives me heap size errors as seen from http://en.wikipedia.org/wiki/User:Bloodysnowrocker/Hadoop. I could only do recommendations locally but clustering and creating of vectors wasn't possible.
Do you suggest I should use the EMR GUI to submit my jobs, or I should just install hadoop on my machine or on ec2 and perform my tasks? ________________________________ From: Ted Dunning <[email protected]> To: [email protected]; hellen maziku <[email protected]> Sent: Wednesday, December 12, 2012 10:56 AM Subject: Re: Creating vectors from lucene index on EMR via the CLI I would still recommend that you switch to using the mahout programs directly to submit jobs. Those programs really have an assumption baked in that they will be submitting the jobs themselves. The EMR commands that you are using take responsibility for creating the environment that you need for job submission, but you are probably not getting the command line arguments through to the Mahout program in good order. As it typical with shell script based utilities, determining how to get those across correctly is probably somewhat difficult. On Wed, Dec 12, 2012 at 7:58 AM, hellen maziku <[email protected]> wrote: > Hi Ted, > If I am running it as a single step, then how come I can add more steps to > it. Currently there are 6 steps. Every time I get the errors, I just add > another step to the same job ID. So I dont understand. > > Also the command to create the job flow is > /elastic-mapreduce --create --alive --log-uri > s3n://mahout-output/logs/ --name dict_vectorize > > > doesn't that mean that the keep alive is set? > > > > ________________________________ > From: Ted Dunning <[email protected]> > To: [email protected]; hellen maziku <[email protected]> > Sent: Wednesday, December 12, 2012 9:48 AM > Subject: Re: Creating vectors from lucene index on EMR via the CLI > > You are trying to run this job as a single step in an EMR flow. Mahout's > command line programs assume that you are running against a live cluster > that will hang around (since many mahout steps involve more than one > map-reduce). > > It would probably be best to separate the creation of the cluster (with the > keep-alive flag set) from the execution of the Mahout jobs with a > subsequent explicit tear-down of the cluster. > > On Wed, Dec 12, 2012 at 3:55 AM, hellen maziku <[email protected]> wrote: > > > Hi, > > I installed mahout and solr. > > > > I created an index from the dictionary.txt using the command below > > > > curl " > > http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true" > -F > > "[email protected]" > > > > To create the vectors from my index > > > > I needed the org.apache.mahout.utils.vectors.lucene.Driver class. I > > couldnot locate this class in mahout-core-o.7-job.jar. I could only > > locate it from mahout-examples-0.7-job.jar, so I uploaded the > > mahout-examples-0.7-job.jar on an s3 bucket. > > > > I also uploaded the dictionary index on a separete s3 bucket. I created > > another bucket with two folders to store my dictOut and vectors. > > > > I created a job flow on the CLI > > > > /elastic-mapreduce --create --alive --log-uri > > s3n://mahout-output/logs/ --name dict_vectorize > > > > I added the step to vectorize my index using the following command > > ./elastic-mapreduce -j j-2NSJRI6N9EQJ4 --jar > > s3n://mahout-bucket/jars/mahout-examples-0.7-job.jar --main-class > > org.apache.mahout.utils.vectors.lucene.Driver --arg --dir > > s3n://mahout-input/input1/index/ --arg --field doc1 --arg --dictOut > > s3n://mahout-output/solr-dict-out/dict.txt --arg --output > > s3n://mahout-output/solr-vect-out/vectors > > > > > > But in the logs I get the following error > > > > 2012-12-12 09:37:17,883 ERROR > > org.apache.mahout.utils.vectors.lucene.Driver (main): Exception > > org.apache.commons.cli2.OptionException: Missing value(s) --dir > > at > > > org.apache.commons.cli2.option.ArgumentImpl.validate(ArgumentImpl.java:241) > > at > > org.apache.commons.cli2.option.ParentImpl.validate(ParentImpl.java:124) > > at > > > org.apache.commons.cli2.option.DefaultOption.validate(DefaultOption.java:176) > > at > > org.apache.commons.cli2.option.GroupImpl.validate(GroupImpl.java:265) > > at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:104) > > at > > org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:197) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > at > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at org.apache.hadoop.util.RunJar.main(RunJar.java:187) > > > > > > What am I doing wrong? > > Another question: what is the correct value of the --field argument, is > it > > doc1 (the id) or dictionary(from the filename dictionary.txt). I am > asking > > this becasue when I issue the querry with q=doc1 on solr I get no > > results. But when I issue the query with q=dictionary, I see my content. > > > > Thank you so much for help. I am a newbie, so please excuse my being too > > verbal. > >
