I’m using the crawl script that you had linked earlier. -- Manikandan Saravanan Architect - Technology TheSocialPeople
On 4 February 2014 at 7:43:49 pm, Manikandan Saravanan ([email protected]) wrote: Okay, the crawl runs well for the most part: I’m running the crawl script as bin/crawl urls/seed.txt TestCrawl http://xxx.xxx.xxx.xxx:8983/solr/ 2 And it’s giving me this: Exception in thread "main" java.lang.IllegalArgumentException: usage: (-crawlId <id>) at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:117) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:123) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:160) After the parse job. What is wrong? -- Manikandan Saravanan Architect - Technology TheSocialPeople On 4 February 2014 at 3:11:36 pm, Lewis John Mcgibbney ([email protected]) wrote: https://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script On Tue, Feb 4, 2014 at 7:04 AM, Manikandan Saravanan < [email protected]> wrote: > How do I run the crawl script on hadoop? > -- > Manikandan Saravanan > Architect - Technology > TheSocialPeople <http://thesocialpeople.net> > > On 4 February 2014 at 1:28:39 am, Lewis John Mcgibbney ( > [email protected] <//[email protected]>) wrote: > > Hi Manikandan, > > On Mon, Feb 3, 2014 at 3:45 PM, <[email protected]> > wrote: > > > And then, I'm running this: > > $HADOOP_HOME/bin/hadoop jar /usr/local/nutch/nutch.job > > org.apache.nutch.crawl.Crawler dmoz -dir /user/hduser/crawl -depth 3 > -topN > > 5000 > > > > You're using the Crawler class. This is not advised at all and is now > deprecated. There is no point in downloading the crawl script if you are > going to use the Crawler class. I would suggest you using the crawl > script. > > > > > > org.apache.gora.memory.store.MemStore as the Gora storage class. > > > > Please don't use MemStore its implementation in Gora 0.3 is not thread > safe > and is only used for trivial tests. Please see the 2.x tutorial on the > Nutch wiki for details of how to configure the supported Gora persistent > data stores. > > > Once you've used the crawl script, and configured your Nutch deployment > job > file, please get back to us with your results. > Remeber you will always need to regenerate your Nutch job file if you make > configuration changes to your Nutch deployment. > hth > Thanks > > -- *Lewis*

