Okay, the crawl runs well for the most part:
I’m running the crawl script as bin/crawl urls/seed.txt TestCrawl
http://xxx.xxx.xxx.xxx:8983/solr/ 2
And it’s giving me this:
Exception in thread "main" java.lang.IllegalArgumentException: usage: (-crawlId
<id>)
at org.apache.nutch.crawl.DbUpdaterJob.run(DbUpdaterJob.java:117)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.DbUpdaterJob.main(DbUpdaterJob.java:123)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
After the parse job. What is wrong?
--
Manikandan Saravanan
Architect - Technology
TheSocialPeople
On 4 February 2014 at 3:11:36 pm, Lewis John Mcgibbney
([email protected]) wrote:
https://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script
On Tue, Feb 4, 2014 at 7:04 AM, Manikandan Saravanan <
[email protected]> wrote:
> How do I run the crawl script on hadoop?
> --
> Manikandan Saravanan
> Architect - Technology
> TheSocialPeople <http://thesocialpeople.net>
>
> On 4 February 2014 at 1:28:39 am, Lewis John Mcgibbney (
> [email protected] <//[email protected]>) wrote:
>
> Hi Manikandan,
>
> On Mon, Feb 3, 2014 at 3:45 PM, <[email protected]>
> wrote:
>
> > And then, I'm running this:
> > $HADOOP_HOME/bin/hadoop jar /usr/local/nutch/nutch.job
> > org.apache.nutch.crawl.Crawler dmoz -dir /user/hduser/crawl -depth 3
> -topN
> > 5000
> >
>
> You're using the Crawler class. This is not advised at all and is now
> deprecated. There is no point in downloading the crawl script if you are
> going to use the Crawler class. I would suggest you using the crawl
> script.
>
>
> >
> > org.apache.gora.memory.store.MemStore as the Gora storage class.
> >
>
> Please don't use MemStore its implementation in Gora 0.3 is not thread
> safe
> and is only used for trivial tests. Please see the 2.x tutorial on the
> Nutch wiki for details of how to configure the supported Gora persistent
> data stores.
>
>
> Once you've used the crawl script, and configured your Nutch deployment
> job
> file, please get back to us with your results.
> Remeber you will always need to regenerate your Nutch job file if you make
> configuration changes to your Nutch deployment.
> hth
> Thanks
>
>
--
*Lewis*