The urls dir is not specified in the command. bin/nutch crawl -solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN 10000
在 2013-02-18 09:53:33,"Lewis John Mcgibbney" <[email protected]> 写道: >Wherever your url directory is kept > >On Sunday, February 17, 2013, 高睿 <[email protected]> wrote: >> Hi, >> >> What do you mean the same directory? '/tmp' or '${NUTCH_HOME}'? >> >> >> >> >> >> >> >> >> At 2013-02-18 00:45:00,"Lewis John Mcgibbney" <[email protected]> >wrote: >>>Hi, >>>Please make sure you have no temp files in the same directory and try >again >>>Please either use the crawl script which is provided with nutch or >>>alternatively build your own script. >>> >>> >>>On Sunday, February 17, 2013, 高睿 <[email protected]> wrote: >>>> Hi, >>>> Additional, the nutch version is 2.1. And I have an ParserFilter to >purge >>>outlinks of parse object. (by code: parse.setOutlinks(new Outlink[] {});) >>>> >>>> When I specify '-depth 1', the url is only crawled once, and If I >specify >>>'-depth 3', the url is crawled 3 times. >>>> Is this expected behavior? Should I use command 'crawl' to do all works >>>in one go? >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> At 2013-02-17 22:11:22,"高睿" <[email protected]> wrote: >>>>>Hi, >>>>> >>>>>There's only 1 url in table 'webpage'. I run command: bin/nutch crawl >>>-solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN >>>10000, then I find the url is crawled twice. >>>>> >>>>>Here's the log: >>>>> 55 2013-02-17 20:45:00,965 INFO fetcher.FetcherJob - fetching >>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm >>>>> 84 2013-02-17 20:45:11,021 INFO parse.ParserJob - Parsing >>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm >>>>>215 2013-02-17 20:45:38,922 INFO fetcher.FetcherJob - fetching >>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm >>>>>244 2013-02-17 20:45:46,031 INFO parse.ParserJob - Parsing >>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm >>>>> >>>>>Do you know how to fix this? >>>>>Besides, when I run the command again. The same log is written in >>>hadoop.log. I don't know why the configuration 'db.fetch.interval.default' >>>in nutch-site.xml doesn't take effect. >>>>> >>>>>Thanks. >>>>> >>>>>Regards, >>>>>Rui >>>> >>> >>>-- >>>*Lewis* >> > >-- >*Lewis*

