Wherever your url directory is kept On Sunday, February 17, 2013, 高睿 <[email protected]> wrote: > Hi, > > What do you mean the same directory? '/tmp' or '${NUTCH_HOME}'? > > > > > > > > > At 2013-02-18 00:45:00,"Lewis John Mcgibbney" <[email protected]> wrote: >>Hi, >>Please make sure you have no temp files in the same directory and try again >>Please either use the crawl script which is provided with nutch or >>alternatively build your own script. >> >> >>On Sunday, February 17, 2013, 高睿 <[email protected]> wrote: >>> Hi, >>> Additional, the nutch version is 2.1. And I have an ParserFilter to purge >>outlinks of parse object. (by code: parse.setOutlinks(new Outlink[] {});) >>> >>> When I specify '-depth 1', the url is only crawled once, and If I specify >>'-depth 3', the url is crawled 3 times. >>> Is this expected behavior? Should I use command 'crawl' to do all works >>in one go? >>> >>> >>> >>> >>> >>> >>> >>> At 2013-02-17 22:11:22,"高睿" <[email protected]> wrote: >>>>Hi, >>>> >>>>There's only 1 url in table 'webpage'. I run command: bin/nutch crawl >>-solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN >>10000, then I find the url is crawled twice. >>>> >>>>Here's the log: >>>> 55 2013-02-17 20:45:00,965 INFO fetcher.FetcherJob - fetching >>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm >>>> 84 2013-02-17 20:45:11,021 INFO parse.ParserJob - Parsing >>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm >>>>215 2013-02-17 20:45:38,922 INFO fetcher.FetcherJob - fetching >>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm >>>>244 2013-02-17 20:45:46,031 INFO parse.ParserJob - Parsing >>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm >>>> >>>>Do you know how to fix this? >>>>Besides, when I run the command again. The same log is written in >>hadoop.log. I don't know why the configuration 'db.fetch.interval.default' >>in nutch-site.xml doesn't take effect. >>>> >>>>Thanks. >>>> >>>>Regards, >>>>Rui >>> >> >>-- >>*Lewis* >
-- *Lewis*

