RE: shouldFetch rejected

Markus Jelsma Mon, 17 Dec 2012 04:39:35 -0800

You're doing nothing wrong, it's just a debug entry. curTime is just the 
CURRENT TIME and fetchTime is the time in the future after which the record 
must be fetched again. The fetch time is controlled by your fetch scheduler. 
See the API docs for AbstractFetchSchedule.


I assume http://www.lequipe.fr/Football is already fetched. Check if this is 
true using the readdb tool.
 
-----Original message-----
> From:Jan Philippe Wimmer <[email protected]>
> Sent: Mon 17-Dec-2012 13:42
> To: [email protected]
> Subject: Re: shouldFetch rejected
> 
> Ahh, but i crawl other urls with the same settings and there it works. 
> What am i doing wrong? What is the correct setting? What setting are 
> responsible for curtime ahead fetchtime?
> Am 17.12.2012 13:40, schrieb Markus Jelsma:
> > Hi - curTime does not exceed fetchTime, thus the record is not eligible for 
> > fetch.
> >   
> >   
> > -----Original message-----
> >> From:Jan Philippe Wimmer <[email protected]>
> >> Sent: Mon 17-Dec-2012 13:31
> >> To: [email protected]
> >> Subject: Re: shouldFetch rejected
> >>
> >> Hi again.
> >>
> >> i still have that issue. I start with a complete new crawl directory
> >> structure and get the following error:
> >>
> >> -shouldFetch rejected 'http://www.lequipe.fr/Football/',
> >> fetchTime=1359626286623, curTime=1355738313780
> >>
> >> Full-Log:
> >> crawl started in: /opt/project/current/crawl_project/nutch/crawl/1300
> >> rootUrlDir = /opt/project/current/crawl_project/nutch/urls/url_1300
> >> threads = 20
> >> depth = 3
> >> solrUrl=http://192.168.1.144:8983/solr/
> >> topN = 400
> >> Injector: starting at 2012-12-17 10:57:36
> >> Injector: crawlDb:
> >> /opt/project/current/crawl_project/nutch/crawl/1300/crawldb
> >> Injector: urlDir: /opt/project/current/crawl_project/nutch/urls/url_1300
> >> Injector: Converting injected urls to crawl db entries.
> >> Injector: Merging injected urls into crawl db.
> >> Injector: finished at 2012-12-17 10:57:51, elapsed: 00:00:14
> >> Generator: starting at 2012-12-17 10:57:51
> >> Generator: Selecting best-scoring urls due for fetch.
> >> Generator: filtering: true
> >> Generator: normalizing: true
> >> Generator: topN: 400
> >> Generator: jobtracker is 'local', generating exactly one partition.
> >> Generator: Partitioning selected urls for politeness.
> >> Generator: segment:
> >> /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
> >> Generator: finished at 2012-12-17 10:58:06, elapsed: 00:00:15
> >> Fetcher: Your 'http.agent.name' value should be listed first in
> >> 'http.robots.agents' property.
> >> Fetcher: starting at 2012-12-17 10:58:06
> >> Fetcher: segment:
> >> /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
> >> Using queue mode : byHost
> >> Fetcher: threads: 20
> >> Fetcher: time-out divisor: 2
> >> QueueFeeder finished: total 1 records + hit by time limit :0
> >> Using queue mode : byHost
> >> Using queue mode : byHost
> >> fetching http://www.lequipe.fr/Football/
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Using queue mode : byHost
> >> -finishing thread FetcherThread, activeThreads=1
> >> Fetcher: throughput threshold: -1
> >> Fetcher: throughput threshold retries: 5
> >> -finishing thread FetcherThread, activeThreads=0
> >> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> >> -activeThreads=0
> >> Fetcher: finished at 2012-12-17 10:58:13, elapsed: 00:00:07
> >> ParseSegment: starting at 2012-12-17 10:58:13
> >> ParseSegment: segment:
> >> /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
> >> ParseSegment: finished at 2012-12-17 10:58:20, elapsed: 00:00:07
> >> CrawlDb update: starting at 2012-12-17 10:58:20
> >> CrawlDb update: db:
> >> /opt/project/current/crawl_project/nutch/crawl/1300/crawldb
> >> CrawlDb update: segments:
> >> [/opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759]
> >> CrawlDb update: additions allowed: true
> >> CrawlDb update: URL normalizing: true
> >> CrawlDb update: URL filtering: true
> >> CrawlDb update: 404 purging: false
> >> CrawlDb update: Merging segment data into db.
> >> CrawlDb update: finished at 2012-12-17 10:58:33, elapsed: 00:00:13
> >> Generator: starting at 2012-12-17 10:58:33
> >> Generator: Selecting best-scoring urls due for fetch.
> >> Generator: filtering: true
> >> Generator: normalizing: true
> >> Generator: topN: 400
> >> Generator: jobtracker is 'local', generating exactly one partition.
> >> -shouldFetch rejected 'http://www.lequipe.fr/Football/',
> >> fetchTime=1359626286623, curTime=1355738313780
> >> Generator: 0 records selected for fetching, exiting ...
> >> Stopping at depth=1 - no more URLs to fetch.
> >> LinkDb: starting at 2012-12-17 10:58:40
> >> LinkDb: linkdb: /opt/project/current/crawl_project/nutch/crawl/1300/linkdb
> >> LinkDb: URL normalize: true
> >> LinkDb: URL filter: true
> >> LinkDb: internal links will be ignored.
> >> LinkDb: adding segment:
> >> file:/opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
> >> LinkDb: finished at 2012-12-17 10:58:47, elapsed: 00:00:07
> >> SolrIndexer: starting at 2012-12-17 10:58:47
> >> SolrIndexer: deleting gone documents: false
> >> SolrIndexer: URL filtering: false
> >> SolrIndexer: URL normalizing: false
> >> SolrIndexer: finished at 2012-12-17 10:59:09, elapsed: 00:00:22
> >> SolrDeleteDuplicates: starting at 2012-12-17 10:59:09
> >> SolrDeleteDuplicates: Solr url: http://192.168.1.144:8983/solr/
> >> SolrDeleteDuplicates: finished at 2012-12-17 10:59:47, elapsed: 00:00:37
> >>
> >> Am 25.11.2012 21:02, schrieb Sebastian Nagel:
> >>>> But, i create a complete new crawl dir for every crawl.
> >>> Then all should work as expected.
> >>>
> >>>> why the the cralwer set a "page to fetch" to rejected. Because obviously
> >>>> the crawler never saw this page before (because i deleted all the old 
> >>>> crawl dirs).
> >>>> In the crawl log i see many page to fetch, but at the end all of them 
> >>>> are rejected
> >>> Are you sure they aren't fetched at all? This debug log output in 
> >>> Generator mapper
> >>> is shown also for URLs fetched in previous cycles. You should check the 
> >>> complete
> >>> log for the "rejected" URLs.
> >>>
> >>>
> >>> On 11/24/2012 04:46 PM, Jan Philippe Wimmer wrote:
> >>>> Hey Sebastian! Thanks for your answer.
> >>>>
> >>>> But, i create a complete new crawl dir for every crawl. In other words i 
> >>>> just have the crawl data of
> >>>> the current, running crawl-process. When i recrawl a urlset, i delete 
> >>>> the old crawl dir and create a
> >>>> new one. At the end of any crawl i index it to solr. So i keep all 
> >>>> crawled content in the index. I
> >>>> don't need any nutch crawl dirs, because i want to crawl all relevant 
> >>>> pages in every crawl process.
> >>>> again and again.
> >>>>
> >>>> I totaly don't understand, why the the cralwer set a "page to fetch" to 
> >>>> rejected. Because obviously
> >>>> the crawler never saw this page before (because i deleted all the old 
> >>>> crawl dirs). In the crawl log
> >>>> i see many page to fetch, but at the end all of them are rejected. Any 
> >>>> ideas?
> >>>>
> >>>> Am 24.11.2012 16:36, schrieb Sebastian Nagel:
> >>>>>> I want my crawler to crawl the complete page without setting up 
> >>>>>> schedulers at all. Every crawl
> >>>>>> process should crawl every page again without having setup wait 
> >>>>>> intervals.
> >>>>> That's quite easy: remove all data and launch the crawl again.
> >>>>> - Nutch 1.x : remove crawldb, segments, and linkdb
> >>>>> - 2.x : drop 'webpage' (or similar, depends on the chosen data store)
> >>>>>
> >>>>> On 11/24/2012 12:17 PM, Jan Philippe Wimmer wrote:
> >>>>>> Hi there,
> >>>>>>
> >>>>>> how can i avoid the following error:
> >>>>>> -shouldFetch rejected 'http://www.page.com/shop', 
> >>>>>> fetchTime=1356347311285, curTime=1353755337755
> >>>>>>
> >>>>>> I want my crawler to crawl the complete page without setting up 
> >>>>>> schedulers at all. Every crawl
> >>>>>> process should crawl every page again without having setup wait 
> >>>>>> intervals.
> >>>>>>
> >>>>>> Any soluti
> >>
> 
>

RE: shouldFetch rejected

Reply via email to