Hi again.

i still have that issue. I start with a complete new crawl directory structure and get the following error:

-shouldFetch rejected 'http://www.lequipe.fr/Football/', fetchTime=1359626286623, curTime=1355738313780

Full-Log:
crawl started in: /opt/project/current/crawl_project/nutch/crawl/1300
rootUrlDir = /opt/project/current/crawl_project/nutch/urls/url_1300
threads = 20
depth = 3
solrUrl=http://192.168.1.144:8983/solr/
topN = 400
Injector: starting at 2012-12-17 10:57:36
Injector: crawlDb: /opt/project/current/crawl_project/nutch/crawl/1300/crawldb
Injector: urlDir: /opt/project/current/crawl_project/nutch/urls/url_1300
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-12-17 10:57:51, elapsed: 00:00:14
Generator: starting at 2012-12-17 10:57:51
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 400
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
Generator: finished at 2012-12-17 10:58:06, elapsed: 00:00:15
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2012-12-17 10:58:06
Fetcher: segment: /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
Using queue mode : byHost
Fetcher: threads: 20
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.lequipe.fr/Football/
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-12-17 10:58:13, elapsed: 00:00:07
ParseSegment: starting at 2012-12-17 10:58:13
ParseSegment: segment: /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
ParseSegment: finished at 2012-12-17 10:58:20, elapsed: 00:00:07
CrawlDb update: starting at 2012-12-17 10:58:20
CrawlDb update: db: /opt/project/current/crawl_project/nutch/crawl/1300/crawldb CrawlDb update: segments: [/opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-12-17 10:58:33, elapsed: 00:00:13
Generator: starting at 2012-12-17 10:58:33
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 400
Generator: jobtracker is 'local', generating exactly one partition.
-shouldFetch rejected 'http://www.lequipe.fr/Football/', fetchTime=1359626286623, curTime=1355738313780
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2012-12-17 10:58:40
LinkDb: linkdb: /opt/project/current/crawl_project/nutch/crawl/1300/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: file:/opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
LinkDb: finished at 2012-12-17 10:58:47, elapsed: 00:00:07
SolrIndexer: starting at 2012-12-17 10:58:47
SolrIndexer: deleting gone documents: false
SolrIndexer: URL filtering: false
SolrIndexer: URL normalizing: false
SolrIndexer: finished at 2012-12-17 10:59:09, elapsed: 00:00:22
SolrDeleteDuplicates: starting at 2012-12-17 10:59:09
SolrDeleteDuplicates: Solr url: http://192.168.1.144:8983/solr/
SolrDeleteDuplicates: finished at 2012-12-17 10:59:47, elapsed: 00:00:37

Am 25.11.2012 21:02, schrieb Sebastian Nagel:
But, i create a complete new crawl dir for every crawl.
Then all should work as expected.

why the the cralwer set a "page to fetch" to rejected. Because obviously
the crawler never saw this page before (because i deleted all the old crawl 
dirs).
In the crawl log i see many page to fetch, but at the end all of them are 
rejected
Are you sure they aren't fetched at all? This debug log output in Generator 
mapper
is shown also for URLs fetched in previous cycles. You should check the complete
log for the "rejected" URLs.


On 11/24/2012 04:46 PM, Jan Philippe Wimmer wrote:
Hey Sebastian! Thanks for your answer.

But, i create a complete new crawl dir for every crawl. In other words i just 
have the crawl data of
the current, running crawl-process. When i recrawl a urlset, i delete the old 
crawl dir and create a
new one. At the end of any crawl i index it to solr. So i keep all crawled 
content in the index. I
don't need any nutch crawl dirs, because i want to crawl all relevant pages in 
every crawl process.
again and again.

I totaly don't understand, why the the cralwer set a "page to fetch" to 
rejected. Because obviously
the crawler never saw this page before (because i deleted all the old crawl 
dirs). In the crawl log
i see many page to fetch, but at the end all of them are rejected. Any ideas?

Am 24.11.2012 16:36, schrieb Sebastian Nagel:
I want my crawler to crawl the complete page without setting up schedulers at 
all. Every crawl
process should crawl every page again without having setup wait intervals.
That's quite easy: remove all data and launch the crawl again.
- Nutch 1.x : remove crawldb, segments, and linkdb
- 2.x : drop 'webpage' (or similar, depends on the chosen data store)

On 11/24/2012 12:17 PM, Jan Philippe Wimmer wrote:
Hi there,

how can i avoid the following error:
-shouldFetch rejected 'http://www.page.com/shop', fetchTime=1356347311285, 
curTime=1353755337755

I want my crawler to crawl the complete page without setting up schedulers at 
all. Every crawl
process should crawl every page again without having setup wait intervals.

Any soluti

Reply via email to