Hello everybody, I am trying to crawl a few websites from my seed.txt with Nutch 2.1 new crawl script bin/crawl. The problem is that everytime I run my script, it does not fetch or parse anything (no urls) with message "Skipping [/here is concrete url/] different batch id ([/here is some batch id/])"
Here is some output from the log: /Start old crawling linked TV: InjectorJob: starting InjectorJob: urlDir: /opt/ir/nutch/urls InjectorJob: finished/ It looks like that the injection of urls was ok... /Sun Jun 30 22:18:07 CEST 2013 : Iteration 1 of 5 Generating batchId Generating a new fetchlist GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting GeneratorJob: filtering: false GeneratorJob: topN: 50000 GeneratorJob: done GeneratorJob: generated batch id: 1372623488-1201848586 InjectorJob: starting InjectorJob: urlDir: /opt/ir/nutch/urls InjectorJob: finished Fetching : FetcherJob: starting FetcherJob: batchId: 1372623487-26323 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. FetcherJob: threads: 50 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : 1372624103280 Using queue mode : byHost Fetcher: threads: 50 QueueFeeder finished: total 0 records. Hit by time limit :0 -finishing thread FetcherThread0, activeThreads=0 -finishing thread FetcherThread1, activeThreads=0 -finishing thread FetcherThread2, activeThreads=0 -finishing thread FetcherThread3, activeThreads=0 -finishing thread FetcherThread4, activeThreads=0/ .... it continues in iteration to FetcherThread48.. /Fetcher: throughput threshold: -1 -finishing thread FetcherThread49, activeThreads=0 -finishing thread FetcherThread36, activeThreads=0 Fetcher: throughput threshold sequence: 5 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0 URLs in 0 queues -activeThreads=0 FetcherJob: done Parsing : ParserJob: starting ParserJob: resuming: false ParserJob: forced reparse: false ParserJob: batchId: 1372623487-26323 Skipping http://www.brugge.be/internet/en/musea/bruggemuseum/stadhuis/index.htm; different batch id (1372590913-1016555835) Skipping http://www.galloromeinsmuseum.be/; different batch id (1372590913-1016555835) Skipping http://www.museumdrguislain.be/; different batch id (1372590913-1016555835) Skipping http://www.muzee.be/; different batch id (1372590913-1016555835) Skipping http://musea.sint-niklaas.be/; different batch id (1372590913-1016555835) Skipping http://www.the-athenaeum.org/; different batch id (1372590913-1016555835) Skipping http://the-athenaeum.org/; different batch id (1372590913-1016555835) Skipping http://viaf.org/; different batch id (1372590913-1016555835)/ ... and skipping more urls from my seed ... yes, from the seed, because I have in my seed.txt exactly this: http://www.brugge.be/internet/en/musea/bruggemuseum/stadhuis/index.htm http://www.galloromeinsmuseum.be/ http://www.museumdrguislain.be/ etc. /ParserJob: success CrawlDB update DbUpdaterJob: starting Limit reached, skipping further inlinks for de.ard.www:http/ Limit reached, skipping further inlinks for de.rbb-online.mediathek:http/ Limit reached, skipping further inlinks for de.rbb-online.www:http/ DbUpdaterJob: done/ Do you know where is the probleam, please? I have read here http://wiki.apache.org/nutch/ErrorMessagesInNutch2#Nutch_logging_shows_Skipping_http:.2F.2FmyurlForParsing.com.3B_different_batch_id_.28null.29 about the second inject: "Null values are possible, too, think about these steps: inject -> generate -> inject -> fetch. The second inject will leave entries in the db without fetchmarks seen by the fetcher later. " but it seems to be for different batch id (null) ant it is not my case... JB -- View this message in context: http://lucene.472066.n3.nabble.com/New-script-bin-crawl-skipping-urls-different-batch-id-XXXXXXXX-YYYYYYYYY-tp4075441.html Sent from the Nutch - User mailing list archive at Nabble.com.

