Hi Chris, Please see the documentation I put up on the wiki for this phenomenon
http://wiki.apache.org/nutch/ErrorMessagesInNutch2#Nutch_logging_shows_Skipping_http:.2F.2FmyurlForParsing.com.3B_different_batch_id_.28null.29 Also, please search the mailing list for a recent discussion on the topic. Finally, I logged and issue in Jira to improve logging for this scenario. I don't agree with the logging of mark's as oppose to the identification of batchId's to which those mark's should belong. If we know the batchId(s) then we can at least attempt to generate (I use the term generate not to specifically relate to the GeneratorJob, however this is one of the tools that Generates Mark's) Mark's for the specific WebPage. Right now there is a bit of work to be done here as this has come up several times and is still not quite fixed. It just occured to me that as of now ALL of this logging has been silenced to DEBUG level... I am not sure that this is useful enough for obtaining metrics upon how many URLs are skipped due to various Mark's being absent. https://issues.apache.org/jira/browse/NUTCH-1567 On Monday, May 20, 2013, Christopher Gross <[email protected]> wrote: > I'm attempting to get a crawl working using scripts, but I've been getting > a "Skipping <url>; different batch id (null)" error and then nothing new in > Solr. So I've reverted back to trying out the "crawl" for the nutch script: > > ./nutch crawl ../urls/ -solr "http://localhost/nutchsolr" -threads 5 -depth > 3 -topN 100 > > urls has the "seed.txt" file with some sites. It definitely is able to get > pages (finding other hostnames in the lists scrolling through the screen), > but then it is still skipping with the "batch id (null)" message for > everything it finds. > > Any guidance/advice would be appreciated. > > Thanks! > > -- Chris > -- *Lewis*

