Re: nutch crawl

Lewis John Mcgibbney Mon, 20 May 2013 09:37:39 -0700

Hi Chris,

Please see the documentation I put up on the wiki for this phenomenon

http://wiki.apache.org/nutch/ErrorMessagesInNutch2#Nutch_logging_shows_Skipping_http:.2F.2FmyurlForParsing.com.3B_different_batch_id_.28null.29

Also, please search the mailing list for a recent discussion on the topic.

Finally, I logged and issue in Jira to improve logging for this scenario. I
don't agree with the logging of mark's as oppose to the identification of
batchId's to which those mark's should belong.
If we know the batchId(s) then we can at least attempt to generate (I use
the term generate not to specifically relate to the GeneratorJob, however
this is one of the tools that Generates Mark's) Mark's for the specific
WebPage.

Right now there is a bit of work to be done here as this has come up
several times and is still not quite fixed.

It just occured to me that as of now ALL of this logging has been silenced
to DEBUG level... I am not sure that this is useful enough for obtaining
metrics upon how many URLs are skipped due to various Mark's being absent.

https://issues.apache.org/jira/browse/NUTCH-1567

On Monday, May 20, 2013, Christopher Gross <[email protected]> wrote:
> I'm attempting to get a crawl working using scripts, but I've been getting
> a "Skipping <url>; different batch id (null)" error and then nothing new
in
> Solr.  So I've reverted back to trying out the "crawl" for the nutch
script:
>
> ./nutch crawl ../urls/ -solr "http://localhost/nutchsolr"; -threads 5
-depth
> 3 -topN 100
>
> urls has the "seed.txt" file with some sites.  It definitely is able to
get
> pages (finding other hostnames in the lists scrolling through the screen),
> but then it is still skipping with the "batch id (null)" message for
> everything it finds.
>
> Any guidance/advice would be appreciated.
>
> Thanks!
>
> -- Chris
>

-- 
*Lewis*

Re: nutch crawl

Reply via email to