Hi Manish,

On Sat, Jan 9, 2016 at 1:05 AM, <[email protected]> wrote:

> Hi, I saw below article  for debugging nutch in eclipse but looks like it
> just debug parse phase and skip Fetching phase
> $NUTCH_HOME/bin/nutch parsechecker http://myurl.com/
>

This is not strictly true. In the ParserChecker tool the fetching component
is undertaken directly by the protocol implementation as oppose to the
Fetcher itself. This can be seen in the following code
https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/parse/ParserChecker.java#L133-L136
You can set a debug breakpoint somewhere in those lines and dive into the
protocol fetching... this is however not allowing you to debug the actual
Fetcher in Nutch.


> How to debug fetch phase ?
>
>
You also have the option of running Generate, outside of the debugger and
then merely passing in the generated crawldb and segment(s) as arguments to
the Fetcher debug session. It is important to note a couple of things here
though.
a) Fetcher session will timeout by default... this is to say that Nutch
will automatically end the Fetcher debug session is threads are spin
waiting with no activity. You would need to disable this setting in your
nutch-site.xml setting to disable it.
b) You can only fetch a segment once (unless you utilize the -addDays
parameter) meaning that you may find yourself generating many
segments/crawldb. This can be a bit annoying if you need to repeat it many
times.

hth

Reply via email to