Hi Manish, On Sat, Jan 9, 2016 at 1:05 AM, <[email protected]> wrote:
> Hi, I saw below article for debugging nutch in eclipse but looks like it > just debug parse phase and skip Fetching phase > $NUTCH_HOME/bin/nutch parsechecker http://myurl.com/ > This is not strictly true. In the ParserChecker tool the fetching component is undertaken directly by the protocol implementation as oppose to the Fetcher itself. This can be seen in the following code https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/parse/ParserChecker.java#L133-L136 You can set a debug breakpoint somewhere in those lines and dive into the protocol fetching... this is however not allowing you to debug the actual Fetcher in Nutch. > How to debug fetch phase ? > > You also have the option of running Generate, outside of the debugger and then merely passing in the generated crawldb and segment(s) as arguments to the Fetcher debug session. It is important to note a couple of things here though. a) Fetcher session will timeout by default... this is to say that Nutch will automatically end the Fetcher debug session is threads are spin waiting with no activity. You would need to disable this setting in your nutch-site.xml setting to disable it. b) You can only fetch a segment once (unless you utilize the -addDays parameter) meaning that you may find yourself generating many segments/crawldb. This can be a bit annoying if you need to repeat it many times. hth

