Hello all,
I want to write my own Nutch plugin which will extends the HTMLParseFilter.
But I faced some issues.
Prereqs:
- I have a Hadoop cluster with 5 nodes. Node #1 is the Namenode, nodes
#3-#5 are the Datanodes.
- I compile nutch via ant to get the nutchXXX.job (my plugin compiles
ok, all changes in the nutch-site and plugins.xml are made)
- I run the nutch.job on the Namenode(#1): hadoop jar nutch.job -params.
*First issue:*
I cannot see the logs. My plugin just log the args:
public ParseResult filter(Content content, ParseResult parseResult,
HTMLMetaTags metaTags, DocumentFragment doc) {
LOG.info("CleanParseFilterImpl: ");
LOG.info("content : " + content);
LOG.info("parseResult : " + parseResult);
LOG.info("metaTags : " + metaTags);
LOG.info("doc : " + doc);
return parseResult;}
I've changed the hadoop executible to disable the root logger:
#HADOOP_OPTS="$HADOOP_OPTS
-Dhadoop.root.logger=${HADOOP_ROOT_LOGGER:-INFO,console}"
and added the logs into the hadoop's log4j.properties:
#special logging requirements for some commandline tools
log4j.logger.org.apache.nutch.crawl.Crawl=ALL,console
log4j.logger.org.apache.nutch.crawl.Injector=ALL,console
log4j.logger.org.apache.nutch.crawl.Generator=ALL,console
log4j.logger.org.apache.nutch.fetcher.Fetcher=ALL,console
log4j.logger.org.apache.nutch.parse.ParseSegment=ALL,console
log4j.logger.org.apache.nutch.crawl.CrawlDbReader=ALL,console
log4j.logger.org.apache.nutch.crawl.CrawlDbMerger=ALL,console
log4j.logger.org.apache.nutch.crawl.LinkDbReader=ALL,console
log4j.logger.org.apache.nutch.segment.SegmentReader=ALL,console
log4j.logger.org.apache.nutch.segment.SegmentMerger=ALL,console
log4j.logger.org.apache.nutch.crawl.CrawlDb=ALL,console
log4j.logger.org.apache.nutch.crawl.LinkDb=ALL,console
log4j.logger.org.apache.nutch.crawl.LinkDbMerger=ALL,console
log4j.logger.org.apache.nutch.indexer.IndexingJob=ALL,console
log4j.logger.org.apache.nutch.indexer.solr.SolrIndexer=ALL,console
log4j.logger.org.apache.nutch.indexer.solr.SolrWriter=ALL,console
log4j.logger.org.apache.nutch.indexer.solr.SolrDeleteDuplicates=ALL,console
log4j.logger.org.apache.nutch.indexer.solr.SolrClean=ALL,console
log4j.logger.org.apache.nutch.scoring.webgraph.WebGraph=ALL,console
log4j.logger.org.apache.nutch.scoring.webgraph.LinkRank=ALL,console
log4j.logger.org.apache.nutch.scoring.webgraph.Loops=ALL,console
log4j.logger.org.apache.nutch.scoring.webgraph.ScoreUpdater=ALL,console
log4j.logger.org.apache.nutch.parse.ParserChecker=ALL,console
log4j.logger.org.apache.nutch.indexer.IndexingFiltersChecker=ALL,console
log4j.logger.org.apache.nutch.tools.FreeGenerator=ALL,console
log4j.logger.org.apache.nutch.util.domain.DomainStatistics=ALL,console
log4j.logger.org.apache.nutch.tools.CrawlDBScanner=ALL,console
log4j.logger.org.apache.nutch.parse.clean.CleanParseFilterImpl=ALL,console
log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN
log4j.logger.org.apache.nutch.parse.ParseUtil=ALL,console
log4j.logger.org.apache.nutch=ALL,console
>From debug I can see that
LoggerFactory.getLogger("org.apache.nutch.parse.ParseUtil").isTraceEnabled()
andLoggerFactory.getLogger("org.apache.nutch.parse.clean.CleanParseFilterImpl").isTraceEnabled()
are "true", but I still dont' see any logs from my plugin or from
ParseUtil...
Where are the logs? I suppose to see them on the console output thile
running the "hadoop jar nutch.job". Maybe that code is executing on the
DataNode??
*Second issue:*
I cannot debug the code in places where i want. E.g. I can debug Crawl.java
in the
fetcher.fetch(segs[0], threads); // fetch it
if (!Fetcher.isParsing(job)) {
parseSegment.parse(segs[0]); // parse it, if needed
}
And see that "!Fetcher.isParsing(job)" is "true" and I go into the
parseSegment.parse().
But I cannot debug the map() method on the ParseSegment (where the
ParseUtil.parse() logic executes). Why I can't debug that? Maybe that code
is executing on the DataNode??
Please help me with understanding of the logging while running the Nutch as
a hadoop jar and debugging it.
--
Best regards,
Ivan Kozlov
E-mail: [email protected]
"Imagination is more important than knowledge." Albert Einstein