Nutch 1.7 HTMLParseFilter plugin dev

Ivan Kozlov Mon, 16 Sep 2013 23:30:46 -0700

Hello all,

I want to write my own Nutch plugin which will extends the HTMLParseFilter.
But I faced some issues.


Prereqs:

   - I have a Hadoop cluster with 5 nodes. Node #1 is the Namenode, nodes
   #3-#5 are the Datanodes.
   - I compile nutch via ant to get the nutchXXX.job (my plugin compiles
   ok, all changes in the nutch-site and plugins.xml are made)
   - I run the nutch.job on the Namenode(#1): hadoop jar nutch.job -params.

*First issue:*

I cannot see the logs. My plugin just log the args:

public ParseResult filter(Content content, ParseResult parseResult,
HTMLMetaTags metaTags, DocumentFragment doc) {
    LOG.info("CleanParseFilterImpl: ");
    LOG.info("content : " + content);
    LOG.info("parseResult : " + parseResult);
    LOG.info("metaTags : " + metaTags);
    LOG.info("doc : " + doc);
    return parseResult;}

I've changed the hadoop executible to disable the root logger:

#HADOOP_OPTS="$HADOOP_OPTS
-Dhadoop.root.logger=${HADOOP_ROOT_LOGGER:-INFO,console}"

and added the logs into the hadoop's log4j.properties:

#special logging requirements for some commandline tools
log4j.logger.org.apache.nutch.crawl.Crawl=ALL,console
log4j.logger.org.apache.nutch.crawl.Injector=ALL,console
log4j.logger.org.apache.nutch.crawl.Generator=ALL,console
log4j.logger.org.apache.nutch.fetcher.Fetcher=ALL,console
log4j.logger.org.apache.nutch.parse.ParseSegment=ALL,console
log4j.logger.org.apache.nutch.crawl.CrawlDbReader=ALL,console
log4j.logger.org.apache.nutch.crawl.CrawlDbMerger=ALL,console
log4j.logger.org.apache.nutch.crawl.LinkDbReader=ALL,console
log4j.logger.org.apache.nutch.segment.SegmentReader=ALL,console
log4j.logger.org.apache.nutch.segment.SegmentMerger=ALL,console
log4j.logger.org.apache.nutch.crawl.CrawlDb=ALL,console
log4j.logger.org.apache.nutch.crawl.LinkDb=ALL,console
log4j.logger.org.apache.nutch.crawl.LinkDbMerger=ALL,console
log4j.logger.org.apache.nutch.indexer.IndexingJob=ALL,console
log4j.logger.org.apache.nutch.indexer.solr.SolrIndexer=ALL,console
log4j.logger.org.apache.nutch.indexer.solr.SolrWriter=ALL,console
log4j.logger.org.apache.nutch.indexer.solr.SolrDeleteDuplicates=ALL,console
log4j.logger.org.apache.nutch.indexer.solr.SolrClean=ALL,console
log4j.logger.org.apache.nutch.scoring.webgraph.WebGraph=ALL,console
log4j.logger.org.apache.nutch.scoring.webgraph.LinkRank=ALL,console
log4j.logger.org.apache.nutch.scoring.webgraph.Loops=ALL,console
log4j.logger.org.apache.nutch.scoring.webgraph.ScoreUpdater=ALL,console
log4j.logger.org.apache.nutch.parse.ParserChecker=ALL,console
log4j.logger.org.apache.nutch.indexer.IndexingFiltersChecker=ALL,console
log4j.logger.org.apache.nutch.tools.FreeGenerator=ALL,console
log4j.logger.org.apache.nutch.util.domain.DomainStatistics=ALL,console
log4j.logger.org.apache.nutch.tools.CrawlDBScanner=ALL,console
log4j.logger.org.apache.nutch.parse.clean.CleanParseFilterImpl=ALL,console
log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN
log4j.logger.org.apache.nutch.parse.ParseUtil=ALL,console
log4j.logger.org.apache.nutch=ALL,console

>From debug I can see that

LoggerFactory.getLogger("org.apache.nutch.parse.ParseUtil").isTraceEnabled()
andLoggerFactory.getLogger("org.apache.nutch.parse.clean.CleanParseFilterImpl").isTraceEnabled()

are "true", but I still dont' see any logs from my plugin or from
ParseUtil...

Where are the logs? I suppose to see them on the console output thile
running the "hadoop jar nutch.job". Maybe that code is executing on the
DataNode??

*Second issue:*

I cannot debug the code in places where i want. E.g. I can debug Crawl.java
in the

fetcher.fetch(segs[0], threads);  // fetch it
  if (!Fetcher.isParsing(job)) {
    parseSegment.parse(segs[0]);    // parse it, if needed
  }

And see that "!Fetcher.isParsing(job)" is "true" and I go into the
parseSegment.parse().

But I cannot debug the map() method on the ParseSegment (where the
ParseUtil.parse() logic executes). Why I can't debug that? Maybe that code
is executing on the DataNode??

Please help me with understanding of the logging while running the Nutch as
a hadoop jar and debugging it.

-- 
Best regards,
Ivan Kozlov
E-mail: [email protected]

  "Imagination is more important than knowledge." Albert Einstein

Nutch 1.7 HTMLParseFilter plugin dev

Reply via email to