Hello,

I have been trying to write a custom parser but getting into what looks from
the hadoop.log, a configuration issue. Any insights in what might be wrong
below:

plugin.xml:

<?xml version="1.0" encoding="UTF-8"?>
<plugin
   id="food"
   name="Parser."
   version="1.0.0"
   provider-name="amrut">

   <runtime>
      <library name="food.jar">
         <export name="*"/>
      </library>
   </runtime>

   <requires>
      <import plugin="nutch-extensionpoints"/>
   </requires>

   <extension id="com.amrut.parser.TDRParser"
              name="TDR"
              point="org.apache.nutch.parse.Parser">
      <implementation id="TDRParser"
         class="com.amrut.parser.TDRParser">
        <parameter name="contentType" value="application/xhtml+xml"/>
        <parameter name="contentType" value="text/html"/>
      </implementation>
   </extension>
</plugin>

build.xml:
<?xml version="1.0"?>
<project name="food" default="jar-core">
  <import file="../build-plugin.xml"/>
</project>

nutch-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

<property>
  <name>plugin.includes</name>
 
<value>*food*|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with
the
  underlying commons-httpclient library.
  </description>
</property>

</configuration>

I have added the contentType to point to custom parser, parse-plugin.xml:
<mimeType name="application/xhtml+xml">
   <plugin id="food" />
</mimeType>

>From the hadoop.log, I can see my parser registered:
2011-07-18 00:01:05,556 INFO  plugin.PluginRepository - Plugins: looking in:
/Users/Amrut/apachenutch/runtime/local/plugins
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Registered Plugins:
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository -         the nutch
core extension points (nutch-extensionpoints)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository -         Basic
Indexing Filter (index-basic)
*2011-07-18 00:01:05,809 INFO  plugin.PluginRepository -         Food
Parser. (food)*
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository -         HTTP
Framework (lib-http)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository -         Regex URL
Filter (urlfilter-regex)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository -         Regex URL
Filter Framework (lib-regex-filter)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository -         Http
Protocol Plug-in (protocol-http)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Registered
Extension-Points:
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository -         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository -         Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository -         Nutch
Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository -         Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository -         Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository -         HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository -         Nutch
Content Parser (org.apache.nutch.parse.Parser)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository -         Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)


Am I missing something? I get this in the hadoop.log:
2011-07-18 00:01:28,551 WARN  parse.ParserFactory - ParserFactory: Plugin:
food mapped to contentType application/xhtml+xml via parse-plugins.xml, but
not enabled via plugin.includes in nutch-default.xml

Thanks for the help.
Amrut Budihal.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Configuration-issue-Custom-parser-not-being-recognised-tp3179811p3179811.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to