Hello,
I have been trying to write a custom parser but getting into what looks from
the hadoop.log, a configuration issue. Any insights in what might be wrong
below:
plugin.xml:
<?xml version="1.0" encoding="UTF-8"?>
<plugin
id="food"
name="Parser."
version="1.0.0"
provider-name="amrut">
<runtime>
<library name="food.jar">
<export name="*"/>
</library>
</runtime>
<requires>
<import plugin="nutch-extensionpoints"/>
</requires>
<extension id="com.amrut.parser.TDRParser"
name="TDR"
point="org.apache.nutch.parse.Parser">
<implementation id="TDRParser"
class="com.amrut.parser.TDRParser">
<parameter name="contentType" value="application/xhtml+xml"/>
<parameter name="contentType" value="text/html"/>
</implementation>
</extension>
</plugin>
build.xml:
<?xml version="1.0"?>
<project name="food" default="jar-core">
<import file="../build-plugin.xml"/>
</project>
nutch-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
<property>
<name>plugin.includes</name>
<value>*food*|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with
the
underlying commons-httpclient library.
</description>
</property>
</configuration>
I have added the contentType to point to custom parser, parse-plugin.xml:
<mimeType name="application/xhtml+xml">
<plugin id="food" />
</mimeType>
>From the hadoop.log, I can see my parser registered:
2011-07-18 00:01:05,556 INFO plugin.PluginRepository - Plugins: looking in:
/Users/Amrut/apachenutch/runtime/local/plugins
2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Registered Plugins:
2011-07-18 00:01:05,809 INFO plugin.PluginRepository - the nutch
core extension points (nutch-extensionpoints)
2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
*2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Food
Parser. (food)*
2011-07-18 00:01:05,809 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Registered
Extension-Points:
2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2011-07-18 00:01:05,810 INFO plugin.PluginRepository - Nutch
Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2011-07-18 00:01:05,810 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2011-07-18 00:01:05,810 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2011-07-18 00:01:05,810 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2011-07-18 00:01:05,810 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2011-07-18 00:01:05,810 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
Am I missing something? I get this in the hadoop.log:
2011-07-18 00:01:28,551 WARN parse.ParserFactory - ParserFactory: Plugin:
food mapped to contentType application/xhtml+xml via parse-plugins.xml, but
not enabled via plugin.includes in nutch-default.xml
Thanks for the help.
Amrut Budihal.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Configuration-issue-Custom-parser-not-being-recognised-tp3179811p3179811.html
Sent from the Nutch - User mailing list archive at Nabble.com.