Re: Configuration issue: Custom parser not being recognised.
Found the issue! plugin.xml defined extension id which didn't match id inside mimeType=application/xhtml+xml tag parse-plugins.xml. i.e.: below bold highlighted should match. plugin.xml: ?xml version=1.0 encoding=UTF-8? plugin id=food name=Food Parser. version=1.0.0 provider-name=amrut runtime library name=food.jar export name=*/ /library /runtime requires import plugin=nutch-extensionpoints/ /requires extension id=com.amrut.parser.TDRParser name=TDR Parser point=org.apache.nutch.parse.Parser * implementation id=com.amrut.parser.TDRParser class=com.amrut.parser.TDRParser parameter name=contentType value=application/xhtml+xml/ /implementation * /extension /plugin parse-plugins.xml: ?xml version=1.0 encoding=UTF-8? parse-plugins mimeType name=application/xhtml+xml * plugin id=food /* /mimeType aliases * alias name=food extension-id=com.amrut.parser.TDRParser /* alias name=parse-tika extension-id=org.apache.nutch.parse.tika.TikaParser / alias name=parse-ext extension-id=ExtParser / alias name=parse-html extension-id=org.apache.nutch.parse.html.HtmlParser / alias name=parse-js extension-id=JSParser / alias name=feed extension-id=org.apache.nutch.parse.feed.FeedParser / alias name=parse-swf extension-id=org.apache.nutch.parse.swf.SWFParser / alias name=parse-zip extension-id=org.apache.nutch.parse.zip.ZipParser / /aliases /parse-plugins -- View this message in context: http://lucene.472066.n3.nabble.com/Configuration-issue-Custom-parser-not-being-recognised-tp3179819p3190290.html Sent from the Nutch - User mailing list archive at Nabble.com.
Configuration issue: Custom parser not being recognised.
Hello, I have been trying to write a custom parser but getting into what looks from the hadoop.log, a configuration issue. Any insights in what might be wrong below: plugin.xml: ?xml version=1.0 encoding=UTF-8? plugin id=food name=Parser. version=1.0.0 provider-name=amrut runtime library name=food.jar export name=*/ /library /runtime requires import plugin=nutch-extensionpoints/ /requires extension id=com.amrut.parser.TDRParser name=TDR point=org.apache.nutch.parse.Parser implementation id=TDRParser class=com.amrut.parser.TDRParser parameter name=contentType value=application/xhtml+xml/ parameter name=contentType value=text/html/ /implementation /extension /plugin build.xml: ?xml version=1.0? project name=food default=jar-core import file=../build-plugin.xml/ /project nutch-site.xml: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namehttp.agent.name/name valueMy Nutch Spider/value /property property nameplugin.includes/name value*food*|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property /configuration I have added the contentType to point to custom parser, parse-plugin.xml: mimeType name=application/xhtml+xml plugin id=food / /mimeType From the hadoop.log, I can see my parser registered: 2011-07-18 00:01:05,556 INFO plugin.PluginRepository - Plugins: looking in: /Users/Amrut/apachenutch/runtime/local/plugins 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Registered Plugins: 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) *2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Food Parser. (food)* 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Registered Extension-Points: 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2011-07-18 00:01:05,810 INFO plugin.PluginRepository - Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) 2011-07-18 00:01:05,810 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2011-07-18 00:01:05,810 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2011-07-18 00:01:05,810 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2011-07-18 00:01:05,810 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2011-07-18 00:01:05,810 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) Am I missing something? I get this in the hadoop.log: 2011-07-18 00:01:28,551 WARN parse.ParserFactory - ParserFactory: Plugin: food mapped to contentType application/xhtml+xml via parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml Thanks for the help. Amrut Budihal. -- View this message in context: http://lucene.472066.n3.nabble.com/Configuration-issue-Custom-parser-not-being-recognised-tp3179811p3179811.html Sent from the Nutch - User mailing list archive at Nabble.com.
Configuration issue: Custom parser not being recognised.
Hello, I have been trying to write a custom parser but getting into what looks from the hadoop.log, a configuration issue. Any insights in what might be wrong below: plugin.xml: ?xml version=1.0 encoding=UTF-8? plugin id=food name=Parser. version=1.0.0 provider-name=amrut runtime library name=food.jar export name=*/ /library /runtime requires import plugin=nutch-extensionpoints/ /requires extension id=com.amrut.parser.TDRParser name=TDR point=org.apache.nutch.parse.Parser implementation id=TDRParser class=com.amrut.parser.TDRParser parameter name=contentType value=application/xhtml+xml/ parameter name=contentType value=text/html/ /implementation /extension /plugin build.xml: ?xml version=1.0? project name=food default=jar-core import file=../build-plugin.xml/ /project nutch-site.xml: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namehttp.agent.name/name valueMy Nutch Spider/value /property property nameplugin.includes/name value*food*|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property /configuration I have added the contentType to point to custom parser, parse-plugin.xml: mimeType name=application/xhtml+xml plugin id=food / /mimeType From the hadoop.log, I can see my parser registered: 2011-07-18 00:01:05,556 INFO plugin.PluginRepository - Plugins: looking in: /Users/Amrut/apachenutch/runtime/local/plugins 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Registered Plugins: 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) *2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Food Parser. (food)* 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Registered Extension-Points: 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2011-07-18 00:01:05,810 INFO plugin.PluginRepository - Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) 2011-07-18 00:01:05,810 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2011-07-18 00:01:05,810 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2011-07-18 00:01:05,810 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2011-07-18 00:01:05,810 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2011-07-18 00:01:05,810 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) Am I missing something? I get this in the hadoop.log: 2011-07-18 00:01:28,551 WARN parse.ParserFactory - ParserFactory: Plugin: food mapped to contentType application/xhtml+xml via parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml Thanks for the help. Amrut Budihal. -- View this message in context: http://lucene.472066.n3.nabble.com/Configuration-issue-Custom-parser-not-being-recognised-tp3179816p3179816.html Sent from the Nutch - User mailing list archive at Nabble.com.
Configuration issue: Custom parser not being recognised.
Hello, I have been trying to write a custom parser but getting into what looks from the hadoop.log, a configuration issue. Any insights in what might be wrong below: plugin.xml: ?xml version=1.0 encoding=UTF-8? plugin id=food name=Parser. version=1.0.0 provider-name=amrut runtime library name=food.jar export name=*/ /library /runtime requires import plugin=nutch-extensionpoints/ /requires extension id=com.amrut.parser.TDRParser name=TDR point=org.apache.nutch.parse.Parser implementation id=TDRParser class=com.amrut.parser.TDRParser parameter name=contentType value=application/xhtml+xml/ parameter name=contentType value=text/html/ /implementation /extension /plugin build.xml: ?xml version=1.0? project name=food default=jar-core import file=../build-plugin.xml/ /project nutch-site.xml: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namehttp.agent.name/name valueMy Nutch Spider/value /property property nameplugin.includes/name value*food*|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property /configuration I have added the contentType to point to custom parser, parse-plugin.xml: mimeType name=application/xhtml+xml plugin id=food / /mimeType From the hadoop.log, I can see my parser registered: 2011-07-18 00:01:05,556 INFO plugin.PluginRepository - Plugins: looking in: /Users/Amrut/apachenutch/runtime/local/plugins 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Registered Plugins: 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) *2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Food Parser. (food)* 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Registered Extension-Points: 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2011-07-18 00:01:05,809 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2011-07-18 00:01:05,810 INFO plugin.PluginRepository - Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) 2011-07-18 00:01:05,810 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2011-07-18 00:01:05,810 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2011-07-18 00:01:05,810 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2011-07-18 00:01:05,810 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2011-07-18 00:01:05,810 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) Am I missing something? I get this in the hadoop.log: 2011-07-18 00:01:28,551 WARN parse.ParserFactory - ParserFactory: Plugin: food mapped to contentType application/xhtml+xml via parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml Thanks for the help. Amrut Budihal. -- View this message in context: http://lucene.472066.n3.nabble.com/Configuration-issue-Custom-parser-not-being-recognised-tp3179819p3179819.html Sent from the Nutch - User mailing list archive at Nabble.com.