Re: Configuration issue: Custom parser not being recognised.

2011-07-21 Thread amrutbudi...@gmail.com
Found the issue! plugin.xml defined extension id which didn't match id inside
mimeType=application/xhtml+xml tag parse-plugins.xml.

i.e.: below bold highlighted should match.
plugin.xml:
?xml version=1.0 encoding=UTF-8?
plugin
   id=food
   name=Food Parser.
   version=1.0.0
   provider-name=amrut

   runtime
  library name=food.jar
 export name=*/
  /library
   /runtime

   requires
  import plugin=nutch-extensionpoints/
   /requires

   extension id=com.amrut.parser.TDRParser
  name=TDR Parser
  point=org.apache.nutch.parse.Parser

*
implementation id=com.amrut.parser.TDRParser
 class=com.amrut.parser.TDRParser
parameter name=contentType value=application/xhtml+xml/
  /implementation
*
   /extension
/plugin

parse-plugins.xml:

?xml version=1.0 encoding=UTF-8?
parse-plugins

mimeType name=application/xhtml+xml
*   plugin id=food /*
/mimeType


aliases
*   alias name=food
extension-id=com.amrut.parser.TDRParser /*
alias name=parse-tika 
extension-id=org.apache.nutch.parse.tika.TikaParser /
alias name=parse-ext extension-id=ExtParser /
alias name=parse-html
extension-id=org.apache.nutch.parse.html.HtmlParser /
alias name=parse-js extension-id=JSParser /
alias name=feed
extension-id=org.apache.nutch.parse.feed.FeedParser /
alias name=parse-swf
extension-id=org.apache.nutch.parse.swf.SWFParser /
alias name=parse-zip
extension-id=org.apache.nutch.parse.zip.ZipParser /
/aliases
/parse-plugins

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Configuration-issue-Custom-parser-not-being-recognised-tp3179819p3190290.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Configuration issue: Custom parser not being recognised.

2011-07-18 Thread amrutbudi...@gmail.com
Hello,

I have been trying to write a custom parser but getting into what looks from
the hadoop.log, a configuration issue. Any insights in what might be wrong
below:

plugin.xml:

?xml version=1.0 encoding=UTF-8?
plugin
   id=food
   name=Parser.
   version=1.0.0
   provider-name=amrut

   runtime
  library name=food.jar
 export name=*/
  /library
   /runtime

   requires
  import plugin=nutch-extensionpoints/
   /requires

   extension id=com.amrut.parser.TDRParser
  name=TDR
  point=org.apache.nutch.parse.Parser
  implementation id=TDRParser
 class=com.amrut.parser.TDRParser
parameter name=contentType value=application/xhtml+xml/
parameter name=contentType value=text/html/
  /implementation
   /extension
/plugin

build.xml:
?xml version=1.0?
project name=food default=jar-core
  import file=../build-plugin.xml/
/project

nutch-site.xml:
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?



configuration
property
 namehttp.agent.name/name
 valueMy Nutch Spider/value
/property

property
  nameplugin.includes/name
 
value*food*|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)/value
  descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with
the
  underlying commons-httpclient library.
  /description
/property

/configuration

I have added the contentType to point to custom parser, parse-plugin.xml:
mimeType name=application/xhtml+xml
   plugin id=food /
/mimeType

From the hadoop.log, I can see my parser registered:
2011-07-18 00:01:05,556 INFO  plugin.PluginRepository - Plugins: looking in:
/Users/Amrut/apachenutch/runtime/local/plugins
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Registered Plugins:
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - the nutch
core extension points (nutch-extensionpoints)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Basic
Indexing Filter (index-basic)
*2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Food
Parser. (food)*
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - HTTP
Framework (lib-http)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Registered
Extension-Points:
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository - Nutch
Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)


Am I missing something? I get this in the hadoop.log:
2011-07-18 00:01:28,551 WARN  parse.ParserFactory - ParserFactory: Plugin:
food mapped to contentType application/xhtml+xml via parse-plugins.xml, but
not enabled via plugin.includes in nutch-default.xml

Thanks for the help.
Amrut Budihal.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Configuration-issue-Custom-parser-not-being-recognised-tp3179811p3179811.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Configuration issue: Custom parser not being recognised.

2011-07-18 Thread amrutbudi...@gmail.com
Hello,

I have been trying to write a custom parser but getting into what looks from
the hadoop.log, a configuration issue. Any insights in what might be wrong
below:

plugin.xml:

?xml version=1.0 encoding=UTF-8?
plugin
   id=food
   name=Parser.
   version=1.0.0
   provider-name=amrut

   runtime
  library name=food.jar
 export name=*/
  /library
   /runtime

   requires
  import plugin=nutch-extensionpoints/
   /requires

   extension id=com.amrut.parser.TDRParser
  name=TDR
  point=org.apache.nutch.parse.Parser
  implementation id=TDRParser
 class=com.amrut.parser.TDRParser
parameter name=contentType value=application/xhtml+xml/
parameter name=contentType value=text/html/
  /implementation
   /extension
/plugin

build.xml:
?xml version=1.0?
project name=food default=jar-core
  import file=../build-plugin.xml/
/project

nutch-site.xml:
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?



configuration
property
 namehttp.agent.name/name
 valueMy Nutch Spider/value
/property

property
  nameplugin.includes/name
 
value*food*|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)/value
  descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with
the
  underlying commons-httpclient library.
  /description
/property

/configuration

I have added the contentType to point to custom parser, parse-plugin.xml:
mimeType name=application/xhtml+xml
   plugin id=food /
/mimeType

From the hadoop.log, I can see my parser registered:
2011-07-18 00:01:05,556 INFO  plugin.PluginRepository - Plugins: looking in:
/Users/Amrut/apachenutch/runtime/local/plugins
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Registered Plugins:
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - the nutch
core extension points (nutch-extensionpoints)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Basic
Indexing Filter (index-basic)
*2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Food
Parser. (food)*
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - HTTP
Framework (lib-http)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Registered
Extension-Points:
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository - Nutch
Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)


Am I missing something? I get this in the hadoop.log:
2011-07-18 00:01:28,551 WARN  parse.ParserFactory - ParserFactory: Plugin:
food mapped to contentType application/xhtml+xml via parse-plugins.xml, but
not enabled via plugin.includes in nutch-default.xml

Thanks for the help.
Amrut Budihal.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Configuration-issue-Custom-parser-not-being-recognised-tp3179816p3179816.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Configuration issue: Custom parser not being recognised.

2011-07-18 Thread amrutbudi...@gmail.com
Hello,

I have been trying to write a custom parser but getting into what looks from
the hadoop.log, a configuration issue. Any insights in what might be wrong
below:

plugin.xml:

?xml version=1.0 encoding=UTF-8?
plugin
   id=food
   name=Parser.
   version=1.0.0
   provider-name=amrut

   runtime
  library name=food.jar
 export name=*/
  /library
   /runtime

   requires
  import plugin=nutch-extensionpoints/
   /requires

   extension id=com.amrut.parser.TDRParser
  name=TDR
  point=org.apache.nutch.parse.Parser
  implementation id=TDRParser
 class=com.amrut.parser.TDRParser
parameter name=contentType value=application/xhtml+xml/
parameter name=contentType value=text/html/
  /implementation
   /extension
/plugin

build.xml:
?xml version=1.0?
project name=food default=jar-core
  import file=../build-plugin.xml/
/project

nutch-site.xml:
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?



configuration
property
 namehttp.agent.name/name
 valueMy Nutch Spider/value
/property

property
  nameplugin.includes/name
 
value*food*|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)/value
  descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with
the
  underlying commons-httpclient library.
  /description
/property

/configuration

I have added the contentType to point to custom parser, parse-plugin.xml:
mimeType name=application/xhtml+xml
   plugin id=food /
/mimeType

From the hadoop.log, I can see my parser registered:
2011-07-18 00:01:05,556 INFO  plugin.PluginRepository - Plugins: looking in:
/Users/Amrut/apachenutch/runtime/local/plugins
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Registered Plugins:
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - the nutch
core extension points (nutch-extensionpoints)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Basic
Indexing Filter (index-basic)
*2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Food
Parser. (food)*
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - HTTP
Framework (lib-http)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Regex URL
Filter (urlfilter-regex)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Regex URL
Filter Framework (lib-regex-filter)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Registered
Extension-Points:
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2011-07-18 00:01:05,809 INFO  plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository - Nutch
Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2011-07-18 00:01:05,810 INFO  plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)


Am I missing something? I get this in the hadoop.log:
2011-07-18 00:01:28,551 WARN  parse.ParserFactory - ParserFactory: Plugin:
food mapped to contentType application/xhtml+xml via parse-plugins.xml, but
not enabled via plugin.includes in nutch-default.xml

Thanks for the help.
Amrut Budihal.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Configuration-issue-Custom-parser-not-being-recognised-tp3179819p3179819.html
Sent from the Nutch - User mailing list archive at Nabble.com.