On 23 July 2010 10:08, Torsten Krah <[email protected]>wrote:
> Hi, > > i need to extend / overwrite the TikaParser and wrote my own plugin: > > plugin.xml does have this: > > [...] > > <requires> > <import plugin="nutch-extensionpoints"/> > <import plugin="lib-nekohtml"/> > <import plugin="parse-tika"/> > </requires> > > <extension point="org.apache.nutch.parse.Parser" > id="my.nutch.plugins.parse" > name="TSITikaParser"> > > <implementation id="my.nutch.plugins.parse.Parser" > class="my.nutch.plugins.parse.TSITikaParser"> > <parameter name="contentType" value="*"/> > </implementation> > > </extension> > > [...] > > parse-plugins.xml: > > [...] > > <alias name="parse-tika" > extension-id="my.nutch.plugins.parse.Parser" /> > > [...] > > > The log output does read: > > 2010-07-23 10:16:37,071 DEBUG parse.ParseUtil - Parsing > [http://localhost/test.pdf] with > [my.nutch.plugins.parse.tsitikapar...@d6089a5] > 2010-07-23 10:16:37,072 ERROR tika.TikaParser - Can't retrieve Tika parser > for > mime-type application/pdf > 2010-07-23 10:16:37,076 WARN fetcher.Fetcher - Error parsing: > http://localhost/test.pdf: failed(2,0): Can't retrieve Tika parser for > mime- > type application/pdf > > > Why does Tika does not find its parsers? > It's just that you've only declared an alias in parse-plugins.xml but no association to a mime-type. You haven't made it a 'default' parser which you can do by either specifying <parameter name="contentType" value="*"/> in the plugin.xml file or your plugin or specifying * <mimeType name="*"> <plugin id="parse-tika" /> </mimeType> *in parse-plugins.xml. -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com

