Hi,

i need to extend / overwrite the TikaParser and wrote my own plugin:

plugin.xml does have this:

[...]

   <requires>
      <import plugin="nutch-extensionpoints"/>
      <import plugin="lib-nekohtml"/>
      <import plugin="parse-tika"/>
   </requires>

      <extension point="org.apache.nutch.parse.Parser"
              id="my.nutch.plugins.parse"
              name="TSITikaParser">

      <implementation id="my.nutch.plugins.parse.Parser"
                      class="my.nutch.plugins.parse.TSITikaParser">
       <parameter name="contentType" value="*"/>
      </implementation>

   </extension>

[...]

parse-plugins.xml:

[...]

<alias name="parse-tika" 
                extension-id="my.nutch.plugins.parse.Parser" />

[...]

The log output does read:

2010-07-23 10:16:37,071 DEBUG parse.ParseUtil - Parsing 
[http://localhost/test.pdf] with 
[my.nutch.plugins.parse.tsitikapar...@d6089a5]
2010-07-23 10:16:37,072 ERROR tika.TikaParser - Can't retrieve Tika parser for 
mime-type application/pdf
2010-07-23 10:16:37,076 WARN  fetcher.Fetcher - Error parsing: 
http://localhost/test.pdf: failed(2,0): Can't retrieve Tika parser for mime-
type application/pdf


Why does Tika does not find its parsers?
TSITikeParser.java is a 1:1 copy of the TikeParser.java from the parse-tika 
plugin (nutch branch 1.1) without any "change" yet, so i thought this shold 
work but did not. Any magic here i've missed? Maybe some classloading specials 
or something else which is important to make tika working with nutch?

Did i have to configure anything else to get Tika working or is this not 
supported at all?


thx 

Torsten


-- 
Bitte senden Sie mir keine Word- oder PowerPoint-Anhänge.
Siehe http://www.gnu.org/philosophy/no-word-attachments.de.html

Really, I'm not out to destroy Microsoft. That will just be a 
completely unintentional side effect."
        -- Linus Torvalds

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to