Hi, i need to extend / overwrite the TikaParser and wrote my own plugin:
plugin.xml does have this:
[...]
<requires>
<import plugin="nutch-extensionpoints"/>
<import plugin="lib-nekohtml"/>
<import plugin="parse-tika"/>
</requires>
<extension point="org.apache.nutch.parse.Parser"
id="my.nutch.plugins.parse"
name="TSITikaParser">
<implementation id="my.nutch.plugins.parse.Parser"
class="my.nutch.plugins.parse.TSITikaParser">
<parameter name="contentType" value="*"/>
</implementation>
</extension>
[...]
parse-plugins.xml:
[...]
<alias name="parse-tika"
extension-id="my.nutch.plugins.parse.Parser" />
[...]
The log output does read:
2010-07-23 10:16:37,071 DEBUG parse.ParseUtil - Parsing
[http://localhost/test.pdf] with
[my.nutch.plugins.parse.tsitikapar...@d6089a5]
2010-07-23 10:16:37,072 ERROR tika.TikaParser - Can't retrieve Tika parser for
mime-type application/pdf
2010-07-23 10:16:37,076 WARN fetcher.Fetcher - Error parsing:
http://localhost/test.pdf: failed(2,0): Can't retrieve Tika parser for mime-
type application/pdf
Why does Tika does not find its parsers?
TSITikeParser.java is a 1:1 copy of the TikeParser.java from the parse-tika
plugin (nutch branch 1.1) without any "change" yet, so i thought this shold
work but did not. Any magic here i've missed? Maybe some classloading specials
or something else which is important to make tika working with nutch?
Did i have to configure anything else to get Tika working or is this not
supported at all?
thx
Torsten
--
Bitte senden Sie mir keine Word- oder PowerPoint-Anhänge.
Siehe http://www.gnu.org/philosophy/no-word-attachments.de.html
Really, I'm not out to destroy Microsoft. That will just be a
completely unintentional side effect."
-- Linus Torvalds
smime.p7s
Description: S/MIME cryptographic signature

