Re: Customize Tika Parser - How to access nutch Content object or is it possible to stack Parsers

Torsten Krah Fri, 23 Jul 2010 03:49:49 -0700

Am Freitag, 23. Juli 2010, um 11:12:28 schrieb Julien Nioche:
> Hi Torsten,
> 
> You can implement your own HTMLParseFilter. These are called by the Tika
> parser (or the old HTML one) once it has parsed the content and converted
> it into a XHTML DOM object. You can then navigate on the DOM object to do
> whatever you want to do and create additional metadata. The difference
> with implementing a Parser (as opposed to a HTMLParseFilter) is that you
> don't have to convert from the original format into text and you are
> getting a XHTML representation regardless of the original format - at
> least if you use the Tika parser.
> 
> HTH
> 
> Julien



For HTML this is ok and works already.
But for non HTML content (PDF, DOC etc.) i did not found any filter API like 
the HTML one (e.g. BinaryParseFilter or something else)?
How to do this there (filter like approach)?

thx

Torsten

-- 
Bitte senden Sie mir keine Word- oder PowerPoint-Anhänge.
Siehe http://www.gnu.org/philosophy/no-word-attachments.de.html

Really, I'm not out to destroy Microsoft. That will just be a 
completely unintentional side effect."
        -- Linus Torvalds

smime.p7s
Description: S/MIME cryptographic signature

Re: Customize Tika Parser - How to access nutch Content object or is it possible to stack Parsers

Reply via email to