Re: Customize Tika Parser - How to access nutch Content object or is it possible to stack Parsers

Julien Nioche Fri, 23 Jul 2010 02:13:00 -0700

Hi Torsten,

You can implement your own HTMLParseFilter. These are called by the Tika
parser (or the old HTML one) once it has parsed the content and converted it
into a XHTML DOM object. You can then navigate on the DOM object to do
whatever you want to do and create additional metadata. The difference with
implementing a Parser (as opposed to a HTMLParseFilter) is that you don't
have to convert from the original format into text and you are getting a
XHTML representation regardless of the original format - at least if you use
the Tika parser.


HTH

Julien


> Hi , using standard nutch parsers, i am able to get access to the
>
> org.apache.nutch.protocol.Content
>
> to get some data to index from the original URI if they are not already
> found
> @Metadata object.
> Using Nutch 1.1 i want to use the tika parsers and wonder if this can be
> done
> - the API does not look like to make it happen.
> So maybe i miss the glue where i can do such things - maybe via my own tika
> parser (where to register them with nutch?).
> Or is it possible to stack parsers - e.g. let tika do its "standard" work
> and
> after that let the next Nutch Parser run to do this stuff?
>
> Any hints appreciated.
>
> thx
>
> Torsten
>
> --
> Bitte senden Sie mir keine Word- oder PowerPoint-Anhänge.
> Siehe http://www.gnu.org/philosophy/no-word-attachments.de.html
>
> Really, I'm not out to destroy Microsoft. That will just be a
> completely unintentional side effect."
>        -- Linus Torvalds
>



-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Re: Customize Tika Parser - How to access nutch Content object or is it possible to stack Parsers

Reply via email to