Hi Torsten,

You can implement your own HTMLParseFilter. These are called by the Tika
parser (or the old HTML one) once it has parsed the content and converted it
into a XHTML DOM object. You can then navigate on the DOM object to do
whatever you want to do and create additional metadata. The difference with
implementing a Parser (as opposed to a HTMLParseFilter) is that you don't
have to convert from the original format into text and you are getting a
XHTML representation regardless of the original format - at least if you use
the Tika parser.

HTH

Julien


> Hi , using standard nutch parsers, i am able to get access to the
>
> org.apache.nutch.protocol.Content
>
> to get some data to index from the original URI if they are not already
> found
> @Metadata object.
> Using Nutch 1.1 i want to use the tika parsers and wonder if this can be
> done
> - the API does not look like to make it happen.
> So maybe i miss the glue where i can do such things - maybe via my own tika
> parser (where to register them with nutch?).
> Or is it possible to stack parsers - e.g. let tika do its "standard" work
> and
> after that let the next Nutch Parser run to do this stuff?
>
> Any hints appreciated.
>
> thx
>
> Torsten
>
> --
> Bitte senden Sie mir keine Word- oder PowerPoint-Anhänge.
> Siehe http://www.gnu.org/philosophy/no-word-attachments.de.html
>
> Really, I'm not out to destroy Microsoft. That will just be a
> completely unintentional side effect."
>        -- Linus Torvalds
>



-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Reply via email to