Hi Torsten, You can implement your own HTMLParseFilter. These are called by the Tika parser (or the old HTML one) once it has parsed the content and converted it into a XHTML DOM object. You can then navigate on the DOM object to do whatever you want to do and create additional metadata. The difference with implementing a Parser (as opposed to a HTMLParseFilter) is that you don't have to convert from the original format into text and you are getting a XHTML representation regardless of the original format - at least if you use the Tika parser.
HTH Julien > Hi , using standard nutch parsers, i am able to get access to the > > org.apache.nutch.protocol.Content > > to get some data to index from the original URI if they are not already > found > @Metadata object. > Using Nutch 1.1 i want to use the tika parsers and wonder if this can be > done > - the API does not look like to make it happen. > So maybe i miss the glue where i can do such things - maybe via my own tika > parser (where to register them with nutch?). > Or is it possible to stack parsers - e.g. let tika do its "standard" work > and > after that let the next Nutch Parser run to do this stuff? > > Any hints appreciated. > > thx > > Torsten > > -- > Bitte senden Sie mir keine Word- oder PowerPoint-Anhänge. > Siehe http://www.gnu.org/philosophy/no-word-attachments.de.html > > Really, I'm not out to destroy Microsoft. That will just be a > completely unintentional side effect." > -- Linus Torvalds > -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com

