Am Freitag, 23. Juli 2010, um 11:12:28 schrieb Julien Nioche: > Hi Torsten, > > You can implement your own HTMLParseFilter. These are called by the Tika > parser (or the old HTML one) once it has parsed the content and converted > it into a XHTML DOM object. You can then navigate on the DOM object to do > whatever you want to do and create additional metadata. The difference > with implementing a Parser (as opposed to a HTMLParseFilter) is that you > don't have to convert from the original format into text and you are > getting a XHTML representation regardless of the original format - at > least if you use the Tika parser. > > HTH > > Julien
For HTML this is ok and works already. But for non HTML content (PDF, DOC etc.) i did not found any filter API like the HTML one (e.g. BinaryParseFilter or something else)? How to do this there (filter like approach)? thx Torsten -- Bitte senden Sie mir keine Word- oder PowerPoint-Anhänge. Siehe http://www.gnu.org/philosophy/no-word-attachments.de.html Really, I'm not out to destroy Microsoft. That will just be a completely unintentional side effect." -- Linus Torvalds
smime.p7s
Description: S/MIME cryptographic signature

