Hi, On Sun, Sep 20, 2009 at 10:22 PM, jakobitsch juergen <tschyr...@yahoo.com> wrote: > 1. when will tika switch to apache's pdf box (is it still not mature enough?)
As soon as the 0.8.0 release is officially out and available from the central Maven repository. I expect this to happen within a week, so Tika 0.5 will be based on Apache PDFBox. > 2. is it possible to skip html tags with tika (say i don't want to have > <script> > or <style> contents in my resulting plain text Yes. That's actually what the HTML parser in Tika is programmed to do by default. See the DISCARD_ELEMENTS set in org.apache.tika.parser.html.HTMLParser. > 3. are there any plan for outputing the result into RDF (currently i'm using > aperture), > but i would be more than happy to switch to an apache project and i'm also > willing > to contribute on that one. We've had discussions about using XMP for expressing and handling extracted document metadata. So far we haven't reached clear consensus and not much work has yet been done about this, but contributions are of course welcome. BR, Jukka Zitting