Re: rdf output

Jukka Zitting Sun, 20 Sep 2009 14:27:19 -0700

Hi,

On Sun, Sep 20, 2009 at 10:22 PM, jakobitsch juergen
<tschyr...@yahoo.com> wrote:
> 1. when will tika switch to apache's pdf box (is it still not mature enough?)


As soon as the 0.8.0 release is officially out and available from the
central Maven repository. I expect this to happen within a week, so
Tika 0.5 will be based on Apache PDFBox.

> 2. is it possible to skip html tags with tika (say i don't want to have 
> <script>
> or <style> contents in my resulting plain text

Yes. That's actually what the HTML parser in Tika is programmed to do
by default. See the DISCARD_ELEMENTS set in
org.apache.tika.parser.html.HTMLParser.

> 3. are there any plan for outputing the result into RDF (currently i'm using 
> aperture),
> but i would be more than happy to switch to an apache project and i'm also 
> willing
> to contribute on that one.

We've had discussions about using XMP for expressing and handling
extracted document metadata. So far we haven't reached clear consensus
and not much work has yet been done about this, but contributions are
of course welcome.

BR,

Jukka Zitting

Re: rdf output

Reply via email to