Sergey,
Good to hear from you. I'm sorry for not responding sooner.
First a note on streaming and Tika. If I understand correctly, from the very
beginning of Tika the goal was for full streaming processing. Unfortunately,
for some file formats, we have to read the entire file before we can parse it,
so streaming is somewhat of an illusion. Also, for some files, metadata can't
be extracted until after some of the contents are extracted which means that in
some cases you'll get more metadata in the Metadata object than you'll get in
our xhtml.
I've dabbled in StAX, and at one point, I found it easier to work with than
SAX so I have some sympathy.
Given that everything in Tika is SAX based, I worry that the benefit isn't
worth the effort of converting parsers to StAX.
What particulars about our SAX handlers make them not conducive to streaming
in your case? Is there anything we can change with less effort than moving to
StAX that would help?
I'm not against you experimenting with a new PDFParser, but overall, it feels
like it would be quite a bit of work.
If you want to work on new handlers, how about the rewriteable ones we need
for Tika 2.0? :)
Cheers,
Tim
-----Original Message-----
From: Sergey Beryozkin [mailto:[email protected]]
Sent: Wednesday, April 5, 2017 7:22 AM
To: [email protected]
Subject: Re: Streaming and Tika
Hi All
Would it make sense to consider doing something like this for a single format,
ex, PDF, or other one which may be the most 'capable' of reporting its events
in a pull like fashion ?
Tom, others, what do you think ?
Cheers, Sergey
On 10/11/16 12:14, Sergey Beryozkin wrote:
> Hi All
>
> I've been looking at how to integrate Tika in some of the streaming
> pipelines, and I'm finding it difficult to set up with the
> callback-based SAX mechanism.
>
> Does it make sense to consider starting adding StAX-like Parser API ?
>
> So far the only reference to Stax I've seen is
> https://issues.apache.org/jira/browse/TIKA-1321
>
> Cheers, Sergey