Hi Tim Thanks, np at all,
I thought of experimenting with integrating Tika into Apache Beam pipelines the other day, where the source of the input data is pulled regularly, this is why I thought it would require Tika provide a pull-like parser interface for such an integration to succeed.
I agree simply attempting to convert Tika parsers to use Stax or similar is not realistic, but perhaps some POC may be around XHTML parser can be attempted. That said it probably does not make much sense as it won't work for all (or most mainstream) Tika parsers anyway...
Thanks, Sergey On 05/04/17 14:51, Allison, Timothy B. wrote:
Sergey, Good to hear from you. I'm sorry for not responding sooner. First a note on streaming and Tika. If I understand correctly, from the very beginning of Tika the goal was for full streaming processing. Unfortunately, for some file formats, we have to read the entire file before we can parse it, so streaming is somewhat of an illusion. Also, for some files, metadata can't be extracted until after some of the contents are extracted which means that in some cases you'll get more metadata in the Metadata object than you'll get in our xhtml. I've dabbled in StAX, and at one point, I found it easier to work with than SAX so I have some sympathy. Given that everything in Tika is SAX based, I worry that the benefit isn't worth the effort of converting parsers to StAX. What particulars about our SAX handlers make them not conducive to streaming in your case? Is there anything we can change with less effort than moving to StAX that would help? I'm not against you experimenting with a new PDFParser, but overall, it feels like it would be quite a bit of work. If you want to work on new handlers, how about the rewriteable ones we need for Tika 2.0? :) Cheers, Tim -----Original Message----- From: Sergey Beryozkin [mailto:[email protected]] Sent: Wednesday, April 5, 2017 7:22 AM To: [email protected] Subject: Re: Streaming and Tika Hi All Would it make sense to consider doing something like this for a single format, ex, PDF, or other one which may be the most 'capable' of reporting its events in a pull like fashion ? Tom, others, what do you think ? Cheers, Sergey On 10/11/16 12:14, Sergey Beryozkin wrote:Hi All I've been looking at how to integrate Tika in some of the streaming pipelines, and I'm finding it difficult to set up with the callback-based SAX mechanism. Does it make sense to consider starting adding StAX-like Parser API ? So far the only reference to Stax I've seen is https://issues.apache.org/jira/browse/TIKA-1321 Cheers, Sergey
