subject:"Re\: PDF with embedded attachments and Tika 2.0 modularity"

Re: PDF with embedded attachments and Tika 2.0 modularity

2016-09-16 Thread Sergey Beryozkin

Hi Bob Thanks for all the info, much appreciated. I agree it makes sense to start with the multimedia bundle to increase the format coverage. I'll keep experimenting. The demo is quite basic as far as Tika is concerned - I've only tried PDF/ODT/ODP files and looks like they are really simple

Re: PDF with embedded attachments and Tika 2.0 modularity

2016-09-16 Thread Bob Paulin

Hi Sergey, On 9/15/2016 3:33 PM, Sergey Beryozkin wrote: Hi Bob, Tim, All, On 15/09/16 18:06, Bob Paulin wrote: Hi Sergey, I definitely get the challenges. In fact recently we merged the PDF module into the Multimedia module due to the tight coupling around the TesseractOCR[1] [2]. We

Re: PDF with embedded attachments and Tika 2.0 modularity

2016-09-15 Thread Bob Paulin

Hi Sergey, I definitely get the challenges. In fact recently we merged the PDF module into the Multimedia module due to the tight coupling around the TesseractOCR[1] [2]. We could look into separating the PDF parser out again but I'm a bit short on a simple way to do it with TesseractOCR in

RE: PDF with embedded attachments and Tika 2.0 modularity

2016-09-15 Thread Allison, Timothy B.

Sergey, your point is well taken. Y, you'd need most parsers, but you can _probably_ live without advanced or scientific (sorry, Chris!). I'd be hesitant to change the structure much. We should definitely document this well, though! -Original Message- From: Sergey Beryozkin