Hi Bob
Thanks for all the info, much appreciated. I agree it makes sense to
start with the multimedia bundle to increase the format coverage.
I'll keep experimenting. The demo is quite basic as far as Tika is
concerned - I've only tried PDF/ODT/ODP files and looks like they are
really simple
Hi Sergey,
On 9/15/2016 3:33 PM, Sergey Beryozkin wrote:
Hi Bob, Tim, All,
On 15/09/16 18:06, Bob Paulin wrote:
Hi Sergey,
I definitely get the challenges. In fact recently we merged the PDF
module into the Multimedia module due to the tight coupling around the
TesseractOCR[1] [2]. We
Hi Sergey,
I definitely get the challenges. In fact recently we merged the PDF
module into the Multimedia module due to the tight coupling around the
TesseractOCR[1] [2]. We could look into separating the PDF parser out
again but I'm a bit short on a simple way to do it with TesseractOCR in
Sergey, your point is well taken.
Y, you'd need most parsers, but you can _probably_ live without advanced or
scientific (sorry, Chris!).
I'd be hesitant to change the structure much. We should definitely document
this well, though!
-Original Message-
From: Sergey Beryozkin