Fwd: Tika parser code for region extraction

Tim Allison Thu, 31 May 2018 12:14:22 -0700

Forwarded from Tanya:



I have been trying to change the pdfparser code in Tika parser so that I
can specify a region on the page to be extracted.  I came across the
following code which has some ideas on how to do this:

https://github.com/asitang/tika_pdf_celgene/blob/master/tika
-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java

I have been able to modify Tika-1.18 source code (PDF2XHTML and
AbstractPDF2XHTML) using the above code idea to run and extract by
specified region.  However, there are some issues, mainly if the pdf is one
page, it fails to extract anything, and if the language is anything but
English, again nothing gets extracted.  I have spend many hours trying to
figure it out, but I can't.  There is no error or exception, just that no
text is extracted.  I have tried many variations on Google search, but
surprisingly have not found anything except above code on how to use a
specified region on the PDF page with Tika parser.  I would have thought
this is a common problem, as I know even pdftotext utility has an option
for area to be passed in.

I was wondering if you have dealt with this issue or are aware of any
enhancements to Tika parser that helps with specifying the rectangular
region to extract text.

I really appreciate any help you can provide.

Many thanks,

Tanya

Fwd: Tika parser code for region extraction

Reply via email to