Re: Tika parser code for region extraction

Tanya Roosta Thu, 31 May 2018 09:07:54 -0700
>
> Hi,
> I have been trying to change the pdfparser code in Tika parser so that I
> can specify a region on the page to be extracted.  I came across the
> following code which has some ideas on how to do this:
>
> https://github.com/asitang/tika_pdf_celgene/blob/master/tika
> -parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
>
> I have been able to modify Tika-1.18 source code (PDF2XHTML and
> AbstractPDF2XHTML files) using the ideas in the above code, to run and
> extract by specified region (rectangular).  However, there are some issues,
> mainly if the pdf is one page, it fails to extract anything, and if the
> language is anything but English, again nothing gets extracted.  I have
> spend many hours trying to figure it out, but I can't.  There is no error
> or exception, just that no text is extracted.
>
> I have tried Google search as well, but surprisingly have not found
> anything except above code on how to use a specified region on the PDF page
> with Tika parser.  I would have thought this is a common problem, as I know
> even pdftotext utility has an option for area to be passed in.  There are
> posts where people discuss how to use PDFTextStripperByArea for a
> standalone solution, but nothing that related to Tika parser being able to
> extract by user specified region.
>
> I was wondering if anyone has dealt with this issue or is aware of any
> enhancements to Tika parser that helps with specifying the rectangular
> region to extract text.
>
> Thanks,
>
> Tanya
>
>
>
Re: Tika parser code for region extraction

Reply via email to