Hah Asitang works on my team I’ve cced him

Sent from my iPhone

On May 31, 2018, at 9:20 AM, Tim Allison 
<[email protected]<mailto:[email protected]>> wrote:

Hi Tanya,
  I think you'll have better luck with straight PDFBox.  Tika is meant to be a 
more general kind of extractor.

 I'm sure you've seen it, but this might help:

https://stackoverflow.com/questions/40101748/extracting-text-from-an-area-with-pdfbox

The class org.apache.pdfbox.examples.util.PrintTextLocations in PDFBox's 
examples module might be useful as well.

Also, on the question of why some text works and other text not...it may be the 
underlying PDF and how the text is stored or not in your files.  It might not 
be the fault of your code at all!

Best,

           Tim

On Thu, May 31, 2018 at 3:14 PM, Tim Allison 
<[email protected]<mailto:[email protected]>> wrote:
Forwarded from Tanya:



I have been trying to change the pdfparser code in Tika parser so that I can 
specify a region on the page to be extracted.  I came across the following code 
which has some ideas on how to do this:

https://github.com/asitang/tika_pdf_celgene/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java

I have been able to modify Tika-1.18 source code (PDF2XHTML and 
AbstractPDF2XHTML) using the above code idea to run and extract by specified 
region.  However, there are some issues, mainly if the pdf is one page, it 
fails to extract anything, and if the language is anything but English, again 
nothing gets extracted.  I have spend many hours trying to figure it out, but I 
can't.  There is no error or exception, just that no text is extracted.  I have 
tried many variations on Google search, but surprisingly have not found 
anything except above code on how to use a specified region on the PDF page 
with Tika parser.  I would have thought this is a common problem, as I know 
even pdftotext utility has an option for area to be passed in.

I was wondering if you have dealt with this issue or are aware of any 
enhancements to Tika parser that helps with specifying the rectangular region 
to extract text.

I really appreciate any help you can provide.

Many thanks,

Tanya



Reply via email to