Hi Tanya, I think you'll have better luck with straight PDFBox. Tika is meant to be a more general kind of extractor.
I'm sure you've seen it, but this might help: https://stackoverflow.com/questions/40101748/extracting-text-from-an-area-with-pdfbox The class org.apache.pdfbox.examples.util.PrintTextLocations in PDFBox's examples module might be useful as well. Also, on the question of why some text works and other text not...it may be the underlying PDF and how the text is stored or not in your files. It might not be the fault of your code at all! Best, Tim On Thu, May 31, 2018 at 3:14 PM, Tim Allison <[email protected]> wrote: > Forwarded from Tanya: > > > > I have been trying to change the pdfparser code in Tika parser so that I > can specify a region on the page to be extracted. I came across the > following code which has some ideas on how to do this: > > https://github.com/asitang/tika_pdf_celgene/blob/master/tika > -parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java > > I have been able to modify Tika-1.18 source code (PDF2XHTML and > AbstractPDF2XHTML) using the above code idea to run and extract by > specified region. However, there are some issues, mainly if the pdf is one > page, it fails to extract anything, and if the language is anything but > English, again nothing gets extracted. I have spend many hours trying to > figure it out, but I can't. There is no error or exception, just that no > text is extracted. I have tried many variations on Google search, but > surprisingly have not found anything except above code on how to use a > specified region on the PDF page with Tika parser. I would have thought > this is a common problem, as I know even pdftotext utility has an option > for area to be passed in. > > I was wondering if you have dealt with this issue or are aware of any > enhancements to Tika parser that helps with specifying the rectangular > region to extract text. > > I really appreciate any help you can provide. > > Many thanks, > > Tanya > > >
