I defer to Tilman... I get the same bad spacing out of Foxit Reader. I tried sort by position, with no luck. You might be able to twiddle with the spacingTolerance or the averageCharTolerance on this one file to get better results, but I don't think there is anything systematic/automatic that can be done. OCR might be better?
On Tue, Jul 7, 2020 at 10:58 AM Eric Pugh <[email protected]> wrote: > One of my PDFs has an electronic text that when extracted has white space > between each character. So instead of “commercial”, “banks”, I get: c o m > m e r c i a l b a n k s > > I’m attaching the extract file and the original PDF. Other files extract > just fine. > > > I thought the section on extra whitespace might be helpful: > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066, > and then tried out PDFbox: > > java -jar pdfbox-app-2.0.20.jar ExtractText ./files/1634473.pdf > > Where I get the same result as Tika. Setting maybe? > > Eric > > > _______________________ > *Eric Pugh **| *Founder & CEO | OpenSource Connections, LLC | 434.466.1467 > | http://www.opensourceconnections.com | My Free/Busy > <http://tinyurl.com/eric-cal> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed > <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> > This e-mail and all contents, including attachments, is considered to be > Company Confidential unless explicitly stated otherwise, regardless > of whether attachments are marked as such. > >
