Re: Getting white space between characters in PDF extraction.

Tim Allison Tue, 07 Jul 2020 13:52:21 -0700

I defer to Tilman...

I get the same bad spacing out of Foxit Reader.  I tried sort by position,
with no luck.  You might be able to twiddle with the spacingTolerance or
the averageCharTolerance on this one file to get better results, but I
don't think there is anything systematic/automatic that can be done.  OCR
might be better?


On Tue, Jul 7, 2020 at 10:58 AM Eric Pugh <[email protected]>
wrote:

> One of my PDFs has an electronic text that when extracted has white space
> between each character.  So instead of “commercial”, “banks”, I get: c o m
> m e r c i a l b a n k s
>
> I’m attaching the extract file and the original PDF.  Other files extract
> just fine.
>
>
> I thought the section on extra whitespace might be helpful:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066,
> and then tried out PDFbox:
>
> java -jar pdfbox-app-2.0.20.jar ExtractText ./files/1634473.pdf
>
> Where I get the same result as Tika.  Setting maybe?
>
> Eric
>
>
> _______________________
> *Eric Pugh **| *Founder & CEO | OpenSource Connections, LLC | 434.466.1467
> | http://www.opensourceconnections.com | My Free/Busy
> <http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed
> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>
>

Re: Getting white space between characters in PDF extraction.

Reply via email to