Re: Getting white space between characters in PDF extraction.

Tilman Hausherr Tue, 07 Jul 2020 21:00:50 -0700

Nothing to add, so said it all :-)

Tilman


Am 07.07.2020 um 22:51 schrieb Tim Allison:

I defer to Tilman...

I get the same bad spacing out of Foxit Reader. I tried sort byposition, with no luck. You might be able to twiddle with thespacingTolerance or the averageCharTolerance on this one file to getbetter results, but I don't think there is anythingsystematic/automatic that can be done. OCR might be better?

On Tue, Jul 7, 2020 at 10:58 AM Eric Pugh<[email protected]<mailto:[email protected]>> wrote:


    One of my PDFs has an electronic text that when extracted has
    white space between each character.  So instead of “commercial”,
    “banks”, I get: c o m m e r c i a l b a n k s

    I’m attaching the extract file and the original PDF. Other files
    extract just fine.

    I thought the section on extra whitespace might be helpful:
    https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066,
    and then tried out PDFbox:

    java -jar pdfbox-app-2.0.20.jar ExtractText ./files/1634473.pdf

    Where I get the same result as Tika.  Setting maybe?

    Eric


    _______________________
    *Eric Pugh **| *Founder & CEO | OpenSource Connections, LLC |
    434.466.1467 |http://www.opensourceconnections.com
    <http://www.opensourceconnections.com/> | My Free/Busy
    <http://tinyurl.com/eric-cal>
    Co-Author:Apache Solr Enterprise Search Server, 3rd Ed
    
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
    This e-mail and all contents, including attachments, is considered
    to be Company Confidential unless explicitly stated otherwise,
    regardless of whether attachments are marked as such.

Re: Getting white space between characters in PDF extraction.

Reply via email to