Nothing to add, so said it all :-)

Tilman

Am 07.07.2020 um 22:51 schrieb Tim Allison:
I defer to Tilman...

I get the same bad spacing out of Foxit Reader.  I tried sort by position, with no luck.  You might be able to twiddle with the spacingTolerance or the averageCharTolerance on this one file to get better results, but I don't think there is anything systematic/automatic that can be done.  OCR might be better?

On Tue, Jul 7, 2020 at 10:58 AM Eric Pugh <[email protected] <mailto:[email protected]>> wrote:

    One of my PDFs has an electronic text that when extracted has
    white space between each character.  So instead of “commercial”,
    “banks”, I get: c o m m e r c i a l b a n k s

    I’m attaching the extract file and the original PDF. Other files
    extract just fine.

    I thought the section on extra whitespace might be helpful:
    https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066,
    and then tried out PDFbox:

    java -jar pdfbox-app-2.0.20.jar ExtractText ./files/1634473.pdf

    Where I get the same result as Tika.  Setting maybe?

    Eric


    _______________________
    *Eric Pugh **| *Founder & CEO | OpenSource Connections, LLC |
    434.466.1467 |http://www.opensourceconnections.com
    <http://www.opensourceconnections.com/> | My Free/Busy
    <http://tinyurl.com/eric-cal>
    Co-Author:Apache Solr Enterprise Search Server, 3rd Ed
    
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
    This e-mail and all contents, including attachments, is considered
    to be Company Confidential unless explicitly stated otherwise,
    regardless of whether attachments are marked as such.


Reply via email to