Nothing to add, so said it all :-)
Tilman
Am 07.07.2020 um 22:51 schrieb Tim Allison:
I defer to Tilman...
I get the same bad spacing out of Foxit Reader. I tried sort by
position, with no luck. You might be able to twiddle with the
spacingTolerance or the averageCharTolerance on this one file to get
better results, but I don't think there is anything
systematic/automatic that can be done. OCR might be better?
On Tue, Jul 7, 2020 at 10:58 AM Eric Pugh
<[email protected]
<mailto:[email protected]>> wrote:
One of my PDFs has an electronic text that when extracted has
white space between each character. So instead of “commercial”,
“banks”, I get: c o m m e r c i a l b a n k s
I’m attaching the extract file and the original PDF. Other files
extract just fine.
I thought the section on extra whitespace might be helpful:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066,
and then tried out PDFbox:
java -jar pdfbox-app-2.0.20.jar ExtractText ./files/1634473.pdf
Where I get the same result as Tika. Setting maybe?
Eric
_______________________
*Eric Pugh **| *Founder & CEO | OpenSource Connections, LLC |
434.466.1467 |http://www.opensourceconnections.com
<http://www.opensourceconnections.com/> | My Free/Busy
<http://tinyurl.com/eric-cal>
Co-Author:Apache Solr Enterprise Search Server, 3rd Ed
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
This e-mail and all contents, including attachments, is considered
to be Company Confidential unless explicitly stated otherwise,
regardless of whether attachments are marked as such.