Re: Getting white space between characters in PDF extraction.

Tim Allison Tue, 07 Jul 2020 14:11:22 -0700

https://fraser.stlouisfed.org/title/statements-speeches-andrew-f-brimmer-463/financial-innovation-monetary-management-united-states-10364/fulltext


LOL...

But notice that Google is likely running OCR on this...look at the
snippets:
https://www.google.com/search?sxsrf=ALeKk02ITUhzxSZzvMyDn9FRuHrgdScZ8A%3A1594155831823&ei=N-MEX4jtMcnt-gTr-JOgAw&q=%22financial+innovation+and+monetary+management+in+the+united+states%22&oq=%22financial+innovation+and+monetary+management+in+the+united+states%22&gs_lcp=CgZwc3ktYWIQAzoECAAQRzoECCMQJzoLCAAQsQMQgwEQkQI6BQgAEJECOggIABCxAxCDAToFCAAQsQM6AggAOgQIABBDOgcIIxDqAhAnOgcIABCxAxBDOggIABCxAxCRAjoGCAAQFhAeOggIIRAWEB0QHjoFCCEQoAFQmr4BWKK1AmDCtgJoA3ABeACAAc8BiAHWT5IBBjAuNjkuMZgBAKABAaoBB2d3cy13aXqwAQo&sclient=psy-ab&ved=0ahUKEwjI5MmghbzqAhXJtp4KHWv8BDQQ4dUDCAw&uact=5

You can certainly use tika-eval on the extracted text from that file to
automatically categorize it as "likely junk" by the high out of vocabulary
score, and then run OCR on it.

See
https://github.com/tballison/share/blob/main/slides/activate19/Activate2019_tika_tallison_20190911.pptx
for
examples of this.

On Tue, Jul 7, 2020 at 4:51 PM Tim Allison <[email protected]> wrote:

> I defer to Tilman...
>
> I get the same bad spacing out of Foxit Reader.  I tried sort by position,
> with no luck.  You might be able to twiddle with the spacingTolerance or
> the averageCharTolerance on this one file to get better results, but I
> don't think there is anything systematic/automatic that can be done.  OCR
> might be better?
>
> On Tue, Jul 7, 2020 at 10:58 AM Eric Pugh <[email protected]>
> wrote:
>
>> One of my PDFs has an electronic text that when extracted has white space
>> between each character.  So instead of “commercial”, “banks”, I get: c o m
>> m e r c i a l b a n k s
>>
>> I’m attaching the extract file and the original PDF.  Other files extract
>> just fine.
>>
>>
>> I thought the section on extra whitespace might be helpful:
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066,
>> and then tried out PDFbox:
>>
>> java -jar pdfbox-app-2.0.20.jar ExtractText ./files/1634473.pdf
>>
>> Where I get the same result as Tika.  Setting maybe?
>>
>> Eric
>>
>>
>> _______________________
>> *Eric Pugh **| *Founder & CEO | OpenSource Connections, LLC | 434.466.1467
>> | http://www.opensourceconnections.com | My Free/Busy
>> <http://tinyurl.com/eric-cal>
>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed
>> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>> This e-mail and all contents, including attachments, is considered to be
>> Company Confidential unless explicitly stated otherwise, regardless
>> of whether attachments are marked as such.
>>
>>

Re: Getting white space between characters in PDF extraction.

Reply via email to