https://fraser.stlouisfed.org/title/statements-speeches-andrew-f-brimmer-463/financial-innovation-monetary-management-united-states-10364/fulltext
LOL... But notice that Google is likely running OCR on this...look at the snippets: https://www.google.com/search?sxsrf=ALeKk02ITUhzxSZzvMyDn9FRuHrgdScZ8A%3A1594155831823&ei=N-MEX4jtMcnt-gTr-JOgAw&q=%22financial+innovation+and+monetary+management+in+the+united+states%22&oq=%22financial+innovation+and+monetary+management+in+the+united+states%22&gs_lcp=CgZwc3ktYWIQAzoECAAQRzoECCMQJzoLCAAQsQMQgwEQkQI6BQgAEJECOggIABCxAxCDAToFCAAQsQM6AggAOgQIABBDOgcIIxDqAhAnOgcIABCxAxBDOggIABCxAxCRAjoGCAAQFhAeOggIIRAWEB0QHjoFCCEQoAFQmr4BWKK1AmDCtgJoA3ABeACAAc8BiAHWT5IBBjAuNjkuMZgBAKABAaoBB2d3cy13aXqwAQo&sclient=psy-ab&ved=0ahUKEwjI5MmghbzqAhXJtp4KHWv8BDQQ4dUDCAw&uact=5 You can certainly use tika-eval on the extracted text from that file to automatically categorize it as "likely junk" by the high out of vocabulary score, and then run OCR on it. See https://github.com/tballison/share/blob/main/slides/activate19/Activate2019_tika_tallison_20190911.pptx for examples of this. On Tue, Jul 7, 2020 at 4:51 PM Tim Allison <[email protected]> wrote: > I defer to Tilman... > > I get the same bad spacing out of Foxit Reader. I tried sort by position, > with no luck. You might be able to twiddle with the spacingTolerance or > the averageCharTolerance on this one file to get better results, but I > don't think there is anything systematic/automatic that can be done. OCR > might be better? > > On Tue, Jul 7, 2020 at 10:58 AM Eric Pugh <[email protected]> > wrote: > >> One of my PDFs has an electronic text that when extracted has white space >> between each character. So instead of “commercial”, “banks”, I get: c o m >> m e r c i a l b a n k s >> >> I’m attaching the extract file and the original PDF. Other files extract >> just fine. >> >> >> I thought the section on extra whitespace might be helpful: >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066, >> and then tried out PDFbox: >> >> java -jar pdfbox-app-2.0.20.jar ExtractText ./files/1634473.pdf >> >> Where I get the same result as Tika. Setting maybe? >> >> Eric >> >> >> _______________________ >> *Eric Pugh **| *Founder & CEO | OpenSource Connections, LLC | 434.466.1467 >> | http://www.opensourceconnections.com | My Free/Busy >> <http://tinyurl.com/eric-cal> >> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed >> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> >> This e-mail and all contents, including attachments, is considered to be >> Company Confidential unless explicitly stated otherwise, regardless >> of whether attachments are marked as such. >> >>
