Re: [PLUG] PDF-1.5 docs not searchable
Rich, I did verify last night and the free version of master pdf editor does include OCR.i installed via the AUR. It works very well against the text, but does try converting the pictures to text as well with some humorous results. Jason - Sent from my pocket computing telecommunications device. All typos and poor communications will be blamed on the autocarrot function of said device. On Sun, Jul 25, 2021, 5:53 AM Rich Shepard wrote: > On Sun, 25 Jul 2021, Jason Barnett wrote: > > > I believe you mentioned Master PDF editor. I believe it has OCR built-in, > > or allows it as a plugin. If needed, a good OCR tool is Tesseract and is > > likely in your distro's repository. > > https://en.wikipedia.org/wiki/Tesseract_(software) > > Jason, > > Thank you. The most recent doc I viewed (that prompted my post) has > multiple > images per page; it's not all text. While I don't remember the few others > that I could not search it's likely that they, too, had many images > embedded > within the text. So I assume they were all scanned (or produced by an > equivalent process). > > I used to get scanned documents (such as permit copies) from clients and > had > no reason to run them through an OCR, but I'll keep that in mind for the > future. > > Germane to MasterPDFEditor, I expect that its OCR capabilites are in the > paid version, not the free one. And, yes, Tesseract is in the SBo repo. > > Stay well, > > Rich >
Re: [PLUG] PDF-1.5 docs not searchable
On Sun, 25 Jul 2021, Jason Barnett wrote: I believe you mentioned Master PDF editor. I believe it has OCR built-in, or allows it as a plugin. If needed, a good OCR tool is Tesseract and is likely in your distro's repository. https://en.wikipedia.org/wiki/Tesseract_(software) Jason, Thank you. The most recent doc I viewed (that prompted my post) has multiple images per page; it's not all text. While I don't remember the few others that I could not search it's likely that they, too, had many images embedded within the text. So I assume they were all scanned (or produced by an equivalent process). I used to get scanned documents (such as permit copies) from clients and had no reason to run them through an OCR, but I'll keep that in mind for the future. Germane to MasterPDFEditor, I expect that its OCR capabilites are in the paid version, not the free one. And, yes, Tesseract is in the SBo repo. Stay well, Rich
Re: [PLUG] PDF-1.5 docs not searchable
Rich, I believe you mentioned Master PDF editor. I believe it has OCR built-in, or allows it as a plugin. If needed, a good OCR tool is Tesseract and is likely in your distro's repository. https://en.wikipedia.org/wiki/Tesseract_(software) Jason On Sat, Jul 24, 2021 at 11:08 AM Tomas Kuchta wrote: > They are not searchable because they do not contain text to search. > Typically, they contain image only. > > The way I deal with it - I OCR the image, generate text document and place > that text into a layer under the image in the output PDF. > > Having the text under the image layer preserves the original look of the > pdf why allowing for search and select. > > I have seen pdf with text over the image, obscuring it - as well as various > attempts of making the text over the image invisible. > > Of course, OCR is not perfect as well as preserving the text in the exact > position under the image. It mostly works for text, not so much for data > extraction from tables, etc. > > I do not believe that there is OK-ish free SW solution to this. I use > commercial SW to do that. It works, but I cannot publicly recommend it due > to their nasty commercial behavior - no respect for privacy no sale just > licensing with build in obsolescence. > > Tomas > > On Fri, Jul 23, 2021, 16:18 Rich Shepard wrote: > > > I've encountered a few PDF-1.5 docs that are not searchable using xpdf, > > mupdf, okular, or MasterPDFEditor. Perhaps they're scanned and I don't > know > > how to determine if they are. > > > > My web searches found nothing relevant; my search terms might be > > inefficient. > > > > Has anyone else experienced this? > > > > Rich > > > > > > >
Re: [PLUG] PDF-1.5 docs not searchable
They are not searchable because they do not contain text to search. Typically, they contain image only. The way I deal with it - I OCR the image, generate text document and place that text into a layer under the image in the output PDF. Having the text under the image layer preserves the original look of the pdf why allowing for search and select. I have seen pdf with text over the image, obscuring it - as well as various attempts of making the text over the image invisible. Of course, OCR is not perfect as well as preserving the text in the exact position under the image. It mostly works for text, not so much for data extraction from tables, etc. I do not believe that there is OK-ish free SW solution to this. I use commercial SW to do that. It works, but I cannot publicly recommend it due to their nasty commercial behavior - no respect for privacy no sale just licensing with build in obsolescence. Tomas On Fri, Jul 23, 2021, 16:18 Rich Shepard wrote: > I've encountered a few PDF-1.5 docs that are not searchable using xpdf, > mupdf, okular, or MasterPDFEditor. Perhaps they're scanned and I don't know > how to determine if they are. > > My web searches found nothing relevant; my search terms might be > inefficient. > > Has anyone else experienced this? > > Rich > > >
Re: [PLUG] PDF-1.5 docs not searchable
On Fri, 23 Jul 2021, Randy Bush wrote: i hit this often. i just ocr them with pdfscanner on a mac. i would guess/hope there are lower cost tools. randy, Thanks. Rich
Re: [PLUG] PDF-1.5 docs not searchable
> I've encountered a few PDF-1.5 docs that are not searchable using > xpdf, mupdf, okular, or MasterPDFEditor. Perhaps they're scanned > and I don't know how to determine if they are. i hit this often. i just ocr them with pdfscanner on a mac. i would guess/hope there are lower cost tools. randy
[PLUG] PDF-1.5 docs not searchable
I've encountered a few PDF-1.5 docs that are not searchable using xpdf, mupdf, okular, or MasterPDFEditor. Perhaps they're scanned and I don't know how to determine if they are. My web searches found nothing relevant; my search terms might be inefficient. Has anyone else experienced this? Rich