Re: [PLUG] PDF-1.5 docs not searchable

2021-07-25 Thread Jason Barnett
Rich, I did verify last night and the free version of master pdf editor
does include OCR.i installed via the AUR.  It works very well against the
text, but does try converting the pictures to text as well with some
humorous results.

Jason

- Sent from my pocket computing telecommunications device.  All typos and
poor communications will be blamed on the autocarrot function of said
device.

On Sun, Jul 25, 2021, 5:53 AM Rich Shepard  wrote:

> On Sun, 25 Jul 2021, Jason Barnett wrote:
>
> > I believe you mentioned Master PDF editor. I believe it has OCR built-in,
> > or allows it as a plugin. If needed, a good OCR tool is Tesseract and is
> > likely in your distro's repository.
> > https://en.wikipedia.org/wiki/Tesseract_(software)
>
> Jason,
>
> Thank you. The most recent doc I viewed (that prompted my post) has
> multiple
> images per page; it's not all text. While I don't remember the few others
> that I could not search it's likely that they, too, had many images
> embedded
> within the text. So I assume they were all scanned (or produced by an
> equivalent process).
>
> I used to get scanned documents (such as permit copies) from clients and
> had
> no reason to run them through an OCR, but I'll keep that in mind for the
> future.
>
> Germane to MasterPDFEditor, I expect that its OCR capabilites are in the
> paid version, not the free one. And, yes, Tesseract is in the SBo repo.
>
> Stay well,
>
> Rich
>


Re: [PLUG] PDF-1.5 docs not searchable

2021-07-25 Thread Rich Shepard

On Sun, 25 Jul 2021, Jason Barnett wrote:


I believe you mentioned Master PDF editor. I believe it has OCR built-in,
or allows it as a plugin. If needed, a good OCR tool is Tesseract and is
likely in your distro's repository.
https://en.wikipedia.org/wiki/Tesseract_(software)


Jason,

Thank you. The most recent doc I viewed (that prompted my post) has multiple
images per page; it's not all text. While I don't remember the few others
that I could not search it's likely that they, too, had many images embedded
within the text. So I assume they were all scanned (or produced by an
equivalent process).

I used to get scanned documents (such as permit copies) from clients and had
no reason to run them through an OCR, but I'll keep that in mind for the
future.

Germane to MasterPDFEditor, I expect that its OCR capabilites are in the
paid version, not the free one. And, yes, Tesseract is in the SBo repo.

Stay well,

Rich


Re: [PLUG] PDF-1.5 docs not searchable

2021-07-25 Thread Jason Barnett
Rich,
I believe you mentioned Master PDF editor. I believe it has OCR built-in,
or allows it as a plugin. If needed, a good OCR tool is Tesseract and is
likely in your distro's repository.
https://en.wikipedia.org/wiki/Tesseract_(software)

Jason

On Sat, Jul 24, 2021 at 11:08 AM Tomas Kuchta 
wrote:

> They are not searchable because they do not contain text to search.
> Typically, they contain image only.
>
> The way I deal with it - I OCR the image, generate text document and place
> that text into a layer under the image in the output PDF.
>
> Having the text under the image layer preserves the original look of the
> pdf why allowing for search and select.
>
> I have seen pdf with text over the image, obscuring it - as well as various
> attempts of making the text over the image invisible.
>
> Of course, OCR is not perfect as well as preserving the text in the exact
> position under the image. It mostly works for text, not so much for data
> extraction from tables, etc.
>
> I do not believe that there is OK-ish free SW solution to this. I use
> commercial SW to do that. It works, but I cannot publicly recommend it due
> to their nasty commercial behavior - no respect for privacy no sale just
> licensing with build in obsolescence.
>
> Tomas
>
> On Fri, Jul 23, 2021, 16:18 Rich Shepard  wrote:
>
> > I've encountered a few PDF-1.5 docs that are not searchable using xpdf,
> > mupdf, okular, or MasterPDFEditor. Perhaps they're scanned and I don't
> know
> > how to determine if they are.
> >
> > My web searches found nothing relevant; my search terms might be
> > inefficient.
> >
> > Has anyone else experienced this?
> >
> > Rich
> >
> >
> >
>


Re: [PLUG] PDF-1.5 docs not searchable

2021-07-24 Thread Tomas Kuchta
They are not searchable because they do not contain text to search.
Typically, they contain image only.

The way I deal with it - I OCR the image, generate text document and place
that text into a layer under the image in the output PDF.

Having the text under the image layer preserves the original look of the
pdf why allowing for search and select.

I have seen pdf with text over the image, obscuring it - as well as various
attempts of making the text over the image invisible.

Of course, OCR is not perfect as well as preserving the text in the exact
position under the image. It mostly works for text, not so much for data
extraction from tables, etc.

I do not believe that there is OK-ish free SW solution to this. I use
commercial SW to do that. It works, but I cannot publicly recommend it due
to their nasty commercial behavior - no respect for privacy no sale just
licensing with build in obsolescence.

Tomas

On Fri, Jul 23, 2021, 16:18 Rich Shepard  wrote:

> I've encountered a few PDF-1.5 docs that are not searchable using xpdf,
> mupdf, okular, or MasterPDFEditor. Perhaps they're scanned and I don't know
> how to determine if they are.
>
> My web searches found nothing relevant; my search terms might be
> inefficient.
>
> Has anyone else experienced this?
>
> Rich
>
>
>


Re: [PLUG] PDF-1.5 docs not searchable

2021-07-23 Thread Rich Shepard

On Fri, 23 Jul 2021, Randy Bush wrote:


i hit this often. i just ocr them with pdfscanner on a mac. i would
guess/hope there are lower cost tools.


randy,

Thanks.

Rich


Re: [PLUG] PDF-1.5 docs not searchable

2021-07-23 Thread Randy Bush
> I've encountered a few PDF-1.5 docs that are not searchable using
> xpdf, mupdf, okular, or MasterPDFEditor. Perhaps they're scanned
> and I don't know how to determine if they are.

i hit this often.  i just ocr them with pdfscanner on a mac.  i would
guess/hope there are lower cost tools.

randy


[PLUG] PDF-1.5 docs not searchable

2021-07-23 Thread Rich Shepard

I've encountered a few PDF-1.5 docs that are not searchable using xpdf,
mupdf, okular, or MasterPDFEditor. Perhaps they're scanned and I don't know
how to determine if they are.

My web searches found nothing relevant; my search terms might be
inefficient.

Has anyone else experienced this?

Rich