The (or a?) way to manage this is through producing page image based PDF
files, which incorporate a text overlay layer.

That way, when a user opens such a pdf for viewing, they'll get to see the
original scans, color, deformations, noise and all, while they will, at the
same time, be able to select & copy\paste the same content as text, thanks
to the embedded text overlay. Same happens for screen readers: the
invisible text overlay will be used and read aloud by your computer (if you
have the software with these capabilities installed and set up).

So the good news is: you're on the right track.

The other bit is: this pdf processing stuff or whatever other similar
process you may find useful (e.g. close captions for video based inputs) is
entirely outside the remit of tesseract.

Tesseract is the component meant to hand you raw OCR results, i.e. the raw
machine-produced text, given a suitably **preprocessed** input image (like
the gaussian thresholded one you showed as second image), and then you are
supposed to take that output (raw text plus possibly some image pixel
coordinates where the text blurbs were located in the image according to
the tesseract machine) and apply whichever postprocess you deem fit. Which
may include text cleanup through spell checking or other \ more
sophisticated means and maybe feeding it to a tool which will combine this
data with the original scans to produce such a pdf.

Tesseract has some features to produce a pdf or similar stuff but don't get
confused by this: tesseract's "core business", so to speak, is transforming
**preprocessed** image inputs to raw text + pixel coordinates.
Tesseract only offers *some* input image preprocessing, image thresholding
and various output file format options to give some users a basic departure
point for ease of use, but these options are, in my mind at least, only
there as a "minimal viable product demo" so you'll be able to get at
something reasonably believable quickly, before you go and do the rest of
the tough stuff. ;-)

Regrettably, I don't have a more clear and employable answer for you: while
there are (open source) solutions for this type of process already out
there, none work to my satisfaction, so consider this "ongoing research
effort" a.k.a. you'll have more to do and find out yourself.

One option *may* be combining tesseract with muPDF (from the makers of
GhostScript): while it's not satisfactory yet for **my purposes**, it may
bring you that much closer to your goal as bleeding edge muPDF (think git
repository master branch, not software releases) incorporates code to load
a pdf (f.e. produced by your book scanner hard- + software, hence only
carrying page images) and take those page images, feeding them to a
linked-in tesseract library, to then take the tesseract text output and
place it in the output pdf file.
While it is slow going here due to circumstances, that's the workflow I
hope to adjust and augment to suit my needs and maybe it can already serve
yours. (A limitation being non-user-scriptable image preprocessing, YMMV)

There are other toolchains already doing this out there (some python based
stuff, f.e.) so be sure to check around.


The key take away: think of tesseract as one tool in a whole chain of tools
necessary to get what you want. Then keep in mind that the core capability
of tesseract is taking b&w thresholded images (BLACK TEXT on WHITE
BACKGROUND, mind you!  ;-)  ) and producing raw text from that; everything
else you need before and after that step should be custom tailored to your
specific needs and quality levels by using additional tooling and processes.

HTH.



On Sat, Apr 23, 2022, 05:19 Andrew M. <[email protected]> wrote:

> I'm using the latest version of Tesseract (5.0), and I'm trying to
> determine whether or not I can insert some preprocessing steps that will
> -not- affect the form of the final image.
>
> For example, I might start out with an image such as this
> <https://i.stack.imgur.com/XWJ7F.jpg>.
>
> There are different levels of shadow/brightness, so I might use adaptive
> Gaussian thresholding to avoid shadows during binarization
> <https://i.stack.imgur.com/fzCQS.jpg>.
>
> I will now run this through tesseract, with the hope of creating an OCR'd
> PDF in the end. However, I want the image that the end user (and I) see to
> be the full-color, original image, with the text from the transformed image
> underlaid
>
> Is there a way to manage this? Or am I completely missing the point here.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/bf9dea45-554b-4076-8946-603ca7176090n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/bf9dea45-554b-4076-8946-603ca7176090n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpjecrc9gmiq8W4iTFG2Xkr397kWKJmP3mvRoWxpaKfpw%40mail.gmail.com.

Reply via email to