Hi,

On 14/04/2021 13:52, Sharp Subbu wrote:
> Dear friends,
> 
> Kindly guide/help us to find solution for the below point:
> =============================
> How to reduce the size of a OCRed pdf file using Tesseract OCR APIs.
> ===============================

Not sure exactly what use case you have in mind (OS, etc), but I have a
suggestion, as I dealt with this in the recent past.

I developed something similar to the foxit/luratech "PDF compression",
in Python and it is entirely open source. It uses the Tesseract hOCR
result files. The can lead to 3-15x compression ratios (sometimes more,
depending on the image formats that you use).

It converts images to JPEG2000 for best compression (but slower loading
times) and also attempts to create a "foreground", "background" and
"mask" image (Mixed Raster Content [0]), which can significantly improve
compression. It inserts a text layer just like Tesseract does (the code
is a port of Tesseract's C++).

Here is some info [1], and here is the source code [2].

There is a "openjpeg-wip" branch that can use OpenJPEG instead of Kakadu
for image compression.

Example usage to create a PDF from a set of images:

recode_pdf --from-imagestack 'images/*.jp2' --hocr-file
combined_tesseract_results.html -o out.pdf -v --use-openjpeg -m 2

There is also the --from-pdf option instead of --from-imagestack, but
that has only seen light testing.

You can combine the hOCR result files using hocr-combine-stream [3]

If this suits your use case, I'd be happy to help/assist here or off
list. There aren't many users of the software yet (the same offer
extends for others reading this list). If you have an example PDF that
you can send me, I'd be happy to try to send you a compressed PDF back.

Cheers,
Merlijn


[0] https://en.wikipedia.org/wiki/Mixed_raster_content
[1] https://archive.org/~merlijn/projects/archive-pdf-tools/index.html
[2] https://git.archive.org/merlijn/archive-pdf-tools
[3]
https://git.archive.org/merlijn/archive-hocr-tools/-/blob/master/bin/hocr-combine-stream

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b2d8343a-dea7-e118-8f30-1bdaef7c8a75%40archive.org.

Reply via email to