Re: [tesseract-ocr] How to reduce the size of a OCRed pdf file using Tesseract OCR APIs.

Merlijn B.W. Wajer Mon, 19 Apr 2021 16:32:08 -0700

Hi,

On 20/04/2021 01:04, Sharp Subbu wrote:
> Dear Merlijn,
> 
> Kindly reply to my previous mail.


I will reply tomorrow -- off-list so that we don't bother others on this
list.

Regards,
Merlijn

> Thanks and Regards,
> Subramanyam
> 
> On Saturday, April 17, 2021 at 11:32:31 PM UTC+5:30 Sharp Subbu wrote:
> 
>> Dear Merlijn,
>>
>> Thank you for your efforts in providing the files "Samle-out.pdf" and 
>> "Samle-out.pdf2".
>> I have checked these files. The file Samle-out2.pdf highly compressed with 
>> good quality.
>> Kindly share the soiurce code or the tools that you have used to generate 
>> the file "Samle-out2.pdf".
>> Kindly let us know that whether i can use your source code or tools on 
>> Windows 10 PC.
>>
>> Thanks and Regards,
>> Subramanyam
>> On Wednesday, April 14, 2021 at 10:21:14 PM UTC+5:30 Merlijn Wajer wrote:
>>
>>> Hi,
>>>
>>> On 14/04/2021 15:26, Sharp Subbu wrote:
>>>> Dear Merlijn,
>>>>
>>>> Thank you very much for your reply. 
>>>> We are doing feasibility study on using Tesseract OCR featurs in our 
>>>> project on Windows 10 English 32/64-bit OS.
>>>> As part of this study, i am trying to find that is it possible to 
>>> compress 
>>>> / reduce the size of the pdf file created by Tesseract OCR 
>>> (CommandLine: > 
>>>> Tesseract input.tif outputFile pdf).
>>>> To find answer for this question, I have checked tesseract forums, and 
>>>> Tesseract APIs. I did not find any related information. Hence, I have 
>>>> posted the same question in Tesseract Google forums.
>>>> Regarding this, i received nice reply from you. Thank you very much for 
>>>> that.
>>>> Firstly, clarify that is Tesseract OCR API supports reducing / 
>>> compressing 
>>>> the OCRed pdf file. Is this support present or not in Tesseract OCR 
>>> sourc 
>>>> code.
>>>>
>>>> Kindly fin dthe attached sample pdf file "Sample.pdf" for your 
>>> reference. 
>>>> Kindly compress it and send the compressed pdf file.
>>>
>>> Please see attached 'sample-out.pdf' (compression ratio ~3.5), and
>>> 'sample-out2.pdf' (compression ratio ~7.5).
>>>
>>> I also generated one other PDF which illustrates how the compression
>>> works, rather than being actually compressed, but that file is not
>>> attached due to file size reasons (~460KB). You can that file (plus the
>>> other two files) here: https://archive.org/~merlijn/tmp/mrc-pdf/
>>>
>>> Please see the commands I used:
>>>
>>> 1. Extract image from PDF (not necessary if you start with an image,
>>> even better to start with the image and not have Tesseract generate the 
>>> PDF)
>>>
>>>> $ pdfimages -all /tmp/Sample.pdf /tmp/sample
>>>
>>> 2. OCR:
>>>
>>>> $ tesseract /tmp/sample-000.jpg - hocr > /tmp/sample-000.hocr.html
>>>
>>> 3. Create PDF (dpi taken from JPEG image):
>>>
>>>> $ PATH=$PATH:/home/merlijn/archive/pdf/bin recode_pdf --from-imagestack 
>>> /tmp/sample-000.jpg --hocr-file /tmp/sample-000.hocr.html -o 
>>> /tmp/sample-out.pdf -m 2 -v --dpi 200
>>>> [...]
>>>> Processed 1 pages at 1.80 seconds/page
>>>> Compression ratio: 3.677732
>>>
>>> 4. Create more compressed PDF (ditto for dpi):
>>>
>>>> $ PATH=$PATH:/home/merlijn/archive/pdf/bin recode_pdf --from-imagestack 
>>> /tmp/sample-000.jpg --hocr-file /tmp/sample-000.hocr.html -o 
>>> /tmp/sample-out2.pdf -m 2 --dpi 200 --bg-downsample 3 -v
>>>> [...]
>>>> Processed 1 pages at 1.59 seconds/page
>>>> Compression ratio: 7.581879
>>>
>>> Size comparison:
>>>
>>>> $ ls -lsh /tmp/sample-out*.pdf /tmp/Sample.pdf
>>>> 40K -rw-r--r-- 1 merlijn merlijn 38K Apr 14 18:08 /tmp/sample-out2.pdf
>>>> 80K -rw-r--r-- 1 merlijn merlijn 79K Apr 14 18:11 /tmp/sample-out.pdf
>>>> 296K -rw-r--r-- 1 merlijn merlijn 293K Apr 14 17:56 /tmp/Sample.pdf
>>>
>>> Note that:
>>>
>>> 1. Compression is not lossless, but text should nevertheless be quite 
>>> sharp.
>>> 2. I run Tesseract on the image in the PDF, so for your purposes you
>>> might want to instead generate hOCR from the image and let 'recode_pdf'
>>> make the PDF for you (it's pretty much the same code that Tesseract uses).
>>> 3. The mask in the PDF is encoded with 'ccitt' and not 'jbig2', which
>>> would give you slightly better compression still. (This is a bug in
>>> mupdf which will be fixed in the next mupdf release, I have a patched
>>> version somewhere, but not at hand)
>>> 4. I have only tested this on Linux.
>>> 5. The above run uses Kakadu for JPEG2000 compression, but you could
>>> also use Grok [0] or OpenJPEG [1] (OpenJPEG already works as per my
>>> previous email).
>>>
>>> Finally, for some reason it looks like the PDFs created with the recode
>>> tool actually look better than the sample you sent me -- I think that is
>>> because yours suffers from JPEG artifacts which gets mostly cancelled
>>> out by the mask technique that MRC employs.
>>>
>>> If this looks like something you might want to use, we could talk
>>> off-list about how to make it works on Windows, to not bother the list
>>> with details not relevant to Tesseract.
>>>
>>> Cheers,
>>> Merlijn
>>>
>>> [0] https://github.com/GrokImageCompression/grok/
>>> [1] https://www.openjpeg.org
>>>
>>>> Thank you very much for your nice help.
>>>> Subramanyam
>>>>
>>>>
>>>> On Wednesday, April 14, 2021 at 6:27:43 PM UTC+5:30 Merlijn Wajer wrote:
>>>>
>>>>> Hi, 
>>>>>
>>>>> On 14/04/2021 13:52, Sharp Subbu wrote: 
>>>>>> Dear friends, 
>>>>>>
>>>>>> Kindly guide/help us to find solution for the below point: 
>>>>>> ============================= 
>>>>>> How to reduce the size of a OCRed pdf file using Tesseract OCR APIs. 
>>>>>> =============================== 
>>>>>
>>>>> Not sure exactly what use case you have in mind (OS, etc), but I have 
>>> a 
>>>>> suggestion, as I dealt with this in the recent past. 
>>>>>
>>>>> I developed something similar to the foxit/luratech "PDF compression", 
>>>>> in Python and it is entirely open source. It uses the Tesseract hOCR 
>>>>> result files. The can lead to 3-15x compression ratios (sometimes 
>>> more, 
>>>>> depending on the image formats that you use). 
>>>>>
>>>>> It converts images to JPEG2000 for best compression (but slower 
>>> loading 
>>>>> times) and also attempts to create a "foreground", "background" and 
>>>>> "mask" image (Mixed Raster Content [0]), which can significantly 
>>> improve 
>>>>> compression. It inserts a text layer just like Tesseract does (the 
>>> code 
>>>>> is a port of Tesseract's C++). 
>>>>>
>>>>> Here is some info [1], and here is the source code [2]. 
>>>>>
>>>>> There is a "openjpeg-wip" branch that can use OpenJPEG instead of 
>>> Kakadu 
>>>>> for image compression. 
>>>>>
>>>>> Example usage to create a PDF from a set of images: 
>>>>>
>>>>> recode_pdf --from-imagestack 'images/*.jp2' --hocr-file 
>>>>> combined_tesseract_results.html -o out.pdf -v --use-openjpeg -m 2 
>>>>>
>>>>> There is also the --from-pdf option instead of --from-imagestack, but 
>>>>> that has only seen light testing. 
>>>>>
>>>>> You can combine the hOCR result files using hocr-combine-stream [3] 
>>>>>
>>>>> If this suits your use case, I'd be happy to help/assist here or off 
>>>>> list. There aren't many users of the software yet (the same offer 
>>>>> extends for others reading this list). If you have an example PDF that 
>>>>> you can send me, I'd be happy to try to send you a compressed PDF 
>>> back. 
>>>>>
>>>>> Cheers, 
>>>>> Merlijn 
>>>>>
>>>>>
>>>>> [0] https://en.wikipedia.org/wiki/Mixed_raster_content 
>>>>> [1] https://archive.org/~merlijn/projects/archive-pdf-tools/index.html 
>>>>> [2] https://git.archive.org/merlijn/archive-pdf-tools 
>>>>> [3] 
>>>>>
>>>>>
>>> https://git.archive.org/merlijn/archive-hocr-tools/-/blob/master/bin/hocr-combine-stream
>>>  
>>>>>
>>>>
>>>
>>>
> 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/170d5a81-3bbe-6603-ef1f-e44105b6d263%40archive.org.

Re: [tesseract-ocr] How to reduce the size of a OCRed pdf file using Tesseract OCR APIs.

Reply via email to