Hi, On 20/04/2021 01:04, Sharp Subbu wrote: > Dear Merlijn, > > Kindly reply to my previous mail.
I will reply tomorrow -- off-list so that we don't bother others on this list. Regards, Merlijn > Thanks and Regards, > Subramanyam > > On Saturday, April 17, 2021 at 11:32:31 PM UTC+5:30 Sharp Subbu wrote: > >> Dear Merlijn, >> >> Thank you for your efforts in providing the files "Samle-out.pdf" and >> "Samle-out.pdf2". >> I have checked these files. The file Samle-out2.pdf highly compressed with >> good quality. >> Kindly share the soiurce code or the tools that you have used to generate >> the file "Samle-out2.pdf". >> Kindly let us know that whether i can use your source code or tools on >> Windows 10 PC. >> >> Thanks and Regards, >> Subramanyam >> On Wednesday, April 14, 2021 at 10:21:14 PM UTC+5:30 Merlijn Wajer wrote: >> >>> Hi, >>> >>> On 14/04/2021 15:26, Sharp Subbu wrote: >>>> Dear Merlijn, >>>> >>>> Thank you very much for your reply. >>>> We are doing feasibility study on using Tesseract OCR featurs in our >>>> project on Windows 10 English 32/64-bit OS. >>>> As part of this study, i am trying to find that is it possible to >>> compress >>>> / reduce the size of the pdf file created by Tesseract OCR >>> (CommandLine: > >>>> Tesseract input.tif outputFile pdf). >>>> To find answer for this question, I have checked tesseract forums, and >>>> Tesseract APIs. I did not find any related information. Hence, I have >>>> posted the same question in Tesseract Google forums. >>>> Regarding this, i received nice reply from you. Thank you very much for >>>> that. >>>> Firstly, clarify that is Tesseract OCR API supports reducing / >>> compressing >>>> the OCRed pdf file. Is this support present or not in Tesseract OCR >>> sourc >>>> code. >>>> >>>> Kindly fin dthe attached sample pdf file "Sample.pdf" for your >>> reference. >>>> Kindly compress it and send the compressed pdf file. >>> >>> Please see attached 'sample-out.pdf' (compression ratio ~3.5), and >>> 'sample-out2.pdf' (compression ratio ~7.5). >>> >>> I also generated one other PDF which illustrates how the compression >>> works, rather than being actually compressed, but that file is not >>> attached due to file size reasons (~460KB). You can that file (plus the >>> other two files) here: https://archive.org/~merlijn/tmp/mrc-pdf/ >>> >>> Please see the commands I used: >>> >>> 1. Extract image from PDF (not necessary if you start with an image, >>> even better to start with the image and not have Tesseract generate the >>> PDF) >>> >>>> $ pdfimages -all /tmp/Sample.pdf /tmp/sample >>> >>> 2. OCR: >>> >>>> $ tesseract /tmp/sample-000.jpg - hocr > /tmp/sample-000.hocr.html >>> >>> 3. Create PDF (dpi taken from JPEG image): >>> >>>> $ PATH=$PATH:/home/merlijn/archive/pdf/bin recode_pdf --from-imagestack >>> /tmp/sample-000.jpg --hocr-file /tmp/sample-000.hocr.html -o >>> /tmp/sample-out.pdf -m 2 -v --dpi 200 >>>> [...] >>>> Processed 1 pages at 1.80 seconds/page >>>> Compression ratio: 3.677732 >>> >>> 4. Create more compressed PDF (ditto for dpi): >>> >>>> $ PATH=$PATH:/home/merlijn/archive/pdf/bin recode_pdf --from-imagestack >>> /tmp/sample-000.jpg --hocr-file /tmp/sample-000.hocr.html -o >>> /tmp/sample-out2.pdf -m 2 --dpi 200 --bg-downsample 3 -v >>>> [...] >>>> Processed 1 pages at 1.59 seconds/page >>>> Compression ratio: 7.581879 >>> >>> Size comparison: >>> >>>> $ ls -lsh /tmp/sample-out*.pdf /tmp/Sample.pdf >>>> 40K -rw-r--r-- 1 merlijn merlijn 38K Apr 14 18:08 /tmp/sample-out2.pdf >>>> 80K -rw-r--r-- 1 merlijn merlijn 79K Apr 14 18:11 /tmp/sample-out.pdf >>>> 296K -rw-r--r-- 1 merlijn merlijn 293K Apr 14 17:56 /tmp/Sample.pdf >>> >>> Note that: >>> >>> 1. Compression is not lossless, but text should nevertheless be quite >>> sharp. >>> 2. I run Tesseract on the image in the PDF, so for your purposes you >>> might want to instead generate hOCR from the image and let 'recode_pdf' >>> make the PDF for you (it's pretty much the same code that Tesseract uses). >>> 3. The mask in the PDF is encoded with 'ccitt' and not 'jbig2', which >>> would give you slightly better compression still. (This is a bug in >>> mupdf which will be fixed in the next mupdf release, I have a patched >>> version somewhere, but not at hand) >>> 4. I have only tested this on Linux. >>> 5. The above run uses Kakadu for JPEG2000 compression, but you could >>> also use Grok [0] or OpenJPEG [1] (OpenJPEG already works as per my >>> previous email). >>> >>> Finally, for some reason it looks like the PDFs created with the recode >>> tool actually look better than the sample you sent me -- I think that is >>> because yours suffers from JPEG artifacts which gets mostly cancelled >>> out by the mask technique that MRC employs. >>> >>> If this looks like something you might want to use, we could talk >>> off-list about how to make it works on Windows, to not bother the list >>> with details not relevant to Tesseract. >>> >>> Cheers, >>> Merlijn >>> >>> [0] https://github.com/GrokImageCompression/grok/ >>> [1] https://www.openjpeg.org >>> >>>> Thank you very much for your nice help. >>>> Subramanyam >>>> >>>> >>>> On Wednesday, April 14, 2021 at 6:27:43 PM UTC+5:30 Merlijn Wajer wrote: >>>> >>>>> Hi, >>>>> >>>>> On 14/04/2021 13:52, Sharp Subbu wrote: >>>>>> Dear friends, >>>>>> >>>>>> Kindly guide/help us to find solution for the below point: >>>>>> ============================= >>>>>> How to reduce the size of a OCRed pdf file using Tesseract OCR APIs. >>>>>> =============================== >>>>> >>>>> Not sure exactly what use case you have in mind (OS, etc), but I have >>> a >>>>> suggestion, as I dealt with this in the recent past. >>>>> >>>>> I developed something similar to the foxit/luratech "PDF compression", >>>>> in Python and it is entirely open source. It uses the Tesseract hOCR >>>>> result files. The can lead to 3-15x compression ratios (sometimes >>> more, >>>>> depending on the image formats that you use). >>>>> >>>>> It converts images to JPEG2000 for best compression (but slower >>> loading >>>>> times) and also attempts to create a "foreground", "background" and >>>>> "mask" image (Mixed Raster Content [0]), which can significantly >>> improve >>>>> compression. It inserts a text layer just like Tesseract does (the >>> code >>>>> is a port of Tesseract's C++). >>>>> >>>>> Here is some info [1], and here is the source code [2]. >>>>> >>>>> There is a "openjpeg-wip" branch that can use OpenJPEG instead of >>> Kakadu >>>>> for image compression. >>>>> >>>>> Example usage to create a PDF from a set of images: >>>>> >>>>> recode_pdf --from-imagestack 'images/*.jp2' --hocr-file >>>>> combined_tesseract_results.html -o out.pdf -v --use-openjpeg -m 2 >>>>> >>>>> There is also the --from-pdf option instead of --from-imagestack, but >>>>> that has only seen light testing. >>>>> >>>>> You can combine the hOCR result files using hocr-combine-stream [3] >>>>> >>>>> If this suits your use case, I'd be happy to help/assist here or off >>>>> list. There aren't many users of the software yet (the same offer >>>>> extends for others reading this list). If you have an example PDF that >>>>> you can send me, I'd be happy to try to send you a compressed PDF >>> back. >>>>> >>>>> Cheers, >>>>> Merlijn >>>>> >>>>> >>>>> [0] https://en.wikipedia.org/wiki/Mixed_raster_content >>>>> [1] https://archive.org/~merlijn/projects/archive-pdf-tools/index.html >>>>> [2] https://git.archive.org/merlijn/archive-pdf-tools >>>>> [3] >>>>> >>>>> >>> https://git.archive.org/merlijn/archive-hocr-tools/-/blob/master/bin/hocr-combine-stream >>> >>>>> >>>> >>> >>> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/170d5a81-3bbe-6603-ef1f-e44105b6d263%40archive.org.

