You could also test with

gswin32c -q -dNOPAUSE -dBATCH -sDEVICE=tiffgray -sCompression=lzw -r300



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Nov 6, 2014 at 2:13 PM, Sébastien Cuendet <
sebastien.cuen...@gmail.com> wrote:

> Hello everyone,
>
> [I apologize for double-posting, but it seems that my first post was not
> published]
>
> The setup of my (web) app is the following: I get user uploaded PDF files,
> I run OCR on them and show them the OCRed PDF. Since everything is online,
> the minimizing the size of the resulting PDF file is key to reduce loading
> and wait time for the user.
>
> The file I receive from the user is sample.pdf (attached to this post). I
> use tesseract 3.04 and do the following:
>
> gs -r300 -sDEVICE=tiff24nc -dBATCH -dNOPAUSE -sOutputFile=sample.tiff
> sample.pdf
> tesseract sample.tiff sample-tess -l fra -psm 1 pdf
>
>
> The result of the OCR is good, but the size of the generated PDF is now
> about 2.5 times as much
> - size of original pdf file: 60k
> - size of final pdf: 147K
>
> So I ask you, how can I reduce the size of the generated PDF while keeping
> the OCR result?
>
> One obvious solution is to reduce the resolution when generating the tiff,
> but I don’t want to do that as it may affect the OCR result.
>
> The second thing I tried was to reduce the PDF size post-tesseract, using
> ghostscript:
>
> gs -o sample-down-300.pdf   -sDEVICE=pdfwrite   -dDownsampleColorImages=
> true   -dDownsampleGrayImages=true   -dDownsampleMonoImages=true   -
> dColorImageResolution=300   -dGrayImageResolution=300   -
> dMonoImageResolution=300   -dColorImageDownsampleThreshold=1.0   -
> dGrayImageDownsampleThreshold=1.5   -dMonoImageDownsampleThreshold=1.0
> sample-tess.pdf
>
>
> This helps a bit, the generated file is only 101K, so about 1.5 times the
> original. I could live with that, but it also seems to affect the OCR
> result. For example, the white space between ‘RESTAURANT’ and ‘PIZZERIA’
> (second line) is now missing.
>
> Another (simpler) option with ghostscript, using the ebook parameter,
> results in a 43k file with some lesser quality in the PDF and the same
> problem of the missing white spaces:
>
> gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE
> -dBATCH  -dQUIET -sOutputFile=sample-ebook.pdf sample-tess.pdf
>
>
> The lesser quality of the PDF is fine, but again, I don’t really want to
> compromise on the OCR.
>
> I’ve done other tests using PNG and JPEGs, but the OCR results always go
> down (even slightly) and the resulting PDF is not smaller. For example,
> with PNG:
>
> convert -density 300 sample.pdf -transparent white sample.png
> tesseract sample.png sample-tess-png -l fra -psm 1 pdf
>
> The total (55.50) is missing and the final pdf size is 149k.
>
>
> So to summarize, here are my questions:
>
>    - Can someone explain why reducing the size of the PDF using
>    ghostscript affects the OCR result? I thought the text layer and the image
>    layer were independent…
>    - Are there options that one can give to tesseract to reduce the
>    quality of the images when it generates the PDF?
>    - I read that other solutions like ABBYY OCR use Mixed Rasterized
>    Content (MRC) to reduce the file size. Does tesseract do that already? If
>    not, are there some open source or proprietary CLI tools that do that,
>    which I could use to reduce the tesseract generated PDF file?
>
>
> Again, I’m OK compromising on the quality of the PDF images (although I
> would like to keep the colors, ideally) as long as the user can search text
> and select it to copy/paste from the PDF.
>
> Any help would be greatly appreciated!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/4d7bb1e2-917f-4f24-a226-e03ccac3d9cc%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/4d7bb1e2-917f-4f24-a226-e03ccac3d9cc%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXL1O9TBo1GrPh6ZpYySVpxnFXACymosOdj6xh7w3rAOQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to