You could also test with gswin32c -q -dNOPAUSE -dBATCH -sDEVICE=tiffgray -sCompression=lzw -r300
ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Nov 6, 2014 at 2:13 PM, Sébastien Cuendet < sebastien.cuen...@gmail.com> wrote: > Hello everyone, > > [I apologize for double-posting, but it seems that my first post was not > published] > > The setup of my (web) app is the following: I get user uploaded PDF files, > I run OCR on them and show them the OCRed PDF. Since everything is online, > the minimizing the size of the resulting PDF file is key to reduce loading > and wait time for the user. > > The file I receive from the user is sample.pdf (attached to this post). I > use tesseract 3.04 and do the following: > > gs -r300 -sDEVICE=tiff24nc -dBATCH -dNOPAUSE -sOutputFile=sample.tiff > sample.pdf > tesseract sample.tiff sample-tess -l fra -psm 1 pdf > > > The result of the OCR is good, but the size of the generated PDF is now > about 2.5 times as much > - size of original pdf file: 60k > - size of final pdf: 147K > > So I ask you, how can I reduce the size of the generated PDF while keeping > the OCR result? > > One obvious solution is to reduce the resolution when generating the tiff, > but I don’t want to do that as it may affect the OCR result. > > The second thing I tried was to reduce the PDF size post-tesseract, using > ghostscript: > > gs -o sample-down-300.pdf -sDEVICE=pdfwrite -dDownsampleColorImages= > true -dDownsampleGrayImages=true -dDownsampleMonoImages=true - > dColorImageResolution=300 -dGrayImageResolution=300 - > dMonoImageResolution=300 -dColorImageDownsampleThreshold=1.0 - > dGrayImageDownsampleThreshold=1.5 -dMonoImageDownsampleThreshold=1.0 > sample-tess.pdf > > > This helps a bit, the generated file is only 101K, so about 1.5 times the > original. I could live with that, but it also seems to affect the OCR > result. For example, the white space between ‘RESTAURANT’ and ‘PIZZERIA’ > (second line) is now missing. > > Another (simpler) option with ghostscript, using the ebook parameter, > results in a 43k file with some lesser quality in the PDF and the same > problem of the missing white spaces: > > gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE > -dBATCH -dQUIET -sOutputFile=sample-ebook.pdf sample-tess.pdf > > > The lesser quality of the PDF is fine, but again, I don’t really want to > compromise on the OCR. > > I’ve done other tests using PNG and JPEGs, but the OCR results always go > down (even slightly) and the resulting PDF is not smaller. For example, > with PNG: > > convert -density 300 sample.pdf -transparent white sample.png > tesseract sample.png sample-tess-png -l fra -psm 1 pdf > > The total (55.50) is missing and the final pdf size is 149k. > > > So to summarize, here are my questions: > > - Can someone explain why reducing the size of the PDF using > ghostscript affects the OCR result? I thought the text layer and the image > layer were independent… > - Are there options that one can give to tesseract to reduce the > quality of the images when it generates the PDF? > - I read that other solutions like ABBYY OCR use Mixed Rasterized > Content (MRC) to reduce the file size. Does tesseract do that already? If > not, are there some open source or proprietary CLI tools that do that, > which I could use to reduce the tesseract generated PDF file? > > > Again, I’m OK compromising on the quality of the PDF images (although I > would like to keep the colors, ideally) as long as the user can search text > and select it to copy/paste from the PDF. > > Any help would be greatly appreciated! > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/4d7bb1e2-917f-4f24-a226-e03ccac3d9cc%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/4d7bb1e2-917f-4f24-a226-e03ccac3d9cc%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXL1O9TBo1GrPh6ZpYySVpxnFXACymosOdj6xh7w3rAOQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.