Thanks for your answer.

I did try that as well, but I lose the colors and the final tesseract file 
is still 142k (only 5k less than the one with colors), so not much of an 
improvement.

On Thursday, November 6, 2014 11:11:54 AM UTC+1, shree wrote:
>
> You could also test with
>
> gswin32c -q -dNOPAUSE -dBATCH -sDEVICE=tiffgray -sCompression=lzw -r300   
>
>
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Thu, Nov 6, 2014 at 2:13 PM, Sébastien Cuendet <[email protected] 
> <javascript:>> wrote:
>
>> Hello everyone,
>>
>> [I apologize for double-posting, but it seems that my first post was not 
>> published]
>>
>> The setup of my (web) app is the following: I get user uploaded PDF 
>> files, I run OCR on them and show them the OCRed PDF. Since everything is 
>> online, the minimizing the size of the resulting PDF file is key to reduce 
>> loading and wait time for the user.
>>
>> The file I receive from the user is sample.pdf (attached to this post). I 
>> use tesseract 3.04 and do the following:
>>
>> gs -r300 -sDEVICE=tiff24nc -dBATCH -dNOPAUSE -sOutputFile=sample.tiff 
>> sample.pdf
>> tesseract sample.tiff sample-tess -l fra -psm 1 pdf
>>
>>
>> The result of the OCR is good, but the size of the generated PDF is now 
>> about 2.5 times as much
>> - size of original pdf file: 60k
>> - size of final pdf: 147K
>>
>> So I ask you, how can I reduce the size of the generated PDF while 
>> keeping the OCR result?
>>
>> One obvious solution is to reduce the resolution when generating the 
>> tiff, but I don’t want to do that as it may affect the OCR result. 
>>
>> The second thing I tried was to reduce the PDF size post-tesseract, using 
>> ghostscript:
>>
>> gs -o sample-down-300.pdf   -sDEVICE=pdfwrite   -dDownsampleColorImages=
>> true   -dDownsampleGrayImages=true   -dDownsampleMonoImages=true   -
>> dColorImageResolution=300   -dGrayImageResolution=300   -
>> dMonoImageResolution=300   -dColorImageDownsampleThreshold=1.0   -
>> dGrayImageDownsampleThreshold=1.5   -dMonoImageDownsampleThreshold=1.0 
>> sample-tess.pdf 
>>
>>
>> This helps a bit, the generated file is only 101K, so about 1.5 times the 
>> original. I could live with that, but it also seems to affect the OCR 
>> result. For example, the white space between ‘RESTAURANT’ and ‘PIZZERIA’ 
>> (second line) is now missing.
>>
>> Another (simpler) option with ghostscript, using the ebook parameter, 
>> results in a 43k file with some lesser quality in the PDF and the same 
>> problem of the missing white spaces:
>>
>> gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE 
>> -dBATCH  -dQUIET -sOutputFile=sample-ebook.pdf sample-tess.pdf
>>
>>
>> The lesser quality of the PDF is fine, but again, I don’t really want to 
>> compromise on the OCR.
>>
>> I’ve done other tests using PNG and JPEGs, but the OCR results always go 
>> down (even slightly) and the resulting PDF is not smaller. For example, 
>> with PNG:
>>
>> convert -density 300 sample.pdf -transparent white sample.png
>> tesseract sample.png sample-tess-png -l fra -psm 1 pdf
>>
>> The total (55.50) is missing and the final pdf size is 149k.
>>
>>
>> So to summarize, here are my questions:
>>
>>    - Can someone explain why reducing the size of the PDF using 
>>    ghostscript affects the OCR result? I thought the text layer and the 
>> image 
>>    layer were independent… 
>>    - Are there options that one can give to tesseract to reduce the 
>>    quality of the images when it generates the PDF?
>>    - I read that other solutions like ABBYY OCR use Mixed Rasterized 
>>    Content (MRC) to reduce the file size. Does tesseract do that already? If 
>>    not, are there some open source or proprietary CLI tools that do that, 
>>    which I could use to reduce the tesseract generated PDF file?
>>    
>>
>> Again, I’m OK compromising on the quality of the PDF images (although I 
>> would like to keep the colors, ideally) as long as the user can search text 
>> and select it to copy/paste from the PDF.
>>
>> Any help would be greatly appreciated!
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/4d7bb1e2-917f-4f24-a226-e03ccac3d9cc%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/4d7bb1e2-917f-4f24-a226-e03ccac3d9cc%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1d1cf9f6-9c22-42c8-93a2-d62baec4d7c8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to