I agree with you.  It's just an example of one of the files we've come 
across while processing clients' files (we're in eDiscovery/forensics).  We 
come across all kinds of images and try to OCR and index whatever we can so 
it can be searched later.

I think it comes down to what's been suggested - if we want the best OCR 
results, we need to preprocess as much as possible and provide good 
resolution to tesseract, then it can do its job effectively.  It doesn't 
look like there is much more I need to do in terms of using tesseract.

On Friday, December 14, 2012 12:22:46 PM UTC-5, sventech wrote:
>
> JPEG is a bad idea for text data. If you must use it then pre-process it, 
> but it generally does not preserve a clean character outline. It is 
> designed for photographs. PNG or TIFF, but beware that TIFF is just a 
> wrapper, so sometimes it has a JPEG inside. You need a lossless 
> pixel-focused format.
> --Sven
>
>
> On Fri, Dec 14, 2012 at 9:10 AM, occorled <[email protected] <javascript:>
> > wrote:
>
>> Thank you, I will do that for b.jpg.
>>
>> But like I said, both of those images have the same .dpi value in the 
>> file, yet a.tiff OCRs perfectly and b.jpg is horrible.  So I'm not sure 
>> which algorithm I would employ at runtime to determine if I should up-scale 
>> an image or not.  It seems you can't simply rely on the exif data.  Not 
>> sure what the best approach is...
>>
>>
>>
>> On Thursday, December 13, 2012 8:32:04 PM UTC-5, Quan Nguyen wrote:
>>>
>>> Width and height are image dimensions but are incorrectly labeled as 
>>> resolution in some applications. Since your images are 96 DPI, tripling 
>>> their resolution should work better.
>>>
>>> On Wednesday, December 12, 2012 8:26:51 AM UTC-6, occorled wrote:
>>>>
>>>> I was always confused about DPI when it comes to images (versus 
>>>> print).  I thought, it's all about (w x h) resolution, not DPI, right?  I 
>>>> found this page to be informative (and funny) 
>>>> http://www.dpiphoto.eu/dpi.htm**.
>>>>
>>>> So basically, I simply scale the image larger right?  Perhaps double or 
>>>> triple the resolution of "b.jpg", right?
>>>>
>>>> On Tuesday, December 11, 2012 10:12:05 PM UTC-5, Quan Nguyen wrote:
>>>>>
>>>>> Rescaling to 300 DPI will produce much better results for the images.
>>>>>
>>>>  -- 
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]<javascript:>
>> To unsubscribe from this group, send email to
>> [email protected] <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>
>
>
> -- 
> ``All that is gold does not glitter,
>   not all those who wander are lost;
> the old that is strong does not wither,
>   deep roots are not reached by the frost.
> From the ashes a fire shall be woken,
>   a light from the shadows shall spring;
> renewed shall be blade that was broken,
>   the crownless again shall be king.”
>  

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to