Re: How to get decent results?

Michael Moore Thu, 08 Jan 2009 11:00:35 -0800

On Thu, Jan 8, 2009 at 11:30 AM, Darren Govoni <[email protected]> wrote:
>
> Hey Michael,
>   I really appreciate the tips. I'm developing an automated batch
> ocr'ing system and there won't be a lot of human cleanup time involved
> (100,000's of images).
>
>  However, it might be possible to do some contrast or resolution type
> enhancements using convert/imagemagick? You think? I'm somewhat new to
> this, so learning the ropes still.


The more uniform and cleaner your input images are, the easier it will
be to batch it.

Here's what I use in my batch script
convert -monochrome +compress  Input.jpg output.tif

The -monochrome seems to do a pretty good job of getting rid of grey
stray marks on the input (like folds in pages, light pencil lines) and
results in a smaller file.
The +compress tells it not to compress the tif file. You can also use
-compress lzw and some other options. I'm pretty sure you can't use
-compress jpeg with the -monochrome option.

Your Spanish text.jpg image looks like it's highly compressed jpg. If
you can get an uncompressed image to start with you'll probably be
better off already.

> The system won't know the quality of the images it will try to OCR, so
> to reduce human checking, I'm looking for way's to convert standard
> images into the best possible format for tesseract (which I'm new to).

You may be able to get users to upload better files if you require
pngs or something like that. Most scanner software can save as png. I
believe that tesseract only takes tif files and I've had the best luck
with monochrome files.

> I assume tesseract outputs UTF-8/Unicode compatible encodings for
> foreign language?

Sorry, I'm pretty new to tesseract too and I'm not sure about that.

> Cheers!
> Darren
>
> On Thu, 2009-01-08 at 10:35 -0700, Michael Moore wrote:
>> On Thu, Jan 8, 2009 at 10:07 AM, Darren Govoni <[email protected]> wrote:
>> >
>> > Hi,
>> >  I read tesseract supports a variety of languages. I convert a Spanish
>> > text JPG to TIFF and ran tesseract with the spanish language pack and
>> > the output text was not even close. Here is the image link:
>> >
>> > http://www.libertas.hu/slike/Spanish%20text.jpg
>> >
>> > Here was the output .txt (gibberish):
>> >
>> > _ físbřts-q:!:
>> > J4Lçar'tr.r1S.r:t:
>> > cærrtraø. CSS.
>> > 22C} (puerta
>> > ğrcupietaria)
>> > Er: |::•ler'•c CC
>> > guæ- En ve
>> > (SCI a 7O êj
>> > Scruõs, Eu'1 1
>> > la psàtirna 1::
>> > trate: pcvr ÇE
>> > elgc czlæ irug
>> > prirrtære pdă
>> > era la plants
>> > erreglscics
>> > \/Istæ sobre
>> >
>> >
>> > What is the trick to getting correct results?
>> >
>> > Thank you.
>> > Darren
>>
>> Your image has a lot of grey in it. It would be more helpful to see
>> the TIFF file you used than the JPG.
>>
>> I used the Gimp's 'levels' tool to enhance the image, then changed the
>> image to a 1 bit palette and saved it as a TIFF.
>> You can see the resulting image here:
>> http://stuporglue.org/downloads/spanish.tif
>>
>> The command I used was 'tesseract spanish.tif sp -l spa' and I have
>> tesseract 2.0.3 installed.
>>
>> I Habitsclonas da Ann (AnEls
>> Aparrmsnts; plano en Cøior del
>> centro, G5, 61): Prijsko. 7. D 321~
>> 220 (peru es ram encontrar u la
>> propletarxa) y 09B503·28E (móvil).
>> En pleno corazón de la ciudad anti-
>> gua. En verano. de 444 a 518 Kn
>> (60 a 70 É) la noche para dos per
>> Sonas. En una Casa adornada por
>> la pátina de los siglos. Muy buen
>> trato por parte de Ana, quien habla
>> algo de inglés. DOS estudios en IE
>> primera planta y un apartamento
>> en la planta baja. impecables y bian
>> arreglados, con ducha y Cocina.
>> Vista sobre las Callejas del barrio.
>>
>> I think if you do some image cleanup before processing them with
>> tesseract you will get much better results.
>>
>>
>
>
> >
>



-- 
Michael Moore
-------------------------
Share your families' genealogy and family history books. It's easy and
free : http://bookscanned.com

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: How to get decent results?

Reply via email to