Hey Michael,
   I really appreciate the tips. I'm developing an automated batch
ocr'ing system and there won't be a lot of human cleanup time involved
(100,000's of images).

 However, it might be possible to do some contrast or resolution type
enhancements using convert/imagemagick? You think? I'm somewhat new to
this, so learning the ropes still.

The system won't know the quality of the images it will try to OCR, so
to reduce human checking, I'm looking for way's to convert standard
images into the best possible format for tesseract (which I'm new to).

I assume tesseract outputs UTF-8/Unicode compatible encodings for
foreign language?

Cheers!
Darren

On Thu, 2009-01-08 at 10:35 -0700, Michael Moore wrote:
> On Thu, Jan 8, 2009 at 10:07 AM, Darren Govoni <[email protected]> wrote:
> >
> > Hi,
> >  I read tesseract supports a variety of languages. I convert a Spanish
> > text JPG to TIFF and ran tesseract with the spanish language pack and
> > the output text was not even close. Here is the image link:
> >
> > http://www.libertas.hu/slike/Spanish%20text.jpg
> >
> > Here was the output .txt (gibberish):
> >
> > _ físbřts-q:!:
> > J4Lçar'tr.r1S.r:t:
> > cærrtraø. CSS.
> > 22C} (puerta
> > ğrcupietaria)
> > Er: |::•ler'•c CC
> > guæ- En ve
> > (SCI a 7O êj
> > Scruõs, Eu'1 1
> > la psàtirna 1::
> > trate: pcvr ÇE
> > elgc czlæ irug
> > prirrtære pdă
> > era la plants
> > erreglscics
> > \/Istæ sobre
> >
> >
> > What is the trick to getting correct results?
> >
> > Thank you.
> > Darren
> 
> Your image has a lot of grey in it. It would be more helpful to see
> the TIFF file you used than the JPG.
> 
> I used the Gimp's 'levels' tool to enhance the image, then changed the
> image to a 1 bit palette and saved it as a TIFF.
> You can see the resulting image here:
> http://stuporglue.org/downloads/spanish.tif
> 
> The command I used was 'tesseract spanish.tif sp -l spa' and I have
> tesseract 2.0.3 installed.
> 
> I Habitsclonas da Ann (AnEls
> Aparrmsnts; plano en Cøior del
> centro, G5, 61): Prijsko. 7. D 321~
> 220 (peru es ram encontrar u la
> propletarxa) y 09B­503·28E (móvil).
> En pleno corazón de la ciudad anti-
> gua. En verano. de 444 a 518 Kn
> (60 a 70 É) la noche para dos per­
> Sonas. En una Casa adornada por
> la pátina de los siglos. Muy buen
> trato por parte de Ana, quien habla
> algo de inglés. DOS estudios en IE
> primera planta y un apartamento
> en la planta baja. impecables y bian
> arreglados, con ducha y Cocina.
> Vista sobre las Callejas del barrio.
> 
> I think if you do some image cleanup before processing them with
> tesseract you will get much better results.
> 
> 


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to