The output is indeed utf-8.Ray.

On Thu, Jan 8, 2009 at 11:00 AM, Michael Moore <[email protected]> wrote:

> On Thu, Jan 8, 2009 at 11:30 AM, Darren Govoni <[email protected]>
> wrote:
> >
> > Hey Michael,
> >   I really appreciate the tips. I'm developing an automated batch
> > ocr'ing system and there won't be a lot of human cleanup time involved
> > (100,000's of images).
> >
> >  However, it might be possible to do some contrast or resolution type
> > enhancements using convert/imagemagick? You think? I'm somewhat new to
> > this, so learning the ropes still.
>
> The more uniform and cleaner your input images are, the easier it will
> be to batch it.
>
> Here's what I use in my batch script
> convert -monochrome +compress  Input.jpg output.tif
>
> The -monochrome seems to do a pretty good job of getting rid of grey
> stray marks on the input (like folds in pages, light pencil lines) and
> results in a smaller file.
> The +compress tells it not to compress the tif file. You can also use
> -compress lzw and some other options. I'm pretty sure you can't use
> -compress jpeg with the -monochrome option.
>
> Your Spanish text.jpg image looks like it's highly compressed jpg. If
> you can get an uncompressed image to start with you'll probably be
> better off already.
>
> > The system won't know the quality of the images it will try to OCR, so
> > to reduce human checking, I'm looking for way's to convert standard
> > images into the best possible format for tesseract (which I'm new to).
>
> You may be able to get users to upload better files if you require
> pngs or something like that. Most scanner software can save as png. I
> believe that tesseract only takes tif files and I've had the best luck
> with monochrome files.
>
> > I assume tesseract outputs UTF-8/Unicode compatible encodings for
> > foreign language?
>
> Sorry, I'm pretty new to tesseract too and I'm not sure about that.
>
> > Cheers!
> > Darren
> >
> > On Thu, 2009-01-08 at 10:35 -0700, Michael Moore wrote:
> >> On Thu, Jan 8, 2009 at 10:07 AM, Darren Govoni <[email protected]>
> wrote:
> >> >
> >> > Hi,
> >> >  I read tesseract supports a variety of languages. I convert a Spanish
> >> > text JPG to TIFF and ran tesseract with the spanish language pack and
> >> > the output text was not even close. Here is the image link:
> >> >
> >> > http://www.libertas.hu/slike/Spanish%20text.jpg
> >> >
> >> > Here was the output .txt (gibberish):
> >> >
> >> > _ físbřts-q:!:
> >> > J4Lçar'tr.r1S.r:t:
> >> > cærrtraø. CSS.
> >> > 22C} (puerta
> >> > ğrcupietaria)
> >> > Er: |::•ler'•c CC
> >> > guæ- En ve
> >> > (SCI a 7O êj
> >> > Scruõs, Eu'1 1
> >> > la psàtirna 1::
> >> > trate: pcvr ÇE
> >> > elgc czlæ irug
> >> > prirrtære pdă
> >> > era la plants
> >> > erreglscics
> >> > \/Istæ sobre
> >> >
> >> >
> >> > What is the trick to getting correct results?
> >> >
> >> > Thank you.
> >> > Darren
> >>
> >> Your image has a lot of grey in it. It would be more helpful to see
> >> the TIFF file you used than the JPG.
> >>
> >> I used the Gimp's 'levels' tool to enhance the image, then changed the
> >> image to a 1 bit palette and saved it as a TIFF.
> >> You can see the resulting image here:
> >> http://stuporglue.org/downloads/spanish.tif
> >>
> >> The command I used was 'tesseract spanish.tif sp -l spa' and I have
> >> tesseract 2.0.3 installed.
> >>
> >> I Habitsclonas da Ann (AnEls
> >> Aparrmsnts; plano en Cøior del
> >> centro, G5, 61): Prijsko. 7. D 321~
> >> 220 (peru es ram encontrar u la
> >> propletarxa) y 09B­503·28E (móvil).
> >> En pleno corazón de la ciudad anti-
> >> gua. En verano. de 444 a 518 Kn
> >> (60 a 70 É) la noche para dos per­
> >> Sonas. En una Casa adornada por
> >> la pátina de los siglos. Muy buen
> >> trato por parte de Ana, quien habla
> >> algo de inglés. DOS estudios en IE
> >> primera planta y un apartamento
> >> en la planta baja. impecables y bian
> >> arreglados, con ducha y Cocina.
> >> Vista sobre las Callejas del barrio.
> >>
> >> I think if you do some image cleanup before processing them with
> >> tesseract you will get much better results.
> >>
> >>
> >
> >
> > >
> >
>
>
>
> --
> Michael Moore
> -------------------------
> Share your families' genealogy and family history books. It's easy and
> free : http://bookscanned.com
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to