The output is indeed utf-8.Ray. On Thu, Jan 8, 2009 at 11:00 AM, Michael Moore <[email protected]> wrote:
> On Thu, Jan 8, 2009 at 11:30 AM, Darren Govoni <[email protected]> > wrote: > > > > Hey Michael, > > I really appreciate the tips. I'm developing an automated batch > > ocr'ing system and there won't be a lot of human cleanup time involved > > (100,000's of images). > > > > However, it might be possible to do some contrast or resolution type > > enhancements using convert/imagemagick? You think? I'm somewhat new to > > this, so learning the ropes still. > > The more uniform and cleaner your input images are, the easier it will > be to batch it. > > Here's what I use in my batch script > convert -monochrome +compress Input.jpg output.tif > > The -monochrome seems to do a pretty good job of getting rid of grey > stray marks on the input (like folds in pages, light pencil lines) and > results in a smaller file. > The +compress tells it not to compress the tif file. You can also use > -compress lzw and some other options. I'm pretty sure you can't use > -compress jpeg with the -monochrome option. > > Your Spanish text.jpg image looks like it's highly compressed jpg. If > you can get an uncompressed image to start with you'll probably be > better off already. > > > The system won't know the quality of the images it will try to OCR, so > > to reduce human checking, I'm looking for way's to convert standard > > images into the best possible format for tesseract (which I'm new to). > > You may be able to get users to upload better files if you require > pngs or something like that. Most scanner software can save as png. I > believe that tesseract only takes tif files and I've had the best luck > with monochrome files. > > > I assume tesseract outputs UTF-8/Unicode compatible encodings for > > foreign language? > > Sorry, I'm pretty new to tesseract too and I'm not sure about that. > > > Cheers! > > Darren > > > > On Thu, 2009-01-08 at 10:35 -0700, Michael Moore wrote: > >> On Thu, Jan 8, 2009 at 10:07 AM, Darren Govoni <[email protected]> > wrote: > >> > > >> > Hi, > >> > I read tesseract supports a variety of languages. I convert a Spanish > >> > text JPG to TIFF and ran tesseract with the spanish language pack and > >> > the output text was not even close. Here is the image link: > >> > > >> > http://www.libertas.hu/slike/Spanish%20text.jpg > >> > > >> > Here was the output .txt (gibberish): > >> > > >> > _ físbřts-q:!: > >> > J4Lçar'tr.r1S.r:t: > >> > cærrtraø. CSS. > >> > 22C} (puerta > >> > ğrcupietaria) > >> > Er: |::•ler'•c CC > >> > guæ- En ve > >> > (SCI a 7O êj > >> > Scruõs, Eu'1 1 > >> > la psàtirna 1:: > >> > trate: pcvr ÇE > >> > elgc czlæ irug > >> > prirrtære pdă > >> > era la plants > >> > erreglscics > >> > \/Istæ sobre > >> > > >> > > >> > What is the trick to getting correct results? > >> > > >> > Thank you. > >> > Darren > >> > >> Your image has a lot of grey in it. It would be more helpful to see > >> the TIFF file you used than the JPG. > >> > >> I used the Gimp's 'levels' tool to enhance the image, then changed the > >> image to a 1 bit palette and saved it as a TIFF. > >> You can see the resulting image here: > >> http://stuporglue.org/downloads/spanish.tif > >> > >> The command I used was 'tesseract spanish.tif sp -l spa' and I have > >> tesseract 2.0.3 installed. > >> > >> I Habitsclonas da Ann (AnEls > >> Aparrmsnts; plano en Cøior del > >> centro, G5, 61): Prijsko. 7. D 321~ > >> 220 (peru es ram encontrar u la > >> propletarxa) y 09B503·28E (móvil). > >> En pleno corazón de la ciudad anti- > >> gua. En verano. de 444 a 518 Kn > >> (60 a 70 É) la noche para dos per > >> Sonas. En una Casa adornada por > >> la pátina de los siglos. Muy buen > >> trato por parte de Ana, quien habla > >> algo de inglés. DOS estudios en IE > >> primera planta y un apartamento > >> en la planta baja. impecables y bian > >> arreglados, con ducha y Cocina. > >> Vista sobre las Callejas del barrio. > >> > >> I think if you do some image cleanup before processing them with > >> tesseract you will get much better results. > >> > >> > > > > > > > > > > > > > -- > Michael Moore > ------------------------- > Share your families' genealogy and family history books. It's easy and > free : http://bookscanned.com > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

