Thanks for your reply! I created said spa.user-words file in tesseract-ocr/tessdata but it didn't help. Maybe I'm doing something wrong...
However, I tested changes in the language specificaction of the tesseract invocation... Before, it was: tesseract -l eng+spa tmp.tif stdout Now, it's: tesseract -l spa tmp.tif stdout For some reason, this solved my issue. I'm a bit perplexed... Why did the changes in the -l flag fixed it? On Sun, 2017-11-12 at 09:52 -0800, Dan9er wrote: > Try making a file named spa.user-words in tesseract-ocr/tessdata with this > line in it: > C.V. > > This will tell tesseract that this is a special word that it should also > look for. You can also add more words on each line in order of the > frequency they appear in your context. This feature was added so you can > add your-context-specific words to Tesseract's dictionary without having to > retrain it. > > On Friday, November 10, 2017 at 12:50:36 PM UTC-5, i wrote: > > > > Hopefully this is clearer than my previous mail... > > > > My commandline invocation is as follows... > > > > convert -density 600 mailinglist01.pdf tmp.tif > > tesseract -l eng+spa tmp.tif stdout > > > > I'm attaching the "mailinglist01.pdf" file... > > > > I'm using data files downloaded from this section of the wiki... > > > > https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-304305 > > > > > > The text generated by tesseract contains the string > > "417575 5.1 COMUNICACION, S.A. DE CV." > > > > This is incorrect, as it should say > > "417575 5.1 COMUNICACION, S.A. DE C.V." > > It's missing a period between the "C" and the "V" > > > > A quick tally tells me that the above commandline sequence triggers > > this error 24 times... > > > > Can anyone think of any Tesserect tweaks that would fix this? > > > > OTOH it's easy to fix this with text processing, after a Tesseract > > invocation. Do people usually fix these type of things with search > > and replace? > > > > These are the software versions... > > > > ~% convert --version > > Version: ImageMagick 6.7.7-10 2014-08-28 Q16 http://www.imagemagick.org > > Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC > > Features: OpenMP > > > > ~% tesseract --version > > tesseract 3.05.01 > > leptonica-1.74.4 > > libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : > > libtiff 4.0.3 : zlib 1.2.8 > > > > > > On Thu, 2017-11-09 at 10:09 -0800, i wrote: > > > Hey! > > > > > > It's my first time using Tesseract. Apologies if my questions are > > offtopic. > > > > > > This is the tesseract version: > > > > > > tesseract 3.05.01 > > > leptonica-1.74.4 > > > libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : > > > libtiff 4.0.3 : zlib 1.2.8 > > > > > > A recurrent error in the generated text concerns the string "C.V." > > > This string is often not being read / parsed / recognized correctly... > > > > > > Quite often, the generated text will contain the incorrect "CV." string > > > instead of the correct "C.V." string. > > > > > > I'm attaching a sample PDF. > > > FWIW the complete phrase is "S.A. DE C.V.", which is a common "type > > > of business entity" in Spanish-speaking geographies... > > > > > > Would anyone have any suggestions for fixing this? > > > > > > > -- > > ------------------------------ > > > > > > > > > > Ivan Monroy > > Desarrollador en Tecnologías para la Transparencia > > > > Datos - https://quienesquien.wiki - @QuienQuienWiki > > PODER - http://projectpoder.org - @projectPODER > > email - [email protected] <javascript:> > > PGP --- 4EB8 DBD8 12DF 4CE2 D942 5FE6 CFB3 B835 BF0D 6582 > > > > > > > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1510598770.10777.0.camel%40user-ThinkPad-X200. For more options, visit https://groups.google.com/d/optout.

