Try making a file named spa.user-words in tesseract-ocr/tessdata with this 
line in it:
C.V.

This will tell tesseract that this is a special word that it should also 
look for. You can also add more words on each line in order of the 
frequency they appear in your context. This feature was added so you can 
add your-context-specific words to Tesseract's dictionary without having to 
retrain it.

On Friday, November 10, 2017 at 12:50:36 PM UTC-5, i wrote:
>
> Hopefully this is clearer than my previous mail... 
>
> My commandline invocation is as follows... 
>
> convert -density 600 mailinglist01.pdf tmp.tif 
> tesseract -l eng+spa tmp.tif stdout 
>
> I'm attaching the "mailinglist01.pdf" file... 
>
> I'm using data files downloaded from this section of the wiki... 
>
> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-304305
>  
>
> The text generated by tesseract contains the string 
> "417575 5.1 COMUNICACION, S.A. DE CV." 
>
> This is incorrect, as it should say 
> "417575 5.1 COMUNICACION, S.A. DE C.V." 
> It's missing a period between the "C" and the "V" 
>
> A quick tally tells me that the above commandline sequence triggers 
> this error 24 times... 
>
> Can anyone think of any Tesserect tweaks that would fix this? 
>
> OTOH it's easy to fix this with text processing, after a Tesseract 
> invocation. Do people usually fix these type of things with search 
> and replace? 
>
> These are the software versions... 
>
> ~% convert --version 
> Version: ImageMagick 6.7.7-10 2014-08-28 Q16 http://www.imagemagick.org 
> Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC 
> Features: OpenMP 
>
> ~% tesseract --version 
> tesseract 3.05.01 
>  leptonica-1.74.4 
>    libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : 
> libtiff 4.0.3 : zlib 1.2.8 
>     
>
> On Thu, 2017-11-09 at 10:09 -0800, i wrote: 
> > Hey! 
> > 
> > It's my first time using Tesseract. Apologies if my questions are 
> offtopic. 
> > 
> > This is the tesseract version: 
> > 
> > tesseract 3.05.01 
> >  leptonica-1.74.4 
> >   libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : 
> >  libtiff 4.0.3 : zlib 1.2.8 
> > 
> > A recurrent error in the generated text concerns the string "C.V." 
> > This string is often not being read / parsed / recognized correctly... 
> > 
> > Quite often, the generated text will contain the incorrect "CV." string 
> > instead of the correct "C.V." string. 
> > 
> > I'm attaching a sample PDF. 
> > FWIW the complete phrase is "S.A. DE C.V.", which is a common "type 
> > of business entity" in Spanish-speaking geographies... 
> > 
> > Would anyone have any suggestions for fixing this? 
> > 
>
> -- 
> ------------------------------ 
>
>
>
>
> Ivan Monroy 
> Desarrollador en Tecnologías para la Transparencia 
>
> Datos - https://quienesquien.wiki - @QuienQuienWiki 
> PODER - http://projectpoder.org - @projectPODER 
> email - [email protected] <javascript:> 
> PGP --- 4EB8 DBD8 12DF 4CE2 D942  5FE6 CFB3 B835 BF0D 6582 
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/581b161f-8716-49ea-95e3-6b52fa9d9486%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to