Thanks for your reply!

I created said spa.user-words file in tesseract-ocr/tessdata but it
didn't help. Maybe I'm doing something wrong...

However, I tested changes in the language specificaction of the
tesseract invocation...

Before, it was:
tesseract -l eng+spa tmp.tif stdout

Now, it's:
tesseract -l spa tmp.tif stdout

For some reason, this solved my issue.

I'm a bit perplexed...

Why did the changes in the -l flag fixed it?

On Sun, 2017-11-12 at 09:52 -0800, Dan9er wrote:
> Try making a file named spa.user-words in tesseract-ocr/tessdata with this 
> line in it:
> C.V.
> 
> This will tell tesseract that this is a special word that it should also 
> look for. You can also add more words on each line in order of the 
> frequency they appear in your context. This feature was added so you can 
> add your-context-specific words to Tesseract's dictionary without having to 
> retrain it.
> 
> On Friday, November 10, 2017 at 12:50:36 PM UTC-5, i wrote:
> >
> > Hopefully this is clearer than my previous mail... 
> >
> > My commandline invocation is as follows... 
> >
> > convert -density 600 mailinglist01.pdf tmp.tif 
> > tesseract -l eng+spa tmp.tif stdout 
> >
> > I'm attaching the "mailinglist01.pdf" file... 
> >
> > I'm using data files downloaded from this section of the wiki... 
> >
> > https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-304305
> >  
> >
> > The text generated by tesseract contains the string 
> > "417575 5.1 COMUNICACION, S.A. DE CV." 
> >
> > This is incorrect, as it should say 
> > "417575 5.1 COMUNICACION, S.A. DE C.V." 
> > It's missing a period between the "C" and the "V" 
> >
> > A quick tally tells me that the above commandline sequence triggers 
> > this error 24 times... 
> >
> > Can anyone think of any Tesserect tweaks that would fix this? 
> >
> > OTOH it's easy to fix this with text processing, after a Tesseract 
> > invocation. Do people usually fix these type of things with search 
> > and replace? 
> >
> > These are the software versions... 
> >
> > ~% convert --version 
> > Version: ImageMagick 6.7.7-10 2014-08-28 Q16 http://www.imagemagick.org 
> > Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC 
> > Features: OpenMP 
> >
> > ~% tesseract --version 
> > tesseract 3.05.01 
> >  leptonica-1.74.4 
> >    libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : 
> > libtiff 4.0.3 : zlib 1.2.8 
> >     
> >
> > On Thu, 2017-11-09 at 10:09 -0800, i wrote: 
> > > Hey! 
> > > 
> > > It's my first time using Tesseract. Apologies if my questions are 
> > offtopic. 
> > > 
> > > This is the tesseract version: 
> > > 
> > > tesseract 3.05.01 
> > >  leptonica-1.74.4 
> > >   libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : 
> > >  libtiff 4.0.3 : zlib 1.2.8 
> > > 
> > > A recurrent error in the generated text concerns the string "C.V." 
> > > This string is often not being read / parsed / recognized correctly... 
> > > 
> > > Quite often, the generated text will contain the incorrect "CV." string 
> > > instead of the correct "C.V." string. 
> > > 
> > > I'm attaching a sample PDF. 
> > > FWIW the complete phrase is "S.A. DE C.V.", which is a common "type 
> > > of business entity" in Spanish-speaking geographies... 
> > > 
> > > Would anyone have any suggestions for fixing this? 
> > > 
> >
> > -- 
> > ------------------------------ 
> >
> >
> >
> >
> > Ivan Monroy 
> > Desarrollador en Tecnologías para la Transparencia 
> >
> > Datos - https://quienesquien.wiki - @QuienQuienWiki 
> > PODER - http://projectpoder.org - @projectPODER 
> > email - [email protected] <javascript:> 
> > PGP --- 4EB8 DBD8 12DF 4CE2 D942  5FE6 CFB3 B835 BF0D 6582 
> >
> >
> >
> >


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1510598770.10777.0.camel%40user-ThinkPad-X200.
For more options, visit https://groups.google.com/d/optout.

Reply via email to