Re: [tesseract-ocr] Training data gets worse as I add characters

ShreeDevi Kumar Fri, 21 Nov 2014 19:56:00 -0800

Hi,

Have you added the fonts to font-properties file?


Try removing the 'narrow' font from your training set.

Test with just one or two similar fonts and see if results are better.



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Nov 22, 2014 at 7:11 AM, Ryan Dev <[email protected]
> wrote:

> I am trying to cover as much as I can of the latin unicode characters in
> the BMP.
>
> What I find is that as I add more characters, the ocr results get worse.
>
> For example, instead of getting the correct ö I get Ö and then as I added
> more characters the latest result is Ṏ.
>
> In otherwords, not only is it getting worse at detecting capitalization
> correctly, but it is favoring more complex characters over the simpler
> solutions! This is just one example, another is Ȧ instead of correctly
> getting A.
>
> When I run a smaller set of training data I get better results (for the
> trained ones, of course others are missed completely).
>
> Should I be trying to do smaller, multiple, traineddata files? This will
> reduce performance, but I need accuracy most of all. Plus I've had problems
> where confidence is reported high on incorrect result, and lower on correct
> results.
>
> I'm using latest tesseract checkout, on Ubuntu, using the tesstrain.sh
> script.
>
> Linked are files I'm using, a sample image, and the traineddata. Plus an
> example image I ocr.
>
>
> https://drive.google.com/folderview?id=0B5ebDnF6cn8UTVhBc25OOV9JYTg&usp=sharing
>
> The unicode ranges I am trying to train for at the moment are.
>
> 0000 - 007f Basic Latin
> 0080 - 00ff Latin 1 Supplemental
> 0100 - 017f Latin Ext A
> 0180 - 024f Latin Ext B
> 1e00 - 1eff Latin Extended Additional
> 2500 - 2594 Box Draw and Box Elements
> fb00 - fb06 Ligatures
>
> Using the following fonts for training
> arial unicode ms
> freeserif
> liberation mono
> liberation sans
> liberation sans narrow condensed
> liberation serif
> segoe ui
>
> I can certainly add more if that helps, but so far adding fonts just means
> it takes longer to realize how bad the trained data is.
>
> If you are asking why I am doing this, it is because I am trying to create
> a language agnostic solution. You can see a test image in the link above,
> and can see I am only looking at font glyphs, not full page ocr.
>
> Any suggestions/advice appreciated!
>
>
>
>
>
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/b5a502dd-78e8-467a-ad0d-a225bc12715b%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/b5a502dd-78e8-467a-ad0d-a225bc12715b%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWEoc%2BC5A4jRF2Ks_BckxDw4qFp1cM5YZzSjT%3Dosi-MhQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Training data gets worse as I add characters

Reply via email to