[tesseract-ocr] How to train tesseract and how to recognize multiple columns and multiple fonts

Guillermo Manglano Fri, 01 Sep 2017 04:03:08 -0700


First post doesn't show.

I have the task of taking a PDF with images to a txt or csv file to store
at a database. I am trying to use OCR on images like the one attached.

The results are as poor as the following:

`20—0
¿ ABÚEADD LDIDI ALBARH, JDSE
AHTÚHIÚ
—- EnlúndeLarreájzegm25- Sºt] . . . . . 944 355019
: ABDGADD 5E'I'IEH ÁLUAREI 5EUERIHD`

Of special importance is the phone number (944 355019), it seems close to
correct but it still has wrong digits which makes the whole thing useless.

After much reading I still do not know how to train tesseract. I am
following this instructions
<https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract> among
others, but when I try to do:

text2image --text=training_text.txt --outputbase=spa.*arial*.exp0 --font='
*Arial*' --fonts_dir=/home/Fonts

I get

Could not find font named Nimbus Sans. Pango suggested font

Please correct --font arg.:Error:Assert failed:in file text2image.cpp, line
437

Segmentation fault (core dumped)

How to approach this problem with multiple fonts, multiple columns, and
spanish as language?
2.

[image: example]

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/4e769c23-b5e6-4fff-8733-a58c8ef18424%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] How to train tesseract and how to recognize multiple columns and multiple fonts

Reply via email to