First post doesn't show.
I have the task of taking a PDF with images to a txt or csv file to store at a database. I am trying to use OCR on images like the one attached. The results are as poor as the following: `20—0 ¿ ABÚEADD LDIDI ALBARH, JDSE AHTÚHIÚ —- EnlúndeLarreájzegm25- Sºt] . . . . . 944 355019 : ABDGADD 5E'I'IEH ÁLUAREI 5EUERIHD` Of special importance is the phone number (944 355019), it seems close to correct but it still has wrong digits which makes the whole thing useless. After much reading I still do not know how to train tesseract. I am following this instructions <https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract> among others, but when I try to do: text2image --text=training_text.txt --outputbase=spa.*arial*.exp0 --font=' *Arial*' --fonts_dir=/home/Fonts I get Could not find font named Nimbus Sans. Pango suggested font Please correct --font arg.:Error:Assert failed:in file text2image.cpp, line 437 Segmentation fault (core dumped) 1. How to approach this problem with multiple fonts, multiple columns, and spanish as language? 2. 3. [image: example] -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4e769c23-b5e6-4fff-8733-a58c8ef18424%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

