First post doesn't show. 


I have the task of taking a PDF with images to a txt or csv file to store 
at a database. I am trying to use OCR on images like the one attached.

The results are as poor as the following:

`20—0
¿ ABÚEADD LDIDI ALBARH, JDSE
AHTÚHIÚ
—- EnlúndeLarreájzegm25- Sºt] . . . . . 944 355019
: ABDGADD 5E'I'IEH ÁLUAREI 5EUERIHD`

Of special importance is the phone number (944 355019), it seems close to 
correct but it still has wrong digits which makes the whole thing useless.

After much reading I still do not know how to train tesseract. I am 
following this instructions 
<https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract> among 
others, but when I try to do:


text2image --text=training_text.txt --outputbase=spa.*arial*.exp0 --font='
*Arial*' --fonts_dir=/home/Fonts


I get 


Could not find font named Nimbus Sans. Pango suggested font 

Please correct --font arg.:Error:Assert failed:in file text2image.cpp, line 
437

Segmentation fault (core dumped)

   1. 
   
   How to approach this problem with multiple fonts, multiple columns, and 
   spanish as language?
   2. 
   
   
   3. 
   
   [image: example]
   

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4e769c23-b5e6-4fff-8733-a58c8ef18424%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to