First of all, there is* already finished* langdata for Spanish here <https://github.com/tesseract-ocr/langdata/tree/master/spa>. Download all the files then run combine_tessdata spa. (with the period)
Second, the fonts folder you're trying to access is *~/.fonts*, NOT /home/Fonts. Actually, you should run nautilus (the file browser) as root (by running gksudo) then move your fonts to /usr/share/fonts. That is the default location for fonts and it allows all users on the system to use the fonts you downloaded. On Friday, September 1, 2017 at 7:02:59 AM UTC-4, Guillermo Manglano wrote: > > First post doesn't show. > > > I have the task of taking a PDF with images to a txt or csv file to store > at a database. I am trying to use OCR on images like the one attached. > > The results are as poor as the following: > > `20—0 > ¿ ABÚEADD LDIDI ALBARH, JDSE > AHTÚHIÚ > —- EnlúndeLarreájzegm25- Sºt] . . . . . 944 355019 > : ABDGADD 5E'I'IEH ÁLUAREI 5EUERIHD` > > Of special importance is the phone number (944 355019), it seems close to > correct but it still has wrong digits which makes the whole thing useless. > > After much reading I still do not know how to train tesseract. I am > following this instructions > <https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract> among > others, but when I try to do: > > > text2image --text=training_text.txt --outputbase=spa.*arial*.exp0 --font=' > *Arial*' --fonts_dir=/home/Fonts > > > I get > > > Could not find font named Nimbus Sans. Pango suggested font > > Please correct --font arg.:Error:Assert failed:in file text2image.cpp, > line 437 > > Segmentation fault (core dumped) > > 1. > > How to approach this problem with multiple fonts, multiple columns, > and spanish as language? > 2. > > > 3. > > [image: example] > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8dcc8865-f617-4455-8628-1782213f8909%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

