First of all, there is* already finished* langdata for Spanish here 
<https://github.com/tesseract-ocr/langdata/tree/master/spa>. Download all 
the files then run combine_tessdata spa. (with the period)

Second, the fonts folder you're trying to access is *~/.fonts*, NOT 
/home/Fonts. Actually, you should run nautilus (the file browser) as root 
(by running gksudo) then move your fonts to /usr/share/fonts. That is the 
default location for fonts and it allows all users on the system to use the 
fonts you downloaded.

On Friday, September 1, 2017 at 7:02:59 AM UTC-4, Guillermo Manglano wrote:
>
> First post doesn't show. 
>
>
> I have the task of taking a PDF with images to a txt or csv file to store 
> at a database. I am trying to use OCR on images like the one attached.
>
> The results are as poor as the following:
>
> `20—0
> ¿ ABÚEADD LDIDI ALBARH, JDSE
> AHTÚHIÚ
> —- EnlúndeLarreájzegm25- Sºt] . . . . . 944 355019
> : ABDGADD 5E'I'IEH ÁLUAREI 5EUERIHD`
>
> Of special importance is the phone number (944 355019), it seems close to 
> correct but it still has wrong digits which makes the whole thing useless.
>
> After much reading I still do not know how to train tesseract. I am 
> following this instructions 
> <https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract> among 
> others, but when I try to do:
>
>
> text2image --text=training_text.txt --outputbase=spa.*arial*.exp0 --font='
> *Arial*' --fonts_dir=/home/Fonts
>
>
> I get 
>
>
> Could not find font named Nimbus Sans. Pango suggested font 
>
> Please correct --font arg.:Error:Assert failed:in file text2image.cpp, 
> line 437
>
> Segmentation fault (core dumped)
>
>    1. 
>    
>    How to approach this problem with multiple fonts, multiple columns, 
>    and spanish as language?
>    2. 
>    
>    
>    3. 
>    
>    [image: example]
>    
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8dcc8865-f617-4455-8628-1782213f8909%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to