[tesseract-ocr] Re: Add an character to language data

Quan Nguyen Sun, 19 Jun 2016 18:33:06 -0700

You can unpack the deu.traineddata file, modify the extracted 
deu.unicharambigs such that it would already replace the misrecognized 
characters with § symbol, and then re-combine the component files. Check 
the Training Wiki for details on the commands.


https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract

On the other hand, full training is not that difficult. There are available 
tools that automated the entire training process.

https://github.com/tesseract-ocr/tesseract/wiki/AddOns

On Thursday, June 16, 2016 at 9:01:49 AM UTC-5, [email protected] wrote:
>
> Hi all,
>
> I'm trying to set up tesseract to scan German documents. So far everything 
> works just fine, except tesseract won't recognize the character "§". This 
> is slightly frustrating, since the documents in question are mostly legal 
> stuff and the "§" is used a lot. It has the meaning of article or section 
> and is not uncommon at all.
>
> I tried to add it as a user-pattern oder user-word without success. I then 
> scanned the files at github in tesseract-ocr/langdata/tree/master/deu and 
> it seems the § is neither in the desired_characters file nor anywhere in 
> the deu.wordlist.
>
> Does that mean, that tesseract does not try to find a § in the documents 
> at all? If so, is there a way to add the character to the language data 
> without completely retraining tesseract? I'm not sure I could do a full 
> training myself.
>
> Thanks
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/77badd3a-402d-444d-8827-9ced292d5c43%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Add an character to language data

Reply via email to