You can unpack deu.traineddata, modify the extracted deu.unicharambigs such 
that it would always replace the misrecognized characters with § symbol, 
and then re-combine the component files. Check Training Wiki 
<https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract> for 
details on the commands.

On the other hand, full training is not hard at all. There are available 
tools <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract> 
that automates the entire training process.

On Thursday, June 16, 2016 at 9:01:49 AM UTC-5, [email protected] wrote:
>
> Hi all,
>
> I'm trying to set up tesseract to scan German documents. So far everything 
> works just fine, except tesseract won't recognize the character "§". This 
> is slightly frustrating, since the documents in question are mostly legal 
> stuff and the "§" is used a lot. It has the meaning of article or section 
> and is not uncommon at all.
>
> I tried to add it as a user-pattern oder user-word without success. I then 
> scanned the files at github in tesseract-ocr/langdata/tree/master/deu and 
> it seems the § is neither in the desired_characters file nor anywhere in 
> the deu.wordlist.
>
> Does that mean, that tesseract does not try to find a § in the documents 
> at all? If so, is there a way to add the character to the language data 
> without completely retraining tesseract? I'm not sure I could do a full 
> training myself.
>
> Thanks
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b176e34a-edc8-4786-a776-29a18b56c539%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to