You can unpack the deu.traineddata file, modify the extracted deu.unicharambigs such that it would already replace the misrecognized characters with § symbol, and then re-combine the component files. Check the Training Wiki for details on the commands.
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract On the other hand, full training is not that difficult. There are available tools that automated the entire training process. https://github.com/tesseract-ocr/tesseract/wiki/AddOns On Thursday, June 16, 2016 at 9:01:49 AM UTC-5, [email protected] wrote: > > Hi all, > > I'm trying to set up tesseract to scan German documents. So far everything > works just fine, except tesseract won't recognize the character "§". This > is slightly frustrating, since the documents in question are mostly legal > stuff and the "§" is used a lot. It has the meaning of article or section > and is not uncommon at all. > > I tried to add it as a user-pattern oder user-word without success. I then > scanned the files at github in tesseract-ocr/langdata/tree/master/deu and > it seems the § is neither in the desired_characters file nor anywhere in > the deu.wordlist. > > Does that mean, that tesseract does not try to find a § in the documents > at all? If so, is there a way to add the character to the language data > without completely retraining tesseract? I'm not sure I could do a full > training myself. > > Thanks > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/77badd3a-402d-444d-8827-9ced292d5c43%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

