Hi, Did you manage to solve the issue with SECTION sign? I am also working with legal domain and this issue bothers me. Best solution so far has been to always use the combination deu+eng. But I really want to learn how to extend tesseract traineddata.
Best regards, Nikolai On Thursday, June 16, 2016 at 4:01:49 PM UTC+2, [email protected] wrote: > > Hi all, > > I'm trying to set up tesseract to scan German documents. So far everything > works just fine, except tesseract won't recognize the character "§". This > is slightly frustrating, since the documents in question are mostly legal > stuff and the "§" is used a lot. It has the meaning of article or section > and is not uncommon at all. > > I tried to add it as a user-pattern oder user-word without success. I then > scanned the files at github in tesseract-ocr/langdata/tree/master/deu and > it seems the § is neither in the desired_characters file nor anywhere in > the deu.wordlist. > > Does that mean, that tesseract does not try to find a § in the documents > at all? If so, is there a way to add the character to the language data > without completely retraining tesseract? I'm not sure I could do a full > training myself. > > Thanks > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4521363f-81dd-43f5-8068-ee65e9442018%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

