Hi,

Did you manage to solve the issue with SECTION sign? I am also working with 
legal domain and this issue bothers me. Best solution so far has been to 
always use the combination deu+eng. But I really want to learn how to 
extend tesseract traineddata.

Best regards,
Nikolai

On Thursday, June 16, 2016 at 4:01:49 PM UTC+2, [email protected] wrote:
>
> Hi all,
>
> I'm trying to set up tesseract to scan German documents. So far everything 
> works just fine, except tesseract won't recognize the character "§". This 
> is slightly frustrating, since the documents in question are mostly legal 
> stuff and the "§" is used a lot. It has the meaning of article or section 
> and is not uncommon at all.
>
> I tried to add it as a user-pattern oder user-word without success. I then 
> scanned the files at github in tesseract-ocr/langdata/tree/master/deu and 
> it seems the § is neither in the desired_characters file nor anywhere in 
> the deu.wordlist.
>
> Does that mean, that tesseract does not try to find a § in the documents 
> at all? If so, is there a way to add the character to the language data 
> without completely retraining tesseract? I'm not sure I could do a full 
> training myself.
>
> Thanks
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4521363f-81dd-43f5-8068-ee65e9442018%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to