[tesseract-ocr] Re-Training of German Fraktur (deu_frak)

Sebastian Goettel Tue, 24 May 2016 10:53:36 -0700

Dear fellows,

I have somewhat a difficult problem for me. I googled many times and found 
many hints and tipps but not a really working methode for me

I'm making it short as possible: I want to OCR an old german dictionary for
my master-thesis in linguistics. The available traineddata of German
(Fraktur) on github is very good, but one type/ style of German Fraktur is
missing and won't be recognized properly. The following picture shows the
"problem": the first word is "Imporös" followed by a definition where you
can see the word "Imporösein". You can see the different types of the
fraktur if you look at the two words, but the difference is most visible if
you look at the two I's at the beginning of these words:

<https://lh3.googleusercontent.com/-wM1_X11Fy6g/V0Rd9YPzenI/AAAAAAAAAac/NpAdpWGRadEjASckuljk9-EjZq7LW0wUACLcB/s1600/example_fraktur.jpg>

Now for every letter of the alphabet there are at least two different
styles of letters. The normal traineddata of deu_frak is able to only
recognize ONE of them. Now I want to train Tesseract to be able to read
both of them. First I thought I should create "a new language". So I
started with Aletheia and then proceeded in Franken+. But the traineddata
of deu_frak on github is not bad, I just need to add some glyphs/letter.
Otherwise I need to start a complete new langdata but that's going to be
too much work since the dictionary is very complicated and needs a lot of
manual correction in Aletheia.

I have downloaded the langdata on github that are needed (the are in the
folder "frk") but I don't know what to do with them. How can I add another
letters/glyphs to be recognized correctly? I was also confused when I
unpacked the original traineddata "deu_frak" with Tesseract of Aletheia
that I get somehow complete different files. If needed, I can attach the
folder containing those files.

I think it's not really helping to solve my problem that I'm working on
Windows? Well I'm actually just a linguist so I have worked my way through
all of that by myself but somehow I just need to be able to re-train that
already existing and good traineddata of "deu_frak".

Maybe someone here could help my out, that would be just too great!

Thanks a lot!

Regards,

Sebastian

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/2f5e36ad-9a22-42a1-8094-c09b06d01522%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re-Training of German Fraktur (deu_frak)

Reply via email to