[tesseract-ocr] Re: english-arabic dictionary - transliteration text

Tom Morris Fri, 29 Mar 2024 08:46:40 -0700

Rather than using random web resources, I'd suggest using the official 
documentation. The most relevant section is probably this:
https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#fine-tuning-for--a-few-characters


I would suggest starting with script/Latin for your base model, which will 
at least give you š ž to start with. In addition to the consonants with 
dots above and below, it looks like there's also a funny Epsilon style 
character that you may want to train (perhaps similar 
to https://unicodeplus.com/U+0190).

You may also want to think about whether it'd be better to train with 
synthetic rendered lines of text or line images chopped out of your page 
scans with associated ground truth text. If you decide to go with the 
latter approach, looking at what the Fraktur OCR project did for training 
may be useful https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR

Good luck!

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ebdd6a43-ff6c-433f-be22-7e6e4d47387bn%40googlegroups.com.

[tesseract-ocr] Re: english-arabic dictionary - transliteration text

Reply via email to