[tesseract-ocr] Re: problem detected using tesseract4 & arabic data

Ray Smith Wed, 29 Mar 2017 15:35:07 -0700

Thanks for spotting this!
I understand why it makes this error, but it will take some thought to fix 
it properly!
It is using a sort by x-position to re-order the boxes for RTL language 
training, but that doesn't work in the case of heavily kerned characters 
like ل in your example.
It needs to simply reverse the RTL characters, but has to avoid messing up 
the order of the common script, which is why I was using a sort to begin 
with.
https://github.com/tesseract-ocr/tesseract/blob/master/training/boxchar.cpp#L202


On Thursday, March 9, 2017 at 5:11:49 AM UTC-8, El Fakir Zakaria wrote:
>
> I noticed that tesseract4 reads الأ as األ which is pretty close, because 
> we need to switch the position of the last 2 letters to have ا ل أ, this 
> happens with similar word forms too like لا reads as ال and should be ل ا, 
> and i wish to correct it.
> can someone show me how to fix this, or maybe update arabic data.
> thank you for your time.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d993e1d4-1978-40f8-9917-331613925457%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: problem detected using tesseract4 & arabic data

Reply via email to