Re: making persian tarineddata dosen't support rtl

Reza M Mon, 29 Oct 2012 09:21:14 -0700

Hi,
Finally I found how can we add RTl support to 3.02
I Wrote a code in python. it can convert unicharset file to RTL supporting 
one
After making unicharset file you should run attached code to correcting 
wrong properties.
I attached this code may be it will useful for other RTL languages!
For running this code you should installed python on your PC
your,
Reza


On Thursday, October 18, 2012 11:38:39 PM UTC+2, Reza M wrote:
>
> Hi,
> I made a simple traineddata for Persian it recognized characters but it 
> changes words directions for example instead of رضا it writes اضر
>
> I tried to use Data of Herby or Arabic but it doesn't work
> I tried 3.02 and it doesn't work correctly for my data!
>
> do you know how can i make my data like Arabic with cub mode? or like 
> Herby that works correctly for RTL?
> there is many languages that they are RTL would you please tell us how did 
> you made Arabic file?
>
> yours,
> reza
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

?#!/usr/bin/python
# -*- coding: utf-8 -*-
# Reza1615
# Distributed under the terms of the CC-BY-SA 3.0 .

import codecs,re

filesample = 'unicharset'
lang=u'Arabic' # for Persian also uses Arabic
lines=u'\n'
couple={}
couple_ch={u"<":u">",u">":u"<",u"(":u")",u")":u"(",u"[":u"]",u"]":u"[",u"{":u"}",u"}":u"{",u"?":u"?",u"?":u"?"}

text = codecs.open( filesample,'r' ,'utf8' )
text = text.read()

for line in text.split(u'\n'):    
    column=line.split(u' ')
    count=-1
    new_line=u' '
    for item in column: # Changing items in Columns
        count+=1
        if count==3:
            if column[0].strip() in u"~#@?'\/<>()[]}{.,?!$*+-_&0123456789????????????????????????????":#Common charcters
                new_line+=item.replace(u'NULL',u'Common ')
            else:
                new_line+=item.replace(u'NULL',lang+u' ')
            continue
        if count==5:
            if column[0].strip() in u"0123456789":# English numbers
                new_line+=u'2 '
            elif column[0].strip() in u"+-":
                new_line+=u'3 '
            elif column[0].strip() in u"#$?":# Dollar sign and it's alternatives
                new_line+=u'4 '
            elif column[0].strip() in u"????????????????????":# Your language's numbers and its alternatives
                new_line+=u'5 '
            elif column[0].strip() in u"/.,??":# Your language's separators 
                new_line+=u'6 '
            elif column[0].strip() in u"~@?'\<>()[]}{?!*_&??":
                new_line+=u'10 '
            else:
                new_line+=u'13 '
            continue
        if count==6:
            if column[0].strip() in u"<>()[]}{??":# Pair Characters
                couple[column[0].strip()]=column[4].strip()
                new_line+=u'[$['+column[0].strip()+u']$] '
            else:
                new_line+=column[4].strip()+u' '
            continue
        new_line+=item+u' '
    lines+=new_line.strip()+u'\n'

# Changing paired charcters column number

total_couples=re.findall(ur"\[\$\[.*?\]\$\]",lines, re.S)
if total_couples:
    for i in total_couples:
                i=i.replace(u'[$[',u'').replace(u']$]',u'').strip()
                try:
                    lines=lines.replace(u'[$['+i+u']$]',couple[couple_ch[i]])
                except:
                    lines=lines.replace(u'[$['+i+u']$]',couple[i])
                    print i + u' is not paired! please add '+couple_ch[i] +u' to your Box'
else:
    print "It couldn't find couple characters like <>()[]}{?? "

# Work is finished!
print "Now your unicharset is set for RTL languages (Arnew_lineic, Persian)!"

with codecs.open(filesample ,mode = 'w',encoding = 'utf8' ) as f:
                    f.write( lines.strip())

Re: making persian tarineddata dosen't support rtl

Reply via email to