Yes, glyph handling and combining is important -- if you search the
archives you'll see how people have dealt with it for Asian languages --
mainly Indian / Indic scripts. You need to specify the component parts in
your training. I sent you 2 links about the right to left support (RTL) in
training. Unfortunately we don't have access to the raw data and training
code that Google used internally to train for Arabic. But some people have
been training for Farsi, so you could look through the archives for what
they've done and maybe contact them directly -- they have not posted much
information on the list.
--Sven


On Thu, Jan 17, 2013 at 10:33 AM, gold snake <[email protected]> wrote:

> the Arab and English font some think very different.
> English font if you input a+b , the result is :ab
>  but if you use Arab font input ئ+ا the result is ئا , if you not
> understand, you can copy ئا and add a space for middle, you can find if
> you input 2 different font , the result is a new font style.
>
> My language too, so, i just afraid the cube is the control for this. if
> cube is for this , it's terrible, because i don't know how create(i not
> mean you tell me how, i just need some example or document about this
> information.)
>
> and about the RTL , looks mean that is not any way for handle this , may
> be we only use programming handle this(when read finish, change display
> mode....something like that).
>
> thanks.
>
> 在 2013年1月17日星期四UTC+8下午10时36分44秒,sventech写道:
>>
>> OK, the fact that cube is something different than combining languages is
>> a major revelation to me. However, huangjingshe, I don't think you need the
>> cube feature for what you're doing. I believe the problem you're having is
>> something else. I would solve the other issues first and then maybe try the
>> cube feature if necessary.
>> --Sven
>>
>>
>> On Wed, Jan 16, 2013 at 10:07 PM, gold snake <[email protected]> wrote:
>>
>>> thanks again .but  i have same question. if use cube just for combine
>>> with other language when training. why when we read document can choice
>>> cube mode just like Sven said??
>>>
>>> it that you mean we can combine with other language  use -l [lang]because 
>>> it's have cube file. if there is no any cube file. we can't use
>>> -l [lang]??
>>>
>>> but i'm test, and everybody knows china language only have .traindata
>>> file, not have cube file .but i can use
>>> tesseract -l chi_sim [lang].[fontname].exp0.tif [lang].[fontname].exp0
>>> batch.nochop makeb
>>>
>>> so , it's maybe not about cube file. or i'm not using right.....
>>>
>>>
>>> 在 2013年1月17日星期四UTC+8上午3时34分25秒,**sventech写道:
>>>>
>>>> Cube means combining different languages. There is not much
>>>> documentation on it -- Google developed it internally. But I don't think
>>>> you need it. The list of files you sent is related to the cube feature, so
>>>> you don't need to create them. For right to left, search the archives for
>>>> "right to left" -- someone wrote a python script to convert, though he
>>>> didn't provide info about how to use it.
>>>>
>>>> utility to convert training files:
>>>> https://groups.google.com/**foru**m/?fromgroups=#!searchin/**tesse**
>>>> ract-ocr/rtl/tesseract-**ocr/**T035ZyQVlMU/tQVoGWdlBDMJ<https://groups.google.com/forum/?fromgroups=#!searchin/tesseract-ocr/rtl/tesseract-ocr/T035ZyQVlMU/tQVoGWdlBDMJ>
>>>>
>>>> basic trick for right to left output from Dmitri Silaev:
>>>> https://groups.google.com/**foru**m/?fromgroups=#!searchin/**tesse**
>>>> ract-ocr/right$20to$**20left$**20output/tesseract-ocr/**8r2qGvM**
>>>> zz9U/so1WuMTyaU8J<https://groups.google.com/forum/?fromgroups=#!searchin/tesseract-ocr/right$20to$20left$20output/tesseract-ocr/8r2qGvMzz9U/so1WuMTyaU8J>
>>>> --Sven
>>>>
>>>>
>>>> On Wed, Jan 16, 2013 at 10:57 AM, gold snake <[email protected]>wrote:
>>>>
>>>>> so you mean: cube exists just because for user combine it with other
>>>>> language, the mean i'm not be need(because my language is not arab).
>>>>> thanks.may be i'm English not good. i just cant understand what is "cube",
>>>>> what is for use , can't find Introduction.
>>>>>
>>>>> and that mean cube and my result is left to right(accurate results
>>>>> must is right to left ) not any relationship. then why when i'm use
>>>>> command:tesseract 14.jpg output -l [lang]. the result(output.txt)
>>>>> content is left to right??
>>>>>
>>>>> i'm very sorry if let masters take the beautiful time for these small
>>>>> problems. just some days ago i'm event don't know what is OCR
>>>>>  if i can find that some question answer....believe me i'm not gonna
>>>>> ask anybody , because it's true,
>>>>> i really understand every friend is very busy. so , i'm trying hard
>>>>> search some problem from now. sorry again....
>>>>>
>>>>> 在 2013年1月16日星期三UTC+8下午10时34分21秒,****sventech写道:
>>>>>>
>>>>>> The reason why Arabic has those files and your language does not is
>>>>>> that Arabic is set up to use the "cube" feature to combine it with other
>>>>>> languages, so you can do "-l ara+eng" and OCR a document with both Arabic
>>>>>> and English. That training is harder, and not necessary if you mainly 
>>>>>> want
>>>>>> to do monolingual documents.
>>>>>>
>>>>>> And what Zdenko is saying is that you are asking questions that don't
>>>>>> show that you're tried to solve the problem yourself. We're all
>>>>>> professional programmers and we want to help people but we don't have 
>>>>>> time
>>>>>> to teach elementary web searching or programming. You seem to be a smart
>>>>>> guy, but your questions appear to be lazy. You need to make an effort to
>>>>>> solve the problems and come to us for help, not ask us to solve them for
>>>>>> you.
>>>>>> --Sven
>>>>>>
>>>>>>
>>>>>> On Wed, Jan 16, 2013 at 2:59 AM, gold snake <[email protected]>wrote:
>>>>>>
>>>>>>> I can't found any answer for my question in this link.
>>>>>>> can you just tolk to me? Is have necessary to bully a rookie?
>>>>>>> please...
>>>>>>>
>>>>>>> 在 2013年1月16日星期三UTC+8下午4时02分25秒,**z****denop写道:
>>>>>>>>
>>>>>>>> Really ;-)? I got 93 results. E.g.:
>>>>>>>>
>>>>>>>> https://groups.google.com/**foru******m/#!msg/tesseract-ocr/**
>>>>>>>> 0msQtTB_******XrI/D1noel9GpPgJ<https://groups.google.com/forum/#!msg/tesseract-ocr/0msQtTB_XrI/D1noel9GpPgJ>
>>>>>>>> https://groups.google.com/d/**to******pic/tesseract-ocr/tyV5_**
>>>>>>>> z65XMk/******discussion<https://groups.google.com/d/topic/tesseract-ocr/tyV5_z65XMk/discussion>
>>>>>>>> https://groups.google.com/d/**ms******g/tesseract-ocr/R7UCx0oV3PA/*
>>>>>>>> *GE******7KJ_76kS0J<https://groups.google.com/d/msg/tesseract-ocr/R7UCx0oV3PA/GE7KJ_76kS0J>
>>>>>>>>
>>>>>>>> Please honor time of people on this list...
>>>>>>>>
>>>>>>>> Zdenko
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jan 16, 2013 at 8:18 AM, gold snake <[email protected]>wrote:
>>>>>>>>
>>>>>>>>> I can't found anything. common....
>>>>>>>>>
>>>>>>>>> 在 2013年1月15日星期二UTC+8下午10时38分42秒,********zdenop写道:
>>>>>>>>>>
>>>>>>>>>> search archive of tesseract forums for cube.
>>>>>>>>>>
>>>>>>>>>> Zdenko
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 15, 2013 at 2:16 PM, gold snake 
>>>>>>>>>> <[email protected]>wrote:
>>>>>>>>>>
>>>>>>>>>>>  My language some special, just like arab font, but bitween
>>>>>>>>>>> arab font have some different, actually only different on shape of 
>>>>>>>>>>> the
>>>>>>>>>>> font. and It's writing right to left too.
>>>>>>>>>>> I'm using standard tutorial : https://code.google.com/p/**te****
>>>>>>>>>>> ****sseract-ocr/wiki/**TrainingTesse********ract3<https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3>
>>>>>>>>>>>
>>>>>>>>>>> but when i'm finish and test, it can't be accurately identify.
>>>>>>>>>>> my step is :
>>>>>>>>>>>
>>>>>>>>>>> tesseract as.kadas.exp0.tif as.kadas.exp0 batch.nochop makebox
>>>>>>>>>>>
>>>>>>>>>>> tesseract as.kadas.exp0.tif as.kadas.exp0 nobatch box.train
>>>>>>>>>>>
>>>>>>>>>>> unicharset_extractor as.kadas.exp0.box
>>>>>>>>>>>
>>>>>>>>>>> shapeclustering -F font_properties -U unicharset
>>>>>>>>>>> as.kadas.exp0.tr
>>>>>>>>>>>
>>>>>>>>>>> mftraining -F font_properties -U unicharset -O as.unicharset
>>>>>>>>>>> as.kadas.exp0.tr
>>>>>>>>>>>
>>>>>>>>>>> cntraining as.kadas.exp0.tr
>>>>>>>>>>>
>>>>>>>>>>> I haven't words dict. so ... i'm not use some step.
>>>>>>>>>>> rename some file , add as. prefix
>>>>>>>>>>>
>>>>>>>>>>> combine_tessdata as.
>>>>>>>>>>>
>>>>>>>>>>> there is no any error until i'm combne, so i'm sure it's not
>>>>>>>>>>> have any problem.
>>>>>>>>>>> and when i'm test picture ,content is 13.  the result is : ئئ
>>>>>>>>>>> when i'm test any words, the result just ئ
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> and i'm find the D:\Little\Tesseract-OCR\**te********ssdata , and
>>>>>>>>>>> i'm found some file :
>>>>>>>>>>>
>>>>>>>>>>> ara.cube.bigrams
>>>>>>>>>>> ara.cube.fold
>>>>>>>>>>> ara.cube.lm
>>>>>>>>>>> ara.cube.nn
>>>>>>>>>>> ara.cube.params
>>>>>>>>>>> ara.cube.size
>>>>>>>>>>> ara.cube.word-freq
>>>>>>>>>>> ara.traineddata
>>>>>>>>>>>
>>>>>>>>>>> and i can't understand. why the arab trainddata not only
>>>>>>>>>>> have ara.traineddata? what is any other arab.* file ?? and if i'm 
>>>>>>>>>>> trainning
>>>>>>>>>>> my lanugage it's necessary??
>>>>>>>>>>> and how i cant find that file or create??
>>>>>>>>>>>
>>>>>>>>>>> thanks very much...
>>>>>>>>>>>
>>>>>>>>>>>  --
>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>> Google
>>>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>>>> To post to this group, send email to [email protected]
>>>>>>>>>>>
>>>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>>>> tesseract-oc...@**googlegroups.**c******om
>>>>>>>>>>>
>>>>>>>>>>> For more options, visit this group at
>>>>>>>>>>> http://groups.google.com/**group********/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  --
>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To post to this group, send email to [email protected]
>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>> tesseract-oc...@**googlegroups.**c****om
>>>>>>>>> For more options, visit this group at
>>>>>>>>> http://groups.google.com/**group******/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en>
>>>>>>>>>
>>>>>>>>
>>>>>>>>  --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To post to this group, send email to [email protected]
>>>>>>> To unsubscribe from this group, send email to
>>>>>>> tesseract-oc...@**googlegroups.**c**om
>>>>>>> For more options, visit this group at
>>>>>>> http://groups.google.com/**group****/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> ``All that is gold does not glitter,
>>>>>>   not all those who wander are lost;
>>>>>> the old that is strong does not wither,
>>>>>>   deep roots are not reached by the frost.
>>>>>> From the ashes a fire shall be woken,
>>>>>>   a light from the shadows shall spring;
>>>>>> renewed shall be blade that was broken,
>>>>>>   the crownless again shall be king.”
>>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To post to this group, send email to [email protected]
>>>>> To unsubscribe from this group, send email to
>>>>> tesseract-oc...@**googlegroups.**com
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/**group**/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> ``All that is gold does not glitter,
>>>>   not all those who wander are lost;
>>>> the old that is strong does not wither,
>>>>   deep roots are not reached by the frost.
>>>> From the ashes a fire shall be woken,
>>>>   a light from the shadows shall spring;
>>>> renewed shall be blade that was broken,
>>>>   the crownless again shall be king.”
>>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To post to this group, send email to [email protected]
>>> To unsubscribe from this group, send email to
>>> tesseract-oc...@**googlegroups.com
>>> For more options, visit this group at
>>> http://groups.google.com/**group/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en>
>>>
>>
>>
>>
>> --
>> ``All that is gold does not glitter,
>>   not all those who wander are lost;
>> the old that is strong does not wither,
>>   deep roots are not reached by the frost.
>> From the ashes a fire shall be woken,
>>   a light from the shadows shall spring;
>> renewed shall be blade that was broken,
>>   the crownless again shall be king.”
>>
>  --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>



-- 
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
>From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to