Yes, cube remains a mystery for the common mortals ... I am experimenting
with it within ScanBizCards and here are my findings so far running
Tesseract 3.02 on a black & white rendition of a standard business card
(image size 1,024x768), on an iPhone 4S:

1. OcrEngineMode=OEM_TESSERACT_ONLY          // Tess sources comment: Run
Tesseract only - fastest
Time: 6 seconds
Accuracy: good

2. OEM_CUBE_ONLY             // Tess sources comment: Run Cube only -
better accuracy, but slower
Time: 53 (!) seconds
Accuracy: I have yet to run it on a large enough sample but for now I am
not convinced this mode is more accurate than OEM_TESSERACT_ONLY, at least
for business cards

3. OEM_TESSERACT_CUBE_COMBINED  // Tess sources comment: Run both and
combine results - best accuracy
Time: 63 (!) seconds
Accuracy: best, improves on OEM_TESSERACT_ONLY

As you can see, the performance penalty for cube is severe but if you need
highest accuracy I would recommend skipping OEM_CUBE_ONLY and using
OEM_TESSERACT_CUBE_COMBINED

Patrick

On Thu, Jan 17, 2013 at 5:26 PM, zdenko podobny <[email protected]> wrote:

> Regarding cube:
>
>    - there are no more public information about cube than that 92 hits at
>    the forum I mentioned already (+ source code ;-))
>    - there are no information how to create cube data files (ok some of
>    them are text files...)
>
>
> So you can:
>
>    1. try to use/train tesseract without cube part (IMO you will need for
>    it for cube, because it looks like some cube files are part of traineddata
>    file[1]
>    2. try to analyze cube data and share your finding - it
>    can encourage more people to have a look on it :-)
>
> [1]
> http://tesseract-ocr.googlecode.com/svn/trunk/doc/combine_tessdata.1.html#_components
>
> Zdenko
>
>
> On Thu, Jan 17, 2013 at 5:33 PM, gold snake <[email protected]> wrote:
>
>> the Arab and English font some think very different.
>> English font if you input a+b , the result is :ab
>>  but if you use Arab font input ئ+ا the result is ئا , if you not
>> understand, you can copy ئا and add a space for middle, you can find if
>> you input 2 different font , the result is a new font style.
>>
>> My language too, so, i just afraid the cube is the control for this. if
>> cube is for this , it's terrible, because i don't know how create(i not
>> mean you tell me how, i just need some example or document about this
>> information.)
>>
>> and about the RTL , looks mean that is not any way for handle this , may
>> be we only use programming handle this(when read finish, change display
>> mode....something like that).
>>
>> thanks.
>>
>> 在 2013年1月17日星期四UTC+8下午10时36分44秒,sventech写道:
>>>
>>> OK, the fact that cube is something different than combining languages
>>> is a major revelation to me. However, huangjingshe, I don't think you need
>>> the cube feature for what you're doing. I believe the problem you're having
>>> is something else. I would solve the other issues first and then maybe try
>>> the cube feature if necessary.
>>> --Sven
>>>
>>>
>>> On Wed, Jan 16, 2013 at 10:07 PM, gold snake <[email protected]> wrote:
>>>
>>>> thanks again .but  i have same question. if use cube just for combine
>>>> with other language when training. why when we read document can choice
>>>> cube mode just like Sven said??
>>>>
>>>> it that you mean we can combine with other language  use -l [lang]because 
>>>> it's have cube file. if there is no any cube file. we can't use
>>>> -l [lang]??
>>>>
>>>> but i'm test, and everybody knows china language only have .traindata
>>>> file, not have cube file .but i can use
>>>> tesseract -l chi_sim [lang].[fontname].exp0.tif [lang].[fontname].exp0
>>>> batch.nochop makeb
>>>>
>>>> so , it's maybe not about cube file. or i'm not using right.....
>>>>
>>>>
>>>> 在 2013年1月17日星期四UTC+8上午3时34分25秒,**sventech写道:
>>>>>
>>>>> Cube means combining different languages. There is not much
>>>>> documentation on it -- Google developed it internally. But I don't think
>>>>> you need it. The list of files you sent is related to the cube feature, so
>>>>> you don't need to create them. For right to left, search the archives for
>>>>> "right to left" -- someone wrote a python script to convert, though he
>>>>> didn't provide info about how to use it.
>>>>>
>>>>> utility to convert training files:
>>>>> https://groups.google.com/**foru**m/?fromgroups=#!searchin/**tesse**
>>>>> ract-ocr/rtl/tesseract-**ocr/**T035ZyQVlMU/tQVoGWdlBDMJ<https://groups.google.com/forum/?fromgroups=#!searchin/tesseract-ocr/rtl/tesseract-ocr/T035ZyQVlMU/tQVoGWdlBDMJ>
>>>>>
>>>>> basic trick for right to left output from Dmitri Silaev:
>>>>> https://groups.google.com/**foru**m/?fromgroups=#!searchin/**tesse**
>>>>> ract-ocr/right$20to$**20left$**20output/tesseract-ocr/**8r2qGvM**
>>>>> zz9U/so1WuMTyaU8J<https://groups.google.com/forum/?fromgroups=#!searchin/tesseract-ocr/right$20to$20left$20output/tesseract-ocr/8r2qGvMzz9U/so1WuMTyaU8J>
>>>>> --Sven
>>>>>
>>>>>
>>>>> On Wed, Jan 16, 2013 at 10:57 AM, gold snake <[email protected]>wrote:
>>>>>
>>>>>> so you mean: cube exists just because for user combine it with other
>>>>>> language, the mean i'm not be need(because my language is not arab).
>>>>>> thanks.may be i'm English not good. i just cant understand what is 
>>>>>> "cube",
>>>>>> what is for use , can't find Introduction.
>>>>>>
>>>>>> and that mean cube and my result is left to right(accurate results
>>>>>> must is right to left ) not any relationship. then why when i'm use
>>>>>> command:tesseract 14.jpg output -l [lang]. the result(output.txt)
>>>>>> content is left to right??
>>>>>>
>>>>>> i'm very sorry if let masters take the beautiful time for these small
>>>>>> problems. just some days ago i'm event don't know what is OCR
>>>>>>  if i can find that some question answer....believe me i'm not gonna
>>>>>> ask anybody , because it's true,
>>>>>> i really understand every friend is very busy. so , i'm trying hard
>>>>>> search some problem from now. sorry again....
>>>>>>
>>>>>> 在 2013年1月16日星期三UTC+8下午10时34分21秒,****sventech写道:
>>>>>>>
>>>>>>> The reason why Arabic has those files and your language does not is
>>>>>>> that Arabic is set up to use the "cube" feature to combine it with other
>>>>>>> languages, so you can do "-l ara+eng" and OCR a document with both 
>>>>>>> Arabic
>>>>>>> and English. That training is harder, and not necessary if you mainly 
>>>>>>> want
>>>>>>> to do monolingual documents.
>>>>>>>
>>>>>>> And what Zdenko is saying is that you are asking questions that
>>>>>>> don't show that you're tried to solve the problem yourself. We're all
>>>>>>> professional programmers and we want to help people but we don't have 
>>>>>>> time
>>>>>>> to teach elementary web searching or programming. You seem to be a smart
>>>>>>> guy, but your questions appear to be lazy. You need to make an effort to
>>>>>>> solve the problems and come to us for help, not ask us to solve them for
>>>>>>> you.
>>>>>>> --Sven
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jan 16, 2013 at 2:59 AM, gold snake <[email protected]>wrote:
>>>>>>>
>>>>>>>> I can't found any answer for my question in this link.
>>>>>>>> can you just tolk to me? Is have necessary to bully a rookie?
>>>>>>>> please...
>>>>>>>>
>>>>>>>> 在 2013年1月16日星期三UTC+8下午4时02分25秒,**z****denop写道:
>>>>>>>>>
>>>>>>>>> Really ;-)? I got 93 results. E.g.:
>>>>>>>>>
>>>>>>>>> https://groups.google.com/**foru******m/#!msg/tesseract-ocr/**
>>>>>>>>> 0msQtTB_******XrI/D1noel9GpPgJ<https://groups.google.com/forum/#!msg/tesseract-ocr/0msQtTB_XrI/D1noel9GpPgJ>
>>>>>>>>> https://groups.google.com/d/**to******pic/tesseract-ocr/tyV5_**
>>>>>>>>> z65XMk/******discussion<https://groups.google.com/d/topic/tesseract-ocr/tyV5_z65XMk/discussion>
>>>>>>>>> https://groups.google.com/d/**ms******g/tesseract-ocr/R7UCx0oV3PA/
>>>>>>>>> **GE******7KJ_76kS0J<https://groups.google.com/d/msg/tesseract-ocr/R7UCx0oV3PA/GE7KJ_76kS0J>
>>>>>>>>>
>>>>>>>>> Please honor time of people on this list...
>>>>>>>>>
>>>>>>>>> Zdenko
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jan 16, 2013 at 8:18 AM, gold snake <[email protected]>wrote:
>>>>>>>>>
>>>>>>>>>> I can't found anything. common....
>>>>>>>>>>
>>>>>>>>>> 在 2013年1月15日星期二UTC+8下午10时38分42秒,********zdenop写道:
>>>>>>>>>>>
>>>>>>>>>>>  search archive of tesseract forums for cube.
>>>>>>>>>>>
>>>>>>>>>>> Zdenko
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 15, 2013 at 2:16 PM, gold snake 
>>>>>>>>>>> <[email protected]>wrote:
>>>>>>>>>>>
>>>>>>>>>>>>  My language some special, just like arab font, but bitween
>>>>>>>>>>>> arab font have some different, actually only different on shape of 
>>>>>>>>>>>> the
>>>>>>>>>>>> font. and It's writing right to left too.
>>>>>>>>>>>> I'm using standard tutorial : https://code.google.com/p/**te***
>>>>>>>>>>>> *****sseract-ocr/wiki/**TrainingTesse********ract3<https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3>
>>>>>>>>>>>>
>>>>>>>>>>>> but when i'm finish and test, it can't be accurately identify.
>>>>>>>>>>>> my step is :
>>>>>>>>>>>>
>>>>>>>>>>>> tesseract as.kadas.exp0.tif as.kadas.exp0 batch.nochop makebox
>>>>>>>>>>>>
>>>>>>>>>>>> tesseract as.kadas.exp0.tif as.kadas.exp0 nobatch box.train
>>>>>>>>>>>>
>>>>>>>>>>>> unicharset_extractor as.kadas.exp0.box
>>>>>>>>>>>>
>>>>>>>>>>>> shapeclustering -F font_properties -U unicharset
>>>>>>>>>>>> as.kadas.exp0.tr
>>>>>>>>>>>>
>>>>>>>>>>>> mftraining -F font_properties -U unicharset -O as.unicharset
>>>>>>>>>>>> as.kadas.exp0.tr
>>>>>>>>>>>>
>>>>>>>>>>>> cntraining as.kadas.exp0.tr
>>>>>>>>>>>>
>>>>>>>>>>>> I haven't words dict. so ... i'm not use some step.
>>>>>>>>>>>> rename some file , add as. prefix
>>>>>>>>>>>>
>>>>>>>>>>>> combine_tessdata as.
>>>>>>>>>>>>
>>>>>>>>>>>> there is no any error until i'm combne, so i'm sure it's not
>>>>>>>>>>>> have any problem.
>>>>>>>>>>>> and when i'm test picture ,content is 13.  the result is : ئئ
>>>>>>>>>>>> when i'm test any words, the result just ئ
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> and i'm find the D:\Little\Tesseract-OCR\**te********ssdata , and
>>>>>>>>>>>> i'm found some file :
>>>>>>>>>>>>
>>>>>>>>>>>> ara.cube.bigrams
>>>>>>>>>>>> ara.cube.fold
>>>>>>>>>>>> ara.cube.lm
>>>>>>>>>>>> ara.cube.nn
>>>>>>>>>>>> ara.cube.params
>>>>>>>>>>>> ara.cube.size
>>>>>>>>>>>> ara.cube.word-freq
>>>>>>>>>>>> ara.traineddata
>>>>>>>>>>>>
>>>>>>>>>>>> and i can't understand. why the arab trainddata not only
>>>>>>>>>>>> have ara.traineddata? what is any other arab.* file ?? and if i'm 
>>>>>>>>>>>> trainning
>>>>>>>>>>>> my lanugage it's necessary??
>>>>>>>>>>>> and how i cant find that file or create??
>>>>>>>>>>>>
>>>>>>>>>>>> thanks very much...
>>>>>>>>>>>>
>>>>>>>>>>>>  --
>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>> Google
>>>>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>>>>> To post to this group, send email to [email protected]
>>>>>>>>>>>>
>>>>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>>>>> tesseract-oc...@**googlegroups.**c******om
>>>>>>>>>>>>
>>>>>>>>>>>> For more options, visit this group at
>>>>>>>>>>>> http://groups.google.com/**group********/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  --
>>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>>> To post to this group, send email to [email protected]
>>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>>> tesseract-oc...@**googlegroups.**c****om
>>>>>>>>>> For more options, visit this group at
>>>>>>>>>> http://groups.google.com/**group******/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To post to this group, send email to [email protected]
>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>> tesseract-oc...@**googlegroups.**c**om
>>>>>>>> For more options, visit this group at
>>>>>>>> http://groups.google.com/**group****/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> ``All that is gold does not glitter,
>>>>>>>   not all those who wander are lost;
>>>>>>> the old that is strong does not wither,
>>>>>>>   deep roots are not reached by the frost.
>>>>>>> From the ashes a fire shall be woken,
>>>>>>>   a light from the shadows shall spring;
>>>>>>> renewed shall be blade that was broken,
>>>>>>>   the crownless again shall be king.”
>>>>>>>
>>>>>>  --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To post to this group, send email to [email protected]
>>>>>> To unsubscribe from this group, send email to
>>>>>> tesseract-oc...@**googlegroups.**com
>>>>>> For more options, visit this group at
>>>>>> http://groups.google.com/**group**/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ``All that is gold does not glitter,
>>>>>   not all those who wander are lost;
>>>>> the old that is strong does not wither,
>>>>>   deep roots are not reached by the frost.
>>>>> From the ashes a fire shall be woken,
>>>>>   a light from the shadows shall spring;
>>>>> renewed shall be blade that was broken,
>>>>>   the crownless again shall be king.”
>>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To post to this group, send email to [email protected]
>>>> To unsubscribe from this group, send email to
>>>> tesseract-oc...@**googlegroups.com
>>>> For more options, visit this group at
>>>> http://groups.google.com/**group/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en>
>>>>
>>>
>>>
>>>
>>> --
>>> ``All that is gold does not glitter,
>>>   not all those who wander are lost;
>>> the old that is strong does not wither,
>>>   deep roots are not reached by the frost.
>>> From the ashes a fire shall be woken,
>>>   a light from the shadows shall spring;
>>> renewed shall be blade that was broken,
>>>   the crownless again shall be king.”
>>>
>>  --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]
>> To unsubscribe from this group, send email to
>> [email protected]
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>
>  --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>



-- 
Patrick Questembert, *ScanBizCards*
+1-917-250-4177 | www.scanbizcards.com
twitter.com/ScanBizCards | www.facebook.com/ScanBizCards
Just released: Power Contacts -
http://itunes.apple.com/us/app/power-contacts/id476986356?mt=8

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to