Cube provides higher accuracy, this is the OcrEngine mode parameter:

enum OcrEngineMode {
  OEM_TESSERACT_ONLY,           // Run Tesseract only - fastest
  OEM_CUBE_ONLY,                // Run Cube only - better accuracy, but
slower
  OEM_TESSERACT_CUBE_COMBINED,  // Run both and combine results - best
accuracy
  OEM_DEFAULT                   // Specify this mode when calling init_*(),
                                // to indicate that any of the above modes
                                // should be automatically inferred from the
                                // variables in the language-specific
config,
                                // command-line configs, or if not specified
                                // in any of the above should be set to the
                                // default OEM_TESSERACT_ONLY.
};

I have been testing English with "OEM_TESSERACT_CUBE_COMBINED" and it's
significantly better, but takes twice as long. I have on my to-do list to
also test just "OEM_CUBE_ONLY".

Patrick

On Thu, Jan 17, 2013 at 9:36 AM, Sven Pedersen <[email protected]>wrote:

> OK, the fact that cube is something different than combining languages is
> a major revelation to me. However, huangjingshe, I don't think you need the
> cube feature for what you're doing. I believe the problem you're having is
> something else. I would solve the other issues first and then maybe try the
> cube feature if necessary.
> --Sven
>
>
> On Wed, Jan 16, 2013 at 10:07 PM, gold snake <[email protected]>wrote:
>
>> thanks again .but  i have same question. if use cube just for combine
>> with other language when training. why when we read document can choice
>> cube mode just like Sven said??
>>
>> it that you mean we can combine with other language  use -l [lang]because 
>> it's have cube file. if there is no any cube file. we can't use
>> -l [lang]??
>>
>> but i'm test, and everybody knows china language only have .traindata
>> file, not have cube file .but i can use
>> tesseract -l chi_sim [lang].[fontname].exp0.tif [lang].[fontname].exp0
>> batch.nochop makeb
>>
>> so , it's maybe not about cube file. or i'm not using right.....
>>
>>
>> 在 2013年1月17日星期四UTC+8上午3时34分25秒,sventech写道:
>>>
>>> Cube means combining different languages. There is not much
>>> documentation on it -- Google developed it internally. But I don't think
>>> you need it. The list of files you sent is related to the cube feature, so
>>> you don't need to create them. For right to left, search the archives for
>>> "right to left" -- someone wrote a python script to convert, though he
>>> didn't provide info about how to use it.
>>>
>>> utility to convert training files:
>>> https://groups.google.com/**forum/?fromgroups=#!searchin/**
>>> tesseract-ocr/rtl/tesseract-**ocr/T035ZyQVlMU/tQVoGWdlBDMJ<https://groups.google.com/forum/?fromgroups=#!searchin/tesseract-ocr/rtl/tesseract-ocr/T035ZyQVlMU/tQVoGWdlBDMJ>
>>>
>>> basic trick for right to left output from Dmitri Silaev:
>>> https://groups.google.com/**forum/?fromgroups=#!searchin/**
>>> tesseract-ocr/right$20to$**20left$20output/tesseract-ocr/**
>>> 8r2qGvMzz9U/so1WuMTyaU8J<https://groups.google.com/forum/?fromgroups=#!searchin/tesseract-ocr/right$20to$20left$20output/tesseract-ocr/8r2qGvMzz9U/so1WuMTyaU8J>
>>> --Sven
>>>
>>>
>>> On Wed, Jan 16, 2013 at 10:57 AM, gold snake <[email protected]> wrote:
>>>
>>>> so you mean: cube exists just because for user combine it with other
>>>> language, the mean i'm not be need(because my language is not arab).
>>>> thanks.may be i'm English not good. i just cant understand what is "cube",
>>>> what is for use , can't find Introduction.
>>>>
>>>> and that mean cube and my result is left to right(accurate results must
>>>> is right to left ) not any relationship. then why when i'm use 
>>>> command:tesseract
>>>> 14.jpg output -l [lang]. the result(output.txt) content is left to
>>>> right??
>>>>
>>>> i'm very sorry if let masters take the beautiful time for these small
>>>> problems. just some days ago i'm event don't know what is OCR
>>>>  if i can find that some question answer....believe me i'm not gonna
>>>> ask anybody , because it's true,
>>>> i really understand every friend is very busy. so , i'm trying hard
>>>> search some problem from now. sorry again....
>>>>
>>>> 在 2013年1月16日星期三UTC+8下午10时34分21秒,**sventech写道:
>>>>>
>>>>> The reason why Arabic has those files and your language does not is
>>>>> that Arabic is set up to use the "cube" feature to combine it with other
>>>>> languages, so you can do "-l ara+eng" and OCR a document with both Arabic
>>>>> and English. That training is harder, and not necessary if you mainly want
>>>>> to do monolingual documents.
>>>>>
>>>>> And what Zdenko is saying is that you are asking questions that don't
>>>>> show that you're tried to solve the problem yourself. We're all
>>>>> professional programmers and we want to help people but we don't have time
>>>>> to teach elementary web searching or programming. You seem to be a smart
>>>>> guy, but your questions appear to be lazy. You need to make an effort to
>>>>> solve the problems and come to us for help, not ask us to solve them for
>>>>> you.
>>>>> --Sven
>>>>>
>>>>>
>>>>> On Wed, Jan 16, 2013 at 2:59 AM, gold snake <[email protected]>wrote:
>>>>>
>>>>>> I can't found any answer for my question in this link.
>>>>>> can you just tolk to me? Is have necessary to bully a rookie?
>>>>>> please...
>>>>>>
>>>>>> 在 2013年1月16日星期三UTC+8下午4时02分25秒,**z**denop写道:
>>>>>>>
>>>>>>> Really ;-)? I got 93 results. E.g.:
>>>>>>>
>>>>>>> https://groups.google.com/**foru****m/#!msg/tesseract-ocr/**0msQtTB_
>>>>>>> ****XrI/D1noel9GpPgJ<https://groups.google.com/forum/#!msg/tesseract-ocr/0msQtTB_XrI/D1noel9GpPgJ>
>>>>>>> https://groups.google.com/d/**to****pic/tesseract-ocr/tyV5_**z65XMk/
>>>>>>> ****discussion<https://groups.google.com/d/topic/tesseract-ocr/tyV5_z65XMk/discussion>
>>>>>>> https://groups.google.com/d/**ms****g/tesseract-ocr/R7UCx0oV3PA/**GE
>>>>>>> ****7KJ_76kS0J<https://groups.google.com/d/msg/tesseract-ocr/R7UCx0oV3PA/GE7KJ_76kS0J>
>>>>>>>
>>>>>>> Please honor time of people on this list...
>>>>>>>
>>>>>>> Zdenko
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jan 16, 2013 at 8:18 AM, gold snake <[email protected]>wrote:
>>>>>>>
>>>>>>>> I can't found anything. common....
>>>>>>>>
>>>>>>>> 在 2013年1月15日星期二UTC+8下午10时38分42秒,******zdenop写道:
>>>>>>>>>
>>>>>>>>> search archive of tesseract forums for cube.
>>>>>>>>>
>>>>>>>>> Zdenko
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jan 15, 2013 at 2:16 PM, gold snake <[email protected]>wrote:
>>>>>>>>>
>>>>>>>>>>  My language some special, just like arab font, but bitween arab
>>>>>>>>>> font have some different, actually only different on shape of the 
>>>>>>>>>> font. and
>>>>>>>>>> It's writing right to left too.
>>>>>>>>>> I'm using standard tutorial : https://code.google.com/p/**te*****
>>>>>>>>>> *sseract-ocr/wiki/**TrainingTesse******ract3<https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3>
>>>>>>>>>>
>>>>>>>>>> but when i'm finish and test, it can't be accurately identify.
>>>>>>>>>> my step is :
>>>>>>>>>>
>>>>>>>>>> tesseract as.kadas.exp0.tif as.kadas.exp0 batch.nochop makebox
>>>>>>>>>>
>>>>>>>>>> tesseract as.kadas.exp0.tif as.kadas.exp0 nobatch box.train
>>>>>>>>>>
>>>>>>>>>> unicharset_extractor as.kadas.exp0.box
>>>>>>>>>>
>>>>>>>>>> shapeclustering -F font_properties -U unicharset as.kadas.exp0.tr
>>>>>>>>>>
>>>>>>>>>> mftraining -F font_properties -U unicharset -O as.unicharset
>>>>>>>>>> as.kadas.exp0.tr
>>>>>>>>>>
>>>>>>>>>> cntraining as.kadas.exp0.tr
>>>>>>>>>>
>>>>>>>>>> I haven't words dict. so ... i'm not use some step.
>>>>>>>>>> rename some file , add as. prefix
>>>>>>>>>>
>>>>>>>>>> combine_tessdata as.
>>>>>>>>>>
>>>>>>>>>> there is no any error until i'm combne, so i'm sure it's not have
>>>>>>>>>> any problem.
>>>>>>>>>> and when i'm test picture ,content is 13.  the result is : ئئ
>>>>>>>>>> when i'm test any words, the result just ئ
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> and i'm find the D:\Little\Tesseract-OCR\**te******ssdata , and
>>>>>>>>>> i'm found some file :
>>>>>>>>>>
>>>>>>>>>> ara.cube.bigrams
>>>>>>>>>> ara.cube.fold
>>>>>>>>>> ara.cube.lm
>>>>>>>>>> ara.cube.nn
>>>>>>>>>> ara.cube.params
>>>>>>>>>> ara.cube.size
>>>>>>>>>> ara.cube.word-freq
>>>>>>>>>> ara.traineddata
>>>>>>>>>>
>>>>>>>>>> and i can't understand. why the arab trainddata not only
>>>>>>>>>> have ara.traineddata? what is any other arab.* file ?? and if i'm 
>>>>>>>>>> trainning
>>>>>>>>>> my lanugage it's necessary??
>>>>>>>>>> and how i cant find that file or create??
>>>>>>>>>>
>>>>>>>>>> thanks very much...
>>>>>>>>>>
>>>>>>>>>>  --
>>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>>> To post to this group, send email to [email protected]
>>>>>>>>>>
>>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>>> tesseract-oc...@**googlegroups.**c****om
>>>>>>>>>>
>>>>>>>>>> For more options, visit this group at
>>>>>>>>>> http://groups.google.com/**group******/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To post to this group, send email to [email protected]
>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>> tesseract-oc...@**googlegroups.**c**om
>>>>>>>> For more options, visit this group at
>>>>>>>> http://groups.google.com/**group****/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en>
>>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To post to this group, send email to [email protected]
>>>>>> To unsubscribe from this group, send email to
>>>>>> tesseract-oc...@**googlegroups.**com
>>>>>> For more options, visit this group at
>>>>>> http://groups.google.com/**group**/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ``All that is gold does not glitter,
>>>>>   not all those who wander are lost;
>>>>> the old that is strong does not wither,
>>>>>   deep roots are not reached by the frost.
>>>>> From the ashes a fire shall be woken,
>>>>>   a light from the shadows shall spring;
>>>>> renewed shall be blade that was broken,
>>>>>   the crownless again shall be king.”
>>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To post to this group, send email to [email protected]
>>>> To unsubscribe from this group, send email to
>>>> tesseract-oc...@**googlegroups.com
>>>> For more options, visit this group at
>>>> http://groups.google.com/**group/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en>
>>>>
>>>
>>>
>>>
>>> --
>>> ``All that is gold does not glitter,
>>>   not all those who wander are lost;
>>> the old that is strong does not wither,
>>>   deep roots are not reached by the frost.
>>> From the ashes a fire shall be woken,
>>>   a light from the shadows shall spring;
>>> renewed shall be blade that was broken,
>>>   the crownless again shall be king.”
>>>
>>  --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]
>> To unsubscribe from this group, send email to
>> [email protected]
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>
>
>
> --
> ``All that is gold does not glitter,
>   not all those who wander are lost;
> the old that is strong does not wither,
>   deep roots are not reached by the frost.
> From the ashes a fire shall be woken,
>   a light from the shadows shall spring;
> renewed shall be blade that was broken,
>   the crownless again shall be king.”
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>



-- 
Patrick Questembert, *ScanBizCards*
+1-917-250-4177 | www.scanbizcards.com
twitter.com/ScanBizCards | www.facebook.com/ScanBizCards
Just released: Power Contacts -
http://itunes.apple.com/us/app/power-contacts/id476986356?mt=8

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to