Yes, cube remains a mystery for the common mortals ... I am experimenting with it within ScanBizCards and here are my findings so far running Tesseract 3.02 on a black & white rendition of a standard business card (image size 1,024x768), on an iPhone 4S:
1. OcrEngineMode=OEM_TESSERACT_ONLY // Tess sources comment: Run Tesseract only - fastest Time: 6 seconds Accuracy: good 2. OEM_CUBE_ONLY // Tess sources comment: Run Cube only - better accuracy, but slower Time: 53 (!) seconds Accuracy: I have yet to run it on a large enough sample but for now I am not convinced this mode is more accurate than OEM_TESSERACT_ONLY, at least for business cards 3. OEM_TESSERACT_CUBE_COMBINED // Tess sources comment: Run both and combine results - best accuracy Time: 63 (!) seconds Accuracy: best, improves on OEM_TESSERACT_ONLY As you can see, the performance penalty for cube is severe but if you need highest accuracy I would recommend skipping OEM_CUBE_ONLY and using OEM_TESSERACT_CUBE_COMBINED Patrick On Thu, Jan 17, 2013 at 5:26 PM, zdenko podobny <[email protected]> wrote: > Regarding cube: > > - there are no more public information about cube than that 92 hits at > the forum I mentioned already (+ source code ;-)) > - there are no information how to create cube data files (ok some of > them are text files...) > > > So you can: > > 1. try to use/train tesseract without cube part (IMO you will need for > it for cube, because it looks like some cube files are part of traineddata > file[1] > 2. try to analyze cube data and share your finding - it > can encourage more people to have a look on it :-) > > [1] > http://tesseract-ocr.googlecode.com/svn/trunk/doc/combine_tessdata.1.html#_components > > Zdenko > > > On Thu, Jan 17, 2013 at 5:33 PM, gold snake <[email protected]> wrote: > >> the Arab and English font some think very different. >> English font if you input a+b , the result is :ab >> but if you use Arab font input ئ+ا the result is ئا , if you not >> understand, you can copy ئا and add a space for middle, you can find if >> you input 2 different font , the result is a new font style. >> >> My language too, so, i just afraid the cube is the control for this. if >> cube is for this , it's terrible, because i don't know how create(i not >> mean you tell me how, i just need some example or document about this >> information.) >> >> and about the RTL , looks mean that is not any way for handle this , may >> be we only use programming handle this(when read finish, change display >> mode....something like that). >> >> thanks. >> >> 在 2013年1月17日星期四UTC+8下午10时36分44秒,sventech写道: >>> >>> OK, the fact that cube is something different than combining languages >>> is a major revelation to me. However, huangjingshe, I don't think you need >>> the cube feature for what you're doing. I believe the problem you're having >>> is something else. I would solve the other issues first and then maybe try >>> the cube feature if necessary. >>> --Sven >>> >>> >>> On Wed, Jan 16, 2013 at 10:07 PM, gold snake <[email protected]> wrote: >>> >>>> thanks again .but i have same question. if use cube just for combine >>>> with other language when training. why when we read document can choice >>>> cube mode just like Sven said?? >>>> >>>> it that you mean we can combine with other language use -l [lang]because >>>> it's have cube file. if there is no any cube file. we can't use >>>> -l [lang]?? >>>> >>>> but i'm test, and everybody knows china language only have .traindata >>>> file, not have cube file .but i can use >>>> tesseract -l chi_sim [lang].[fontname].exp0.tif [lang].[fontname].exp0 >>>> batch.nochop makeb >>>> >>>> so , it's maybe not about cube file. or i'm not using right..... >>>> >>>> >>>> 在 2013年1月17日星期四UTC+8上午3时34分25秒,**sventech写道: >>>>> >>>>> Cube means combining different languages. There is not much >>>>> documentation on it -- Google developed it internally. But I don't think >>>>> you need it. The list of files you sent is related to the cube feature, so >>>>> you don't need to create them. For right to left, search the archives for >>>>> "right to left" -- someone wrote a python script to convert, though he >>>>> didn't provide info about how to use it. >>>>> >>>>> utility to convert training files: >>>>> https://groups.google.com/**foru**m/?fromgroups=#!searchin/**tesse** >>>>> ract-ocr/rtl/tesseract-**ocr/**T035ZyQVlMU/tQVoGWdlBDMJ<https://groups.google.com/forum/?fromgroups=#!searchin/tesseract-ocr/rtl/tesseract-ocr/T035ZyQVlMU/tQVoGWdlBDMJ> >>>>> >>>>> basic trick for right to left output from Dmitri Silaev: >>>>> https://groups.google.com/**foru**m/?fromgroups=#!searchin/**tesse** >>>>> ract-ocr/right$20to$**20left$**20output/tesseract-ocr/**8r2qGvM** >>>>> zz9U/so1WuMTyaU8J<https://groups.google.com/forum/?fromgroups=#!searchin/tesseract-ocr/right$20to$20left$20output/tesseract-ocr/8r2qGvMzz9U/so1WuMTyaU8J> >>>>> --Sven >>>>> >>>>> >>>>> On Wed, Jan 16, 2013 at 10:57 AM, gold snake <[email protected]>wrote: >>>>> >>>>>> so you mean: cube exists just because for user combine it with other >>>>>> language, the mean i'm not be need(because my language is not arab). >>>>>> thanks.may be i'm English not good. i just cant understand what is >>>>>> "cube", >>>>>> what is for use , can't find Introduction. >>>>>> >>>>>> and that mean cube and my result is left to right(accurate results >>>>>> must is right to left ) not any relationship. then why when i'm use >>>>>> command:tesseract 14.jpg output -l [lang]. the result(output.txt) >>>>>> content is left to right?? >>>>>> >>>>>> i'm very sorry if let masters take the beautiful time for these small >>>>>> problems. just some days ago i'm event don't know what is OCR >>>>>> if i can find that some question answer....believe me i'm not gonna >>>>>> ask anybody , because it's true, >>>>>> i really understand every friend is very busy. so , i'm trying hard >>>>>> search some problem from now. sorry again.... >>>>>> >>>>>> 在 2013年1月16日星期三UTC+8下午10时34分21秒,****sventech写道: >>>>>>> >>>>>>> The reason why Arabic has those files and your language does not is >>>>>>> that Arabic is set up to use the "cube" feature to combine it with other >>>>>>> languages, so you can do "-l ara+eng" and OCR a document with both >>>>>>> Arabic >>>>>>> and English. That training is harder, and not necessary if you mainly >>>>>>> want >>>>>>> to do monolingual documents. >>>>>>> >>>>>>> And what Zdenko is saying is that you are asking questions that >>>>>>> don't show that you're tried to solve the problem yourself. We're all >>>>>>> professional programmers and we want to help people but we don't have >>>>>>> time >>>>>>> to teach elementary web searching or programming. You seem to be a smart >>>>>>> guy, but your questions appear to be lazy. You need to make an effort to >>>>>>> solve the problems and come to us for help, not ask us to solve them for >>>>>>> you. >>>>>>> --Sven >>>>>>> >>>>>>> >>>>>>> On Wed, Jan 16, 2013 at 2:59 AM, gold snake <[email protected]>wrote: >>>>>>> >>>>>>>> I can't found any answer for my question in this link. >>>>>>>> can you just tolk to me? Is have necessary to bully a rookie? >>>>>>>> please... >>>>>>>> >>>>>>>> 在 2013年1月16日星期三UTC+8下午4时02分25秒,**z****denop写道: >>>>>>>>> >>>>>>>>> Really ;-)? I got 93 results. E.g.: >>>>>>>>> >>>>>>>>> https://groups.google.com/**foru******m/#!msg/tesseract-ocr/** >>>>>>>>> 0msQtTB_******XrI/D1noel9GpPgJ<https://groups.google.com/forum/#!msg/tesseract-ocr/0msQtTB_XrI/D1noel9GpPgJ> >>>>>>>>> https://groups.google.com/d/**to******pic/tesseract-ocr/tyV5_** >>>>>>>>> z65XMk/******discussion<https://groups.google.com/d/topic/tesseract-ocr/tyV5_z65XMk/discussion> >>>>>>>>> https://groups.google.com/d/**ms******g/tesseract-ocr/R7UCx0oV3PA/ >>>>>>>>> **GE******7KJ_76kS0J<https://groups.google.com/d/msg/tesseract-ocr/R7UCx0oV3PA/GE7KJ_76kS0J> >>>>>>>>> >>>>>>>>> Please honor time of people on this list... >>>>>>>>> >>>>>>>>> Zdenko >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Jan 16, 2013 at 8:18 AM, gold snake <[email protected]>wrote: >>>>>>>>> >>>>>>>>>> I can't found anything. common.... >>>>>>>>>> >>>>>>>>>> 在 2013年1月15日星期二UTC+8下午10时38分42秒,********zdenop写道: >>>>>>>>>>> >>>>>>>>>>> search archive of tesseract forums for cube. >>>>>>>>>>> >>>>>>>>>>> Zdenko >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Jan 15, 2013 at 2:16 PM, gold snake >>>>>>>>>>> <[email protected]>wrote: >>>>>>>>>>> >>>>>>>>>>>> My language some special, just like arab font, but bitween >>>>>>>>>>>> arab font have some different, actually only different on shape of >>>>>>>>>>>> the >>>>>>>>>>>> font. and It's writing right to left too. >>>>>>>>>>>> I'm using standard tutorial : https://code.google.com/p/**te*** >>>>>>>>>>>> *****sseract-ocr/wiki/**TrainingTesse********ract3<https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3> >>>>>>>>>>>> >>>>>>>>>>>> but when i'm finish and test, it can't be accurately identify. >>>>>>>>>>>> my step is : >>>>>>>>>>>> >>>>>>>>>>>> tesseract as.kadas.exp0.tif as.kadas.exp0 batch.nochop makebox >>>>>>>>>>>> >>>>>>>>>>>> tesseract as.kadas.exp0.tif as.kadas.exp0 nobatch box.train >>>>>>>>>>>> >>>>>>>>>>>> unicharset_extractor as.kadas.exp0.box >>>>>>>>>>>> >>>>>>>>>>>> shapeclustering -F font_properties -U unicharset >>>>>>>>>>>> as.kadas.exp0.tr >>>>>>>>>>>> >>>>>>>>>>>> mftraining -F font_properties -U unicharset -O as.unicharset >>>>>>>>>>>> as.kadas.exp0.tr >>>>>>>>>>>> >>>>>>>>>>>> cntraining as.kadas.exp0.tr >>>>>>>>>>>> >>>>>>>>>>>> I haven't words dict. so ... i'm not use some step. >>>>>>>>>>>> rename some file , add as. prefix >>>>>>>>>>>> >>>>>>>>>>>> combine_tessdata as. >>>>>>>>>>>> >>>>>>>>>>>> there is no any error until i'm combne, so i'm sure it's not >>>>>>>>>>>> have any problem. >>>>>>>>>>>> and when i'm test picture ,content is 13. the result is : ئئ >>>>>>>>>>>> when i'm test any words, the result just ئ >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> and i'm find the D:\Little\Tesseract-OCR\**te********ssdata , and >>>>>>>>>>>> i'm found some file : >>>>>>>>>>>> >>>>>>>>>>>> ara.cube.bigrams >>>>>>>>>>>> ara.cube.fold >>>>>>>>>>>> ara.cube.lm >>>>>>>>>>>> ara.cube.nn >>>>>>>>>>>> ara.cube.params >>>>>>>>>>>> ara.cube.size >>>>>>>>>>>> ara.cube.word-freq >>>>>>>>>>>> ara.traineddata >>>>>>>>>>>> >>>>>>>>>>>> and i can't understand. why the arab trainddata not only >>>>>>>>>>>> have ara.traineddata? what is any other arab.* file ?? and if i'm >>>>>>>>>>>> trainning >>>>>>>>>>>> my lanugage it's necessary?? >>>>>>>>>>>> and how i cant find that file or create?? >>>>>>>>>>>> >>>>>>>>>>>> thanks very much... >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>> Google >>>>>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>>>>> To post to this group, send email to [email protected] >>>>>>>>>>>> >>>>>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>>>>> tesseract-oc...@**googlegroups.**c******om >>>>>>>>>>>> >>>>>>>>>>>> For more options, visit this group at >>>>>>>>>>>> http://groups.google.com/**group********/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>>> To post to this group, send email to [email protected] >>>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>>> tesseract-oc...@**googlegroups.**c****om >>>>>>>>>> For more options, visit this group at >>>>>>>>>> http://groups.google.com/**group******/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en> >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To post to this group, send email to [email protected] >>>>>>>> To unsubscribe from this group, send email to >>>>>>>> tesseract-oc...@**googlegroups.**c**om >>>>>>>> For more options, visit this group at >>>>>>>> http://groups.google.com/**group****/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> ``All that is gold does not glitter, >>>>>>> not all those who wander are lost; >>>>>>> the old that is strong does not wither, >>>>>>> deep roots are not reached by the frost. >>>>>>> From the ashes a fire shall be woken, >>>>>>> a light from the shadows shall spring; >>>>>>> renewed shall be blade that was broken, >>>>>>> the crownless again shall be king.” >>>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To post to this group, send email to [email protected] >>>>>> To unsubscribe from this group, send email to >>>>>> tesseract-oc...@**googlegroups.**com >>>>>> For more options, visit this group at >>>>>> http://groups.google.com/**group**/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> ``All that is gold does not glitter, >>>>> not all those who wander are lost; >>>>> the old that is strong does not wither, >>>>> deep roots are not reached by the frost. >>>>> From the ashes a fire shall be woken, >>>>> a light from the shadows shall spring; >>>>> renewed shall be blade that was broken, >>>>> the crownless again shall be king.” >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To post to this group, send email to [email protected] >>>> To unsubscribe from this group, send email to >>>> tesseract-oc...@**googlegroups.com >>>> For more options, visit this group at >>>> http://groups.google.com/**group/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en> >>>> >>> >>> >>> >>> -- >>> ``All that is gold does not glitter, >>> not all those who wander are lost; >>> the old that is strong does not wither, >>> deep roots are not reached by the frost. >>> From the ashes a fire shall be woken, >>> a light from the shadows shall spring; >>> renewed shall be blade that was broken, >>> the crownless again shall be king.” >>> >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected] >> To unsubscribe from this group, send email to >> [email protected] >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- Patrick Questembert, *ScanBizCards* +1-917-250-4177 | www.scanbizcards.com twitter.com/ScanBizCards | www.facebook.com/ScanBizCards Just released: Power Contacts - http://itunes.apple.com/us/app/power-contacts/id476986356?mt=8 -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

