if i found create cube solution for my language, i must use it' thanks anyway .that result is important
在 2013年1月18日星期五UTC+8上午6时41分28秒,Patrick Questembert写道: > > Yes, cube remains a mystery for the common mortals ... I am experimenting > with it within ScanBizCards and here are my findings so far running > Tesseract 3.02 on a black & white rendition of a standard business card > (image size 1,024x768), on an iPhone 4S: > > 1. OcrEngineMode=OEM_TESSERACT_ONLY // Tess sources comment: Run > Tesseract only - fastest > Time: 6 seconds > Accuracy: good > > 2. OEM_CUBE_ONLY // Tess sources comment: Run Cube only - > better accuracy, but slower > Time: 53 (!) seconds > Accuracy: I have yet to run it on a large enough sample but for now I am > not convinced this mode is more accurate than OEM_TESSERACT_ONLY, at least > for business cards > > 3. OEM_TESSERACT_CUBE_COMBINED // Tess sources comment: Run both and > combine results - best accuracy > Time: 63 (!) seconds > Accuracy: best, improves on OEM_TESSERACT_ONLY > > As you can see, the performance penalty for cube is severe but if you need > highest accuracy I would recommend skipping OEM_CUBE_ONLY and using > OEM_TESSERACT_CUBE_COMBINED > > Patrick > > On Thu, Jan 17, 2013 at 5:26 PM, zdenko podobny <[email protected]<javascript:> > > wrote: > >> Regarding cube: >> >> - there are no more public information about cube than that 92 hits >> at the forum I mentioned already (+ source code ;-)) >> - there are no information how to create cube data files (ok some of >> them are text files...) >> >> >> So you can: >> >> 1. try to use/train tesseract without cube part (IMO you will need >> for it for cube, because it looks like some cube files are part of >> traineddata file[1] >> 2. try to analyze cube data and share your finding - it >> can encourage more people to have a look on it :-) >> >> [1] >> http://tesseract-ocr.googlecode.com/svn/trunk/doc/combine_tessdata.1.html#_components >> >> Zdenko >> >> >> On Thu, Jan 17, 2013 at 5:33 PM, gold snake <[email protected]<javascript:> >> > wrote: >> >>> the Arab and English font some think very different. >>> English font if you input a+b , the result is :ab >>> but if you use Arab font input ئ+ا the result is ئا , if you not >>> understand, you can copy ئا and add a space for middle, you can find if >>> you input 2 different font , the result is a new font style. >>> >>> My language too, so, i just afraid the cube is the control for this. if >>> cube is for this , it's terrible, because i don't know how create(i not >>> mean you tell me how, i just need some example or document about this >>> information.) >>> >>> and about the RTL , looks mean that is not any way for handle this , may >>> be we only use programming handle this(when read finish, change display >>> mode....something like that). >>> >>> thanks. >>> >>> 在 2013年1月17日星期四UTC+8下午10时36分44秒,sventech写道: >>>> >>>> OK, the fact that cube is something different than combining languages >>>> is a major revelation to me. However, huangjingshe, I don't think you need >>>> the cube feature for what you're doing. I believe the problem you're >>>> having >>>> is something else. I would solve the other issues first and then maybe try >>>> the cube feature if necessary. >>>> --Sven >>>> >>>> >>>> On Wed, Jan 16, 2013 at 10:07 PM, gold snake <[email protected]>wrote: >>>> >>>>> thanks again .but i have same question. if use cube just for combine >>>>> with other language when training. why when we read document can choice >>>>> cube mode just like Sven said?? >>>>> >>>>> it that you mean we can combine with other language use -l [lang]because >>>>> it's have cube file. if there is no any cube file. we can't use >>>>> -l [lang]?? >>>>> >>>>> but i'm test, and everybody knows china language only have .traindata >>>>> file, not have cube file .but i can use >>>>> tesseract -l chi_sim [lang].[fontname].exp0.tif [lang].[fontname].exp0 >>>>> batch.nochop makeb >>>>> >>>>> so , it's maybe not about cube file. or i'm not using right..... >>>>> >>>>> >>>>> 在 2013年1月17日星期四UTC+8上午3时34分25秒,**sventech写道: >>>>>> >>>>>> Cube means combining different languages. There is not much >>>>>> documentation on it -- Google developed it internally. But I don't think >>>>>> you need it. The list of files you sent is related to the cube feature, >>>>>> so >>>>>> you don't need to create them. For right to left, search the archives >>>>>> for >>>>>> "right to left" -- someone wrote a python script to convert, though he >>>>>> didn't provide info about how to use it. >>>>>> >>>>>> utility to convert training files: >>>>>> https://groups.google.com/**foru**m/?fromgroups=#!searchin/**tesse** >>>>>> ract-ocr/rtl/tesseract-**ocr/**T035ZyQVlMU/tQVoGWdlBDMJ<https://groups.google.com/forum/?fromgroups=#!searchin/tesseract-ocr/rtl/tesseract-ocr/T035ZyQVlMU/tQVoGWdlBDMJ> >>>>>> >>>>>> basic trick for right to left output from Dmitri Silaev: >>>>>> https://groups.google.com/**foru**m/?fromgroups=#!searchin/**tesse** >>>>>> ract-ocr/right$20to$**20left$**20output/tesseract-ocr/**8r2qGvM** >>>>>> zz9U/so1WuMTyaU8J<https://groups.google.com/forum/?fromgroups=#!searchin/tesseract-ocr/right$20to$20left$20output/tesseract-ocr/8r2qGvMzz9U/so1WuMTyaU8J> >>>>>> --Sven >>>>>> >>>>>> >>>>>> On Wed, Jan 16, 2013 at 10:57 AM, gold snake <[email protected]>wrote: >>>>>> >>>>>>> so you mean: cube exists just because for user combine it with other >>>>>>> language, the mean i'm not be need(because my language is not arab). >>>>>>> thanks.may be i'm English not good. i just cant understand what is >>>>>>> "cube", >>>>>>> what is for use , can't find Introduction. >>>>>>> >>>>>>> and that mean cube and my result is left to right(accurate results >>>>>>> must is right to left ) not any relationship. then why when i'm use >>>>>>> command:tesseract 14.jpg output -l [lang]. the result(output.txt) >>>>>>> content is left to right?? >>>>>>> >>>>>>> i'm very sorry if let masters take the beautiful time for these >>>>>>> small problems. just some days ago i'm event don't know what is OCR >>>>>>> if i can find that some question answer....believe me i'm not >>>>>>> gonna ask anybody , because it's true, >>>>>>> i really understand every friend is very busy. so , i'm trying hard >>>>>>> search some problem from now. sorry again.... >>>>>>> >>>>>>> 在 2013年1月16日星期三UTC+8下午10时34分21秒,****sventech写道: >>>>>>>> >>>>>>>> The reason why Arabic has those files and your language does not is >>>>>>>> that Arabic is set up to use the "cube" feature to combine it with >>>>>>>> other >>>>>>>> languages, so you can do "-l ara+eng" and OCR a document with both >>>>>>>> Arabic >>>>>>>> and English. That training is harder, and not necessary if you mainly >>>>>>>> want >>>>>>>> to do monolingual documents. >>>>>>>> >>>>>>>> And what Zdenko is saying is that you are asking questions that >>>>>>>> don't show that you're tried to solve the problem yourself. We're all >>>>>>>> professional programmers and we want to help people but we don't have >>>>>>>> time >>>>>>>> to teach elementary web searching or programming. You seem to be a >>>>>>>> smart >>>>>>>> guy, but your questions appear to be lazy. You need to make an effort >>>>>>>> to >>>>>>>> solve the problems and come to us for help, not ask us to solve them >>>>>>>> for >>>>>>>> you. >>>>>>>> --Sven >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jan 16, 2013 at 2:59 AM, gold snake <[email protected]>wrote: >>>>>>>> >>>>>>>>> I can't found any answer for my question in this link. >>>>>>>>> can you just tolk to me? Is have necessary to bully a rookie? >>>>>>>>> please... >>>>>>>>> >>>>>>>>> 在 2013年1月16日星期三UTC+8下午4时02分25秒,**z****denop写道: >>>>>>>>>> >>>>>>>>>> Really ;-)? I got 93 results. E.g.: >>>>>>>>>> >>>>>>>>>> https://groups.google.com/**foru******m/#!msg/tesseract-ocr/** >>>>>>>>>> 0msQtTB_******XrI/D1noel9GpPgJ<https://groups.google.com/forum/#!msg/tesseract-ocr/0msQtTB_XrI/D1noel9GpPgJ> >>>>>>>>>> https://groups.google.com/d/**to******pic/tesseract-ocr/tyV5_** >>>>>>>>>> z65XMk/******discussion<https://groups.google.com/d/topic/tesseract-ocr/tyV5_z65XMk/discussion> >>>>>>>>>> https://groups.google.com/d/**ms****** >>>>>>>>>> g/tesseract-ocr/R7UCx0oV3PA/**GE******7KJ_76kS0J<https://groups.google.com/d/msg/tesseract-ocr/R7UCx0oV3PA/GE7KJ_76kS0J> >>>>>>>>>> >>>>>>>>>> Please honor time of people on this list... >>>>>>>>>> >>>>>>>>>> Zdenko >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Jan 16, 2013 at 8:18 AM, gold snake >>>>>>>>>> <[email protected]>wrote: >>>>>>>>>> >>>>>>>>>>> I can't found anything. common.... >>>>>>>>>>> >>>>>>>>>>> 在 2013年1月15日星期二UTC+8下午10时38分42秒,********zdenop写道: >>>>>>>>>>>> >>>>>>>>>>>> search archive of tesseract forums for cube. >>>>>>>>>>>> >>>>>>>>>>>> Zdenko >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Jan 15, 2013 at 2:16 PM, gold snake <[email protected] >>>>>>>>>>>> > wrote: >>>>>>>>>>>> >>>>>>>>>>>>> My language some special, just like arab font, but bitween >>>>>>>>>>>>> arab font have some different, actually only different on shape >>>>>>>>>>>>> of the >>>>>>>>>>>>> font. and It's writing right to left too. >>>>>>>>>>>>> I'm using standard tutorial : https://code.google.com/p/**te** >>>>>>>>>>>>> ******sseract-ocr/wiki/**TrainingTesse********ract3<https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3> >>>>>>>>>>>>> >>>>>>>>>>>>> but when i'm finish and test, it can't be accurately identify. >>>>>>>>>>>>> my step is : >>>>>>>>>>>>> >>>>>>>>>>>>> tesseract as.kadas.exp0.tif as.kadas.exp0 batch.nochop makebox >>>>>>>>>>>>> >>>>>>>>>>>>> tesseract as.kadas.exp0.tif as.kadas.exp0 nobatch box.train >>>>>>>>>>>>> >>>>>>>>>>>>> unicharset_extractor as.kadas.exp0.box >>>>>>>>>>>>> >>>>>>>>>>>>> shapeclustering -F font_properties -U unicharset >>>>>>>>>>>>> as.kadas.exp0.tr >>>>>>>>>>>>> >>>>>>>>>>>>> mftraining -F font_properties -U unicharset -O as.unicharset >>>>>>>>>>>>> as.kadas.exp0.tr >>>>>>>>>>>>> >>>>>>>>>>>>> cntraining as.kadas.exp0.tr >>>>>>>>>>>>> >>>>>>>>>>>>> I haven't words dict. so ... i'm not use some step. >>>>>>>>>>>>> rename some file , add as. prefix >>>>>>>>>>>>> >>>>>>>>>>>>> combine_tessdata as. >>>>>>>>>>>>> >>>>>>>>>>>>> there is no any error until i'm combne, so i'm sure it's not >>>>>>>>>>>>> have any problem. >>>>>>>>>>>>> and when i'm test picture ,content is 13. the result is : ئئ >>>>>>>>>>>>> when i'm test any words, the result just ئ >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> and i'm find the D:\Little\Tesseract-OCR\**te********ssdata , and >>>>>>>>>>>>> i'm found some file : >>>>>>>>>>>>> >>>>>>>>>>>>> ara.cube.bigrams >>>>>>>>>>>>> ara.cube.fold >>>>>>>>>>>>> ara.cube.lm >>>>>>>>>>>>> ara.cube.nn >>>>>>>>>>>>> ara.cube.params >>>>>>>>>>>>> ara.cube.size >>>>>>>>>>>>> ara.cube.word-freq >>>>>>>>>>>>> ara.traineddata >>>>>>>>>>>>> >>>>>>>>>>>>> and i can't understand. why the arab trainddata not only >>>>>>>>>>>>> have ara.traineddata? what is any other arab.* file ?? and if i'm >>>>>>>>>>>>> trainning >>>>>>>>>>>>> my lanugage it's necessary?? >>>>>>>>>>>>> and how i cant find that file or create?? >>>>>>>>>>>>> >>>>>>>>>>>>> thanks very much... >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>> Google >>>>>>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>>> [email protected] >>>>>>>>>>>>> >>>>>>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>>>>>> tesseract-oc...@**googlegroups.**c******om >>>>>>>>>>>>> >>>>>>>>>>>>> For more options, visit this group at >>>>>>>>>>>>> http://groups.google.com/**group********/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>> Google >>>>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>>>> To post to this group, send email to [email protected] >>>>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>>>> tesseract-oc...@**googlegroups.**c****om >>>>>>>>>>> For more options, visit this group at >>>>>>>>>>> http://groups.google.com/**group******/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To post to this group, send email to [email protected] >>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>> tesseract-oc...@**googlegroups.**c**om >>>>>>>>> For more options, visit this group at >>>>>>>>> http://groups.google.com/**group****/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> ``All that is gold does not glitter, >>>>>>>> not all those who wander are lost; >>>>>>>> the old that is strong does not wither, >>>>>>>> deep roots are not reached by the frost. >>>>>>>> From the ashes a fire shall be woken, >>>>>>>> a light from the shadows shall spring; >>>>>>>> renewed shall be blade that was broken, >>>>>>>> the crownless again shall be king.” >>>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To post to this group, send email to [email protected] >>>>>>> To unsubscribe from this group, send email to >>>>>>> tesseract-oc...@**googlegroups.**com >>>>>>> For more options, visit this group at >>>>>>> http://groups.google.com/**group**/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> ``All that is gold does not glitter, >>>>>> not all those who wander are lost; >>>>>> the old that is strong does not wither, >>>>>> deep roots are not reached by the frost. >>>>>> From the ashes a fire shall be woken, >>>>>> a light from the shadows shall spring; >>>>>> renewed shall be blade that was broken, >>>>>> the crownless again shall be king.” >>>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To post to this group, send email to [email protected] >>>>> To unsubscribe from this group, send email to >>>>> tesseract-oc...@**googlegroups.com >>>>> For more options, visit this group at >>>>> http://groups.google.com/**group/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en> >>>>> >>>> >>>> >>>> >>>> -- >>>> ``All that is gold does not glitter, >>>> not all those who wander are lost; >>>> the old that is strong does not wither, >>>> deep roots are not reached by the frost. >>>> From the ashes a fire shall be woken, >>>> a light from the shadows shall spring; >>>> renewed shall be blade that was broken, >>>> the crownless again shall be king.” >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To post to this group, send email to [email protected]<javascript:> >>> To unsubscribe from this group, send email to >>> [email protected] <javascript:> >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en >>> >> >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected]<javascript:> >> To unsubscribe from this group, send email to >> [email protected] <javascript:> >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> > > > > -- > Patrick Questembert, *ScanBizCards* > +1-917-250-4177 | www.scanbizcards.com > twitter.com/ScanBizCards | www.facebook.com/ScanBizCards > Just released: Power Contacts - > http://itunes.apple.com/us/app/power-contacts/id476986356?mt=8 > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

