Regarding cube: - there are no more public information about cube than that 92 hits at the forum I mentioned already (+ source code ;-)) - there are no information how to create cube data files (ok some of them are text files...)
So you can: 1. try to use/train tesseract without cube part (IMO you will need for it for cube, because it looks like some cube files are part of traineddata file[1] 2. try to analyze cube data and share your finding - it can encourage more people to have a look on it :-) [1] http://tesseract-ocr.googlecode.com/svn/trunk/doc/combine_tessdata.1.html#_components Zdenko On Thu, Jan 17, 2013 at 5:33 PM, gold snake <[email protected]> wrote: > the Arab and English font some think very different. > English font if you input a+b , the result is :ab > but if you use Arab font input ئ+ا the result is ئا , if you not > understand, you can copy ئا and add a space for middle, you can find if > you input 2 different font , the result is a new font style. > > My language too, so, i just afraid the cube is the control for this. if > cube is for this , it's terrible, because i don't know how create(i not > mean you tell me how, i just need some example or document about this > information.) > > and about the RTL , looks mean that is not any way for handle this , may > be we only use programming handle this(when read finish, change display > mode....something like that). > > thanks. > > 在 2013年1月17日星期四UTC+8下午10时36分44秒,sventech写道: >> >> OK, the fact that cube is something different than combining languages is >> a major revelation to me. However, huangjingshe, I don't think you need the >> cube feature for what you're doing. I believe the problem you're having is >> something else. I would solve the other issues first and then maybe try the >> cube feature if necessary. >> --Sven >> >> >> On Wed, Jan 16, 2013 at 10:07 PM, gold snake <[email protected]> wrote: >> >>> thanks again .but i have same question. if use cube just for combine >>> with other language when training. why when we read document can choice >>> cube mode just like Sven said?? >>> >>> it that you mean we can combine with other language use -l [lang]because >>> it's have cube file. if there is no any cube file. we can't use >>> -l [lang]?? >>> >>> but i'm test, and everybody knows china language only have .traindata >>> file, not have cube file .but i can use >>> tesseract -l chi_sim [lang].[fontname].exp0.tif [lang].[fontname].exp0 >>> batch.nochop makeb >>> >>> so , it's maybe not about cube file. or i'm not using right..... >>> >>> >>> 在 2013年1月17日星期四UTC+8上午3时34分25秒,**sventech写道: >>>> >>>> Cube means combining different languages. There is not much >>>> documentation on it -- Google developed it internally. But I don't think >>>> you need it. The list of files you sent is related to the cube feature, so >>>> you don't need to create them. For right to left, search the archives for >>>> "right to left" -- someone wrote a python script to convert, though he >>>> didn't provide info about how to use it. >>>> >>>> utility to convert training files: >>>> https://groups.google.com/**foru**m/?fromgroups=#!searchin/**tesse** >>>> ract-ocr/rtl/tesseract-**ocr/**T035ZyQVlMU/tQVoGWdlBDMJ<https://groups.google.com/forum/?fromgroups=#!searchin/tesseract-ocr/rtl/tesseract-ocr/T035ZyQVlMU/tQVoGWdlBDMJ> >>>> >>>> basic trick for right to left output from Dmitri Silaev: >>>> https://groups.google.com/**foru**m/?fromgroups=#!searchin/**tesse** >>>> ract-ocr/right$20to$**20left$**20output/tesseract-ocr/**8r2qGvM** >>>> zz9U/so1WuMTyaU8J<https://groups.google.com/forum/?fromgroups=#!searchin/tesseract-ocr/right$20to$20left$20output/tesseract-ocr/8r2qGvMzz9U/so1WuMTyaU8J> >>>> --Sven >>>> >>>> >>>> On Wed, Jan 16, 2013 at 10:57 AM, gold snake <[email protected]>wrote: >>>> >>>>> so you mean: cube exists just because for user combine it with other >>>>> language, the mean i'm not be need(because my language is not arab). >>>>> thanks.may be i'm English not good. i just cant understand what is "cube", >>>>> what is for use , can't find Introduction. >>>>> >>>>> and that mean cube and my result is left to right(accurate results >>>>> must is right to left ) not any relationship. then why when i'm use >>>>> command:tesseract 14.jpg output -l [lang]. the result(output.txt) >>>>> content is left to right?? >>>>> >>>>> i'm very sorry if let masters take the beautiful time for these small >>>>> problems. just some days ago i'm event don't know what is OCR >>>>> if i can find that some question answer....believe me i'm not gonna >>>>> ask anybody , because it's true, >>>>> i really understand every friend is very busy. so , i'm trying hard >>>>> search some problem from now. sorry again.... >>>>> >>>>> 在 2013年1月16日星期三UTC+8下午10时34分21秒,****sventech写道: >>>>>> >>>>>> The reason why Arabic has those files and your language does not is >>>>>> that Arabic is set up to use the "cube" feature to combine it with other >>>>>> languages, so you can do "-l ara+eng" and OCR a document with both Arabic >>>>>> and English. That training is harder, and not necessary if you mainly >>>>>> want >>>>>> to do monolingual documents. >>>>>> >>>>>> And what Zdenko is saying is that you are asking questions that don't >>>>>> show that you're tried to solve the problem yourself. We're all >>>>>> professional programmers and we want to help people but we don't have >>>>>> time >>>>>> to teach elementary web searching or programming. You seem to be a smart >>>>>> guy, but your questions appear to be lazy. You need to make an effort to >>>>>> solve the problems and come to us for help, not ask us to solve them for >>>>>> you. >>>>>> --Sven >>>>>> >>>>>> >>>>>> On Wed, Jan 16, 2013 at 2:59 AM, gold snake <[email protected]>wrote: >>>>>> >>>>>>> I can't found any answer for my question in this link. >>>>>>> can you just tolk to me? Is have necessary to bully a rookie? >>>>>>> please... >>>>>>> >>>>>>> 在 2013年1月16日星期三UTC+8下午4时02分25秒,**z****denop写道: >>>>>>>> >>>>>>>> Really ;-)? I got 93 results. E.g.: >>>>>>>> >>>>>>>> https://groups.google.com/**foru******m/#!msg/tesseract-ocr/** >>>>>>>> 0msQtTB_******XrI/D1noel9GpPgJ<https://groups.google.com/forum/#!msg/tesseract-ocr/0msQtTB_XrI/D1noel9GpPgJ> >>>>>>>> https://groups.google.com/d/**to******pic/tesseract-ocr/tyV5_** >>>>>>>> z65XMk/******discussion<https://groups.google.com/d/topic/tesseract-ocr/tyV5_z65XMk/discussion> >>>>>>>> https://groups.google.com/d/**ms******g/tesseract-ocr/R7UCx0oV3PA/* >>>>>>>> *GE******7KJ_76kS0J<https://groups.google.com/d/msg/tesseract-ocr/R7UCx0oV3PA/GE7KJ_76kS0J> >>>>>>>> >>>>>>>> Please honor time of people on this list... >>>>>>>> >>>>>>>> Zdenko >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jan 16, 2013 at 8:18 AM, gold snake <[email protected]>wrote: >>>>>>>> >>>>>>>>> I can't found anything. common.... >>>>>>>>> >>>>>>>>> 在 2013年1月15日星期二UTC+8下午10时38分42秒,********zdenop写道: >>>>>>>>>> >>>>>>>>>> search archive of tesseract forums for cube. >>>>>>>>>> >>>>>>>>>> Zdenko >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Jan 15, 2013 at 2:16 PM, gold snake >>>>>>>>>> <[email protected]>wrote: >>>>>>>>>> >>>>>>>>>>> My language some special, just like arab font, but bitween >>>>>>>>>>> arab font have some different, actually only different on shape of >>>>>>>>>>> the >>>>>>>>>>> font. and It's writing right to left too. >>>>>>>>>>> I'm using standard tutorial : https://code.google.com/p/**te**** >>>>>>>>>>> ****sseract-ocr/wiki/**TrainingTesse********ract3<https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3> >>>>>>>>>>> >>>>>>>>>>> but when i'm finish and test, it can't be accurately identify. >>>>>>>>>>> my step is : >>>>>>>>>>> >>>>>>>>>>> tesseract as.kadas.exp0.tif as.kadas.exp0 batch.nochop makebox >>>>>>>>>>> >>>>>>>>>>> tesseract as.kadas.exp0.tif as.kadas.exp0 nobatch box.train >>>>>>>>>>> >>>>>>>>>>> unicharset_extractor as.kadas.exp0.box >>>>>>>>>>> >>>>>>>>>>> shapeclustering -F font_properties -U unicharset >>>>>>>>>>> as.kadas.exp0.tr >>>>>>>>>>> >>>>>>>>>>> mftraining -F font_properties -U unicharset -O as.unicharset >>>>>>>>>>> as.kadas.exp0.tr >>>>>>>>>>> >>>>>>>>>>> cntraining as.kadas.exp0.tr >>>>>>>>>>> >>>>>>>>>>> I haven't words dict. so ... i'm not use some step. >>>>>>>>>>> rename some file , add as. prefix >>>>>>>>>>> >>>>>>>>>>> combine_tessdata as. >>>>>>>>>>> >>>>>>>>>>> there is no any error until i'm combne, so i'm sure it's not >>>>>>>>>>> have any problem. >>>>>>>>>>> and when i'm test picture ,content is 13. the result is : ئئ >>>>>>>>>>> when i'm test any words, the result just ئ >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> and i'm find the D:\Little\Tesseract-OCR\**te********ssdata , and >>>>>>>>>>> i'm found some file : >>>>>>>>>>> >>>>>>>>>>> ara.cube.bigrams >>>>>>>>>>> ara.cube.fold >>>>>>>>>>> ara.cube.lm >>>>>>>>>>> ara.cube.nn >>>>>>>>>>> ara.cube.params >>>>>>>>>>> ara.cube.size >>>>>>>>>>> ara.cube.word-freq >>>>>>>>>>> ara.traineddata >>>>>>>>>>> >>>>>>>>>>> and i can't understand. why the arab trainddata not only >>>>>>>>>>> have ara.traineddata? what is any other arab.* file ?? and if i'm >>>>>>>>>>> trainning >>>>>>>>>>> my lanugage it's necessary?? >>>>>>>>>>> and how i cant find that file or create?? >>>>>>>>>>> >>>>>>>>>>> thanks very much... >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>> Google >>>>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>>>> To post to this group, send email to [email protected] >>>>>>>>>>> >>>>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>>>> tesseract-oc...@**googlegroups.**c******om >>>>>>>>>>> >>>>>>>>>>> For more options, visit this group at >>>>>>>>>>> http://groups.google.com/**group********/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To post to this group, send email to [email protected] >>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>> tesseract-oc...@**googlegroups.**c****om >>>>>>>>> For more options, visit this group at >>>>>>>>> http://groups.google.com/**group******/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To post to this group, send email to [email protected] >>>>>>> To unsubscribe from this group, send email to >>>>>>> tesseract-oc...@**googlegroups.**c**om >>>>>>> For more options, visit this group at >>>>>>> http://groups.google.com/**group****/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> ``All that is gold does not glitter, >>>>>> not all those who wander are lost; >>>>>> the old that is strong does not wither, >>>>>> deep roots are not reached by the frost. >>>>>> From the ashes a fire shall be woken, >>>>>> a light from the shadows shall spring; >>>>>> renewed shall be blade that was broken, >>>>>> the crownless again shall be king.” >>>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To post to this group, send email to [email protected] >>>>> To unsubscribe from this group, send email to >>>>> tesseract-oc...@**googlegroups.**com >>>>> For more options, visit this group at >>>>> http://groups.google.com/**group**/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en> >>>>> >>>> >>>> >>>> >>>> -- >>>> ``All that is gold does not glitter, >>>> not all those who wander are lost; >>>> the old that is strong does not wither, >>>> deep roots are not reached by the frost. >>>> From the ashes a fire shall be woken, >>>> a light from the shadows shall spring; >>>> renewed shall be blade that was broken, >>>> the crownless again shall be king.” >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To post to this group, send email to [email protected] >>> To unsubscribe from this group, send email to >>> tesseract-oc...@**googlegroups.com >>> For more options, visit this group at >>> http://groups.google.com/**group/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en> >>> >> >> >> >> -- >> ``All that is gold does not glitter, >> not all those who wander are lost; >> the old that is strong does not wither, >> deep roots are not reached by the frost. >> From the ashes a fire shall be woken, >> a light from the shadows shall spring; >> renewed shall be blade that was broken, >> the crownless again shall be king.” >> > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

