Also see the language training data available at https://github.com/tesseract-ocr/langdata
ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Jul 9, 2015 at 8:27 PM, ShreeDevi Kumar <[email protected]> wrote: > Have you tried with the new traineddata files at > > https://github.com/tesseract-ocr/tessdata > > > ShreeDevi > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Thu, Jul 9, 2015 at 2:55 PM, <[email protected]> wrote: > >> Hi, Nade, thanks for your post. >> >> I've tried your method on chi_sim but got 17 empty sub dawgs. however my >> fixed-length.dawg is around 600Kb... BTW, do you have any idea what this >> file is for? Any help to promote the accuracy for Chinese recognition? >> >> -Han >> >> 在 2015年5月19日星期二 UTC+8下午2:25:38,Nade Sritanyaratana写道: >>> >>> cskau, thank you for posting this! I would have gotten stuck without it. >>> >>> The awk command you provided seems to work great on jpn.traineddata. I >>> was just trying the same awk command for chi_sim.traineddata, but >>> unfortunately did not come across similar luck. >>> >>> Following your suggestion, I used a hex editor to view the dawgs file >>> and a dawg file, both from chi_sim.traineddata. I see that the "magic >>> number" was for some reason slightly different. I noticed instead the magic >>> hexadecimal number "2A00A313". >>> >>> Fast forward a bit -- the following command worked for me: >>> awk 'BEGIN {RS="\x2A\x00\xA3\x13"; FILENUM=-1} {FILENUM++; if (FILENUM >>> == 0) {next}; FILENAME="chi_sim.fixed-length-dawg-"FILENUM; printf >>> "%s",RS$0 > FILENAME;}' chi_sim.fixed-length-dawgs >>> >>> Detailing my steps for others: >>> >>> 1. Download chi_sim.traineddata from Tesseract's downloads page >>> <https://code.google.com/p/tesseract-ocr/downloads/list>, untar, CD >>> shell to the directory containing the traineddata file. >>> 2. combine_tessdata -u chi_sim.traineddata chi_sim. >>> 3. Execute the awk command shown above. >>> 4. % dawg2wordlist chi_sim.unicharset chi_sim.fixed-length-dawg-1 >>> fixed-length-1_wordlist >>> 5. Repeat step 4 for chi_sim.fixed-length-dawg-2, >>> chi_sim.fixed-length-dawg-3. >>> >>> >>> Cheers, >>> Nade >>> >>> On Tuesday, January 7, 2014 at 8:39:19 AM UTC-5, cskau wrote: >>>> >>>> I was pondering the same thing this evening. So since there seems to be >>>> precious little information out there, allow me to revive this 3 month old >>>> thread with a few of my findings. >>>> >>>> I too got a crash when I tried extracting the fixed-length-dawgs, and >>>> dawg2wordlist >>>> <http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/dawg2wordlist.1.html> >>>> doesn't >>>> seem to offer any special flags for handling this special composite dawg. >>>> However, wordlist2dawg >>>> <http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/wordlist2dawg.1.html> >>>> *does* have a special mode: >>>> >>>>> *wordlist2dawg* -l <short> <long> *WORDLIST* *DAWG* *lang.unicharset* >>>> >>>> and says about the option: >>>> >>>>> -l <short> <long> Produce a file with several dawgs in it, one each >>>>> for words of length <short>, <short+1>,… <long> >>>> >>>> >>>> While one could surely just look at the source to figure out the >>>> details, I figured the "dawgs" file format is simply a bunch of "dawg"s >>>> cat'ed together. >>>> To verify this theory I compared a regular dawg and the >>>> fixed-length-dawgs in a hex editor. >>>> The regular dawg appears to use the magic number '2A001D0E', which was >>>> suspiciously found several times in the dawgs. >>>> An educated guess tells me the dawgs format is simply: >>>> [4 bytes : number of dawgs] + ([4 bytes : length of words in dawg] + [DAWG >>>> ...])* >>>> >>>> This makes is very easy to manually extract the individual dawgs, and >>>> one could even naively split the file on the headers: >>>> awk 'BEGIN {RS="\x2A\x00\x1D\x0E"; FILENUM=-1} {FILENUM++; if (FILENUM >>>> == 0) >>>> {next}; FILENAME=".fixed-length-dawg-"FILENUM; printf "%s",RS$0 > >>>> FILENAME;}' .fixed-length-dawgs >>>> >>>> By using the above snippet I successfully managed to "extract" 6 dawgs >>>> of various length from the pre-built jpn.traineddata. >>>> You can then run the standard dawg2wordlist and extract the wordlists >>>> from them. >>>> >>>> >>>> On a separate note it is still not clear to me what the exact purpose >>>> of these sub dawgs is. >>>> The jpn.traineddata appears to contain a .freq-dawg and the >>>> .fixed-length-dawgs but no .word-dawg. >>>> Why it is helpful to split the dictionary into many smaller >>>> dictionaries based on word length, I cannot guess. >>>> >>>> >>>> I hope this will be helpful to someone out there. >>>> >>>> >>>> On Wednesday, 16 October 2013 17:48:09 UTC+9, Xiaohui Zhang wrote: >>>>> >>>>> Dears, >>>>> >>>>> Is there any tips about how to use the file of fixed-length-dawgs? I >>>>> tried to use dawg2wordlist >>>>> <http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/dawg2wordlist.1.html> >>>>> to >>>>> extract some sample content from provided chi_sim trained data, but no >>>>> success, the command will crash while "Reading squished dawg". >>>>> >>>>> Any suggestion about how to use this file? >>>>> >>>>> Thanks very much. >>>>> >>>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at http://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/f6946285-b07d-4c69-acf5-6aa9360e3f9b%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/f6946285-b07d-4c69-acf5-6aa9360e3f9b%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXxRFLYi6xEvB9OAPKLqC%2B3pDcuFoVOqEVbVcAa9RA6bw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

