Hello Han, Sorry about the late response on my end. Did shree's comments help with your inquiries?
Regarding fixed-length.dawg -- this is just one of the dawg files that are typically used for wordlist2dawg: https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Dictionary_Data_(Optional) There is some information from the link above. My understanding is that this helps for any languages that have fixed-length characters, such as Chinese. I am not sure if this is the answer you were looking for though -- feel free to re-clarify in case others might have a better idea of how to answer. - Nade On Thursday, July 9, 2015 at 5:32:14 AM UTC-4, [email protected] wrote: > > Hi, Nade, thanks for your post. > > I've tried your method on chi_sim but got 17 empty sub dawgs. however my > fixed-length.dawg is around 600Kb... BTW, do you have any idea what this > file is for? Any help to promote the accuracy for Chinese recognition? > > -Han > > 在 2015年5月19日星期二 UTC+8下午2:25:38,Nade Sritanyaratana写道: >> >> cskau, thank you for posting this! I would have gotten stuck without it. >> >> The awk command you provided seems to work great on jpn.traineddata. I >> was just trying the same awk command for chi_sim.traineddata, but >> unfortunately did not come across similar luck. >> >> Following your suggestion, I used a hex editor to view the dawgs file and >> a dawg file, both from chi_sim.traineddata. I see that the "magic number" >> was for some reason slightly different. I noticed instead the magic >> hexadecimal number "2A00A313". >> >> Fast forward a bit -- the following command worked for me: >> awk 'BEGIN {RS="\x2A\x00\xA3\x13"; FILENUM=-1} {FILENUM++; if (FILENUM >> == 0) {next}; FILENAME="chi_sim.fixed-length-dawg-"FILENUM; printf >> "%s",RS$0 > FILENAME;}' chi_sim.fixed-length-dawgs >> >> Detailing my steps for others: >> >> 1. Download chi_sim.traineddata from Tesseract's downloads page >> <https://code.google.com/p/tesseract-ocr/downloads/list>, untar, CD >> shell to the directory containing the traineddata file. >> 2. combine_tessdata -u chi_sim.traineddata chi_sim. >> 3. Execute the awk command shown above. >> 4. % dawg2wordlist chi_sim.unicharset chi_sim.fixed-length-dawg-1 >> fixed-length-1_wordlist >> 5. Repeat step 4 for chi_sim.fixed-length-dawg-2, >> chi_sim.fixed-length-dawg-3. >> >> >> Cheers, >> Nade >> >> On Tuesday, January 7, 2014 at 8:39:19 AM UTC-5, cskau wrote: >>> >>> I was pondering the same thing this evening. So since there seems to be >>> precious little information out there, allow me to revive this 3 month old >>> thread with a few of my findings. >>> >>> I too got a crash when I tried extracting the fixed-length-dawgs, and >>> dawg2wordlist >>> <http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/dawg2wordlist.1.html> >>> doesn't >>> seem to offer any special flags for handling this special composite dawg. >>> However, wordlist2dawg >>> <http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/wordlist2dawg.1.html> >>> *does* have a special mode: >>> >>>> *wordlist2dawg* -l <short> <long> *WORDLIST* *DAWG* *lang.unicharset* >>> >>> and says about the option: >>> >>>> -l <short> <long> Produce a file with several dawgs in it, one each >>>> for words of length <short>, <short+1>,… <long> >>> >>> >>> While one could surely just look at the source to figure out the >>> details, I figured the "dawgs" file format is simply a bunch of "dawg"s >>> cat'ed together. >>> To verify this theory I compared a regular dawg and the >>> fixed-length-dawgs in a hex editor. >>> The regular dawg appears to use the magic number '2A001D0E', which was >>> suspiciously found several times in the dawgs. >>> An educated guess tells me the dawgs format is simply: >>> [4 bytes : number of dawgs] + ([4 bytes : length of words in dawg] + [DAWG >>> ...])* >>> >>> This makes is very easy to manually extract the individual dawgs, and >>> one could even naively split the file on the headers: >>> awk 'BEGIN {RS="\x2A\x00\x1D\x0E"; FILENUM=-1} {FILENUM++; if (FILENUM >>> == 0) >>> {next}; FILENAME=".fixed-length-dawg-"FILENUM; printf "%s",RS$0 > >>> FILENAME;}' .fixed-length-dawgs >>> >>> By using the above snippet I successfully managed to "extract" 6 dawgs >>> of various length from the pre-built jpn.traineddata. >>> You can then run the standard dawg2wordlist and extract the wordlists >>> from them. >>> >>> >>> On a separate note it is still not clear to me what the exact purpose of >>> these sub dawgs is. >>> The jpn.traineddata appears to contain a .freq-dawg and the >>> .fixed-length-dawgs but no .word-dawg. >>> Why it is helpful to split the dictionary into many smaller dictionaries >>> based on word length, I cannot guess. >>> >>> >>> I hope this will be helpful to someone out there. >>> >>> >>> On Wednesday, 16 October 2013 17:48:09 UTC+9, Xiaohui Zhang wrote: >>>> >>>> Dears, >>>> >>>> Is there any tips about how to use the file of fixed-length-dawgs? I >>>> tried to use dawg2wordlist >>>> <http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/dawg2wordlist.1.html> >>>> to >>>> extract some sample content from provided chi_sim trained data, but no >>>> success, the command will crash while "Reading squished dawg". >>>> >>>> Any suggestion about how to use this file? >>>> >>>> Thanks very much. >>>> >>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/df4bfc4e-17cc-46ae-9c50-fd220bbe5a93%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

