[tesseract-ocr] Re: what's the content of fixed-length-dawgs

Nade Sritanyaratana Thu, 23 Jul 2015 08:59:56 -0700

Hello Han,

Sorry about the late response on my end. Did shree's comments help with 
your inquiries?


Regarding fixed-length.dawg -- this is just one of the dawg files that are 
typically used for wordlist2dawg:
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Dictionary_Data_(Optional)

There is some information from the link above. My understanding is that 
this helps for any languages that have fixed-length characters, such as 
Chinese. I am not sure if this is the answer you were looking for though -- 
feel free to re-clarify in case others might have a better idea of how to 
answer.

- Nade

On Thursday, July 9, 2015 at 5:32:14 AM UTC-4, [email protected] wrote:
>
> Hi, Nade, thanks for your post.
>
> I've tried your method on chi_sim but got 17 empty sub dawgs. however my 
> fixed-length.dawg is around 600Kb... BTW, do you have any idea what this 
> file is for? Any help to promote the accuracy for Chinese recognition?
>
> -Han
>
> 在 2015年5月19日星期二 UTC+8下午2:25:38，Nade Sritanyaratana写道：
>>
>> cskau, thank you for posting this! I would have gotten stuck without it.
>>
>> The awk command you provided seems to work great on jpn.traineddata. I 
>> was just trying the same awk command for chi_sim.traineddata, but 
>> unfortunately did not come across similar luck.
>>
>> Following your suggestion, I used a hex editor to view the dawgs file and 
>> a dawg file, both from chi_sim.traineddata. I see that the "magic number" 
>> was for some reason slightly different. I noticed instead the magic 
>> hexadecimal number "2A00A313".
>>
>> Fast forward a bit -- the following command worked for me:
>> awk 'BEGIN {RS="\x2A\x00\xA3\x13"; FILENUM=-1} {FILENUM++; if (FILENUM 
>> == 0) {next}; FILENAME="chi_sim.fixed-length-dawg-"FILENUM; printf 
>> "%s",RS$0 > FILENAME;}' chi_sim.fixed-length-dawgs
>>
>> Detailing my steps for others:
>>
>>    1. Download chi_sim.traineddata from Tesseract's downloads page 
>>    <https://code.google.com/p/tesseract-ocr/downloads/list>, untar, CD 
>>    shell to the directory containing the traineddata file.
>>    2. combine_tessdata -u chi_sim.traineddata chi_sim.
>>    3. Execute the awk command shown above.
>>    4. % dawg2wordlist chi_sim.unicharset chi_sim.fixed-length-dawg-1 
>>    fixed-length-1_wordlist
>>    5. Repeat step 4 for chi_sim.fixed-length-dawg-2, 
>>    chi_sim.fixed-length-dawg-3.
>>    
>>
>> Cheers,
>> Nade
>>
>> On Tuesday, January 7, 2014 at 8:39:19 AM UTC-5, cskau wrote:
>>>
>>> I was pondering the same thing this evening. So since there seems to be 
>>> precious little information out there, allow me to revive this 3 month old 
>>> thread with a few of my findings.
>>>
>>> I too got a crash when I tried extracting the fixed-length-dawgs, and 
>>> dawg2wordlist 
>>> <http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/dawg2wordlist.1.html>
>>>  doesn't 
>>> seem to offer any special flags for handling this special composite dawg.
>>> However, wordlist2dawg 
>>> <http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/wordlist2dawg.1.html>
>>>  *does* have a special mode:
>>>
>>>> *wordlist2dawg* -l <short> <long> *WORDLIST* *DAWG* *lang.unicharset*
>>>
>>> and says about the option:
>>>
>>>>  -l <short> <long> Produce a file with several dawgs in it, one each 
>>>> for words of length <short>, <short+1>,… <long>
>>>
>>>
>>> While one could surely just look at the source to figure out the 
>>> details, I figured the "dawgs" file format is simply a bunch of "dawg"s 
>>> cat'ed together.
>>> To verify this theory I compared a regular dawg and the 
>>> fixed-length-dawgs in a hex editor.
>>> The regular dawg appears to use the magic number '2A001D0E', which was 
>>> suspiciously found several times in the dawgs.
>>> An educated guess tells me the dawgs format is simply:
>>> [4 bytes : number of dawgs] + ([4 bytes : length of words in dawg] + [DAWG 
>>> ...])*
>>>
>>> This makes is very easy to manually extract the individual dawgs, and 
>>> one could even naively split the file on the headers:
>>> awk 'BEGIN {RS="\x2A\x00\x1D\x0E"; FILENUM=-1} {FILENUM++; if (FILENUM 
>>> == 0)
>>>   {next}; FILENAME=".fixed-length-dawg-"FILENUM; printf "%s",RS$0 > 
>>> FILENAME;}' .fixed-length-dawgs
>>>
>>> By using the above snippet I successfully managed to "extract" 6 dawgs 
>>> of various length from the pre-built jpn.traineddata.
>>> You can then run the standard dawg2wordlist and extract the wordlists 
>>> from them.
>>>
>>>
>>> On a separate note it is still not clear to me what the exact purpose of 
>>> these sub dawgs is.
>>> The jpn.traineddata appears to contain a .freq-dawg and the 
>>> .fixed-length-dawgs but no .word-dawg.
>>> Why it is helpful to split the dictionary into many smaller dictionaries 
>>> based on word length, I cannot guess.
>>>
>>>
>>> I hope this will be helpful to someone out there.
>>>
>>>
>>> On Wednesday, 16 October 2013 17:48:09 UTC+9, Xiaohui Zhang wrote:
>>>>
>>>> Dears,
>>>>
>>>> Is there any tips about how to use the file of fixed-length-dawgs?  I 
>>>> tried to use dawg2wordlist 
>>>> <http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/dawg2wordlist.1.html>
>>>>  to 
>>>> extract some sample content from provided chi_sim trained data, but no 
>>>> success, the command will crash while "Reading squished dawg".
>>>>
>>>> Any suggestion about how to use this file?
>>>>
>>>> Thanks very much.
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/df4bfc4e-17cc-46ae-9c50-fd220bbe5a93%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: what's the content of fixed-length-dawgs

Reply via email to