cskau, thank you for posting this! I would have gotten stuck without it.
The awk command you provided seems to work great on jpn.traineddata. I was
just trying the same awk command for chi_sim.traineddata, but unfortunately
did not come across similar luck.
Following your suggestion, I used a hex editor to view the dawgs file and a
dawg file, both from chi_sim.traineddata. I see that the "magic number" was
for some reason slightly different. I noticed instead the magic hexadecimal
number "2A00A313".
Fast forward a bit -- the following command worked for me:
awk 'BEGIN {RS="\x2A\x00\xA3\x13"; FILENUM=-1} {FILENUM++; if (FILENUM ==
0) {next}; FILENAME="chi_sim.fixed-length-dawg-"FILENUM; printf "%s",RS$0 >
FILENAME;}' chi_sim.fixed-length-dawgs
Detailing my steps for others:
1. Download chi_sim.traineddata from Tesseract's downloads page
<https://code.google.com/p/tesseract-ocr/downloads/list>, untar, CD
shell to the directory containing the traineddata file.
2. combine_tessdata -u chi_sim.traineddata chi_sim.
3. Execute the awk command shown above.
4. % dawg2wordlist chi_sim.unicharset chi_sim.fixed-length-dawg-1 fixed-
length-1_wordlist
5. Repeat step 4 for chi_sim.fixed-length-dawg-2,
chi_sim.fixed-length-dawg-3.
Cheers,
Nade
On Tuesday, January 7, 2014 at 8:39:19 AM UTC-5, cskau wrote:
>
> I was pondering the same thing this evening. So since there seems to be
> precious little information out there, allow me to revive this 3 month old
> thread with a few of my findings.
>
> I too got a crash when I tried extracting the fixed-length-dawgs, and
> dawg2wordlist
> <http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/dawg2wordlist.1.html>
> doesn't
> seem to offer any special flags for handling this special composite dawg.
> However, wordlist2dawg
> <http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/wordlist2dawg.1.html>
> *does* have a special mode:
>
>> *wordlist2dawg* -l <short> <long> *WORDLIST* *DAWG* *lang.unicharset*
>
> and says about the option:
>
>> -l <short> <long> Produce a file with several dawgs in it, one each for
>> words of length <short>, <short+1>,… <long>
>
>
> While one could surely just look at the source to figure out the details,
> I figured the "dawgs" file format is simply a bunch of "dawg"s cat'ed
> together.
> To verify this theory I compared a regular dawg and the fixed-length-dawgs
> in a hex editor.
> The regular dawg appears to use the magic number '2A001D0E', which was
> suspiciously found several times in the dawgs.
> An educated guess tells me the dawgs format is simply:
> [4 bytes : number of dawgs] + ([4 bytes : length of words in dawg] + [DAWG
> ...])*
>
> This makes is very easy to manually extract the individual dawgs, and one
> could even naively split the file on the headers:
> awk 'BEGIN {RS="\x2A\x00\x1D\x0E"; FILENUM=-1} {FILENUM++; if (FILENUM ==
> 0)
> {next}; FILENAME=".fixed-length-dawg-"FILENUM; printf "%s",RS$0 >
> FILENAME;}' .fixed-length-dawgs
>
> By using the above snippet I successfully managed to "extract" 6 dawgs of
> various length from the pre-built jpn.traineddata.
> You can then run the standard dawg2wordlist and extract the wordlists from
> them.
>
>
> On a separate note it is still not clear to me what the exact purpose of
> these sub dawgs is.
> The jpn.traineddata appears to contain a .freq-dawg and the
> .fixed-length-dawgs but no .word-dawg.
> Why it is helpful to split the dictionary into many smaller dictionaries
> based on word length, I cannot guess.
>
>
> I hope this will be helpful to someone out there.
>
>
> On Wednesday, 16 October 2013 17:48:09 UTC+9, Xiaohui Zhang wrote:
>>
>> Dears,
>>
>> Is there any tips about how to use the file of fixed-length-dawgs? I
>> tried to use dawg2wordlist
>> <http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/dawg2wordlist.1.html>
>> to
>> extract some sample content from provided chi_sim trained data, but no
>> success, the command will crash while "Reading squished dawg".
>>
>> Any suggestion about how to use this file?
>>
>> Thanks very much.
>>
>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/2673802b-8b73-49e6-8e88-dbb9d5805b70%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.