Re: [tesseract-ocr] Re: what's the content of fixed-length-dawgs

ShreeDevi Kumar Thu, 09 Jul 2015 08:02:36 -0700

Also see the language training data available at

https://github.com/tesseract-ocr/langdata


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jul 9, 2015 at 8:27 PM, ShreeDevi Kumar <[email protected]>
wrote:

> Have you tried with the new traineddata files at
>
> https://github.com/tesseract-ocr/tessdata
>
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Thu, Jul 9, 2015 at 2:55 PM, <[email protected]> wrote:
>
>> Hi, Nade, thanks for your post.
>>
>> I've tried your method on chi_sim but got 17 empty sub dawgs. however my
>> fixed-length.dawg is around 600Kb... BTW, do you have any idea what this
>> file is for? Any help to promote the accuracy for Chinese recognition?
>>
>> -Han
>>
>> 在 2015年5月19日星期二 UTC+8下午2:25:38，Nade Sritanyaratana写道：
>>>
>>> cskau, thank you for posting this! I would have gotten stuck without it.
>>>
>>> The awk command you provided seems to work great on jpn.traineddata. I
>>> was just trying the same awk command for chi_sim.traineddata, but
>>> unfortunately did not come across similar luck.
>>>
>>> Following your suggestion, I used a hex editor to view the dawgs file
>>> and a dawg file, both from chi_sim.traineddata. I see that the "magic
>>> number" was for some reason slightly different. I noticed instead the magic
>>> hexadecimal number "2A00A313".
>>>
>>> Fast forward a bit -- the following command worked for me:
>>> awk 'BEGIN {RS="\x2A\x00\xA3\x13"; FILENUM=-1} {FILENUM++; if (FILENUM
>>> == 0) {next}; FILENAME="chi_sim.fixed-length-dawg-"FILENUM; printf
>>> "%s",RS$0 > FILENAME;}' chi_sim.fixed-length-dawgs
>>>
>>> Detailing my steps for others:
>>>
>>>    1. Download chi_sim.traineddata from Tesseract's downloads page
>>>    <https://code.google.com/p/tesseract-ocr/downloads/list>, untar, CD
>>>    shell to the directory containing the traineddata file.
>>>    2. combine_tessdata -u chi_sim.traineddata chi_sim.
>>>    3. Execute the awk command shown above.
>>>    4. % dawg2wordlist chi_sim.unicharset chi_sim.fixed-length-dawg-1
>>>    fixed-length-1_wordlist
>>>    5. Repeat step 4 for chi_sim.fixed-length-dawg-2,
>>>    chi_sim.fixed-length-dawg-3.
>>>
>>>
>>> Cheers,
>>> Nade
>>>
>>> On Tuesday, January 7, 2014 at 8:39:19 AM UTC-5, cskau wrote:
>>>>
>>>> I was pondering the same thing this evening. So since there seems to be
>>>> precious little information out there, allow me to revive this 3 month old
>>>> thread with a few of my findings.
>>>>
>>>> I too got a crash when I tried extracting the fixed-length-dawgs, and
>>>> dawg2wordlist
>>>> <http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/dawg2wordlist.1.html>
>>>>  doesn't
>>>> seem to offer any special flags for handling this special composite dawg.
>>>> However, wordlist2dawg
>>>> <http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/wordlist2dawg.1.html>
>>>>  *does* have a special mode:
>>>>
>>>>> *wordlist2dawg* -l <short> <long> *WORDLIST* *DAWG* *lang.unicharset*
>>>>
>>>> and says about the option:
>>>>
>>>>>  -l <short> <long> Produce a file with several dawgs in it, one each
>>>>> for words of length <short>, <short+1>,… <long>
>>>>
>>>>
>>>> While one could surely just look at the source to figure out the
>>>> details, I figured the "dawgs" file format is simply a bunch of "dawg"s
>>>> cat'ed together.
>>>> To verify this theory I compared a regular dawg and the
>>>> fixed-length-dawgs in a hex editor.
>>>> The regular dawg appears to use the magic number '2A001D0E', which was
>>>> suspiciously found several times in the dawgs.
>>>> An educated guess tells me the dawgs format is simply:
>>>> [4 bytes : number of dawgs] + ([4 bytes : length of words in dawg] + [DAWG
>>>> ...])*
>>>>
>>>> This makes is very easy to manually extract the individual dawgs, and
>>>> one could even naively split the file on the headers:
>>>> awk 'BEGIN {RS="\x2A\x00\x1D\x0E"; FILENUM=-1} {FILENUM++; if (FILENUM
>>>> == 0)
>>>>   {next}; FILENAME=".fixed-length-dawg-"FILENUM; printf "%s",RS$0 >
>>>> FILENAME;}' .fixed-length-dawgs
>>>>
>>>> By using the above snippet I successfully managed to "extract" 6 dawgs
>>>> of various length from the pre-built jpn.traineddata.
>>>> You can then run the standard dawg2wordlist and extract the wordlists
>>>> from them.
>>>>
>>>>
>>>> On a separate note it is still not clear to me what the exact purpose
>>>> of these sub dawgs is.
>>>> The jpn.traineddata appears to contain a .freq-dawg and the
>>>> .fixed-length-dawgs but no .word-dawg.
>>>> Why it is helpful to split the dictionary into many smaller
>>>> dictionaries based on word length, I cannot guess.
>>>>
>>>>
>>>> I hope this will be helpful to someone out there.
>>>>
>>>>
>>>> On Wednesday, 16 October 2013 17:48:09 UTC+9, Xiaohui Zhang wrote:
>>>>>
>>>>> Dears,
>>>>>
>>>>> Is there any tips about how to use the file of fixed-length-dawgs?  I
>>>>> tried to use dawg2wordlist
>>>>> <http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/dawg2wordlist.1.html>
>>>>>  to
>>>>> extract some sample content from provided chi_sim trained data, but no
>>>>> success, the command will crash while "Reading squished dawg".
>>>>>
>>>>> Any suggestion about how to use this file?
>>>>>
>>>>> Thanks very much.
>>>>>
>>>>  --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/f6946285-b07d-4c69-acf5-6aa9360e3f9b%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/f6946285-b07d-4c69-acf5-6aa9360e3f9b%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXxRFLYi6xEvB9OAPKLqC%2B3pDcuFoVOqEVbVcAa9RA6bw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: what's the content of fixed-length-dawgs

Reply via email to