Re: [tesseract-ocr] Detect only AlphaNumberic characters

Shree Devi Kumar Thu, 04 Sep 2014 19:02:56 -0700

You may also be able to do this by giving a config file as parameter at
runtime. I haven't tried with 'whitelist' though.


Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Thu, Sep 4, 2014 at 9:14 PM, John Nilson <[email protected]> wrote:

> Thanks. That did the trick. I was able to switch to Alpha Numeric only.
> Here are the steps I took:
>
> 1) copied eng.* files into a new "Unpacked" directory I created. Then ran
> combine_tessdata -u to unpack:
> ...\tessdata\Unpacked>combine_tessdata -u eng.traineddata ./eng2.
> Extracting tessdata components from eng.traineddata
> Wrote ./eng.config
> Wrote ./eng.unicharset
> Wrote ./eng2.unicharambigs
> Wrote ./eng2.inttemp
> Wrote ./eng.pffmtable
> Wrote ./eng.normproto
> Wrote ./eng.punc-dawg
> Wrote ./eng.word-dawg
> Wrote ./eng.number-dawg
> Wrote ./eng.freq-dawg
> Wrote ./eng.cube-unicharset
> Wrote ./eng.cube-word-dawg
> Wrote ./eng.shapetable
> Wrote ./eng.bigram-dawg
>
> 2) Edited eng.config and added the line:
> tessedit_char_whitelist
> abcdefghijklmnopqrtsuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
>
> 3)created a new eng.traineddata file using the following command:
> ...\tessdata\Unpacked>combine_tessdata eng.
> Combining tessdata files
> TessdataManager combined tesseract data files.
> Offset for type 0 is 140
> Offset for type 1 is 358
> Offset for type 2 is 7643
> Offset for type 3 is 8690
> Offset for type 4 is 980283
> Offset for type 5 is 981099
> Offset for type 6 is 997382
> Offset for type 7 is 1001704
> Offset for type 8 is 2085898
> Offset for type 9 is 2112548
> Offset for type 10 is -1
> Offset for type 11 is 2113958
> Offset for type 12 is 2115469
> Offset for type 13 is 3177575
> Offset for type 14 is 3240921
> Offset for type 15 is -1
> Offset for type 16 is -1
>
> 4) ran Tesseract on the image file I wanted to extract AlphaNumeric only
> characters and IT WORKED!
>
>
> On Thursday, September 4, 2014 5:30:19 AM UTC-4, shree wrote:
>>
>> http://tesseract-ocr.googlecode.com/svn/trunk/doc/combine_tessdata.1.html
>>
>> Combine_tessdata -u to unpack and get all files from the traineddata file
>> - that will have in it the unicharset also.
>>
>> I am not familiar with the cube files that you are changing, so can't
>> comment about that.
>>
>>
>> Shree Devi Kumar
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>>
>> On Thu, Sep 4, 2014 at 7:49 AM, John Nilson <[email protected]> wrote:
>>
>>> "did you unpack the eng.traineddata first to get all the files?"
>>>
>>> No. How do I do that?
>>>
>>> On Wednesday, September 3, 2014 9:23:08 PM UTC-4, shree wrote:
>>>>
>>>> did you unpack the eng.traineddata first to get all the files?
>>>>
>>>> Shree Devi Kumar
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>>
>>>> On Wed, Sep 3, 2014 at 9:20 PM, John Nilson <[email protected]> wrote:
>>>>
>>>>>
>>>>> Any help would be greatly appreciated.
>>>>>
>>>>> I would like to do something fairly simple and that's reduce the types
>>>>> of characters Tesseract looks for to be just AlphaNumeric, 0-9 a-z A-Z
>>>>> .  I'm using the very latest version 3.02.02. I want to do this because
>>>>> Tesseract is doing things like confusing M with |'U'| . Notice the pipe 
>>>>> and
>>>>> single quotes. I'd like to remove any punctuation like that to reduce
>>>>> errors.
>>>>>
>>>>> My first attempt was to
>>>>>
>>>>> 1) Edit the default eng.cub.lm and eng.cub.lm_ files in the tessdata
>>>>> directory.
>>>>> 2) Remove the non-AlphaNumeric punctuation characters.
>>>>> 3) Run combine_tessdata to generate a new eng.traineddata
>>>>>
>>>>> Unfortunately this isn't working. Here's the directory listing and the
>>>>> output I get in red below.
>>>>> C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation>dir
>>>>>  Volume in drive C has no label.
>>>>>  Volume Serial Number is F0DD-A475
>>>>>
>>>>>  Directory of C:\Program Files (x86)\Tesseract-OCR\tessdata_N
>>>>> oPunctuation
>>>>>
>>>>> 09/03/2014  11:43 AM    <DIR>          .
>>>>>  09/03/2014  11:43 AM    <DIR>          ..
>>>>> 09/03/2014  10:30 AM    <DIR>          configs
>>>>> 02/03/2012  02:47 AM        21,876,572 eng - Copy.jpg
>>>>> 02/03/2012  03:15 AM           171,918 eng.cube.bigrams
>>>>> 02/03/2012  03:15 AM                38 eng.cube.fold
>>>>> 09/03/2014  10:38 AM               137 eng.cube.lm
>>>>> 09/03/2014  10:38 AM               137 eng.cube.lm_
>>>>> 02/03/2012  03:15 AM           857,304 eng.cube.nn
>>>>> 02/03/2012  03:15 AM               254 eng.cube.params
>>>>> 02/03/2012  03:15 AM        13,020,078 eng.cube.size
>>>>> 02/03/2012  03:15 AM         2,444,187 eng.cube.word-freq
>>>>> 02/03/2012  03:15 AM               996 eng.tesseract_cube.nn
>>>>> 09/03/2014  11:46 AM                 0 eng.traineddata
>>>>> 09/03/2014  11:44 AM                 0 lang.traineddata
>>>>> 02/03/2012  03:15 AM        10,562,727 osd.traineddata
>>>>> 09/03/2014  10:30 AM    <DIR>          tessconfigs
>>>>>               13 File(s)     48,934,348 bytes
>>>>>                4 Dir(s)  666,501,136,384 bytes free
>>>>>
>>>>> C:\Program Files 
>>>>> (x86)\Tesseract-OCR\tessdata_NoPunctuation>combine_tessdata
>>>>> eng.
>>>>> Combining tessdata files
>>>>> Error opening unicharset file
>>>>> Error combining tessdata files into eng.traineddata
>>>>>
>>>>> C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/tesseract-ocr/b2356fd1-be8c-45c7-9f18-afa2a459eef9%40goo
>>>>> glegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b2356fd1-be8c-45c7-9f18-afa2a459eef9%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/tesseract-ocr/7bdffe5d-cdfd-43e4-805c-340231b0a112%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/7bdffe5d-cdfd-43e4-805c-340231b0a112%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ae30bcd7-7831-4181-a32d-cc7ba511788a%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/ae30bcd7-7831-4181-a32d-cc7ba511788a%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWSdCbWOzfz3wRLm-AtH6kdqX3JnWDQmQT%2BwLx%2Bwi5U6w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Detect only AlphaNumberic characters

Reply via email to