Re: [tesseract-ocr] Detect only AlphaNumberic characters

John Nilson Thu, 04 Sep 2014 02:12:08 -0700

"did you unpack the eng.traineddata first to get all the files?"


No. How do I do that?

On Wednesday, September 3, 2014 9:23:08 PM UTC-4, shree wrote:
>
> did you unpack the eng.traineddata first to get all the files?
>
> Shree Devi Kumar
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
>
> On Wed, Sep 3, 2014 at 9:20 PM, John Nilson <[email protected] 
> <javascript:>> wrote:
>
>>
>> Any help would be greatly appreciated.
>>
>> I would like to do something fairly simple and that's reduce the types of 
>> characters Tesseract looks for to be just AlphaNumeric, 0-9 a-z A-Z .  I'm 
>> using the very latest version 3.02.02. I want to do this because Tesseract 
>> is doing things like confusing M with |'U'| . Notice the pipe and single 
>> quotes. I'd like to remove any punctuation like that to reduce errors.
>>
>> My first attempt was to
>>
>> 1) Edit the default eng.cub.lm and eng.cub.lm_ files in the tessdata 
>> directory.
>> 2) Remove the non-AlphaNumeric punctuation characters.
>> 3) Run combine_tessdata to generate a new eng.traineddata
>>
>> Unfortunately this isn't working. Here's the directory listing and the 
>> output I get in red below.
>> C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation>dir
>>  Volume in drive C has no label.
>>  Volume Serial Number is F0DD-A475
>>
>>  Directory of C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation
>>
>> 09/03/2014  11:43 AM    <DIR>          .
>> 09/03/2014  11:43 AM    <DIR>          ..
>> 09/03/2014  10:30 AM    <DIR>          configs
>> 02/03/2012  02:47 AM        21,876,572 eng - Copy.jpg
>> 02/03/2012  03:15 AM           171,918 eng.cube.bigrams
>> 02/03/2012  03:15 AM                38 eng.cube.fold
>> 09/03/2014  10:38 AM               137 eng.cube.lm
>> 09/03/2014  10:38 AM               137 eng.cube.lm_
>> 02/03/2012  03:15 AM           857,304 eng.cube.nn
>> 02/03/2012  03:15 AM               254 eng.cube.params
>> 02/03/2012  03:15 AM        13,020,078 eng.cube.size
>> 02/03/2012  03:15 AM         2,444,187 eng.cube.word-freq
>> 02/03/2012  03:15 AM               996 eng.tesseract_cube.nn
>> 09/03/2014  11:46 AM                 0 eng.traineddata
>> 09/03/2014  11:44 AM                 0 lang.traineddata
>> 02/03/2012  03:15 AM        10,562,727 osd.traineddata
>> 09/03/2014  10:30 AM    <DIR>          tessconfigs
>>               13 File(s)     48,934,348 bytes
>>                4 Dir(s)  666,501,136,384 bytes free
>>
>> C:\Program Files 
>> (x86)\Tesseract-OCR\tessdata_NoPunctuation>combine_tessdata eng.
>> Combining tessdata files
>> Error opening unicharset file
>> Error combining tessdata files into eng.traineddata
>>
>> C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation>
>>
>>
>>  
>>
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/b2356fd1-be8c-45c7-9f18-afa2a459eef9%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/b2356fd1-be8c-45c7-9f18-afa2a459eef9%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7bdffe5d-cdfd-43e4-805c-340231b0a112%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Detect only AlphaNumberic characters

Reply via email to