You may also be able to do this by giving a config file as parameter at runtime. I haven't tried with 'whitelist' though.
Shree Devi Kumar ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Sep 4, 2014 at 9:14 PM, John Nilson <[email protected]> wrote: > Thanks. That did the trick. I was able to switch to Alpha Numeric only. > Here are the steps I took: > > 1) copied eng.* files into a new "Unpacked" directory I created. Then ran > combine_tessdata -u to unpack: > ...\tessdata\Unpacked>combine_tessdata -u eng.traineddata ./eng2. > Extracting tessdata components from eng.traineddata > Wrote ./eng.config > Wrote ./eng.unicharset > Wrote ./eng2.unicharambigs > Wrote ./eng2.inttemp > Wrote ./eng.pffmtable > Wrote ./eng.normproto > Wrote ./eng.punc-dawg > Wrote ./eng.word-dawg > Wrote ./eng.number-dawg > Wrote ./eng.freq-dawg > Wrote ./eng.cube-unicharset > Wrote ./eng.cube-word-dawg > Wrote ./eng.shapetable > Wrote ./eng.bigram-dawg > > 2) Edited eng.config and added the line: > tessedit_char_whitelist > abcdefghijklmnopqrtsuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 > > 3)created a new eng.traineddata file using the following command: > ...\tessdata\Unpacked>combine_tessdata eng. > Combining tessdata files > TessdataManager combined tesseract data files. > Offset for type 0 is 140 > Offset for type 1 is 358 > Offset for type 2 is 7643 > Offset for type 3 is 8690 > Offset for type 4 is 980283 > Offset for type 5 is 981099 > Offset for type 6 is 997382 > Offset for type 7 is 1001704 > Offset for type 8 is 2085898 > Offset for type 9 is 2112548 > Offset for type 10 is -1 > Offset for type 11 is 2113958 > Offset for type 12 is 2115469 > Offset for type 13 is 3177575 > Offset for type 14 is 3240921 > Offset for type 15 is -1 > Offset for type 16 is -1 > > 4) ran Tesseract on the image file I wanted to extract AlphaNumeric only > characters and IT WORKED! > > > On Thursday, September 4, 2014 5:30:19 AM UTC-4, shree wrote: >> >> http://tesseract-ocr.googlecode.com/svn/trunk/doc/combine_tessdata.1.html >> >> Combine_tessdata -u to unpack and get all files from the traineddata file >> - that will have in it the unicharset also. >> >> I am not familiar with the cube files that you are changing, so can't >> comment about that. >> >> >> Shree Devi Kumar >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> >> On Thu, Sep 4, 2014 at 7:49 AM, John Nilson <[email protected]> wrote: >> >>> "did you unpack the eng.traineddata first to get all the files?" >>> >>> No. How do I do that? >>> >>> On Wednesday, September 3, 2014 9:23:08 PM UTC-4, shree wrote: >>>> >>>> did you unpack the eng.traineddata first to get all the files? >>>> >>>> Shree Devi Kumar >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>>> >>>> On Wed, Sep 3, 2014 at 9:20 PM, John Nilson <[email protected]> wrote: >>>> >>>>> >>>>> Any help would be greatly appreciated. >>>>> >>>>> I would like to do something fairly simple and that's reduce the types >>>>> of characters Tesseract looks for to be just AlphaNumeric, 0-9 a-z A-Z >>>>> . I'm using the very latest version 3.02.02. I want to do this because >>>>> Tesseract is doing things like confusing M with |'U'| . Notice the pipe >>>>> and >>>>> single quotes. I'd like to remove any punctuation like that to reduce >>>>> errors. >>>>> >>>>> My first attempt was to >>>>> >>>>> 1) Edit the default eng.cub.lm and eng.cub.lm_ files in the tessdata >>>>> directory. >>>>> 2) Remove the non-AlphaNumeric punctuation characters. >>>>> 3) Run combine_tessdata to generate a new eng.traineddata >>>>> >>>>> Unfortunately this isn't working. Here's the directory listing and the >>>>> output I get in red below. >>>>> C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation>dir >>>>> Volume in drive C has no label. >>>>> Volume Serial Number is F0DD-A475 >>>>> >>>>> Directory of C:\Program Files (x86)\Tesseract-OCR\tessdata_N >>>>> oPunctuation >>>>> >>>>> 09/03/2014 11:43 AM <DIR> . >>>>> 09/03/2014 11:43 AM <DIR> .. >>>>> 09/03/2014 10:30 AM <DIR> configs >>>>> 02/03/2012 02:47 AM 21,876,572 eng - Copy.jpg >>>>> 02/03/2012 03:15 AM 171,918 eng.cube.bigrams >>>>> 02/03/2012 03:15 AM 38 eng.cube.fold >>>>> 09/03/2014 10:38 AM 137 eng.cube.lm >>>>> 09/03/2014 10:38 AM 137 eng.cube.lm_ >>>>> 02/03/2012 03:15 AM 857,304 eng.cube.nn >>>>> 02/03/2012 03:15 AM 254 eng.cube.params >>>>> 02/03/2012 03:15 AM 13,020,078 eng.cube.size >>>>> 02/03/2012 03:15 AM 2,444,187 eng.cube.word-freq >>>>> 02/03/2012 03:15 AM 996 eng.tesseract_cube.nn >>>>> 09/03/2014 11:46 AM 0 eng.traineddata >>>>> 09/03/2014 11:44 AM 0 lang.traineddata >>>>> 02/03/2012 03:15 AM 10,562,727 osd.traineddata >>>>> 09/03/2014 10:30 AM <DIR> tessconfigs >>>>> 13 File(s) 48,934,348 bytes >>>>> 4 Dir(s) 666,501,136,384 bytes free >>>>> >>>>> C:\Program Files >>>>> (x86)\Tesseract-OCR\tessdata_NoPunctuation>combine_tessdata >>>>> eng. >>>>> Combining tessdata files >>>>> Error opening unicharset file >>>>> Error combining tessdata files into eng.traineddata >>>>> >>>>> C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit https://groups.google.com/d/ >>>>> msgid/tesseract-ocr/b2356fd1-be8c-45c7-9f18-afa2a459eef9%40goo >>>>> glegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b2356fd1-be8c-45c7-9f18-afa2a459eef9%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit https://groups.google.com/d/ >>> msgid/tesseract-ocr/7bdffe5d-cdfd-43e4-805c-340231b0a112% >>> 40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/7bdffe5d-cdfd-43e4-805c-340231b0a112%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/ae30bcd7-7831-4181-a32d-cc7ba511788a%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/ae30bcd7-7831-4181-a32d-cc7ba511788a%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWSdCbWOzfz3wRLm-AtH6kdqX3JnWDQmQT%2BwLx%2Bwi5U6w%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

