http://tesseract-ocr.googlecode.com/svn/trunk/doc/combine_tessdata.1.html
Combine_tessdata -u to unpack and get all files from the traineddata file - that will have in it the unicharset also. I am not familiar with the cube files that you are changing, so can't comment about that. Shree Devi Kumar ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Sep 4, 2014 at 7:49 AM, John Nilson <[email protected]> wrote: > "did you unpack the eng.traineddata first to get all the files?" > > No. How do I do that? > > On Wednesday, September 3, 2014 9:23:08 PM UTC-4, shree wrote: >> >> did you unpack the eng.traineddata first to get all the files? >> >> Shree Devi Kumar >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> >> On Wed, Sep 3, 2014 at 9:20 PM, John Nilson <[email protected]> wrote: >> >>> >>> Any help would be greatly appreciated. >>> >>> I would like to do something fairly simple and that's reduce the types >>> of characters Tesseract looks for to be just AlphaNumeric, 0-9 a-z A-Z >>> . I'm using the very latest version 3.02.02. I want to do this because >>> Tesseract is doing things like confusing M with |'U'| . Notice the pipe and >>> single quotes. I'd like to remove any punctuation like that to reduce >>> errors. >>> >>> My first attempt was to >>> >>> 1) Edit the default eng.cub.lm and eng.cub.lm_ files in the tessdata >>> directory. >>> 2) Remove the non-AlphaNumeric punctuation characters. >>> 3) Run combine_tessdata to generate a new eng.traineddata >>> >>> Unfortunately this isn't working. Here's the directory listing and the >>> output I get in red below. >>> C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation>dir >>> Volume in drive C has no label. >>> Volume Serial Number is F0DD-A475 >>> >>> Directory of C:\Program Files (x86)\Tesseract-OCR\tessdata_ >>> NoPunctuation >>> >>> 09/03/2014 11:43 AM <DIR> . >>> 09/03/2014 11:43 AM <DIR> .. >>> 09/03/2014 10:30 AM <DIR> configs >>> 02/03/2012 02:47 AM 21,876,572 eng - Copy.jpg >>> 02/03/2012 03:15 AM 171,918 eng.cube.bigrams >>> 02/03/2012 03:15 AM 38 eng.cube.fold >>> 09/03/2014 10:38 AM 137 eng.cube.lm >>> 09/03/2014 10:38 AM 137 eng.cube.lm_ >>> 02/03/2012 03:15 AM 857,304 eng.cube.nn >>> 02/03/2012 03:15 AM 254 eng.cube.params >>> 02/03/2012 03:15 AM 13,020,078 eng.cube.size >>> 02/03/2012 03:15 AM 2,444,187 eng.cube.word-freq >>> 02/03/2012 03:15 AM 996 eng.tesseract_cube.nn >>> 09/03/2014 11:46 AM 0 eng.traineddata >>> 09/03/2014 11:44 AM 0 lang.traineddata >>> 02/03/2012 03:15 AM 10,562,727 osd.traineddata >>> 09/03/2014 10:30 AM <DIR> tessconfigs >>> 13 File(s) 48,934,348 bytes >>> 4 Dir(s) 666,501,136,384 bytes free >>> >>> C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation>combine_tessdata >>> eng. >>> Combining tessdata files >>> Error opening unicharset file >>> Error combining tessdata files into eng.traineddata >>> >>> C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation> >>> >>> >>> >>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit https://groups.google.com/d/ >>> msgid/tesseract-ocr/b2356fd1-be8c-45c7-9f18-afa2a459eef9% >>> 40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/b2356fd1-be8c-45c7-9f18-afa2a459eef9%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/7bdffe5d-cdfd-43e4-805c-340231b0a112%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/7bdffe5d-cdfd-43e4-805c-340231b0a112%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUJZ9zkojJXtOru7Sfv7Edc1W6uiWKR%2B00B5w94qs%2Bt8w%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

