"did you unpack the eng.traineddata first to get all the files?"
No. How do I do that? On Wednesday, September 3, 2014 9:23:08 PM UTC-4, shree wrote: > > did you unpack the eng.traineddata first to get all the files? > > Shree Devi Kumar > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > > On Wed, Sep 3, 2014 at 9:20 PM, John Nilson <[email protected] > <javascript:>> wrote: > >> >> Any help would be greatly appreciated. >> >> I would like to do something fairly simple and that's reduce the types of >> characters Tesseract looks for to be just AlphaNumeric, 0-9 a-z A-Z . I'm >> using the very latest version 3.02.02. I want to do this because Tesseract >> is doing things like confusing M with |'U'| . Notice the pipe and single >> quotes. I'd like to remove any punctuation like that to reduce errors. >> >> My first attempt was to >> >> 1) Edit the default eng.cub.lm and eng.cub.lm_ files in the tessdata >> directory. >> 2) Remove the non-AlphaNumeric punctuation characters. >> 3) Run combine_tessdata to generate a new eng.traineddata >> >> Unfortunately this isn't working. Here's the directory listing and the >> output I get in red below. >> C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation>dir >> Volume in drive C has no label. >> Volume Serial Number is F0DD-A475 >> >> Directory of C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation >> >> 09/03/2014 11:43 AM <DIR> . >> 09/03/2014 11:43 AM <DIR> .. >> 09/03/2014 10:30 AM <DIR> configs >> 02/03/2012 02:47 AM 21,876,572 eng - Copy.jpg >> 02/03/2012 03:15 AM 171,918 eng.cube.bigrams >> 02/03/2012 03:15 AM 38 eng.cube.fold >> 09/03/2014 10:38 AM 137 eng.cube.lm >> 09/03/2014 10:38 AM 137 eng.cube.lm_ >> 02/03/2012 03:15 AM 857,304 eng.cube.nn >> 02/03/2012 03:15 AM 254 eng.cube.params >> 02/03/2012 03:15 AM 13,020,078 eng.cube.size >> 02/03/2012 03:15 AM 2,444,187 eng.cube.word-freq >> 02/03/2012 03:15 AM 996 eng.tesseract_cube.nn >> 09/03/2014 11:46 AM 0 eng.traineddata >> 09/03/2014 11:44 AM 0 lang.traineddata >> 02/03/2012 03:15 AM 10,562,727 osd.traineddata >> 09/03/2014 10:30 AM <DIR> tessconfigs >> 13 File(s) 48,934,348 bytes >> 4 Dir(s) 666,501,136,384 bytes free >> >> C:\Program Files >> (x86)\Tesseract-OCR\tessdata_NoPunctuation>combine_tessdata eng. >> Combining tessdata files >> Error opening unicharset file >> Error combining tessdata files into eng.traineddata >> >> C:\Program Files (x86)\Tesseract-OCR\tessdata_NoPunctuation> >> >> >> >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at http://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/b2356fd1-be8c-45c7-9f18-afa2a459eef9%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/b2356fd1-be8c-45c7-9f18-afa2a459eef9%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7bdffe5d-cdfd-43e4-805c-340231b0a112%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

