I would also suggest that you add spaces between words in your input text, ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Mon, Jun 19, 2017 at 9:19 PM, ShreeDevi Kumar <[email protected]> wrote: > You could also try running training on your windows pc with 3.05.01 using > tesstrain.sh using `git for windows` which will provide you a shell for > running bash scripts. > > ShreeDevi > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Mon, Jun 19, 2017 at 9:05 PM, ShreeDevi Kumar <[email protected]> > wrote: > >> Where do you have your source files for english langdata? >> >> If it is in a directory such as ../langdata/eng/ >> then put the common.unicharset, latin.unicharset and font_properties etc >> in >> ../langdata >> >> >> >> ShreeDevi >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> On Mon, Jun 19, 2017 at 8:34 PM, David Barishev <[email protected]> >> wrote: >> >>> Thanks for the replay, >>> If you mean if i have the latin and common unicharset in the tessdata >>> direcotry( /usr/share/tesseract-ocr/tessdata ),i have downloaded them >>> and placed them in the directory and still getting the same behavior. >>> I have also tried doing it from my windows machine which has 3.05 >>> version, and had same behavior . >>> >>> On Monday, June 19, 2017 at 2:58:40 PM UTC+3, shree wrote: >>>> >>>> do u have the common and latin unicharset in ur langdata directory. >>>> >>>> See https://github.com/tesseract-ocr/langdata >>>> >>>> Try to build the latest 3.05.01 version. >>>> >>>> ShreeDevi >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>>> On Mon, Jun 19, 2017 at 3:23 PM, David Barishev <[email protected]> >>>> wrote: >>>> >>>>> Hello all! >>>>> Im trying to train tesseract to recognize a new font in English ( >>>>> supercell-magic). >>>>> I have created a .tif file and matching .box file using jTessBoxEditor >>>>> ( eng.supercell-magic.exp0.tif and eng.supercell-magic.exp0.box ), >>>>> and trained tesseract with them. >>>>> >>>>> Here is tesseracts's output: >>>>> $ tesseract eng.supercell-magic.exp0.tif eng.supercell-magic.exp0 >>>>> box.train >>>>> Tesseract Open Source OCR Engine v3.04.01 with Leptonica >>>>> Page 1 >>>>> row xheight=30, but median xheight = 37.5455 >>>>> APPLY_BOXES: >>>>> Boxes read from boxfile: 1559 >>>>> Found 1559 good blobs. >>>>> Generated training data for 34 words >>>>> Page 2 >>>>> APPLY_BOXES: >>>>> Boxes read from boxfile: 1677 >>>>> Found 1677 good blobs. >>>>> Generated training data for 34 words >>>>> Page 3 >>>>> APPLY_BOXES: >>>>> Boxes read from boxfile: 1362 >>>>> Found 1362 good blobs. >>>>> Generated training data for 28 words >>>>> >>>>> >>>>> So the next step is to extract the characters >>>>> using unicharset_extractor. >>>>> There was a normal output for it : >>>>> $ unicharset_extractor eng.supercell-magic.exp0.box >>>>> Extracting unicharset from eng.supercell-magic.exp0.box >>>>> Wrote unicharset file ./unicharset. >>>>> >>>>> But when i view the file, it's mostly 0 and 255, which is not like the >>>>> example in the wiki >>>>> <https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract#an-example-of-the-unicharset-file> >>>>> : >>>>> An example of the unicharset file >>>>> >>>>> 110 >>>>> NULL 0 NULL 0 >>>>> N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N >>>>> Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y >>>>> 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1 >>>>> 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9 >>>>> a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a >>>>> ... >>>>> >>>>> >>>>> Mine looks more like this: >>>>> >>>>> 74 >>>>> NULL 0 NULL 0 >>>>> Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Joined [4a 6f 69 6e >>>>> 65 64 ] >>>>> |Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Broken >>>>> t 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # t [74 ] >>>>> h 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # h [68 ] >>>>> a 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # a [61 ] >>>>> n 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # n [6e ] >>>>> P 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # P [50 ] >>>>> o 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # o [6f ] >>>>> e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # e [65 ] >>>>> : 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # : [3a ] >>>>> r 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # r [72 ] >>>>> l 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # l [6c ] >>>>> i 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # i [69 ] >>>>> 1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # 1 [31 ] >>>>> N 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # N [4e ] >>>>> >>>>> Why is that ? Thanks in advances. >>>>> >>>>> Im using ubuntu 16.04 with tesseract version: >>>>> >>>>> tesseract 3.04.01 >>>>> leptonica-1.73 >>>>> libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : >>>>> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0 >>>>> >>>>> I have attached the box and tiff file and the data file, and the >>>>> unicharset file. >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/cd052525-9eb >>>>> 7-4527-b75b-82e1a687997d%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/cd052525-9eb7-4527-b75b-82e1a687997d%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit https://groups.google.com/d/ms >>> gid/tesseract-ocr/3789eb00-d438-4efe-afc3-ce3e3dc60aa2%40goo >>> glegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/3789eb00-d438-4efe-afc3-ce3e3dc60aa2%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW2P4s4aeiAuukdtmgcHGKCwkOPBg1h2eV%2ByBUO4ADScA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

