Hi I am trying to use Tesseract to do doctor's handwriting recognition on Windows. It seems like an impossible task but I am trying to see what kind of accuracy can be obtained using Tesseract. I have used a doctor's font image for training, created the box file, trained file, unicharset, font_properties. But the shape clustering command is giving the following error:
C:\Program Files (x86)\Tesseract-OCR>shapeclustering -F font_properties -U unicharset eng.a.exp0.box.tr Reading eng.a.exp0.box.tr ... Font id = -1/0, class id = 1/63 on sample 0 font_id >= 0 && font_id < font_id_map_.SparseSize():Error:Assert failed:in file ..\..\classify\trainingsampleset.cpp, line 622 Can someone please tell me how to deal with this error? Any help would be appreciated. Thanks! On Sunday, September 18, 2011 at 10:04:39 PM UTC+5:30, Sriranga(78yrsold) wrote: > > Mervet, > Yes, you can cut off the letters pragmatically. attached files are answer > to your different question. If you forward your all datafiles generated by > you,I shall investigate where mistake happens and feedback to you. > With Best Wishes, > -sriranga(78yrs) > > On Sun, Sep 18, 2011 at 2:48 PM, merve t <merve...@gmail.com <javascript:> > > wrote: > >> Hello, >> I am computer scientist and have programming experience, thus i think i >> can cut off the letters automatically, i think i will have questions on how >> can i get image of words from tesseract. >> >> Anyway now i have a different question, i copy it here: >> >> >> ----------------------------------------------------------------------------------------------------------------------------------------------------- >> >> Hello, >> >> i wrote what i did; >> >> bnv is my lang code, files i used are attached. >> >> >>tesseract bnv.denemem.exp0.tif bnv.denemem.exp0 batch.nochop makebox >> >> i have a box file >> >> i edit it because there was mistakes. ok no problem. >> >> >>tesseract bnv.denemem.exp0.tif bnv.denemem.exp0 nobatch box.train >> >> >>unicharset_extractor bnv.denemem.exp0.box >> >> >>mftraining -F font_properties -U unicharset -O bnv.unicharset >> bnv.denemem.exp0.tr >> >> >>cntraining bnv.denemem.exp0.tr >> >> change names; >> >> inttemp >> Microfeat >> normproto >> pffmtable >> >> to; >> >> bnv.inttemp >> bnv.Microfeat >> bnv.normproto >> bnv.pffmtable >> >> >> >>combine_tessdata bnv. >> >> my traning procedure finishes at this point >> >> move bnv.traineddata into /tessdata folder >> >> >> >>tesseract 3example.tif output -l bnv >> >> i do nothing about training with file 3example.tif, should i do? >> >> I trained tesseract with a little dataset of my hand writing and i get >> some good results, but when i try to "test" the image attached i get >> >> "fcgbcd" >> >> as output. >> >> the last three chars are correct "bcd". >> >> But for "a" it returns "fcg" , three chars. >> >> As another process i tried to generate a box file using the box file >> generating step of training, for the file attached, it recognizes "a" and >> its box correctly. >> >> The main problem is getting 6 letters instead of 4 in "testing". >> >> Also the situation about not to be able to get the right char is a >> problem too. >> >> Thanks for your idea and time. >> >> >> >> 2011/9/18 Sriranga(78yrsold) <withbl...@gmail.com <javascript:>> >> >> Merve, >>> You can ask Alex,Centre Raime reg: program for joined handwriting and >>> evaluate suitability of YagpoOCR for your purpose. If you find YagpoOCR is >>> better than tesseract-OCR, >>> you can use it. but don't ask me for help since zero hand son experience >>> with YagpoOCR. >>> With best of Luck, >>> -sriranga(78yrs) >>> >>> >>> On Sun, Sep 18, 2011 at 11:28 AM, Sriranga(78yrsold) < >>> withbl...@gmail.com <javascript:>> wrote: >>> >>>> Merve, >>>> reg:*I have another question in this mail list, it would be >>>> appreciated if you share your idea about it, i have sent my cmd transcript >>>> to the mail list*. - I could not locate in the forum. >>>> >>>> >>>> On Sun, Sep 18, 2011 at 8:41 AM, Sriranga(78yrsold) < >>>> withbl...@gmail.com <javascript:>> wrote: >>>> >>>>> Merve, >>>>> thanks for the frank email. *you have not answered about programing >>>>> knowledge you have*? >>>>> Yes You are correct. joined handwritten text will not work unless it >>>>> is cut off(split the joined portion of two chars). You have to train the >>>>> handwriting(which has generally have different shape/style) - number of >>>>> times just like fonts of regular, bold etc. please remember that output >>>>> will not have 100% accuracy similar to regular fonts of any lang because >>>>> of >>>>> relevant source code have to be modified by the creator. As such by post >>>>> processing program the accuracy can be improved further which i feel. >>>>> Wishing you success in the your project. >>>>> -sriranga(78yrs) >>>>> >>>>> >>>>> On Sat, Sep 17, 2011 at 9:57 PM, merve t <merve...@gmail.com >>>>> <javascript:>> wrote: >>>>> >>>>>> Sriranga, >>>>>> Thanks very much for attention, i have a solution in my mind to solve >>>>>> joined handwritten text. I am going to try to cut off letters and try if >>>>>> the words are in dictionary or not. The best solution i have ever found >>>>>> is >>>>>> this. I have another question in this mail list, it would be appreciated >>>>>> if >>>>>> you share your idea about it, i have sent my cmd transcript to the mail >>>>>> list. >>>>>> Thanks very much. >>>>>> >>>>>> 2011/9/17 Sriranga(78yrsold) <withbl...@gmail.com <javascript:>> >>>>>> >>>>>> Mervert, >>>>>>> I like to know which program you are specialised/well versed? >>>>>>> With best wishes, >>>>>>> -sriranga(78yrs) >>>>>>> >>>>>>> >>>>>>> On Sat, Sep 17, 2011 at 11:47 AM, Sriranga(78yrsold) < >>>>>>> withbl...@gmail.com <javascript:>> wrote: >>>>>>> >>>>>>>> Mervet, >>>>>>>> >>>>>>>> *regarding KannadaOC*R = Since I am not trained properly for >>>>>>>> generating Kannada datafiles for Yagpo OCR by Center Rime. >>>>>>>> As such I do not know how to generate datafile or operate the >>>>>>>> yagpoOCR for OCR purpose. and also I am not in position to offer any >>>>>>>> comments about *joined handwrite text*(as stated by Center Rime) - >>>>>>>> which is *new to me* and just now I hearing. Further I am not >>>>>>>> using YagpoOCR for my project like English,Kannada, etc. >>>>>>>> In the circumstances, I am not in position to help/guide you about >>>>>>>> YagpoOCR, in case, if you approach me. >>>>>>>> Wishing you Good Luck, >>>>>>>> -sriranga(78yrs) >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Sep 17, 2011 at 2:46 AM, Center Rime <go...@mail.ru >>>>>>>> <javascript:>> wrote: >>>>>>>> >>>>>>>>> Dear friends! >>>>>>>>> At present we has engine for OCR sanskrit and joined hand write >>>>>>>>> text. >>>>>>>>> With help or Shriranga we has base model for Kannada OCR. >>>>>>>>> We has frame agreement on sanskrit devanagary recognition. On next >>>>>>>>> year we has in plan >>>>>>>>> recognition of main Unicode Asian area. >>>>>>>>> Send you current project status >>>>>>>>> >>>>>>>>> >>>>>>>>> We invite you to cooperation in using the open source tibetan >>>>>>>>> text computer recognition software. >>>>>>>>> This program already use TBRC for input of tibetan text. >>>>>>>>> It is inputed more than 200 volumes already. >>>>>>>>> In printed text we can OCR with 1-2 errors on page. Also we start >>>>>>>>> work with woodblock and hand write text. >>>>>>>>> >>>>>>>>> At present OCR program can recognize printed text with 300 dpi >>>>>>>>> grayscale scanned images. >>>>>>>>> With support of Trace Foundation we start server for tibetan OCR >>>>>>>>> project www.dharmabook.ru >>>>>>>>> Material for OCR you can upload on our server or provide access >>>>>>>>> for scanned material on your server. >>>>>>>>> All OCR work free of charge, till end of this year it has support >>>>>>>>> from Trace Foundation. >>>>>>>>> >>>>>>>>> We start work with woodblock also. It is need more advanced >>>>>>>>> program and we work on it. Now we >>>>>>>>> can OCR clear printed woodblock and handwrite text with OCR level >>>>>>>>> about 90%. >>>>>>>>> Also we can OCR dictionary and mixed tibetan-endlish or >>>>>>>>> tibetan-sanskrit text. >>>>>>>>> From our side most problem it is proofreading. For that we provide >>>>>>>>> spellchecker - see example of recognition. >>>>>>>>> Also we can develop tibetan software for your projects. >>>>>>>>> སྒ་རབ་འབྱམས་པ་ཀུན་དགའ་ཡེ་ཤེས་ book OCR example >>>>>>>>> >>>>>>>>> http://www.dharmabook.ru/work_file/W00EGS1016747-I01JW143/index.php?img_page=I01JW1430066.tif&photo_index=65 >>>>>>>>> >>>>>>>>> all this book in Zip >>>>>>>>> http://www.dharmabook.ru/work_file/W00EGS1016747-I01JW143.zip >>>>>>>>> >>>>>>>>> We will be happy help you in your activity >>>>>>>>> >>>>>>>>> Also some example of Kanjur printed edition OCR >>>>>>>>> >>>>>>>>> http://www.dharmabook.ru/work_file/W1PD95844-I1PD95855/index.php?img_page=I1PD958550071.tif&photo_index=70 >>>>>>>>> TBRC now scan a half of this volumes. We has OCR for that text and >>>>>>>>> can introduce for Trace and TBRC >>>>>>>>> project of proofreading of this edition. >>>>>>>>> >>>>>>>>> >>>>>>>>> Sarva Mangalam! >>>>>>>>> alex >>>>>>>>> www.code.google.com/p/ocrlib >>>>>>>>> www.dharmabook.ru >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a2c2edca-61a7-4834-b67f-bbb1d668d447%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.