Hi

I am trying to use Tesseract to do doctor's handwriting recognition on 
Windows. It seems like an impossible task but I am trying to see what kind 
of accuracy can be obtained using Tesseract. I have used a doctor's font 
image for training, created the box file, trained file, unicharset, 
font_properties. But the shape clustering command is giving the following 
error:

C:\Program Files (x86)\Tesseract-OCR>shapeclustering -F font_properties -U 
unicharset eng.a.exp0.box.tr
Reading eng.a.exp0.box.tr ...
Font id = -1/0, class id = 1/63 on sample 0
font_id >= 0 && font_id < font_id_map_.SparseSize():Error:Assert failed:in 
file ..\..\classify\trainingsampleset.cpp, line 622

Can someone please tell me how to deal with this error? Any help would be 
appreciated.

Thanks!


On Sunday, September 18, 2011 at 10:04:39 PM UTC+5:30, Sriranga(78yrsold) 
wrote:
>
> Mervet,
> Yes, you can cut off the letters pragmatically. attached files are answer 
> to your different question. If you forward your all datafiles generated by 
> you,I shall investigate where mistake happens and feedback to you.
> With Best Wishes,
> -sriranga(78yrs)
>
> On Sun, Sep 18, 2011 at 2:48 PM, merve t <merve...@gmail.com <javascript:>
> > wrote:
>
>> Hello,
>> I am computer scientist and have programming experience, thus i think i 
>> can cut off the letters automatically, i think i will have questions on how 
>> can i get image of words from tesseract.
>>
>> Anyway now i have a different question, i copy it here:
>>
>>
>> -----------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> Hello,
>>
>> i wrote what i did;
>>
>> bnv is my lang code, files i used are attached.
>>
>> >>tesseract bnv.denemem.exp0.tif bnv.denemem.exp0 batch.nochop makebox
>>
>> i have a box file
>>
>> i edit it because there was mistakes. ok no problem.
>>
>> >>tesseract bnv.denemem.exp0.tif bnv.denemem.exp0 nobatch box.train
>>
>> >>unicharset_extractor bnv.denemem.exp0.box
>>
>> >>mftraining -F font_properties -U unicharset -O bnv.unicharset 
>> bnv.denemem.exp0.tr
>>
>> >>cntraining bnv.denemem.exp0.tr
>>
>> change names;
>>
>> inttemp
>> Microfeat
>> normproto
>> pffmtable
>>
>> to;
>>
>> bnv.inttemp
>> bnv.Microfeat
>> bnv.normproto
>> bnv.pffmtable
>>
>>
>> >>combine_tessdata bnv.
>>
>> my traning procedure finishes at this point
>>
>> move bnv.traineddata into /tessdata folder
>>
>>
>> >>tesseract 3example.tif output -l bnv
>>
>> i do nothing about training with file 3example.tif, should i do?
>>
>> I trained tesseract with a little dataset of my hand writing and i get 
>> some good results, but when i try to "test" the image attached i get 
>>
>> "fcgbcd"
>>
>> as output.
>>
>> the last three chars are correct "bcd".
>>
>> But for "a" it returns "fcg" , three chars.
>>
>> As another process i tried to generate a box file using the box file 
>> generating step of training, for the file attached, it recognizes "a" and 
>> its box correctly.
>>
>> The main problem is getting 6 letters instead of 4 in "testing".
>>
>> Also the situation about not to be able to get the right char is a 
>> problem too.
>>
>> Thanks for your idea and time.
>>
>>
>>
>> 2011/9/18 Sriranga(78yrsold) <withbl...@gmail.com <javascript:>>
>>
>> Merve,
>>> You can ask Alex,Centre Raime reg: program for  joined handwriting and 
>>> evaluate suitability of YagpoOCR for your purpose. If you find YagpoOCR is 
>>> better than tesseract-OCR,
>>> you can use it. but don't ask me for help since zero hand son experience 
>>> with YagpoOCR.
>>> With best of Luck,
>>> -sriranga(78yrs)
>>>
>>>
>>> On Sun, Sep 18, 2011 at 11:28 AM, Sriranga(78yrsold) <
>>> withbl...@gmail.com <javascript:>> wrote:
>>>
>>>> Merve,
>>>> reg:*I have another question in this mail list, it would be 
>>>> appreciated if you share your idea about it, i have sent my cmd transcript 
>>>> to the mail list*. - I could not locate in the forum.
>>>>
>>>>
>>>> On Sun, Sep 18, 2011 at 8:41 AM, Sriranga(78yrsold) <
>>>> withbl...@gmail.com <javascript:>> wrote:
>>>>
>>>>> Merve,
>>>>> thanks for the frank email. *you have not answered about programing 
>>>>> knowledge you have*?
>>>>> Yes You are correct. joined handwritten text will not work unless it 
>>>>> is cut off(split the joined portion of two chars). You have to train the 
>>>>> handwriting(which has generally have different shape/style) - number of 
>>>>> times just like fonts of regular, bold etc. please remember that output 
>>>>> will not have 100% accuracy similar to regular fonts of any lang because 
>>>>> of 
>>>>> relevant source code have to be modified by the creator. As such  by post 
>>>>> processing program the accuracy can be improved further which i feel. 
>>>>> Wishing you success in the your project.
>>>>> -sriranga(78yrs)
>>>>>
>>>>>
>>>>> On Sat, Sep 17, 2011 at 9:57 PM, merve t <merve...@gmail.com 
>>>>> <javascript:>> wrote:
>>>>>
>>>>>> Sriranga,
>>>>>> Thanks very much for attention, i have a solution in my mind to solve 
>>>>>> joined handwritten text. I am going to try to cut off letters and try if 
>>>>>> the words are in dictionary or not. The best solution i have ever found 
>>>>>> is 
>>>>>> this. I have another question in this mail list, it would be appreciated 
>>>>>> if 
>>>>>> you share your idea about it, i have sent my cmd transcript to the mail 
>>>>>> list.
>>>>>> Thanks very much.
>>>>>>
>>>>>> 2011/9/17 Sriranga(78yrsold) <withbl...@gmail.com <javascript:>>
>>>>>>
>>>>>> Mervert,
>>>>>>> I like to know which program you are specialised/well versed?
>>>>>>> With best wishes,
>>>>>>> -sriranga(78yrs)
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Sep 17, 2011 at 11:47 AM, Sriranga(78yrsold) <
>>>>>>> withbl...@gmail.com <javascript:>> wrote:
>>>>>>>
>>>>>>>> Mervet,
>>>>>>>>
>>>>>>>> *regarding KannadaOC*R = Since I am not trained properly for 
>>>>>>>> generating Kannada datafiles for Yagpo OCR by Center Rime. 
>>>>>>>> As such I do not know how to generate datafile or operate the 
>>>>>>>> yagpoOCR for OCR purpose. and also I am not in position to offer any 
>>>>>>>> comments about *joined handwrite text*(as stated by Center Rime) - 
>>>>>>>> which is *new to me* and just now I hearing. Further I am not 
>>>>>>>> using YagpoOCR for my project like English,Kannada, etc. 
>>>>>>>> In the circumstances, I am not in position to help/guide you about 
>>>>>>>> YagpoOCR, in case, if you approach me.
>>>>>>>> Wishing you Good Luck,
>>>>>>>> -sriranga(78yrs)
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Sep 17, 2011 at 2:46 AM, Center Rime <go...@mail.ru 
>>>>>>>> <javascript:>> wrote:
>>>>>>>>
>>>>>>>>> Dear friends!
>>>>>>>>> At present we has engine for OCR sanskrit and joined hand write 
>>>>>>>>> text. 
>>>>>>>>> With help or Shriranga we has base model for Kannada OCR. 
>>>>>>>>> We has frame agreement on sanskrit devanagary recognition. On next 
>>>>>>>>> year we has in plan
>>>>>>>>> recognition of main Unicode Asian area.
>>>>>>>>> Send you current project status
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  We invite you to cooperation in using the open source tibetan 
>>>>>>>>> text computer recognition software.
>>>>>>>>> This program already use TBRC for input of tibetan text.
>>>>>>>>> It is inputed more than 200 volumes already.
>>>>>>>>> In printed text we can OCR with 1-2 errors on page. Also we start 
>>>>>>>>> work with woodblock and hand write text.
>>>>>>>>>
>>>>>>>>> At present OCR program can recognize printed text with 300 dpi 
>>>>>>>>> grayscale scanned images.
>>>>>>>>> With support of Trace Foundation we start server for tibetan OCR 
>>>>>>>>> project www.dharmabook.ru
>>>>>>>>> Material for OCR you can upload on our server or provide access 
>>>>>>>>> for scanned material on your server.
>>>>>>>>> All OCR work free of charge, till end of this year it has support 
>>>>>>>>> from Trace Foundation.
>>>>>>>>>
>>>>>>>>> We start work with woodblock also. It is need more advanced 
>>>>>>>>> program and we work on it. Now we
>>>>>>>>> can OCR clear printed woodblock and handwrite text with OCR level 
>>>>>>>>> about 90%.
>>>>>>>>> Also we can OCR dictionary and mixed tibetan-endlish or 
>>>>>>>>> tibetan-sanskrit text.
>>>>>>>>> From our side most problem it is proofreading. For that we provide 
>>>>>>>>> spellchecker - see example of recognition. 
>>>>>>>>> Also we can develop tibetan software for your projects.
>>>>>>>>> སྒ་རབ་འབྱམས་པ་ཀུན་དགའ་ཡེ་ཤེས་ book OCR example
>>>>>>>>>
>>>>>>>>> http://www.dharmabook.ru/work_file/W00EGS1016747-I01JW143/index.php?img_page=I01JW1430066.tif&photo_index=65
>>>>>>>>>
>>>>>>>>> all this book in Zip
>>>>>>>>> http://www.dharmabook.ru/work_file/W00EGS1016747-I01JW143.zip
>>>>>>>>>
>>>>>>>>> We will be happy help you in your activity 
>>>>>>>>>
>>>>>>>>> Also some example of Kanjur printed edition OCR
>>>>>>>>>
>>>>>>>>> http://www.dharmabook.ru/work_file/W1PD95844-I1PD95855/index.php?img_page=I1PD958550071.tif&photo_index=70
>>>>>>>>> TBRC now scan a half of this volumes. We has OCR for that text and 
>>>>>>>>> can introduce for Trace and TBRC
>>>>>>>>> project of proofreading of this edition.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sarva Mangalam!
>>>>>>>>> alex
>>>>>>>>> www.code.google.com/p/ocrlib
>>>>>>>>> www.dharmabook.ru
>>>>>>>>> 
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a2c2edca-61a7-4834-b67f-bbb1d668d447%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to