Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-19 Thread reza
thanks for your reply. i will test these as soon as possible. one of the weakness of tesseract is when we want ocr multiple languages. for example, if we have an image with persian and english text, the tesseract can't recogize those as well as we have a single language. Do you have any

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-18 Thread ShreeDevi Kumar
Hi Reza, Attached are two scripts and one log file. You will need to change the directories in the scripts. finetune.sh and finetune log file are for a sample finetuning for eng. By changing the language code you can run it for fas. You can use that as a test. plus-fas.sh is for plusminus type

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-18 Thread reza
hi ShreeDevi Thanks. I tested the 2 models that you have provided. The accuracy on samples without noise were about 98% but on scanned samples or captured images, were about 80%. but still it didn't work on different fonts. Could u send all files that needed for training models? I want fine

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-18 Thread ShreeDevi Kumar
I have posted a couple of test models for Farsi at https://github.com/Shreeshrii/tessdata_shreetest These have not been trained on text with diacritics as the normalization and training process was giving error on the combining marks. Please give them a try and see if they provide better

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-15 Thread reza
hi again thanks for your reply. i need more fonts. for examples : B Koodak B Lotus B Titr B Zar B Yekan Iran Nastaliq if needs, i send the .ttf files of that fonts ? thanks On Tuesday, May 15, 2018 at 5:35:10 PM UTC+4:30, shree wrote: > > I will try to put together complete steps. > > I am

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-15 Thread ShreeDevi Kumar
I will try to put together complete steps. I am doing a test run for training persian. Are the following fonts ok for it? '55_Sarchia_Kurdish' \ '56_Sarchia_Kurdish_Bold Bold' \ 'Amiri' \ 'Arabic Typesetting' \ 'Arial' \ 'Arial Unicode MS' \ 'B Nazanin' \ 'B Nazanin Bold' \

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-15 Thread reza
i test it on ubuntu , that raised error too. could u help me and send me a new bash file for fine tuning with new fonts ? i put "eng.traineddata" fil in tessdata_best folder and "eng.training_text" and "eng.traineddata" in langdata\eng is it true and sufficient ? or need more file ? thanks

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-15 Thread ShreeDevi Kumar
Please use the latest windows binaries from https://github.com/UB-Mannheim/tesseract/wiki provided by @stweil How do you run bash script on windows10? @stweil I have not tried training on windows? Do you have feedback from others who have tried it. ShreeDevi

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-15 Thread reza
thanks for reply tesseract 4 beta windows 10 On Tuesday, May 15, 2018 at 1:12:20 PM UTC+4:30, shree wrote: > > What o/s are you running it on? > > Which version of tesseract? > > > ICU ERROR: U_FILE_ACCESS_ERRORERROR: /tmp/tmp.6m4B2TUln1/eng/eng.unicharset > does not exist or is not readable

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-15 Thread reza
windows 10 tesseract 4 alpha On Tuesday, May 15, 2018 at 1:12:20 PM UTC+4:30, shree wrote: > > What o/s are you running it on? > > Which version of tesseract? > > > ICU ERROR: U_FILE_ACCESS_ERRORERROR: /tmp/tmp.6m4B2TUln1/eng/eng.unicharset > does not exist or is not readable > > which version

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-15 Thread ShreeDevi Kumar
What o/s are you running it on? Which version of tesseract? > ICU ERROR: U_FILE_ACCESS_ERRORERROR: /tmp/tmp.6m4B2TUln1/eng/eng.unicharset does not exist or is not readable which version of icu library? ShreeDevi भजन - कीर्तन - आरती @

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-15 Thread reza
i used this attached finetune.sh file ... but that raised error. could u help me ? thanks > ## MAKING TRAINING DATA ## > > >> === Starting training for language 'eng' > > [Tue, May 15, 2018 11:42:36 AM] /c/Program Files >> (x86)/Tesseract-OCR/text2image --fonts_dir=C:WindowsFonts

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-14 Thread reza
thanks for your reply. I read that but i confused. could u send me a bash file for fine tune for impact ? thanks On Monday, May 14, 2018 at 6:18:11 PM UTC+4:30, shree wrote: > > please see > https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact > >

[tesseract-ocr] train more fonts on trained model fas in tesseract

2018-05-14 Thread reza
hi i tested tesseract 4 beta on persian lang , the results was good. but i think needs more training on more fonts and texts. how could we train more fonts and texts on model that exist in tesseract 4 beta for persian lang ? and last question is, how could we apply dictionary to correct that