eng+iast-plus-3600 => no diacritics at all
Latin+iast-plus-3600 => only macrons none other



On Thursday, July 12, 2018 at 1:12:25 AM UTC+5:30, shree wrote:
>
> What about ocr with 
>
> eng+iast
>
>
>
> On Wed 11 Jul, 2018, 7:44 PM yajva, <nsvnar...@gmail.com <javascript:>> 
> wrote:
>
>> shree
>> namaste
>>
>> I am trying to OCR the attached image. Getting not so good results. Even 
>> for text which is apparently clear. Eg. in the first line, B is recognized 
>> as H, under dot for 't' in 'most' 4th line etc. The image has warping but 
>> still best/Latin and Google OCR produce better results. Is it possible to 
>> add diacritics to Latin? Can you help in any way?
>>
>> regards
>> Venkatesh
>>
>>
>> On Monday, July 2, 2018 at 2:05:47 PM UTC+5:30, yajva wrote:
>>>
>>> Many thanks. Downloaded and using.
>>> Will wait for next ver.
>>>
>>>
>>> On Sunday, July 1, 2018 at 12:21:19 AM UTC+5:30, shree wrote:
>>>>
>>>> I have uploaded a new version of traineddata file at 
>>>>
>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/iast-layer-18003.traineddata
>>>>
>>>> Attached is the OCRed output for pages 13-24 of dark pdf with it.
>>>>
>>>> I am still training a different variation.
>>>>
>>>>
>>>>
>>>> On Wed, Jun 27, 2018 at 6:46 PM Shree Devi Kumar <shree...@gmail.com> 
>>>> wrote:
>>>>
>>>>> ok. I will take a look.
>>>>>
>>>>> On Wed, Jun 27, 2018 at 5:04 PM yajva <nsvnar...@gmail.com> wrote:
>>>>>
>>>>>> Checked with both light & dark pdfs. The results are very good. 
>>>>>> Thanks.
>>>>>>
>>>>>> A few concerns. E is consistently missed in both. J is missed 
>>>>>> consistently in darker image but recognized as T in dark image. ṝ is 
>>>>>> recognized as ṛ consistently. Can these be addressed ?
>>>>>> I am using tesseract 4 alpha windows build from command line.
>>>>>>
>>>>>> Are the dev files in repos ?
>>>>>>
>>>>>>
>>>>>> On Tuesday, June 26, 2018 at 11:06:06 PM UTC+5:30, shree wrote:
>>>>>>>
>>>>>>> I had used ghostview to convert PDF to tif or png.
>>>>>>>
>>>>>>> You can ocr PDF directly with gimagereader using the traineddata 
>>>>>>> file I sent.
>>>>>>>
>>>>>>> See links for new windows binaries in msg below.
>>>>>>>
>>>>>>>
>>>>>>> At last, here are some fresh builds:
>>>>>>>
>>>>>>>
>>>>>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe
>>>>>>>
>>>>>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe
>>>>>>>
>>>>>>> I'd be also interested in testing of the tessdata manager, which 
>>>>>>> should now also properly handle script tessdatas
>>>>>>>
>>>>>>> On Tue 26 Jun, 2018, 10:59 PM yajva, <nsvnar...@gmail.com> wrote:
>>>>>>>
>>>>>>>> The doc is diff ver of the same text. Here's the doc used for the 
>>>>>>>> first. png. This is slightly darker, but the one sent earlier is 
>>>>>>>> cleaner. 
>>>>>>>> Let me know which is more amenable for OCRing. I use PDF Shaper to 
>>>>>>>> extract 
>>>>>>>> images and convert to png using xnview.
>>>>>>>>
>>>>>>>> On Tuesday, June 26, 2018 at 7:48:28 PM UTC+5:30, shree wrote:
>>>>>>>>>
>>>>>>>>> Traineddata file is attached for use with tesseract4.0.0-beta.
>>>>>>>>>
>>>>>>>>> How did you create the test png from the pdf? I am not getting as 
>>>>>>>>> good quality, tried various settings with irfanview.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jun 26, 2018 at 4:58 PM yajva <nsvnar...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Sorry for the delay, my system was down.
>>>>>>>>>>
>>>>>>>>>> I am getting "Page not Found" for the link given. Can you pl 
>>>>>>>>>> re-check?
>>>>>>>>>>
>>>>>>>>>> Here's the doc I am trying to OCR
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Saturday, June 23, 2018 at 9:46:08 PM UTC+5:30, shree wrote:
>>>>>>>>>>>
>>>>>>>>>>> Please test with traineddata file from 
>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_sanskrit/tree/master/iast-plus1
>>>>>>>>>>>  
>>>>>>>>>>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FShreeshrii%2Ftessdata_sanskrit%2Ftree%2Fmaster%2Fiast-plus1&sa=D&sntz=1&usg=AFQjCNHSTndmiJUoozyMRJ7OpHzTKIqYLw>
>>>>>>>>>>>
>>>>>>>>>>> Need to check that is it not overfitted.
>>>>>>>>>>>
>>>>>>>>>>> Please share a couple more images which I can use for testing.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jun 21, 2018 at 11:38 PM yajva <nsvnar...@gmail.com> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> one more correction.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> done
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am attaching the OCRed text. Please correct it so that  I 
>>>>>>>>>>>>>> can use as groundtruth for further training and testing.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar <
>>>>>>>>>>>>>> shree...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I had done a training for sanskrit for both devanagari and 
>>>>>>>>>>>>>>> IAST but it does not include cedilla for Sh 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I will add it and let you know.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed 20 Jun, 2018, 1:17 AM yajva, <nsvnar...@gmail.com> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have tried Google OCR for recognizing Sanskrit text in 
>>>>>>>>>>>>>>>> Roman with diacritics (IAST). It recognizes above macron but 
>>>>>>>>>>>>>>>> not dots below 
>>>>>>>>>>>>>>>> also joining grave and accent. Is there any traineddata 
>>>>>>>>>>>>>>>> available for 
>>>>>>>>>>>>>>>> tesseract that can do this with good accuracy ? Attached a 
>>>>>>>>>>>>>>>> sample page that 
>>>>>>>>>>>>>>>> I am interested in.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails 
>>>>>>>>>>>>>>>> from it, send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>>>>> To post to this group, send email to 
>>>>>>>>>>>>>>>> tesser...@googlegroups.com.
>>>>>>>>>>>>>>>> Visit this group at 
>>>>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ____________________________________________________________
>>>>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>> -- 
>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>>> it, send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>>>> To post to this group, send email to tesser...@googlegroups.com
>>>>>>>>>>>> .
>>>>>>>>>>>> Visit this group at 
>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com
>>>>>>>>>>>>  
>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>> .
>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>>
>>>>>>>>>>> ____________________________________________________________
>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr
>>>>>>>>>> .
>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/81b2b741-471c-45a5-adef-48330d960d62%40googlegroups.com
>>>>>>>>>>  
>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/81b2b741-471c-45a5-adef-48330d960d62%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>> .
>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>>
>>>>>>>>> ____________________________________________________________
>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>
>>>>>>>> -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>>> To view this discussion on the web visit 
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/ed565236-146d-4902-b3e2-13445939a2f4%40googlegroups.com
>>>>>>>>  
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/ed565236-146d-4902-b3e2-13445939a2f4%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/f942f9b9-a767-4d9e-9de7-0855179db9b5%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/f942f9b9-a767-4d9e-9de7-0855179db9b5%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>>
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>>
>>>>
>>>> -- 
>>>>
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com <javascript:>.
>> To post to this group, send email to tesser...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/1692f4a3-f536-4e57-b666-5f0c6155514e%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/1692f4a3-f536-4e57-b666-5f0c6155514e%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d2439fb8-2fa7-4988-8b5f-ea23f0fbf4f4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to