Thank you for your feedback of eng+

I will try training for this and get back.


On Thu, Jul 12, 2018 at 2:18 PM yajva <[email protected]> wrote:

> eng+iast-plus-3600 => no diacritics at all
> Latin+iast-plus-3600 => only macrons none other
>
>
>
> On Thursday, July 12, 2018 at 1:12:25 AM UTC+5:30, shree wrote:
>>
>> What about ocr with
>>
>> eng+iast
>>
>>
>>
>> On Wed 11 Jul, 2018, 7:44 PM yajva, <[email protected]> wrote:
>>
>>> shree
>>> namaste
>>>
>>> I am trying to OCR the attached image. Getting not so good results. Even
>>> for text which is apparently clear. Eg. in the first line, B is recognized
>>> as H, under dot for 't' in 'most' 4th line etc. The image has warping but
>>> still best/Latin and Google OCR produce better results. Is it possible
>>> to add diacritics to Latin? Can you help in any way?
>>>
>>> regards
>>> Venkatesh
>>>
>>>
>>> On Monday, July 2, 2018 at 2:05:47 PM UTC+5:30, yajva wrote:
>>>>
>>>> Many thanks. Downloaded and using.
>>>> Will wait for next ver.
>>>>
>>>>
>>>> On Sunday, July 1, 2018 at 12:21:19 AM UTC+5:30, shree wrote:
>>>>>
>>>>> I have uploaded a new version of traineddata file at
>>>>>
>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/iast-layer-18003.traineddata
>>>>>
>>>>> Attached is the OCRed output for pages 13-24 of dark pdf with it.
>>>>>
>>>>> I am still training a different variation.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 27, 2018 at 6:46 PM Shree Devi Kumar <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> ok. I will take a look.
>>>>>>
>>>>>> On Wed, Jun 27, 2018 at 5:04 PM yajva <[email protected]> wrote:
>>>>>>
>>>>>>> Checked with both light & dark pdfs. The results are very good.
>>>>>>> Thanks.
>>>>>>>
>>>>>>> A few concerns. E is consistently missed in both. J is missed
>>>>>>> consistently in darker image but recognized as T in dark image. ṝ is
>>>>>>> recognized as ṛ consistently. Can these be addressed ?
>>>>>>> I am using tesseract 4 alpha windows build from command line.
>>>>>>>
>>>>>>> Are the dev files in repos ?
>>>>>>>
>>>>>>>
>>>>>>> On Tuesday, June 26, 2018 at 11:06:06 PM UTC+5:30, shree wrote:
>>>>>>>>
>>>>>>>> I had used ghostview to convert PDF to tif or png.
>>>>>>>>
>>>>>>>> You can ocr PDF directly with gimagereader using the traineddata
>>>>>>>> file I sent.
>>>>>>>>
>>>>>>>> See links for new windows binaries in msg below.
>>>>>>>>
>>>>>>>>
>>>>>>>> At last, here are some fresh builds:
>>>>>>>>
>>>>>>>>
>>>>>>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe
>>>>>>>>
>>>>>>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe
>>>>>>>>
>>>>>>>> I'd be also interested in testing of the tessdata manager, which
>>>>>>>> should now also properly handle script tessdatas
>>>>>>>>
>>>>>>>> On Tue 26 Jun, 2018, 10:59 PM yajva, <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> The doc is diff ver of the same text. Here's the doc used for the
>>>>>>>>> first. png. This is slightly darker, but the one sent earlier is 
>>>>>>>>> cleaner.
>>>>>>>>> Let me know which is more amenable for OCRing. I use PDF Shaper to 
>>>>>>>>> extract
>>>>>>>>> images and convert to png using xnview.
>>>>>>>>>
>>>>>>>>> On Tuesday, June 26, 2018 at 7:48:28 PM UTC+5:30, shree wrote:
>>>>>>>>>>
>>>>>>>>>> Traineddata file is attached for use with tesseract4.0.0-beta.
>>>>>>>>>>
>>>>>>>>>> How did you create the test png from the pdf? I am not getting as
>>>>>>>>>> good quality, tried various settings with irfanview.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jun 26, 2018 at 4:58 PM yajva <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Sorry for the delay, my system was down.
>>>>>>>>>>>
>>>>>>>>>>> I am getting "Page not Found" for the link given. Can you pl
>>>>>>>>>>> re-check?
>>>>>>>>>>>
>>>>>>>>>>> Here's the doc I am trying to OCR
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Saturday, June 23, 2018 at 9:46:08 PM UTC+5:30, shree wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Please test with traineddata file from
>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_sanskrit/tree/master/iast-plus1
>>>>>>>>>>>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FShreeshrii%2Ftessdata_sanskrit%2Ftree%2Fmaster%2Fiast-plus1&sa=D&sntz=1&usg=AFQjCNHSTndmiJUoozyMRJ7OpHzTKIqYLw>
>>>>>>>>>>>>
>>>>>>>>>>>> Need to check that is it not overfitted.
>>>>>>>>>>>>
>>>>>>>>>>>> Please share a couple more images which I can use for testing.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jun 21, 2018 at 11:38 PM yajva <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> one more correction.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> done
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am attaching the OCRed text. Please correct it so that  I
>>>>>>>>>>>>>>> can use as groundtruth for further training and testing.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I had done a training for sanskrit for both devanagari and
>>>>>>>>>>>>>>>> IAST but it does not include cedilla for Sh
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I will add it and let you know.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed 20 Jun, 2018, 1:17 AM yajva, <[email protected]>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I have tried Google OCR for recognizing Sanskrit text in
>>>>>>>>>>>>>>>>> Roman with diacritics (IAST). It recognizes above macron but 
>>>>>>>>>>>>>>>>> not dots below
>>>>>>>>>>>>>>>>> also joining grave and accent. Is there any traineddata 
>>>>>>>>>>>>>>>>> available for
>>>>>>>>>>>>>>>>> tesseract that can do this with good accuracy ? Attached a 
>>>>>>>>>>>>>>>>> sample page that
>>>>>>>>>>>>>>>>> I am interested in.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> You received this message because you are subscribed to
>>>>>>>>>>>>>>>>> the Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails
>>>>>>>>>>>>>>>>> from it, send an email to [email protected]
>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>> To post to this group, send email to
>>>>>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>>>>>> Visit this group at
>>>>>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com
>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout
>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ____________________________________________________________
>>>>>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from
>>>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>>>> To post to this group, send email to
>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>> Visit this group at
>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com
>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>> .
>>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> ____________________________________________________________
>>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from
>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>>>>> Visit this group at
>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/81b2b741-471c-45a5-adef-48330d960d62%40googlegroups.com
>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/81b2b741-471c-45a5-adef-48330d960d62%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>> .
>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> ____________________________________________________________
>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>> send an email to [email protected].
>>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>>>> To view this discussion on the web visit
>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/ed565236-146d-4902-b3e2-13445939a2f4%40googlegroups.com
>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/ed565236-146d-4902-b3e2-13445939a2f4%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>
>>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/f942f9b9-a767-4d9e-9de7-0855179db9b5%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/f942f9b9-a767-4d9e-9de7-0855179db9b5%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> ____________________________________________________________
>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/1692f4a3-f536-4e57-b666-5f0c6155514e%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/1692f4a3-f536-4e57-b666-5f0c6155514e%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d2439fb8-2fa7-4988-8b5f-ea23f0fbf4f4%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/d2439fb8-2fa7-4988-8b5f-ea23f0fbf4f4%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWbLEmjgg2_gM3W3VCh7jE40q3S6W0tgiNFJAg74EV%3Dng%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to