Thank you for your feedback of eng+ I will try training for this and get back.
On Thu, Jul 12, 2018 at 2:18 PM yajva <[email protected]> wrote: > eng+iast-plus-3600 => no diacritics at all > Latin+iast-plus-3600 => only macrons none other > > > > On Thursday, July 12, 2018 at 1:12:25 AM UTC+5:30, shree wrote: >> >> What about ocr with >> >> eng+iast >> >> >> >> On Wed 11 Jul, 2018, 7:44 PM yajva, <[email protected]> wrote: >> >>> shree >>> namaste >>> >>> I am trying to OCR the attached image. Getting not so good results. Even >>> for text which is apparently clear. Eg. in the first line, B is recognized >>> as H, under dot for 't' in 'most' 4th line etc. The image has warping but >>> still best/Latin and Google OCR produce better results. Is it possible >>> to add diacritics to Latin? Can you help in any way? >>> >>> regards >>> Venkatesh >>> >>> >>> On Monday, July 2, 2018 at 2:05:47 PM UTC+5:30, yajva wrote: >>>> >>>> Many thanks. Downloaded and using. >>>> Will wait for next ver. >>>> >>>> >>>> On Sunday, July 1, 2018 at 12:21:19 AM UTC+5:30, shree wrote: >>>>> >>>>> I have uploaded a new version of traineddata file at >>>>> >>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/iast-layer-18003.traineddata >>>>> >>>>> Attached is the OCRed output for pages 13-24 of dark pdf with it. >>>>> >>>>> I am still training a different variation. >>>>> >>>>> >>>>> >>>>> On Wed, Jun 27, 2018 at 6:46 PM Shree Devi Kumar <[email protected]> >>>>> wrote: >>>>> >>>>>> ok. I will take a look. >>>>>> >>>>>> On Wed, Jun 27, 2018 at 5:04 PM yajva <[email protected]> wrote: >>>>>> >>>>>>> Checked with both light & dark pdfs. The results are very good. >>>>>>> Thanks. >>>>>>> >>>>>>> A few concerns. E is consistently missed in both. J is missed >>>>>>> consistently in darker image but recognized as T in dark image. ṝ is >>>>>>> recognized as ṛ consistently. Can these be addressed ? >>>>>>> I am using tesseract 4 alpha windows build from command line. >>>>>>> >>>>>>> Are the dev files in repos ? >>>>>>> >>>>>>> >>>>>>> On Tuesday, June 26, 2018 at 11:06:06 PM UTC+5:30, shree wrote: >>>>>>>> >>>>>>>> I had used ghostview to convert PDF to tif or png. >>>>>>>> >>>>>>>> You can ocr PDF directly with gimagereader using the traineddata >>>>>>>> file I sent. >>>>>>>> >>>>>>>> See links for new windows binaries in msg below. >>>>>>>> >>>>>>>> >>>>>>>> At last, here are some fresh builds: >>>>>>>> >>>>>>>> >>>>>>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe >>>>>>>> >>>>>>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe >>>>>>>> >>>>>>>> I'd be also interested in testing of the tessdata manager, which >>>>>>>> should now also properly handle script tessdatas >>>>>>>> >>>>>>>> On Tue 26 Jun, 2018, 10:59 PM yajva, <[email protected]> wrote: >>>>>>>> >>>>>>>>> The doc is diff ver of the same text. Here's the doc used for the >>>>>>>>> first. png. This is slightly darker, but the one sent earlier is >>>>>>>>> cleaner. >>>>>>>>> Let me know which is more amenable for OCRing. I use PDF Shaper to >>>>>>>>> extract >>>>>>>>> images and convert to png using xnview. >>>>>>>>> >>>>>>>>> On Tuesday, June 26, 2018 at 7:48:28 PM UTC+5:30, shree wrote: >>>>>>>>>> >>>>>>>>>> Traineddata file is attached for use with tesseract4.0.0-beta. >>>>>>>>>> >>>>>>>>>> How did you create the test png from the pdf? I am not getting as >>>>>>>>>> good quality, tried various settings with irfanview. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Jun 26, 2018 at 4:58 PM yajva <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Sorry for the delay, my system was down. >>>>>>>>>>> >>>>>>>>>>> I am getting "Page not Found" for the link given. Can you pl >>>>>>>>>>> re-check? >>>>>>>>>>> >>>>>>>>>>> Here's the doc I am trying to OCR >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Saturday, June 23, 2018 at 9:46:08 PM UTC+5:30, shree wrote: >>>>>>>>>>>> >>>>>>>>>>>> Please test with traineddata file from >>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_sanskrit/tree/master/iast-plus1 >>>>>>>>>>>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FShreeshrii%2Ftessdata_sanskrit%2Ftree%2Fmaster%2Fiast-plus1&sa=D&sntz=1&usg=AFQjCNHSTndmiJUoozyMRJ7OpHzTKIqYLw> >>>>>>>>>>>> >>>>>>>>>>>> Need to check that is it not overfitted. >>>>>>>>>>>> >>>>>>>>>>>> Please share a couple more images which I can use for testing. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jun 21, 2018 at 11:38 PM yajva <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> one more correction. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> done >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I am attaching the OCRed text. Please correct it so that I >>>>>>>>>>>>>>> can use as groundtruth for further training and testing. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I had done a training for sanskrit for both devanagari and >>>>>>>>>>>>>>>> IAST but it does not include cedilla for Sh >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I will add it and let you know. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed 20 Jun, 2018, 1:17 AM yajva, <[email protected]> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I have tried Google OCR for recognizing Sanskrit text in >>>>>>>>>>>>>>>>> Roman with diacritics (IAST). It recognizes above macron but >>>>>>>>>>>>>>>>> not dots below >>>>>>>>>>>>>>>>> also joining grave and accent. Is there any traineddata >>>>>>>>>>>>>>>>> available for >>>>>>>>>>>>>>>>> tesseract that can do this with good accuracy ? Attached a >>>>>>>>>>>>>>>>> sample page that >>>>>>>>>>>>>>>>> I am interested in. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> You received this message because you are subscribed to >>>>>>>>>>>>>>>>> the Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails >>>>>>>>>>>>>>>>> from it, send an email to [email protected] >>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>>>>>>> [email protected]. >>>>>>>>>>>>>>>>> Visit this group at >>>>>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr. >>>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com >>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout >>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ____________________________________________________________ >>>>>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>>>> it, send an email to [email protected]. >>>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>>> [email protected]. >>>>>>>>>>>>> Visit this group at >>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr. >>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com >>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>> . >>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> >>>>>>>>>>>> ____________________________________________________________ >>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>> it, send an email to [email protected]. >>>>>>>>>>> To post to this group, send email to [email protected]. >>>>>>>>>>> Visit this group at >>>>>>>>>>> https://groups.google.com/group/tesseract-ocr. >>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/81b2b741-471c-45a5-adef-48330d960d62%40googlegroups.com >>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/81b2b741-471c-45a5-adef-48330d960d62%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>> . >>>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> ____________________________________________________________ >>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>>>> >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to [email protected]. >>>>>>>>> To post to this group, send email to [email protected]. >>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/ed565236-146d-4902-b3e2-13445939a2f4%40googlegroups.com >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/ed565236-146d-4902-b3e2-13445939a2f4%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>> >>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To post to this group, send email to [email protected]. >>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/f942f9b9-a767-4d9e-9de7-0855179db9b5%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/f942f9b9-a767-4d9e-9de7-0855179db9b5%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> ____________________________________________________________ >>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/1692f4a3-f536-4e57-b666-5f0c6155514e%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/1692f4a3-f536-4e57-b666-5f0c6155514e%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/d2439fb8-2fa7-4988-8b5f-ea23f0fbf4f4%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/d2439fb8-2fa7-4988-8b5f-ea23f0fbf4f4%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWbLEmjgg2_gM3W3VCh7jE40q3S6W0tgiNFJAg74EV%3Dng%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

