eng+iast-plus-3600 => no diacritics at all Latin+iast-plus-3600 => only macrons none other
On Thursday, July 12, 2018 at 1:12:25 AM UTC+5:30, shree wrote: > > What about ocr with > > eng+iast > > > > On Wed 11 Jul, 2018, 7:44 PM yajva, <[email protected] <javascript:>> > wrote: > >> shree >> namaste >> >> I am trying to OCR the attached image. Getting not so good results. Even >> for text which is apparently clear. Eg. in the first line, B is recognized >> as H, under dot for 't' in 'most' 4th line etc. The image has warping but >> still best/Latin and Google OCR produce better results. Is it possible to >> add diacritics to Latin? Can you help in any way? >> >> regards >> Venkatesh >> >> >> On Monday, July 2, 2018 at 2:05:47 PM UTC+5:30, yajva wrote: >>> >>> Many thanks. Downloaded and using. >>> Will wait for next ver. >>> >>> >>> On Sunday, July 1, 2018 at 12:21:19 AM UTC+5:30, shree wrote: >>>> >>>> I have uploaded a new version of traineddata file at >>>> >>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/iast-layer-18003.traineddata >>>> >>>> Attached is the OCRed output for pages 13-24 of dark pdf with it. >>>> >>>> I am still training a different variation. >>>> >>>> >>>> >>>> On Wed, Jun 27, 2018 at 6:46 PM Shree Devi Kumar <[email protected]> >>>> wrote: >>>> >>>>> ok. I will take a look. >>>>> >>>>> On Wed, Jun 27, 2018 at 5:04 PM yajva <[email protected]> wrote: >>>>> >>>>>> Checked with both light & dark pdfs. The results are very good. >>>>>> Thanks. >>>>>> >>>>>> A few concerns. E is consistently missed in both. J is missed >>>>>> consistently in darker image but recognized as T in dark image. ṝ is >>>>>> recognized as ṛ consistently. Can these be addressed ? >>>>>> I am using tesseract 4 alpha windows build from command line. >>>>>> >>>>>> Are the dev files in repos ? >>>>>> >>>>>> >>>>>> On Tuesday, June 26, 2018 at 11:06:06 PM UTC+5:30, shree wrote: >>>>>>> >>>>>>> I had used ghostview to convert PDF to tif or png. >>>>>>> >>>>>>> You can ocr PDF directly with gimagereader using the traineddata >>>>>>> file I sent. >>>>>>> >>>>>>> See links for new windows binaries in msg below. >>>>>>> >>>>>>> >>>>>>> At last, here are some fresh builds: >>>>>>> >>>>>>> >>>>>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe >>>>>>> >>>>>>> https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe >>>>>>> >>>>>>> I'd be also interested in testing of the tessdata manager, which >>>>>>> should now also properly handle script tessdatas >>>>>>> >>>>>>> On Tue 26 Jun, 2018, 10:59 PM yajva, <[email protected]> wrote: >>>>>>> >>>>>>>> The doc is diff ver of the same text. Here's the doc used for the >>>>>>>> first. png. This is slightly darker, but the one sent earlier is >>>>>>>> cleaner. >>>>>>>> Let me know which is more amenable for OCRing. I use PDF Shaper to >>>>>>>> extract >>>>>>>> images and convert to png using xnview. >>>>>>>> >>>>>>>> On Tuesday, June 26, 2018 at 7:48:28 PM UTC+5:30, shree wrote: >>>>>>>>> >>>>>>>>> Traineddata file is attached for use with tesseract4.0.0-beta. >>>>>>>>> >>>>>>>>> How did you create the test png from the pdf? I am not getting as >>>>>>>>> good quality, tried various settings with irfanview. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Jun 26, 2018 at 4:58 PM yajva <[email protected]> wrote: >>>>>>>>> >>>>>>>>>> Sorry for the delay, my system was down. >>>>>>>>>> >>>>>>>>>> I am getting "Page not Found" for the link given. Can you pl >>>>>>>>>> re-check? >>>>>>>>>> >>>>>>>>>> Here's the doc I am trying to OCR >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Saturday, June 23, 2018 at 9:46:08 PM UTC+5:30, shree wrote: >>>>>>>>>>> >>>>>>>>>>> Please test with traineddata file from >>>>>>>>>>> https://github.com/Shreeshrii/tessdata_sanskrit/tree/master/iast-plus1 >>>>>>>>>>> >>>>>>>>>>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2FShreeshrii%2Ftessdata_sanskrit%2Ftree%2Fmaster%2Fiast-plus1&sa=D&sntz=1&usg=AFQjCNHSTndmiJUoozyMRJ7OpHzTKIqYLw> >>>>>>>>>>> >>>>>>>>>>> Need to check that is it not overfitted. >>>>>>>>>>> >>>>>>>>>>> Please share a couple more images which I can use for testing. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Jun 21, 2018 at 11:38 PM yajva <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> one more correction. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thursday, June 21, 2018 at 11:34:00 PM UTC+5:30, yajva wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> done >>>>>>>>>>>>> >>>>>>>>>>>>> On Wednesday, June 20, 2018 at 9:05:01 PM UTC+5:30, shree >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am attaching the OCRed text. Please correct it so that I >>>>>>>>>>>>>> can use as groundtruth for further training and testing. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Jun 20, 2018 at 3:15 PM Shree Devi Kumar < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I had done a training for sanskrit for both devanagari and >>>>>>>>>>>>>>> IAST but it does not include cedilla for Sh >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I will add it and let you know. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed 20 Jun, 2018, 1:17 AM yajva, <[email protected]> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I have tried Google OCR for recognizing Sanskrit text in >>>>>>>>>>>>>>>> Roman with diacritics (IAST). It recognizes above macron but >>>>>>>>>>>>>>>> not dots below >>>>>>>>>>>>>>>> also joining grave and accent. Is there any traineddata >>>>>>>>>>>>>>>> available for >>>>>>>>>>>>>>>> tesseract that can do this with good accuracy ? Attached a >>>>>>>>>>>>>>>> sample page that >>>>>>>>>>>>>>>> I am interested in. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails >>>>>>>>>>>>>>>> from it, send an email to [email protected]. >>>>>>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>>>>>> [email protected]. >>>>>>>>>>>>>>>> Visit this group at >>>>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr. >>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> >>>>>>>>>>>>>> ____________________________________________________________ >>>>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>>> it, send an email to [email protected]. >>>>>>>>>>>> To post to this group, send email to [email protected] >>>>>>>>>>>> . >>>>>>>>>>>> Visit this group at >>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr. >>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com >>>>>>>>>>>> >>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>> . >>>>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>>> ____________________________________________________________ >>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>>> send an email to [email protected]. >>>>>>>>>> To post to this group, send email to [email protected]. >>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr >>>>>>>>>> . >>>>>>>>>> To view this discussion on the web visit >>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/81b2b741-471c-45a5-adef-48330d960d62%40googlegroups.com >>>>>>>>>> >>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/81b2b741-471c-45a5-adef-48330d960d62%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>> . >>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> ____________________________________________________________ >>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to [email protected]. >>>>>>>> To post to this group, send email to [email protected]. >>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/ed565236-146d-4902-b3e2-13445939a2f4%40googlegroups.com >>>>>>>> >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/ed565236-146d-4902-b3e2-13445939a2f4%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To post to this group, send email to [email protected]. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/f942f9b9-a767-4d9e-9de7-0855179db9b5%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/f942f9b9-a767-4d9e-9de7-0855179db9b5%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>> >>>> >>>> -- >>>> >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/1692f4a3-f536-4e57-b666-5f0c6155514e%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/1692f4a3-f536-4e57-b666-5f0c6155514e%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d2439fb8-2fa7-4988-8b5f-ea23f0fbf4f4%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

