Hi, Ive just installed tesseract to OCR some old Epigraphy documents. I used Google colab as well as a Mac install. All fine, except I am unable to get the text with IAST...characters are substituted (ā becomes i etc). I tried using the lang attribute as lat but it doesnt find a latin lang package and installing latin script didnt help. Ive searched through all of Shree's work on github, but cant figure this out. I have three objectives: 1. OCR english pages and search through them 2. It would be nice to convert the sanskrit into IAST and search through it 3. OCR Kannada inscriptions and keep them in OCR'ed format-this is optional- a "good to have"
Writing the search code doesnt seem to be tough, however the IAST recognition/transcription is the challenge. Accuracy is not very important as I have to search through volumes of inscriptions for specific key words to recategorize a lot of mis categorised inscriptions on my research topic. Any help would be appreciated. The volume itself doesnt make the Google OCR solution suggested by Shree elsewhere practicable. Im new at Python and tesseract, though have programmed in the past. Any help is appreciated. On Friday, July 27, 2018 at 6:29:09 AM UTC+2, shree wrote: > > You can try IAST ones from > https://github.com/Shreeshrii/tessdata_shreetest?files=1 > > On Fri 27 Jul, 2018, 8:27 AM Shree Devi Kumar, <[email protected] > <javascript:>> wrote: > >> There is no official traineddata for san_latn or last. I have created >> some experimental versions but the output is not fully accurate. >> >> >> >> On Fri 27 Jul, 2018, 12:21 AM John Muccigrosso, <[email protected] >> <javascript:>> wrote: >> >>> You're telling tesseract that your text is in Latin. You need the >>> traineddata for san-lat. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected] <javascript:>. >>> To post to this group, send email to [email protected] >>> <javascript:>. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/d2fc7942-16a2-48f0-9651-920616179d54%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/d2fc7942-16a2-48f0-9651-920616179d54%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c8665c36-e7a1-442d-9f2a-aff9cd968e5f%40googlegroups.com.

