Please see https://github.com/Shreeshrii/tesstrain-xsa/blob/master/langdata/latin2unicode.sh
It has sed substitution commands for going from transliteration to Unicode for xsa, based on mapping shown in Wikipedia and other web pages. On Mon, Mar 23, 2020, 01:58 Wincent Balin <[email protected]> wrote: > Hi Shree, > > I will add a tool to create random text within Unicode range soon. > > @aby tesh: Do you know anything about a converter from transliterated text > to [xsa] Unicode text? > > On Mon, 16 Mar 2020 at 03:12, Shree Devi Kumar <[email protected]> > wrote: > >> Hi Wincent, >> Thanks for the link. >> >> I had checked that site earlier. It has text transcription in Latin >> transliteration,eg. >> http://dasi.cnr.it/index.php?id=79&prjId=1&corId=5&colId=0&navId=522207406&recId=2149 >> I >> haven't found any conversion tool to Unicode for the same. >> >> 1 Yʿly w-ʾḏmr bny Whbʾl[ ... ...] ʾḏmr[ ... ... by]— >> 2 t-(s¹m) Yġl b-rdʾ mrʾ-s¹[m ... ...] >> 3 [... ... ]w-(b)-(rd)ʾ mrʾ-s¹m [... ...] >> 4 [... ...] ʾḏ(mr) w-b-rd(ʾ)[ ... ...] >> >> Maybe, you can add a tool in https://github.com/wincentbalin/pytesstrain to >> create randomly generated training text from a range of characters/word >> list, similar to >> >> The tool language_metrics runs Tesseract OCR over images of random word >>> sequences, which are created out of the supplied wordlist, >> >> >> On Mon, Mar 16, 2020 at 2:32 AM Wincent Balin <[email protected]> >> wrote: >> >>> Maybe http://dasi.cnr.it does have something usable? >>> >>> Shree Devi Kumar <[email protected]> schrieb am So., 15. März 2020, >>> 16:55: >>> >>>> There is no online corpus for xsa that I could find. >>>> >>>> Two of the fonts you sent are legacy fonts, that is they map English >>>> letters to ancient Arabic characters. >>>> >>>> Are there any converters that convert from the legacy mapping to >>>> Unicode? >>>> >>>> If there is existing text in legacy fonts, it can be converted to >>>> Unicode and that can be used for training. >>>> >>>> On Sun, Mar 15, 2020, 17:57 aby tesh <[email protected]> wrote: >>>> >>>>> Where can i get the training text, or can i create a new one. I have a >>>>> problem writing with fonts which some of included in the attachment i sent >>>>> you. >>>>> >>>>> On Sunday, March 15, 2020 at 4:32:08 AM UTC+3, shree wrote: >>>>>> >>>>>> I had used the findfonts feature of text2image and found only two >>>>>> fonts that rendered the xsa text. I will check the fonts that you sent. >>>>>> What about training text? Unless you have some more text, it will be >>>>>> difficult to do training. >>>>>> >>>>>> Quivira >>>>>> Segoe UI Historic >>>>>> >>>>>> On Sun, Mar 15, 2020, 04:01 aby tesh <[email protected]> wrote: >>>>>> >>>>>>> That is what i am not getting, i don't think they all are unicode >>>>>>> fonts, i couldn't get one. Some render on my machine (Linux) some don't. >>>>>>> >>>>>>> On Saturday, March 14, 2020 at 8:45:46 PM UTC+3, shree wrote: >>>>>>>> >>>>>>>> Are all these Unicode fonts? >>>>>>>> >>>>>>>> What about training text in utf-8 Unicode encoding? >>>>>>>> >>>>>>>> On Sat, Mar 14, 2020, 22:37 aby tesh <[email protected]> wrote: >>>>>>>> >>>>>>>>> Hey shree, I have compiled all relevant fonts and attached them >>>>>>>>> below. I am not sure know how i can generate text data with it. >>>>>>>>> >>>>>>>>> On Tuesday, March 10, 2020 at 5:35:26 AM UTC+3, shree wrote: >>>>>>>>>> >>>>>>>>>> If you can share a large enough training text and fonts, I can >>>>>>>>>> rerun the training. >>>>>>>>>> >>>>>>>>>> On Tue, Mar 10, 2020, 03:41 aby tesh <[email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hey, >>>>>>>>>>> >>>>>>>>>>> I followed the steps in the readme file, and i started the >>>>>>>>>>> lstmtraining, but it seems my current computer's processor can't >>>>>>>>>>> handle the >>>>>>>>>>> training for a longer period of time. >>>>>>>>>>> >>>>>>>>>>> What can i do about it? When should i abort the training to get >>>>>>>>>>> a good trainedata file? or is there one which is accurate that you >>>>>>>>>>> can >>>>>>>>>>> share ? >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>> it, send an email to [email protected]. >>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/e727f106-d668-44b5-9bba-8fad29fc1587%40googlegroups.com >>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/e727f106-d668-44b5-9bba-8fad29fc1587%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>> . >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to [email protected]. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/efa79761-20a5-4d20-b0c1-40eb2523c289%40googlegroups.com >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/efa79761-20a5-4d20-b0c1-40eb2523c289%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> >>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1d3e54cc-3f53-4ad3-b870-171bb26fc6eb%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1d3e54cc-3f53-4ad3-b870-171bb26fc6eb%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/88bfa189-4a1e-4528-857c-013248b5ee4b%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/88bfa189-4a1e-4528-857c-013248b5ee4b%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVrD9Vo8HUFWe_dr6c6Gs2EPOB2bh9DfkmAtA85cKp8fQ%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVrD9Vo8HUFWe_dr6c6Gs2EPOB2bh9DfkmAtA85cKp8fQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CANuFvMcdEir5VQr0RJCkBKaS-0C%3DE2EaPUpezxtqyKwaRcTAUw%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CANuFvMcdEir5VQr0RJCkBKaS-0C%3DE2EaPUpezxtqyKwaRcTAUw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >> >> >> -- >> >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWPnzsoe7BgF2k6bg8QQg4XcLp1Cu%2B6Fq3kVbkw28XEwg%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWPnzsoe7BgF2k6bg8QQg4XcLp1Cu%2B6Fq3kVbkw28XEwg%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CANuFvMcxdxNSr5M4ruQqRmLW3n233DQmBHReYAmJ%2BHcNyCGtLg%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CANuFvMcxdxNSr5M4ruQqRmLW3n233DQmBHReYAmJ%2BHcNyCGtLg%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVf0OzOPf_yKGZOEShBPcsAmVzR9Hn5c%2BqaCjfBVccFMA%40mail.gmail.com.

