Re: [XeTeX] Devanagari ASCII to Unicode mapping
2018-02-22 11:44 GMT+01:00 Philip Taylor (RHUoL): > > > Daniel Greenhoe wrote: > >> I think the conclusion is that I was going about the problem the wrong >> way---that there is no one-to-one mapping between the Devanagari ASCII >> font and unicode font. Rather, it is many-to-one. >> > Is the problem not, in fact, that there is not one "Devanagari ASCII font" > but rather many, for each of which there is potentially a different mapping > required ? > Yes, there are many fonts with non-unicode proprietary encodings. The web sites with such fonts offer downlowd of a Windows executable which installs these fonts into Windows, so I have never managed to view such pages on Linux. It is not difficult to define the mapping for TECkit if you know the encoding. > Philip Taylor > > > > Zdeněk Wagner > http://ttsm.icpf.cas.cz/team/wagner.shtml > http://icebearsoft.euweb.cz > > > > -- > Subscriptions, Archive, and List information, etc.: > http://tug.org/mailman/listinfo/xetex > -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Devanagari ASCII to Unicode mapping
Daniel Greenhoe wrote: I think the conclusion is that I was going about the problem the wrong way---that there is no one-to-one mapping between the Devanagari ASCII font and unicode font. Rather, it is many-to-one. Is the problem not, in fact, that there is not one "Devanagari ASCII font" but rather many, for each of which there is potentially a different mapping required ? Philip Taylor -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Devanagari ASCII to Unicode mapping
I think this is a TECkit converter for the Preeti font: https://github.com/silnrsi/wsresources/tree/master/scripts/Deva/legacy/sag-preeti/mappings Lorna Original Message Subject: Re: [XeTeX] Devanagari ASCII to Unicode mapping From: ShreeDevi Kumar <shreesh...@gmail.com> To: XeTeX (Unicode-based TeX) discussion. <xetex@tug.org> Date: 2/17/2018 11:11 AM Please see view-source:http://hindi-fonts.com/tools/Preeti-to-Unicode-Converter There is no direct mapping, butarray_one has the ASCII codes for Preeti, while array_two has the corresponding unicode. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sat, Feb 17, 2018 at 10:32 PM, ShreeDevi Kumar <shreesh...@gmail.com <mailto:shreesh...@gmail.com>> wrote: > What I think I am looking for is something that would map a document typeset using something like the Devanagari Preeti font (https://fonts2u.com/preeti.font <https://fonts2u.com/preeti.font>), which seems to have the Devanagari glyphs encoded in the range 0x00-0x7F, to something like the Devanagari unicode font Mukta (https://ektype.in/scripts/devanagari/mukta.html <https://ektype.in/scripts/devanagari/mukta.html>) in the range 0x0900-0x097F. Please try http://www.ashesh.com.np/preeti-unicode/ <http://www.ashesh.com.np/preeti-unicode/> Also see https://github.com/Shuvayatra/preeti <https://github.com/Shuvayatra/preeti> ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sat, Feb 17, 2018 at 10:27 PM, Mike Maxwell <maxw...@umiacs.umd.edu <mailto:maxw...@umiacs.umd.edu>> wrote: On 2/17/2018 11:08 AM, Daniel Greenhoe wrote: Does anyone know where I can find an ASCII to Unicode mapping for Devanagari? For example, it seems that the Devanagari glyph "ब" is encoded as 0x61 (hex) in ASCII (lower case 'a' for the Latin alphabet), but is 0x092C in the Unicode standard: http://www.unicode.org/charts/PDF/U0900.pdf <http://www.unicode.org/charts/PDF/U0900.pdf> So what I am asking for is a map (or table) that maps 0x00-0x7F in Devanagari ASCII to 0x0900-0x097F in Unicode. In addition to the ASCII-to-Devanagari transcription system that Philip Taylor mentioned, you may be interested in the ISCII encoding for Brahmi-derived writing systems, including Devanagari: https://en.wikipedia.org/wiki/Indian_Script_Code_for_Information_Interchange <https://en.wikipedia.org/wiki/Indian_Script_Code_for_Information_Interchange> This is _not_ an ASCII-to-Devanagari encoding, rather it leaves the ASCII range intact, and encodes Devanagari (etc.) in the range 128 (actually, 161)-255. It was afaik never widely used, but there were (and probably still are) fonts for it. I don't imagine those fonts would be terribly high quality by today's standards, e.g. I'd be surprised if they handled conjunct characters. FWIW, there was a similar encoding called TSCII for Tamil. iconv can be used to map TSCII to other encodings, but for some reason it doesn't seem to have ISCII in its reportoire (it does include VISCII, but that's a legacy Vietnamese encoding). -- Mike Maxwell "My definition of an interesting universe is one that has the capacity to study itself." --Stephen Eastmond -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex <http://tug.org/mailman/listinfo/xetex> -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Devanagari ASCII to Unicode mapping
On 2/18/2018 4:10 AM, ShreeDevi Kumar wrote: >> The LDC *might* still have the encoding converters laying around somewhere. These will be very useful, if they can be made available. There is a need for easily converting legacy documents to Unicode. One of the applications for which someone was looking for these recently was for checking for plagiarism in student projects/thesis. I'd suggest contacting them. Their website is ldc.upenn.edu There's a "Contact us" tab near the upper right-hand corner of their page. -- Mike Maxwell "My definition of an interesting universe is one that has the capacity to study itself." --Stephen Eastmond -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Devanagari ASCII to Unicode mapping
Thank you for this info. There is still a lot of content in Hindi being generated in non-Unicode fonts (lot of DTP software being used in India still does not support Unicode). >> The LDC *might* still have the encoding converters laying around somewhere. These will be very useful, if they can be made available. There is a need for easily converting legacy documents to Unicode. One of the applications for which someone was looking for these recently was for checking for plagiarism in student projects/thesis. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sat, Feb 17, 2018 at 10:45 PM, Mike Maxwellwrote: > On 2/17/2018 11:58 AM, ShreeDevi Kumar wrote: > >> Before unicode, devanagari fonts used the ASCII range (legacy fonts) - >> however AFAIK there is no standardization in the mapping, though various >> families of fonts had similar mapping. >> >> see http://hindi-fonts.com/tools for converters from different mappings >> to unicode. >> >> So, ASCII to Unicode mapping for Devanagari will change based on the >> font used. >> > > Indeed! In 2003, DARPA held a "surprise language exercise", the goal of > which was to produce (very basic) MT etc. tools for Hindi, in a month's > time. I had been involved in the prep for it to ensure that there would be > no roadblocks (at the time, I was working at the LDC). One of the things > that Bill Poser and I verified was that there was a Unicode encoding for > Hindi/Devanagari. There was, but that was the wrong question. > > The right question was whether any Hindi website used Unicode. The answer > to that was that the BBC and Colgate did, but hardly anyone else. A few > Indian government sites used ISCII, which wouldn't have been bad, but most > places used proprietary encodings that went along with a proprietary font. > Worse, these were not simple code-point-to-character encodings; it was as > if the Latin letter 'l' had been encoded as 'l', but then 'd' had been > encoded as 'c' + 'l', 'b' as 'l' + a sort of backwards 'c', 'p' as a > lowered 'l' _ the backwards 'c', etc. It was a mess, and for awhile it was > unclear whether the exercise would fail because most of the data we needed > was in these weird proprietary encodings. (It eventually succeeded.) > > There are some notes here-- > > http://languagelog.ldc.upenn.edu/myl/ldc/hindi_fonts_and_conversions.html > --that Mark Liberman of the LDC made at the time concerning some of the > issues. Most of it is long out of date (and the links are probably > broken), and these proprietary encodings have thankfully been replaced by > Unicode; but if you're dealing with documents from that era, you might > still run into them. The LDC *might* still have the encoding converters > laying around somewhere. > -- >Mike Maxwell >"My definition of an interesting universe is >one that has the capacity to study itself." > --Stephen Eastmond > -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Devanagari ASCII to Unicode mapping
On 2/17/2018 11:58 AM, ShreeDevi Kumar wrote: Before unicode, devanagari fonts used the ASCII range (legacy fonts) - however AFAIK there is no standardization in the mapping, though various families of fonts had similar mapping. see http://hindi-fonts.com/tools for converters from different mappings to unicode. So, ASCII to Unicode mapping for Devanagari will change based on the font used. Indeed! In 2003, DARPA held a "surprise language exercise", the goal of which was to produce (very basic) MT etc. tools for Hindi, in a month's time. I had been involved in the prep for it to ensure that there would be no roadblocks (at the time, I was working at the LDC). One of the things that Bill Poser and I verified was that there was a Unicode encoding for Hindi/Devanagari. There was, but that was the wrong question. The right question was whether any Hindi website used Unicode. The answer to that was that the BBC and Colgate did, but hardly anyone else. A few Indian government sites used ISCII, which wouldn't have been bad, but most places used proprietary encodings that went along with a proprietary font. Worse, these were not simple code-point-to-character encodings; it was as if the Latin letter 'l' had been encoded as 'l', but then 'd' had been encoded as 'c' + 'l', 'b' as 'l' + a sort of backwards 'c', 'p' as a lowered 'l' _ the backwards 'c', etc. It was a mess, and for awhile it was unclear whether the exercise would fail because most of the data we needed was in these weird proprietary encodings. (It eventually succeeded.) There are some notes here-- http://languagelog.ldc.upenn.edu/myl/ldc/hindi_fonts_and_conversions.html --that Mark Liberman of the LDC made at the time concerning some of the issues. Most of it is long out of date (and the links are probably broken), and these proprietary encodings have thankfully been replaced by Unicode; but if you're dealing with documents from that era, you might still run into them. The LDC *might* still have the encoding converters laying around somewhere. -- Mike Maxwell "My definition of an interesting universe is one that has the capacity to study itself." --Stephen Eastmond -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Devanagari ASCII to Unicode mapping
Please see view-source:http://hindi-fonts.com/tools/Preeti-to-Unicode-Converter There is no direct mapping, but array_one has the ASCII codes for Preeti, while array_two has the corresponding unicode. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sat, Feb 17, 2018 at 10:32 PM, ShreeDevi Kumarwrote: > > What I think I am looking for is something that would map a document > typeset using something like the Devanagari Preeti font > (https://fonts2u.com/preeti.font), which seems to have the Devanagari > glyphs encoded in the range 0x00-0x7F, to something like the > Devanagari unicode font Mukta > (https://ektype.in/scripts/devanagari/mukta.html) in the range > 0x0900-0x097F. > > Please try http://www.ashesh.com.np/preeti-unicode/ > > Also see > > https://github.com/Shuvayatra/preeti > > ShreeDevi > > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Sat, Feb 17, 2018 at 10:27 PM, Mike Maxwell > wrote: > >> On 2/17/2018 11:08 AM, Daniel Greenhoe wrote: >> >>> Does anyone know where I can find an ASCII to Unicode mapping for >>> Devanagari? >>> >>> For example, it seems that the Devanagari glyph "ब" is encoded as >>> 0x61 (hex) in ASCII (lower case 'a' for the Latin alphabet), but is >>> 0x092C in the Unicode standard: >>>http://www.unicode.org/charts/PDF/U0900.pdf >>> >>> So what I am asking for is a map (or table) that maps 0x00-0x7F in >>> Devanagari ASCII to 0x0900-0x097F in Unicode. >>> >> >> In addition to the ASCII-to-Devanagari transcription system that Philip >> Taylor mentioned, you may be interested in the ISCII encoding for >> Brahmi-derived writing systems, including Devanagari: >> >> https://en.wikipedia.org/wiki/Indian_Script_Code_for_Informa >> tion_Interchange >> >> This is _not_ an ASCII-to-Devanagari encoding, rather it leaves the ASCII >> range intact, and encodes Devanagari (etc.) in the range 128 (actually, >> 161)-255. It was afaik never widely used, but there were (and probably >> still are) fonts for it. I don't imagine those fonts would be terribly >> high quality by today's standards, e.g. I'd be surprised if they handled >> conjunct characters. >> >> FWIW, there was a similar encoding called TSCII for Tamil. >> >> iconv can be used to map TSCII to other encodings, but for some reason it >> doesn't seem to have ISCII in its reportoire (it does include VISCII, but >> that's a legacy Vietnamese encoding). >> -- >>Mike Maxwell >>"My definition of an interesting universe is >>one that has the capacity to study itself." >> --Stephen Eastmond >> >> >> >> -- >> Subscriptions, Archive, and List information, etc.: >> http://tug.org/mailman/listinfo/xetex >> > > -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Devanagari ASCII to Unicode mapping
> What I think I am looking for is something that would map a document typeset using something like the Devanagari Preeti font (https://fonts2u.com/preeti.font), which seems to have the Devanagari glyphs encoded in the range 0x00-0x7F, to something like the Devanagari unicode font Mukta (https://ektype.in/scripts/devanagari/mukta.html) in the range 0x0900-0x097F. Please try http://www.ashesh.com.np/preeti-unicode/ Also see https://github.com/Shuvayatra/preeti ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sat, Feb 17, 2018 at 10:27 PM, Mike Maxwellwrote: > On 2/17/2018 11:08 AM, Daniel Greenhoe wrote: > >> Does anyone know where I can find an ASCII to Unicode mapping for >> Devanagari? >> >> For example, it seems that the Devanagari glyph "ब" is encoded as >> 0x61 (hex) in ASCII (lower case 'a' for the Latin alphabet), but is >> 0x092C in the Unicode standard: >>http://www.unicode.org/charts/PDF/U0900.pdf >> >> So what I am asking for is a map (or table) that maps 0x00-0x7F in >> Devanagari ASCII to 0x0900-0x097F in Unicode. >> > > In addition to the ASCII-to-Devanagari transcription system that Philip > Taylor mentioned, you may be interested in the ISCII encoding for > Brahmi-derived writing systems, including Devanagari: > > https://en.wikipedia.org/wiki/Indian_Script_Code_for_Informa > tion_Interchange > > This is _not_ an ASCII-to-Devanagari encoding, rather it leaves the ASCII > range intact, and encodes Devanagari (etc.) in the range 128 (actually, > 161)-255. It was afaik never widely used, but there were (and probably > still are) fonts for it. I don't imagine those fonts would be terribly > high quality by today's standards, e.g. I'd be surprised if they handled > conjunct characters. > > FWIW, there was a similar encoding called TSCII for Tamil. > > iconv can be used to map TSCII to other encodings, but for some reason it > doesn't seem to have ISCII in its reportoire (it does include VISCII, but > that's a legacy Vietnamese encoding). > -- >Mike Maxwell >"My definition of an interesting universe is >one that has the capacity to study itself." > --Stephen Eastmond > > > > -- > Subscriptions, Archive, and List information, etc.: > http://tug.org/mailman/listinfo/xetex > -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Devanagari ASCII to Unicode mapping
> For example, it seems that the Devanagari glyph "ब" is encoded as 0x61 (hex) in ASCII (lower case 'a' for the Latin alphabet), Before unicode, devanagari fonts used the ASCII range (legacy fonts) - however AFAIK there is no standardization in the mapping, though various families of fonts had similar mapping. see http://hindi-fonts.com/tools for converters from different mappings to unicode. So, ASCII to Unicode mapping for Devanagari will change based on the font used. ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sat, Feb 17, 2018 at 10:04 PM, Philip Taylorwrote: > Daniel Greenhoe wrote: > >> Does anyone know where I can find an ASCII to Unicode mapping for >> Devanagari? >> > Would this be of any help ? > > https://clas.uiowa.edu/linguistics/hindi-verb-project/ascii- > devanagari-chart > > Philip Taylor > > > > -- > Subscriptions, Archive, and List information, etc.: > http://tug.org/mailman/listinfo/xetex > -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Devanagari ASCII to Unicode mapping
On 2/17/2018 11:08 AM, Daniel Greenhoe wrote: Does anyone know where I can find an ASCII to Unicode mapping for Devanagari? For example, it seems that the Devanagari glyph "ब" is encoded as 0x61 (hex) in ASCII (lower case 'a' for the Latin alphabet), but is 0x092C in the Unicode standard: http://www.unicode.org/charts/PDF/U0900.pdf So what I am asking for is a map (or table) that maps 0x00-0x7F in Devanagari ASCII to 0x0900-0x097F in Unicode. In addition to the ASCII-to-Devanagari transcription system that Philip Taylor mentioned, you may be interested in the ISCII encoding for Brahmi-derived writing systems, including Devanagari: https://en.wikipedia.org/wiki/Indian_Script_Code_for_Information_Interchange This is _not_ an ASCII-to-Devanagari encoding, rather it leaves the ASCII range intact, and encodes Devanagari (etc.) in the range 128 (actually, 161)-255. It was afaik never widely used, but there were (and probably still are) fonts for it. I don't imagine those fonts would be terribly high quality by today's standards, e.g. I'd be surprised if they handled conjunct characters. FWIW, there was a similar encoding called TSCII for Tamil. iconv can be used to map TSCII to other encodings, but for some reason it doesn't seem to have ISCII in its reportoire (it does include VISCII, but that's a legacy Vietnamese encoding). -- Mike Maxwell "My definition of an interesting universe is one that has the capacity to study itself." --Stephen Eastmond -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Devanagari ASCII to Unicode mapping
> https://clas.uiowa.edu/linguistics/hindi-verb-project/ascii-devanagari-chart That one looks to be more like an input tool (like a teckit mapping) for Devanagari. What I think I am looking for is something that would map a document typeset using something like the Devanagari Preeti font (https://fonts2u.com/preeti.font), which seems to have the Devanagari glyphs encoded in the range 0x00-0x7F, to something like the Devanagari unicode font Mukta (https://ektype.in/scripts/devanagari/mukta.html) in the range 0x0900-0x097F. In short, I would maybe like a simple map something like this: 0x21 --> 0x096F (९) 0x22 --> 0x0942 0x23 --> 0x0969 (३) 0x24 --> 0x096A (४) 0x25 --> 0x096B (५) 0x26 --> 0x096D (७) ... On Sat, Feb 17, 2018 at 4:34 PM, Philip Taylorwrote: > Daniel Greenhoe wrote: >> >> Does anyone know where I can find an ASCII to Unicode mapping for >> Devanagari? > > Would this be of any help ? > > https://clas.uiowa.edu/linguistics/hindi-verb-project/ascii-devanagari-chart > > Philip Taylor https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail_term=icon; target="_blank">https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif; alt="" width="46" height="29" style="width: 46px; height: 29px;" /> Virus-free. https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail_term=link; target="_blank" style="color: #4453ea;">www.avast.com -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Devanagari ASCII to Unicode mapping
Daniel Greenhoe wrote: Does anyone know where I can find an ASCII to Unicode mapping for Devanagari? Would this be of any help ? https://clas.uiowa.edu/linguistics/hindi-verb-project/ascii-devanagari-chart Philip Taylor -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex