Re: [XeTeX] Devanagari ASCII to Unicode mapping

2018-02-22 Thread Zdenek Wagner
2018-02-22 11:44 GMT+01:00 Philip Taylor (RHUoL) :

>
>
> Daniel Greenhoe wrote:
>
>> I think the conclusion is that I was going about the problem the wrong
>> way---that there is no one-to-one mapping between the Devanagari ASCII
>> font and unicode font. Rather, it is many-to-one.
>>
> Is the problem not, in fact, that there is not one "Devanagari ASCII font"
> but rather many, for each of which there is potentially a different mapping
> required ?
>

Yes, there are many fonts with non-unicode proprietary encodings. The web
sites with such fonts offer downlowd of a Windows executable which installs
these fonts into Windows, so I have never managed to view such pages on
Linux. It is not difficult to define the mapping for TECkit if you know the
encoding.


> Philip Taylor
>
>
>
> Zdeněk Wagner
> http://ttsm.icpf.cas.cz/team/wagner.shtml
> http://icebearsoft.euweb.cz
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex
>


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Devanagari ASCII to Unicode mapping

2018-02-22 Thread Philip Taylor (RHUoL)



Daniel Greenhoe wrote:

I think the conclusion is that I was going about the problem the wrong
way---that there is no one-to-one mapping between the Devanagari ASCII
font and unicode font. Rather, it is many-to-one.
Is the problem not, in fact, that there is not one "Devanagari ASCII 
font" but rather many, for each of which there is potentially a 
different mapping required ?

Philip Taylor


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Devanagari ASCII to Unicode mapping

2018-02-21 Thread Lorna Evans

I think this is a TECkit converter for the Preeti font:

https://github.com/silnrsi/wsresources/tree/master/scripts/Deva/legacy/sag-preeti/mappings

Lorna


 Original Message 
Subject: Re: [XeTeX] Devanagari ASCII to Unicode mapping
From: ShreeDevi Kumar <shreesh...@gmail.com>
To: XeTeX (Unicode-based TeX) discussion. <xetex@tug.org>
Date: 2/17/2018 11:11 AM

Please see

view-source:http://hindi-fonts.com/tools/Preeti-to-Unicode-Converter

There is no direct mapping, butarray_one has the ASCII codes for 
Preeti, while array_two has the corresponding unicode.


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Feb 17, 2018 at 10:32 PM, ShreeDevi Kumar 
<shreesh...@gmail.com <mailto:shreesh...@gmail.com>> wrote:


> What I think I am looking for is something that would map a document
typeset using something like the Devanagari Preeti font
(https://fonts2u.com/preeti.font
<https://fonts2u.com/preeti.font>), which seems to have the Devanagari
glyphs encoded in the range 0x00-0x7F, to something like the
Devanagari unicode font Mukta
(https://ektype.in/scripts/devanagari/mukta.html
<https://ektype.in/scripts/devanagari/mukta.html>) in the range
0x0900-0x097F.

Please try http://www.ashesh.com.np/preeti-unicode/
<http://www.ashesh.com.np/preeti-unicode/>

Also see

https://github.com/Shuvayatra/preeti
<https://github.com/Shuvayatra/preeti>

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Feb 17, 2018 at 10:27 PM, Mike Maxwell
<maxw...@umiacs.umd.edu <mailto:maxw...@umiacs.umd.edu>> wrote:

On 2/17/2018 11:08 AM, Daniel Greenhoe wrote:

Does anyone know where I can find an ASCII to Unicode
mapping for Devanagari?

For example, it seems that the Devanagari glyph "ब" is
encoded as
0x61 (hex) in ASCII (lower case 'a' for the Latin
alphabet), but is
0x092C in the Unicode standard:
http://www.unicode.org/charts/PDF/U0900.pdf
<http://www.unicode.org/charts/PDF/U0900.pdf>

So what I am asking for is a map (or table) that maps
0x00-0x7F in
Devanagari ASCII to 0x0900-0x097F in Unicode.


In addition to the ASCII-to-Devanagari transcription system
that Philip Taylor mentioned, you may be interested in the
ISCII encoding for Brahmi-derived writing systems, including
Devanagari:


https://en.wikipedia.org/wiki/Indian_Script_Code_for_Information_Interchange

<https://en.wikipedia.org/wiki/Indian_Script_Code_for_Information_Interchange>

This is _not_ an ASCII-to-Devanagari encoding, rather it
leaves the ASCII range intact, and encodes Devanagari (etc.)
in the range 128 (actually, 161)-255.  It was afaik never
widely used, but there were (and probably still are) fonts for
it.  I don't imagine those fonts would be terribly high
quality by today's standards, e.g. I'd be surprised if they
handled conjunct characters.

FWIW, there was a similar encoding called TSCII for Tamil.

iconv can be used to map TSCII to other encodings, but for
some reason it doesn't seem to have ISCII in its reportoire
(it does include VISCII, but that's a legacy Vietnamese encoding).
-- 
   Mike Maxwell

   "My definition of an interesting universe is
   one that has the capacity to study itself."
         --Stephen Eastmond



--
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex
<http://tug.org/mailman/listinfo/xetex>






--
Subscriptions, Archive, and List information, etc.:
   http://tug.org/mailman/listinfo/xetex




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Devanagari ASCII to Unicode mapping

2018-02-18 Thread Mike Maxwell

On 2/18/2018 4:10 AM, ShreeDevi Kumar wrote:
 >> The LDC *might* still have the encoding converters laying around 
somewhere.


These will be very useful, if they can be made available. There is a 
need for easily converting legacy documents to Unicode. One of the 
applications for which someone was looking for these recently was for 
checking for plagiarism in student projects/thesis.


I'd suggest contacting them.  Their website is
ldc.upenn.edu
There's a "Contact us" tab near the upper right-hand corner of their page.
--
   Mike Maxwell
   "My definition of an interesting universe is
   one that has the capacity to study itself."
 --Stephen Eastmond


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Devanagari ASCII to Unicode mapping

2018-02-18 Thread ShreeDevi Kumar
Thank you for this info.

There is still a lot of content in Hindi being generated in non-Unicode
fonts (lot of DTP software being used in India still does not support
Unicode).

>> The LDC *might* still have the encoding converters laying around
somewhere.

These will be very useful, if they can be made available. There is a need
for easily converting legacy documents to Unicode. One of the applications
for which someone was looking for these recently was for checking for
plagiarism in student projects/thesis.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Feb 17, 2018 at 10:45 PM, Mike Maxwell 
wrote:

> On 2/17/2018 11:58 AM, ShreeDevi Kumar wrote:
>
>> Before unicode, devanagari fonts used the ASCII range (legacy fonts) -
>> however AFAIK there is no standardization in the mapping, though various
>> families of fonts had similar mapping.
>>
>> see http://hindi-fonts.com/tools for converters from different mappings
>> to unicode.
>>
>> So,  ASCII to Unicode mapping for Devanagari will change based on the
>> font used.
>>
>
> Indeed!  In 2003, DARPA held a "surprise language exercise", the goal of
> which was to produce (very basic) MT etc. tools for Hindi, in a month's
> time.  I had been involved in the prep for it to ensure that there would be
> no roadblocks (at the time, I was working at the LDC).  One of the things
> that Bill Poser and I verified was that there was a Unicode encoding for
> Hindi/Devanagari.  There was, but that was the wrong question.
>
> The right question was whether any Hindi website used Unicode.  The answer
> to that was that the BBC and Colgate did, but hardly anyone else.  A few
> Indian government sites used ISCII, which wouldn't have been bad, but most
> places used proprietary encodings that went along with a proprietary font.
> Worse, these were not simple code-point-to-character encodings; it was as
> if the Latin letter 'l' had been encoded as 'l', but then 'd' had been
> encoded as 'c' + 'l', 'b' as 'l' + a sort of backwards 'c', 'p' as a
> lowered 'l' _ the backwards 'c', etc.  It was a mess, and for awhile it was
> unclear whether the exercise would fail because most of the data we needed
> was in these weird proprietary encodings.  (It eventually succeeded.)
>
> There are some notes here--
>
> http://languagelog.ldc.upenn.edu/myl/ldc/hindi_fonts_and_conversions.html
> --that Mark Liberman of the LDC made at the time concerning some of the
> issues.  Most of it is long out of date (and the links are probably
> broken), and these proprietary encodings have thankfully been replaced by
> Unicode; but if you're dealing with documents from that era, you might
> still run into them.  The LDC *might* still have the encoding converters
> laying around somewhere.
> --
>Mike Maxwell
>"My definition of an interesting universe is
>one that has the capacity to study itself."
>  --Stephen Eastmond
>


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Devanagari ASCII to Unicode mapping

2018-02-17 Thread Mike Maxwell

On 2/17/2018 11:58 AM, ShreeDevi Kumar wrote:
Before unicode, devanagari fonts used the ASCII range (legacy fonts) - 
however AFAIK there is no standardization in the mapping, though various 
families of fonts had similar mapping.


see http://hindi-fonts.com/tools for converters from different mappings 
to unicode.


So,  ASCII to Unicode mapping for Devanagari will change based on the 
font used.


Indeed!  In 2003, DARPA held a "surprise language exercise", the goal of 
which was to produce (very basic) MT etc. tools for Hindi, in a month's 
time.  I had been involved in the prep for it to ensure that there would 
be no roadblocks (at the time, I was working at the LDC).  One of the 
things that Bill Poser and I verified was that there was a Unicode 
encoding for Hindi/Devanagari.  There was, but that was the wrong 
question.


The right question was whether any Hindi website used Unicode.  The 
answer to that was that the BBC and Colgate did, but hardly anyone else. 
 A few Indian government sites used ISCII, which wouldn't have been 
bad, but most places used proprietary encodings that went along with a 
proprietary font.  Worse, these were not simple code-point-to-character 
encodings; it was as if the Latin letter 'l' had been encoded as 'l', 
but then 'd' had been encoded as 'c' + 'l', 'b' as 'l' + a sort of 
backwards 'c', 'p' as a lowered 'l' _ the backwards 'c', etc.  It was a 
mess, and for awhile it was unclear whether the exercise would fail 
because most of the data we needed was in these weird proprietary 
encodings.  (It eventually succeeded.)


There are some notes here--

http://languagelog.ldc.upenn.edu/myl/ldc/hindi_fonts_and_conversions.html
--that Mark Liberman of the LDC made at the time concerning some of the 
issues.  Most of it is long out of date (and the links are probably 
broken), and these proprietary encodings have thankfully been replaced 
by Unicode; but if you're dealing with documents from that era, you 
might still run into them.  The LDC *might* still have the encoding 
converters laying around somewhere.

--
   Mike Maxwell
   "My definition of an interesting universe is
   one that has the capacity to study itself."
 --Stephen Eastmond


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Devanagari ASCII to Unicode mapping

2018-02-17 Thread ShreeDevi Kumar
Please see

view-source:http://hindi-fonts.com/tools/Preeti-to-Unicode-Converter

There is no direct mapping, but  array_one has the ASCII codes for Preeti,
while array_two has the corresponding unicode.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Feb 17, 2018 at 10:32 PM, ShreeDevi Kumar 
wrote:

> > What I think I am looking for is something that would map a document
> typeset using something like the Devanagari Preeti font
> (https://fonts2u.com/preeti.font), which seems to have the Devanagari
> glyphs encoded in the range 0x00-0x7F, to something like the
> Devanagari unicode font Mukta
> (https://ektype.in/scripts/devanagari/mukta.html) in the range
> 0x0900-0x097F.
>
> Please try http://www.ashesh.com.np/preeti-unicode/
>
> Also see
>
> https://github.com/Shuvayatra/preeti
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Sat, Feb 17, 2018 at 10:27 PM, Mike Maxwell 
> wrote:
>
>> On 2/17/2018 11:08 AM, Daniel Greenhoe wrote:
>>
>>> Does anyone know where I can find an ASCII to Unicode mapping for
>>> Devanagari?
>>>
>>> For example, it seems that the Devanagari  glyph "ब" is encoded as
>>> 0x61 (hex) in ASCII (lower case 'a' for the Latin alphabet), but is
>>> 0x092C in the Unicode standard:
>>>http://www.unicode.org/charts/PDF/U0900.pdf
>>>
>>> So what I am asking for is a map (or table) that maps 0x00-0x7F in
>>> Devanagari ASCII to 0x0900-0x097F in Unicode.
>>>
>>
>> In addition to the ASCII-to-Devanagari transcription system that Philip
>> Taylor mentioned, you may be interested in the ISCII encoding for
>> Brahmi-derived writing systems, including Devanagari:
>>
>> https://en.wikipedia.org/wiki/Indian_Script_Code_for_Informa
>> tion_Interchange
>>
>> This is _not_ an ASCII-to-Devanagari encoding, rather it leaves the ASCII
>> range intact, and encodes Devanagari (etc.) in the range 128 (actually,
>> 161)-255.  It was afaik never widely used, but there were (and probably
>> still are) fonts for it.  I don't imagine those fonts would be terribly
>> high quality by today's standards, e.g. I'd be surprised if they handled
>> conjunct characters.
>>
>> FWIW, there was a similar encoding called TSCII for Tamil.
>>
>> iconv can be used to map TSCII to other encodings, but for some reason it
>> doesn't seem to have ISCII in its reportoire (it does include VISCII, but
>> that's a legacy Vietnamese encoding).
>> --
>>Mike Maxwell
>>"My definition of an interesting universe is
>>one that has the capacity to study itself."
>>  --Stephen Eastmond
>>
>>
>>
>> --
>> Subscriptions, Archive, and List information, etc.:
>>  http://tug.org/mailman/listinfo/xetex
>>
>
>


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Devanagari ASCII to Unicode mapping

2018-02-17 Thread ShreeDevi Kumar
> What I think I am looking for is something that would map a document
typeset using something like the Devanagari Preeti font
(https://fonts2u.com/preeti.font), which seems to have the Devanagari
glyphs encoded in the range 0x00-0x7F, to something like the
Devanagari unicode font Mukta
(https://ektype.in/scripts/devanagari/mukta.html) in the range
0x0900-0x097F.

Please try http://www.ashesh.com.np/preeti-unicode/

Also see

https://github.com/Shuvayatra/preeti

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Feb 17, 2018 at 10:27 PM, Mike Maxwell 
wrote:

> On 2/17/2018 11:08 AM, Daniel Greenhoe wrote:
>
>> Does anyone know where I can find an ASCII to Unicode mapping for
>> Devanagari?
>>
>> For example, it seems that the Devanagari  glyph "ब" is encoded as
>> 0x61 (hex) in ASCII (lower case 'a' for the Latin alphabet), but is
>> 0x092C in the Unicode standard:
>>http://www.unicode.org/charts/PDF/U0900.pdf
>>
>> So what I am asking for is a map (or table) that maps 0x00-0x7F in
>> Devanagari ASCII to 0x0900-0x097F in Unicode.
>>
>
> In addition to the ASCII-to-Devanagari transcription system that Philip
> Taylor mentioned, you may be interested in the ISCII encoding for
> Brahmi-derived writing systems, including Devanagari:
>
> https://en.wikipedia.org/wiki/Indian_Script_Code_for_Informa
> tion_Interchange
>
> This is _not_ an ASCII-to-Devanagari encoding, rather it leaves the ASCII
> range intact, and encodes Devanagari (etc.) in the range 128 (actually,
> 161)-255.  It was afaik never widely used, but there were (and probably
> still are) fonts for it.  I don't imagine those fonts would be terribly
> high quality by today's standards, e.g. I'd be surprised if they handled
> conjunct characters.
>
> FWIW, there was a similar encoding called TSCII for Tamil.
>
> iconv can be used to map TSCII to other encodings, but for some reason it
> doesn't seem to have ISCII in its reportoire (it does include VISCII, but
> that's a legacy Vietnamese encoding).
> --
>Mike Maxwell
>"My definition of an interesting universe is
>one that has the capacity to study itself."
>  --Stephen Eastmond
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex
>


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Devanagari ASCII to Unicode mapping

2018-02-17 Thread ShreeDevi Kumar
> For example, it seems that the Devanagari  glyph "ब" is encoded as
0x61 (hex) in ASCII (lower case 'a' for the Latin alphabet),

Before unicode, devanagari fonts used the ASCII range (legacy fonts) -
however AFAIK there is no standardization in the mapping, though various
families of fonts had similar mapping.

see http://hindi-fonts.com/tools for converters from different mappings to
unicode.

So,  ASCII to Unicode mapping for Devanagari will change based on the font
used.


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Feb 17, 2018 at 10:04 PM, Philip Taylor  wrote:

> Daniel Greenhoe wrote:
>
>> Does anyone know where I can find an ASCII to Unicode mapping for
>> Devanagari?
>>
> Would this be of any help ?
>
> https://clas.uiowa.edu/linguistics/hindi-verb-project/ascii-
> devanagari-chart
>
> Philip Taylor
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex
>


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Devanagari ASCII to Unicode mapping

2018-02-17 Thread Mike Maxwell

On 2/17/2018 11:08 AM, Daniel Greenhoe wrote:

Does anyone know where I can find an ASCII to Unicode mapping for Devanagari?

For example, it seems that the Devanagari  glyph "ब" is encoded as
0x61 (hex) in ASCII (lower case 'a' for the Latin alphabet), but is
0x092C in the Unicode standard:
   http://www.unicode.org/charts/PDF/U0900.pdf

So what I am asking for is a map (or table) that maps 0x00-0x7F in
Devanagari ASCII to 0x0900-0x097F in Unicode.


In addition to the ASCII-to-Devanagari transcription system that Philip 
Taylor mentioned, you may be interested in the ISCII encoding for 
Brahmi-derived writing systems, including Devanagari:


https://en.wikipedia.org/wiki/Indian_Script_Code_for_Information_Interchange

This is _not_ an ASCII-to-Devanagari encoding, rather it leaves the 
ASCII range intact, and encodes Devanagari (etc.) in the range 128 
(actually, 161)-255.  It was afaik never widely used, but there were 
(and probably still are) fonts for it.  I don't imagine those fonts 
would be terribly high quality by today's standards, e.g. I'd be 
surprised if they handled conjunct characters.


FWIW, there was a similar encoding called TSCII for Tamil.

iconv can be used to map TSCII to other encodings, but for some reason 
it doesn't seem to have ISCII in its reportoire (it does include VISCII, 
but that's a legacy Vietnamese encoding).

--
   Mike Maxwell
   "My definition of an interesting universe is
   one that has the capacity to study itself."
 --Stephen Eastmond


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Devanagari ASCII to Unicode mapping

2018-02-17 Thread Daniel Greenhoe
> https://clas.uiowa.edu/linguistics/hindi-verb-project/ascii-devanagari-chart

That one looks to be more like an input tool (like a teckit mapping)
for Devanagari.

What I think I am looking for is something that would map a document
typeset using something like the Devanagari Preeti font
(https://fonts2u.com/preeti.font), which seems to have the Devanagari
glyphs encoded in the range 0x00-0x7F, to something like the
Devanagari unicode font Mukta
(https://ektype.in/scripts/devanagari/mukta.html) in the range
0x0900-0x097F.

In short, I would maybe like a simple map something like this:
  0x21 --> 0x096F  (९)
  0x22 --> 0x0942
  0x23 --> 0x0969 (३)
  0x24 --> 0x096A (४)
  0x25 --> 0x096B (५)
  0x26 --> 0x096D (७)
  ...



On Sat, Feb 17, 2018 at 4:34 PM, Philip Taylor  wrote:
> Daniel Greenhoe wrote:
>>
>> Does anyone know where I can find an ASCII to Unicode mapping for
>> Devanagari?
>
> Would this be of any help ?
>
> https://clas.uiowa.edu/linguistics/hindi-verb-project/ascii-devanagari-chart
>
> Philip Taylor


https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail_term=icon;
target="_blank">https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif;
alt="" width="46" height="29" style="width: 46px; height: 29px;"
/>
Virus-free. https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail_term=link;
target="_blank" style="color: #4453ea;">www.avast.com






--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Devanagari ASCII to Unicode mapping

2018-02-17 Thread Philip Taylor

Daniel Greenhoe wrote:

Does anyone know where I can find an ASCII to Unicode mapping for Devanagari?

Would this be of any help ?

https://clas.uiowa.edu/linguistics/hindi-verb-project/ascii-devanagari-chart

Philip Taylor


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex