Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Jonathan Kew

On 6/5/15 14:14, Joseph Wright wrote:


Based on the current files, we have a block to set \XeTeXcharclass,
which only applies to XeTeX. The logic followed in that code is that
characters in the file LineBreak.txt which have class ID (ideographs)
not only set the \XeTeXcharclass class to 1 but also set the \catcode of
the code point to 11. That leads to a difference between the two Unicode
engines. My current feeling is that the data file should split this
process such that the category code change applies to both XeTeX and
LuaTeX, with the XeTeX-specific code separate. Does this make sense and
indeed does the current assignment make sense?



ISTM that the most appropriate (default) \catcode for characters with 
class ID is clearly letter (11), and would suggest that LuaTeX should 
follow XeTeX in this.


So yes, splitting out the XeTeX-specific code and having LuaTeX share 
the catcode assignments makes sense.


After all, if users can write control sequences such as

  \hello
  \halló
  \Здравствуйте
  \ሰላም
  \सलाम

they should equally well be able to write

  \你好
  \こんにちわ

and have each of these treated as single control sequences, too. This 
will not work if category ID characters are given catcode 12.


If you're making improvements to unicode-letters.def, I would suggest 
also adding a section that assigns catcode 15 (invalid) to the code 
values D800 - DFFF (i.e. the UTF-16 surrogates, which should never be 
used in isolation as characters).


JK



--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


[XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Joseph Wright
Hello all,

As some people will have seen, the LaTeX team have recently integrated
setting of codes (\catcode, \lccode, etc.) for the entire Unicode range
 into the kernel when XeTeX/LuaTeX are in use. This is not a functional
change for end users but does mean that the team now have some control
over these important settings. Notably, the new data file we have
created (unicode-letters.def) is compatible with plain TeX and works
with both XeTeX and LuaTeX. We are therefore hopeful that it will
provide useful not only to LaTeX users but also to those using
plain-basef formats.

For the initial pass we have adopted the settings applied by
unicode-letters.tex (XeTeX)/luatex-unicode-letters.tex (LuaTeX) as-is.
We have constructed a new (TeX) script to generate this data from the
raw Unicode data files.

Most of the settings are straight-forward and shared between XeTeX and
LuaTeX. For example, characters marked as Unicode as letters have
\catcode 11, \lccode and \uccode are set up based on case relationships,
etc. However, we would like to raise one area that may need revision.

Based on the current files, we have a block to set \XeTeXcharclass,
which only applies to XeTeX. The logic followed in that code is that
characters in the file LineBreak.txt which have class ID (ideographs)
not only set the \XeTeXcharclass class to 1 but also set the \catcode of
the code point to 11. That leads to a difference between the two Unicode
engines. My current feeling is that the data file should split this
process such that the category code change applies to both XeTeX and
LuaTeX, with the XeTeX-specific code separate. Does this make sense and
indeed does the current assignment make sense?

We are very keen to hear about any other logic changes that may be
required in the data file. This is a complex area and we have at present
done little other than copy the current logic.
--
Joseph Wright


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-06 Thread Arthur Reutenauer
  While working on these bugs, we also discussed how surrogate
characters were handled in XeTeX.  Surrogate characters are the 2048
code points that are used in UTF-16 to encode characters with code
points above 65536: a pair of them makes up one Unicode character;
however they're not meant to be used in isolation, even though they have
code points like other characters (they're not just byte sequences).

  Right now, XeTeX allows isolated surrogate characters, and also
combines sequences such as d835dc00 into one Unicode character.
We want to flag the former case but are not sure how: should we make the
characters invalid (with catcode 15)?  Or we could map them to the
standard unknown character (U+FFFD).  The latter case is more nasty
and should definitely be forbidden -- the ^^ notation should only be
used for proper characters (so instead of the above, the Unicode code
point of the resulting Unicode character should be used, in this case
^1d400).

  Any thoughts?

Best,

Arthur


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-06 Thread David Carlisle
On 6 May 2015 at 23:04, Arthur Reutenauer
arthur.reutena...@normalesup.org wrote:
   While working on these bugs, we also discussed how surrogate
 characters were handled in XeTeX.  Surrogate characters are the 2048
 code points that are used in UTF-16 to encode characters with code
 points above 65536: a pair of them makes up one Unicode character;
 however they're not meant to be used in isolation, even though they have
 code points like other characters (they're not just byte sequences).

   Right now, XeTeX allows isolated surrogate characters, and also
 combines sequences such as d835dc00 into one Unicode character.
 We want to flag the former case but are not sure how: should we make the
 characters invalid (with catcode 15)?  Or we could map them to the
 standard unknown character (U+FFFD).  The latter case is more nasty
 and should definitely be forbidden -- the ^^ notation should only be
 used for proper characters (so instead of the above, the Unicode code
 point of the resulting Unicode character should be used, in this case
 ^1d400).

   Any thoughts?


A major difference between using catcode 15 and the engine's input
filter substituting
U+FFFD is that the former could be over-ridden at the macro layer.
Whether that's a good thing
or not depends a bit on what happens if a document puts the catcodes
back to (say) 12.

if you just get undefined characters and missing glyphs, then you get
what you ask for
and probably it should be allowed just because.  If the internals
can't reliably deal with an
unpaired surrogate (eg it crashes some font library api) then the
engine had better ensure
it doesn't easily happen and FFFD is as good as anything probably.

If you do go for catcode 15, then (as suggested in the thread on
unicode-letters.def)
it could be set in the macro layer or the engine could initialise
these catcodes.
Doing it at the macro layer is probably more in the spirit of the
traditional catcode initialisation
which is very minimalist.

As you say, combining d835dc00 into one token just wrong,
and I think it should do (twice) whatever you decide to do for
unpaired surrogates.

David


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-06 Thread David Carlisle
 The character itself, as bytes that is, is not wrong and users should be able 
 to create these.
 But preferably through macros that ensure that they come correctly paired.

placing two character tokens representing a surrogate pair should not
though magically turn itself
into a single character. The UTF-8 or  encoding should refer to
the unicode code point not
to the UTF-16 encoding,

In the current versions d835dc00 is two characters in luatex
and one character in xetex
as the implementation detail that xetex's underlying storage is mostly
UTF-16 is exposed. If it is
not possible to prevent ^^^ or utf8 encoded surrogate pairs combining
then it is better to
prevent them being formed.

this is no different to XML where  #xd835; #xdc00; always refers to
two (invalid) characters not
to  #x1d400;

David


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


[XeTeX] Σχετ: Re: Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Apostolos Syropoulos
The only mark that remains when making all capitals is the dieredis 
(dialytika). All other vanish. This is common knowledge for people who speak 
and write Greek.


AS

Στάλθηκε από το Ταχυδρομείο Yahoo στο Android 






--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-06 Thread Ross Moore
Hi Arthur,

On 07/05/2015, at 8:04, Arthur Reutenauer arthur.reutena...@normalesup.org 
wrote:

  While working on these bugs, we also discussed how surrogate
 characters were handled in XeTeX.  Surrogate characters are the 2048
 code points that are used in UTF-16 to encode characters with code
 points above 65536: a pair of them makes up one Unicode character;
 however they're not meant to be used in isolation, even though they have
 code points like other characters (they're not just byte sequences).
 
  Right now, XeTeX allows isolated surrogate characters, and also
 combines sequences such as d835dc00 into one Unicode character.
 We want to flag the former case but are not sure how: should we make the
 characters invalid (with catcode 15)?  

That would definitely be wrong.
The character itself, as bytes that is, is not wrong and users should be able 
to create these.
But preferably through macros that ensure that they come correctly paired.

IMHO, this is a macro issue, not an engine issue.

The same kind of thing applies with combining accents and diacritics.
I've written macros that take an argument and follow it with a combining 
character.
This is useful for generating correct UTF8 bytes to put into XML packets, as 
needed for the XMP Metadata that is required in PDF files that must validate 
for ISO specifications.

Similar macros could be used to construct upper-plane characters from 
surrogates, given only the math style and Latin letter. For these, single 
surrogate characters will be needed in the macro definitions, with the ultimate 
matching pair to be determined algorithmically, probably using an \ifcase  
instance. Single characters thus need to be able to be input, so as to create 
the macro definition.

OK, a clever macro programmer can change the catcodes to become valid local to 
the macro definition. But that is really complicating things.


 Or we could map them to the
 standard unknown character (U+FFFD).  The latter case is more nasty
 and should definitely be forbidden -- the ^^ notation should only be
 used for proper characters (so instead of the above, the Unicode code
 point of the resulting Unicode character should be used, in this case
 ^1d400).

I disagree. 
The ^^ notation can be used in macros to create the required bytes, for writing 
out into a file other than the  .dvi  or .pdf  output.
pdfTeX (or other engine) then can cause that file to become embedded as a file 
object stream in the final PDF.


 
  Any thoughts?
 
Best,
 
Arthur


Hope this helps,

Ross




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Joseph Wright
On 06/05/2015 21:06, David Carlisle wrote:
 On 6 May 2015 at 20:15, Philip Taylor p.tay...@rhul.ac.uk wrote:


 Apostolos Syropoulos wrote:

 It seems to me that most people have no idea what Unicode is and what is 
 really
 involved.

 OK, so if we restrict the Universe of Discourse to the set of native
 Hellenic speakers who know what Unicode is, know the importance of being
 able to use it to identify the correct upper case of (for example)
 'GREEK SMALL LETTER EPSILON WITH PSILI', and hold an informed opinion on
 the matter, would you expect that 100% of these would agree that the
 uppercase is 'GREEK LETTER EPSILON' and not 'GREEK LETTER EPSILON WITH
 PSILI', or would you expect that some percentage (perhaps small) would
 hold the opposite point of view ?

 ** Phil.

 
 I don't think that's the right question. Even if everyone, including
 the Unicode technical committee,
 agreed some properties are incorrect for some characters, it isn't
 clear we should change
 them at this level.
 
 I think that unicode-letters.def makes most sense as a
 fully automated representation of the UCD data files in TeX syntax.
 
 That way everyone knows what data is in there.
 
 Individual language packages have far fewer characters to worry about
 and can over-ride
 the base settings where appropriate.

Indeed: provided hyphenation is correct then we are OK. (LuaTeX of
course is rather more flexible there than XeTeX.)
--
Joseph Wright



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Julian Bradfield
On 2015-05-06, Apostolos Syropoulos asyropou...@yahoo.com wrote:
 I checked a bit the file and I have noticed that 
 \L 1F10 1F18 1F10 % 
 while xgreek.sty defines 
 \global\lccode1F10=1F10 \global\uccode1F10=0395

 You see the uppercase of 'GREEK SMALL LETTER EPSILON WITH PSILI'
 is 'GREEK LETTER EPSILON' and not 'GREEK LETTER EPSILON WITH PSILI. 

Not in standard representations of Ancient Greek it isn't, and
polytonic greek is mostly used for that.

I thought you didn't even use the psili at all in modern greek?


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread David Carlisle
On 6 May 2015 at 20:15, Philip Taylor p.tay...@rhul.ac.uk wrote:


 Apostolos Syropoulos wrote:

 It seems to me that most people have no idea what Unicode is and what is 
 really
 involved.

 OK, so if we restrict the Universe of Discourse to the set of native
 Hellenic speakers who know what Unicode is, know the importance of being
 able to use it to identify the correct upper case of (for example)
 'GREEK SMALL LETTER EPSILON WITH PSILI', and hold an informed opinion on
 the matter, would you expect that 100% of these would agree that the
 uppercase is 'GREEK LETTER EPSILON' and not 'GREEK LETTER EPSILON WITH
 PSILI', or would you expect that some percentage (perhaps small) would
 hold the opposite point of view ?

 ** Phil.


I don't think that's the right question. Even if everyone, including
the Unicode technical committee,
agreed some properties are incorrect for some characters, it isn't
clear we should change
them at this level.

I think that unicode-letters.def makes most sense as a
fully automated representation of the UCD data files in TeX syntax.

That way everyone knows what data is in there.

Individual language packages have far fewer characters to worry about
and can over-ride
the base settings where appropriate.

David

[Joseph's original message was cross posted to luatex list,
is there a particular reason that has been dropped?
it seems unfortunate as  a major part of the question was
how to arrange to get the same settings on both systems]


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Joseph Wright
On 06/05/2015 15:09, Jonathan Kew wrote:
 On 6/5/15 14:14, Joseph Wright wrote:
 
 Based on the current files, we have a block to set \XeTeXcharclass,
 which only applies to XeTeX. The logic followed in that code is that
 characters in the file LineBreak.txt which have class ID (ideographs)
 not only set the \XeTeXcharclass class to 1 but also set the \catcode of
 the code point to 11. That leads to a difference between the two Unicode
 engines. My current feeling is that the data file should split this
 process such that the category code change applies to both XeTeX and
 LuaTeX, with the XeTeX-specific code separate. Does this make sense and
 indeed does the current assignment make sense?

 
 ISTM that the most appropriate (default) \catcode for characters with
 class ID is clearly letter (11), and would suggest that LuaTeX should
 follow XeTeX in this.

Well for LaTeX at least the team get to make the call here and I think
we will pull everything into line.

 So yes, splitting out the XeTeX-specific code and having LuaTeX share
 the catcode assignments makes sense.

OK, if there are no objections I have a plan on this (I'll actually keep
all of the data, I think, and alter the assignment code).

 After all, if users can write control sequences such as
 
   \hello
   \halló
   \Здравствуйте
   \ሰላም
   \सलाम
 
 they should equally well be able to write
 
   \你好
   \こんにちわ
 
 and have each of these treated as single control sequences, too. This
 will not work if category ID characters are given catcode 12.

Entirely reasonable.

 If you're making improvements to unicode-letters.def, I would suggest
 also adding a section that assigns catcode 15 (invalid) to the code
 values D800 - DFFF (i.e. the UTF-16 surrogates, which should never be
 used in isolation as characters).

Noted: easy enough to add.
--
Joseph Wright




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Philip Taylor


David Carlisle wrote:

 I don't think that's the right question. Even if everyone, including 
 the Unicode technical committee, agreed some properties are
 incorrect for some characters, it isn't clear we should change them
 at this level.

You are (inadvertently) conflating my question with earlier discussions.

My question was asked solely in the context of Apostolos's suggestion that :

 somewhere it is explained why this is not correct. Otherwise, people 
 would see strange things and might wonder why they see them.

and I was trying to ascertain how best this explanation might be cast.
If it is the case (that for UNIV) all agree that Unicode is wrong and
Apostolos is correct, then a simple explanation that 'Unicode is wrong'
would be all that is needed. But if (say) 50% of UNIV agree that Unicode
is correct, then the explanation would have to be cast bearing this in mind.

** Phil.


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Joseph Wright
On 06/05/2015 16:04, Apostolos Syropoulos wrote:
 Hello,
 
 I checked a bit the file and I have noticed that 
 
 
 \L 1F10 1F18 1F10 % 
 
 while xgreek.sty defines 
 
 
 \global\lccode1F10=1F10 \global\uccode1F10=0395
 
 You see the uppercase of 'GREEK SMALL LETTER EPSILON WITH PSILI'
 is 'GREEK LETTER EPSILON' and not 'GREEK LETTER EPSILON WITH PSILI. 
 
 Some time ago I reported this to the Unicode people and they told me 
 
 something like we cannot change it now (I do not remember the exact 
 
 wording but the essence remains the same.) Naturally, all \lccodes and
 \uccodes for Greek letters are wrong and I suspect many more are wrong. 

This is slightly at a tangent from my original question (whether we are
processing the Unicode data in the right way), but is worth
consideration. It also has some impact on expl3 code related to case
changing (which does not use \lccode/\uccode).

I guess one could imagine deviating from the Unicode data but there are
issues. First, the current position is at least easy to explain. Second,
the current approach is the same position taken by I guess many other
pieces of software, so is cross-compatible with other stuff. Third, as a
non-Greek I can't comment on the technical correctness of what you say!
Is there some place I could see this discussed in detail? (I'm a bit
confused as to what 'GREEK CAPITAL LETTER EPSILON WITH PSILI' represents
if it's not the upper case of 'GREEK SMALL LETTER EPSILON WITH PSILI': I
notice in xgreek you map U+1F18 to U+0395 for upper casing and U+1F10
for lower casing.)
--
Joseph Wright


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Philip Taylor


Apostolos Syropoulos wrote:

 It seems to me that most people have no idea what Unicode is and what is 
 really
 involved. 

OK, so if we restrict the Universe of Discourse to the set of native
Hellenic speakers who know what Unicode is, know the importance of being
able to use it to identify the correct upper case of (for example)
'GREEK SMALL LETTER EPSILON WITH PSILI', and hold an informed opinion on
the matter, would you expect that 100% of these would agree that the
uppercase is 'GREEK LETTER EPSILON' and not 'GREEK LETTER EPSILON WITH
PSILI', or would you expect that some percentage (perhaps small) would
hold the opposite point of view ?

** Phil.



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Philip Taylor


Apostolos Syropoulos wrote:
 I'd suggest that the basic (Xe|Lua)TeX formats should simply follow
 Unicode properties.
 
 In addition, I would suggest that somewhere it is explained why this
 is not correct. Otherwise, people would see strange things and might 
 wonder why they see them.

How united is the Hellenic-speaking world about this, Apostolos ?  Is it
a universal truth, universally accepted, or are there some (even just a
few) who maintain that Unicode is right and everyone else is wrong ?

** Phil.


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Apostolos Syropoulos

 How united is the Hellenic-speaking world about this, Apostolos ?  Is it
 a universal truth, universally accepted, or are there some (even just a
 few) who maintain that Unicode is right and everyone else is wrong ?
 


It seems to me that most people have no idea what Unicode is and what is really
involved. 


A.S.


--
Apostolos Syropoulos
Xanthi, Greece


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Jonathan Kew

On 6/5/15 16:29, Philip Taylor wrote:



Apostolos Syropoulos wrote:


the uppercase of 'GREEK SMALL LETTER EPSILON WITH PSILI'
is 'GREEK LETTER EPSILON' and not 'GREEK LETTER EPSILON WITH PSILI.

Some time ago I reported this to the Unicode people and they told me

something like we cannot change it now (I do not remember the exact

wording but the essence remains the same.) Naturally, all \lccodes and
\uccodes for Greek letters are wrong and I suspect many more are wrong.


Nasty.  In that case I would propose a user-selectable option :

\Unicodecompliance

with possible values

strict (as per current Unicode standard)

and

loose (as advised by consensus of native speakers)

One might need to factor this out by language, as in :


\Unicodecompliance {Greek} {strict}
\Unicodecompliance {Greek} {loose}

or perhaps

\Unicodecompliance (Greek=loose, Turkish=strict, ...)



I'd suggest that the basic (Xe|Lua)TeX formats should simply follow 
Unicode properties. A package designed to support any particular 
language is of course free to offer other options and make whatever 
adjustments may be appropriate.


JK



--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-06 Thread Ross Moore
Hi David,

On 07/05/2015, at 9:26 AM, David Carlisle wrote:

 The character itself, as bytes that is, is not wrong and users should be 
 able to create these.
 But preferably through macros that ensure that they come correctly paired.
 
 placing two character tokens representing a surrogate pair should not
 though magically turn itself
 into a single character.

Agreed.
You don't know whether you want a single character until 
you know what kind of output is being generated.
That need not be known on input.

 The UTF-8 or  encoding should refer to
 the unicode code point not
 to the UTF-16 encoding,

No disagreement to this.

 
 In the current versions d835dc00 is two characters in luatex
 and one character in xetex
 as the implementation detail that xetex's underlying storage is mostly
 UTF-16 is exposed.

This seems to be premature of XeTeX then.
It seems to be making an assumption on how those bytes 
will ultimately be used.

 If it is
 not possible to prevent ^^^ or utf8 encoded surrogate pairs combining
 then it is better to
 prevent them being formed.

Hmm. 
What if you have an entirely different purpose in mind for those bytes?
You still need to be able to create them and do further processing with them.

Maybe there should be a primitive that sets a flag controlling what
happens to surrogates' bytes on input?
It may well be that XeTeX's current behaviour is best for putting
content into PDF pages; but not best in other situations. So a macro
programmer should have a means to change this, when needed.

 
 this is no different to XML where  #xd835; #xdc00; always refers to
 two (invalid) characters not
 to  #x1d400;

Seems fine to me.
If application software wants/needs to combine them, it can do so.

 
 David


Cheers,

Ross


Ross Moore

Senior Lecturer
Mathematics Department  |   Level 2, E7A 
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955   |  F: +61 2 9850 8114
M: +61 407 288 255  |  http://www.maths.mq.edu.au/

CRICOS Provider Number 2J. Think before you print. Please consider the 
environment before printing this email.

This message is intended for the addressee named and may contain confidential 
information. If you are not the intended recipient, please delete it and notify 
the sender. Views expressed in this message are those of the individual sender, 
and are not necessarily the views of Macquarie University.



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Philip Taylor


Apostolos Syropoulos wrote:

 the uppercase of 'GREEK SMALL LETTER EPSILON WITH PSILI'
 is 'GREEK LETTER EPSILON' and not 'GREEK LETTER EPSILON WITH PSILI. 
 
 Some time ago I reported this to the Unicode people and they told me 
 
 something like we cannot change it now (I do not remember the exact 
 
 wording but the essence remains the same.) Naturally, all \lccodes and
 \uccodes for Greek letters are wrong and I suspect many more are wrong. 

Nasty.  In that case I would propose a user-selectable option :

\Unicodecompliance

with possible values

strict (as per current Unicode standard)

and

loose (as advised by consensus of native speakers)

One might need to factor this out by language, as in :


\Unicodecompliance {Greek} {strict}
\Unicodecompliance {Greek} {loose}

or perhaps

\Unicodecompliance (Greek=loose, Turkish=strict, ...)

Philip Taylor




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Apostolos Syropoulos
Hello,

I checked a bit the file and I have noticed that 


\L 1F10 1F18 1F10 % 

while xgreek.sty defines 


\global\lccode1F10=1F10 \global\uccode1F10=0395

You see the uppercase of 'GREEK SMALL LETTER EPSILON WITH PSILI'
is 'GREEK LETTER EPSILON' and not 'GREEK LETTER EPSILON WITH PSILI. 

Some time ago I reported this to the Unicode people and they told me 

something like we cannot change it now (I do not remember the exact 

wording but the essence remains the same.) Naturally, all \lccodes and
\uccodes for Greek letters are wrong and I suspect many more are wrong. 


A.S.

PS Of course people who use the xgreek package have no problem.

 --
Apostolos Syropoulos
Xanthi, Greece


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex