Re: [PHP-DEV] Progress or just 'a mess'?

2017-09-20 Thread Stanislav Malyshev
Hi!

> Having established that the only characters that are case-insensitive in
> PHP7 ... the unicode basic latin set ... the discussion SHOULD be on
> either expanding that to cover all case folding or simply removing this
> rather limited case? 

Why? Does anybody seriously need Russian case folding in PHP constants?
I mean, sure, nice demo, but does anybody *need* it? I don't see much
code on github - in any language - that uses Russian identifiers, for
example.

> argument. However many of my clients do not use English as a first
> language so any data handling has to be unicode based, and case in that

You seem to be mixing data and code here. So what you are talking about
- data or code?
-- 
Stas Malyshev
smalys...@gmail.com

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Progress or just 'a mess'?

2017-09-20 Thread Lester Caine
On 20/09/17 08:26, Stanislav Malyshev wrote:
>> picking up on the base problem? Just what character set is PHP7 designed
>> to work with.
> 
> What do you mean by "work with"?

Actually that HAS already been identified in this thread, and it is only
the basic ASCII character set, but this is not actually specified anywhere?

>> For PHP8 is it not time to lay out a similar set of rules as provided by
>> SQL and identify just what 'case-insensitive' means and where it does apply?
> 
> I'm not sure which problem you are trying to solve here. Could you
> explain what you'd be using these rules for?

Having established that the only characters that are case-insensitive in
PHP7 ... the unicode basic latin set ... the discussion SHOULD be on
either expanding that to cover all case folding or simply removing this
rather limited case? Tony Marston is making an impassioned demand to
retain this very limited case, and therefore expand it to cover all
character sets, and as a fellow 'English only' coder, I can accept that
argument. However many of my clients do not use English as a first
language so any data handling has to be unicode based, and case in that
data can be important, so is case-insensitive really as universal as
Tony thinks? Certainly we need data case-insensitivity to handle unicode
properly and not just a few english characters ( should I really add a
capital 'E' to english just to please the spell checker? )

People are using their own languages when writing PHP variables and
function names, and apart from a few edge cases this does seem to be
working for them. As with SQL, the key programming words are in English,
and I don't think anybody would suggest adding aliases for them, so
restricting keywords to 'unicode basic latin set' can be defined, but
does THEN making that case-insensitive add to the problems of making PHP
more user friendly in handling unicode names elsewhere? I am seeing SQL
field names coming in with unicode content, and these are then array
keys in PHP ... the latin characters get lower cased at times and this
DOES cause a problem if the metadata defines upper case and I suspect
that is something that will never be changed now, but the actual rules
applied would be nice to know?

-- 
Lester Caine - G8HFL
-
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Progress or just 'a mess'?

2017-09-20 Thread Stanislav Malyshev
Hi!

> picking up on the base problem? Just what character set is PHP7 designed
> to work with.

What do you mean by "work with"?

> For PHP8 is it not time to lay out a similar set of rules as provided by
> SQL and identify just what 'case-insensitive' means and where it does apply?

I'm not sure which problem you are trying to solve here. Could you
explain what you'd be using these rules for?
-- 
Stas Malyshev
smalys...@gmail.com

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Progress or just 'a mess'?

2017-09-17 Thread Christoph M. Becker
On 17.09.2017 at 15:45, Christoph M. Becker wrote:

> On 17.09.2017 at 14:37, Rowan Collins wrote:
> 
>> That makes much more sense, but doesn't answer the other question, of if 
>> there's a working definition of what we mean by "case insensitive".
> 
> For case-insensitive constants zend_register_constant() uses
> zend_str_tolower_copy() which uses zend_tolower_ascii() which looks up
> in tolower_map:
> .
>  As the name already says, this is a simple ASCII lower case mapping
> (A-Z are mapped to a-z; all others map to themselves).  So only
> identifiers consisting solely of ASCII characters can actually be
> case-insensitive.
> 
> I presume that this map is also used for other case-insensitive identifiers.

See also Sara's reply to the other thread:
.

-- 
Christoph M. Becker



-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Progress or just 'a mess'?

2017-09-17 Thread Lester Caine
On 17/09/17 11:53, Rowan Collins wrote:
> On 17 September 2017 09:54:54 BST, Lester Caine  wrote:
>> Just what character set is PHP7
>> designed
>> to work with.
> 
> Focusing on the answerable part of this, PHP actually allows a very wide 
> variety of characters in identifiers (names of variables, classes, functions, 
> etc).
> 
> I checked the PHP lang-spec repo expecting to find a set of Unicode classes, 
> but it currently mentions "U+0080-U+00FF": 
> https://github.com/php/php-langspec/blob/master/spec/09-lexical-structure.md#names
>  That seems wrong to me, unless I'm looking at the wrong definition - the 
> first part of that range is control characters, and you can have variables 
> called things like $ (with an emoji as the entire name).
> 
> That would definitely be the place to document the allowed characters, 
> though, and a rigorous definition of "case insensitive" could also be added. 
> I was wrong, by the way, to say that using "to case fold" rather than "to 
> lower case" would solve the Turkish I problem - the key for that is to define 
> a single locale whose case folding you are using, independent of runtime 
> locale settings.

I think this is actually the problem. Unicode is simply NOT a general
solution! Normalizing is another aspect, and that can result in
differences between strings if one also 'case folds'. On top of which
one has to add the collation one is using to provide sort order which is
another can of worms? Sorting array keys in order depends on the
character set used ... which is perhaps why there seems to be a drive to
replace associative arrays with simple numeric ones?

"U+0020-U+007F" gives the Basic Latin set of characters (ASCII)
"U+0080-U+00FF" add the "Latin-1 Supplement"
The problem is that the second 128 characters is avoiding overlaying the
"U+-U+001F" control character block, while single byte character
sets WOULD be more productive if they followed the extra character
convention instead. One of the irritating compromises made by Unicode?

It would perhaps also be nice if the file naming convention used 'nbsp'
for spaces rather than 'sp' and eliminate the need for quotes around
file and directory names, but adding quotes is used by SQL to indicate
'case-sensitive' strings, yet another convention to be given a nod to?
If you get an associative key from a quoted field name it is NOT
case-insensitive and while a second field with the same combination of
characters would be 'silly' it is something that can happen for many
reasons ... and explode() falls over in some instances as a result.

-- 
Lester Caine - G8HFL
-
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Progress or just 'a mess'?

2017-09-17 Thread Christoph M. Becker
On 17.09.2017 at 14:37, Rowan Collins wrote:

> That makes much more sense, but doesn't answer the other question, of if 
> there's a working definition of what we mean by "case insensitive".

For case-insensitive constants zend_register_constant() uses
zend_str_tolower_copy() which uses zend_tolower_ascii() which looks up
in tolower_map:
.
 As the name already says, this is a simple ASCII lower case mapping
(A-Z are mapped to a-z; all others map to themselves).  So only
identifiers consisting solely of ASCII characters can actually be
case-insensitive.

I presume that this map is also used for other case-insensitive identifiers.

-- 
Christoph M. Becker

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Progress or just 'a mess'?

2017-09-17 Thread Rowan Collins
On 17 September 2017 13:18:44 BST, "Christoph M. Becker"  
wrote:
>On 17.09.2017 at 12:53, Rowan Collins wrote:
>
>> I checked the PHP lang-spec repo expecting to find a set of Unicode
>classes, but it currently mentions "U+0080-U+00FF":
>https://github.com/php/php-langspec/blob/master/spec/09-lexical-structure.md#names
>That seems wrong to me, unless I'm looking at the wrong definition -
>the first part of that range is control characters, and you can have
>variables called things like $ (with an emoji as the entire name).
>
>The specification in the PHP manual[1] appears to be more appropriate
>for our current implementation:
>
>| As a regular expression, it would be expressed thus: '[a-zA-Z_\x7f-
>| \xff][a-zA-Z0-9_\x7f-\xff]*'
>
>With regard to control characters: that depends on the chosen character
>encoding; for instance in Windows-1252 the ¢ character is mapped to
>\xA2.
>
>[1] 

Ah, so the mistake in the spec is that these aren't actually Unicode code 
points at all, but allowed *bytes*, which happen to allow for the UTF8 encoding 
of pretty much any Unicode codepoints.

That makes much more sense, but doesn't answer the other question, of if 
there's a working definition of what we mean by "case insensitive".

Regards,

-- 
Rowan Collins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Progress or just 'a mess'?

2017-09-17 Thread Christoph M. Becker
On 17.09.2017 at 12:53, Rowan Collins wrote:

> I checked the PHP lang-spec repo expecting to find a set of Unicode classes, 
> but it currently mentions "U+0080-U+00FF": 
> https://github.com/php/php-langspec/blob/master/spec/09-lexical-structure.md#names
>  That seems wrong to me, unless I'm looking at the wrong definition - the 
> first part of that range is control characters, and you can have variables 
> called things like $ (with an emoji as the entire name).

The specification in the PHP manual[1] appears to be more appropriate
for our current implementation:

| As a regular expression, it would be expressed thus: '[a-zA-Z_\x7f-
| \xff][a-zA-Z0-9_\x7f-\xff]*'

With regard to control characters: that depends on the chosen character
encoding; for instance in Windows-1252 the ¢ character is mapped to \xA2.

[1] 

-- 
Christoph M. Becker

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Progress or just 'a mess'?

2017-09-17 Thread Rowan Collins
On 17 September 2017 09:54:54 BST, Lester Caine  wrote:
> Just what character set is PHP7
>designed
>to work with.

Focusing on the answerable part of this, PHP actually allows a very wide 
variety of characters in identifiers (names of variables, classes, functions, 
etc).

I checked the PHP lang-spec repo expecting to find a set of Unicode classes, 
but it currently mentions "U+0080-U+00FF": 
https://github.com/php/php-langspec/blob/master/spec/09-lexical-structure.md#names
 That seems wrong to me, unless I'm looking at the wrong definition - the first 
part of that range is control characters, and you can have variables called 
things like $ (with an emoji as the entire name).

That would definitely be the place to document the allowed characters, though, 
and a rigorous definition of "case insensitive" could also be added. I was 
wrong, by the way, to say that using "to case fold" rather than "to lower case" 
would solve the Turkish I problem - the key for that is to define a single 
locale whose case folding you are using, independent of runtime locale settings.

Regards,

-- 
Rowan Collins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php