Re: Comparing inputs with source strings

2016-05-11 Thread Karl Williamson

On 05/11/2016 02:04 AM, Daniel Dehennin wrote:

Karl Williamson  writes:


On 05/09/2016 08:53 AM, Daniel Dehennin wrote:

Hello,

I tried to make my Perl5 code unicode compliant after reading a post on
stackoverflow[1].

As suggested in the post:

  “always run incoming stuff through NFD and outbound stuff from NFC.”

I got a hard time finding why my Test::More was failing but displaying
exactly the same strings for “got” and “expected”.

I finally check how UTF-8 sources are handled and found that they are in
NFC form, I run the following script:


[...]


I'm afraid that when it comes to normalization in Perl5, you have to
do it yourself.  I hear that Perl6 is much friendlier in this regard,
but I have no personal experience with it.  Your $unistring is in
whatever normalization you made it when you typed it into your editor,
or whatever your editor did with it as you were typing.  You could
have typed it in NFD, but probably the most natural way to enter
things on your keyboard will underlying it all be NFC.


That's what I finally find out in another post, normally all my inputs
are NFD but my tests used static string to match, I declared them with
NFD to make it explicit.

I added a note in my POD to signal that the sub returns NFD strings.


I forgot to mention that if you're just dealing with collation, it may 
be that comparisons actually work properly regardless of normalization, 
if you are doing the comparisons within the scope of 'use locale' and 
the locale is recognized by Perl5 to be a UTF-8 locale.  It depends on 
the libc implementation for your platform.  There are bugs in Perl5's 
handling of these, however, which I have fixes for, and expect to put 
into the latest development version, called blead, within the next week 
or two.



Normalization is tricky, and the Unicode Consortium has had to modify
things years after they were first specified, because no one could
reasonably implement what was expected.  I may tackle getting
normalization to be more developer friendly in future Perl5 versions,
but not in the next couple of years.


Thanks, as soon as my little work project is working well I'll try to
redo it in Perl6.

Regards.





Re: Comparing inputs with source strings

2016-05-11 Thread Daniel Dehennin
Karl Williamson  writes:

> On 05/09/2016 08:53 AM, Daniel Dehennin wrote:
>> Hello,
>>
>> I tried to make my Perl5 code unicode compliant after reading a post on
>> stackoverflow[1].
>>
>> As suggested in the post:
>>
>>  “always run incoming stuff through NFD and outbound stuff from NFC.”
>>
>> I got a hard time finding why my Test::More was failing but displaying
>> exactly the same strings for “got” and “expected”.
>>
>> I finally check how UTF-8 sources are handled and found that they are in
>> NFC form, I run the following script:

[...]

> I'm afraid that when it comes to normalization in Perl5, you have to
> do it yourself.  I hear that Perl6 is much friendlier in this regard,
> but I have no personal experience with it.  Your $unistring is in
> whatever normalization you made it when you typed it into your editor,
> or whatever your editor did with it as you were typing.  You could
> have typed it in NFD, but probably the most natural way to enter
> things on your keyboard will underlying it all be NFC.

That's what I finally find out in another post, normally all my inputs
are NFD but my tests used static string to match, I declared them with
NFD to make it explicit.

I added a note in my POD to signal that the sub returns NFD strings.

> Normalization is tricky, and the Unicode Consortium has had to modify
> things years after they were first specified, because no one could
> reasonably implement what was expected.  I may tackle getting
> normalization to be more developer friendly in future Perl5 versions,
> but not in the next couple of years.

Thanks, as soon as my little work project is working well I'll try to
redo it in Perl6.

Regards.

-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature


Re: Comparing inputs with source strings

2016-05-10 Thread Karl Williamson

On 05/09/2016 08:53 AM, Daniel Dehennin wrote:

Hello,

I tried to make my Perl5 code unicode compliant after reading a post on
stackoverflow[1].

As suggested in the post:

 “always run incoming stuff through NFD and outbound stuff from NFC.”

I got a hard time finding why my Test::More was failing but displaying
exactly the same strings for “got” and “expected”.

I finally check how UTF-8 sources are handled and found that they are in
NFC form, I run the following script:

#+begin_src perl
#!/usr/bin/env perl

use utf8;
use warnings;

use Test::More;
use Unicode::Normalize;

my $unistring = 'C’est une chaîne unicode';

my @forms = ("NFD", "NFC", "NFKD", "NFKC");

for my $form (@forms) {
if ($unistring eq &$form($unistring)) {
print "UTF-8 source is in form '$form'\n";
}
}
#+end_src

and got:

#+begin_src
UTF-8 source is in form 'NFC'
UTF-8 source is in form 'NFKC'
#+end_src

So, the Test::More::is_deeply was trying to compare an input in NFD with
the expected string in NFC.

My code can use Unicode::Collate, but for all the code I did not write I
wonder if there is a way to handle it cleanly.

Or maybe I'm doing something wrong?


I'm afraid that when it comes to normalization in Perl5, you have to do 
it yourself.  I hear that Perl6 is much friendlier in this regard, but I 
have no personal experience with it.  Your $unistring is in whatever 
normalization you made it when you typed it into your editor, or 
whatever your editor did with it as you were typing.  You could have 
typed it in NFD, but probably the most natural way to enter things on 
your keyboard will underlying it all be NFC.


Normalization is tricky, and the Unicode Consortium has had to modify 
things years after they were first specified, because no one could 
reasonably implement what was expected.  I may tackle getting 
normalization to be more developer friendly in future Perl5 versions, 
but not in the next couple of years.


Regards.

Footnotes:
[1]  
https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default





Re: Comparing inputs with source strings

2016-05-10 Thread Daniel Dehennin
Daniel Dehennin  writes:


[...]

> I can't imagine declaring all my static string variable with:
>
> my unistring = NFD('C’est une chaîne unicode');

Hey hey, it's more complicated than that, it depends on how the source
was encoded, the following match none of the forms:

'C’est une chaîne unicode avec É'

Since “É” is “\N{LATIN CAPITAL LETTER E}\N{COMBINING ACUTE ACCENT}”

So, it looks like no normalisation is done on sources.

Regards.
-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature


Re: Comparing inputs with source strings

2016-05-10 Thread Daniel Dehennin
Daniel Dehennin  writes:

> Hello,
>
> I tried to make my Perl5 code unicode compliant after reading a post on
> stackoverflow[1].
>
> As suggested in the post:
>
> “always run incoming stuff through NFD and outbound stuff from NFC.”

The same from perlunicode[1]:

“The usual advice is to convert your inputs to NFD before processing
further”

I can't imagine declaring all my static string variable with:

my unistring = NFD('C’est une chaîne unicode');

Regards.

Footnotes: 
[1]  http://perldoc.perl.org/perlunicode.html

-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature


Comparing inputs with source strings

2016-05-09 Thread Daniel Dehennin
Hello,

I tried to make my Perl5 code unicode compliant after reading a post on
stackoverflow[1].

As suggested in the post:

“always run incoming stuff through NFD and outbound stuff from NFC.”

I got a hard time finding why my Test::More was failing but displaying
exactly the same strings for “got” and “expected”.

I finally check how UTF-8 sources are handled and found that they are in
NFC form, I run the following script:

#+begin_src perl
#!/usr/bin/env perl

use utf8;
use warnings;

use Test::More;
use Unicode::Normalize;

my $unistring = 'C’est une chaîne unicode';

my @forms = ("NFD", "NFC", "NFKD", "NFKC");

for my $form (@forms) {
if ($unistring eq &$form($unistring)) {
print "UTF-8 source is in form '$form'\n";
}
}
#+end_src

and got:

#+begin_src
UTF-8 source is in form 'NFC'
UTF-8 source is in form 'NFKC'
#+end_src

So, the Test::More::is_deeply was trying to compare an input in NFD with
the expected string in NFC.

My code can use Unicode::Collate, but for all the code I did not write I
wonder if there is a way to handle it cleanly.

Or maybe I'm doing something wrong?

Regards.

Footnotes: 
[1]  
https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default

-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature