Re: select a variable as stdout and utf8 flag behaviour

2016-11-10 Thread Aristotle Pagaltzis
* Gert Brinkmann <g...@netcologne.de> [2016-11-09 16:00]:
> open(my $fh, '>:encoding(UTF-8)', \$html);
> my $orig_stdout = select( $fh );
> print "Ümläut Test ßaß; 使用下列语言\n";

Think of it this way:

Those three lines of code are an elaborate way of doing this:

$html = Encode::encode('UTF-8', "Ümläut Test ßaß; 使用下列语言\n");

If you wrote that code, would you be surprised that $html does not
have the UTF8 flag set afterwards?

Bonus question if you are not surprised then: what is the difference
between these two cases that makes your argument that “perl knows what
I put in there so it should know to set the UTF8 flag on it” not apply
to this?

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>


Re: Encode UTF-8 optimizations

2016-08-20 Thread Aristotle Pagaltzis
* Karl Williamson  [2016-08-21 03:12]:
> That should be done anyway to make sure we've got less buggy Unicode
> handling code available to older modules.

I think you meant “available to older perls”?


Re: UTF-8 encoding & decoding

2016-05-14 Thread Aristotle Pagaltzis
* Pali Rohár <pali.ro...@gmail.com> [2016-05-12 20:23]:
> If both functions should do same thing, why we have duplicity?

Encode.pm is big and fairly slow, because it handles a zillion encodings
and has lots of options for handling invalid input data. Perl needs only
UTF-8 transcoding and needs it fast, so it has code for just that. Since
that code is there anyway, it can just as well be exposed to Perl space.

> And which one is preferred to use?

Well, either you need Encode.pm or you don’t. The built-ins are faster
and always loaded, but they only do UTF-8 and if you have invalid data
then all you get is a false return value and no other help. If you need
anything else you pay the memory and take the speed hit of Encode.pm.
(If you are working on a large application, chances are high that you
have Encode.pm loaded anyway.)

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>


Re: UTF-8 encoding & decoding

2016-05-06 Thread Aristotle Pagaltzis
* Pali Rohár <pali.ro...@gmail.com> [2016-05-06 14:50]:
> 1. What is difference between those two calls?
>
>  utf8::encode($str);
>
> and
>
>  $str = Encode::encode('utf8', $str);
>
> 2. What is difference between those?
>
>  utf8::decode($str);
>  $str = Encode::decode_utf8($str);

They do the same thing with different interfaces. utf8::encode/decode
modify a string in-place and return a boolean to signal success or not.
Encode.pm returns a copy and can be configured to do a range of things
with invalid input, from converting invalid bytes to replacement marks
to throwing an exception.

> 3. Where is implementation of utf8::encode/decode functions? It is not
> in utf8.pm, nor in utf8_heavy.pl and also not in unicore/Heavy.pl. And
> what those functions doing?

They are part of the perl interpreter and defined in universal.c as thin
wrappers around code ultimately from sv.c.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>


Re: Choice of BOM for UTF-16 encoding

2014-02-09 Thread Aristotle Pagaltzis
* Geoffrey Leach ge...@hughes.net [2014-02-10 07:35]:
 Is there a way to force (from my module) the choice to be LE? It turns
 out that the library I'm supporting (taglib) works in LE.

Does it need a BOM prepended?

If not, just do the obvious and `encode('UTF-16LE', $str)`.

C.f. `perldoc Encode::Unicode`.

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/


Re: Matching upper ASCII characters in RE patterns

2010-12-20 Thread Aristotle Pagaltzis
* Jonathan Pool p...@utilika.org [2010-11-30 23:50]:
 As documented in
 http://rt.perl.org/rt3/Public/Bug/Display.html?id=80030 there
 seems to be a problem when use encoding 'utf8' is removed and
 replaced with use utf8, so the problem is not limited to the
 encoding pragma. ˉ

However, you can expect the `utf8` pragma to be fixed – though
that won’t help you right now. The `encoding` pragma OTOH is
irretrievably broken.

(There is also consensus that source files in arbitrary encodings
are not a sane idea anyway; if you need more than ASCII, your
code should be in UTF-8 and you should `use utf8`. So no there is
no replacement for that aspect of the `encoding` pragma coming
down the pipe either, now or ever.)

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/


Re: Am I correct in thinking that the only way to get ord() to return a value over 256 is to send the character as a Unicode string instead of a byte string?

2010-10-29 Thread Aristotle Pagaltzis
* Dan Muey d...@cpanel.net [2010-10-28 21:55]:
 For example, note the differences in output between a unicode
 string and a byte string regarding character 257, as a unicode
 string it is 257, as a byte string it is 196.

That is not what’s going on.

$ perl -E'say ord 1234'
49

When you pass a multi-character string to `ord`, you get the code
point of the first character.

$ perl -E'say chr 49'
1

In your case you get 196. That is 0xC4, or the character Ä. It is
not the character ā (U+101 = code point 257).

0xC4 is the value of the first byte in the two-byte UTF-8
sequence that encodes the character 257. You are passing a string
containing a representation of those bytes as two characters to
`ord`, and `ord` is giving you the code point of the first
byte-as-character.

You are missing the rest of the bytes from the UTF-8 encoding.

You are losing data.

If you try this on more code points you will find that there are
*lots* of different characters that are reported as 196 – because
they get encoded as multi-byte sequences that all start with the
byte value 0xC4.

-- 
*AUTOLOAD=*_;sub _{s/::([^:]*)$/print$1,(,$\/, )[defined 
wantarray]/e;chop;$_}
Just-another-Perl-hack;
#Aristotle Pagaltzis // http://plasmasturm.org/


Re: Silence “Wide charact er” warning globally one time

2010-08-02 Thread Aristotle Pagaltzis
* Michael Ludwig mil...@gmx.de [2010-07-30 01:20]:
 You need the equivalent of -CO in your script:

   binmode STDOUT, ':utf8';

Argh, no, unfortunately not. The `:utf8` layer is bad. It does
the equivalent of `_utf8_on` on input and `_utf8_off` on output,
without actually decoding or (worse) encoding anything.

You want `:encoding(UTF-8)`.

-- 
*AUTOLOAD=*_;sub _{s/(.*)::(.*)/print$2,(,$\/, )[defined wantarray]/e;$1}
Just-another-Perl-hack;
#Aristotle Pagaltzis // http://plasmasturm.org/


Re: Don't use the \C escape in regexes - Why not?

2010-05-04 Thread Aristotle Pagaltzis
* Michael Ludwig michael.lud...@xing.com [2010-05-04 14:55]:
 But wait a second: While URIs are meant to be made of
 characters, they're also meant to go over the wire, and there
 are no characters on the wire, only bytes. There is no standard
 encoding defined for the wire, although UTF-8 has come to be
 seen as the standard encoding for URIs containing non-ASCII
 characters. Perl having two standard encodings (UTF-8 and
 ISO-8859-1) for text and relying on the internal flag to tell
 which one is meant to matter, shouldn't the URI module either
 only accept bytes or only characters? Or rather, provide two
 different constructors instead of only one trying to be
 intelligent?

  URI-bytes( $bytes ); # byte string
  URI-chars( $chars ); # character string

 And, in addition, define the character encoding used for
 serialization.

Yes, exactly. And both methods would use the moral equivalent of
a plain `split //` – no trickery such as with `\C`. The only
difference between then is that the `chars` method would
`encode_utf8` the string first and then encode it blindly,
whereas the `bytes` method would leave it as is but then croak if
it found a codepoint  0xFF (since the string is supposed to
represent an octet sequence already).

Notably absent in both cases: any dependence on the state of the
UTF8 flag of the string.

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/


Re: Effect of -C command line switch on `warn` and `die`

2010-04-22 Thread Aristotle Pagaltzis
Hi Michael,

* Michael Ludwig michael.lud...@xing.com [2010-04-22 17:00]:
 Consider the following script, the source of which is encoded
 in UTF-8:

I can’t answer your question, but I do want to suggest that you
re-post it to perl5-port...@perl.org – it’s much more likely that
someone over there will be able to tell you what’s up with this.

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/


Re: Use case for utf8::upgrade?

2010-04-08 Thread Aristotle Pagaltzis
* Michael Ludwig michael.lud...@xing.com [2010-04-08 09:25]:
 since upgrading a string increases memory consumption and can
 significantly slow down regex matches against it.

 Is it some copying behind the scenes that increases memory
 consumption?

Just the simple fact that some characters take multiple bytes to
encode in the UTF8-based format.

 Why does that have the potential to significantly slow down
 regex matches?

Because one byte and one character is no longer the same thing,
so if you know you want the 17th character in the string, you
can’t say where in memory that is. You have to scan the string.
This is sort of access pattern is rare in practice – most
operations either just copy the entire string or scan over it one
character at a time. But the regex engine is one of those things
that sometimes needs to jump around in the string rather than
merely scanning linearly. (Perl’s regex engine does some caching
to avoid the worst penalties with this, but that in itself also
causes slowdown, so there’s a balance to strike.)

 Does that mean that when doing lots of matching, it might be
 preferable to use byte strings and byte semantics, not
 character strings and character semantics?

Almost all of the time the performance cost is negligible and not
worth sweating at the application code level.

Trying to work on text using byte semantics is a recipe for
massive headaches, and an invitation for bugs. It’s doable if you
are careful and disciplined, absolutely. But why punish yourself?
You gain little, at significant effort.

On 5.12, though, you can get a tiny potential improvement en
passant, with basically zero effort.

In that case – and only in that case: why not? The gain is small;
but the cost is also.

In the other direction, that doesn’t translate. Don’t go micro-
optimising your code for this.

 Under older perls, it’s a question of getting the wrong
 results in less time and memory, so there’s not an option.

 Wrong results? Could you clarify? Thanks :-)

Well, you get Latin-1 semantics, eg. upper-/lowercasing will
ignore accented characters that fall outside the Latin-1 charset.

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/


Re: Use case for utf8::upgrade?

2010-04-07 Thread Aristotle Pagaltzis
* Michael Ludwig michael.lud...@xing.com [2010-04-07 15:00]:
 Having read Juerd's list of useful advice, I don't understand
 the reason for its last three items:

 • utf8::upgrade before doing lc/lcfirst/uc
 • utf8::upgrade before doing case insensitive matching
 • utf8::upgrade before matching predefined character classes
   like w and s

 Can anyone enlighten me on the background of using
 utf8::upgrade here?

Perl versions up to the upcoming 5.12.0 (I think) are buggy in
that they apply ISO-8859-1 semantics to downgraded strings and
Unicode semantics to upgraded strings, even when they contain the
same data. By upgrading your strings, you make sure that you get
Unicode semantics consistently.

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/


Re: Use case for utf8::upgrade?

2010-04-07 Thread Aristotle Pagaltzis
* Gisle Aas gi...@aas.no [2010-04-08 00:00]:
 This fix was withdrawn from 5.12.0. Currently you have to use
 feature 'unicode_strings' to get the sane behaviour in the
 current lexical scope. Current 'perldoc unicode' also says:

   The use feature 'unicode_strings' pragma is intended to
   always, regardless of platform, force Unicode semantics
   in a particular lexical scope. In release 5.12, it is
   partially implemented, applying only to case changes. See
   The Unicode Bug below.

 This means that the utf8::upgrade() advice also applies to
 perl-5.12.0.

Oh right! That was it. (I couldn’t remember the specifics.)

Well, using `use feature 'unicode_strings';` and not upgrading
strings is a better strategy for code that doesn’t need to work
under earlier perl versions, since upgrading a string increases
memory consumption and can significantly slow down regex matches
against it.

Under older perls, it’s a question of getting the wrong results
in less time and memory, so there’s not an option.

If you want both, I guess you could do something like

use constant UNICODE_BUG = ( $]  5.012 );
use if not UNICODE_BUG, feature = 'unicode_strings';
# ...
utf8::upgrade( $some_str ) if UNICODE_BUG;

(Note to readers who don’t already know: using a constant here
will cause either the conditional or the entire statement to get
optimised away depending on its truthiness, so that there won’t
be any runtime penalty for the conditionals.)

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/


Re: Character (or byte?) escapes under utf8 pragma

2010-04-04 Thread Aristotle Pagaltzis
Hi Michael,

I just noticed I never replied to this…

* Michael Ludwig michael.lud...@xing.com [2010-03-08 15:50]:
 Am 07.03.2010 um 07:39 schrieb Aristotle Pagaltzis:
 Use the \U escape to indicate that you always mean a Unicode
 code point. Due to other quirks in how \U is implemented, it
 ends up not triggering the bug that \x would.

 How would I use that? I only know about the U specifier for
 pack:

 my $smiley = pack 'U', 0x263a;

Sorry – I meant \N. Eg in that case,

my $smiley = \N{U+263A};

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/


Re: Character (or byte?) escapes under utf8 pragma

2010-03-06 Thread Aristotle Pagaltzis
Hi Michael,

[ perlbug readers, you will find the nut of the issue in the
  section marked BUG ]

* Michael Ludwig michael.lud...@xing.com [2010-03-03 14:05]:
 For convenience, I have test script source code in UTF-8. The
 test also deals with non-breaking spaces, which I prefer to
 keep as character references since they are not visible and
 might be mistaken by the casual onlooker for ordinary spaces.
 So I write them as \xa0. Or \x{a0}, or \x{00a0}.

 Now I find that they seem to be byte references, not character
 references.

Perl does not distinguish between bytes and characters. It does
distinguish between scalars that use a packed byte buffer for
storage vs strings that use variable-width integer sequence for
storage, but this is an implementation detail and does not mean
anything in terms of semantics. Strings are simply strings in
Perl. You cannot tell what kind of data they contain just by
looking at them and the UTF8 flag doesn’t tell you either.

 Consider the following test script:

 use strict;
 use warnings;
 use utf8; # source code in UTF-8 (Zurück)
 use open OUT = ':encoding(UTF-8)', ':std';

 my $str1 = \xa0Zurück\n;  # byte - bad
 my $str2 = \x{a0}Zurück\n;# should be character, but isn't
 my $str3 = \x{00a0}Zurück\n;  # ditto
 my $str4 = \xa0 . Zurück\n; # upgrading hack, works

 print $str1, $str2, $str3, $str4;

 $str1 ne $str2 and die won't die;
 $str1 ne $str3 and die won't die;
 $str1 ne $str4 and die 'die now, somewhat counter-intuitively';

\x{00a0} does not map to utf8 at t.pl line 11.
\xA0Zurück
\x{00a0} does not map to utf8 at t.pl line 11.
\xA0Zurück
\x{00a0} does not map to utf8 at t.pl line 11.
\xA0Zurück
 Zurück
die now, somewhat counter-intuitively at t.pl line 15.

This is definitely a bug.

 The correct version of the string uses implicit upgrading of
 the byte escape \xa0 to a Unicode character. I've read
 upgrading should rather be avoided, but here it does the job.

No, upgrading is perfectly fine. Mixing byte and character data
is what should be avoided, because then Perl will assume it’s all
characters, which will result in mangling of one of the two kinds
of data. Usually the byte data is encoded text, in which case the
problem becomes apparent as double-encoded text. But it’s really
a problem both ways.

 Am I mistaken in my expectation that while \xa0 should be
 a byte, \x{a0} and \x{00a0} should be characters? Note that
 perlretut(1) seems to support this assumption:

  Unicode characters in the range of 128-255 use two hexadecimal
  digits with braces: \x{ab}. Note that this is different than
  \xab, which is just a hexadecimal byte with no Unicode
  significance.

 http://perl.active-venture.com/pod/perlretut-morecharacter.html

 But maybe this only refers to these escapes inside regular expressions.

The documentation appears to be wrong. Unfortunately a lot of the
documentation of Perl itself is wrong or confused about Perl’s
string model.

 Or maybe the utf8 pragma breaks things here? Don't think so,
 though. If I comment it out, I have to recode my script to
 Latin1 in order for the strings to be valid.

Yes. This appears to be a utf8 pragma bug or a bug in the parser
that shows up in interaction with the utf8 pragma.

== BUG ==

What happens is that the presence of the ü under the utf8 pragma
triggers using the variable-width integer sequence format for the
string, but the 0xA0 byte from the \x escape gets written into
that buffer verbatim, as if it were a packed byted array string.
This is wrong and completely broken.

== BUG ==

 Note that the reason I use the utf8 pragma is so I can write
 Zurück in my source code and automatically have Perl informed
 that these are characters, not bytes - which is a great
 convenience.

 Yeah, it would also work in Latin1, and our editors handle
 various encodings just fine - but we have a good UTF-8
 development environment and there might be characters not
 representable in Latin1 that I'd like to add to the script
 source.

Writing source in UTF-8 is a perfectly sane practice. No need to
justify it.

 What's your advice for handling this situation more elegantly?

Use the \U escape to indicate that you always mean a Unicode code
point. Due to other quirks in how \U is implemented, it ends up
not triggering the bug that \x would.

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/


Re: use encoding 'utf8' and \x{00e4} notation

2010-02-03 Thread Aristotle Pagaltzis
* Michael Ludwig michael.lud...@xing.com [2010-02-02 17:35]:
   use encoding 'utf8';

The `encoding` pragma is broken. Do not use it.

You want

use open ':encoding(UTF-8)', ':std';

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/


Re: Determining IO layer set on filehandle

2010-01-29 Thread Aristotle Pagaltzis
* Michael Ludwig michael.lud...@xing.com [2010-01-29 18:30]:
 It appears you can use that information to restore a filehandle
 configuration:

 # Gut: STDOUT duplizieren und Duplikat umstellen.
 # STDOUT (global) wird nicht verstellt.
 sub out_bin_good {
open my $fh, 'STDOUT' or die dup STDOUT: $!;
binmode $fh, ':raw' or die binmode: $!;
print $fh BINÄR 3\t, @_;
print STDERR * layer: $_\n for PerlIO::get_layers( $fh );
 }

 # Auch gut: IO-Modus sichern und wiederherstellen.
 sub out_bin_also_good {
my @layers = PerlIO::get_layers( STDOUT );
binmode STDOUT, ':raw' or die binmode: $!;
print BINÄR 4\t, @_;
print STDERR * layer: $_\n for PerlIO::get_layers( STDOUT );
my $layers = join '', map :$_, @layers;
binmode STDOUT, $layers;
print STDERR reset STDOUT to $layers\n;
print STDERR * layer: $_\n for PerlIO::get_layers( STDOUT );
 }

Considering the relative complexities of the approaches and the
fact that conservation of filehandle state is not a concern in
your case, I know which solution *I* would favour…

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/