Re: Fwd: Encode 3.0

2019-03-12 Thread pali
Hello, in future please write in English.

That problem should be already fixed by pull request:
https://github.com/dankogai/p5-encode/pull/138

On Tuesday 19 February 2019 20:08:00 dagmatritsa via perl-unicode wrote:
> 
> 
> 
>  Пересылаемое сообщение 
> От кого: dagmatritsa 
> Кому: danko...@cpan.org
> Дата: Вторник, 19 февраля 2019, 20:06 +03:00
> Тема: Encode 3.0
> 
> 
> Hi there!
> 
> cpan-outdated -p | cpanm -f
> Slackware 14.2
> Linux  4.4.172-smp #2 SMP Wed Jan 30 16:13:07 CST 2019 i686 Intel(R) 
> Celeron(R) CPU    E3300  @ 2.50GHz GenuineIntel GNU/Linux
> gcc (GCC) 5.5.0
> perl v5.22.2
> Building and testing Encode-3.00 ... FAIL
> 
> cp Unicode.pm ../blib/lib/Encode/Unicode.pm
> Running Mkbootstrap for Unicode ()
> chmod 644 "Unicode.bs"
> "/usr/bin/perl5.22.2" -MExtUtils::Command::MM -e 'cp_nonempty' -- Unicode.bs 
> ../blib/arch/auto/Encode/Unicode/Unicode.bs 644
> "/usr/bin/perl5.22.2" "/usr/local/share/perl5/ExtUtils/xsubpp"  -typemap 
> '/usr/share/perl5/ExtUtils/typemap'  Unicode.xs > Unicode.xsc
> mv Unicode.xsc Unicode.c
> cc -c -I./Encode  -I../Encode -D_REENTRANT -D_GNU_SOURCE -fwrapv 
> -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include 
> -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2 -O2 
> -march=i586 -mtune=i686   -DVERSION=\"2.18\" -DXS_VERSION=\"2.18\" -fPIC 
> "-I/usr/lib/perl5/CORE"   Unicode.c
> In file included from Unicode.xs:11:0:
> ../Encode/encode.h: В функции «S_does_utf8_overflow»:
> ../Encode/encode.h:416:19: предупреждение: неявная декларация функции 
> «S_is_utf8_overlong_given_start_byte_ok» [-Wimplicit-function-declaration]
>  is_overlong = S_is_utf8_overlong_given_start_byte_ok(s, len);
>    ^
> ../Encode/encode.h:450:20: предупреждение: неявная декларация функции «memGT» 
> [-Wimplicit-function-declaration]
>  return memGT(s + 1, conts_for_highest_30_bit, cmp_len);
>     ^
> ../Encode/encode.h: На верхнем уровне:
> 
> 
> 
> 
> ../Encode/encode.h:488:1: ERROR: static-декларация 
> «S_is_utf8_overlong_given_start_byte_ok» после неstatic-декларации
>  S_is_utf8_overlong_given_start_byte_ok(const U8 * const s, const STRLEN len)
>  ^
> 
> 
> 
> 
> ../Encode/encode.h:416:19: замечание: здесь была предыдущая неявная 
> декларация «S_is_utf8_overlong_given_start_byte_ok»
>  is_overlong = S_is_utf8_overlong_given_start_byte_ok(s, len);
>    ^
> Makefile:313: ошибка выполнения рецепта для цели «Unicode.o»
> 
> 
> With best regards dagmatritsa :)
> 
> 
> 
> --
> 
> With best regards dagmatritsa :)


Re: select a variable as stdout and utf8 flag behaviour

2016-11-09 Thread pali
On Wednesday 09 November 2016 15:55:47 Gert Brinkmann wrote:
> Hello,
> 
...
> 
> This prints out the utf8 characters corrupted. You have to flag the
> Variable after writing into it with Encode::_utf8_on() as utf8 to make
> it work correctly. (So activate the commented line.)
> 
> Using this _utf8_on() usually means that I am doing something wrong.

Yes, that is truth! You should never use _utf8_on/_utf8_off/is_utf8
functions! They are here *only* for dealing with buggy XS modules. Not
for pure perl code... In pure perl code you must *not* care about UTF8
flag.

> Is there a better way to achieve the correct behaviour?

Of course! When you think that you need to use Encode::_utf8_on() then
use utf8::decode() instead (or Encode::decode('UTF-8', ...)). Similarly
utf8::encode (or Encode::encode('UTF-8, ...)) instead of
Encode::_utf8_off().

> Btw. there was a change in the behaviour between perl v5.14.2 and
> v5.20.2: In older perl versions you could do a
> 
> my $html = '';
> Encode::_utf8_on($html);
> 
> before opening the file handle onto this variable. In newer perl
> versions the utf8 flag is reset on open() and print() to the variable's
> file handle.

UTF8 flag just indicate if internal encoding of perl scalar is Latin1 or
UTF8. But it is internal any Latin1 string can be represented either in
Latin1 (without UTF8 flag) or in UTF-8 (with UTF8 flag). You should not
care about internal representation in pure perl code. Any perl function
at any time can convert scalar between these two encoding if it is
possible (for ASCII and Latin1 charsets).

(Btw, on EBCDIC platforms, UTF8 flag indicate that internal encoding is
UTFEBCDIC or EBCDIC, not UTF-8!!, so really do not depend on UTF8 flag!)

And to your question, here is explanation of your source code:

> -
> use strict;
> use utf8;

Now source code is expected to be in utf8 and perl strings are treated
as wide characters.

> use Encode;
> use FileHandle;
> 
> binmode STDOUT, ":utf8";

Now printing to STDOUT handle accept wide characters (>= 0xFF) and
convert output to utf8 octets. So your terminal should be configured to
accept and show UTF-8 sequences correctly.

> 
> my $html = '';
> 
> #-- open filehandle to write into the $html variable as utf8
> open(my $fh, '>:encoding(UTF-8)', \$html);

Now printing to $fh accept wide characters and convert printed
characters to utf8 octets before storing them to $html. It means that
$html will *always* contains sequence of numbers which represent utf8
sequences.

> my $orig_stdout = select( $fh );
> 
> 
> print "Ümläut Test ßaß; 使用下列语言\n";

Now you have string with wide characters and this print will send this
string to $html. In $html you have sequence of octets which contains
encoded form of that wide string.

> 
> 
> select( $orig_stdout );
> $fh->close();
> 
> #You need to activate this line to make utf8 output correct
> #Encode::_utf8_on($html);
> 
> print $html;

And now you send sequence of utf8 octets to STDOUT which expect wide
characters those are converted to utf8 octets. So what you get is double
encoded utf8 sequence.

Now stop, and think about it why this is truth!

> -

Fix is really simple. Either decode utf8 octets in $html back to wide
characters (via utf8::decode($html)) or tell STDOUT that it does not
expect wide strings, but raw octets (= remove binmode STDOUT, ":utf8";)
line.

Again... think about it, why both my proposed fixes are working.



Btw, perl does not use UTF-8 encoding, but perl's extended utf8. If you
want strict UTF-8, use ":encoding(UTF-8)" layer. Layers ":utf8" or
":encoding(utf8)" (without hyphen) are those non-strict perl's extended
utf8 encodings. Also utf8::encode/decode are non-stricts...


Re: Encode UTF-8 optimizations

2016-11-01 Thread pali
Hi! New Encode 2.87 with lots of fixes for Encode.xs and
Encode::MIME::Header was released. Can you sync/import it into blead?


Re: Encode UTF-8 optimizations

2016-10-27 Thread pali
On Sunday 25 September 2016 10:49:41 Karl Williamson wrote:
> On 09/25/2016 04:06 AM, p...@cpan.org wrote:
> >On Thursday 01 September 2016 09:30:08 p...@cpan.org wrote:
> >>On Wednesday 31 August 2016 21:27:37 Karl Williamson wrote:
> >>>We may change Encode in blead too, since it already differs from
> >>>cpan. I'll have to get Sawyer's opinion on that.  But the next
> >>>step is for me to fix Devel::PPPort to handle the things that
> >>>Encode needs, issue a pull request there, and after that is
> >>>resolved issue an Encode PR.
> >
> >Hi! One month passed, do you have any progress in syncing blead and cpan
> >Encode version? Or do you need some help?
> 
> I don't see any way to easily split up the tasks.  In the next 48 hours I
> will push to blead the changes for Encode to use.  In working on this, I've
> seen some other things I think should happen to blead to give XS writers all
> they need so they won't be tempted to get to such a low level as before,and
> introduce security bugs.  And I am working on this and expect to finish this
> coming week.  After this soaks in blead for a while, I'll issue a pull
> request so that all the tools are in Devel::PPPort.  At that time Encode can
> be sync'd.

Hi! I send my changes and fixes for Encode to upstream on github.
It includes fixes for more crashed reported in rt.cpan.org.

I think that those fixes for crashes should be included also in blead as
processing untrusted data (e.g. prepared from attacker) leads in crash
of whole perl...


Re: Encode UTF-8 optimizations

2016-09-25 Thread pali
On Thursday 01 September 2016 09:30:08 p...@cpan.org wrote:
> On Wednesday 31 August 2016 21:27:37 Karl Williamson wrote:
> > We may change Encode in blead too, since it already differs from
> > cpan. I'll have to get Sawyer's opinion on that.  But the next
> > step is for me to fix Devel::PPPort to handle the things that
> > Encode needs, issue a pull request there, and after that is
> > resolved issue an Encode PR.

Hi! One month passed, do you have any progress in syncing blead and cpan 
Encode version? Or do you need some help?

> In my opinion we should sync Encode version in blead and on cpan.
> Currently they are more or less different which can cause problems...
> 
> Anyway, I have some suggestions for changes about warnings in
> Encode::utf8 package. If you have time, please look at that (I sent
> email) and tell what do you think about it... In my opinion that
> should be fixed too and I can prepare patches (after decision will
> be made).


Re: Encode UTF-8 optimizations

2016-08-31 Thread pali
On Monday 29 August 2016 17:00:00 Karl Williamson wrote:
> If you'd be willing to test this out, especially the performance
> parts that would be great!
[snip]
> There are 2 experimental performance commits.  If you want to see if
> they actually improve performance by doing a before/after compare
> that would be nice.

So here are my results:

strict = bless({strict_utf8 => 1}, "Encode::utf8")->encode_xs/decode_xs
lax= bless({strict_utf8 => 0}, "Encode::utf8")->encode_xs/decode_xs
int= utf8::encode/decode

all= join "", map { chr } 0 .. 0x10
short  = "žluťoučký kůň pěl ďábelské ódy " x 45
long   = $short x 1000
ishort = "\xA0" x 1000
ilong  = "\xA0" x 100

your   = 9c03449800417dd02cc1af613951a1002490a52a
orig   = f16e7fa35c1302aa056db5d8d022b7861c1dd2e8
my = orig without c8247c27c13d1cf152398e453793a91916d2185d
your1  = your without b65e9a52d8b428146ee554d724b9274f8e77286c
your2  = your without 9ccc3ecd1119ccdb64e91b1f03376916aa8cc6f7


decode
  allilong  ishort long 
  short 
my: - int   285.94/s 14988.61/s   4694109.54/s   704.15/s   
 599678.93/s
  orig: - int   292.41/s 15121.98/s   4782883.50/s   494.33/s   
 553182.28/s
 your1: - int   271.21/s 14232.25/s   4706722.93/s   599.68/s   
 554941.90/s
 your2: - int   280.85/s 14090.33/s   4210573.40/s   593.93/s   
 558487.86/s
  your: - int   283.23/s 15121.98/s   4500252.51/s   691.95/s   
 678859.55/s

  allilong  ishort long 
  short 
my: - lax83.28/s   202.22/s142049.67/s   181.82/s   
 163352.41/s
  orig: - lax53.49/s   201.58/s152422.11/s   147.13/s   
 133974.37/s
 your1: - lax   255.13/s53.75/s 47590.82/s   560.34/s   
 431447.77/s
 your2: - lax   281.71/s48.41/s 43260.19/s   634.16/s   
 445365.29/s
  your: - lax   286.96/s46.35/s 42848.40/s   632.20/s   
 442546.52/s

  allilong  ishort long 
  short 
my: - strict 90.48/s   200.00/s143081.15/s   197.53/s   
 175800.00/s
  orig: - strict 49.21/s   202.22/s149447.34/s   142.81/s   
 128290.63/s
 your1: - strict154.94/s48.16/s 44237.93/s   191.36/s   
 169228.16/s
 your2: - strict158.75/s40.06/s 37244.06/s   195.95/s   
 173588.68/s
  your: - strict158.26/s38.54/s 36898.14/s   195.95/s   
 172504.61/s


encode
  allilong  ishort long 
  short 
my: - int   5197722.67/s   5227338.26/s   5210583.97/s   5163520.62/s   
5227338.26/s
  orig: - int   5449888.54/s   5381336.48/s   5370254.05/s   5449888.54/s   
5301624.60/s
 your1: - int   5244200.62/s   5293830.28/s   5277183.02/s   5361483.07/s   
5260640.13/s
 your2: - int   5435994.67/s   5432587.30/s   5398312.30/s   5487602.22/s   
5606457.74/s
  your: - int   5261172.17/s   5327441.90/s   5310582.91/s   5310582.91/s   
5361483.07/s

  allilong  ishort long 
  short 
my: - lax  2442.24/s 15084.08/s   2882995.00/s  7993.15/s   
2716293.65/s
  orig: - lax  2438.39/s 15121.98/s   2933419.33/s  7965.22/s   
2665521.81/s
 your1: - lax  2229.94/s 14908.60/s   2117316.51/s  7428.89/s   
2011133.75/s
 your2: - lax  2400.92/s 15121.98/s   3046739.87/s  8065.41/s   
2742961.18/s
  your: - lax  2368.00/s 15168.94/s   2862328.67/s  8090.85/s   
2685694.50/s

  allilong  ishort long 
  short 
my: - strict 92.16/s   204.81/s157772.05/s   200.00/s   
 190344.59/s
  orig: - strict 49.04/s   202.22/s160767.72/s   142.81/s   
 133548.90/s
 your1: - strict147.75/s46.91/s 46095.57/s   194.36/s   
 176949.84/s
 your2: - strict159.25/s40.19/s 38034.59/s   196.20/s   
 185166.45/s
  your: - strict158.26/s38.54/s 37012.73/s   196.20/s   
 186357.23/s


So looks like that experimental commits did not speed up encoder or decoder.

What is relevant from these tests is that your patches slow down encoding
and decoding of illegal sequences like "\xA0" x 100 about 4-5 times.


Re: Encode UTF-8 optimizations

2016-08-25 Thread pali
On Wednesday 24 August 2016 22:49:21 Karl Williamson wrote:
> On 08/22/2016 02:47 PM, p...@cpan.org wrote:
> 
> snip
> 
> >I added some tests for overlong sequences. Only for ASCII platforms, tests 
> >for EBCDIC
> >are missing (sorry, I do not have access to any EBCDIC platform for testing).
> 
> It's fine to skip those tests on EBCDIC.

Ok.

>  > Anyway, how it behave on EBCDIC platforms? And maybe another question
>  > what should  Encode::encode('UTF-8', $str) do on EBCDIC? Encode $str to
>  > UTF-8 or to UTF-EBCDIC?
> >>>
> >>> It works fine on EBCDIC platforms.  There are other bugs in Encode on
> >>> EBCDIC that I plan on investigating as time permits.  Doing this has
> >>> fixed some of these for free.  The uvuni() functions should in almost
> >>> all instances be uvchr(), and my patch does that.
> >Now I'm thinking if FBCHAR_UTF8 define is working also on EBCDIC... I think 
> >that it
> >should be different for UTF-EBCDIC.
> 
> I'll fix that
> >
> >>> On EBCDIC platforms, UTF-8 is defined to be UTF-EBCDIC (or vice versa if
> >>> you prefer), so $str will effectively be in the version of UTF-EBCDIC
> >>> valid for the platform it is running on (there are differences depending
> >>> on the platform's underlying code page).
> >So it means that on EBCDIC platforms you cannot process file which is 
> >encoded in UTF-8?
> >As Encode::decode("UTF-8", $str) expect $str to be in UTF-EBCDIC and not in 
> >UTF-8 (as I
> >understood).
> >
> Yes.  The two worlds do not meet.  If you are on an EBCDIC platform, the
> native encoding is UTF-EBCDIC tailored to the code page the platform runs
> on.
> 
> In searching, I did not find anything that converts between the two, so I
> wrote a Perl script to do so.  Our OS/390 man, Yaroslav, wrote one in C.

Thank you for information! I though that "UTF-8" encoding (with hyphen)
is that strict and correct UTF-8 version on both ASCII & EBCDIC
platforms as in Encode documentation is nothing written that on EBCDIC
is is different...

Anyway, if you need some help with Encode module or something different,
let me know. As I want to have UTF-8 support in Encode correctly
working...


Re: Encode UTF-8 optimizations

2016-08-22 Thread pali
(this only applies for strict UTF-8)

On Monday 22 August 2016 23:19:51 Karl Williamson wrote:
> The code could be tweaked to call UTF8_IS_SUPER first, but I'm
> asserting that an optimizing compiler will see that any call to
> is_utf8_char_slow() is pointless, and will optimize it out.

Such optimization cannot be done and compiler cannot know such thing...

You have this code:

+const STRLEN char_len = isUTF8_CHAR(x, send);
+
+if (UNLIKELY(! char_len)
+|| (UNLIKELY(isUTF8_POSSIBLY_PROBLEMATIC(*x))
+&& (   UNLIKELY(UTF8_IS_SURROGATE(x, send))
+|| UNLIKELY(UTF8_IS_SUPER(x, send))
+|| UNLIKELY(UTF8_IS_NONCHAR(x, send)
+{
+*ep = x;
+return FALSE;
+}

Here isUTF8_CHAR() macro will call function is_utf8_char_slow() if 
condition IS_UTF8_CHAR_FAST(UTF8SKIP(x))) is truth. And because 
is_utf8_char_slow() is external library function compiler has absolutely 
no idea what that function is doing. In non-functional world such 
function could have side effect, etc and compiler really cannot 
eliminate that call.

Moving UTF8_IS_SUPER before isUTF8_CHAR maybe could help, but I'm septic 
if gcc really can propagate constant from PL_utf8skip[] array back and 
prove that IS_UTF8_CHAR_FAST must be always true when UTF8_IS_SUPER is 
true too...

Rather add IS_UTF8_CHAR_FAST(UTF8SKIP(s))) check (or similar) before 
isUTF8_CHAR() call. That should totally eliminate generating code with 
call to is_utf8_char_slow() function.

With UTF8_IS_SUPER there can be branch in binary code which never will 
be evaluated.


Re: Encode utf8 warnings

2016-08-22 Thread pali
On Saturday 13 August 2016 19:41:46 p...@cpan.org wrote:
> Hello, I see that there is one big mess in utf8 warnings for Encode.

Per request this discussion was moved to perl5-port...@perl.org ML:
http://www.nntp.perl.org/group/perl.perl5.porters/2016/08/msg239061.html


Re: Encode UTF-8 optimizations

2016-08-22 Thread pali
On Sunday 21 August 2016 08:49:08 Karl Williamson wrote:
> On 08/21/2016 02:34 AM, p...@cpan.org wrote:
> >On Sunday 21 August 2016 03:10:40 Karl Williamson wrote:
> >>Top posting.
> >>
> >>Attached is my alternative patch.  It effectively uses a different
> >>algorithm to avoid decoding the input into code points, and to copy
> >>all spans of valid input at once, instead of character at a time.
> >>
> >>And it uses only currently available functions.
> >
> >And that's the problem. As already wrote in previous email, calling
> >function from shared library cannot be heavy optimized as inlined
> >function and cause slow down. You are calling is_utf8_string_loc for
> >non-strict mode which is not inlined and so encode/decode of non-strict
> >mode will be slower...
> >
> >And also in is_strict_utf8_string_loc you are calling isUTF8_CHAR which
> >is calling _is_utf8_char_slow and which is calling utf8n_to_uvchr which
> >cannot be inlined too...
> >
> >Therefore I think this is not good approach...
> >
> 
> Then you should run your benchmarks to find out the performance.

You are right, benchmarks are needed to show final results.

> On valid input, is_utf8_string_loc() is called once per string.  The
> function call overhead and non-inlining should be not noticeable.

Ah right, I misread it as it is called per one valid sequence, not for
whole string. You are right.

> On valid input, is_utf8_char_slow() is never called.  The used-parts can be
> inlined.

Yes, but this function is there to be called primary on unknown input
which does not have to be valid. If I know that input is valid then
utf8::encode/decode is enough :-)

> On invalid input, performance should be a minor consideration.

See below...

> The inner loop is much tighter in both functions; likely it can be held in
> the cache.  The algorithm avoids a bunch of work compared to the previous
> one.

Right, for valid input algorithm is really faster. If it is because of
less case misses... maybe... I can play with perf or another tool to
look what is bottle neck now.

> I doubt that it will be slower than that.  The only way to know in any
> performance situation is to actually test.  And know that things will be
> different depending on the underlying hardware, so only large differences
> are really significant.

So, here are my test results. You can say that they are subjective, but
I would be happy if somebody provide better input for performance tests.

Abbreviations:
strict = Encode::encode/decode "UTF-8"
lax = Encode::encode/decode "utf8"
int = utf8::encode/decode
orig = commit 92d73bfab7792718f9ad5c5dc54013176ed9c76b
your = orig + 0001-Speed-up-Encode-UTF-8-validation-checking.patch
my = orig + revert commit c8247c27c13d1cf152398e453793a91916d2185d

Test cases:
all = join "", map { chr } 0 .. 0x10
short = "žluťoučký kůň pěl ďábelské ódy " x 45
long = $short x 1000
invalid-short = "\xA0" x 1000
invalid-long = "\xA0" x 100

Encoding was called on string with Encode::_utf8_on() flag.


Rates:

encode:
   all   short long  invalid-short invalid-long
orig - strict  41/s124533/s132/s 115197/s172/s
your - strict 176/s411523/s427/s  54813/s 66/s
my   - strict  80/s172712/s186/s 113787/s138/s

orig - lax   1010/s   3225806/s   6250/s 546800/s   5151/s
your - lax952/s   3225806/s   5882/s 519325/s   4919/s
my   - lax   1060/s   3125000/s   6250/s 645119/s   5009/s

orig - int8154604/s  1000/sinfty9787566/s9748151/s
your - int9135243/s  /sinfty8922821/s9737657/s
my   - int9779395/s  1000/sinfty9822046/s8949861/s


decode:
   all   short long  invalid-short invalid-long
orig - strict  39/s119048/s131/s 108574/s171/s
your - strict 173/s353357/s442/s  42440/s 55/s
my   - strict  69/s17/s182/s 117291/s135/s

orig - lax 39/s123609/s137/s 127302/s172/s
your - lax230/s393701/s495/s  37346/s 65/s
my   - lax 79/s158983/s180/s 121456/s138/s

orig - int274/s546448/s565/s8219513/s  12357/s
your - int273/s540541/s562/s7226066/s  12948/s
my   - int274/s543478/s562/s8502902/s  12421/s


int is there just for verifications of tests as utf8::encode/decode
functions was not changed.

Results are: your patch is faster for valid sequences (as you wrote
above), but slower for invalid (in some cases radically).

So I would propose two optimizations:

1) Change macro isUTF8_CHAR in is_strict_utf8_string_loc() with some new
   which does not call utf8n_to_uvchr. That call is not needed as in
   that case sequence is already invalid.

2) Try to make inline version of function is_utf8_string_loc(). Maybe
   merge with 

Re: Encode UTF-8 optimizations

2016-08-21 Thread pali
On Sunday 21 August 2016 03:10:40 Karl Williamson wrote:
> Top posting.
> 
> Attached is my alternative patch.  It effectively uses a different
> algorithm to avoid decoding the input into code points, and to copy
> all spans of valid input at once, instead of character at a time.
> 
> And it uses only currently available functions.

And that's the problem. As already wrote in previous email, calling 
function from shared library cannot be heavy optimized as inlined 
function and cause slow down. You are calling is_utf8_string_loc for 
non-strict mode which is not inlined and so encode/decode of non-strict 
mode will be slower...

And also in is_strict_utf8_string_loc you are calling isUTF8_CHAR which 
is calling _is_utf8_char_slow and which is calling utf8n_to_uvchr which 
cannot be inlined too...

Therefore I think this is not good approach...


Re: Encode UTF-8 optimizations

2016-08-19 Thread pali
On Thursday 18 August 2016 23:06:27 Karl Williamson wrote:
> On 08/12/2016 09:31 AM, p...@cpan.org wrote:
> >On Thursday 11 August 2016 17:41:23 Karl Williamson wrote:
> >>On 07/09/2016 05:12 PM, p...@cpan.org wrote:
> >>>Hi! As we know utf8::encode() does not provide correct UTF-8 encoding
> >>>and Encode::encode("UTF-8", ...) should be used instead. Also opening
> >>>file should be done by :encoding(UTF-8) layer instead :utf8.
> >>>
> >>>But UTF-8 strict implementation in Encode module is horrible slow when
> >>>comparing to utf8::encode(). It is implemented in Encode.xs file and for
> >>>benchmarking can be this XS implementation called directly by:
> >>>
> >>> use Encode;
> >>> my $output = Encode::utf8::encode_xs({strict_utf8 => 1}, $input)
> >>>
> >>>(without overhead of Encode module...)
> >>>
> >>>Here are my results on 160 bytes long input string:
> >>>
> >>> Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  8 wallclock secs ( 
> >>> 8.56 usr +
> >0.00 sys =  8.56 CPU) @ 467289.72/s (n=400)
> >>> Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 
> >>> 1.66 usr +
> >0.00 sys =  1.66 CPU) @ 2409638.55/s (n=400)
> >>> utf8::encode:  1 wallclock secs ( 0.39 usr +  0.00 sys =  0.39 CPU) @
> >10256410.26/s (n=400)
> >>>
> >>>I found two bottle necks (slow sv_catpv* and utf8n_to_uvuni functions)
> >>>and did some optimizations. Final results are:
> >>>
> >>> Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  2 wallclock secs ( 
> >>> 3.27 usr +
> >0.00 sys =  3.27 CPU) @ 1223241.59/s (n=400)
> >>> Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 
> >>> 1.68 usr +
> >0.00 sys =  1.68 CPU) @ 2380952.38/s (n=400)
> >>> utf8::encode:  1 wallclock secs ( 0.40 usr +  0.00 sys =  0.40 CPU) @
> >1000.00/s (n=400)
> >>>
> >>>Patches are on github at pull request:
> >>>https://github.com/dankogai/p5-encode/pull/56
> >>>
> >>>I would like if somebody review my patches and tell if this is the
> >>>right way for optimizations...
> >>>
> >>
> >>I'm sorry that this slipped off my radar until I saw it in the new Encode
> >>release
> >>
> >>There are a couple of things I see wrong with your patch.
> >>
> >>1) It does not catch the malformation of an overlong sequence.  This is a
> >>serious malformation which has been used for attacks.  Basically, after you
> >>get the result, you need to check that it is the expected length for that
> >>result.  For example, \xC2\x80 will have an input length of 2, and evaluates
> >>to \x00, whose expected length is 1, and so the input is overlong.  In
> >>modern perls, you can just do an OFFUNISKIP(uv) and compare that with the
> >>passed-in length.  This can be rewritten for perls back to 5.8 using
> >>UNI_SKIP and UNI_TO_NATIVE
> >
> >I do not see where can be a problem. At least I think my patches should
> >be compatible with previous implementation of Encode.xs...
> >
> >First UTF8_IS_INVARIANT is checked and one character processed.
> >
> >Otherwise UTF8_IS_START is checked and UTF8SKIP is used to get length of
> >sequence. And then len-1 characters are checked if they pass test for
> >UTF8_IS_CONTINUATION.
> >
> >If there are less characters then following does not
> >UTF8_IS_CONTINUATION and error is reported. If there are more, then next
> >iteration of loop starts and it fail on both UTF8_IS_CONTINUATION and
> >UTF8_IS_START.
> >
> >Can you describe in details what do you think it wrong and how to do
> >that attack?
> 
> This discussion has been active at
> https://github.com/dankogai/p5-encode/issues/64
> 
> For the curious out there, please refer to that discussion.  My bottom line
> is that I have come to believe the security risks are too high to have
> modules do their own security checking for UTF-8 correctness.
> >
> >>2) It does not work on EBCDIC platforms.  The NATIVE_TO_UTF() call is a good
> >>start, but the result uv needs to be transformed back to native, using
> >>UNI_TO_NATIVE(uv).
> >
> >uv is used just to check if it is valid Unicode code point. Real value
> >is used only for error/warn message. Previous implementation used
> >utf8n_to_uvuni which convert return value with NATIVE_TO_UNI.
> 
> As I said on that other thread, if this is really true, then it's faster to
> use a boolean function to verify the inputs.

Value of uv is used only in warn/error message.

> Also, performance should not
> be a consideration for errors or warnings.  One can check validity fast; and
> then spend the time getting the message right in the rare cases where a
> message is generated.

Yes, fully I agree.

> >>3) The assumptions the subroutine runs under need to be documented for
> >>future maintainers and code readers.  For example, it assumes that there is
> >>enough space in the input to hold all the bytes.
> >
> >Function process_utf8 does not assume that. It calls SvGROW to increase
> >buffer size when needed.
> 
> You misunderstand what I meant here.  The bottom line is your patch adds a
> significant 

Encode utf8 warnings

2016-08-13 Thread pali
Hello, I see that there is one big mess in utf8 warnings for Encode.

First, warnings should be enabled by warnings pragma. For utf8 there
are: utf8, non_unicode, nonchar, surrogate.

Second, warnings for Encode can be enabled by check flag Encode::FB_WARN
or Encode::WARN_ON_ERR.

Third, warnings for perlio :encoding layer can be enabled via
$PerlIO::encoding::fallback variable (same flags as for Encode module).

And here is problem:

Should Encode utf8 throw warnings if:

* utf8 warnings are enabled by pragma, but not enabled via Encode check
  flags?
* utf8 pragma warnings are disabled, but Encode WARN_ON_ERR bit enabled?
* utf8 pragma warnings are disabled and $PerlIO::encoding::fallback
  variable was not modified?

There are couple of bugs and comments about this problem:
https://rt.cpan.org/Public/Bug/Display.html?id=88592
https://github.com/dankogai/p5-encode/pull/26#issuecomment-235641347
https://rt.perl.org/Public/Bug/Display.html?id=128788
https://rt.cpan.org/Public/Bug/Display.html?id=116629
https://github.com/dankogai/p5-encode/issues/59
https://github.com/dankogai/p5-encode/commit/a6c2ba385875c2c03bd42350e23aef0188fb23b0
https://github.com/dankogai/p5-encode/commit/07c8adb58e55c7cf66b3d6673bf50010fe1a69ea

I think that we need to declare how should utf8 pragma warnings
interference with Encode WARN_ON_ERR for Unicode encodings
(Encode::utf8 and Encode::Unicode).

Documentation:
https://metacpan.org/pod/warnings
https://metacpan.org/pod/Encode#FB_WARN
https://metacpan.org/pod/PerlIO::encoding


Re: Encode UTF-8 optimizations

2016-08-12 Thread pali
On Thursday 11 August 2016 17:41:23 Karl Williamson wrote:
> On 07/09/2016 05:12 PM, p...@cpan.org wrote:
> >Hi! As we know utf8::encode() does not provide correct UTF-8 encoding
> >and Encode::encode("UTF-8", ...) should be used instead. Also opening
> >file should be done by :encoding(UTF-8) layer instead :utf8.
> >
> >But UTF-8 strict implementation in Encode module is horrible slow when
> >comparing to utf8::encode(). It is implemented in Encode.xs file and for
> >benchmarking can be this XS implementation called directly by:
> >
> >  use Encode;
> >  my $output = Encode::utf8::encode_xs({strict_utf8 => 1}, $input)
> >
> >(without overhead of Encode module...)
> >
> >Here are my results on 160 bytes long input string:
> >
> >  Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  8 wallclock secs ( 8.56 
> > usr +  
0.00 sys =  8.56 CPU) @ 467289.72/s (n=400)
> >  Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.66 
> > usr +  
0.00 sys =  1.66 CPU) @ 2409638.55/s (n=400)
> >  utf8::encode:  1 wallclock secs ( 0.39 usr +  0.00 sys =  0.39 CPU) @ 
10256410.26/s (n=400)
> >
> >I found two bottle necks (slow sv_catpv* and utf8n_to_uvuni functions)
> >and did some optimizations. Final results are:
> >
> >  Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  2 wallclock secs ( 3.27 
> > usr +  
0.00 sys =  3.27 CPU) @ 1223241.59/s (n=400)
> >  Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.68 
> > usr +  
0.00 sys =  1.68 CPU) @ 2380952.38/s (n=400)
> >  utf8::encode:  1 wallclock secs ( 0.40 usr +  0.00 sys =  0.40 CPU) @ 
1000.00/s (n=400)
> >
> >Patches are on github at pull request:
> >https://github.com/dankogai/p5-encode/pull/56
> >
> >I would like if somebody review my patches and tell if this is the
> >right way for optimizations...
> >
> 
> I'm sorry that this slipped off my radar until I saw it in the new Encode
> release
> 
> There are a couple of things I see wrong with your patch.
> 
> 1) It does not catch the malformation of an overlong sequence.  This is a
> serious malformation which has been used for attacks.  Basically, after you
> get the result, you need to check that it is the expected length for that
> result.  For example, \xC2\x80 will have an input length of 2, and evaluates
> to \x00, whose expected length is 1, and so the input is overlong.  In
> modern perls, you can just do an OFFUNISKIP(uv) and compare that with the
> passed-in length.  This can be rewritten for perls back to 5.8 using
> UNI_SKIP and UNI_TO_NATIVE

I do not see where can be a problem. At least I think my patches should
be compatible with previous implementation of Encode.xs...

First UTF8_IS_INVARIANT is checked and one character processed.

Otherwise UTF8_IS_START is checked and UTF8SKIP is used to get length of
sequence. And then len-1 characters are checked if they pass test for
UTF8_IS_CONTINUATION.

If there are less characters then following does not
UTF8_IS_CONTINUATION and error is reported. If there are more, then next
iteration of loop starts and it fail on both UTF8_IS_CONTINUATION and
UTF8_IS_START.

Can you describe in details what do you think it wrong and how to do
that attack?

> 2) It does not work on EBCDIC platforms.  The NATIVE_TO_UTF() call is a good
> start, but the result uv needs to be transformed back to native, using
> UNI_TO_NATIVE(uv).

uv is used just to check if it is valid Unicode code point. Real value
is used only for error/warn message. Previous implementation used
utf8n_to_uvuni which convert return value with NATIVE_TO_UNI.

> 3) The assumptions the subroutine runs under need to be documented for
> future maintainers and code readers.  For example, it assumes that there is
> enough space in the input to hold all the bytes.

Function process_utf8 does not assume that. It calls SvGROW to increase
buffer size when needed.

> Other than that, it looks ok to me.  But, to be sure, I think you should run
> it on the tests included in the core t/op/utf8decode.t which came from an
> internet repository of edge cases.

How to use and run that test with Encode?


Encode UTF-8 optimizations

2016-07-09 Thread pali
Hi! As we know utf8::encode() does not provide correct UTF-8 encoding
and Encode::encode("UTF-8", ...) should be used instead. Also opening
file should be done by :encoding(UTF-8) layer instead :utf8.

But UTF-8 strict implementation in Encode module is horrible slow when
comparing to utf8::encode(). It is implemented in Encode.xs file and for
benchmarking can be this XS implementation called directly by:

  use Encode;
  my $output = Encode::utf8::encode_xs({strict_utf8 => 1}, $input)

(without overhead of Encode module...)

Here are my results on 160 bytes long input string:

  Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  8 wallclock secs ( 8.56 
usr +  0.00 sys =  8.56 CPU) @ 467289.72/s (n=400)
  Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.66 
usr +  0.00 sys =  1.66 CPU) @ 2409638.55/s (n=400)
  utf8::encode:  1 wallclock secs ( 0.39 usr +  0.00 sys =  0.39 CPU) @ 
10256410.26/s (n=400)

I found two bottle necks (slow sv_catpv* and utf8n_to_uvuni functions)
and did some optimizations. Final results are:

  Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  2 wallclock secs ( 3.27 
usr +  0.00 sys =  3.27 CPU) @ 1223241.59/s (n=400)
  Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.68 
usr +  0.00 sys =  1.68 CPU) @ 2380952.38/s (n=400)
  utf8::encode:  1 wallclock secs ( 0.40 usr +  0.00 sys =  0.40 CPU) @ 
1000.00/s (n=400)

Patches are on github at pull request:
https://github.com/dankogai/p5-encode/pull/56

I would like if somebody review my patches and tell if this is the
right way for optimizations...


Re: UTF-8 encoding & decoding

2016-05-12 Thread Pali Rohár
On Friday 06 May 2016 09:24:01 Karl Williamson wrote:
> On 05/05/2016 08:37 AM, Pali Rohár wrote:
> >Hi!
> >
> >I though that I understand UTF-8 encoding/decoding done in perl until I
> >looked into source code of Encode package... (exactly sub encode_utf8)
> >
> >Before... I only read description of Encode package (not source code):
> >https://metacpan.org/pod/Encode#UTF-8-vs.-utf8-vs.-UTF8
> >
> >I tried to find some more information (ideally those which answer my
> >question) but without success. Can you help me? My questions are:
> >
> >1. What is difference between those two calls?
> >
> >  utf8::encode($str);
> >
> >and
> >
> >  $str = Encode::encode('utf8', $str);
> >
> >2. What is difference between those?
> >
> >  utf8::decode($str);
> >  $str = Encode::decode_utf8($str);
> 
> Each pair of functions is supposed to do essentially the same thing. I have
> not studied them to know what subtle differences there may be.

If both functions should do same thing, why we have duplicity? And which
one is preferred to use?

> >3. Where is implementation of utf8::encode/decode functions? It is not
> >in utf8.pm, nor in utf8_heavy.pl and also not in unicore/Heavy.pl. And
> >what those functions doing?
> 
> The implementation is in universal.c.  But these are just wrappers for
> sv_utf8_encode and sv_utf8_decode, which are implemented in sv.c.  Their
> documentation is in perlapi.  It should match the documentation of
> utf8::decode and utf8::encode, whose documentation is in utf8.pm.  (I myself
> have a hard time mapping the names chosen for these operations with what
> they actually do)

Ok, thank you!

-- 
Pali Rohár
pali.ro...@gmail.com


UTF-8 encoding & decoding

2016-05-06 Thread Pali Rohár
Hi!

I though that I understand UTF-8 encoding/decoding done in perl until I
looked into source code of Encode package... (exactly sub encode_utf8)

Before... I only read description of Encode package (not source code):
https://metacpan.org/pod/Encode#UTF-8-vs.-utf8-vs.-UTF8

I tried to find some more information (ideally those which answer my
question) but without success. Can you help me? My questions are:

1. What is difference between those two calls?

 utf8::encode($str);

and

 $str = Encode::encode('utf8', $str);

2. What is difference between those?

 utf8::decode($str);
 $str = Encode::decode_utf8($str);

3. Where is implementation of utf8::encode/decode functions? It is not
in utf8.pm, nor in utf8_heavy.pl and also not in unicore/Heavy.pl. And
what those functions doing?

-- 
Pali Rohár
pali.ro...@gmail.com