Re: Encode UTF-8 optimizations

2016-08-22 Thread pali
(this only applies for strict UTF-8)

On Monday 22 August 2016 23:19:51 Karl Williamson wrote:
> The code could be tweaked to call UTF8_IS_SUPER first, but I'm
> asserting that an optimizing compiler will see that any call to
> is_utf8_char_slow() is pointless, and will optimize it out.

Such optimization cannot be done and compiler cannot know such thing...

You have this code:

+const STRLEN char_len = isUTF8_CHAR(x, send);
+
+if (UNLIKELY(! char_len)
+|| (UNLIKELY(isUTF8_POSSIBLY_PROBLEMATIC(*x))
+&& (   UNLIKELY(UTF8_IS_SURROGATE(x, send))
+|| UNLIKELY(UTF8_IS_SUPER(x, send))
+|| UNLIKELY(UTF8_IS_NONCHAR(x, send)
+{
+*ep = x;
+return FALSE;
+}

Here isUTF8_CHAR() macro will call function is_utf8_char_slow() if 
condition IS_UTF8_CHAR_FAST(UTF8SKIP(x))) is truth. And because 
is_utf8_char_slow() is external library function compiler has absolutely 
no idea what that function is doing. In non-functional world such 
function could have side effect, etc and compiler really cannot 
eliminate that call.

Moving UTF8_IS_SUPER before isUTF8_CHAR maybe could help, but I'm septic 
if gcc really can propagate constant from PL_utf8skip[] array back and 
prove that IS_UTF8_CHAR_FAST must be always true when UTF8_IS_SUPER is 
true too...

Rather add IS_UTF8_CHAR_FAST(UTF8SKIP(s))) check (or similar) before 
isUTF8_CHAR() call. That should totally eliminate generating code with 
call to is_utf8_char_slow() function.

With UTF8_IS_SUPER there can be branch in binary code which never will 
be evaluated.


Re: Encode UTF-8 optimizations

2016-08-22 Thread Karl Williamson

On 08/22/2016 03:19 PM, Karl Williamson wrote:

On 08/22/2016 02:47 PM, p...@cpan.org wrote:

> And I think you misunderstand when is_utf8_char_slow() is called.
It is
> called only when the next byte in the input indicates that the only
> legal UTF-8 that might follow would be for a code point that is at
least
> U+20, almost twice as high as the highest legal Unicode code
point.
> It is a Perl extension to handle such code points, unlike other
> languages.  But the Perl core is not optimized for them, nor will
it be.
>   My point is that is_utf8_char_slow() will only be called in very
> specialized cases, and we need not make those cases have as good a
> performance as normal ones.

In strict mode, there is absolutely no need to call
is_utf8_char_slow(). As in strict
mode such sequence must be always invalid (it is above last valid
Unicode character)
This is what I tried to tell.

And currently is_strict_utf8_string_loc() first calls isUTF8_CHAR()
(which could call
is_utf8_char_slow()) and after that is check for UTF8_IS_SUPER().


I only have time to respond to this portion just now.

The code could be tweaked to call UTF8_IS_SUPER first, but I'm asserting
that an optimizing compiler will see that any call to
is_utf8_char_slow() is pointless, and will optimize it out.



Now, I'm realizing I'm wrong.  It can't be optimized out by the compiler 
because it is not declared (nor can it be) to be a pure function.  And, 
I'd rather not tweak it to call UTF8_IS_SUPER first, because that relies 
on knowing what the current internal implementation is.


But I still argue that it is fine the way it is.  It will only get 
called for code points much higher than Unicode, and the performance on 
those should not affect our decisions in any way.


Re: Encode UTF-8 optimizations

2016-08-22 Thread Karl Williamson

On 08/22/2016 02:47 PM, p...@cpan.org wrote:

> And I think you misunderstand when is_utf8_char_slow() is called.  It is
> called only when the next byte in the input indicates that the only
> legal UTF-8 that might follow would be for a code point that is at least
> U+20, almost twice as high as the highest legal Unicode code point.
> It is a Perl extension to handle such code points, unlike other
> languages.  But the Perl core is not optimized for them, nor will it be.
>   My point is that is_utf8_char_slow() will only be called in very
> specialized cases, and we need not make those cases have as good a
> performance as normal ones.

In strict mode, there is absolutely no need to call is_utf8_char_slow(). As in 
strict
mode such sequence must be always invalid (it is above last valid Unicode 
character)
This is what I tried to tell.

And currently is_strict_utf8_string_loc() first calls isUTF8_CHAR() (which 
could call
is_utf8_char_slow()) and after that is check for UTF8_IS_SUPER().


I only have time to respond to this portion just now.

The code could be tweaked to call UTF8_IS_SUPER first, but I'm asserting 
that an optimizing compiler will see that any call to 
is_utf8_char_slow() is pointless, and will optimize it out.


Re: Encode UTF-8 optimizations

2016-08-22 Thread Karl Williamson

On 08/22/2016 07:05 AM, p...@cpan.org wrote:

On Sunday 21 August 2016 08:49:08 Karl Williamson wrote:

On 08/21/2016 02:34 AM, p...@cpan.org wrote:

On Sunday 21 August 2016 03:10:40 Karl Williamson wrote:

Top posting.

Attached is my alternative patch.  It effectively uses a different
algorithm to avoid decoding the input into code points, and to copy
all spans of valid input at once, instead of character at a time.

And it uses only currently available functions.


And that's the problem. As already wrote in previous email, calling
function from shared library cannot be heavy optimized as inlined
function and cause slow down. You are calling is_utf8_string_loc for
non-strict mode which is not inlined and so encode/decode of non-strict
mode will be slower...

And also in is_strict_utf8_string_loc you are calling isUTF8_CHAR which
is calling _is_utf8_char_slow and which is calling utf8n_to_uvchr which
cannot be inlined too...

Therefore I think this is not good approach...



Then you should run your benchmarks to find out the performance.


You are right, benchmarks are needed to show final results.


On valid input, is_utf8_string_loc() is called once per string.  The
function call overhead and non-inlining should be not noticeable.


Ah right, I misread it as it is called per one valid sequence, not for
whole string. You are right.


It is called once per valid sequence.  See below.




On valid input, is_utf8_char_slow() is never called.  The used-parts can be
inlined.


Yes, but this function is there to be called primary on unknown input
which does not have to be valid. If I know that input is valid then
utf8::encode/decode is enough :-)


What process_utf8() does is to copy the alleged UTF-8 input to the 
output, verifying along the way that it actually is legal UTF-8 (with 2 
levels of strictness, depending on the input parameter), and taking some 
actions (exactly what depends on other input parameters) if and when it 
finds invalid UTF-8.


The way it works after my patch is like an instruction pipeline.  You 
start it up, and it stays in the pipeline as long as the next character 
in the input is legal or it reaches the end.  When it finds illegal 
input, it drops out of the pipeline, handles that, and starts up the 
pipeline to process any remaining input.  If the entire input string is 
valid, a single instance of the pipeline is all that gets invoked.


The use-case I envision is that the input is supposed to be valid UTF-8, 
and the purpose of process_utf8() is to verify that that is in fact 
true, and to take specified actions when it isn't.  Under that use-case, 
taking longer to deal with invalid input is not a problem.  If that is 
not your use-case, please explain what yours is.


And I think you misunderstand when is_utf8_char_slow() is called.  It is 
called only when the next byte in the input indicates that the only 
legal UTF-8 that might follow would be for a code point that is at least 
U+20, almost twice as high as the highest legal Unicode code point. 
It is a Perl extension to handle such code points, unlike other 
languages.  But the Perl core is not optimized for them, nor will it be. 
 My point is that is_utf8_char_slow() will only be called in very 
specialized cases, and we need not make those cases have as good a 
performance as normal ones.



On invalid input, performance should be a minor consideration.


See below...


See above. :)




The inner loop is much tighter in both functions; likely it can be held in
the cache.  The algorithm avoids a bunch of work compared to the previous
one.


Right, for valid input algorithm is really faster. If it is because of
less case misses... maybe... I can play with perf or another tool to
look what is bottle neck now.


I doubt that it will be slower than that.  The only way to know in any
performance situation is to actually test.  And know that things will be
different depending on the underlying hardware, so only large differences
are really significant.


So, here are my test results. You can say that they are subjective, but
I would be happy if somebody provide better input for performance tests.

Abbreviations:
strict = Encode::encode/decode "UTF-8"
lax = Encode::encode/decode "utf8"
int = utf8::encode/decode
orig = commit 92d73bfab7792718f9ad5c5dc54013176ed9c76b
your = orig + 0001-Speed-up-Encode-UTF-8-validation-checking.patch
my = orig + revert commit c8247c27c13d1cf152398e453793a91916d2185d

Test cases:
all = join "", map { chr } 0 .. 0x10
short = "žluťoučký kůň pěl ďábelské ódy " x 45
long = $short x 1000
invalid-short = "\xA0" x 1000
invalid-long = "\xA0" x 100

Encoding was called on string with Encode::_utf8_on() flag.


Rates:

encode:
   all   short long  invalid-short invalid-long
orig - strict  41/s124533/s132/s 115197/s172/s
your - strict 176/s411523/s427/s  54813/s 66/s
my   - strict  80/s172712/s186/s 

Re: Encode utf8 warnings

2016-08-22 Thread pali
On Saturday 13 August 2016 19:41:46 p...@cpan.org wrote:
> Hello, I see that there is one big mess in utf8 warnings for Encode.

Per request this discussion was moved to perl5-port...@perl.org ML:
http://www.nntp.perl.org/group/perl.perl5.porters/2016/08/msg239061.html


Re: Encode UTF-8 optimizations

2016-08-22 Thread pali
On Sunday 21 August 2016 08:49:08 Karl Williamson wrote:
> On 08/21/2016 02:34 AM, p...@cpan.org wrote:
> >On Sunday 21 August 2016 03:10:40 Karl Williamson wrote:
> >>Top posting.
> >>
> >>Attached is my alternative patch.  It effectively uses a different
> >>algorithm to avoid decoding the input into code points, and to copy
> >>all spans of valid input at once, instead of character at a time.
> >>
> >>And it uses only currently available functions.
> >
> >And that's the problem. As already wrote in previous email, calling
> >function from shared library cannot be heavy optimized as inlined
> >function and cause slow down. You are calling is_utf8_string_loc for
> >non-strict mode which is not inlined and so encode/decode of non-strict
> >mode will be slower...
> >
> >And also in is_strict_utf8_string_loc you are calling isUTF8_CHAR which
> >is calling _is_utf8_char_slow and which is calling utf8n_to_uvchr which
> >cannot be inlined too...
> >
> >Therefore I think this is not good approach...
> >
> 
> Then you should run your benchmarks to find out the performance.

You are right, benchmarks are needed to show final results.

> On valid input, is_utf8_string_loc() is called once per string.  The
> function call overhead and non-inlining should be not noticeable.

Ah right, I misread it as it is called per one valid sequence, not for
whole string. You are right.

> On valid input, is_utf8_char_slow() is never called.  The used-parts can be
> inlined.

Yes, but this function is there to be called primary on unknown input
which does not have to be valid. If I know that input is valid then
utf8::encode/decode is enough :-)

> On invalid input, performance should be a minor consideration.

See below...

> The inner loop is much tighter in both functions; likely it can be held in
> the cache.  The algorithm avoids a bunch of work compared to the previous
> one.

Right, for valid input algorithm is really faster. If it is because of
less case misses... maybe... I can play with perf or another tool to
look what is bottle neck now.

> I doubt that it will be slower than that.  The only way to know in any
> performance situation is to actually test.  And know that things will be
> different depending on the underlying hardware, so only large differences
> are really significant.

So, here are my test results. You can say that they are subjective, but
I would be happy if somebody provide better input for performance tests.

Abbreviations:
strict = Encode::encode/decode "UTF-8"
lax = Encode::encode/decode "utf8"
int = utf8::encode/decode
orig = commit 92d73bfab7792718f9ad5c5dc54013176ed9c76b
your = orig + 0001-Speed-up-Encode-UTF-8-validation-checking.patch
my = orig + revert commit c8247c27c13d1cf152398e453793a91916d2185d

Test cases:
all = join "", map { chr } 0 .. 0x10
short = "žluťoučký kůň pěl ďábelské ódy " x 45
long = $short x 1000
invalid-short = "\xA0" x 1000
invalid-long = "\xA0" x 100

Encoding was called on string with Encode::_utf8_on() flag.


Rates:

encode:
   all   short long  invalid-short invalid-long
orig - strict  41/s124533/s132/s 115197/s172/s
your - strict 176/s411523/s427/s  54813/s 66/s
my   - strict  80/s172712/s186/s 113787/s138/s

orig - lax   1010/s   3225806/s   6250/s 546800/s   5151/s
your - lax952/s   3225806/s   5882/s 519325/s   4919/s
my   - lax   1060/s   3125000/s   6250/s 645119/s   5009/s

orig - int8154604/s  1000/sinfty9787566/s9748151/s
your - int9135243/s  /sinfty8922821/s9737657/s
my   - int9779395/s  1000/sinfty9822046/s8949861/s


decode:
   all   short long  invalid-short invalid-long
orig - strict  39/s119048/s131/s 108574/s171/s
your - strict 173/s353357/s442/s  42440/s 55/s
my   - strict  69/s17/s182/s 117291/s135/s

orig - lax 39/s123609/s137/s 127302/s172/s
your - lax230/s393701/s495/s  37346/s 65/s
my   - lax 79/s158983/s180/s 121456/s138/s

orig - int274/s546448/s565/s8219513/s  12357/s
your - int273/s540541/s562/s7226066/s  12948/s
my   - int274/s543478/s562/s8502902/s  12421/s


int is there just for verifications of tests as utf8::encode/decode
functions was not changed.

Results are: your patch is faster for valid sequences (as you wrote
above), but slower for invalid (in some cases radically).

So I would propose two optimizations:

1) Change macro isUTF8_CHAR in is_strict_utf8_string_loc() with some new
   which does not call utf8n_to_uvchr. That call is not needed as in
   that case sequence is already invalid.

2) Try to make inline version of function is_utf8_string_loc(). Maybe
   merge with