Re: [PHP-DEV] [VOTE] Improve unserialize() error handling

2022-10-18 Thread Tim Starling

On 18/10/22 22:32, Tim Düsterhus wrote:


false is a valid value that might've been serialized. The manual 
specifically notes:


 false is returned both in the case of an error and if 
unserializing the serialized false value. It is possible to catch 
this special case by comparing data with serialize(false) or by 
catching the issued E_NOTICE. 


Yes, that is fair enough.




Note that your example code also errors out if the $result is an 
empty array, an empty string or another falsy values. This might be 
an acceptable result in your specific case, but is not the correct 
behavior in the general case.


Right, there was no instance in the 150 callers I reviewed yesterday 
in which unserialize() was explicitly checked for errors while 
unserializing a falsy value. Most often, there is an array wrapper 
around the actual value, or the possibility is otherwise excluded.





The example in the RFC was written to be correct for the general 
case, without imposing any constraints in the input data.


As I said, if throwing was optional, like in json_decode(), I would be 
all for it. I'm just doing a cost/benefit analysis. The common case 
should be easy, while the general case should be possible.


As a matter of style, I think PHP's false returns are fine. I don't 
think we need to follow Python into making every error an exception.


-- Tim Starling

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] RFC [Discussion]: Randomizer Additions

2022-10-18 Thread Larry Garfield
On Tue, Oct 18, 2022, at 12:22 PM, Tim Düsterhus wrote:

>>> Generating a random string containing specific characters...thus requires 
>>> multiple lines of code for what effectively is a very simple operation.
>> 
>> Yeah, though those lines of code add distinction and emphasis for is
>> meant by character.
>> 
>> In particular, users might be surprised when they give this string
>> "abc‍‍"* and get a non-ascii result.
>
> That's why the method includes 'bytes' in its name. This term is also 
> used in ->shuffleBytes() which was renamed in 
> https://wiki.php.net/rfc/random_extension_improvement due to this exact 
> problem.
>
> In fact ->shuffleBytes() can be considered the companion method to the 
> proposed ->getBytesFromAlphabet():

"Alphabet" here still, to me, implies a character set, not a byte stream.  
Maybe getBytesFromString?  getBytesFromList?  getBytesFrom() (because you're 
getting it "from" the string that's provided, so you don't need another noun 
there.)?

I'm not opposed to the functionality, but "alphabet" doesn't seem like the 
right word for it.

--Larry Garfield

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] RFC [Discussion]: Randomizer Additions

2022-10-18 Thread Jordan LeDoux
On Tue, Oct 18, 2022 at 10:22 AM Tim Düsterhus  wrote:

>
> This cannot be reasonably done in userland, because you pay an increased
> cost to turn the bytes into numbers and then to perform the necessary
> bit fiddling to debias the numbers.
>

To add to this, I'm going to link to a userland implementation I wrote for
my Fermat library. A couple of notes:

1. I am not at all claiming to be a subject matter expert, and this was my
first pass at the feature, not a fully researched and optimized version.
2. The main issue at hand for me was that my 'numbers' could be arbitrary
precision, which would require me to directly ask for bytes and then pull
out numbers from them depending on the number of integer digits possible
within the range.

https://github.com/JordanRL/Fermat/blob/5d186eac99ceb98b089865099c2d7622428bfdaa/src/Samsara/Fermat/Provider/RandomProvider.php#L128

Now again, this was my first pass, but the way I tackled this was to get a
string of bytes that fully contained the integer range. The issue of
debiasing the numbers with that approach is *so* difficult, that I opted
instead to simply recursively call the operation until a number within the
desired range was generated. This avoids introducing bias, but is obviously
less than optimal. Some kind of ability to directly provide a set of
possible characters and generating from bytes directly would at least
reduce the recursion space further in my case, though my particular
algorithm still needs additional improvement obviously.

Generating random numbers that have an arbitrary range is not easy,
particularly if your range might exceed PHP_INT_MAX. *Any* utilities that
core can provide to improve performance and reduce the number of places
that bias might be introduced in such a situation would be much
appreciated. Personally, I decided for my implementation that introducing
bias was a greater concern than performance.

Jordan


Re: [PHP-DEV] [VOTE] Improve unserialize() error handling

2022-10-18 Thread Nicolas Grekas
> > The other failing tests involve "unserialize_callback_func" to gracefully
> > detect class-not-found, and they are all plain broken by the RFC.
> >
> > I created this patch/PR to show the changes that would be required on
> > Symfony to work around the BC break:
> > https://github.com/symfony/symfony/pull/47882
> >
> > Note to readers: in this whole discussion, Symfony is just an example of
> > affected code. In the end, Symfony will adapt to the RFC if it passes.
> The
> > point being made is that PHP scripts should not have to be patched to run
> > on newer minor versions of PHP. That's what "keeping BC" means.
>
> Scripts will not need to be changed, unless they are already relying on
> observed behavior instead of the documented behavior.

When I use Symfony's 'PhpSerializer' and perform the following call:
>
> > $serializer->decode([
> > 'body' => '{"message": "bar"}',
> > ]);
>
> Then I can expect the input to be gracefully rejected with a
> MessageDecodingFailedException. This behavior is tested in the
> 'testDecodingFailsWithBadFormat' test.
>
> But *under no circumstances* must I perform the following method call:
>
> > $serializer->decode([
> > 'body' => 'O:8:"DateTime":0:{}',
> > ]);
>
> Because this payload is invalid and thus must not exist in practice.
>

That's the current behavior, yes. It allows graceful errors when one
misconfigures their workers and sends jsons to a worker that expects
php-ser. Under no circumstances do we need to guard against adversary
payloads. We could handle the situation you describe of course, but that'd
be just dead code in practice. ¯\_(ツ)_/¯



> > Breaking BC in the name of such invalid payloads doesn't make an argument
> > in favor of the RFC.
>
> I demonstrated how to adjust 'PhpSerializer' to improve the behavior for
> what I consider a reasonable use case without breaking any existing
> tests in step3b.diff and step3c.diff. At the same time the fixed version
> will automatically handle the change this RFC proposes. I even offered
> to adjust the RFC implementation to preserve the original Exception
> message and code.
>
> >> Failed asserting that exception message 'Could not decode message using
> >> PHP serialization: O:13:"ReceivedSt0mp":0:{}.' matches '/class
> >> "ReceivedSt0mp" not found/'.
> >>
> >> Okay, now the Exception message changed. Personally I do not consider
> >> this a BC break: I believe Exception messages are meant for human
> >> consumption, not for programs. Otherwise fixing a typo in the message
> >> would be a BC break. If the code wants to learn about the cause, it
> >> should either use the '$code' or different types of Exception should be
> >> thrown to clarify the cause by entering a different catch() block.
> >>
> >
> > Yes, the specific error message should be part of the BC promise. This
> > allows building test suites that can assert the message in a stable way.
>
> I'm not talking about test suites here. I believe makes sense to verify
> the error message to ensure a specific error message is emitted to the
> human observer in the error log.
>

Breaking test suites is the BC issue I wanted to highlight.



> I was talking about code that does something like this, which I consider
> to be inherently unsafe:
>
> try { … }
> catch (SomeException $e) {
>if ($e->getMessage() === 'Foobar') doSomething();
>else doSomethingElse();
> }
>
> As a library author I want to be able to provide the best possible
> Exception message to ease debugging for the user. This is not possible
> if I am locked into a bad choice forever.
>
> > This is also why we don't change the output of
> var_dump/print_r/var_export:
> > they're written now, in the same of BC, for the best of stability. (I've
> > barely read any PHP code where exception's code is used in any useful
> BTW -
> > that can't be a solution.)
>
> PDO is an example where the $code is useful, because it exposes the
> standardized SQL error code. Similarly a HTTP client could expose the
> status code within $code.
>
> For other libraries simply using a different Exception class within the
> same hierarchy is often sufficient to differentiate between different
> errors, because programmatically I often don't care *why exactly* an
> operation failed.
>
> > More importantly, the type of thrown exceptions should undoubtedly be
> part
> > of the BC promise.
>
> I agree.
>
> > Wrapping exceptions inside UnserializationFailedException breaks existing
> > catch/instanceof checks (see my PR above.)
> >
>
> unserialize() only promises you that a Throwable might be thrown:
>
>
> https://www.php.net/manual/en/function.unserialize.php#refsect1-function.unserialize-errors


Breaking BC at will unless it's properly documented looks like a terrible
policy. It'd make developing in PHP really risky. Especially here:
expecting that an exception would bubble down is just what everybody would
expect from unserialize(), like everything else in 

Re: [PHP-DEV] RFC [Discussion]: Randomizer Additions

2022-10-18 Thread Tim Düsterhus

Hi

On 10/16/22 22:24, Dan Ackroyd wrote:

Shall an option be added to getFloat() that changes the logic to

select from [$min, $max] (i.e. allowing the maximum to be returned)? And
how should that look like? Boolean parameter? Enum?

An enum would probably be nice, and possibly be for all four cases of
min_(inclusive|exclusive)_max_(inclusive|exclusive) unless there is a
technical reason to not include all of them.


No technical reason. The paper for the γ-section algorithm by Prof. 
Goualard includes implementations for all four combinations.


The [closed, open) and [closed, closed] variants together are the most 
useful combination, though. The former can be cleanly split into 
subintervals, whereas the latter is symmetric which is useful if your 
use case is as well.


With rejection sampling the [closed, closed] variant can also be turned 
into any of the other three without introducing any bias.



Generating a random string containing specific characters...thus requires 
multiple lines of code for what effectively is a very simple operation.


Yeah, though those lines of code add distinction and emphasis for is
meant by character.

In particular, users might be surprised when they give this string
"abc‍‍"* and get a non-ascii result.


That's why the method includes 'bytes' in its name. This term is also 
used in ->shuffleBytes() which was renamed in 
https://wiki.php.net/rfc/random_extension_improvement due to this exact 
problem.


In fact ->shuffleBytes() can be considered the companion method to the 
proposed ->getBytesFromAlphabet():


->shuffleBytes() allows to simulate a Multivariate hypergeometric 
distribution ("sampling without replacement") by shuffling the input 
string and then using 'substr' on the result to select a number of bytes 
'n'.


->getBytesFromAlphabet() directly maps to a Multinomial distribution 
("replacement sampling").



You're going to need to be really precise on the naming and I'm not at
all sure there is a single version that would be useful enough to
belong in core.


Personally I believe the proposed ->getBytesFromAlphabet() to be useful 
enough to belong in core. There are quite a few use cases that are 
naturally restricted to ASCII: Basically everything that can be 
considered an identifier.


Arbitrary numeric strings alone likely provide plenty of use cases:

- Backup codes for multi-factor authentication (restricting the input to 
digits allows you to leverage a numeric keyboard, reducing the chance 
for input errors).

- Random phone numbers for testing purposes.
- Random credit card numbers (you just need to calculate the checksum 
yourself).


While hexadecimal strings can easily be generated by applying bin2hex to 
the output of ->getBytes(), that is also unintuitive, because the result 
length is twice the number of bytes.



whereas a 64 Bit engine could generate randomness for 8 characters at once.


I'm really not sure that many programs are going to be speed limited
by random number generation.


The syscall cost to retrieve bytes from the 'Secure' engine (which is 
the default engine) can be expensive. Especially on older operating 
system versions and depending on how many more Meltdown/Spectre-style 
vulnerabilities they find.


A userland implementation that generates 1000 random numeric strings 
with 100 characters each using the 'Secure' engine requires 146ms on my 
computer. The native implementation without optimization requires 101ms 
and the optimized native implementation 26ms.


For Xoshiro256** (the fastest engine) the numbers are 89ms, 7ms and 3ms 
respectively.


Benchmark attached.


For those that are, writing their own generator to consume all 64 bits
of randomness for each call sounds reasonably sensible, unless a
useful general api can be thought of.


This cannot be reasonably done in userland, because you pay an increased 
cost to turn the bytes into numbers and then to perform the necessary 
bit fiddling to debias the numbers.



For the float side of the RFC, as there are technical limitations on
which platforms it would be usable on, there needs to be a way of
determining whether the nextFloat and getFloat methods are going to
work. The way this is done on Imagick is to put appropriate defines in
the stub file and in the C code implementations so that the methods
aren't available on the class for the platforms where it isn't going
to function correctly.

I made a PR for that to Tim's repo, though I don't know of an
environment where it can be tested.


I've seent he PR and we shortly discussed this in chat. I would defer 
this to the code review, because this amounts to an implementation 
detail. Especially since all reasonable server platforms use IEEE 754.


Best regards
Tim Düsterhus<>

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Re: [PHP-DEV] [VOTE] Improve unserialize() error handling

2022-10-18 Thread Dan Ackroyd
Hi Tim,

On Fri, 14 Oct 2022 at 15:44, Tim Düsterhus  wrote:
>
> Hi
>
> as announced on Wednesday [1] I've now opened the vote for:
>
> "Improve unserialize() error handling" [2]


I'm going to have to vote no for changing to throwing an exception.
The BC break is too big imo, and too hard for people to be aware of.

IMO There needs to be a clear path of deprecating the current
behaviour that people are alerted to (rather than having to read
upgrade notes), so that they know they need to change their code from:

function getUserPreferences($data) {
  $user_preferences = @unserialize($data);
  if ($user_preferences === false) {
$user_preferences = get_default_preferences();
  }
  return $user_preferences;
}

to something like:

function getUserPreferences($data) {
  try {
return unserialize($data);
  }
  catch (UnserializationFailedException $ufe) {
return get_default_preferences();
  }
}

but I'm not having great ideas of how that could be done 'nicely'.

Similar for the exception hierarchy change (example is below).
Changing the type of exception thrown would break valid (if not super
great) code in a way that would be hard to detect, and I'm not sure we
have a good way of performing this type of change.

IMO these isn't just a BC break that needs to wait for a major version
release; it's something that needs to have a clear upgrade path no
matter which version they are proposed for.

btw I think there are underlying interlinked problems with PHP that
causes proposed changes like these to be more difficult than anyone
would want:

1. The PHP core library is 'ungood' in quite a few aspects, and many
people would like to improve it, but some of those changes are hard to
do in a BC way.

2. Because of the way PHP works, it's not possible for one
library/module to use different versions of the same function. This is
different from JavaScript where with their (ridiculously convoluted*)
tools, each module is able to specify exactly which versions of which
libraries it wants to use.

3. Although PECL was good for the time it was created in (where it was
uncommon for individuals to have places to share files online) it
really isn't a tool fit for purpose any more. It's not really fit for
purpose any more, as even when there are (allegedly) great libraries
like ds (https://www.php.net/manual/en/book.ds.php) relatively few
people using PHP use them.

I think trying to improve the situation, so that changes to individual
functions avoid even requiring a discussion in internals in the first
place, is a worthy goal. But it's one that would require many multiple
years of effort, so is far too difficult to be done by individual
people volunteering their time.

Addressing problems that are far too large for individual contributors
to fix, is possibly a discussion to be taken up by the PHP Foundation.
If anyone who works for a for-profit company has read this far and
their company isn't already sponsoring something like 1% of their
engineers salaries to the foundation, then I encourage you to do so at
https://opencollective.com/phpfoundation .

For the  "Increase the severity of emitted E_NOTICE to E_WARNING"
part... I offer no opinion on the change. I personally think any
project that doesn't convert any unsilenced warning to an exception is
just asking for trouble.

cheers
Dan
Ack

* I wouldn't want to see a 'standard' library for PHP split across
80,000 repositories, but I think that the one (1) standard library
that PHP has is too few. Which possibly gives somes bounds to the
desirable number of libraries for a programming language.



// Example for __unserialize exception

class UserNotFoundException extends \Exception {}

function check_user_exists($user_id)
{
// check user account hasn't been deleted, otherwise throw:
throw new UserNotFoundException("exceptions for flow control are great.\n");
}

class User
{
private $id = 5;

function __unserialize(array $data) {
check_user_exists($data["\x00User\x00id"]);
}
}


$user = new User();
$data = serialize($user);

echo "$data \n";

try {
$result = unserialize($data);
}
catch (UserNotFoundException $unfe) {
// This exception would no longer be caught if the exception type
was changed.
echo "redirect user to login page\n";
exit(0);
}

echo "User is valid.\n";

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] [VOTE] Improve unserialize() error handling

2022-10-18 Thread Tim Düsterhus

Hi

On 10/18/22 03:26, Tim Starling wrote:

If I'm reading the implementation correctly, the original exception is
thrown away, there's no way to get the original message and backtrace.
That will make debugging more difficult.


See Derick's reply (and my reply to Derick's reply).


I reviewed about 150 callers of unserialize() in the MediaWiki core
and extensions. There were no instances of the kind of error guarding
shown in the RFC. Occasionally, we have:

$result = @unserialize( $data );
if ( !$result ) {
   throw new Exception( ... );
}

The assumption is that if there is a failure, false will be returned,
as is documented in the manual. I don't understand the assertion in
the RFC that set_error_handler() is required to detect unserialization
failure.


false is a valid value that might've been serialized. The manual 
specifically notes:


 false is returned both in the case of an error and if unserializing the serialized false value. It is possible to catch this special case by comparing data with serialize(false) or by catching the issued E_NOTICE. 


Note that your example code also errors out if the $result is an empty 
array, an empty string or another falsy values. This might be an 
acceptable result in your specific case, but is not the correct behavior 
in the general case.


The example in the RFC was written to be correct for the general case, 
without imposing any constraints in the input data.



Normally, exceptions in MediaWiki are allowed to propagate back to the
global exception handler. There are a couple of categories of
exceptions which should never be caught as a matter of policy:

* Database errors. These are permanent errors. Attempting to persist
with executing the request after a database error will mostly likely
lead to another error.

* Request timeout exceptions. We're using our Excimer extension to
implement request timeouts by throwing exceptions from a timer
callback. This has been an interesting can of worms which has brought
some nice features with a number of unexpected consequences. It means
that it's always incorrect to catch and discard an arbitrary Throwable.


Please note that this is already broken independently of this RFC. It is 
entirely reasonable for a library to catch all Exceptions that happen 
internally and wrap them in a well-defined "library exception" at the 
library boundary.


As a specific example, the "flysystem" file system abstraction library 
is doing this:


https://github.com/thephpleague/flysystem/blob/f5e93fad64cff5f45a1f79246dae15b31099b105/src/MountManager.php#L35-L37


Probably nothing in our ecosystem is doing database access in an
__unserialize() magic method. But if someone tried it, I would want
the resulting errors to be properly logged with correct backtraces,
not silently discarded.


The internal behavior of unserialization is unspecified. It only 
guarantees you that you get back your original structure and that all 
the unserialization callbacks will have been called if unserialization 
fails. For the error case there are no guarantees, except "a Throwable 
might be thrown".


As an example PHP 7.0.10 changed the behavior for unserialization 
failures to no longer call __destruct() on incompletely initialized 
objects to fix a security issue:


https://3v4l.org/82UJj

I believe this is the corresponding bug: 
https://bugs.php.net/bug.php?id=72663


In PHP 7.0.15/7.1.1 apparently there was another change:

https://3v4l.org/tnh2R

As such you can't rely on the database access Exception to actually be 
emitted if unserialization of another object also fails.


Best regards
Tim Düsterhus

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] [VOTE] Improve unserialize() error handling

2022-10-18 Thread Tim Düsterhus

Hi

On 10/18/22 09:32, Derick Rethans wrote:

If I'm reading the implementation correctly, the original exception
is thrown away, there's no way to get the
original message and backtrace.
That will make debugging more difficult.


I believe that is not true. It should be wrapped and accessible through 
->previous on the exception.



Correct, the original Exception/Error will be available using 
$e->getPrevious(). This is also explained in the RFC in the "Add wrapper 
Exception 'UnserializationFailedException'" section:


 The original \Throwable will be accessible through the $previous property of \UnserializationFailedException, allowing the developer to learn about the actual cause of the unserialization failure by inspecting ->getPrevious(). 


This is not apparent from taking a cursory look at the diff of the 
native implementation in the pull request, because setting $previous 
will happen implicitly whenever native code throws an Exception, while 
an Exception is already "active".


Best regards
Tim Düsterhus

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] [VOTE] Improve unserialize() error handling

2022-10-18 Thread Derick Rethans
On 18 October 2022 02:26:59 BST, Tim Starling  wrote:

> If I'm reading the implementation correctly, the original exception
> is thrown away, there's no way to get the
> original message and backtrace. 
> That will make debugging more difficult.

I believe that is not true. It should be wrapped and accessible through 
->previous on the exception. 

cheers
Derick 

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php