Re: [PHP-DEV] [VOTE] Improve unserialize() error handling
On 18/10/22 22:32, Tim Düsterhus wrote: false is a valid value that might've been serialized. The manual specifically notes: false is returned both in the case of an error and if unserializing the serialized false value. It is possible to catch this special case by comparing data with serialize(false) or by catching the issued E_NOTICE. Yes, that is fair enough. Note that your example code also errors out if the $result is an empty array, an empty string or another falsy values. This might be an acceptable result in your specific case, but is not the correct behavior in the general case. Right, there was no instance in the 150 callers I reviewed yesterday in which unserialize() was explicitly checked for errors while unserializing a falsy value. Most often, there is an array wrapper around the actual value, or the possibility is otherwise excluded. The example in the RFC was written to be correct for the general case, without imposing any constraints in the input data. As I said, if throwing was optional, like in json_decode(), I would be all for it. I'm just doing a cost/benefit analysis. The common case should be easy, while the general case should be possible. As a matter of style, I think PHP's false returns are fine. I don't think we need to follow Python into making every error an exception. -- Tim Starling -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php
Re: [PHP-DEV] RFC [Discussion]: Randomizer Additions
On Tue, Oct 18, 2022, at 12:22 PM, Tim Düsterhus wrote: >>> Generating a random string containing specific characters...thus requires >>> multiple lines of code for what effectively is a very simple operation. >> >> Yeah, though those lines of code add distinction and emphasis for is >> meant by character. >> >> In particular, users might be surprised when they give this string >> "abc"* and get a non-ascii result. > > That's why the method includes 'bytes' in its name. This term is also > used in ->shuffleBytes() which was renamed in > https://wiki.php.net/rfc/random_extension_improvement due to this exact > problem. > > In fact ->shuffleBytes() can be considered the companion method to the > proposed ->getBytesFromAlphabet(): "Alphabet" here still, to me, implies a character set, not a byte stream. Maybe getBytesFromString? getBytesFromList? getBytesFrom() (because you're getting it "from" the string that's provided, so you don't need another noun there.)? I'm not opposed to the functionality, but "alphabet" doesn't seem like the right word for it. --Larry Garfield -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php
Re: [PHP-DEV] RFC [Discussion]: Randomizer Additions
On Tue, Oct 18, 2022 at 10:22 AM Tim Düsterhus wrote: > > This cannot be reasonably done in userland, because you pay an increased > cost to turn the bytes into numbers and then to perform the necessary > bit fiddling to debias the numbers. > To add to this, I'm going to link to a userland implementation I wrote for my Fermat library. A couple of notes: 1. I am not at all claiming to be a subject matter expert, and this was my first pass at the feature, not a fully researched and optimized version. 2. The main issue at hand for me was that my 'numbers' could be arbitrary precision, which would require me to directly ask for bytes and then pull out numbers from them depending on the number of integer digits possible within the range. https://github.com/JordanRL/Fermat/blob/5d186eac99ceb98b089865099c2d7622428bfdaa/src/Samsara/Fermat/Provider/RandomProvider.php#L128 Now again, this was my first pass, but the way I tackled this was to get a string of bytes that fully contained the integer range. The issue of debiasing the numbers with that approach is *so* difficult, that I opted instead to simply recursively call the operation until a number within the desired range was generated. This avoids introducing bias, but is obviously less than optimal. Some kind of ability to directly provide a set of possible characters and generating from bytes directly would at least reduce the recursion space further in my case, though my particular algorithm still needs additional improvement obviously. Generating random numbers that have an arbitrary range is not easy, particularly if your range might exceed PHP_INT_MAX. *Any* utilities that core can provide to improve performance and reduce the number of places that bias might be introduced in such a situation would be much appreciated. Personally, I decided for my implementation that introducing bias was a greater concern than performance. Jordan
Re: [PHP-DEV] [VOTE] Improve unserialize() error handling
> > The other failing tests involve "unserialize_callback_func" to gracefully > > detect class-not-found, and they are all plain broken by the RFC. > > > > I created this patch/PR to show the changes that would be required on > > Symfony to work around the BC break: > > https://github.com/symfony/symfony/pull/47882 > > > > Note to readers: in this whole discussion, Symfony is just an example of > > affected code. In the end, Symfony will adapt to the RFC if it passes. > The > > point being made is that PHP scripts should not have to be patched to run > > on newer minor versions of PHP. That's what "keeping BC" means. > > Scripts will not need to be changed, unless they are already relying on > observed behavior instead of the documented behavior. When I use Symfony's 'PhpSerializer' and perform the following call: > > > $serializer->decode([ > > 'body' => '{"message": "bar"}', > > ]); > > Then I can expect the input to be gracefully rejected with a > MessageDecodingFailedException. This behavior is tested in the > 'testDecodingFailsWithBadFormat' test. > > But *under no circumstances* must I perform the following method call: > > > $serializer->decode([ > > 'body' => 'O:8:"DateTime":0:{}', > > ]); > > Because this payload is invalid and thus must not exist in practice. > That's the current behavior, yes. It allows graceful errors when one misconfigures their workers and sends jsons to a worker that expects php-ser. Under no circumstances do we need to guard against adversary payloads. We could handle the situation you describe of course, but that'd be just dead code in practice. ¯\_(ツ)_/¯ > > Breaking BC in the name of such invalid payloads doesn't make an argument > > in favor of the RFC. > > I demonstrated how to adjust 'PhpSerializer' to improve the behavior for > what I consider a reasonable use case without breaking any existing > tests in step3b.diff and step3c.diff. At the same time the fixed version > will automatically handle the change this RFC proposes. I even offered > to adjust the RFC implementation to preserve the original Exception > message and code. > > >> Failed asserting that exception message 'Could not decode message using > >> PHP serialization: O:13:"ReceivedSt0mp":0:{}.' matches '/class > >> "ReceivedSt0mp" not found/'. > >> > >> Okay, now the Exception message changed. Personally I do not consider > >> this a BC break: I believe Exception messages are meant for human > >> consumption, not for programs. Otherwise fixing a typo in the message > >> would be a BC break. If the code wants to learn about the cause, it > >> should either use the '$code' or different types of Exception should be > >> thrown to clarify the cause by entering a different catch() block. > >> > > > > Yes, the specific error message should be part of the BC promise. This > > allows building test suites that can assert the message in a stable way. > > I'm not talking about test suites here. I believe makes sense to verify > the error message to ensure a specific error message is emitted to the > human observer in the error log. > Breaking test suites is the BC issue I wanted to highlight. > I was talking about code that does something like this, which I consider > to be inherently unsafe: > > try { … } > catch (SomeException $e) { >if ($e->getMessage() === 'Foobar') doSomething(); >else doSomethingElse(); > } > > As a library author I want to be able to provide the best possible > Exception message to ease debugging for the user. This is not possible > if I am locked into a bad choice forever. > > > This is also why we don't change the output of > var_dump/print_r/var_export: > > they're written now, in the same of BC, for the best of stability. (I've > > barely read any PHP code where exception's code is used in any useful > BTW - > > that can't be a solution.) > > PDO is an example where the $code is useful, because it exposes the > standardized SQL error code. Similarly a HTTP client could expose the > status code within $code. > > For other libraries simply using a different Exception class within the > same hierarchy is often sufficient to differentiate between different > errors, because programmatically I often don't care *why exactly* an > operation failed. > > > More importantly, the type of thrown exceptions should undoubtedly be > part > > of the BC promise. > > I agree. > > > Wrapping exceptions inside UnserializationFailedException breaks existing > > catch/instanceof checks (see my PR above.) > > > > unserialize() only promises you that a Throwable might be thrown: > > > https://www.php.net/manual/en/function.unserialize.php#refsect1-function.unserialize-errors Breaking BC at will unless it's properly documented looks like a terrible policy. It'd make developing in PHP really risky. Especially here: expecting that an exception would bubble down is just what everybody would expect from unserialize(), like everything else in
Re: [PHP-DEV] RFC [Discussion]: Randomizer Additions
Hi On 10/16/22 22:24, Dan Ackroyd wrote: Shall an option be added to getFloat() that changes the logic to select from [$min, $max] (i.e. allowing the maximum to be returned)? And how should that look like? Boolean parameter? Enum? An enum would probably be nice, and possibly be for all four cases of min_(inclusive|exclusive)_max_(inclusive|exclusive) unless there is a technical reason to not include all of them. No technical reason. The paper for the γ-section algorithm by Prof. Goualard includes implementations for all four combinations. The [closed, open) and [closed, closed] variants together are the most useful combination, though. The former can be cleanly split into subintervals, whereas the latter is symmetric which is useful if your use case is as well. With rejection sampling the [closed, closed] variant can also be turned into any of the other three without introducing any bias. Generating a random string containing specific characters...thus requires multiple lines of code for what effectively is a very simple operation. Yeah, though those lines of code add distinction and emphasis for is meant by character. In particular, users might be surprised when they give this string "abc"* and get a non-ascii result. That's why the method includes 'bytes' in its name. This term is also used in ->shuffleBytes() which was renamed in https://wiki.php.net/rfc/random_extension_improvement due to this exact problem. In fact ->shuffleBytes() can be considered the companion method to the proposed ->getBytesFromAlphabet(): ->shuffleBytes() allows to simulate a Multivariate hypergeometric distribution ("sampling without replacement") by shuffling the input string and then using 'substr' on the result to select a number of bytes 'n'. ->getBytesFromAlphabet() directly maps to a Multinomial distribution ("replacement sampling"). You're going to need to be really precise on the naming and I'm not at all sure there is a single version that would be useful enough to belong in core. Personally I believe the proposed ->getBytesFromAlphabet() to be useful enough to belong in core. There are quite a few use cases that are naturally restricted to ASCII: Basically everything that can be considered an identifier. Arbitrary numeric strings alone likely provide plenty of use cases: - Backup codes for multi-factor authentication (restricting the input to digits allows you to leverage a numeric keyboard, reducing the chance for input errors). - Random phone numbers for testing purposes. - Random credit card numbers (you just need to calculate the checksum yourself). While hexadecimal strings can easily be generated by applying bin2hex to the output of ->getBytes(), that is also unintuitive, because the result length is twice the number of bytes. whereas a 64 Bit engine could generate randomness for 8 characters at once. I'm really not sure that many programs are going to be speed limited by random number generation. The syscall cost to retrieve bytes from the 'Secure' engine (which is the default engine) can be expensive. Especially on older operating system versions and depending on how many more Meltdown/Spectre-style vulnerabilities they find. A userland implementation that generates 1000 random numeric strings with 100 characters each using the 'Secure' engine requires 146ms on my computer. The native implementation without optimization requires 101ms and the optimized native implementation 26ms. For Xoshiro256** (the fastest engine) the numbers are 89ms, 7ms and 3ms respectively. Benchmark attached. For those that are, writing their own generator to consume all 64 bits of randomness for each call sounds reasonably sensible, unless a useful general api can be thought of. This cannot be reasonably done in userland, because you pay an increased cost to turn the bytes into numbers and then to perform the necessary bit fiddling to debias the numbers. For the float side of the RFC, as there are technical limitations on which platforms it would be usable on, there needs to be a way of determining whether the nextFloat and getFloat methods are going to work. The way this is done on Imagick is to put appropriate defines in the stub file and in the C code implementations so that the methods aren't available on the class for the platforms where it isn't going to function correctly. I made a PR for that to Tim's repo, though I don't know of an environment where it can be tested. I've seent he PR and we shortly discussed this in chat. I would defer this to the code review, because this amounts to an implementation detail. Especially since all reasonable server platforms use IEEE 754. Best regards Tim Düsterhus<> -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php
Re: [PHP-DEV] [VOTE] Improve unserialize() error handling
Hi Tim, On Fri, 14 Oct 2022 at 15:44, Tim Düsterhus wrote: > > Hi > > as announced on Wednesday [1] I've now opened the vote for: > > "Improve unserialize() error handling" [2] I'm going to have to vote no for changing to throwing an exception. The BC break is too big imo, and too hard for people to be aware of. IMO There needs to be a clear path of deprecating the current behaviour that people are alerted to (rather than having to read upgrade notes), so that they know they need to change their code from: function getUserPreferences($data) { $user_preferences = @unserialize($data); if ($user_preferences === false) { $user_preferences = get_default_preferences(); } return $user_preferences; } to something like: function getUserPreferences($data) { try { return unserialize($data); } catch (UnserializationFailedException $ufe) { return get_default_preferences(); } } but I'm not having great ideas of how that could be done 'nicely'. Similar for the exception hierarchy change (example is below). Changing the type of exception thrown would break valid (if not super great) code in a way that would be hard to detect, and I'm not sure we have a good way of performing this type of change. IMO these isn't just a BC break that needs to wait for a major version release; it's something that needs to have a clear upgrade path no matter which version they are proposed for. btw I think there are underlying interlinked problems with PHP that causes proposed changes like these to be more difficult than anyone would want: 1. The PHP core library is 'ungood' in quite a few aspects, and many people would like to improve it, but some of those changes are hard to do in a BC way. 2. Because of the way PHP works, it's not possible for one library/module to use different versions of the same function. This is different from JavaScript where with their (ridiculously convoluted*) tools, each module is able to specify exactly which versions of which libraries it wants to use. 3. Although PECL was good for the time it was created in (where it was uncommon for individuals to have places to share files online) it really isn't a tool fit for purpose any more. It's not really fit for purpose any more, as even when there are (allegedly) great libraries like ds (https://www.php.net/manual/en/book.ds.php) relatively few people using PHP use them. I think trying to improve the situation, so that changes to individual functions avoid even requiring a discussion in internals in the first place, is a worthy goal. But it's one that would require many multiple years of effort, so is far too difficult to be done by individual people volunteering their time. Addressing problems that are far too large for individual contributors to fix, is possibly a discussion to be taken up by the PHP Foundation. If anyone who works for a for-profit company has read this far and their company isn't already sponsoring something like 1% of their engineers salaries to the foundation, then I encourage you to do so at https://opencollective.com/phpfoundation . For the "Increase the severity of emitted E_NOTICE to E_WARNING" part... I offer no opinion on the change. I personally think any project that doesn't convert any unsilenced warning to an exception is just asking for trouble. cheers Dan Ack * I wouldn't want to see a 'standard' library for PHP split across 80,000 repositories, but I think that the one (1) standard library that PHP has is too few. Which possibly gives somes bounds to the desirable number of libraries for a programming language. // Example for __unserialize exception class UserNotFoundException extends \Exception {} function check_user_exists($user_id) { // check user account hasn't been deleted, otherwise throw: throw new UserNotFoundException("exceptions for flow control are great.\n"); } class User { private $id = 5; function __unserialize(array $data) { check_user_exists($data["\x00User\x00id"]); } } $user = new User(); $data = serialize($user); echo "$data \n"; try { $result = unserialize($data); } catch (UserNotFoundException $unfe) { // This exception would no longer be caught if the exception type was changed. echo "redirect user to login page\n"; exit(0); } echo "User is valid.\n"; -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php
Re: [PHP-DEV] [VOTE] Improve unserialize() error handling
Hi On 10/18/22 03:26, Tim Starling wrote: If I'm reading the implementation correctly, the original exception is thrown away, there's no way to get the original message and backtrace. That will make debugging more difficult. See Derick's reply (and my reply to Derick's reply). I reviewed about 150 callers of unserialize() in the MediaWiki core and extensions. There were no instances of the kind of error guarding shown in the RFC. Occasionally, we have: $result = @unserialize( $data ); if ( !$result ) { throw new Exception( ... ); } The assumption is that if there is a failure, false will be returned, as is documented in the manual. I don't understand the assertion in the RFC that set_error_handler() is required to detect unserialization failure. false is a valid value that might've been serialized. The manual specifically notes: false is returned both in the case of an error and if unserializing the serialized false value. It is possible to catch this special case by comparing data with serialize(false) or by catching the issued E_NOTICE. Note that your example code also errors out if the $result is an empty array, an empty string or another falsy values. This might be an acceptable result in your specific case, but is not the correct behavior in the general case. The example in the RFC was written to be correct for the general case, without imposing any constraints in the input data. Normally, exceptions in MediaWiki are allowed to propagate back to the global exception handler. There are a couple of categories of exceptions which should never be caught as a matter of policy: * Database errors. These are permanent errors. Attempting to persist with executing the request after a database error will mostly likely lead to another error. * Request timeout exceptions. We're using our Excimer extension to implement request timeouts by throwing exceptions from a timer callback. This has been an interesting can of worms which has brought some nice features with a number of unexpected consequences. It means that it's always incorrect to catch and discard an arbitrary Throwable. Please note that this is already broken independently of this RFC. It is entirely reasonable for a library to catch all Exceptions that happen internally and wrap them in a well-defined "library exception" at the library boundary. As a specific example, the "flysystem" file system abstraction library is doing this: https://github.com/thephpleague/flysystem/blob/f5e93fad64cff5f45a1f79246dae15b31099b105/src/MountManager.php#L35-L37 Probably nothing in our ecosystem is doing database access in an __unserialize() magic method. But if someone tried it, I would want the resulting errors to be properly logged with correct backtraces, not silently discarded. The internal behavior of unserialization is unspecified. It only guarantees you that you get back your original structure and that all the unserialization callbacks will have been called if unserialization fails. For the error case there are no guarantees, except "a Throwable might be thrown". As an example PHP 7.0.10 changed the behavior for unserialization failures to no longer call __destruct() on incompletely initialized objects to fix a security issue: https://3v4l.org/82UJj I believe this is the corresponding bug: https://bugs.php.net/bug.php?id=72663 In PHP 7.0.15/7.1.1 apparently there was another change: https://3v4l.org/tnh2R As such you can't rely on the database access Exception to actually be emitted if unserialization of another object also fails. Best regards Tim Düsterhus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php
Re: [PHP-DEV] [VOTE] Improve unserialize() error handling
Hi On 10/18/22 09:32, Derick Rethans wrote: If I'm reading the implementation correctly, the original exception is thrown away, there's no way to get the original message and backtrace. That will make debugging more difficult. I believe that is not true. It should be wrapped and accessible through ->previous on the exception. Correct, the original Exception/Error will be available using $e->getPrevious(). This is also explained in the RFC in the "Add wrapper Exception 'UnserializationFailedException'" section: The original \Throwable will be accessible through the $previous property of \UnserializationFailedException, allowing the developer to learn about the actual cause of the unserialization failure by inspecting ->getPrevious(). This is not apparent from taking a cursory look at the diff of the native implementation in the pull request, because setting $previous will happen implicitly whenever native code throws an Exception, while an Exception is already "active". Best regards Tim Düsterhus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php
Re: [PHP-DEV] [VOTE] Improve unserialize() error handling
On 18 October 2022 02:26:59 BST, Tim Starling wrote: > If I'm reading the implementation correctly, the original exception > is thrown away, there's no way to get the > original message and backtrace. > That will make debugging more difficult. I believe that is not true. It should be wrapped and accessible through ->previous on the exception. cheers Derick -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php