Re: [PATCH] json_lex_string: don't overread on bad UTF8
On Wed, May 8, 2024 at 9:27 PM Michael Paquier wrote: > This is a bit mitigated by the fact that d6607016c738 is recent, but > this is incorrect since v13 so backpatched down to that. Thank you! --Jacob
Re: [PATCH] json_lex_string: don't overread on bad UTF8
On Wed, May 08, 2024 at 07:01:08AM -0700, Jacob Champion wrote: > On Tue, May 7, 2024 at 10:31 PM Michael Paquier wrote: >> But looking closer, I can see that in the JSON_INVALID_TOKEN case, >> when !tok_done, we set token_terminator to point to the end of the >> token, and that would include an incomplete byte sequence like in your >> case. :/ > > Ah, I see what you're saying. Yeah, that approach would need some more > invasive changes. My first feeling was actually to do that, and report the location in the input string where we are seeing issues. All code paths playing with token_terminator would need to track that. > Agreed. Fortunately (or unfortunately?) I think the JSON > client-encoding work is now a prerequisite for OAuth in libpq, so > hopefully some improvements can fall out of that work too. I'm afraid so. I don't quite see how this would be OK to tweak on stable branches, but all areas that could report error states with partial byte sequence contents would benefit from such a change. >> Thoughts and/or objections? > > None here. This is a bit mitigated by the fact that d6607016c738 is recent, but this is incorrect since v13 so backpatched down to that. -- Michael signature.asc Description: PGP signature
Re: [PATCH] json_lex_string: don't overread on bad UTF8
On Tue, May 7, 2024 at 10:31 PM Michael Paquier wrote: > But looking closer, I can see that in the JSON_INVALID_TOKEN case, > when !tok_done, we set token_terminator to point to the end of the > token, and that would include an incomplete byte sequence like in your > case. :/ Ah, I see what you're saying. Yeah, that approach would need some more invasive changes. > This situation makes me > uncomfortable and we should put more effort in printing error messages > in a readable format, but that could always be tackled later as a > separate problem.. And I don't see something backpatchable at short > sight for v16. Agreed. Fortunately (or unfortunately?) I think the JSON client-encoding work is now a prerequisite for OAuth in libpq, so hopefully some improvements can fall out of that work too. > Thoughts and/or objections? None here. Thanks! --Jacob
Re: [PATCH] json_lex_string: don't overread on bad UTF8
On Tue, May 07, 2024 at 02:06:10PM -0700, Jacob Champion wrote: > Maybe I've misunderstood, but isn't that what's being done in v2? Something a bit different.. I was wondering if it could be possible to tweak this code to truncate the data in the generated error string so as the incomplete multi-byte sequence is entirely cut out, which would come to setting token_terminator to "s" (last byte before the incomplete byte sequence) rather than "term" (last byte available, even if incomplete): #define FAIL_AT_CHAR_END(code) \ do { \ char *term = s + pg_encoding_mblen(lex->input_encoding, s); \ lex->token_terminator = (term <= end) ? term : s; \ return code; \ } while (0) But looking closer, I can see that in the JSON_INVALID_TOKEN case, when !tok_done, we set token_terminator to point to the end of the token, and that would include an incomplete byte sequence like in your case. :/ At the end of the day, I think that I'm OK with your patch and avoid the overread for now in the back-branches. This situation makes me uncomfortable and we should put more effort in printing error messages in a readable format, but that could always be tackled later as a separate problem.. And I don't see something backpatchable at short sight for v16. Thoughts and/or objections? -- Michael signature.asc Description: PGP signature
Re: [PATCH] json_lex_string: don't overread on bad UTF8
On Mon, May 6, 2024 at 8:43 PM Michael Paquier wrote: > On Fri, May 03, 2024 at 07:05:38AM -0700, Jacob Champion wrote: > > We could port something like that to src/common. IMO that'd be more > > suited for an actual conversion routine, though, as opposed to a > > parser that for the most part assumes you didn't lie about the input > > encoding and is just trying not to crash if you're wrong. Most of the > > time, the parser just copies bytes between delimiters around and it's > > up to the caller to handle encodings... the exceptions to that are the > > \u escapes and the error handling. > > Hmm. That would still leave the backpatch issue at hand, which is > kind of confusing to leave as it is. Would it be complicated to > truncate the entire byte sequence in the error message and just give > up because we cannot do better if the input byte sequence is > incomplete? Maybe I've misunderstood, but isn't that what's being done in v2? > > Maybe I'm missing > > code somewhere, but I don't see a conversion routine from > > json_errdetail() to the actual client/locale encoding. (And the parser > > does not support multibyte input_encodings that contain ASCII in trail > > bytes.) > > Referring to json_lex_string() that does UTF-8 -> ASCII -> give-up in > its conversion for FRONTEND, I guess? Yep. This limitation looks > like a problem, especially if plugging that to libpq. Okay. How we deal with that will likely guide the "optimal" fix to error reporting, I think... Thanks, --Jacob
Re: [PATCH] json_lex_string: don't overread on bad UTF8
On Fri, May 03, 2024 at 07:05:38AM -0700, Jacob Champion wrote: > On Fri, May 3, 2024 at 4:54 AM Peter Eisentraut wrote: >> but for the general encoding conversion we have what >> would appear to be the same behavior in report_invalid_encoding(), and >> we go out of our way there to produce a verbose error message including >> the invalid data. I was looking for that a couple of days ago in the backend but could not put my finger on it. Thanks. > We could port something like that to src/common. IMO that'd be more > suited for an actual conversion routine, though, as opposed to a > parser that for the most part assumes you didn't lie about the input > encoding and is just trying not to crash if you're wrong. Most of the > time, the parser just copies bytes between delimiters around and it's > up to the caller to handle encodings... the exceptions to that are the > \u escapes and the error handling. Hmm. That would still leave the backpatch issue at hand, which is kind of confusing to leave as it is. Would it be complicated to truncate the entire byte sequence in the error message and just give up because we cannot do better if the input byte sequence is incomplete? We could still have some information depending on the string given in input, which should be enough, hopefully. With the location pointing to the beginning of the sequence, even better. > Offhand, are all of our supported frontend encodings > self-synchronizing? By that I mean, is it safe to print a partial byte > sequence if the locale isn't UTF-8? (As I type this I'm starting at > Shift-JIS, and thinking "probably not.") > > Actually -- hopefully this is not too much of a tangent -- that > further crystallizes a vague unease about the API that I have. The > JsonLexContext is initialized with something called the > "input_encoding", but that encoding is necessarily also the output > encoding for parsed string literals and error messages. For the server > side that's fine, but frontend clients have the input_encoding locked > to UTF-8, which seems like it might cause problems? Maybe I'm missing > code somewhere, but I don't see a conversion routine from > json_errdetail() to the actual client/locale encoding. (And the parser > does not support multibyte input_encodings that contain ASCII in trail > bytes.) Referring to json_lex_string() that does UTF-8 -> ASCII -> give-up in its conversion for FRONTEND, I guess? Yep. This limitation looks like a problem, especially if plugging that to libpq. -- Michael signature.asc Description: PGP signature
Re: [PATCH] json_lex_string: don't overread on bad UTF8
On Fri, May 3, 2024 at 4:54 AM Peter Eisentraut wrote: > > On 30.04.24 19:39, Jacob Champion wrote: > > Tangentially: Should we maybe rethink pieces of the json_lex_string > > error handling? For example, do we really want to echo an incomplete > > multibyte sequence once we know it's bad? > > I can't quite find the place you might be looking at in > json_lex_string(), (json_lex_string() reports the beginning and end of the "area of interest" via the JsonLexContext; it's json_errdetail() that turns that into an error message.) > but for the general encoding conversion we have what > would appear to be the same behavior in report_invalid_encoding(), and > we go out of our way there to produce a verbose error message including > the invalid data. We could port something like that to src/common. IMO that'd be more suited for an actual conversion routine, though, as opposed to a parser that for the most part assumes you didn't lie about the input encoding and is just trying not to crash if you're wrong. Most of the time, the parser just copies bytes between delimiters around and it's up to the caller to handle encodings... the exceptions to that are the \u escapes and the error handling. Offhand, are all of our supported frontend encodings self-synchronizing? By that I mean, is it safe to print a partial byte sequence if the locale isn't UTF-8? (As I type this I'm starting at Shift-JIS, and thinking "probably not.") Actually -- hopefully this is not too much of a tangent -- that further crystallizes a vague unease about the API that I have. The JsonLexContext is initialized with something called the "input_encoding", but that encoding is necessarily also the output encoding for parsed string literals and error messages. For the server side that's fine, but frontend clients have the input_encoding locked to UTF-8, which seems like it might cause problems? Maybe I'm missing code somewhere, but I don't see a conversion routine from json_errdetail() to the actual client/locale encoding. (And the parser does not support multibyte input_encodings that contain ASCII in trail bytes.) Thanks, --Jacob
Re: [PATCH] json_lex_string: don't overread on bad UTF8
On 30.04.24 19:39, Jacob Champion wrote: Tangentially: Should we maybe rethink pieces of the json_lex_string error handling? For example, do we really want to echo an incomplete multibyte sequence once we know it's bad? I can't quite find the place you might be looking at in json_lex_string(), but for the general encoding conversion we have what would appear to be the same behavior in report_invalid_encoding(), and we go out of our way there to produce a verbose error message including the invalid data.
Re: [PATCH] json_lex_string: don't overread on bad UTF8
On Wed, May 1, 2024 at 8:40 PM Michael Paquier wrote: > > On Thu, May 02, 2024 at 11:23:13AM +0900, Michael Paquier wrote: > > About the fact that we may finish by printing unfinished UTF-8 > > sequences, I'd be curious to hear your thoughts. Now, the information > > provided about the partial byte sequences can be also useful for > > debugging on top of having the error code, no? Yes, but which information do you want? Do you want to know the bad byte sequence, or see the glyph that corresponds to it (which is probably �)? The glyph is better as long as it's complete; if it's a bad sequence, then maybe you'd prefer to know the particular byte, but that assumes a lot of technical knowledge on the part of whoever's reading the message. > By the way, as long as I have that in mind.. I am not sure that it is > worth spending cycles in detecting the unfinished sequences and make > these printable. Wouldn't it be enough for more cases to adjust > token_error() to truncate the byte sequences we cannot print? Maybe. I'm beginning to wonder if I'm overthinking this particular problem, and if we should just go ahead and print the bad sequence. At least for the case of UTF-8 console encoding, replacement glyphs will show up as needed. There is the matter of a client that's not using UTF-8, though. Do we deal with that correctly today? (I understand why it was done the way it was, at least on the server side, but it's still really weird to have code that parses "JSON" that isn't actually Unicode.) > Another thing that I think would be nice would be to calculate the > location of what we're parsing on a given line, and provide that in > the error context. That would not be backpatchable as it requires a > change in JsonLexContext, unfortunately, but it would help in making > more sense with an error if the incomplete byte sequence is at the > beginning of a token or after an expected character. +1, at least that way you can skip directly to the broken spot during a postmortem. Thanks, --Jacob
Re: [PATCH] json_lex_string: don't overread on bad UTF8
On Thu, May 02, 2024 at 11:23:13AM +0900, Michael Paquier wrote: > About the fact that we may finish by printing unfinished UTF-8 > sequences, I'd be curious to hear your thoughts. Now, the information > provided about the partial byte sequences can be also useful for > debugging on top of having the error code, no? By the way, as long as I have that in mind.. I am not sure that it is worth spending cycles in detecting the unfinished sequences and make these printable. Wouldn't it be enough for more cases to adjust token_error() to truncate the byte sequences we cannot print? Another thing that I think would be nice would be to calculate the location of what we're parsing on a given line, and provide that in the error context. That would not be backpatchable as it requires a change in JsonLexContext, unfortunately, but it would help in making more sense with an error if the incomplete byte sequence is at the beginning of a token or after an expected character. -- Michael signature.asc Description: PGP signature
Re: [PATCH] json_lex_string: don't overread on bad UTF8
On Wed, May 01, 2024 at 04:22:24PM -0700, Jacob Champion wrote: > On Tue, Apr 30, 2024 at 11:09 PM Michael Paquier wrote: >> Not sure to like much the fact that this advances token_terminator >> first. Wouldn't it be better to calculate pg_encoding_mblen() first, >> then save token_terminator? I feel a bit uneasy about saving a value >> in token_terminator past the end of the string. That a nit in this >> context, still.. > > v2 tries it that way; see what you think. Is the concern that someone > might add code later that escapes that macro early? Yeah, I am not sure if that's something that would really happen, but that looks like a good practice to keep anyway to keep a clean stack at any time. >> Ah, that makes sense. That looks OK here. A comment around the test >> would be adapted to document that, I guess. > > Done. That seems OK at quick glance. I don't have much room to do something about this patch this week as an effect of Golden Week and the buildfarm effect, but I should be able to get to it next week once the next round of minor releases is tagged. About the fact that we may finish by printing unfinished UTF-8 sequences, I'd be curious to hear your thoughts. Now, the information provided about the partial byte sequences can be also useful for debugging on top of having the error code, no? -- Michael signature.asc Description: PGP signature
Re: [PATCH] json_lex_string: don't overread on bad UTF8
On Tue, Apr 30, 2024 at 11:09 PM Michael Paquier wrote: > Not sure to like much the fact that this advances token_terminator > first. Wouldn't it be better to calculate pg_encoding_mblen() first, > then save token_terminator? I feel a bit uneasy about saving a value > in token_terminator past the end of the string. That a nit in this > context, still.. v2 tries it that way; see what you think. Is the concern that someone might add code later that escapes that macro early? > Hmm. Keeping it around as currently designed means that it could > cause more harm than anything in the long term, even in the stable > branches if new code uses it. There is a risk of seeing this new code > incorrectly using it again, even if its top comment tells that we rely > on the string being nul-terminated. A safer alternative would be to > redesign it so as the length of the string is provided in input, > removing the dependency of strlen in its internals, perhaps. Anyway, > without any callers of it, I'd be tempted to wipe it from HEAD and > call it a day. Removed in v2. > > The new test needs to record two versions of the error message, one > > for invalid token and one for invalid escape sequence. This is > > because, for smaller chunk sizes, the partial-token logic in the > > incremental JSON parser skips the affected code entirely when it can't > > find an ending double-quote. > > Ah, that makes sense. That looks OK here. A comment around the test > would be adapted to document that, I guess. Done. > Advancing the tracking pointer 's' before reporting an error related > the end of the string is a bad practive, IMO, and we should avoid > that. json_lex_string() does not offer a warm feeling regarding that > with escape characters, at least :/ Yeah... I think some expansion of the json_errdetail test coverage is probably in my future. :) Thanks, --Jacob v2-0001-json_lex_string-don-t-overread-on-bad-UTF8.patch Description: Binary data
Re: [PATCH] json_lex_string: don't overread on bad UTF8
On Tue, Apr 30, 2024 at 10:39:04AM -0700, Jacob Champion wrote: > When json_lex_string() hits certain types of invalid input, it calls > pg_encoding_mblen_bounded(), which assumes that its input is > null-terminated and calls strnlen(). But the JSON lexer is constructed > with an explicit string length, and we don't ensure that the string is > null-terminated in all cases, so we can walk off the end of the > buffer. This isn't really relevant on the server side, where you'd > have to get a superuser to help you break string encodings, but for > client-side usage on untrusted input (such as my OAuth patch) it would > be more important. Not sure to like much the fact that this advances token_terminator first. Wouldn't it be better to calculate pg_encoding_mblen() first, then save token_terminator? I feel a bit uneasy about saving a value in token_terminator past the end of the string. That a nit in this context, still.. > Attached is a draft patch that explicitly checks against the > end-of-string pointer and clamps the token_terminator to it. Note that > this removes the only caller of pg_encoding_mblen_bounded() and I'm > not sure what we should do with that function. It seems like a > reasonable API, just not here. Hmm. Keeping it around as currently designed means that it could cause more harm than anything in the long term, even in the stable branches if new code uses it. There is a risk of seeing this new code incorrectly using it again, even if its top comment tells that we rely on the string being nul-terminated. A safer alternative would be to redesign it so as the length of the string is provided in input, removing the dependency of strlen in its internals, perhaps. Anyway, without any callers of it, I'd be tempted to wipe it from HEAD and call it a day. > The new test needs to record two versions of the error message, one > for invalid token and one for invalid escape sequence. This is > because, for smaller chunk sizes, the partial-token logic in the > incremental JSON parser skips the affected code entirely when it can't > find an ending double-quote. Ah, that makes sense. That looks OK here. A comment around the test would be adapted to document that, I guess. > Tangentially: Should we maybe rethink pieces of the json_lex_string > error handling? For example, do we really want to echo an incomplete > multibyte sequence once we know it's bad? It also looks like there are > places where the FAIL_AT_CHAR_END macro is called after the `s` > pointer has already advanced past the code point of interest. I'm not > sure if that's intentional. Advancing the tracking pointer 's' before reporting an error related the end of the string is a bad practive, IMO, and we should avoid that. json_lex_string() does not offer a warm feeling regarding that with escape characters, at least :/ -- Michael signature.asc Description: PGP signature
[PATCH] json_lex_string: don't overread on bad UTF8
Hi all, When json_lex_string() hits certain types of invalid input, it calls pg_encoding_mblen_bounded(), which assumes that its input is null-terminated and calls strnlen(). But the JSON lexer is constructed with an explicit string length, and we don't ensure that the string is null-terminated in all cases, so we can walk off the end of the buffer. This isn't really relevant on the server side, where you'd have to get a superuser to help you break string encodings, but for client-side usage on untrusted input (such as my OAuth patch) it would be more important. Attached is a draft patch that explicitly checks against the end-of-string pointer and clamps the token_terminator to it. Note that this removes the only caller of pg_encoding_mblen_bounded() and I'm not sure what we should do with that function. It seems like a reasonable API, just not here. The new test needs to record two versions of the error message, one for invalid token and one for invalid escape sequence. This is because, for smaller chunk sizes, the partial-token logic in the incremental JSON parser skips the affected code entirely when it can't find an ending double-quote. Tangentially: Should we maybe rethink pieces of the json_lex_string error handling? For example, do we really want to echo an incomplete multibyte sequence once we know it's bad? It also looks like there are places where the FAIL_AT_CHAR_END macro is called after the `s` pointer has already advanced past the code point of interest. I'm not sure if that's intentional. Thanks, --Jacob 0001-json_lex_string-don-t-overread-on-bad-UTF8.patch Description: Binary data