Re: Leakproofness of texteq()/textne()
Peter Eisentraut writes: > On 2019-09-16 06:24, Tom Lane wrote: >> So it seems that the consensus is that it's okay to mark these >> functions leakproof, because if any of the errors they throw >> are truly reachable for other than data-corruption reasons, >> we would wish to try to prevent such errors. (Maybe through >> upstream validity checks? Hard to say how we'd do it exactly, >> when we don't have an idea what the problem is.) > Yeah, it seems like as we expand our Unicode capabilities, we will see > more cases like "it could fail here in theory, but it shouldn't happen > for normal data", and the answer can't be to call all that untrusted or > leaky. It's the job of the database software to sort that out. > Obviously, it will require careful evaluation in each case. Here's a proposed patch to mark functions that depend on varstr_cmp as leakproof. I think we can apply this to HEAD and then close the open item as "won't fix for v12". regards, tom lane diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index d36156f..8c18294 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -1463,6 +1463,12 @@ check_collation_set(Oid collid) * to allow null-termination for inputs to strcoll(). * Returns an integer less than, equal to, or greater than zero, indicating * whether arg1 is less than, equal to, or greater than arg2. + * + * Note: many functions that depend on this are marked leakproof; therefore, + * avoid reporting the actual contents of the input when throwing errors. + * All errors herein should be things that can't happen except on corrupt + * data, anyway; otherwise we will have trouble with indexing strings that + * would cause them. */ int varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid) diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index e6645f1..5e69dee 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -681,44 +681,44 @@ proname => 'nameeqtext', proleakproof => 't', prorettype => 'bool', proargtypes => 'name text', prosrc => 'nameeqtext' }, { oid => '241', - proname => 'namelttext', prorettype => 'bool', proargtypes => 'name text', - prosrc => 'namelttext' }, + proname => 'namelttext', proleakproof => 't', prorettype => 'bool', + proargtypes => 'name text', prosrc => 'namelttext' }, { oid => '242', - proname => 'nameletext', prorettype => 'bool', proargtypes => 'name text', - prosrc => 'nameletext' }, + proname => 'nameletext', proleakproof => 't', prorettype => 'bool', + proargtypes => 'name text', prosrc => 'nameletext' }, { oid => '243', - proname => 'namegetext', prorettype => 'bool', proargtypes => 'name text', - prosrc => 'namegetext' }, + proname => 'namegetext', proleakproof => 't', prorettype => 'bool', + proargtypes => 'name text', prosrc => 'namegetext' }, { oid => '244', - proname => 'namegttext', prorettype => 'bool', proargtypes => 'name text', - prosrc => 'namegttext' }, + proname => 'namegttext', proleakproof => 't', prorettype => 'bool', + proargtypes => 'name text', prosrc => 'namegttext' }, { oid => '245', proname => 'namenetext', proleakproof => 't', prorettype => 'bool', proargtypes => 'name text', prosrc => 'namenetext' }, { oid => '246', descr => 'less-equal-greater', - proname => 'btnametextcmp', prorettype => 'int4', proargtypes => 'name text', - prosrc => 'btnametextcmp' }, + proname => 'btnametextcmp', proleakproof => 't', prorettype => 'int4', + proargtypes => 'name text', prosrc => 'btnametextcmp' }, { oid => '247', proname => 'texteqname', proleakproof => 't', prorettype => 'bool', proargtypes => 'text name', prosrc => 'texteqname' }, { oid => '248', - proname => 'textltname', prorettype => 'bool', proargtypes => 'text name', - prosrc => 'textltname' }, + proname => 'textltname', proleakproof => 't', prorettype => 'bool', + proargtypes => 'text name', prosrc => 'textltname' }, { oid => '249', - proname => 'textlename', prorettype => 'bool', proargtypes => 'text name', - prosrc => 'textlename' }, + proname => 'textlename', proleakproof => 't', prorettype => 'bool', + proargtypes => 'text name', prosrc => 'textlename' }, { oid => '250', - proname => 'textgename', prorettype => 'bool', proargtypes => 'text name', - prosrc => 'textgename' }, + proname => 'textgename', proleakproof => 't', prorettype => 'bool', + proargtypes => 'text name', prosrc => 'textgename' }, { oid => '251', - proname => 'textgtname', prorettype => 'bool', proargtypes => 'text name', - prosrc => 'textgtname' }, + proname => 'textgtname', proleakproof => 't', prorettype => 'bool', + proargtypes => 'text name', prosrc => 'textgtname' }, { oid => '252', proname => 'textnename', proleakproof => 't', prorettype => 'bool', proargtypes => 'text name', prosrc => 'textnename' }, { oid => '253', descr => 'less-equal-greater', - proname
Re: Leakproofness of texteq()/textne()
On 2019-09-16 06:24, Tom Lane wrote: > So it seems that the consensus is that it's okay to mark these > functions leakproof, because if any of the errors they throw > are truly reachable for other than data-corruption reasons, > we would wish to try to prevent such errors. (Maybe through > upstream validity checks? Hard to say how we'd do it exactly, > when we don't have an idea what the problem is.) Yeah, it seems like as we expand our Unicode capabilities, we will see more cases like "it could fail here in theory, but it shouldn't happen for normal data", and the answer can't be to call all that untrusted or leaky. It's the job of the database software to sort that out. Obviously, it will require careful evaluation in each case. > My inclination is to do the proleakproof changes in HEAD, but > not touch v12. The inconsistency in leakproof markings in v12 > is annoying but it's not a regression or security hazard, so > I'm thinking it's not worth a late catversion bump to fix. Sounds good, unless we do another catversion bump. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Leakproofness of texteq()/textne()
Stephen Frost writes: > * Robert Haas (robertmh...@gmail.com) wrote: >> On Thu, Sep 12, 2019 at 5:19 PM Tom Lane wrote: >>> However, if there is some character C that makes ICU misbehave like >>> that, we are going to have problems with indexing strings containing C, >>> whether we think varstr_cmp is leaky or not. So I'm not sure that >>> focusing our attention on leakiness is a helpful way to proceed. >> That seems like a compelling argument to me. > Agreed. So it seems that the consensus is that it's okay to mark these functions leakproof, because if any of the errors they throw are truly reachable for other than data-corruption reasons, we would wish to try to prevent such errors. (Maybe through upstream validity checks? Hard to say how we'd do it exactly, when we don't have an idea what the problem is.) My inclination is to do the proleakproof changes in HEAD, but not touch v12. The inconsistency in leakproof markings in v12 is annoying but it's not a regression or security hazard, so I'm thinking it's not worth a late catversion bump to fix. Thoughts? regards, tom lane
Re: Leakproofness of texteq()/textne()
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Thu, Sep 12, 2019 at 5:19 PM Tom Lane wrote: > > However, if there is some character C that makes ICU misbehave like > > that, we are going to have problems with indexing strings containing C, > > whether we think varstr_cmp is leaky or not. So I'm not sure that > > focusing our attention on leakiness is a helpful way to proceed. > > That seems like a compelling argument to me. Agreed. Thanks, Stephen signature.asc Description: PGP signature
Re: Leakproofness of texteq()/textne()
On Thu, Sep 12, 2019 at 5:19 PM Tom Lane wrote: > However, if there is some character C that makes ICU misbehave like > that, we are going to have problems with indexing strings containing C, > whether we think varstr_cmp is leaky or not. So I'm not sure that > focusing our attention on leakiness is a helpful way to proceed. That seems like a compelling argument to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Leakproofness of texteq()/textne()
Robert Haas writes: > I wouldn't feel comfortable with ignoring information leaks that can > happen with some valid strings but not others. That sounds like > exactly the sort of information leak that we must prevent. The user > can write arbitrary stuff in their query, potentially transforming > strings so that the result hits the ERROR iff the original string had > some arbitrary property P for which they wish to test. Agreed. > However, I wonder if there's any realistic case outside of an encoding > conversion where such failures can occur. I would expect, perhaps > naively, that the set of characters that can be represented by UTF-16 > is the same set as can be represented by UTF-8. Unless Microsoft did something weird, that doesn't seem very likely. I'm more worried about the possibility that ICU contains weird exception cases that will make it fail to compare specific strings. Noting that ucnv_toUChars has an error indicator but ucol_strcoll does not, it seems like again any such cases are going to boil down to encoding conversion problems. However, if there is some character C that makes ICU misbehave like that, we are going to have problems with indexing strings containing C, whether we think varstr_cmp is leaky or not. So I'm not sure that focusing our attention on leakiness is a helpful way to proceed. regards, tom lane
Re: Leakproofness of texteq()/textne()
On Thu, Sep 12, 2019 at 1:38 PM Tom Lane wrote: > In any case, from a purely theoretical viewpoint, such an error message > *does* constitute a leak of information about the input strings. Whether > it's a usable leak is very debatable, but that's basically what we've > got to decide. I'm pretty content to ignore information leaks that can only happen if the database is corrupt anyway. If that's moving the goalposts at all, it's about a quarter-inch. I mean, a slightly differently corrupted varlena would could crash the database entirely. I wouldn't feel comfortable with ignoring information leaks that can happen with some valid strings but not others. That sounds like exactly the sort of information leak that we must prevent. The user can write arbitrary stuff in their query, potentially transforming strings so that the result hits the ERROR iff the original string had some arbitrary property P for which they wish to test. Allowing that sounds no different than deciding that int4div is leakproof, which it sure isn't. However, I wonder if there's any realistic case outside of an encoding conversion where such failures can occur. I would expect, perhaps naively, that the set of characters that can be represented by UTF-16 is the same set as can be represented by UTF-8. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Leakproofness of texteq()/textne()
I wrote: > I agree with your point that this is a shouldn't-happen corner case. > The question boils down to, if it *does* happen, does that constitute > a meaningful information leak? Up to now we've taken quite a hard > line about what leakproofness means, so deciding that varstr_cmp > is leakproof would constitute moving the goalposts a bit. They'd > still be in the same stadium, though, IMO. For most of us it might be more meaningful to look at the non-Windows code paths, for which the question reduces to what we think of this: UErrorCodestatus; status = U_ZERO_ERROR; result = ucol_strcollUTF8(mylocale->info.icu.ucol, arg1, len1, arg2, len2, ); if (U_FAILURE(status)) ereport(ERROR, (errmsg("collation failed: %s", u_errorName(status; which, as it happens, is also a UTF8-encoding-only code path. Can this throw an error in practice, and if so does that constitute a meaningful information leak? (For bonus points: is this error report up to project standards?) Thumbing through the list of UErrorCode values, it seems like the only ones that are applicable here and aren't internal-error cases are U_INVALID_CHAR_FOUND and the like, so that this boils down to "one of the strings contains a character that ICU can't cope with". Maybe that's impossible except with a pre-existing encoding violation, or maybe not. In any case, from a purely theoretical viewpoint, such an error message *does* constitute a leak of information about the input strings. Whether it's a usable leak is very debatable, but that's basically what we've got to decide. regards, tom lane
Re: Leakproofness of texteq()/textne()
Robert Haas writes: > On Thu, Sep 12, 2019 at 12:19 PM Tom Lane wrote: >> After burrowing down further, it's visibly the case that >> text_cmp and varstr_cmp don't leak in the sense of actually >> reporting any part of their input strings. What they do do, >> in some code paths, is things like >> ereport(ERROR, >> (errmsg("could not convert string to UTF-16: error code %lu", >> GetLastError(; > Is this possible? I mean, I'm sure it could happen if the data's > corrupted, but we ought to have validated it on the way into the > database. But maybe this code path also gets used for non-Unicode > encodings? Nope, the above is inside #ifdef WIN32 /* Win32 does not have UTF-8, so we need to map to UTF-16 */ if (GetDatabaseEncoding() == PG_UTF8 && (!mylocale || mylocale->provider == COLLPROVIDER_LIBC)) I agree with your point that this is a shouldn't-happen corner case. The question boils down to, if it *does* happen, does that constitute a meaningful information leak? Up to now we've taken quite a hard line about what leakproofness means, so deciding that varstr_cmp is leakproof would constitute moving the goalposts a bit. They'd still be in the same stadium, though, IMO. Another approach would be to try to remove these failure cases, but I don't really see how we'd do that. regards, tom lane
Re: Leakproofness of texteq()/textne()
On Thu, Sep 12, 2019 at 12:19 PM Tom Lane wrote: > After burrowing down further, it's visibly the case that > text_cmp and varstr_cmp don't leak in the sense of actually > reporting any part of their input strings. What they do do, > in some code paths, is things like > > ereport(ERROR, > (errmsg("could not convert string to UTF-16: error code %lu", > GetLastError(; Is this possible? I mean, I'm sure it could happen if the data's corrupted, but we ought to have validated it on the way into the database. But maybe this code path also gets used for non-Unicode encodings? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Leakproofness of texteq()/textne()
I wrote: > Seems like we have two choices: > 1. Remove the leakproof marking on texteq()/textne(). This > would, unfortunately, be catastrophic for performance in > a lot of scenarios where leakproofness matters. > 2. Fix text_cmp to actually be leakproof, whereupon we might > as well mark all the remaining btree comparison operators > leakproof. > I think #2 is pretty obviously the superior answer, if we > can do it. After burrowing down further, it's visibly the case that text_cmp and varstr_cmp don't leak in the sense of actually reporting any part of their input strings. What they do do, in some code paths, is things like ereport(ERROR, (errmsg("could not convert string to UTF-16: error code %lu", GetLastError(; So this seems to mostly be a question of interpretation: does an error like this represent enough of an information leak to violate the promises of leakproofness? I'm not sure that this error is really capable of disclosing any information about the input strings. (Note that this error occurs only if we failed to convert UTF8 to UTF16, which probably could only happen if the input isn't valid UTF8, which would mean a failure of encoding checking somewhere up the line.) There are also various pallocs and such that could fail, which perhaps could be argued to disclose the lengths of the input strings. But that hazard exists already inside PG_GETARG_TEXT_PP; if you want to complain about that, then functions like byteaeq() aren't legitimately leakproof either. On the whole I'm prepared to say that these aren't leakproofness violations, but it would depend a lot on how strict you want to be. regards, tom lane
Leakproofness of texteq()/textne()
Currently, texteq() and textne() are marked leakproof, while sibling operations such as textlt() are not. The argument for that was that those two functions depend only on memcmp() so they can be seen to be safe, whereas it's a whole lot less clear that text_cmp() should be considered leakproof. Of course, the addition of nondeterministic collations has utterly obliterated that argument: it's now possible for texteq() to call text_cmp(), so if the latter is leaky then the former certainly must be considered so as well. Seems like we have two choices: 1. Remove the leakproof marking on texteq()/textne(). This would, unfortunately, be catastrophic for performance in a lot of scenarios where leakproofness matters. 2. Fix text_cmp to actually be leakproof, whereupon we might as well mark all the remaining btree comparison operators leakproof. I think #2 is pretty obviously the superior answer, if we can do it. ISTM we can't ship v12 without dealing with this one way or the other, so I'll go add an open item. Comments? regards, tom lane