Re: Leakproofness of texteq()/textne()

2019-09-17 Thread Tom Lane
Peter Eisentraut  writes:
> On 2019-09-16 06:24, Tom Lane wrote:
>> So it seems that the consensus is that it's okay to mark these
>> functions leakproof, because if any of the errors they throw
>> are truly reachable for other than data-corruption reasons,
>> we would wish to try to prevent such errors.  (Maybe through
>> upstream validity checks?  Hard to say how we'd do it exactly,
>> when we don't have an idea what the problem is.)

> Yeah, it seems like as we expand our Unicode capabilities, we will see
> more cases like "it could fail here in theory, but it shouldn't happen
> for normal data", and the answer can't be to call all that untrusted or
> leaky.  It's the job of the database software to sort that out.
> Obviously, it will require careful evaluation in each case.

Here's a proposed patch to mark functions that depend on varstr_cmp
as leakproof.  I think we can apply this to HEAD and then close the
open item as "won't fix for v12".

regards, tom lane

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index d36156f..8c18294 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1463,6 +1463,12 @@ check_collation_set(Oid collid)
  *	to allow null-termination for inputs to strcoll().
  * Returns an integer less than, equal to, or greater than zero, indicating
  * whether arg1 is less than, equal to, or greater than arg2.
+ *
+ * Note: many functions that depend on this are marked leakproof; therefore,
+ * avoid reporting the actual contents of the input when throwing errors.
+ * All errors herein should be things that can't happen except on corrupt
+ * data, anyway; otherwise we will have trouble with indexing strings that
+ * would cause them.
  */
 int
 varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index e6645f1..5e69dee 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -681,44 +681,44 @@
   proname => 'nameeqtext', proleakproof => 't', prorettype => 'bool',
   proargtypes => 'name text', prosrc => 'nameeqtext' },
 { oid => '241',
-  proname => 'namelttext', prorettype => 'bool', proargtypes => 'name text',
-  prosrc => 'namelttext' },
+  proname => 'namelttext', proleakproof => 't', prorettype => 'bool',
+  proargtypes => 'name text', prosrc => 'namelttext' },
 { oid => '242',
-  proname => 'nameletext', prorettype => 'bool', proargtypes => 'name text',
-  prosrc => 'nameletext' },
+  proname => 'nameletext', proleakproof => 't', prorettype => 'bool',
+  proargtypes => 'name text', prosrc => 'nameletext' },
 { oid => '243',
-  proname => 'namegetext', prorettype => 'bool', proargtypes => 'name text',
-  prosrc => 'namegetext' },
+  proname => 'namegetext', proleakproof => 't', prorettype => 'bool',
+  proargtypes => 'name text', prosrc => 'namegetext' },
 { oid => '244',
-  proname => 'namegttext', prorettype => 'bool', proargtypes => 'name text',
-  prosrc => 'namegttext' },
+  proname => 'namegttext', proleakproof => 't', prorettype => 'bool',
+  proargtypes => 'name text', prosrc => 'namegttext' },
 { oid => '245',
   proname => 'namenetext', proleakproof => 't', prorettype => 'bool',
   proargtypes => 'name text', prosrc => 'namenetext' },
 { oid => '246', descr => 'less-equal-greater',
-  proname => 'btnametextcmp', prorettype => 'int4', proargtypes => 'name text',
-  prosrc => 'btnametextcmp' },
+  proname => 'btnametextcmp', proleakproof => 't', prorettype => 'int4',
+  proargtypes => 'name text', prosrc => 'btnametextcmp' },
 { oid => '247',
   proname => 'texteqname', proleakproof => 't', prorettype => 'bool',
   proargtypes => 'text name', prosrc => 'texteqname' },
 { oid => '248',
-  proname => 'textltname', prorettype => 'bool', proargtypes => 'text name',
-  prosrc => 'textltname' },
+  proname => 'textltname', proleakproof => 't', prorettype => 'bool',
+  proargtypes => 'text name', prosrc => 'textltname' },
 { oid => '249',
-  proname => 'textlename', prorettype => 'bool', proargtypes => 'text name',
-  prosrc => 'textlename' },
+  proname => 'textlename', proleakproof => 't', prorettype => 'bool',
+  proargtypes => 'text name', prosrc => 'textlename' },
 { oid => '250',
-  proname => 'textgename', prorettype => 'bool', proargtypes => 'text name',
-  prosrc => 'textgename' },
+  proname => 'textgename', proleakproof => 't', prorettype => 'bool',
+  proargtypes => 'text name', prosrc => 'textgename' },
 { oid => '251',
-  proname => 'textgtname', prorettype => 'bool', proargtypes => 'text name',
-  prosrc => 'textgtname' },
+  proname => 'textgtname', proleakproof => 't', prorettype => 'bool',
+  proargtypes => 'text name', prosrc => 'textgtname' },
 { oid => '252',
   proname => 'textnename', proleakproof => 't', prorettype => 'bool',
   proargtypes => 'text name', prosrc => 'textnename' },
 { oid => '253', descr => 'less-equal-greater',
-  proname 

Re: Leakproofness of texteq()/textne()

2019-09-17 Thread Peter Eisentraut
On 2019-09-16 06:24, Tom Lane wrote:
> So it seems that the consensus is that it's okay to mark these
> functions leakproof, because if any of the errors they throw
> are truly reachable for other than data-corruption reasons,
> we would wish to try to prevent such errors.  (Maybe through
> upstream validity checks?  Hard to say how we'd do it exactly,
> when we don't have an idea what the problem is.)

Yeah, it seems like as we expand our Unicode capabilities, we will see
more cases like "it could fail here in theory, but it shouldn't happen
for normal data", and the answer can't be to call all that untrusted or
leaky.  It's the job of the database software to sort that out.
Obviously, it will require careful evaluation in each case.

> My inclination is to do the proleakproof changes in HEAD, but
> not touch v12.  The inconsistency in leakproof markings in v12
> is annoying but it's not a regression or security hazard, so
> I'm thinking it's not worth a late catversion bump to fix.

Sounds good, unless we do another catversion bump.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Leakproofness of texteq()/textne()

2019-09-15 Thread Tom Lane
Stephen Frost  writes:
> * Robert Haas (robertmh...@gmail.com) wrote:
>> On Thu, Sep 12, 2019 at 5:19 PM Tom Lane  wrote:
>>> However, if there is some character C that makes ICU misbehave like
>>> that, we are going to have problems with indexing strings containing C,
>>> whether we think varstr_cmp is leaky or not.  So I'm not sure that
>>> focusing our attention on leakiness is a helpful way to proceed.

>> That seems like a compelling argument to me.

> Agreed.

So it seems that the consensus is that it's okay to mark these
functions leakproof, because if any of the errors they throw
are truly reachable for other than data-corruption reasons,
we would wish to try to prevent such errors.  (Maybe through
upstream validity checks?  Hard to say how we'd do it exactly,
when we don't have an idea what the problem is.)

My inclination is to do the proleakproof changes in HEAD, but
not touch v12.  The inconsistency in leakproof markings in v12
is annoying but it's not a regression or security hazard, so
I'm thinking it's not worth a late catversion bump to fix.

Thoughts?

regards, tom lane




Re: Leakproofness of texteq()/textne()

2019-09-15 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Thu, Sep 12, 2019 at 5:19 PM Tom Lane  wrote:
> > However, if there is some character C that makes ICU misbehave like
> > that, we are going to have problems with indexing strings containing C,
> > whether we think varstr_cmp is leaky or not.  So I'm not sure that
> > focusing our attention on leakiness is a helpful way to proceed.
> 
> That seems like a compelling argument to me.

Agreed.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Leakproofness of texteq()/textne()

2019-09-13 Thread Robert Haas
On Thu, Sep 12, 2019 at 5:19 PM Tom Lane  wrote:
> However, if there is some character C that makes ICU misbehave like
> that, we are going to have problems with indexing strings containing C,
> whether we think varstr_cmp is leaky or not.  So I'm not sure that
> focusing our attention on leakiness is a helpful way to proceed.

That seems like a compelling argument to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Leakproofness of texteq()/textne()

2019-09-12 Thread Tom Lane
Robert Haas  writes:
> I wouldn't feel comfortable with ignoring information leaks that can
> happen with some valid strings but not others. That sounds like
> exactly the sort of information leak that we must prevent. The user
> can write arbitrary stuff in their query, potentially transforming
> strings so that the result hits the ERROR iff the original string had
> some arbitrary property P for which they wish to test.

Agreed.

> However, I wonder if there's any realistic case outside of an encoding
> conversion where such failures can occur. I would expect, perhaps
> naively, that the set of characters that can be represented by UTF-16
> is the same set as can be represented by UTF-8.

Unless Microsoft did something weird, that doesn't seem very likely.
I'm more worried about the possibility that ICU contains weird exception
cases that will make it fail to compare specific strings.  Noting
that ucnv_toUChars has an error indicator but ucol_strcoll does not,
it seems like again any such cases are going to boil down to
encoding conversion problems.

However, if there is some character C that makes ICU misbehave like
that, we are going to have problems with indexing strings containing C,
whether we think varstr_cmp is leaky or not.  So I'm not sure that
focusing our attention on leakiness is a helpful way to proceed.

regards, tom lane




Re: Leakproofness of texteq()/textne()

2019-09-12 Thread Robert Haas
On Thu, Sep 12, 2019 at 1:38 PM Tom Lane  wrote:
> In any case, from a purely theoretical viewpoint, such an error message
> *does* constitute a leak of information about the input strings.  Whether
> it's a usable leak is very debatable, but that's basically what we've
> got to decide.

I'm pretty content to ignore information leaks that can only happen if
the database is corrupt anyway.  If that's moving the goalposts at
all, it's about a quarter-inch.  I mean, a slightly differently
corrupted varlena would could crash the database entirely.

I wouldn't feel comfortable with ignoring information leaks that can
happen with some valid strings but not others. That sounds like
exactly the sort of information leak that we must prevent. The user
can write arbitrary stuff in their query, potentially transforming
strings so that the result hits the ERROR iff the original string had
some arbitrary property P for which they wish to test. Allowing that
sounds no different than deciding that int4div is leakproof, which it
sure isn't.

However, I wonder if there's any realistic case outside of an encoding
conversion where such failures can occur. I would expect, perhaps
naively, that the set of characters that can be represented by UTF-16
is the same set as can be represented by UTF-8.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Leakproofness of texteq()/textne()

2019-09-12 Thread Tom Lane
I wrote:
> I agree with your point that this is a shouldn't-happen corner case.
> The question boils down to, if it *does* happen, does that constitute
> a meaningful information leak?  Up to now we've taken quite a hard
> line about what leakproofness means, so deciding that varstr_cmp
> is leakproof would constitute moving the goalposts a bit.  They'd
> still be in the same stadium, though, IMO.

For most of us it might be more meaningful to look at the non-Windows
code paths, for which the question reduces to what we think of this:

UErrorCodestatus;

status = U_ZERO_ERROR;
result = ucol_strcollUTF8(mylocale->info.icu.ucol,
  arg1, len1,
  arg2, len2,
  );
if (U_FAILURE(status))
ereport(ERROR,
(errmsg("collation failed: %s", 
u_errorName(status;

which, as it happens, is also a UTF8-encoding-only code path.
Can this throw an error in practice, and if so does that
constitute a meaningful information leak?  (For bonus points:
is this error report up to project standards?)

Thumbing through the list of UErrorCode values, it seems like the only
ones that are applicable here and aren't internal-error cases are
U_INVALID_CHAR_FOUND and the like, so that this boils down to "one of
the strings contains a character that ICU can't cope with".  Maybe that's
impossible except with a pre-existing encoding violation, or maybe not.

In any case, from a purely theoretical viewpoint, such an error message
*does* constitute a leak of information about the input strings.  Whether
it's a usable leak is very debatable, but that's basically what we've
got to decide.

regards, tom lane




Re: Leakproofness of texteq()/textne()

2019-09-12 Thread Tom Lane
Robert Haas  writes:
> On Thu, Sep 12, 2019 at 12:19 PM Tom Lane  wrote:
>> After burrowing down further, it's visibly the case that
>> text_cmp and varstr_cmp don't leak in the sense of actually
>> reporting any part of their input strings.  What they do do,
>> in some code paths, is things like
>> ereport(ERROR,
>> (errmsg("could not convert string to UTF-16: error code %lu",
>> GetLastError(;

> Is this possible? I mean, I'm sure it could happen if the data's
> corrupted, but we ought to have validated it on the way into the
> database. But maybe this code path also gets used for non-Unicode
> encodings?

Nope, the above is inside 

#ifdef WIN32
/* Win32 does not have UTF-8, so we need to map to UTF-16 */
if (GetDatabaseEncoding() == PG_UTF8
&& (!mylocale || mylocale->provider == COLLPROVIDER_LIBC))

I agree with your point that this is a shouldn't-happen corner case.
The question boils down to, if it *does* happen, does that constitute
a meaningful information leak?  Up to now we've taken quite a hard
line about what leakproofness means, so deciding that varstr_cmp
is leakproof would constitute moving the goalposts a bit.  They'd
still be in the same stadium, though, IMO.

Another approach would be to try to remove these failure cases,
but I don't really see how we'd do that.

regards, tom lane




Re: Leakproofness of texteq()/textne()

2019-09-12 Thread Robert Haas
On Thu, Sep 12, 2019 at 12:19 PM Tom Lane  wrote:
> After burrowing down further, it's visibly the case that
> text_cmp and varstr_cmp don't leak in the sense of actually
> reporting any part of their input strings.  What they do do,
> in some code paths, is things like
>
> ereport(ERROR,
> (errmsg("could not convert string to UTF-16: error code %lu",
> GetLastError(;

Is this possible? I mean, I'm sure it could happen if the data's
corrupted, but we ought to have validated it on the way into the
database. But maybe this code path also gets used for non-Unicode
encodings?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Leakproofness of texteq()/textne()

2019-09-12 Thread Tom Lane
I wrote:
> Seems like we have two choices:
> 1. Remove the leakproof marking on texteq()/textne().  This
> would, unfortunately, be catastrophic for performance in
> a lot of scenarios where leakproofness matters.
> 2. Fix text_cmp to actually be leakproof, whereupon we might
> as well mark all the remaining btree comparison operators
> leakproof.
> I think #2 is pretty obviously the superior answer, if we
> can do it.

After burrowing down further, it's visibly the case that
text_cmp and varstr_cmp don't leak in the sense of actually
reporting any part of their input strings.  What they do do,
in some code paths, is things like

ereport(ERROR,
(errmsg("could not convert string to UTF-16: error code %lu",
GetLastError(;

So this seems to mostly be a question of interpretation:
does an error like this represent enough of an information
leak to violate the promises of leakproofness?  I'm not sure
that this error is really capable of disclosing any information
about the input strings.  (Note that this error occurs only if
we failed to convert UTF8 to UTF16, which probably could only
happen if the input isn't valid UTF8, which would mean a failure
of encoding checking somewhere up the line.)

There are also various pallocs and such that could fail, which
perhaps could be argued to disclose the lengths of the input
strings.  But that hazard exists already inside PG_GETARG_TEXT_PP;
if you want to complain about that, then functions like byteaeq()
aren't legitimately leakproof either.

On the whole I'm prepared to say that these aren't leakproofness
violations, but it would depend a lot on how strict you want to be.

regards, tom lane




Leakproofness of texteq()/textne()

2019-09-12 Thread Tom Lane
Currently, texteq() and textne() are marked leakproof, while
sibling operations such as textlt() are not.  The argument
for that was that those two functions depend only on memcmp()
so they can be seen to be safe, whereas it's a whole lot less
clear that text_cmp() should be considered leakproof.

Of course, the addition of nondeterministic collations has
utterly obliterated that argument: it's now possible for
texteq() to call text_cmp(), so if the latter is leaky then
the former certainly must be considered so as well.

Seems like we have two choices:

1. Remove the leakproof marking on texteq()/textne().  This
would, unfortunately, be catastrophic for performance in
a lot of scenarios where leakproofness matters.

2. Fix text_cmp to actually be leakproof, whereupon we might
as well mark all the remaining btree comparison operators
leakproof.

I think #2 is pretty obviously the superior answer, if we
can do it.

ISTM we can't ship v12 without dealing with this one way
or the other, so I'll go add an open item.

Comments?

regards, tom lane