Re: [HACKERS] index-only quals vs. security_barrier views

2012-03-02 Thread Noah Misch
On Thu, Feb 09, 2012 at 12:02:29PM -0500, Robert Haas wrote:
 When Heikki worked up his original index-only scan patches (which
 didn't end up looking much like what eventually got committed), he had
 the notion of an index-only qual.  That is, given a query like this:
 
 select sum(1) from foo where substring(a,1,3) = 'abc';
 
 We could evaluate the substring qual before performing the heap fetch,
 and fetch the tuple from the heap only if the qual passes.

 Now, there's a fly in the ointment here, which is that applying
 arbitrary user-defined functions to tuples that might not be visible
 doesn't sound very safe.  The user-defined function in question might
 perform some action based on those invisible tuples that has side
 effects, which would be bad, because now we're violating MVCC
 semantics.  Or it might throw an error, killing the scan dead on the
 basis of the contents of some tuple that the scan shouldn't even see.
  However, there is certainly a class of functions for which this type
 of optimization would be safe, and it's an awful lot like the set of
 functions that can be safely pushed down through a security_barrier
 view - namely, things that don't have side effects or throw errors.
 So it's possible that the proleakproof flag KaiGai is proposing to add
 to pg_proc could do double duty, serving also to identify when it's
 safe to apply a qual to an index tuple when the corresponding heap
 tuple might be invisible.  However, I have some reservations about
 assuming that the two concepts are exactly the same.  For one thing,
 people are inevitably going to want to cheat a little bit more here
 than is appropriate for security views, and encouraging people to mark
 things LEAKPROOF when they're really not is a security disaster
 waiting to happen.

The similarity is indeed tempting, but I find the concepts sufficiently
distinct to not make one device serve both.  Adding to the reservations you
offer, LEAKPROOF is superuser-only.  This other marker would not entail any
special privilege.

 For another thing, there are some important cases
 that this doesn't cover, like:
 
 select * from foo where substring(a,1,3) like '%def%';
 
 The like operator doesn't seem to be leakproof in the security sense,
 because it can throw an error if the pattern is something like a
 single backslash (ERROR:  LIKE pattern must not end with escape
 character) and indeed it doesn't seem like it would be safe here
 either if the pattern were stored in the table.  But if the pattern
 were constant, it'd be OK, or almost OK: there's still the edge case
 where the table contains invisible rows but no visible ones - whether
 or not we complain about the pattern there ought to be the same as
 whether or not we complain about it on a completely empty table.  If
 we got to that point, then we might as well consider the qual
 leakproof for security purposes under the same set of circumstances
 we'd consider it OK to apply to possibly-invisible tuples.

This sort of thing implicates substring(), too, when you call it as
substring(a, 1, b); b  0 produces an error.

To handle these, I think we'd need a facility along the lines of protransform.
Have a function inspecting call nodes for a particular other function and
determining whether each is ok-for-index-only-quals.  You could even force
protransform itself into that role.  Create an additional pg_proc entry
identical to the ordinary substring() but for a different name and having the
ok-for-index-only-quals flag.  Add a protransform to the main pg_proc entry
that inspects the argument nodes and, when they're safe, replaces the call
with a call to that errorfree_substring() at plan time.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only quals vs. security_barrier views

2012-02-13 Thread Robert Haas
On Sat, Feb 11, 2012 at 7:16 AM, Jesper Krogh jes...@krogh.cc wrote:
 Ok, but there are still cases where we don't even need to construct
 a data tuple at all:

 2012-02-11 13:14:01.579 jk=# explain select count(*) from testtable where
 fts @@ to_tsquery('english','test1');
                                QUERY PLAN
 ---
  Aggregate  (cost=31.24..31.25 rows=1 width=0)
   -  Bitmap Heap Scan on testtable  (cost=16.03..31.23 rows=4 width=0)
         Recheck Cond: (fts @@ '''test1'''::tsquery)
         -  Bitmap Index Scan on ftsid  (cost=0.00..16.03 rows=4 width=0)
               Index Cond: (fts @@ '''test1'''::tsquery)
 (5 rows)

In that case I believe you DO need the heap tuple.  That Recheck
Cond there means that the index might be lossy - i.e. return tuples
that don't really match the search condition.

 Another idea sprung into my head, that indices on (ctid,some mix of
 columns)
 could actually serve as some kind of vertical partitioning of the table.

The ctid of a tuple is its physical position in the table.  It makes
no sense to index that.  Since it's unique, it makes even less sense
to index that plus other things in the same index.

Does anyone have any comments on the issue raised in my original
email?  I would like to get (some version of) his patch committed, but
I would also like to not back ourselves into a corner.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only quals vs. security_barrier views

2012-02-11 Thread Jesper Krogh

On 2012-02-09 22:17, Jesper Krogh wrote:

On 2012-02-09 21:09, Robert Haas wrote:

That doesn't make sense to me.  If you probe index A for rows where a
= 1 and find that CTID (100,1) is such a row, and now want to return a
column value b that is not present in that index, the fastest way to
get the row is going to be to fetch block 100 from the heap and return
the data out of the first tuple.  To get the value out of some other
index that does include column b would require scanning the entire
index looking for that CTID, just so you could then grab the
corresponding index tuple, which wouldn't make any sense at all.


You're right, in my head, everything it wired up against my primary
keys, of-course that isn't the case for the DB. Sorry for the noise.


Ok, but there are still cases where we don't even need to construct
a data tuple at all:

2012-02-11 13:14:01.579 jk=# explain select count(*) from testtable 
where fts @@ to_tsquery('english','test1');

QUERY PLAN
---
 Aggregate  (cost=31.24..31.25 rows=1 width=0)
   -  Bitmap Heap Scan on testtable  (cost=16.03..31.23 rows=4 width=0)
 Recheck Cond: (fts @@ '''test1'''::tsquery)
 -  Bitmap Index Scan on ftsid  (cost=0.00..16.03 rows=4 width=0)
   Index Cond: (fts @@ '''test1'''::tsquery)
(5 rows)


Another idea sprung into my head, that indices on (ctid,some mix of 
columns)

could actually serve as some kind of vertical partitioning of the table.

Wether it actually will me more efficient or not need to be tested.

Jesper

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] index-only quals vs. security_barrier views

2012-02-09 Thread Robert Haas
When Heikki worked up his original index-only scan patches (which
didn't end up looking much like what eventually got committed), he had
the notion of an index-only qual.  That is, given a query like this:

select sum(1) from foo where substring(a,1,3) = 'abc';

We could evaluate the substring qual before performing the heap fetch,
and fetch the tuple from the heap only if the qual passes.  The
current code is capable of generating an index-only scan plan for this
query, but the MVCC visibility check always happens first: if the page
is all-visible, we go ahead and evaluate the qual using only the index
data, but if the page is not all-visible, we do the heap fetch first
and then check the qual only if the tuple is visible to our snapshot.
It would be nice to have the ability to do those checks in the other
order, in the case where the projected cost of checking the qual is
less than the projected cost of the heap fetch.  This would allow
index-only scans to win in more situations than they can right now,
because we'd conceivably avoid quite a bit of random I/O if the qual
is fairly selective and there are a decent number of visibility map
bits that are unset.

In fact, this technique might pay off even if we don't have a covering index:

select * from foo where substring(a,1,3) = 'abc';

If the expected selectivity of the qual is low and the index is a lot
smaller than the table, we might want to iterate through all the index
tuples in the entire index and fetch the heap tuples for only those
where the qual passes.  This would allow index-only scans - or
whatever term you want to use, since there's not much that's index
only about this case - to potentially win even when no
visibility-map bits are set at all.

Now, there's a fly in the ointment here, which is that applying
arbitrary user-defined functions to tuples that might not be visible
doesn't sound very safe.  The user-defined function in question might
perform some action based on those invisible tuples that has side
effects, which would be bad, because now we're violating MVCC
semantics.  Or it might throw an error, killing the scan dead on the
basis of the contents of some tuple that the scan shouldn't even see.
 However, there is certainly a class of functions for which this type
of optimization would be safe, and it's an awful lot like the set of
functions that can be safely pushed down through a security_barrier
view - namely, things that don't have side effects or throw errors.
So it's possible that the proleakproof flag KaiGai is proposing to add
to pg_proc could do double duty, serving also to identify when it's
safe to apply a qual to an index tuple when the corresponding heap
tuple might be invisible.  However, I have some reservations about
assuming that the two concepts are exactly the same.  For one thing,
people are inevitably going to want to cheat a little bit more here
than is appropriate for security views, and encouraging people to mark
things LEAKPROOF when they're really not is a security disaster
waiting to happen.  For another thing, there are some important cases
that this doesn't cover, like:

select * from foo where substring(a,1,3) like '%def%';

The like operator doesn't seem to be leakproof in the security sense,
because it can throw an error if the pattern is something like a
single backslash (ERROR:  LIKE pattern must not end with escape
character) and indeed it doesn't seem like it would be safe here
either if the pattern were stored in the table.  But if the pattern
were constant, it'd be OK, or almost OK: there's still the edge case
where the table contains invisible rows but no visible ones - whether
or not we complain about the pattern there ought to be the same as
whether or not we complain about it on a completely empty table.  If
we got to that point, then we might as well consider the qual
leakproof for security purposes under the same set of circumstances
we'd consider it OK to apply to possibly-invisible tuples.

I don't have any appetite for trying to do anything more with
index-only scans for 9.2, though maybe someone else will think
otherwise.  But I would like very much to get KaiGai's leakproof stuff
committed, and so it seems like a good idea to reconcile the needs of
that machinery with what might eventually be needed here.

Comments?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only quals vs. security_barrier views

2012-02-09 Thread Jesper Krogh

On 2012-02-09 18:02, Robert Haas wrote:

I don't have any appetite for trying to do anything more with
index-only scans for 9.2, though maybe someone else will think
otherwise.  But I would like very much to get KaiGai's leakproof stuff
committed, and so it seems like a good idea to reconcile the needs of
that machinery with what might eventually be needed here.

Those were a couple of nice cases where index-only-scans
could win more than they does today. I have another one here:

2012-02-09 19:17:28.788 jk=# \d testtable
  Table public.testtable
 Column |   Type   |   Modifiers
+--+
 id | integer  | not null default nextval('testtable_id_seq'::regclass)
 fts| tsvector |
Indexes:
prk_idx UNIQUE, btree (id)
fts_id gin (fts)

2012-02-09 19:19:39.054 jk=# explain select id from testtable where fts 
@@ to_tsquery('english','test1000');

  QUERY PLAN
---
 Bitmap Heap Scan on testtable  (cost=20.29..161.28 rows=37 width=4)
   Recheck Cond: (fts @@ '''test1000'''::tsquery)
   -  Bitmap Index Scan on fts_id  (cost=0.00..20.28 rows=37 width=0)
 Index Cond: (fts @@ '''test1000'''::tsquery)
(4 rows)

Time: 0.494 ms
2012-02-09 19:19:52.748 jk=#

In this situation the tuple can be regenerated from the index, but
not from the index-satisfying the where clause, this allows significantly
more complex where-clauses and may also benefit situations where
we only going for one or more of the primary-key/foreing-key columns
for join-conditions.

Above situation does not need to involve a gin-index, but a btree index
where the where clause can be matched up using one index, and the tuple
constructed using another falls into the same category.


--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only quals vs. security_barrier views

2012-02-09 Thread Robert Haas
On Thu, Feb 9, 2012 at 1:33 PM, Jesper Krogh jes...@krogh.cc wrote:
 On 2012-02-09 18:02, Robert Haas wrote:

 I don't have any appetite for trying to do anything more with
 index-only scans for 9.2, though maybe someone else will think
 otherwise.  But I would like very much to get KaiGai's leakproof stuff
 committed, and so it seems like a good idea to reconcile the needs of
 that machinery with what might eventually be needed here.

 Those were a couple of nice cases where index-only-scans
 could win more than they does today. I have another one here:

 2012-02-09 19:17:28.788 jk=# \d testtable
                          Table public.testtable
  Column |   Type   |                       Modifiers
 +--+
  id     | integer  | not null default nextval('testtable_id_seq'::regclass)
  fts    | tsvector |
 Indexes:
    prk_idx UNIQUE, btree (id)
    fts_id gin (fts)

 2012-02-09 19:19:39.054 jk=# explain select id from testtable where fts @@
 to_tsquery('english','test1000');
                              QUERY PLAN
 ---
  Bitmap Heap Scan on testtable  (cost=20.29..161.28 rows=37 width=4)
   Recheck Cond: (fts @@ '''test1000'''::tsquery)
   -  Bitmap Index Scan on fts_id  (cost=0.00..20.28 rows=37 width=0)
         Index Cond: (fts @@ '''test1000'''::tsquery)
 (4 rows)

 Time: 0.494 ms
 2012-02-09 19:19:52.748 jk=#

 In this situation the tuple can be regenerated from the index, but
 not from the index-satisfying the where clause, this allows significantly
 more complex where-clauses and may also benefit situations where
 we only going for one or more of the primary-key/foreing-key columns
 for join-conditions.

I don't understand what you're saying here.

 Above situation does not need to involve a gin-index, but a btree index
 where the where clause can be matched up using one index, and the tuple
 constructed using another falls into the same category.

That doesn't make sense to me.  If you probe index A for rows where a
= 1 and find that CTID (100,1) is such a row, and now want to return a
column value b that is not present in that index, the fastest way to
get the row is going to be to fetch block 100 from the heap and return
the data out of the first tuple.  To get the value out of some other
index that does include column b would require scanning the entire
index looking for that CTID, just so you could then grab the
corresponding index tuple, which wouldn't make any sense at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only quals vs. security_barrier views

2012-02-09 Thread Jesper Krogh

On 2012-02-09 21:09, Robert Haas wrote:

That doesn't make sense to me.  If you probe index A for rows where a
= 1 and find that CTID (100,1) is such a row, and now want to return a
column value b that is not present in that index, the fastest way to
get the row is going to be to fetch block 100 from the heap and return
the data out of the first tuple.  To get the value out of some other
index that does include column b would require scanning the entire
index looking for that CTID, just so you could then grab the
corresponding index tuple, which wouldn't make any sense at all.


You're right, in my head, everything it wired up against my primary
keys, of-course that isn't the case for the DB. Sorry for the noise.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers