Re: [HACKERS] [RFC] Minmax indexes

2013-09-01 Thread Noah Misch
On Fri, Jun 14, 2013 at 06:28:06PM -0400, Alvaro Herrera wrote:
> Partial indexes are not supported; since an index is concerned with minimum 
> and
> maximum values of the involved columns across all the pages in the table, it
> doesn't make sense to exclude values.

It can make sense if the predicate references a column other than the indexed
column(s).  Unlike a partial btree index, a partial minmax index would be no
smaller.  It could have narrower min-max spreads, reducing the recheck work
done by queries.

> Another way to see "partial" indexes
> here would be those that only considered some pages in the table instead of 
> all
> of them; but this would be difficult to implement and manage and, most likely,
> pointless.

That's a distinct feature from the AM-independent partial index mechanism, in
any event.

> Expressional indexes can probably be supported in the future, but we disallow
> them initially for conceptual simplicity.

Whether an index column uses an expression is irrelevant to each existing core
AM.  How does minmax differ in this respect?

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-07-19 Thread Josh Berkus
On 07/18/2013 10:39 PM, Alvaro Herrera wrote:
> To scan the index, we first obtain the TID of index tuple for page 0.  If
> this returns a valid TID, we read that tuple to determine the min/max bounds
> for this page range.  If it returns invalid, then the range is unsummarized,
> and the scan must return the whole range as needing scan.  After this
> index entry has been processed, we obtain the TID of the index tuple for
> page 0+pagesPerRange (currently this is a compile-time constant, but
> there's no reason this cannot be a per-index property).  Continue adding
> pagesPerRange until we reach the end of the heap.

Conceptually, this sounds like a good initial solution to the update
problem.

I still think we could do incremental updates to the minmax indexes per
the idea I discussed, but that could be a later version.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-07-18 Thread Alvaro Herrera
Alvaro Herrera wrote:

> This is a preliminary proposal for Minmax indexes.  I'm experimenting
> with the code, but it's too crude to post yet, so here's a document
> explaining what they are and how they work, so that reviewers can poke
> holes to have the design improved.

After going further with the implementation, I have added a new concept,
the reverse range map.

Reverse Range Map
-

The reverse range map is a separate fork each Minmax index has.  Its purpose
is to let a way to quickly find the index tuple corresponding to a page range;
for a given heap page number, there's an O(1) way to obtain the TID of the
corresponding index tuple.
 
To scan the index, we first obtain the TID of index tuple for page 0.  If
this returns a valid TID, we read that tuple to determine the min/max bounds
for this page range.  If it returns invalid, then the range is unsummarized,
and the scan must return the whole range as needing scan.  After this
index entry has been processed, we obtain the TID of the index tuple for
page 0+pagesPerRange (currently this is a compile-time constant, but
there's no reason this cannot be a per-index property).  Continue adding
pagesPerRange until we reach the end of the heap.

To read the TID during an index scan, we follow this protocol:

* read revmap page
* obtain share lock on the buffer
* read the TID
* LockTuple that TID (using the index as relation).  A shared lock is
  sufficient.  We need the LockTuple to prevent VACUUM from recycling
  the index tuple; see below.
* release buffer lock
* read the index tuple
* release the tuple lock

On heap insert/update, it is normally cheaper to update the summary tuple
(grab the old tuple, expand the min/max range according to the new value
being inserted, insert the new index tuple, update the reverse range map)
than letting the range be unsummarized, which would require scanning the
pages in the range.

[However, when many updates on a range are going to be done, it'd be
better to mark it as unsummarized (i.e. set the revmap TID to Invalid)
and do a resummarization later, to prevent the index from bloating.
Also, it'd be sensible to allow postponing of the index update, if many
tuples are inserted; something like GIN's "pending list".  We would need
to keep track the TIDs of the inserted heap tuples, so that we can
figure out the new min/max values, without having to scan the whole
range.]

To update the summary tuple for a page range, we use this protocol:

* insert a new index tuple anywhere; note its TID
* read revmap page
* obtain exclusive lock on buffer
* write the TID
* release lock

This ensures no concurrent reader can obtain a partially-written TID.
Note we don't need a tuple lock here.  Concurrent scans don't have to
worry about whether they got the old or new index tuple: if they get the
old one, the tighter values are okay from a correctness standpoint because
due to MVCC they can't possibly see the just-inserted heap tuples anyway.

For vacuuming, we need to figure out which index tuples are no longer
referenced from the reverse range map.  This requires some brute force,
but is simple:

1) scan the complete index, store each existing TIDs in a dynahash.
   Hash key is the TID, hash value is a boolean initially set to false.
2) scan the complete revmap sequentially, read the TIDs on each page.  Share
   lock on each page is sufficient.  For each TID so obtained, grab the
   element from the hash and update the boolean to true.
3) Scan the index again; for each tuple found, search the hash table.
   If the tuple is not present, it must have been added after our initial
   scan; ignore it.  If the hash flag is true, then the tuple is referenced;
   ignore it.  If the hash flag is false, then the index tuple is no longer
   referenced by the revmap; but they could be about to be accessed by a
   concurrent scan.  Do ConditionalLockTuple.  If this fails, ignore the
   tuple, it will be deleted by a future vacuum.  If lock is acquired,
   then we can safely remove the index tuple.
4) Index pages with free space can be detected by this second scan.  Register
   those with the FSM.

Note this doesn't require scanning the heap at all, or being involved in
the heap's cleanup procedure.  (In particular, tables with only minmax
indexes do not require the removed tuples' TIDs to be collected during
the heap cleanup pass.)   XXX I think this is wrong in detail; we
probably need to be using LockBufferForCleanup.

With the reverse range map it is very easy to see which page ranges in
the heap need resummarization; it's the ones marked with InvalidTid.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-28 Thread Josh Berkus

>> On Fri, Jun 28, 2013 at 2:18 PM, Jim Nasby  wrote:
>>> If we add a dirty flag it would probably be wise to allow for more
>>> than one
>>> value so we can do a clock-sweep. That would allow for detecting a range
>>> that is getting dirtied repeatedly and not bother to try and
>>> re-summarize it
>>> until later.

Given that I'm talking about doing the resummarization at VACUUM time, I
don't think that's justified.  That only makes sense if you're doing the
below ...

>>> Something else I don't think was mentioned... re-summarization should be
>>> somehow tied to access activity: if a query will need to seqscan a
>>> segment
>>> that needs to be summarized, we should take that opportunity to
>>> summarize at
>>> the same time while pages are in cache. Maybe that can be done in the
>>> backend itself; maybe we'd want a separate process.

Writing at SELECT time is something our users hate, with significant
justification.  This suggestion is possibly a useful optimization, but
I'd like to see the simplest possible version of minmax indexes go into
9.4, so that we can see how they work in practice before we start adding
complicated performance optimizations involving extra write overhead and
background workers.  We're more likely to engineer an expensive
de-optimization that way.

I wouldn't be unhappy to have the very basic minmax indexes -- the one
where we just flag pages as invalid if they get updated -- go into 9.4.
That would still work for large, WORM tables, which are a significant
and pervasive use case.  If Alvaro gets to it, I think the "updating the
range, rescan at VACUUM time" approach isn't much more complex, but if
he doesn't I'd still vote to accept the feature.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-28 Thread Jim Nasby

On 6/28/13 12:26 PM, Claudio Freire wrote:

On Fri, Jun 28, 2013 at 2:18 PM, Jim Nasby  wrote:

On 6/17/13 3:38 PM, Josh Berkus wrote:


Why?  Why can't we just update the affected pages in the index?




The page range has to be scanned in order to find out the min/max values
for the indexed columns on the range; and then, with these data, update
the index.


Seems like you could incrementally update the range, at least for
inserts.  If you insert a row which doesn't decrease the min or increase
the max, you can ignore it, and if it does increase/decrease, you can
change the min/max.  No?

For updates, things are more complicated.  If the row you're updating
was the min/max, in theory you should update it to adjust that, but you
can't verify that it was the ONLY min/max row without doing a full scan.
   My suggestion would be to add a "dirty" flag which would indicate that
that block could use a rescan next VACUUM, and otherwise ignore changing
the min/max.  After all, the only defect to having min to low or max too
high for a block would be scanning too many blocks.  Which you'd do
anyway with it marked "invalid".



If we add a dirty flag it would probably be wise to allow for more than one
value so we can do a clock-sweep. That would allow for detecting a range
that is getting dirtied repeatedly and not bother to try and re-summarize it
until later.

Something else I don't think was mentioned... re-summarization should be
somehow tied to access activity: if a query will need to seqscan a segment
that needs to be summarized, we should take that opportunity to summarize at
the same time while pages are in cache. Maybe that can be done in the
backend itself; maybe we'd want a separate process.



This smells a lot like hint bits and all the trouble they bring.


Possibly; though if that's the case just having a second process do the work 
would probably eliminate most of that problem.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-28 Thread Claudio Freire
On Fri, Jun 28, 2013 at 2:18 PM, Jim Nasby  wrote:
> On 6/17/13 3:38 PM, Josh Berkus wrote:

 Why?  Why can't we just update the affected pages in the index?
>>>
>>> >
>>> >The page range has to be scanned in order to find out the min/max values
>>> >for the indexed columns on the range; and then, with these data, update
>>> >the index.
>>
>> Seems like you could incrementally update the range, at least for
>> inserts.  If you insert a row which doesn't decrease the min or increase
>> the max, you can ignore it, and if it does increase/decrease, you can
>> change the min/max.  No?
>>
>> For updates, things are more complicated.  If the row you're updating
>> was the min/max, in theory you should update it to adjust that, but you
>> can't verify that it was the ONLY min/max row without doing a full scan.
>>   My suggestion would be to add a "dirty" flag which would indicate that
>> that block could use a rescan next VACUUM, and otherwise ignore changing
>> the min/max.  After all, the only defect to having min to low or max too
>> high for a block would be scanning too many blocks.  Which you'd do
>> anyway with it marked "invalid".
>
>
> If we add a dirty flag it would probably be wise to allow for more than one
> value so we can do a clock-sweep. That would allow for detecting a range
> that is getting dirtied repeatedly and not bother to try and re-summarize it
> until later.
>
> Something else I don't think was mentioned... re-summarization should be
> somehow tied to access activity: if a query will need to seqscan a segment
> that needs to be summarized, we should take that opportunity to summarize at
> the same time while pages are in cache. Maybe that can be done in the
> backend itself; maybe we'd want a separate process.


This smells a lot like hint bits and all the trouble they bring.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-28 Thread Jim Nasby

On 6/17/13 3:38 PM, Josh Berkus wrote:

Why?  Why can't we just update the affected pages in the index?

>
>The page range has to be scanned in order to find out the min/max values
>for the indexed columns on the range; and then, with these data, update
>the index.

Seems like you could incrementally update the range, at least for
inserts.  If you insert a row which doesn't decrease the min or increase
the max, you can ignore it, and if it does increase/decrease, you can
change the min/max.  No?

For updates, things are more complicated.  If the row you're updating
was the min/max, in theory you should update it to adjust that, but you
can't verify that it was the ONLY min/max row without doing a full scan.
  My suggestion would be to add a "dirty" flag which would indicate that
that block could use a rescan next VACUUM, and otherwise ignore changing
the min/max.  After all, the only defect to having min to low or max too
high for a block would be scanning too many blocks.  Which you'd do
anyway with it marked "invalid".


If we add a dirty flag it would probably be wise to allow for more than one 
value so we can do a clock-sweep. That would allow for detecting a range that 
is getting dirtied repeatedly and not bother to try and re-summarize it until 
later.

Something else I don't think was mentioned... re-summarization should be 
somehow tied to access activity: if a query will need to seqscan a segment that 
needs to be summarized, we should take that opportunity to summarize at the 
same time while pages are in cache. Maybe that can be done in the backend 
itself; maybe we'd want a separate process.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-25 Thread Alvaro Herrera
I just noticed that this patch was closed with "returned with feedback"
in the commitfest app.  This is good, IMV -- it's saying that the
opinion of the various people commenting on the thread is positive, and
therefore no more discussion is currently needed.

I will post an actual patch to CF2, at which point more detailed
discussion can be had.

Thanks everyone for their feedback.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-24 Thread Simon Riggs
On 25 June 2013 00:51, Bruce Momjian  wrote:

> On Sat, Jun 15, 2013 at 11:39:23AM -0400, Tom Lane wrote:
> > Simon Riggs  writes:
> > > On 15 June 2013 00:01, Josh Berkus  wrote:
> > >> If we're going to start adding reloptions for specific table behavior,
> > >> I'd rather think of all of the optimizations we might have for a
> > >> prospective "append-only table" and bundle those, rather than tying it
> > >> to whether a certain index exists or not.
> >
> > > I agree that the FSM behaviour shouldn't be linked to index existence.
> > > IMHO that should be a separate table parameter, WITH (fsm_mode =
> append)
> > > Index only scans would also benefit from that.
> >
> > -1 ... I cannot believe that such a parameter would ever get turned on
> > in production by anyone.  If your table has a significant update rate,
> > the resulting table bloat would make such behavior completely
> > infeasible.  If you have few enough updates to make such a behavior
> > practical, then you can live with the expensive index updates instead.
>
> Can you have pages that are receiving updates _not_ track min/max, until
> the page is nearly full?  This would require full scans of such pages,
> but there might be few of them.  The amount of free spaces on the page
> as reported by FSM might be useful here.
>

Yes, that is the proposal. Just like index only scans.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] [RFC] Minmax indexes

2013-06-24 Thread Bruce Momjian
On Sat, Jun 15, 2013 at 11:39:23AM -0400, Tom Lane wrote:
> Simon Riggs  writes:
> > On 15 June 2013 00:01, Josh Berkus  wrote:
> >> If we're going to start adding reloptions for specific table behavior,
> >> I'd rather think of all of the optimizations we might have for a
> >> prospective "append-only table" and bundle those, rather than tying it
> >> to whether a certain index exists or not.
> 
> > I agree that the FSM behaviour shouldn't be linked to index existence.
> > IMHO that should be a separate table parameter, WITH (fsm_mode = append)
> > Index only scans would also benefit from that.
> 
> -1 ... I cannot believe that such a parameter would ever get turned on
> in production by anyone.  If your table has a significant update rate,
> the resulting table bloat would make such behavior completely
> infeasible.  If you have few enough updates to make such a behavior
> practical, then you can live with the expensive index updates instead.

Can you have pages that are receiving updates _not_ track min/max, until
the page is nearly full?  This would require full scans of such pages,
but there might be few of them.  The amount of free spaces on the page
as reported by FSM might be useful here.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-17 Thread Tomas Vondra
Hi!

This sounds really interesting, so a few quick comments.

On 15.6.2013 00:28, Alvaro Herrera wrote:

> In each index tuple (corresponding to one page range), we store: -
> first block this tuple applies to - last block this tuple applies to 
> - for each indexed column: * min() value across all tuples in the
> range * max() value across all tuples in the range * nulls present in
> any tuple?

What about adding a counter how many times the min/max value is present
in the range? The other messages in this thread suggest that the refresh
after UPDATE/DELETE is one of the concerns - as Greg Stark mentioned the
range invalidation may only happen when running DELETE on a row matching
the min/max value. I believe having a counter would be an improvement -
a refresh would be needed only if it reaches 0.

> Summarization -
> 
> At index creation time, the whole table is scanned; for each page
> range the min() and max() values and nulls bitmap are collected, and
> stored in the index. The possibly-incomplete range at the end of the
> table is also included.

Would it make sense to do this using an existing (b-tree) index for very
large tables? Clearly it doesn't make sense to create a b-tree index and
then minmax on the same column, but for very large databases upgraded
using pg_upgrade this might be interesting.

> Once in a while, it is necessary to summarize a bunch of unsummarized
> pages (because the table has grown since the index was created), or
> re-summarize a range that has been marked invalid.  This is simple:
> scan the page range calculating the min() and max() for each indexed
> column, then insert the new index entry at the end of the index.  The
> main interesting questions are:
> 
> a) when to do it The perfect time to do it is as soon as a complete
> page range of the configured range size has been filled (assuming
> page ranges are constant size).
> 
> b) who does it (what process) It doesn't seem a good idea to have a
> client-connected process do it; it would incur unwanted latency.
> Three other options are (i) to spawn a specialized process to do it,
> which perhaps can be signalled by a client-connected process that
> executes a scan and notices the need to run summarization; or (ii) to
> let autovacuum do it, as a separate new maintenance task.  This seems
> simple enough to bolt on top of already existing autovacuum
> infrastructure.  The timing constraints of autovacuum might be
> undesirable, though.  (iii) wait for user command.
>
> 
> The easiest way to go around this seems to have vacuum do it.  That
> way we can simply do re-summarization on the amvacuumcleanup routine.
> Other answers would mean we need a separate AM routine, which appears
> unwarranted at this stage.

I don't think this should be added to the autovacuum daemon. It's quite
tricky to tune autovacuum to be aggressive just enough, i.e. not to run
too frequently / not to lag. I'm afraid this would add task consuming
unpredictable amounts of time, which would make the autovacuum tuning
even trickier.

I can live with non-summarized minmax indexes, but I can't live with
non-vacuumed tables.

> Open questions --
> 
> * Same-size page ranges? Current related literature seems to consider
> that each "index entry" in a minmax index must cover the same number
> of pages.  There doesn't seem to be a hard reason for this to be so;
> it might make sense to allow the index to self-tune so that some
> index entries cover smaller page ranges, if this allows the
> min()/max() values to be more compact.  This would incur larger space
> overhead for the index itself, but might allow better pruning of page
> ranges during scan.  In the limit of one index tuple per page, the
> index itself would occupy too much space, even though we would be 
> able to skip reading the most heap pages, because the min()/max() 
> ranges are tight; in the opposite limit of a single tuple that 
> summarizes the whole table, we wouldn't be able to prune anything
> from the seqscan even though the index is very small.

I see no particular reason not to allow that, and having variable range
size would be a very nice feature IMHO.

Do you have an indea on how the self-tuning might work?

I was thinking about something like this:

(1) estimate the initial range size, trying to keep good "selectivity"
while not having very small ranges, using

   (a) average row length
   (b) number of distinct values in the column
   (c) MCV/histogram/correlation

(2) once the index is built, running a "merge" on the ranges - merging
the adjacent pages if the resulting selectivity is not significantly
worse (again, this might be evaluated using MCV, histogram)

But it should be possible to override this (initial range size, max
range size) or disable heuristics completely, as having large ranges
probably makes the resummarizing more expensive. Although having the
counter might fix that as it's more likely there's another row with
min/max value.

BTW

Re: [HACKERS] [RFC] Minmax indexes

2013-06-17 Thread Josh Berkus

>> This begins to sound like these indexes are only useful on append-only
>> tables.  Not that there aren't plenty of those, but ...
> 
> But what?  

... "but" the other comments further down in my email.  Also, my
successive comments in other emails.

>> Why?  Why can't we just update the affected pages in the index?
> 
> The page range has to be scanned in order to find out the min/max values
> for the indexed columns on the range; and then, with these data, update
> the index.

Seems like you could incrementally update the range, at least for
inserts.  If you insert a row which doesn't decrease the min or increase
the max, you can ignore it, and if it does increase/decrease, you can
change the min/max.  No?

For updates, things are more complicated.  If the row you're updating
was the min/max, in theory you should update it to adjust that, but you
can't verify that it was the ONLY min/max row without doing a full scan.
 My suggestion would be to add a "dirty" flag which would indicate that
that block could use a rescan next VACUUM, and otherwise ignore changing
the min/max.  After all, the only defect to having min to low or max too
high for a block would be scanning too many blocks.  Which you'd do
anyway with it marked "invalid".

> This is not a requirement.  It merely makes the index more effective.

Right.  So I'm saying let's do this index without the FSM modifications,
and then consider those as their own, separate patch, if we even do them.

> Eh?  Sure, my intention for this reloption is for the user to be able to
> state their intention for the table, and each feature that has
> append-only table optimization does its thing.  I wasn't thinking in
> anything automatic.

99.7% of our users have no idea what to do with reloptions.  We'd have
to expose it with an ALTER TABLE SET append_only=true.

>> Also, I hate the name ...
> 
> Feel free to propose other names; that way I can hate your proposals
> too (or maybe not).

Well, my first thought was "block-range indexing", which I think is the
best description, but that isn't exactly an exciting name for a feature
which will likely be worthy of short-listing for 9.4.  I'd prefer it
over minmax, which users will think only works on aggregates, but it's
still not a great name.  "Summary Index" also comes to mind, but really
isn't a lot more exciting.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-17 Thread Alvaro Herrera
Tom Lane wrote:

> We've talked a lot about index-organized tables in the past.  How much
> of the use case for this would be subsumed by a feature like that?

IOTs are not flexible enough.  You can only have one index that you
index-organize the table on; and you can search only based on a prefix
of the index key.  If you want to change the key, ... um. I don't even
know what you'd do.

With minmax indexes, on the other hand, you can create one or several,
and they let you scan the table based on any of the indexed columns.  So
you can create a minmax index on creation_date, insertion_date, ship_date;
and have it serve queries that use any of these columns.  (You probably
don't add key column(s) to the minmax index because you already have
btrees on them.)

On the other hand, IOTs are expensive to insert into.  For each tuple to
insert you need to start from the root and descend the tree, insert your
tuple, then propagate splits upwards.  If you have a 10 TB table, you
cannot afford to have to do all that for each and every tuple.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-17 Thread Andres Freund
On 2013-06-17 16:23:40 -0400, Alvaro Herrera wrote:
> > > Re-summarization is relatively expensive, because the complete page range 
> > > has
> > > to be scanned.
> > 
> > Why?  Why can't we just update the affected pages in the index?
> 
> The page range has to be scanned in order to find out the min/max values
> for the indexed columns on the range; and then, with these data, update
> the index.

Why? Assuming the initial summarization has been performed you can check
the current min/max value, check whether it's still smaller/bigger than
the value you're inserting and if not update the index accordingly with
the new value. You don't even need to wait for the new value to become
visible since the ranges don't need to be minimal.

I think the contention this possibly causes may a better argument, but I
am not sure how much of a problem that really becomes if we find a
deadlock free way to only lock the minmax pages exlusively if the range
is violated.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-17 Thread Alvaro Herrera
Josh Berkus wrote:

> > Value changes in columns that are part of a minmax index, and tuple 
> > insertion
> > in summarized pages, would invalidate the stored min/max values.  To support
> > this, each minmax index has a validity map; a range can only be considered 
> > in a
> > scan if it hasn't been invalidated by such changes (A range "not 
> > considered" in
> > the scan needs to be returned in whole regardless of the stored min/max 
> > values,
> > that is, it cannot be pruned per query quals).  The validity map is very
> > similar to the visibility map in terms of performance characteristics: quick
> > enough that it's not contentious, allowing updates and insertions to proceed
> > even when data values violate the minmax index conditions.  An invalidated
> > range can be made valid by re-summarization (see below).
> 
> This begins to sound like these indexes are only useful on append-only
> tables.  Not that there aren't plenty of those, but ...

But what?  This is a useful use-case; one that's not served by any other
type of index.  Sure, you can have btrees over append-only tables, but
they are really large and have huge maintainance costs.  A smaller lossy
index is useful if low-maintenance and if it avoids keeping the cost of
scanning the table low enough.

These indexes can be considered a sort of partitioning of a large table.
Only instead of having to define CHECK (insert_date >= 'a month') for
each partition, you just create the index on the insert_date column, and
the index will allow a seqscan of the table to skip pages that don't
match the query's WHERE clause on that column.

> > Re-summarization is relatively expensive, because the complete page range 
> > has
> > to be scanned.
> 
> Why?  Why can't we just update the affected pages in the index?

The page range has to be scanned in order to find out the min/max values
for the indexed columns on the range; and then, with these data, update
the index.

> >  To avoid this, a table having a minmax index would be
> > configured so that inserts only go to the page(s) at the end of the table; 
> > this
> > avoids frequent invalidation of ranges in the middle of the table.  We 
> > provide
> > a table reloption that tweaks the FSM behavior, so that summarized pages are
> > not candidates for insertion.
> 
> We haven't had an index type which modifies table insertion behavior
> before, and I'm not keen to start now; imagine having two indexes on the
> same table each with their own, conflicting, requirements.

This is not a requirement.  It merely makes the index more effective.

> If we're going to start adding reloptions for specific table behavior,
> I'd rather think of all of the optimizations we might have for a
> prospective "append-only table" and bundle those, rather than tying it
> to whether a certain index exists or not.

Eh?  Sure, my intention for this reloption is for the user to be able to
state their intention for the table, and each feature that has
append-only table optimization does its thing.  I wasn't thinking in
anything automatic.

> Also, I hate the name ...

Feel free to propose other names; that way I can hate your proposals
too (or maybe not).

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-17 Thread Alvaro Herrera
Greg Stark wrote:
> On Fri, Jun 14, 2013 at 11:28 PM, Alvaro Herrera
>  wrote:
> > Re-summarization is relatively expensive, because the complete page range 
> > has
> > to be scanned.
> 
> That doesn't sound too bad to me. It just means there's a downside to
> having larger page ranges. I would expect the page ranges to be
> something in the ballpark of 32 pages --  scanning 32 pages to
> resummarize doesn't sound that painful but sounds like it's large
> enough that the resulting index would be a reasonable size.

Actually, I'm thinking that a range is more like, say, 1280 pages, or 10
MB.  My goal is to consider tables in the 10 TB magnitude; if I store
one index tuple for every 32 pages, I would end up with too large an
index.  With 10 MBs per range I can index the whole table with an index
of 50 MB, which seems reasonable to scan.  But anyway my intention is
that page range size is configurable.

> But I don't understand why an insert would invalid a tuple. An insert
> can just update the min and max incrementally. It's a delete that
> invalidates the range but as you note it doesn't really invalidate it,
> just mark it as needing a refresh -- and even then only if the value
> being deleted is equal to either the min or max.

No, I don't intend to update the index tuple with each heap insert.  I
think this will end up being too slow.  The validity map is there to
hold a single bit for each page saying whether the page range is known
valid or not; one insert into any page in the range invalidates the
range (and any scan of the table needs to scan that range as a whole).
This way I can wait until the storm of inserts has passed from a range,
and only then do the summarization for that range.  This avoids having
to summarize the range over and over.

Alternatively, I could consider the index tuple always valid, and update
it online as soon as we do an insert or update (i.e. examine the min/max
values in the index tuple, and update it to match if the new value is
out of bounds).  This seems too slow, so I won't try.

Also, a delete does not invalidate a range either.  As Simon said
elsewhere in a thread, if the range is not minimal, this is not a
problem; we might have to scan some more ranges than absolutely
necessary, but it's not a correctness problem.  The heap scan recheck
node will get rid of the unwanted tuples anyway.

> > Same-size page ranges?
> > Current related literature seems to consider that each "index entry" in a
> > minmax index must cover the same number of pages.  There doesn't seem to be 
> > a
> 
> I assume the reason for this in the literature is the need to quickly
> find the summary for a given page when you're handling an insert or
> delete.

Yeah, that's one thing to keep in mind.  I haven't gotten too much into
this; I only added these two entries at the end for my own future
reference, because I will need to consider them at some point.

Right now my intention is to have each index have a fixed page range
size, which is defined at index creation time.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-17 Thread Josh Berkus

> So there isn't a "fall down" thing here. We expect the recently
> loaded/updated data to be scanned and that's OK.
> 
> Having the minmax index updated greedily is just adding extra work for
> fast diminishing returns. We can always add that later if really
> needed, but I doubt it will be needed - in just the same way as mat
> views aren't greedily updated.

Ok, in that case, can we add the patch without messing with the FSM
logic?  It'll work out-of-the-box for append-only tables, and that's a
pretty solid use case.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-17 Thread Simon Riggs
On 17 June 2013 02:05, Josh Berkus  wrote:
>
>>> I agree that the FSM behaviour shouldn't be linked to index existence.
>>> IMHO that should be a separate table parameter, WITH (fsm_mode = append)
>>> Index only scans would also benefit from that.
>>
>> -1 ... I cannot believe that such a parameter would ever get turned on
>> in production by anyone.  If your table has a significant update rate,
>> the resulting table bloat would make such behavior completely
>> infeasible.  If you have few enough updates to make such a behavior
>> practical, then you can live with the expensive index updates instead.
>
> I'm also thinking that if a table is really append-only, then there are
> never any middle-of-the-table pages in the FSM, no?
>
> Where this falls down is the table which is
> mostly-append-but-occasionally-needs-an-update-or-delete.  I think the
> answer there is to look for a way to make updating the index block range
> faster, not ways to modify how we append to the heap.  If we told users
> "tables with Minmax indexes will be very expensive to update" then I
> think they'd live with it; dropping and readding an index to enable fast
> updates is something which is already familiar.

This feature is using a similar technique to enhance SeqScans as we
already use on VACUUM. We don't really care about whether we have 100%
scan avoidance because we were likely to be using a WHERE clause that
doesn't give perfect constraint elimination anyway. So on a large
table we don't care about the fact that we still have to scan 1-5% of
the table - we are still 20 times faster than a full seqscan.

So there isn't a "fall down" thing here. We expect the recently
loaded/updated data to be scanned and that's OK.

Having the minmax index updated greedily is just adding extra work for
fast diminishing returns. We can always add that later if really
needed, but I doubt it will be needed - in just the same way as mat
views aren't greedily updated.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-16 Thread Josh Berkus

>> I agree that the FSM behaviour shouldn't be linked to index existence.
>> IMHO that should be a separate table parameter, WITH (fsm_mode = append)
>> Index only scans would also benefit from that.
> 
> -1 ... I cannot believe that such a parameter would ever get turned on
> in production by anyone.  If your table has a significant update rate,
> the resulting table bloat would make such behavior completely
> infeasible.  If you have few enough updates to make such a behavior
> practical, then you can live with the expensive index updates instead.

I'm also thinking that if a table is really append-only, then there are
never any middle-of-the-table pages in the FSM, no?

Where this falls down is the table which is
mostly-append-but-occasionally-needs-an-update-or-delete.  I think the
answer there is to look for a way to make updating the index block range
faster, not ways to modify how we append to the heap.  If we told users
"tables with Minmax indexes will be very expensive to update" then I
think they'd live with it; dropping and readding an index to enable fast
updates is something which is already familiar.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-16 Thread Craig Ringer
On 06/15/2013 11:39 PM, Tom Lane wrote:
> There's a nearby thread complaining bitterly about our willingness to
> create hard-to-use, hard-to-tune "features".  In its current form,
> this will be another one of those.
For what it's worth, I was going for "usability concerns", not
"complaint"... and I'd much rather have most of those features, even
with poor usability, than not have them at all. I'm not convinced anyone
can introduce something that's beautiful design and implementation
perfection first time, every time. Learning where the pain points are
and improving them is what I'm going for.

There's also, IMO, nothing wrong with somewhat obscure options for
obscure use cases and workloads. The problem is when difficult to
understand or weird and obscure parameters need adjustment for perfectly
ordinary middle-of-the-road workloads.

I'm curious about using minmax indexes to guide insert/update placement,
so new tuples are written in regions of the table where they already
match the indexed range - essentially providing coarse-grained
auto-clustering. Otherwise with updates a table is likely to see
scattering of values such that the minmax ranges all keep on expanding
and the index becomes less and less selective. I realise that this may
not be even remotely feasible, but if it were then it might help reduce
the need for explicit tuning. I suspect we'd also land up wanting table
optimisation down the track, where tuples were shuffled around during
autovacuum to narrow the minmax index ranges.

I'm not expressing an opinion on the option discussed here as I haven't
worked with enough DBs that'd benefit from it. Personally I'm just
seriously impressed that this feature is coming together.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-15 Thread Simon Riggs
On 15 June 2013 16:39, Tom Lane  wrote:
> Simon Riggs  writes:
>> On 15 June 2013 00:01, Josh Berkus  wrote:
>>> If we're going to start adding reloptions for specific table behavior,
>>> I'd rather think of all of the optimizations we might have for a
>>> prospective "append-only table" and bundle those, rather than tying it
>>> to whether a certain index exists or not.
>
>> I agree that the FSM behaviour shouldn't be linked to index existence.
>> IMHO that should be a separate table parameter, WITH (fsm_mode = append)
>> Index only scans would also benefit from that.
>
> -1 ... I cannot believe that such a parameter would ever get turned on
> in production by anyone.

It's an extremely common use case, epecially with larger databases.

> If your table has a significant update rate,
> the resulting table bloat would make such behavior completely
> infeasible.  If you have few enough updates to make such a behavior
> practical, then you can live with the expensive index updates instead.

Depends. What I had in mind was that "append" mode would allow updates
normally within the last 1 GB of a table, but relocate updates should
they occur in earlier segments. That would cover the majority of use
cases I see.

> I also find the analogy to index-only scans to be bogus, because those
> didn't require any user tuning.

I don't think anybody is suggesting there is a requirement for user
tuning on this type of index.

But I think having one parameter would make it the same as say, GIN
indexes which also have a single tuning parameter.

I think the idea that index only scans "just work" is itself bogus. A
mild random update pattern will reduce the benefit of IOS scans
considerably, which is hard to understand or control, especially
without any way of measuring it. We could eliminate that problem if we
chose, but its not even documented as a potential issue for some
reason so its hard to discuss. Not being able to see or control a
known problem is not the same thing as "doesn't require user tuning".

> There's a nearby thread complaining bitterly about our willingness to
> create hard-to-use, hard-to-tune "features".  In its current form,
> this will be another one of those.

Not based upon anything mentioned here so far, or that I'm aware of.

I think it would be interesting to see some anonymous voting on
feature complaints, so we can see what people really think without
needing to risk offending the main authors of each of the various
parts of our software.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-15 Thread Tom Lane
Simon Riggs  writes:
> On 15 June 2013 00:01, Josh Berkus  wrote:
>> If we're going to start adding reloptions for specific table behavior,
>> I'd rather think of all of the optimizations we might have for a
>> prospective "append-only table" and bundle those, rather than tying it
>> to whether a certain index exists or not.

> I agree that the FSM behaviour shouldn't be linked to index existence.
> IMHO that should be a separate table parameter, WITH (fsm_mode = append)
> Index only scans would also benefit from that.

-1 ... I cannot believe that such a parameter would ever get turned on
in production by anyone.  If your table has a significant update rate,
the resulting table bloat would make such behavior completely
infeasible.  If you have few enough updates to make such a behavior
practical, then you can live with the expensive index updates instead.

I also find the analogy to index-only scans to be bogus, because those
didn't require any user tuning.

There's a nearby thread complaining bitterly about our willingness to
create hard-to-use, hard-to-tune "features".  In its current form,
this will be another one of those.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-15 Thread Simon Riggs
On 15 June 2013 00:07, Tom Lane  wrote:

> We've talked a lot about index-organized tables in the past.  How much
> of the use case for this would be subsumed by a feature like that?

IOTs would only work for one specific organisation, acting like a
multi-column btree. So you could use IOTs for this but only in a
limited way.

MinMax indexes cover multiple columns, possibly even all columns if
desired. So MMIs have much greater usefulness, as well as not
requiring a full reload of the data when you decide you want one.

Personally, I think IOTs are not worth pursuing. IOTs for Postgres
would require major changes to the definition of the heap, so we're a
long way from implementing them directly and coping with all of the
detailed implementation horrors. Grouped Item Tuple indexes are
theoretically equivalent in terms of size and number of block accesses
required to access data. Heikki's earlier work on that could easily be
revived. IOTs would be a multi-year effort with conflicting use cases.

Note that SQLServer and Sybase use block range indexes for primary
indexes, neatly avoiding the hassle Oracle walked into when they chose
to do IOTs. I think we should learn that same lession ourselves.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-15 Thread Simon Riggs
On 15 June 2013 00:01, Josh Berkus  wrote:
> Alvaro,
>
> This sounds really interesting, and I can see the possibilities.
> However ...
>
>> Value changes in columns that are part of a minmax index, and tuple insertion
>> in summarized pages, would invalidate the stored min/max values.  To support
>> this, each minmax index has a validity map; a range can only be considered 
>> in a
>> scan if it hasn't been invalidated by such changes (A range "not considered" 
>> in
>> the scan needs to be returned in whole regardless of the stored min/max 
>> values,
>> that is, it cannot be pruned per query quals).  The validity map is very
>> similar to the visibility map in terms of performance characteristics: quick
>> enough that it's not contentious, allowing updates and insertions to proceed
>> even when data values violate the minmax index conditions.  An invalidated
>> range can be made valid by re-summarization (see below).
>
> This begins to sound like these indexes are only useful on append-only
> tables.  Not that there aren't plenty of those, but ...

The index is basically using the "index only scan" mechanism. The
"only useful on append-only tables" comment would/should apply also to
index only scans. I can't see a reason to raise that specifically for
this index type.


>> Re-summarization is relatively expensive, because the complete page range has
>> to be scanned.
>
> Why?  Why can't we just update the affected pages in the index?

Again, same thing as index-only scans. For IOS, we reset the
visibility info at vacuum. The route proposed here follows exactly the
same timing, same mechanism. I can't see a reason for any difference
between the two.


>>  To avoid this, a table having a minmax index would be
>> configured so that inserts only go to the page(s) at the end of the table; 
>> this
>> avoids frequent invalidation of ranges in the middle of the table.  We 
>> provide
>> a table reloption that tweaks the FSM behavior, so that summarized pages are
>> not candidates for insertion.
>
> We haven't had an index type which modifies table insertion behavior
> before, and I'm not keen to start now; imagine having two indexes on the
> same table each with their own, conflicting, requirements.  This is
> sounding a lot more like a candidate for our prospective pluggable
> storage manager.  Also, the above doesn't help us at all with UPDATEs.
>
> If we're going to start adding reloptions for specific table behavior,
> I'd rather think of all of the optimizations we might have for a
> prospective "append-only table" and bundle those, rather than tying it
> to whether a certain index exists or not.

I agree that the FSM behaviour shouldn't be linked to index existence.

IMHO that should be a separate table parameter, WITH (fsm_mode = append)

Index only scans would also benefit from that.


> Also, I hate the name ... if this feature goes ahead, I'm going to be
> lobbying to change it.  But that's pretty minor compared to the update
> issues.

This feature has already had 3 different names. I don't think the name
is crucial, but it makes sense to give it a name up front. So if you
want to lobby for that then you'd need to come up with a name soon, so
poor Alvaro can cope with name #4.

(There's no consistency in naming from any other implementation either).

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-14 Thread Greg Stark
On Fri, Jun 14, 2013 at 11:28 PM, Alvaro Herrera
 wrote:
> Re-summarization is relatively expensive, because the complete page range has
> to be scanned.

That doesn't sound too bad to me. It just means there's a downside to
having larger page ranges. I would expect the page ranges to be
something in the ballpark of 32 pages --  scanning 32 pages to
resummarize doesn't sound that painful but sounds like it's large
enough that the resulting index would be a reasonable size.

But I don't understand why an insert would invalid a tuple. An insert
can just update the min and max incrementally. It's a delete that
invalidates the range but as you note it doesn't really invalidate it,
just mark it as needing a refresh -- and even then only if the value
being deleted is equal to either the min or max.

> Same-size page ranges?
> Current related literature seems to consider that each "index entry" in a
> minmax index must cover the same number of pages.  There doesn't seem to be a

I assume the reason for this in the literature is the need to quickly
find the summary for a given page when you're handling an insert or
delete. If you have some kind of meta data structure that lets you
find it (which I gather is what the validity map is?) then you
wouldn't need it. But that seems like a difficulty cost to justify
compared to just having a 1:1 mapping from block to bitmap tuple.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-14 Thread Tom Lane
Josh Berkus  writes:
>> To avoid this, a table having a minmax index would be
>> configured so that inserts only go to the page(s) at the end of the table; 
>> this
>> avoids frequent invalidation of ranges in the middle of the table.  We 
>> provide
>> a table reloption that tweaks the FSM behavior, so that summarized pages are
>> not candidates for insertion.

> We haven't had an index type which modifies table insertion behavior
> before, and I'm not keen to start now; imagine having two indexes on the
> same table each with their own, conflicting, requirements.

I agree; such a restriction is a nonstarter for a secondary index.  I
don't believe that hacking the FSM would be sufficient to guarantee the
required behavior, either.

We've talked a lot about index-organized tables in the past.  How much
of the use case for this would be subsumed by a feature like that?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-14 Thread Josh Berkus
Alvaro,

This sounds really interesting, and I can see the possibilities.
However ...

> Value changes in columns that are part of a minmax index, and tuple insertion
> in summarized pages, would invalidate the stored min/max values.  To support
> this, each minmax index has a validity map; a range can only be considered in 
> a
> scan if it hasn't been invalidated by such changes (A range "not considered" 
> in
> the scan needs to be returned in whole regardless of the stored min/max 
> values,
> that is, it cannot be pruned per query quals).  The validity map is very
> similar to the visibility map in terms of performance characteristics: quick
> enough that it's not contentious, allowing updates and insertions to proceed
> even when data values violate the minmax index conditions.  An invalidated
> range can be made valid by re-summarization (see below).

This begins to sound like these indexes are only useful on append-only
tables.  Not that there aren't plenty of those, but ...

> Re-summarization is relatively expensive, because the complete page range has
> to be scanned.

Why?  Why can't we just update the affected pages in the index?

>  To avoid this, a table having a minmax index would be
> configured so that inserts only go to the page(s) at the end of the table; 
> this
> avoids frequent invalidation of ranges in the middle of the table.  We provide
> a table reloption that tweaks the FSM behavior, so that summarized pages are
> not candidates for insertion.

We haven't had an index type which modifies table insertion behavior
before, and I'm not keen to start now; imagine having two indexes on the
same table each with their own, conflicting, requirements.  This is
sounding a lot more like a candidate for our prospective pluggable
storage manager.  Also, the above doesn't help us at all with UPDATEs.

If we're going to start adding reloptions for specific table behavior,
I'd rather think of all of the optimizations we might have for a
prospective "append-only table" and bundle those, rather than tying it
to whether a certain index exists or not.

Also, I hate the name ... if this feature goes ahead, I'm going to be
lobbying to change it.  But that's pretty minor compared to the update
issues.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [RFC] Minmax indexes

2013-06-14 Thread Alvaro Herrera
Hi,

This is a preliminary proposal for Minmax indexes.  I'm experimenting
with the code, but it's too crude to post yet, so here's a document
explaining what they are and how they work, so that reviewers can poke
holes to have the design improved.  My intention is to have a patch to
show for CF2, so please do have a look at this and comment.

This is part of the AXLE project http://www.axleproject.eu and the
intention is to support tables of very large size.  In a quick
experiment, I have a table of ~12 GB and its corresponding index is 65
kB in size, making the time to do the equivalent of a seqscan a small
fraction of that taken by a real seqscan.  This technique sits between a
bitmap scan of a normal btree, and a seqscan: the minmax index tells the
bitmap heap scan what pages to seqscan, allowing it to skip a large
fraction of pages that are known not to contain tuples matching the
query quals.  This is a huge win for large data warehouses.  

Without further ado, here's what I propose.


Minmax Range Indexes


Minmax indexes are a new access method intended to enable very fast scanning of
extremely large tables.

The essential idea of a minmax index is to keep track of the min() and max()
values in consecutive groups of heap pages (page ranges).  These values can be
used by constraint exclusion to avoid scanning such pages, depending on query
quals.

The main drawback of this is having to update the stored min/max values of each
page range as tuples are inserted into them.

Other database systems already have this feature. Some examples:

* Oracle Exadata calls this "storage indexes"
  http://richardfoote.wordpress.com/category/storage-indexes/

* Netezza has "zone maps"
  http://nztips.com/2010/11/netezza-integer-join-keys/

* Infobright has this automatically within their "data packs"
  
http://www.infobright.org/Blog/Entry/organizing_data_and_more_about_rough_data_contest/

* MonetDB seems to have it
  http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2662
  "Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS"

Grammar
---

To create a minmax index, we use

  CREATE INDEX foo_minmax_idx ON foo USING MINMAX (a, b, e);

Partial indexes are not supported; since an index is concerned with minimum and
maximum values of the involved columns across all the pages in the table, it
doesn't make sense to exclude values.  Another way to see "partial" indexes
here would be those that only considered some pages in the table instead of all
of them; but this would be difficult to implement and manage and, most likely,
pointless.

Expressional indexes can probably be supported in the future, but we disallow
them initially for conceptual simplicity.

Having multiple minmax indexes in the same table is acceptable, though most of
the time it would make more sense to have a single index covering all the
interesting columns.  Multiple indexes might be useful for columns added later.

Access Method Design


Since item pointers are not stored inside indexes of this type, it is not
possible to support the amgettuple interface.  Instead, we only provide
amgetbitmap support; scanning a relation using this index requires a recheck
node on top.  The amgetbitmap routine would return a TIDBitmap comprising all
the pages in those page groups that comply with the query quals; the recheck 
node
prunes tuples that are not visible per snapshot and those that are not visible
per query quals.

For each supported datatype, we need an opclass with the following catalog
entries:

- support functions (pg_amproc)
  * pg_proc entry for min()
  * pg_proc entry for max()
- support operators (pg_amop): same as btree (<, <=, =, >=, >)

The min() and max() support functions are used during index construction.
The support operators are used in the optimizer, so that the index is chosen
when queries on the indexed table are planned.  (Also, we use them in the
amgetbitmap routine, to compare ScanKeys and decide whether to emit a certain
block or not).

In each index tuple (corresponding to one page range), we store:
- first block this tuple applies to
- last block this tuple applies to
- for each indexed column:
  * min() value across all tuples in the range
  * max() value across all tuples in the range
  * nulls present in any tuple?

With the default INDEX_MAX_KEYS of 32, and considering columns of 8-byte length
types (timestamptz, bigint), each tuple would be 524 bytes in length, which
seems reasonable.  Of course, larger columns are possible, such as varchar, but
creating minmax indexes on such columns seems of little practical usefulness.

This maximum index tuple size is calculated as:
BlockNumber (4 bytes) * 2 + data value (8 bytes) * 32 * 2 + null bitmap (4 
bytes)


Block ranges mentioned in index entries shouldn't overlap. However, there can
be gaps where some pages have no covering index entry. (In particular, the last
few pages of the table would commonly not be summarized.)

In