Re: [HACKERS] Reducing tuple overhead

2015-06-07 Thread Amit Kapila
On Sun, Jun 7, 2015 at 3:02 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 23 April 2015 at 17:24, Andres Freund and...@anarazel.de wrote:



 It's hard to see how to save space there without reference to a specific
use case. I see different solutions depending upon whether we assume a low
number of transactions or a high number of transactions.


I have tried to check theoretically, how much difference such a
change could give us.

Assuming BLK_SZ - 8192 bytes; Page header - 24 bytes; each
line pointer - 4 bytes; average tuple - 150 bytes, roughly 53
tuples could be accommodated in one page.  Now each of this
tuple contains 12 bytes transaction information (xmin, xmax,
cid/combocid).  Now considering that in average workload 4
transactions operate on a page at the same time (I think for a
workload like pgbench tpc-b, it shouldn't be more otherwise it
should have been visible in perf reports),  4 additional tuples [1]
could be accommodated on a page which is approximately 7% savings
in space (which in-turns means that much less I/O).  This gain
could vary based on tuple size, no. of transactions that can operate
on page, page size..

Some additional benefits that I could see from such a change:

1. I think we don't need to traverse the whole page while freezing,
so there should be some savings in freeze operation as well.

2. Now I think with this, we might be able to reduce WAL also
if we can avoid some transaction related info in the cases where
it is currently stored (update/delete).

3. Another gain could come if we want to add transaction information
in index segment as well, because if such information can be stored at
page level, then there won't be much impact in adding it there which
will help us in avoiding multiple-passes of Vaccum (heap and index
could be vacuumed separately which will definitely help in IO and
bloat reduction).

[1]
Calc for 4 additional tuples:
(saving by removing trans info from tuple - new space consumed by trans)
/new tuple size
(53 * 12  - 12 * 4) / (150 - 12) = 4.26

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Reducing tuple overhead

2015-06-07 Thread Simon Riggs
On 23 April 2015 at 17:24, Andres Freund and...@anarazel.de wrote:

 Split into a new thread, the other one is already growing fast
 enough. This discussion started at
 http://archives.postgresql.org/message-id/55391469.5010506%40iki.fi

 On April 23, 2015 6:48:57 PM GMT+03:00, Heikki Linnakangas 
 hlinn...@iki.fi wrote:
 Stop right there. You need to reserve enough space on the page to store
 
 an xmax for *every* tuple on the page. Because if you don't, what are
 you going to do when every tuple on the page is deleted by a different
 transaction.
 
 Even if you store the xmax somewhere else than the page header, you
 need
 to reserve the same amount of space for them, so it doesn't help at
 all.

 Depends on how you do it and what you optimize for (disk space, runtime,
 code complexity..).  You can e.g. use apply a somewhat similar trick to
 xmin/xmax as done to cmin/cmax; only that the data structure needs to be
 persistent.

 In fact, we already have combocid like structure for xids that's
 persistent - multixacts. We could just have one xid saved that's either
 xmin or xmax (indicated by bits) or a multixact.  When a tuple is
 updated/deleted whose xmin is still required we could replace the former
 xmin with a multixact, otherwise just change the flag that it's now a
 xmax without a xmin.  To check visibility and if the xid is a multixact
 we'd just have to look for the relevant member for the actual xmin and
 xmax.

 To avoid exessive overhead when a tuple is repeatedly updated within one
 session we could store some of the data in the combocid entry that we
 anyway need in that case.

 Whether that's feasible complexity wise is debatable, but it's certainly
 possible.


 I do wonder what, in realistic cases, is actually the bigger contributor
 to the overhead. The tuple header or the padding we liberally add in
 many cases...


It's hard to see how to save space there without reference to a specific
use case. I see different solutions depending upon whether we assume a low
number of transactions or a high number of transactions.

A case we could optimize for is insert-mostly tables. But in that case if
you get rid of the xmax then you still have to freeze the tuples later.

I would have thought a better optimization would be to use the xmax for the
xid epoch by default, so that such rows would never need freezing. Then at
least we are using the xmax for something useful in a larger insert-mostly
database rather than just leaving it at zero.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Reducing tuple overhead

2015-06-07 Thread Peter Geoghegan
On Thu, Apr 30, 2015 at 6:54 AM, Robert Haas robertmh...@gmail.com wrote:
 The other, related problem is that the ordering operator might start
 to return different results than it did at index creation time.  For
 example, consider a btree index built on a text column.  Now consider
 'yum update'.  glibc gets updated, collation ordering of various
 strings change, and now you've got tuples that are in the wrong
 place in the index, because when the index was built, we thought A 
 B, but now we think B  A.  You would think the glibc maintainers
 might avoid such changes in minor releases, or that the Red Hat guys
 would avoid packaging and shipping those changes in minor releases,
 but you'd be wrong.

I would not think that. Unicode Technical Standard #10 states:


Collation order is not fixed.

Over time, collation order will vary: there may be fixes needed as
more information becomes available about languages; there may be new
government or industry standards for the language that require
changes; and finally, new characters added to the Unicode Standard
will interleave with the previously-defined ones. This means that
collations must be carefully versioned.


Also, in the paper Modern B-Tree Techniques, by Goetz Graefe, page
238, it states:


In many operating systems, appropriate functions are provided to
compute a normalized key from a localized string value, date value, or
time value. This functionality is used, for example, to list files in
a directory as appropriate for the local language. Adding
normalization for numeric data types is relatively straightforward, as
is concatenation of multiple normalized values. Database code must not
rely on such operating system code, however. The problem with relying
on operating systems support for database indexes is the update
frequency. An operating system might update its normalization code due
to an error or extension in the code or in the definition of a local
sort order; it is unacceptable, however, if such an update silently
renders existing large database indexes incorrect.


Unfortunately, it is simply not the case that we can rely on OS
collations being immutable. We have *no* contract with any C standard
library concerning collation stability whatsoever. I'm surprised that
we don't see hear more about this kind of corruption.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing tuple overhead

2015-06-06 Thread Amit Kapila
On Thu, Apr 23, 2015 at 9:54 PM, Andres Freund and...@anarazel.de wrote:

 Split into a new thread, the other one is already growing fast
 enough. This discussion started at
 http://archives.postgresql.org/message-id/55391469.5010506%40iki.fi

 On April 23, 2015 6:48:57 PM GMT+03:00, Heikki Linnakangas 
hlinn...@iki.fi wrote:
 Stop right there. You need to reserve enough space on the page to store
 
 an xmax for *every* tuple on the page. Because if you don't, what are
 you going to do when every tuple on the page is deleted by a different
 transaction.
 
 Even if you store the xmax somewhere else than the page header, you
 need
 to reserve the same amount of space for them, so it doesn't help at
 all.

 Depends on how you do it and what you optimize for (disk space, runtime,
 code complexity..).  You can e.g. use apply a somewhat similar trick to
 xmin/xmax as done to cmin/cmax; only that the data structure needs to be
 persistent.

Today while reading how other databases (that stores similar information
at page level) tackle this problem, I came across a link [1] which indicates
that this is controlled by some clauses (options) at table level.  The idea
seems to be that have some preallocated space (minimum and maximum
value for which can be specified by user, ofcourse there will be some
default values for the same) for this information in page and if more space
than that is required for a concurrent write operation, then the operation
needs to wait till the space for the same is available.

I am not sure if this is the best way as it depends on how to re-use the
preallocated space for transaction information at page level, but still I
think it is worth considering.


[1] -
https://techiedba.wordpress.com/2011/09/03/what-is-initrans-and-maxtrans/


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Reducing tuple overhead

2015-05-01 Thread Jim Nasby

On 4/30/15 7:37 AM, Robert Haas wrote:

On Thu, Apr 30, 2015 at 8:05 AM, Simon Riggs si...@2ndquadrant.com wrote:

A much better idea is to work out how to avoid index bloat at cause. If we
are running an UPDATE and we cannot get a cleanup lock, we give up and do a
non-HOT update, causing the index to bloat. It seems better to wait for a
short period to see if we can get the cleanup lock. The short period is
currently 0, so lets start there and vary the duration of wait upwards
proportionally as the index gets more bloated.


That only happens if there already wasn't enough space on the page so we 
need to Defrag, yes? If there is enough space we can HOT update without 
the cleanup lock.


What would be useful to know is how often we abort a HOT update because 
of lack of free space; that would indicate to a DBA that a lower fill 
factor may be in oredr. What would be useful to -hackers would be stats 
on how often an update would have been HOT if only the page had been pruned.



What I'd be worried about there is that it would be very hard to tune
the wait time, and that the operating system scheduling granularity
(10ms?) would be way too long.


[1] indicates between 0.75 and 6ms by default on Linux. I think FBSD 
still uses a 1000Hz scheduler (1ms), but it's not as clear.


What might be more promising is ways to avoid holding a pin for a long 
time (like the outer side of a nested loop), or being more aggressive 
about attempting the lock (IE: lower the threshold to trigger cleaning).


There's also a (in hindsight) questionable bit of logic in 
heap_page_prune_opt(); once we get the cleanup lock we check the page 
free space a second time. If we managed to actually get the lock, we 
should probably just clean it anyway.



But I'm in vigorous agreement with you on one point: the solution to
index bloat (and probably heap bloat, too) is not to clean it up
faster but to create less of it in the first place.  Making more
updates HOT is one way to do that.


+1.


1: 
http://stackoverflow.com/questions/16401294/how-to-know-linux-scheduler-time-slice

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing tuple overhead

2015-05-01 Thread Amit Kapila
On Thu, Apr 30, 2015 at 5:35 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 25 April 2015 at 01:12, Amit Kapila amit.kapil...@gmail.com wrote:

 On Sat, Apr 25, 2015 at 1:58 AM, Jim Nasby jim.na...@bluetreble.com
wrote:
 
  On 4/23/15 10:40 PM, Amit Kapila wrote:
 
  I agree with you and what I think one of the major reasons of bloat
is that
  Index segment doesn't have visibility information due to which
clearing of
  Index needs to be tied along with heap.  Now if we can move
transaction
  information at page level, then we can even think of having it in
Index
  segment as well and then Index can delete/prune it's tuples on it's
own
  which can reduce the bloat in index significantly and there is a
benefit
  to Vacuum as well.
 
 
  I don't see how putting visibility at the page level helps indexes at
all. We could already put XMIN in indexes if we wanted, but it won't help,
because...
 

 We can do that by putting transaction info at tuple level in index as
 well but that will be huge increase in size of index unless we devise
 a way to have variable index tuple header rather than a fixed.

  Now this has some downsides as well like Delete
  needs to traverse Index segment as well to Delete mark the tuples, but
  I think the upsides of reducing bloat can certainly outweigh the
downsides.
 
 
  ... which isn't possible. You can not go from a heap tuple to an index
tuple.

 We will have the access to index value during delete, so why do you
 think that we need linkage between heap and index tuple to perform
 Delete operation?  I think we need to think more to design Delete
 .. by CTID, but that should be doable.


 I see some assumptions here that need to be challenged.

 We can keep xmin and/or xmax on index entries. The above discussion
assumes that the information needs to be updated synchronously. We already
store visibility information on index entries using the lazily updated
killtuple mechanism, so I don't see much problem in setting the xmin in a
similar lazy manner. That way when we use the index if xmax is set we use
it, if it is not we check the heap. (And then you get to freeze indexes as
well ;-( )
 Anyway, I have no objection to making index AM pass visibility
information to indexes that wish to know the information, as long as it is
provided lazily.


Providing such an information lazily can help to an extent, but I think
it won't help much in bloat reduction. For example, when an
insert tries to insert a row in index page and found that there is no
space, it can't kill or overwrite any tuple (that is actually dead unless
updated lazily by that time) which is I think one of the main reasons for
index bloat.

 The second assumption is that if we had visibility information in the
index that it would make a difference to bloat. Since as I mention, we
already do have visibility information, I don't see that adding xmax or
xmin would make any difference at all to bloat. So -1 to adding it **for
that reason**.


The visibility information is only updated when such an index item
is accessed (lazy updation) and by that time already the new space
for index insertion would be used whereas if the information is provided
synchronously the dead space could be reclaimed much earlier (for
insertions on page which has dead tuples) and will reduce the chances
of bloat.


 A much better idea is to work out how to avoid index bloat at cause. If
we are running an UPDATE and we cannot get a cleanup lock, we give up and
do a non-HOT update, causing the index to bloat. It seems better to wait
for a short period to see if we can get the cleanup lock. The short period
is currently 0, so lets start there and vary the duration of wait upwards
proportionally as the index gets more bloated.


I think this is a separate and another good way of avoiding the
bloat, but independent of this having something like what we
discussed above will even reduce the chances of bloat for a
non-HOT update in a scenario described by you.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Reducing tuple overhead

2015-04-30 Thread Simon Riggs
On 25 April 2015 at 01:12, Amit Kapila amit.kapil...@gmail.com wrote:

 On Sat, Apr 25, 2015 at 1:58 AM, Jim Nasby jim.na...@bluetreble.com
 wrote:
 
  On 4/23/15 10:40 PM, Amit Kapila wrote:
 
  I agree with you and what I think one of the major reasons of bloat is
 that
  Index segment doesn't have visibility information due to which clearing
 of
  Index needs to be tied along with heap.  Now if we can move transaction
  information at page level, then we can even think of having it in Index
  segment as well and then Index can delete/prune it's tuples on it's own
  which can reduce the bloat in index significantly and there is a benefit
  to Vacuum as well.
 
 
  I don't see how putting visibility at the page level helps indexes at
 all. We could already put XMIN in indexes if we wanted, but it won't help,
 because...
 

 We can do that by putting transaction info at tuple level in index as
 well but that will be huge increase in size of index unless we devise
 a way to have variable index tuple header rather than a fixed.

  Now this has some downsides as well like Delete
  needs to traverse Index segment as well to Delete mark the tuples, but
  I think the upsides of reducing bloat can certainly outweigh the
 downsides.
 
 
  ... which isn't possible. You can not go from a heap tuple to an index
 tuple.

 We will have the access to index value during delete, so why do you
 think that we need linkage between heap and index tuple to perform
 Delete operation?  I think we need to think more to design Delete
 .. by CTID, but that should be doable.


I see some assumptions here that need to be challenged.

We can keep xmin and/or xmax on index entries. The above discussion assumes
that the information needs to be updated synchronously. We already store
visibility information on index entries using the lazily updated killtuple
mechanism, so I don't see much problem in setting the xmin in a similar
lazy manner. That way when we use the index if xmax is set we use it, if it
is not we check the heap. (And then you get to freeze indexes as well ;-( )
Anyway, I have no objection to making index AM pass visibility information
to indexes that wish to know the information, as long as it is provided
lazily.

The second assumption is that if we had visibility information in the index
that it would make a difference to bloat. Since as I mention, we already do
have visibility information, I don't see that adding xmax or xmin would
make any difference at all to bloat. So -1 to adding it **for that reason**.


A much better idea is to work out how to avoid index bloat at cause. If we
are running an UPDATE and we cannot get a cleanup lock, we give up and do a
non-HOT update, causing the index to bloat. It seems better to wait for a
short period to see if we can get the cleanup lock. The short period is
currently 0, so lets start there and vary the duration of wait upwards
proportionally as the index gets more bloated.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Reducing tuple overhead

2015-04-30 Thread Robert Haas
On Thu, Apr 30, 2015 at 8:05 AM, Simon Riggs si...@2ndquadrant.com wrote:
 A much better idea is to work out how to avoid index bloat at cause. If we
 are running an UPDATE and we cannot get a cleanup lock, we give up and do a
 non-HOT update, causing the index to bloat. It seems better to wait for a
 short period to see if we can get the cleanup lock. The short period is
 currently 0, so lets start there and vary the duration of wait upwards
 proportionally as the index gets more bloated.

What I'd be worried about there is that it would be very hard to tune
the wait time, and that the operating system scheduling granularity
(10ms?) would be way too long.

But I'm in vigorous agreement with you on one point: the solution to
index bloat (and probably heap bloat, too) is not to clean it up
faster but to create less of it in the first place.  Making more
updates HOT is one way to do that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing tuple overhead

2015-04-30 Thread Robert Haas
On Thu, Apr 30, 2015 at 12:31 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 I think I am missing something here, but when this second
 evaluation is needed.  Basically what I understand from index
 insertion is that it evaluates the value to be inserted in index
 before calling nbtree module and then nbtree just inserts the
 value/tuple passed to it.

Sure, but what happens if it doesn't evaluate to the same value?

Consider a tuple where a = 1 and a function f(a).  You insert the
tuple, evaluate f(a), and get 17.  So you insert an index tuple into
the btree with a value of 17, pointing at the tuple.  Now you delete
the tuple, evaluate f(a) again, and this time you get 42.  You search
the btree for an index tuple with a value of 42, and you don't find
one.  But the index tuple is still there.

With the current approach, that doesn't happen, because we effectively
search the entire index for tuples pointing at the heap tuple we're
trying to get rid of.  The only problem with that is that it's
crushingly expensive when the index is large and the number of tuples
we're cleaning out is comparatively small.  But that's a big problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing tuple overhead

2015-04-30 Thread Amit Kapila
On Thu, Apr 30, 2015 at 5:03 PM, Robert Haas robertmh...@gmail.com wrote:

 On Thu, Apr 30, 2015 at 12:31 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  I think I am missing something here, but when this second
  evaluation is needed.  Basically what I understand from index
  insertion is that it evaluates the value to be inserted in index
  before calling nbtree module and then nbtree just inserts the
  value/tuple passed to it.

 Sure, but what happens if it doesn't evaluate to the same value?

 Consider a tuple where a = 1 and a function f(a).  You insert the
 tuple, evaluate f(a), and get 17.  So you insert an index tuple into
 the btree with a value of 17, pointing at the tuple.  Now you delete
 the tuple, evaluate f(a) again, and this time you get 42.

As the index expression contain table columns and all the functions
or operators used in expression must be IMMUTABLE, won't that
guarantee to avoid such a situation?

If not, then I think for such indexes we might need to search
for tupleoffset in the entire index and mark it as Delete or may be
for such indexes follow the current mechanism.  I think it will still
give us lot of benefit in more common cases.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Reducing tuple overhead

2015-04-30 Thread Robert Haas
On Thu, Apr 30, 2015 at 9:46 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 As the index expression contain table columns and all the functions
 or operators used in expression must be IMMUTABLE, won't that
 guarantee to avoid such a situation?

The concern is that they might be labeled as immutable but not
actually behave that way.

The other, related problem is that the ordering operator might start
to return different results than it did at index creation time.  For
example, consider a btree index built on a text column.  Now consider
'yum update'.  glibc gets updated, collation ordering of various
strings change, and now you've got tuples that are in the wrong
place in the index, because when the index was built, we thought A 
B, but now we think B  A.  You would think the glibc maintainers
might avoid such changes in minor releases, or that the Red Hat guys
would avoid packaging and shipping those changes in minor releases,
but you'd be wrong.  We've seen cases where the master and the standby
were both running RHEL X.Y (same X.Y on both servers) but they didn't
agree on the collation definitions, so queries that worked on the
master failed on the slave.

 If not, then I think for such indexes we might need to search
 for tupleoffset in the entire index and mark it as Delete or may be
 for such indexes follow the current mechanism.

I think that *is* the current mechanism.

 I think it will still
 give us lot of benefit in more common cases.

It's hard to figure out exactly what can work here.  Aside from
correctness issues, the biggest problem with refinding the index
tuples one-by-one and killing them is that it may suck for bulk
operations.  When you delete one tuple from the table, refinding the
index tuples and killing them immediately is probably the smartest
thing you can do.  If you postpone it, somebody may be forced to split
the page because it's full, whereas if you do it right away, the next
guy who wants to put a tuple on that page may be able to make it fit.
That's a big win.

But if you want to delete 10 million tuples, doing an index scan for
each one may induce a tremendous amount of random I/O on the index
pages, possibly visiting and dirtying the same pages more than once.
Scanning the whole index for tuples to kill avoids that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing tuple overhead

2015-04-30 Thread Amit Kapila
On Thu, Apr 30, 2015 at 7:24 PM, Robert Haas robertmh...@gmail.com wrote:

 On Thu, Apr 30, 2015 at 9:46 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  As the index expression contain table columns and all the functions
  or operators used in expression must be IMMUTABLE, won't that
  guarantee to avoid such a situation?

 The concern is that they might be labeled as immutable but not
 actually behave that way.

 The other, related problem is that the ordering operator might start
 to return different results than it did at index creation time.  For
 example, consider a btree index built on a text column.  Now consider
 'yum update'.  glibc gets updated, collation ordering of various
 strings change, and now you've got tuples that are in the wrong
 place in the index, because when the index was built, we thought A 
 B, but now we think B  A.  You would think the glibc maintainers
 might avoid such changes in minor releases, or that the Red Hat guys
 would avoid packaging and shipping those changes in minor releases,
 but you'd be wrong.  We've seen cases where the master and the standby
 were both running RHEL X.Y (same X.Y on both servers) but they didn't
 agree on the collation definitions, so queries that worked on the
 master failed on the slave.


This is quite similar to IMMUTABLE function, but for an operator.  I think
guaranteeing the immutable nature for a function is the job of
application/user.
Oracle uses something similar (DETERMINISTIC functions) for function based
indexes and asks user to ensure the DETERMINISTIC property of function.
I think having synchronous delete (delete from index at the same time as
from heap) is one of the main differentiating factor for not having bloat in
indexes in some of the other databases.  If we don't want to change the
current property for functions or operators for indexes, then we can have a
new
type of index which can have visibility information and users are advised to
use such an index where they can ensure that functions or operators for
that index are IMMUTABLE.  Here, there is one argument that users might or
might not be able to ensure the same, but I think if it can be ensured by
users
of other successful databases, then the same should be possible for
PostgreSQL
users as well, after all this can bring a lot of value on table (avoiding
index bloat)
for users.


  I think it will still
  give us lot of benefit in more common cases.

 It's hard to figure out exactly what can work here.  Aside from
 correctness issues, the biggest problem with refinding the index
 tuples one-by-one and killing them is that it may suck for bulk
 operations.  When you delete one tuple from the table, refinding the
 index tuples and killing them immediately is probably the smartest
 thing you can do.  If you postpone it, somebody may be forced to split
 the page because it's full, whereas if you do it right away, the next
 guy who wants to put a tuple on that page may be able to make it fit.
 That's a big win.


Exactly, I have that big win in my mind.

 But if you want to delete 10 million tuples, doing an index scan for
 each one may induce a tremendous amount of random I/O on the index
 pages, possibly visiting and dirtying the same pages more than once.
 Scanning the whole index for tuples to kill avoids that.


Even doing it only from heap might not be cheap.
I think for doing BULK delete (as you described), users of other
databases have some smart ways like:
a.
If possible drop the indexes
b.
instead of deleting the records you don't want, SELECT OUT the
records you do -- drop the old table -- rename new table.
c.
Archive instead of delete Instead of deleting, just set a flag to mark
a row as archived, invalid, deleted, etc. This will avoid the immediate
overhead of index maintenance. Ignore the rows temporarily and let
some other job delete them later. The large downside to this is that it
affects any query on the table.
...
possibly some other ways.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Reducing tuple overhead

2015-04-30 Thread Alvaro Herrera
Robert Haas wrote:
 On Mon, Apr 27, 2015 at 5:01 PM, Jim Nasby jim.na...@bluetreble.com wrote:
  The problem with just having the value is that if *anything* changes between
  how you evaluated the value when you created the index tuple and when you
  evaluate it a second time you'll corrupt your index. This is actually an
  incredibly easy problem to have; witness how we allowed indexing
  timestamptz::date until very recently. That was clearly broken, but because
  we never attempted to re-run the index expression to do vacuuming at least
  we never corrupted the index itself.
 
 True.  But I guess what I don't understand is: how big a deal is this,
 really?  The uncorrupted index can still return wrong answers to
 queries.  The fact that you won't end up with index entries pointing
 to completely unrelated tuples is nice, but if index scans are missing
 tuples that they should see, aren't you still pretty hosed?

As I recall, there's a desire not to run expressions when vacuuming,
because to run them means getting a snapshot, in case any functions look
at database state; and you can't get a snapshot while vacuuming because
that means the vacuum gets visible to concurrent processes; in
particular they keep all processes' xmin from moving forward.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing tuple overhead

2015-04-29 Thread Robert Haas
On Mon, Apr 27, 2015 at 5:01 PM, Jim Nasby jim.na...@bluetreble.com wrote:
 The problem with just having the value is that if *anything* changes between
 how you evaluated the value when you created the index tuple and when you
 evaluate it a second time you'll corrupt your index. This is actually an
 incredibly easy problem to have; witness how we allowed indexing
 timestamptz::date until very recently. That was clearly broken, but because
 we never attempted to re-run the index expression to do vacuuming at least
 we never corrupted the index itself.

True.  But I guess what I don't understand is: how big a deal is this,
really?  The uncorrupted index can still return wrong answers to
queries.  The fact that you won't end up with index entries pointing
to completely unrelated tuples is nice, but if index scans are missing
tuples that they should see, aren't you still pretty hosed?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing tuple overhead

2015-04-29 Thread Jim Nasby

On 4/29/15 12:18 PM, Robert Haas wrote:

On Mon, Apr 27, 2015 at 5:01 PM, Jim Nasby jim.na...@bluetreble.com wrote:

The problem with just having the value is that if *anything* changes between
how you evaluated the value when you created the index tuple and when you
evaluate it a second time you'll corrupt your index. This is actually an
incredibly easy problem to have; witness how we allowed indexing
timestamptz::date until very recently. That was clearly broken, but because
we never attempted to re-run the index expression to do vacuuming at least
we never corrupted the index itself.


True.  But I guess what I don't understand is: how big a deal is this,
really?  The uncorrupted index can still return wrong answers to
queries.  The fact that you won't end up with index entries pointing
to completely unrelated tuples is nice, but if index scans are missing
tuples that they should see, aren't you still pretty hosed?


Maybe, maybe not. You could argue it's better to miss some rows than 
have completely unrelated ones.


My recollection is there's other scenarios where this causes problems, 
but that's from several years ago and I wasn't able to find anything on 
a quick search of archives. I've wondered the same in the past and Tom 
had reasons it was bad, but perhaps they're overstated.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing tuple overhead

2015-04-29 Thread Amit Kapila
On Tue, Apr 28, 2015 at 2:31 AM, Jim Nasby jim.na...@bluetreble.com wrote:

 On 4/25/15 12:12 AM, Amit Kapila wrote:

   ... which isn't possible. You can not go from a heap tuple to an
 index tuple.

 We will have the access to index value during delete, so why do you
 think that we need linkage between heap and index tuple to perform
 Delete operation?  I think we need to think more to design Delete
 .. by CTID, but that should be doable.


 The problem with just having the value is that if *anything* changes
between how you evaluated the value when you created the index tuple and
when you evaluate it a second time you'll corrupt your index.


I think I am missing something here, but when this second
evaluation is needed.  Basically what I understand from index
insertion is that it evaluates the value to be inserted in index
before calling nbtree module and then nbtree just inserts the
value/tuple passed to it.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Reducing tuple overhead

2015-04-27 Thread Jim Nasby

On 4/25/15 12:12 AM, Amit Kapila wrote:

  ... which isn't possible. You can not go from a heap tuple to an
index tuple.

We will have the access to index value during delete, so why do you
think that we need linkage between heap and index tuple to perform
Delete operation?  I think we need to think more to design Delete
.. by CTID, but that should be doable.


The problem with just having the value is that if *anything* changes 
between how you evaluated the value when you created the index tuple and 
when you evaluate it a second time you'll corrupt your index. This is 
actually an incredibly easy problem to have; witness how we allowed 
indexing timestamptz::date until very recently. That was clearly broken, 
but because we never attempted to re-run the index expression to do 
vacuuming at least we never corrupted the index itself.



  This has been discussed in the past.

I have tried to search in archive, but not getting what is the exact
problem.


Unfortunately I can't find prior discussion now either... :/
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing tuple overhead

2015-04-24 Thread Amit Kapila
On Sat, Apr 25, 2015 at 1:58 AM, Jim Nasby jim.na...@bluetreble.com wrote:

 On 4/23/15 10:40 PM, Amit Kapila wrote:

 I agree with you and what I think one of the major reasons of bloat is
that
 Index segment doesn't have visibility information due to which clearing
of
 Index needs to be tied along with heap.  Now if we can move transaction
 information at page level, then we can even think of having it in Index
 segment as well and then Index can delete/prune it's tuples on it's own
 which can reduce the bloat in index significantly and there is a benefit
 to Vacuum as well.


 I don't see how putting visibility at the page level helps indexes at
all. We could already put XMIN in indexes if we wanted, but it won't help,
because...


We can do that by putting transaction info at tuple level in index as
well but that will be huge increase in size of index unless we devise
a way to have variable index tuple header rather than a fixed.

 Now this has some downsides as well like Delete
 needs to traverse Index segment as well to Delete mark the tuples, but
 I think the upsides of reducing bloat can certainly outweigh the
downsides.


 ... which isn't possible. You can not go from a heap tuple to an index
tuple.

We will have the access to index value during delete, so why do you
think that we need linkage between heap and index tuple to perform
Delete operation?  I think we need to think more to design Delete
.. by CTID, but that should be doable.

 This has been discussed in the past.

I have tried to search in archive, but not getting what is the exact
problem.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Reducing tuple overhead

2015-04-24 Thread Jim Nasby

On 4/23/15 10:40 PM, Amit Kapila wrote:

I agree with you and what I think one of the major reasons of bloat is that
Index segment doesn't have visibility information due to which clearing of
Index needs to be tied along with heap.  Now if we can move transaction
information at page level, then we can even think of having it in Index
segment as well and then Index can delete/prune it's tuples on it's own
which can reduce the bloat in index significantly and there is a benefit
to Vacuum as well.


I don't see how putting visibility at the page level helps indexes at 
all. We could already put XMIN in indexes if we wanted, but it won't 
help, because...



Now this has some downsides as well like Delete
needs to traverse Index segment as well to Delete mark the tuples, but
I think the upsides of reducing bloat can certainly outweigh the downsides.


... which isn't possible. You can not go from a heap tuple to an index 
tuple. This has been discussed in the past. If we could do that then 
vacuum would become REALLY cheap compared to today.


BTW, before actually tackling anything we should try and get more data 
from Robert/Jan about where the extra 80% came from. We don't know if 
it's indexes or what.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing tuple overhead

2015-04-23 Thread Joshua D. Drake


On 04/23/2015 09:42 AM, Jim Nasby wrote:


On 4/23/15 11:24 AM, Andres Freund wrote:

I do wonder what, in realistic cases, is actually the bigger contributor
to the overhead. The tuple header or the padding we liberally add in
many cases...


Assuming you're talking about padding between fields...

Several years ago Enova paid Command Prompt to start work on logical
column ordering, and part of the motivation for that was to allow
re-ordering physical tuples into the most efficient on-disk format
possible. I think I did some tests re-arranging some tables into the
theoretically most efficient order and measuring heap size. I think
there was some modest size improvement, maybe 10-15%? This was several
years ago so it's all foggy. Maybe Josh can find some of this in CMD's
ticketing system?


Yeah I dug around. I don't see anything about size improvement but here 
are our notes:


Alvaro said:

I ended up not producing notes as regularly as I had initially hoped. To 
try and make up for it, here's an update covering everything I've done 
since I started working on this issue.


This patch turned out to be completely different than what we had 
initially thought. We had thought it was going to be a matter of finding 
out places that used attnum and replace it with either attnum, 
attlognum or attphysnum, depending on what order was necessary on any 
given spot. This wasn't an easy thing to do because there are several 
hundreds of those. So it was supposed to be amazingly time-consuming and 
rather boring work.


This has nothing to do with reality: anywhere from parser down to 
optimizer and executor, the way things work is that a list of attributes 
is built, processed, and referenced. Some places assume that the list is 
in a certain order that's always the same order for those three cases. 
So the way to develop this feature is to change those places so that 
instead of receiving the list in one of these orders, they instead 
receive it in a different order.


So what I had to do early on, was find a way to retrieve the sort order 
from catalogs, preserve it when TupleDescriptors are built, and ensure 
the attribute list is extracted from TupleDesc in the correct order. But 
it turned out that this is not enough, because down in the parser guts, 
a target list is constructed; and later, a TupleDescriptor is built from 
the target list. So it's necessary to preserve the sorting info from the 
original tuple descriptor into the target list (which means adding order 
info to Var and TargetEntry nodes), so that the new TupleDesc can also 
have it.


Today I'm finding that even more than that is necessary. It turns out 
that the RangeTableEntries (i.e. the entries in the FROM clause of a 
query) have an item dubbed eref which is a list of column names; due 
to my changes in the earlier parser stages, this list is sorted in 
logical column order; but the code to resolve things such as columns 
used in JOIN/ON clauses walks the list (which is in logical order) and 
then uses the number of times it had to walk the elements in the list to 
construct a Var (column reference) in attnum order -- so it finds a 
different column, and it all fails.


So what I'm doing now is modify the RangeTableEntry node to keep a 
mapping list of logical to identity numbers. Then I'll have to search 
for places using the rte-eref-colnames and make sure that they 
correctly use attlognum as index into it.


And then later:

First of all I should note that I discussed the approach mentioned above 
to pgsql-hackers and got a very interesting comment from Tom Lane that 
adding sorting info to Var and TargetEntry nodes was not a very good 
idea because it'd break stored rules whenever a table column changed. So 
I went back and studied that code and noticed that it was really the 
change in RangeTableEntry that's doing the good magic; those other 
changes are fortunately not necessary. (Though there were a necessary 
vehicle for me to understand how the other stuff works.)


I've been continuing to study the backend code looking for uses of 
attribute lists that assume a single ordering. As I get more into it, 
more complex cases appear. The number of cases is fortunately bounded, 
though. Most of the uses of straight attribute lists are in places that 
do not require modification, or require little work or thought to update 
correctly.


However, some other places are not like that. I have fixed SQL 
functions two times now, and I just found out that the second fix (which 
I believed to be mostly correct) was to be the final one, but I found 
out just now that it's not, and the proper fix is going to require 
something a bit more low-level (namely, a projection step that reorders 
columns correctly after the fact). Fortunately, I believe that this 
extra projection step is going to fix a lot of other cases too, which I 
originally had no idea how to attack. Moreover, understanding that bit 
means I also figured out what Tom Lane meant 

Re: [HACKERS] Reducing tuple overhead

2015-04-23 Thread Petr Jelinek

On 23/04/15 18:24, Andres Freund wrote:

Whether that's feasible complexity wise is debatable, but it's certainly
possible.


I do wonder what, in realistic cases, is actually the bigger contributor
to the overhead. The tuple header or the padding we liberally add in
many cases...



The logical ordering patch + auto optimizations of column layout on 
table creation/rewrite might help partially there.


But what seems to be clear is that we need more in depth analysis of 
what really contributes most to the tuple size in various use-cases and 
then we can debate what we can do about it.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing tuple overhead

2015-04-23 Thread Alvaro Herrera
Thanks for posting this.

Joshua D. Drake wrote:

 First of all I should note that I discussed the approach mentioned above to
 pgsql-hackers and got a very interesting comment from Tom Lane that adding
 sorting info to Var and TargetEntry nodes was not a very good idea because
 it'd break stored rules whenever a table column changed. So I went back and
 studied that code and noticed that it was really the change in
 RangeTableEntry that's doing the good magic; those other changes are
 fortunately not necessary. (Though there were a necessary vehicle for me to
 understand how the other stuff works.)

So in the logical columns thread I was saying that this change of eref
didn't work either; it seems that most of the time (if not always) the
list should continue to be in attnum order, and the
logical-ordering-info data should be obtained from the tuple descriptor
and whatever needs to be sorted should be sorted afterwards, i.e. in
setrefs.c, using data previously stored in RelOptInfo.  I tried doing
that and ran into some other mess elsewhere.

 I've been continuing to study the backend code looking for uses of attribute
 lists that assume a single ordering. As I get more into it, more complex
 cases appear. The number of cases is fortunately bounded, though.

he hopes, and eventually gives up

 However, some other places are not like that. I have fixed SQL functions
 two times now, and I just found out that the second fix (which I believed to
 be mostly correct) was to be the final one, but I found out just now that
 it's not, and the proper fix is going to require something a bit more
 low-level (namely, a projection step that reorders columns correctly after
 the fact). Fortunately, I believe that this extra projection step is going
 to fix a lot of other cases too, which I originally had no idea how to
 attack. Moreover, understanding that bit means I also figured out what Tom
 Lane meant on the second half of his response to my original pgsql-hackers
 comment. So I think we're good on that front.

I had forgotten my intention to rework things using projection.

In any case, I have posted a lot of patches now which others could
study, if there's interest.  I would sure like to see this split, and
there are multiple benefits such as reduction of padding space.  I'm
sure there's a nonempty set of guys brighter than me that could figure
it out in not that much time.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing tuple overhead

2015-04-23 Thread Amit Kapila
On Fri, Apr 24, 2015 at 1:03 AM, Jim Nasby jim.na...@bluetreble.com wrote:

 On 4/23/15 11:45 AM, Petr Jelinek wrote:

 On 23/04/15 18:24, Andres Freund wrote:

 Whether that's feasible complexity wise is debatable, but it's certainly
 possible.


 I do wonder what, in realistic cases, is actually the bigger contributor
 to the overhead. The tuple header or the padding we liberally add in
 many cases...


 The logical ordering patch + auto optimizations of column layout on
 table creation/rewrite might help partially there.

 But what seems to be clear is that we need more in depth analysis of
 what really contributes most to the tuple size in various use-cases and
 then we can debate what we can do about it.


 Also, what Robert posted was that while we started at something like
15%-30% larger, we ended the test at 80% larger. That makes me think that
the bigger win is not in reducing tuple size but tackling bloat.


I agree with you and what I think one of the major reasons of bloat is that
Index segment doesn't have visibility information due to which clearing of
Index needs to be tied along with heap.  Now if we can move transaction
information at page level, then we can even think of having it in Index
segment as well and then Index can delete/prune it's tuples on it's own
which can reduce the bloat in index significantly and there is a benefit
to Vacuum as well.  Now this has some downsides as well like Delete
needs to traverse Index segment as well to Delete mark the tuples, but
I think the upsides of reducing bloat can certainly outweigh the downsides.

In short, reducing the tuple size by moving transaction information at
page level can not only reduce the tuple size but can also help in
reducing bloat.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Reducing tuple overhead

2015-04-23 Thread Jim Nasby

On 4/23/15 11:45 AM, Petr Jelinek wrote:

On 23/04/15 18:24, Andres Freund wrote:

Whether that's feasible complexity wise is debatable, but it's certainly
possible.


I do wonder what, in realistic cases, is actually the bigger contributor
to the overhead. The tuple header or the padding we liberally add in
many cases...



The logical ordering patch + auto optimizations of column layout on
table creation/rewrite might help partially there.

But what seems to be clear is that we need more in depth analysis of
what really contributes most to the tuple size in various use-cases and
then we can debate what we can do about it.


Also, what Robert posted was that while we started at something like 
15%-30% larger, we ended the test at 80% larger. That makes me think 
that the bigger win is not in reducing tuple size but tackling bloat.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers