Re: [HACKERS] all_visible replay aborting due to uninitialized pages

2013-05-30 Thread Andres Freund
On 2013-05-29 23:01:31 -0400, Robert Haas wrote:
 On Wed, May 29, 2013 at 9:57 AM, Andres Freund and...@2ndquadrant.com wrote:
  Thought about that, but given that 9.3's visibilitymap_set already will
  already FPI heap pages I concluded it wouldn't really be an improvement
  since it's only one ||log_heap_page or so there. Not sure what's
  better. Will write the patch and see how it goes.
 
  Ended up using log_newpage_buffer since reusing visibilitymap_set's
  record would break the wal format as we currently do not accept an FPI
  on the heap pages during replay when  9.3. Forcing to upgrade the
  client first would be rather unfriendly...
 
  That has the disadvantage of logging a full heap page since it doesn't
  use the hole optimization but this happens really infrequently, so ...
 
 Yeah, I think it's fine.  The patch also looks fine, although I think
 the comments could use a bit of tidying.  I guess we need to
 back-patch this all the way back to 8.4?  It will require some
 adjustments for the older branches.

I think 9.2 is actually far enough and it should apply there. Before
that we only logged the unsetting of all_visible via
heap_(inset|update|delete)'s wal records not the setting as far as I can
tell. So I don't immediately see a danger  9.2.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] units in postgresql.conf comments

2013-05-30 Thread Heikki Linnakangas

On 30.05.2013 06:43, Bruce Momjian wrote:

On Wed, May 29, 2013 at 09:59:10PM -0400, Peter Eisentraut wrote:

I think these sort of entries don't make much sense:

#wal_sender_timeout = 60s  # in milliseconds; 0 disables

I think we should remove units from the comments when it's clear from
the name or the default value that time units are accepted.


We are documenting what happens when there are no units.  Are people are
going to change '60s' to '50' and assume that is '50s'?  Hopefully not.
I do like the clutter avoidance of removing the units from the comments.


We could make it mandatory to specify the unit in the value. Ie. throw 
an error on wal_sender_timeout = 50:


ERROR: unit required for option wal_sender_timeout
HINT:  Valid units for this parameter are ms, s, min, h, and d.

Then you wouldn't need a comment to explain what the unit of a naked 
value is. The only problem I see with that is backwards-compatibility. 
Old postgresql.conf files containing naked values would no longer work. 
But all you'd need to do is to add in the units, which would work on 
older versions too, and would be good for readability anyway.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] add --throttle to pgbench (submission 3)

2013-05-30 Thread Craig Ringer
On 05/28/2013 07:52 PM, Fabien COELHO wrote:

 However I'm not sure that pg_stat_replication currently has the
 necessary information on either side to measure the lag (in time
 transactions, but how do I know when a transaction was committed? or
 number of transactions?). 

The BDR codebase now has a handy function to report when a transaction
was committed, pg_get_transaction_committime(xid) .

It also adds pg_xlog_wait_remote_apply and pg_xlog_wait_remote_receive
that can be used with pg_current_xlog_location() to wait until one or
all replicas have caught up, or with LSNs from pg_stat_replication to
(say) wait until all replicas have caught up with the most up-to-date one.

I don't think these depend on anything BDR-specific, though Andres or
Álvaro would be able to say for sure. Take a look in:

git://git.postgresql.org/git/users/andresfreund/postgres.git

on the 'bdr' branch. Be aware that it is rebased regularly, though the
'0.4' tag applied earlier today will remain constant and contains the
functions of interest.

I hope this helps.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] units in postgresql.conf comments

2013-05-30 Thread Joshua D. Drake


On 05/30/2013 12:01 AM, Heikki Linnakangas wrote:


We could make it mandatory to specify the unit in the value. Ie. throw
an error on wal_sender_timeout = 50:

ERROR: unit required for option wal_sender_timeout
HINT:  Valid units for this parameter are ms, s, min, h, and d.

Then you wouldn't need a comment to explain what the unit of a naked
value is. The only problem I see with that is backwards-compatibility.
Old postgresql.conf files containing naked values would no longer work.
But all you'd need to do is to add in the units, which would work on
older versions too, and would be good for readability anyway.


I like this idea with one addition. We should have a default unit for 
each. For wal_sender_timeout seconds makes sense, but for 
checkpoint_timeout minutes makes sense (for example).


JD





- Heikki






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] add --throttle to pgbench (submission 3)

2013-05-30 Thread Craig Ringer
On 05/30/2013 03:10 PM, Craig Ringer wrote:
 On 05/28/2013 07:52 PM, Fabien COELHO wrote:
 However I'm not sure that pg_stat_replication currently has the
 necessary information on either side to measure the lag (in time
 transactions, but how do I know when a transaction was committed? or
 number of transactions?). 
 The BDR codebase now has a handy function to report when a transaction
 was committed, pg_get_transaction_committime(xid) .

 It also adds pg_xlog_wait_remote_apply and pg_xlog_wait_remote_receive
 that can be used with pg_current_xlog_location() to wait until one or
 all replicas have caught up, or with LSNs from pg_stat_replication to
 (say) wait until all replicas have caught up with the most up-to-date one.

 I don't think these depend on anything BDR-specific
They do, however, require changes to Pg core. These aren't functions you
can just borrow and add to an extension, they require additional changes
to core to collect the data they use.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] units in postgresql.conf comments

2013-05-30 Thread Magnus Hagander
On Thu, May 30, 2013 at 3:52 AM, Joshua D. Drake j...@commandprompt.com wrote:

 On 05/30/2013 12:01 AM, Heikki Linnakangas wrote:

 We could make it mandatory to specify the unit in the value. Ie. throw
 an error on wal_sender_timeout = 50:

 ERROR: unit required for option wal_sender_timeout
 HINT:  Valid units for this parameter are ms, s, min, h, and d.

 Then you wouldn't need a comment to explain what the unit of a naked
 value is. The only problem I see with that is backwards-compatibility.
 Old postgresql.conf files containing naked values would no longer work.
 But all you'd need to do is to add in the units, which would work on
 older versions too, and would be good for readability anyway.

In general, I like this. Requiring full specification is never
wrong. Except possibly for thje backwards compatible thing.


 I like this idea with one addition. We should have a default unit for each.
 For wal_sender_timeout seconds makes sense, but for checkpoint_timeout
 minutes makes sense (for example).

This sounds like a good way to make things even more confusing. Right
now the confusion is only in the comments - this would make it
confusing in the actual values.

Requiring a unit seems like a much better idea. That way, there is no
way for confusion.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] add --throttle to pgbench (submission 3)

2013-05-30 Thread Andres Freund
On 2013-05-30 15:54:01 +0800, Craig Ringer wrote:
 On 05/30/2013 03:10 PM, Craig Ringer wrote:
  On 05/28/2013 07:52 PM, Fabien COELHO wrote:
  However I'm not sure that pg_stat_replication currently has the
  necessary information on either side to measure the lag (in time
  transactions, but how do I know when a transaction was committed? or
  number of transactions?). 
  The BDR codebase now has a handy function to report when a transaction
  was committed, pg_get_transaction_committime(xid) .
 
  It also adds pg_xlog_wait_remote_apply and pg_xlog_wait_remote_receive
  that can be used with pg_current_xlog_location() to wait until one or
  all replicas have caught up, or with LSNs from pg_stat_replication to
  (say) wait until all replicas have caught up with the most up-to-date one.
 
  I don't think these depend on anything BDR-specific
 They do, however, require changes to Pg core. These aren't functions you
 can just borrow and add to an extension, they require additional changes
 to core to collect the data they use.

pg_xlog_wait_remote_receive() doesn't require changes afaics and should
be easily packable as an extension. We might want to make it use the
sync commit infrastructure at some point instead of essentially busy
waiting, but...

'committs' - the mapping of xids to timestamp certainly does though.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] units in postgresql.conf comments

2013-05-30 Thread Joshua D. Drake


On 05/30/2013 12:55 AM, Magnus Hagander wrote:


I like this idea with one addition. We should have a default unit for each.
For wal_sender_timeout seconds makes sense, but for checkpoint_timeout
minutes makes sense (for example).


This sounds like a good way to make things even more confusing. Right
now the confusion is only in the comments - this would make it
confusing in the actual values.

Requiring a unit seems like a much better idea. That way, there is no
way for confusion.


I can buy into that.

JD




--
  Magnus Hagander
  Me: http://www.hagander.net/
  Work: http://www.redpill-linpro.com/






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] units in postgresql.conf comments

2013-05-30 Thread Heikki Linnakangas

On 30.05.2013 10:52, Joshua D. Drake wrote:

On 05/30/2013 12:01 AM, Heikki Linnakangas wrote:


We could make it mandatory to specify the unit in the value. Ie. throw
an error on wal_sender_timeout = 50:

ERROR: unit required for option wal_sender_timeout
HINT: Valid units for this parameter are ms, s, min, h, and d.

Then you wouldn't need a comment to explain what the unit of a naked
value is. The only problem I see with that is backwards-compatibility.
Old postgresql.conf files containing naked values would no longer work.
But all you'd need to do is to add in the units, which would work on
older versions too, and would be good for readability anyway.


I like this idea with one addition. We should have a default unit for
each. For wal_sender_timeout seconds makes sense, but for
checkpoint_timeout minutes makes sense (for example).


Uh, if specifying the unit is mandatory, what exactly would the default 
unit mean?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Eliminating PD_ALL_VISIBLE, take 2

2013-05-30 Thread Heikki Linnakangas

On 30.05.2013 06:54, Jeff Davis wrote:

Continuation of:

http://www.postgresql.org/message-id/1353551097.11440.128.camel@sussancws0025

Rebased patch attached; no other changes.



@@ -675,6 +675,16 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
}

/*
+* If this page is left over from an upgraded system, it may 
have a
+* PD_ALL_VISIBLE bit set (which is deprecated). If so, clear 
it.
+*/
+   if (PageIsAllVisible(page))
+   {
+   PageClearAllVisible(page);
+   MarkBufferDirty(buf);
+   }
+
+   /*
 * Prune all HOT-update chains in this page.
 *
 * We count tuples removed by the pruning step as removed by 
VACUUM.


That could cause a torn page and checksum failure if checksums are 
enabled. Actually, I think the later PageClearAllVisible() call later in 
the function has the same problem, even without this patch.


Instead of adding a new vmbuffer argument to heap_insert() and friends, 
could we put that into BulkInsertStateData? The new argument is similar 
to the current bulk-insert state in spirit. That would simplify the 
callers and make the heapam API cleaner.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] units in postgresql.conf comments

2013-05-30 Thread Joshua D. Drake


On 05/30/2013 01:14 AM, Heikki Linnakangas wrote:


On 30.05.2013 10:52, Joshua D. Drake wrote:

On 05/30/2013 12:01 AM, Heikki Linnakangas wrote:


We could make it mandatory to specify the unit in the value. Ie. throw
an error on wal_sender_timeout = 50:

ERROR: unit required for option wal_sender_timeout
HINT: Valid units for this parameter are ms, s, min, h, and d.

Then you wouldn't need a comment to explain what the unit of a naked
value is. The only problem I see with that is backwards-compatibility.
Old postgresql.conf files containing naked values would no longer work.
But all you'd need to do is to add in the units, which would work on
older versions too, and would be good for readability anyway.


I like this idea with one addition. We should have a default unit for
each. For wal_sender_timeout seconds makes sense, but for
checkpoint_timeout minutes makes sense (for example).


Uh, if specifying the unit is mandatory, what exactly would the default
unit mean?


Yeah, see my other email. I missed that part. It is late for me. Sorry 
for the noise.


JD



- Heikki





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Eliminating PD_ALL_VISIBLE, take 2

2013-05-30 Thread Heikki Linnakangas

On 30.05.2013 11:26, Heikki Linnakangas wrote:

On 30.05.2013 06:54, Jeff Davis wrote:

Continuation of:

http://www.postgresql.org/message-id/1353551097.11440.128.camel@sussancws0025


Rebased patch attached; no other changes.



@@ -675,6 +675,16 @@ lazy_scan_heap(Relation onerel, LVRelStats
*vacrelstats,
}

/*
+ * If this page is left over from an upgraded system, it may have a
+ * PD_ALL_VISIBLE bit set (which is deprecated). If so, clear it.
+ */
+ if (PageIsAllVisible(page))
+ {
+ PageClearAllVisible(page);
+ MarkBufferDirty(buf);
+ }
+
+ /*
* Prune all HOT-update chains in this page.
*
* We count tuples removed by the pruning step as removed by VACUUM.


That could cause a torn page and checksum failure if checksums are
enabled.


Come to think of it, even without the torn page  checksum issue, do we 
really want to actively clear the all-visible flags after upgrade? For 
tables that haven't been changed, and thus have the all-visible bits 
set, that amounts to a complete rewrite on the first vacuum after 
upgrade. That's expensive.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump with postgis extension dumps rules separately

2013-05-30 Thread Dimitri Fontaine
Joe Conway m...@joeconway.com writes:
 Here's a cleaned up version, which also includes documentation. I'll
 commit back to 9.1 in a day or two unless there are any objections.

Looks good to me.

Were you able to test it against an extension containing both rules and
views, to check that pg_dump has no problem with the new set of
dependencies?

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Robert Haas
On Wed, May 29, 2013 at 10:42 AM, Andres Freund and...@2ndquadrant.com wrote:
 FWIW, posix' description about posix_fallocate() doesn't actually say
 *anything* about reading. The guarantee it makes is:
 If posix_fallocate() returns successfully, subsequent writes to the
 specified file data shall not fail due to the lack of free space on the
 file system storage media..

 http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html

 So we don't even know whether we can read. I think that means we need to
 zero the file anyway...

Surely this is undue pessimism.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Greg Smith

On 5/30/13 6:49 AM, Robert Haas wrote:

On Wed, May 29, 2013 at 10:42 AM, Andres Freund and...@2ndquadrant.com wrote:

So we don't even know whether we can read. I think that means we need to
zero the file anyway...


Surely this is undue pessimism.


There have been many occasions where I've found the Linux kernel 
defining support for POSIX behavior with a NOP stub that basically says 
we should make this work one day.  I don't know whether the fallocate 
code is one of those or a fully implemented call.  Based on that 
history, until I see a reader that validates the resulting files are 
good I have to assume they're not.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Andres Freund
On 2013-05-30 06:49:42 -0400, Robert Haas wrote:
 On Wed, May 29, 2013 at 10:42 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  FWIW, posix' description about posix_fallocate() doesn't actually say
  *anything* about reading. The guarantee it makes is:
  If posix_fallocate() returns successfully, subsequent writes to the
  specified file data shall not fail due to the lack of free space on the
  file system storage media..
 
  http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html
 
  So we don't even know whether we can read. I think that means we need to
  zero the file anyway...
 
 Surely this is undue pessimism.

Why? The spec doesn't specify that case and that very well allows other
behaviour. Glibc sure does behave sensibly and zeroes the data
(sysdeps/posix/posix_fallocate64.c for the generic implementation) and
so does linux' fallocate() syscall, but that doesn't say much about
other implementations.

None of the manpages I could find, nor the spec says anything about the
file's contents in the extended range. Given there were at least three
manpages of different origins that didn't specify that behaviour I am
not too optimistic. Why they didn't specify that completely obvious
question is hard to understand from my pov.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Andres Freund
On 2013-05-30 06:55:16 -0400, Greg Smith wrote:
 On 5/30/13 6:49 AM, Robert Haas wrote:
 On Wed, May 29, 2013 at 10:42 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
 So we don't even know whether we can read. I think that means we need to
 zero the file anyway...
 
 Surely this is undue pessimism.
 
 There have been many occasions where I've found the Linux kernel defining
 support for POSIX behavior with a NOP stub that basically says we should
 make this work one day.  I don't know whether the fallocate code is one of
 those or a fully implemented call.  Based on that history, until I see a
 reader that validates the resulting files are good I have to assume they're
 not.

That argument in contrast I find not very convincing though. What was
the last incidence of such a system call that did not just error out
with ENOTSUPP or such?

The linux fallocate call is fully specified for this behaviour and got
added 2.6.23, there wasn't a stub before, so I am far less worried about
it than about the underspecifiedness of posix_fallocate(). Also, if some
system call doesn't follow its documented specifications it's not fully
our problem anymore. If we rely on undocumented behaviour though...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] extensible external toast tuple support

2013-05-30 Thread Andres Freund
Hi,

In
http://archives.postgresql.org/message-id/20130216164231.GA15069%40awork2.anarazel.de
I presented the need for 'indirect' toast tuples which point into memory
instead of a toast table. In the comments to that proposal, off-list and
in-person talks the wish to make that a more general concept has
been voiced.

The previous patch used varattrib_1b_e.va_len_1be to discern between
different types of external tuples. That obviously only works if the
data sizes of all possibly stored datum types are distinct which isn't
nice. So what the newer patch now does is to rename that field into
'va_tag' and decide based on that what kind of Datum we have. To get the
actual length of that datum there now is a VARTAG_SIZE() macro which
maps the tags back to size.
To keep on-disk compatibility the size of an external toast tuple
containing a varatt_external is used as its tag value.

This should allow for fairly easy development of a new compression
scheme for out-of-line toast tuples. It will *not* work for compressed
inline tuples (i.e. VARATT_4B_C). I am not convinced that that is a
problem or that if it is, that it cannot be solved separately.

FWIW, in some quick microbenchmarks I couldn't find any performance
difference due to the slightly more complex size computation which I do
*not* find surprising.

Opinions?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From 43416ee71033a1bd12bec5e651ff45ea9eeafd56 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Sun, 17 Feb 2013 01:38:17 +0100
Subject: [PATCH] Add support for multiple kinds of external toast datums

There are several usecases where our current representation of external toast
datums is limiting:
* adding new compression schemes
* avoidance of repeated detoasting
* externally decoded toast tuples

For that support 'tags' on external (varattrib_1b_e) varlenas which recoin the
current va_len_1be field to store the tag (or type) of a varlena. To determine
the actual length a macro VARTAG_SIZE(tag) is added which can be used to map
from a tag to the actual length.

This patch adds support for 'indirect' tuples which point to some externally
allocated memory containing a toast tuple. It also implements the stub for a
different compression algorithm.
---
 src/backend/access/heap/tuptoaster.c | 115 ---
 src/include/postgres.h   | 100 +++---
 2 files changed, 183 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index fc37ceb..46c7cf4 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -128,7 +128,7 @@ heap_tuple_fetch_attr(struct varlena * attr)
 struct varlena *
 heap_tuple_untoast_attr(struct varlena * attr)
 {
-	if (VARATT_IS_EXTERNAL(attr))
+	if (VARATT_IS_EXTERNAL_OLDSTYLE(attr))
 	{
 		/*
 		 * This is an externally stored datum --- fetch it back from there
@@ -145,6 +145,19 @@ heap_tuple_untoast_attr(struct varlena * attr)
 			pfree(tmp);
 		}
 	}
+	else if (VARATT_IS_EXTERNAL_COMPRESSED(attr))
+	{
+		elog(ERROR, not yet);
+	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		struct varatt_indirect redirect;
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		attr = (struct varlena *)redirect.pointer;
+		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
+
+		attr = heap_tuple_untoast_attr(attr);
+	}
 	else if (VARATT_IS_COMPRESSED(attr))
 	{
 		/*
@@ -191,7 +204,7 @@ heap_tuple_untoast_attr_slice(struct varlena * attr,
 	char	   *attrdata;
 	int32		attrsize;
 
-	if (VARATT_IS_EXTERNAL(attr))
+	if (VARATT_IS_EXTERNAL_OLDSTYLE(attr))
 	{
 		struct varatt_external toast_pointer;
 
@@ -204,6 +217,13 @@ heap_tuple_untoast_attr_slice(struct varlena * attr,
 		/* fetch it back (compressed marker will get set automatically) */
 		preslice = toast_fetch_datum(attr);
 	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		struct varatt_indirect redirect;
+		VARATT_EXTERNAL_GET_POINTER(redirect, attr);
+		return heap_tuple_untoast_attr_slice(redirect.pointer,
+			 sliceoffset, slicelength);
+	}
 	else
 		preslice = attr;
 
@@ -267,7 +287,7 @@ toast_raw_datum_size(Datum value)
 	struct varlena *attr = (struct varlena *) DatumGetPointer(value);
 	Size		result;
 
-	if (VARATT_IS_EXTERNAL(attr))
+	if (VARATT_IS_EXTERNAL_OLDSTYLE(attr))
 	{
 		/* va_rawsize is the size of the original datum -- including header */
 		struct varatt_external toast_pointer;
@@ -275,6 +295,17 @@ toast_raw_datum_size(Datum value)
 		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
 		result = toast_pointer.va_rawsize;
 	}
+	else if (VARATT_IS_EXTERNAL_COMPRESSED(attr))
+	{
+		elog(ERROR, not yet);
+	}
+	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+	{
+		struct varatt_indirect toast_pointer;
+
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+		return 

Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Greg Smith

On 5/30/13 7:17 AM, Andres Freund wrote:

That argument in contrast I find not very convincing though. What was
the last incidence of such a system call that did not just error out
with ENOTSUPP or such?


http://linux.die.net/man/2/posix_fadvise talks about  POSIX_FADV_NOREUSE 
and POSIX_FADV_WILLNEED being both buggy and quietly mapped to a no-op, 
depending on your version.  I know there were more examples than just 
that one that popped up during the testing of effective_io_concurrency. 
 My starting position has to assume that posix_fallocate can have the 
same sort of surprising behavior that showed up repeatedly when we were 
trying to use posix_fadvise more aggressively.


The way O_SYNC was quietly mapped to O_DSYNC (which isn't the same 
thing) was a similar issue, and that's the first one that left me 
forever skeptical of Linux kernel claims in this area until they are 
explicitly validated:  http://lwn.net/Articles/350225/


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Andres Freund
On 2013-05-30 07:48:51 -0400, Greg Smith wrote:
 On 5/30/13 7:17 AM, Andres Freund wrote:
 That argument in contrast I find not very convincing though. What was
 the last incidence of such a system call that did not just error out
 with ENOTSUPP or such?
 
 http://linux.die.net/man/2/posix_fadvise talks about  POSIX_FADV_NOREUSE and
 POSIX_FADV_WILLNEED being both buggy and quietly mapped to a no-op,
 depending on your version.  I know there were more examples than just that
 one that popped up during the testing of effective_io_concurrency.  My
 starting position has to assume that posix_fallocate can have the same sort
 of surprising behavior that showed up repeatedly when we were trying to use
 posix_fadvise more aggressively.

Uh. How is that a correctness problem? fadvise is a hint which is pretty
different from a fallocate where ignoring would have way much more
severe consequences.
I don't think that's a very meaningful comparison.

 The way O_SYNC was quietly mapped to O_DSYNC (which isn't the same thing)
 was a similar issue, and that's the first one that left me forever skeptical
 of Linux kernel claims in this area until they are explicitly validated:
 http://lwn.net/Articles/350225/

Yea, but that mistake is literally decades old...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-30 Thread Robert Haas
On Thu, May 30, 2013 at 12:06 AM, Jeff Davis pg...@j-davis.com wrote:
 AFAICS, the main benefit of eliminating PD_ALL_VISIBLE is that we
 eliminate one write cycle; that is, we won't dirty the page once to
 hint it and then again to mark it all-visible.  But as of 9.3, that
 should really only be a problem in the insert-only case.  And in that
 case, my proposal to consider all-visible pages as frozen would be a
 huge win, because you'd only need to emit XLOG_HEAP_VISIBLE for every
 page in the heap, rather than XLOG_HEAP_FREEZE.

 Agreed.

Just to quantify that a bit more, I ran this command a couple of times:

dropdb rhaas ; sleep 5 ; createdb ; sleep 5 ; pgbench -i -s 1000 -n;
sleep 5 ; time psql -c checkpoint ; time psql -c 'vacuum'

And also this one:

dropdb rhaas ; sleep 5 ; createdb ; sleep 5 ; pgbench -i -s 1000 -n;
sleep 5 ; time psql -c checkpoint ; time psql -c 'vacuum freeze'

In the first one, the vacuum at the end takes about 25 seconds.  In
the second one, it takes about 15 minutes, during which time there's
one CPU core running at about 10%; the remainder of the time is spent
waiting for disk I/O.  A little follow-up testing shows that the
vacuum emits 88MB of WAL, while the vacuum freeze emits 13GB of WAL.

This is on the 16-core, 64-thread IBM POWER box with the following
non-default configuration settings:

shared_buffers = 8GB
maintenance_work_mem = 1GB
synchronous_commit = off
checkpoint_segments = 300
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
log_line_prefix = '%t [%p] '

Andres' proposal for freezing at the same time we mark pages
all-visible relies on emitting FPIs when we mark pages all-visible,
but I hope that the test above is convincing evidence that it would be
*really* expensive for some users.  My proposal to consider
all-visible pages as frozen avoids that cost, but as far as I can see,
it also requires PD_ALL_VISIBLE to stick around.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Greg Smith

On 5/30/13 7:52 AM, Andres Freund wrote:

fadvise is a hint which is pretty
different from a fallocate where ignoring would have way much more
severe consequences.


Yes, it will.  That's why I want to see it tested.  There is more than 
enough past examples of bad behavior here to be skeptical that this sort 
of API may not work exactly as specified.  If you're willing to believe 
the spec, that's fine, but I think that's dangerously optimistic.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Robert Haas
On Thu, May 30, 2013 at 7:13 AM, Andres Freund and...@2ndquadrant.com wrote:
 Surely this is undue pessimism.

 Why? The spec doesn't specify that case and that very well allows other
 behaviour. Glibc sure does behave sensibly and zeroes the data
 (sysdeps/posix/posix_fallocate64.c for the generic implementation) and
 so does linux' fallocate() syscall, but that doesn't say much about
 other implementations.

 None of the manpages I could find, nor the spec says anything about the
 file's contents in the extended range. Given there were at least three
 manpages of different origins that didn't specify that behaviour I am
 not too optimistic. Why they didn't specify that completely obvious
 question is hard to understand from my pov.

I think they didn't specify it because it IS obvious.  As Stephen
says, it's been understood for decades that allowing unzeroed pages to
be reallocated to some other file is a major security hole.  I think
we can assume that no credible OS does that.  If there's some OS out
there that chooses to fill the pre-extended pages with 0x55 or cat
/dev/urandom instead of 0x00, they probably deserve what they get.
It's hard for me to be believe that anything that silly actually
exists.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-30 Thread Andres Freund
On 2013-05-30 07:54:38 -0400, Robert Haas wrote:
 On Thu, May 30, 2013 at 12:06 AM, Jeff Davis pg...@j-davis.com wrote:
  AFAICS, the main benefit of eliminating PD_ALL_VISIBLE is that we
  eliminate one write cycle; that is, we won't dirty the page once to
  hint it and then again to mark it all-visible.  But as of 9.3, that
  should really only be a problem in the insert-only case.  And in that
  case, my proposal to consider all-visible pages as frozen would be a
  huge win, because you'd only need to emit XLOG_HEAP_VISIBLE for every
  page in the heap, rather than XLOG_HEAP_FREEZE.
 
  Agreed.
 
 Just to quantify that a bit more, I ran this command a couple of times:
 
 dropdb rhaas ; sleep 5 ; createdb ; sleep 5 ; pgbench -i -s 1000 -n;
 sleep 5 ; time psql -c checkpoint ; time psql -c 'vacuum'
 
 And also this one:
 
 dropdb rhaas ; sleep 5 ; createdb ; sleep 5 ; pgbench -i -s 1000 -n;
 sleep 5 ; time psql -c checkpoint ; time psql -c 'vacuum freeze'
 
 In the first one, the vacuum at the end takes about 25 seconds.  In
 the second one, it takes about 15 minutes, during which time there's
 one CPU core running at about 10%; the remainder of the time is spent
 waiting for disk I/O.  A little follow-up testing shows that the
 vacuum emits 88MB of WAL, while the vacuum freeze emits 13GB of WAL.
 
 This is on the 16-core, 64-thread IBM POWER box with the following
 non-default configuration settings:
 
 shared_buffers = 8GB
 maintenance_work_mem = 1GB
 synchronous_commit = off
 checkpoint_segments = 300
 checkpoint_timeout = 15min
 checkpoint_completion_target = 0.9
 log_line_prefix = '%t [%p] '
 
 Andres' proposal for freezing at the same time we mark pages
 all-visible relies on emitting FPIs when we mark pages all-visible,
 but I hope that the test above is convincing evidence that it would be
 *really* expensive for some users.  My proposal to consider
 all-visible pages as frozen avoids that cost

I think I basically suggested treating all visible as frozen, didn't I?
If not, I had lost sync between my fingers and my thoughts which happens
too often ;).
You had noticed that my proposed was lacking a bit around when we omit
FPIs for the page while setting all-visible, but we both thought that we
may find a workaround that - which looking at the page level flag first
basically is.

As far as I understand the trick basically is that we can rely on an FPI
being logged when an action unsetting ALL_VISIBLE is performed. That
all-visible would then make sure the hint-bits marking indvidual tuples
as frozen would hit disk. For that we need to add some more work though,
consider:

1) write tuples on a page
2) freeze page by setting ALL_VISIBLE and setting hint
bits. Setting ALL_VISIBLE is wall logged
3) crash
4) replay ALL_VISIBLE, set it on the page level. The individual tuples
   are *not* guaranteed to be marked frozen.
5) update tuple on the page unsetting all visible. Emits an FPI which
   does *not* have the tuples marked as frozen.

Easy enough and fairly cheap to fix by having a function that checks
that updates the hint bits on a page when unsetting all visible since we
can just set it for all pre-existing tuples.

 but as far as I can see, it also requires PD_ALL_VISIBLE to stick
 around.

Now, I am far from being convinced its a good idea to get rid of
PD_ALL_VISIBLE, but I don't think it does. Except that it currently is
legal for the page level ALL_VISIBLE being set while the corresponding
visibilitymap one isn't there's not much prohibiting us fundamentally
from looking in the vm when we need to know whether the page is all
visible, is there?
To the contrary, this actually seems to be a pretty good case for Jeff's
proposed behaviour since it would allow freezing while only writing the
vm?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Andres Freund
On 2013-05-30 08:02:56 -0400, Robert Haas wrote:
 On Thu, May 30, 2013 at 7:13 AM, Andres Freund and...@2ndquadrant.com wrote:
  Surely this is undue pessimism.
 
  Why? The spec doesn't specify that case and that very well allows other
  behaviour. Glibc sure does behave sensibly and zeroes the data
  (sysdeps/posix/posix_fallocate64.c for the generic implementation) and
  so does linux' fallocate() syscall, but that doesn't say much about
  other implementations.
 
  None of the manpages I could find, nor the spec says anything about the
  file's contents in the extended range. Given there were at least three
  manpages of different origins that didn't specify that behaviour I am
  not too optimistic. Why they didn't specify that completely obvious
  question is hard to understand from my pov.
 
 I think they didn't specify it because it IS obvious.  As Stephen
 says, it's been understood for decades that allowing unzeroed pages to
 be reallocated to some other file is a major security hole.  I think
 we can assume that no credible OS does that.  If there's some OS out
 there that chooses to fill the pre-extended pages with 0x55 or cat
 /dev/urandom instead of 0x00, they probably deserve what they get.
 It's hard for me to be believe that anything that silly actually
 exists.

I don't think there's much danger of getting uninitialized data or
such. That clearly would be insane. I think somebody might interpret it
as read(2) returning an error until the page has been written to which
isn't completely crazy.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Peter Eisentraut
On 5/30/13 7:13 AM, Andres Freund wrote:
 Why? The spec doesn't specify that case and that very well allows other
 behaviour. Glibc sure does behave sensibly and zeroes the data
 (sysdeps/posix/posix_fallocate64.c for the generic implementation) and
 so does linux' fallocate() syscall, but that doesn't say much about
 other implementations.

glibc actually only writes one byte to every file system block, to make
sure the block is allocated.  It doesn't actually zero every byte.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Peter Eisentraut
On 5/30/13 8:02 AM, Robert Haas wrote:
 If there's some OS out
 there that chooses to fill the pre-extended pages with 0x55 or cat
 /dev/urandom instead of 0x00, they probably deserve what they get.

Even that wouldn't be a problem for our purpose.  The only problem would
be if you can't read from the allocated region at all.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Andres Freund
On 2013-05-30 08:19:17 -0400, Peter Eisentraut wrote:
 On 5/30/13 8:02 AM, Robert Haas wrote:
  If there's some OS out
  there that chooses to fill the pre-extended pages with 0x55 or cat
  /dev/urandom instead of 0x00, they probably deserve what they get.
 
 Even that wouldn't be a problem for our purpose.  The only problem would
 be if you can't read from the allocated region at all.

Well, only as long as we only use it for preallocation of wal files. I
am much, much more interested in doing that for the heap. And there that
surely would be a problem.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Andres Freund
On 2013-05-30 08:17:28 -0400, Peter Eisentraut wrote:
 On 5/30/13 7:13 AM, Andres Freund wrote:
  Why? The spec doesn't specify that case and that very well allows other
  behaviour. Glibc sure does behave sensibly and zeroes the data
  (sysdeps/posix/posix_fallocate64.c for the generic implementation) and
  so does linux' fallocate() syscall, but that doesn't say much about
  other implementations.
 
 glibc actually only writes one byte to every file system block, to make
 sure the block is allocated.  It doesn't actually zero every byte.

Which is fine since that guarantees we can read from those areas... And
unless I misremember something that actually guarantees that the rest of
the data is initialized to zero as well. Yes: subsequent reads of data
in the gap shall return bytes with the value 0 until data is actually
written into the gap.

But really, I am not at all concerned about some obscure values being
returned, but about a read() not being successful..

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PostgreSQL 9.3 beta breaks some extensions make install

2013-05-30 Thread Stefan Kaltenbrunner
On 05/29/2013 06:08 PM, Cédric Villemain wrote:
 I just took time to inspect our contribs, USE_PGXS is not supported by all
 of them atm because of SHLIB_PREREQS (it used submake) I have a patch
 pending here to fix that. Once all our contribs can build with USE_PGXS I
 fix the VPATH.
 
 I've added 'most' of the patches to the commitfest... (I am not sure it is 
 required, as this is more bugfix than anything else IMHO)
 See 
 https://commitfest.postgresql.org/action/patch_view?id=1122
 https://commitfest.postgresql.org/action/patch_view?id=1123
 https://commitfest.postgresql.org/action/patch_view?id=1124
 
 
 I stopped trying to add new item after too many failures from 
 https://commitfest.postgresql.org/action/patch_form 
 So one patch is not in the commitfest yet (fix_install_ext_vpath.patch)

failures? what kind of issues did you experience?



Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-30 Thread Heikki Linnakangas

On 30.05.2013 15:12, Andres Freund wrote:

Now, I am far from being convinced its a good idea to get rid of
PD_ALL_VISIBLE, but I don't think it does. Except that it currently is
legal for the page level ALL_VISIBLE being set while the corresponding
visibilitymap one isn't there's not much prohibiting us fundamentally
from looking in the vm when we need to know whether the page is all
visible, is there?


Hmm, so you're suggesting that the visibility map would be *required* to 
interpret the pages correctly. Ie. if you just zapped the visibility 
map, you'd lose critical information and the heap would appear to be 
corrupt. I guess that's possible, but it makes me quite uneasy. At the 
moment, it's relieving to know that it's always safe to just truncate 
the visibility map in case of emergency.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-30 Thread Andres Freund
On 2013-05-30 15:34:04 +0300, Heikki Linnakangas wrote:
 On 30.05.2013 15:12, Andres Freund wrote:
 Now, I am far from being convinced its a good idea to get rid of
 PD_ALL_VISIBLE, but I don't think it does. Except that it currently is
 legal for the page level ALL_VISIBLE being set while the corresponding
 visibilitymap one isn't there's not much prohibiting us fundamentally
 from looking in the vm when we need to know whether the page is all
 visible, is there?
 
 Hmm, so you're suggesting that the visibility map would be *required* to
 interpret the pages correctly. Ie. if you just zapped the visibility map,
 you'd lose critical information and the heap would appear to be corrupt. I
 guess that's possible, but it makes me quite uneasy. At the moment, it's
 relieving to know that it's always safe to just truncate the visibility map
 in case of emergency.

I didn't say its a good idea, just that I don't think Robert's
conclusion is necessarily valid. But requiring only the few kbytes of
the vm to be written instead of all of the heap during freezeing (or
whatever we would call it) has quite some allure, I admit that.

But I think that should be a separate project to reeingineering how
freezing works.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] pg_upgrade -u

2013-05-30 Thread Ray Stell

On May 29, 2013, at 11:07 AM, Bruce Momjian wrote:

 On Wed, May 29, 2013 at 08:59:42AM -0400, Ray Stell wrote:
 [ moved to hacker ]
 The question is whether hard-wiring these helps more than it hurts, and 
 which ones should be hard-wired.

I seems to me that superuser is exactly that special case and that if an 
alternate superuser is hardwired in the src cluster then -u/-U and that 
specific value will be required on both sides of pg_upgrade, no variability is 
needed and perhaps not possible.  You're point is well taken for port.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Stephen Frost
* Peter Eisentraut (pete...@gmx.net) wrote:
 On 5/30/13 7:13 AM, Andres Freund wrote:
  Why? The spec doesn't specify that case and that very well allows other
  behaviour. Glibc sure does behave sensibly and zeroes the data
  (sysdeps/posix/posix_fallocate64.c for the generic implementation) and
  so does linux' fallocate() syscall, but that doesn't say much about
  other implementations.
 
 glibc actually only writes one byte to every file system block, to make
 sure the block is allocated.  It doesn't actually zero every byte.

That goes back to the 'sane implementation' question..  Is there a case
where that would actually be different from writing zeros for the entire
block..?  Is there some OS that gives you random data for the 'hole'
when you write a byte, seek to the start of the next block and then
write another byte?  That actually *would* be against what's documented
and required by spec, no?

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Stephen Frost
* Andres Freund (and...@2ndquadrant.com) wrote:
 On 2013-05-30 08:19:17 -0400, Peter Eisentraut wrote:
  On 5/30/13 8:02 AM, Robert Haas wrote:
   If there's some OS out
   there that chooses to fill the pre-extended pages with 0x55 or cat
   /dev/urandom instead of 0x00, they probably deserve what they get.
  
  Even that wouldn't be a problem for our purpose.  The only problem would
  be if you can't read from the allocated region at all.
 
 Well, only as long as we only use it for preallocation of wal files. I
 am much, much more interested in doing that for the heap. And there that
 surely would be a problem.

Yes, that was my thinking as well.  If posix_fallocate is faster than
writing out 8K of zeros, and the block can immediately be read as if it
had actually been written to, then I'd be very interested in using it to
extend heap files.  As I mentioned in this thread (or perhaps it was
another), I don't think this solves the locking issue around the
relation extention lock, but it might help some and it may make it
easier to tweak that logic to allocate larger chunks in the future.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Stephen Frost
* Andres Freund (and...@2ndquadrant.com) wrote:
 But really, I am not at all concerned about some obscure values being
 returned, but about a read() not being successful..

Alright, so what do we need to do to test this?  We really just need a
short C program written up and then a bunch of folks to run it on
various architectures, right?  Gee, sounds like what the buildfarm was
made for (alright, alright, PostgreSQL isn't exactly a 'short C
program', but you get the idea).  As I recall, Andrew reworked the
buildfarm code to be more modular too..  Anyone have thoughts about how
we could run these kinds of tests with it?  Or do people think that's a
bad idea?

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Andres Freund
On 2013-05-30 08:53:37 -0400, Stephen Frost wrote:
 * Andres Freund (and...@2ndquadrant.com) wrote:
  But really, I am not at all concerned about some obscure values being
  returned, but about a read() not being successful..
 
 Alright, so what do we need to do to test this?  We really just need a
 short C program written up and then a bunch of folks to run it on
 various architectures, right?

After a bit of standard perusing writing a single byte to the end of the
file after the fallocate ought to make at least the reading guaranteed
to be defined. If we did seek(last_byte); write(); posix_fallocate() we
should even always have defined content. Yuck.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Stephen Frost
* Andres Freund (and...@2ndquadrant.com) wrote:
 After a bit of standard perusing writing a single byte to the end of the
 file after the fallocate ought to make at least the reading guaranteed
 to be defined. If we did seek(last_byte); write(); posix_fallocate() we
 should even always have defined content. Yuck.

Alright, but would that actually be any better than just doing what
glibc's posix_fallocate() does in the generic case?  And, to be honest,
it makes me a bit nervous to seek/write like that because it looks like
the typical create a hole setup, which we certainly aren't intending,
yet if the posix_fallocate() call disappeared, or did nothing, or this
code was copied w/o it, or someone didn't understand what it did, we
could end up with that.

Not a fan. :(

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] PostgreSQL 9.3 beta breaks some extensions make install

2013-05-30 Thread Cédric Villemain
Le jeudi 30 mai 2013 14:32:46, Stefan Kaltenbrunner a écrit :
 On 05/29/2013 06:08 PM, Cédric Villemain wrote:
  I just took time to inspect our contribs, USE_PGXS is not supported by
  all of them atm because of SHLIB_PREREQS (it used submake) I have a
  patch pending here to fix that. Once all our contribs can build with
  USE_PGXS I fix the VPATH.
  
  I've added 'most' of the patches to the commitfest... (I am not sure it
  is required, as this is more bugfix than anything else IMHO)
  See
  https://commitfest.postgresql.org/action/patch_view?id=1122
  https://commitfest.postgresql.org/action/patch_view?id=1123
  https://commitfest.postgresql.org/action/patch_view?id=1124
  
  
  I stopped trying to add new item after too many failures from
  https://commitfest.postgresql.org/action/patch_form
  So one patch is not in the commitfest yet (fix_install_ext_vpath.patch)
 
 failures? what kind of issues did you experience?

I didn't sent too much details as I am not sure if it was my setup which 
breaks things or not.

[...]

Just tested with Stephen, looks like a problem with something on my side, 
sorry for the noise. (rekonq 0 - chromium 1)

-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


signature.asc
Description: This is a digitally signed message part.


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Greg Smith

On 5/30/13 8:50 AM, Stephen Frost wrote:

I don't think this solves the locking issue around the
relation extention lock, but it might help some and it may make it
easier to tweak that logic to allocate larger chunks in the future.


It might make it a bit faster, but it doesn't make it any easier to 
implement.  The messy part of extending relations in larger chunks is 
how to communicate that back into the buffer manager usefully.  The 
extension path causing trouble is RelationGetBufferForTuple calling 
ReadBufferBI.  All of that is passing a single buffer around.  There's 
no simple way I can see to rewrite it to handle more than one at a time.


I have a test case for relation extension that I'm going to package up 
soon.  That makes it easy to see where the problem is at.  Far as I can 
tell the reason it hasn't been fixed before now is that it's a pain to 
write the code.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Freezing without write I/O

2013-05-30 Thread Heikki Linnakangas
Since we're bashing around ideas around freezing, let me write down the 
idea I've been pondering and discussing with various people for years. I 
don't think I invented this myself, apologies to whoever did for not 
giving credit.


The reason we have to freeze is that otherwise our 32-bit XIDs wrap 
around and become ambiguous. The obvious solution is to extend XIDs to 
64 bits, but that would waste a lot space. The trick is to add a field 
to the page header indicating the 'epoch' of the XID, while keeping the 
XIDs in tuple header 32-bit wide (*).


The other reason we freeze is to truncate the clog. But with 64-bit 
XIDs, we wouldn't actually need to change old XIDs on disk to FrozenXid. 
Instead, we could implicitly treat anything older than relfrozenxid as 
frozen.


That's the basic idea. Vacuum freeze only needs to remove dead tuples, 
but doesn't need to dirty pages that contain no dead tuples.


Since we're not storing 64-bit wide XIDs on every tuple, we'd still need 
to replace the XIDs with FrozenXid whenever the difference between the 
smallest and largest XID on a page exceeds 2^31. But that would only 
happen when you're updating the page, in which case the page is dirtied 
anyway, so it wouldn't cause any extra I/O.


This would also be the first step in allowing the clog to grow larger 
than 2 billion transactions, eliminating the need for anti-wraparound 
freezing altogether. You'd still want to truncate the clog eventually, 
but it would be nice to not be pressed against the wall with run vacuum 
freeze now, or the system will shut down.


(*) Adding an epoch is inaccurate, but I like to use that as my mental 
model. If you just add a 32-bit epoch field, then you cannot have xids 
from different epochs on the page, which would be a problem. In reality, 
you would store one 64-bit XID value in the page header, and use that as 
the reference point for all the 32-bit XIDs on the tuples. See 
existing convert_txid() function for how that works. Another method is 
to store the 32-bit xid values in tuple headers as offsets from the 
per-page 64-bit value, but then you'd always need to have the 64-bit 
value at hand when interpreting the XIDs, even if they're all recent.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Robert Haas
On Thu, May 30, 2013 at 8:14 AM, Andres Freund and...@2ndquadrant.com wrote:
 I don't think there's much danger of getting uninitialized data or
 such. That clearly would be insane. I think somebody might interpret it
 as read(2) returning an error until the page has been written to which
 isn't completely crazy.

In the absence of tangible evidence of some implementation that
behaves that way, I think that's just paranoia.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] removing PD_ALL_VISIBLE

2013-05-30 Thread Robert Haas
On Thu, May 30, 2013 at 8:12 AM, Andres Freund and...@2ndquadrant.com wrote:
 As far as I understand the trick basically is that we can rely on an FPI
 being logged when an action unsetting ALL_VISIBLE is performed. That
 all-visible would then make sure the hint-bits marking indvidual tuples
 as frozen would hit disk. For that we need to add some more work though,
 consider:

 1) write tuples on a page
 2) freeze page by setting ALL_VISIBLE and setting hint
 bits. Setting ALL_VISIBLE is wall logged
 3) crash
 4) replay ALL_VISIBLE, set it on the page level. The individual tuples
are *not* guaranteed to be marked frozen.
 5) update tuple on the page unsetting all visible. Emits an FPI which
does *not* have the tuples marked as frozen.

 Easy enough and fairly cheap to fix by having a function that checks
 that updates the hint bits on a page when unsetting all visible since we
 can just set it for all pre-existing tuples.

Basically, yes, though I would say infomask bits rather than hint
bits, since not all of them are only hints, and this case would not
be merely a hint.

 but as far as I can see, it also requires PD_ALL_VISIBLE to stick
 around.

 Now, I am far from being convinced its a good idea to get rid of
 PD_ALL_VISIBLE, but I don't think it does. Except that it currently is
 legal for the page level ALL_VISIBLE being set while the corresponding
 visibilitymap one isn't there's not much prohibiting us fundamentally
 from looking in the vm when we need to know whether the page is all
 visible, is there?
 To the contrary, this actually seems to be a pretty good case for Jeff's
 proposed behaviour since it would allow freezing while only writing the
 vm?

Well, as Heikki points out, I think that's unacceptably dangerous.
Loss or corruption of a single visibility map page means possible loss
of half a gigabyte of data.

Also, if we go that route, looking at the visibility map is no longer
an optimization; it's essential for correctness.  We can't decide to
skip it when it seems expensive, for example, as Jeff was proposing.

There's another thing that's bothering me about this whole discussion,
too.  If looking at another page for the information we need to make
visibility decisions is so cheap that we need not be concerned with
it, then why do we need hint bits?  I realize that it's not quite the
same thing, because CLOG doesn't have as much locality of access as
the visibility map; you're guaranteed to find all the information you
need for a single heap page on a single VM page.  Also, CLOG is
per-tuple, not per-page, and we get a decent speed-up from checking
once for the whole page rather than for each tuple.  Nonetheless, I
think the contrast between Jeff's tests,  which aren't showing much
impact from the increased visibility map traffic, and previous
hint-bit removal tests, which have crashed and burned, may be caused
in part by the fact that our algorithms and locking regimen for
shared_buffers are much more sophisticated than for SLRU.  I'm not
eager to have our design decisions driven by that gap.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump with postgis extension dumps rules separately

2013-05-30 Thread Joe Conway
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 05/30/2013 02:02 AM, Dimitri Fontaine wrote:
 Joe Conway m...@joeconway.com writes:
 Here's a cleaned up version, which also includes documentation.
 I'll commit back to 9.1 in a day or two unless there are any
 objections.
 
 Looks good to me.
 
 Were you able to test it against an extension containing both rules
 and views, to check that pg_dump has no problem with the new set
 of dependencies?

PostGIS has both:

test=# \dv
  List of relations
 Schema |   Name| Type |  Owner
- +---+--+--
 public | geography_columns | view | postgres
 public | geometry_columns  | view | postgres
 public | raster_columns| view | postgres
 public | raster_overviews  | view | postgres
(4 rows)

select tablename, rulename from pg_rules ;
tablename |rulename
- --+-
 pg_settings  | pg_settings_n
 pg_settings  | pg_settings_u
 geometry_columns | geometry_columns_delete
 geometry_columns | geometry_columns_update
 geometry_columns | geometry_columns_insert
(5 rows)

8
# pg_dump test  /tmp/test-01.dmp
# dropdb test
# createdb test
# psql test  /tmp/test-01.dmp
SET
SET
SET
SET
SET
SET
CREATE EXTENSION
COMMENT
CREATE EXTENSION
COMMENT
SET
REVOKE
REVOKE
GRANT
GRANT
# pg_dump test  /tmp/test-02.dmp
# diff /tmp/test-01.dmp /tmp/test-02.dmp
...no differences...
8

Joe

- -- 
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting,  24x7 Support
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJRp1ykAAoJEDfy90M199hlLQcP/1OkpFeb99EO9xca0RD+WIHS
FrqhBEJDHA4ujODvitZRMTFjpS1WH4Difm7P05Lvbr1xEUmwuSD6oBw/VQ1p6cxs
RyIvUM1uLVhR/nwjMeymner9kOINPu4rBVKf+7EgPJQcFvZuUSzafNGH1l70p6wk
dXMA2ggjjFdvF6voVxaKkHFbs+uttURNDZ2l0f6eb4QJRZta+NuFCOtIkPTqBESx
oABWuoAutAEX0Z7b0iEyNjjDLduPzjMIqQm8Y6NfsGmkEYd2jrpYVl04T8hcbqf0
vFJ2NvblvuaHoRIhq/ZYbFt9dQKIdoUtNuzR8MOK474mD/VrX6v/xquGh1rcmL4x
1k94Lis3Cf/QEBUEqwKNribkOLemaxEEDVnVTCWSC59FoDbaAZuiwtYspHYwAQED
D3nZ9jknEr+Bziqw6y8KP3wQGAbyssIKdtXFlw2u1BFeFjuWK5pL8vR3vi08j/Ij
diycCOBLotJGlkaHEt7vCNMTbHlIru4d4yblh0hbB6wL6JvI2HbGlK5chPPqIu+O
zHpPGUuTy7lgi+0809k5ceoqYUDJJTo0yu/3BuvLeaZwJqfS9QBIjCdryb/0MCVn
QJ6u3r54aSz4FQHP8iDoDnfbZIAdpCtjlqiTxLARxxYZqWt9nHoW0bC9fZpTkBfT
YxJX5C74NCVHE2Qdqnqx
=fbuL
-END PGP SIGNATURE-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Behavior of a pg_trgm index for 2 (or 3) character LIKE queries

2013-05-30 Thread Robert Haas
On Wed, May 29, 2013 at 10:51 PM, Amit Langote amitlangot...@gmail.com wrote:
 So, it appears, for search strings consisting of 2 (or  3)
 characters, trigrams can not be utilized. No?

I think that's right.  trigram means a sequence of three characters,
and what's stored in the indexes are three-character sequences from
the original text.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Behavior of a pg_trgm index for 2 (or 3) character LIKE queries

2013-05-30 Thread Amit Langote
On Thu, May 30, 2013 at 11:47 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, May 29, 2013 at 10:51 PM, Amit Langote amitlangot...@gmail.com 
 wrote:
 So, it appears, for search strings consisting of 2 (or  3)
 characters, trigrams can not be utilized. No?

 I think that's right.  trigram means a sequence of three characters,
 and what's stored in the indexes are three-character sequences from
 the original text.


Was there any improvement to pg_trgm in recent past that could make it
better for partial matching (the case in question I suppose) or is
partial-matching a different thing altogether?


--
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump with postgis extension dumps rules separately

2013-05-30 Thread Dimitri Fontaine
Joe Conway m...@joeconway.com writes:
 Were you able to test it against an extension containing both rules
 and views, to check that pg_dump has no problem with the new set
 of dependencies?

 PostGIS has both:
[...]
 # pg_dump test  /tmp/test-02.dmp
 # diff /tmp/test-01.dmp /tmp/test-02.dmp
 ...no differences...

Perfect, thanks!

-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Alvaro Herrera
Greg Smith escribió:

 The messy part of extending relations in larger chunks
 is how to communicate that back into the buffer manager usefully.
 The extension path causing trouble is RelationGetBufferForTuple
 calling ReadBufferBI.  All of that is passing a single buffer
 around.  There's no simple way I can see to rewrite it to handle
 more than one at a time.

No, but we can have it create several pages and insert them into the
FSM.  So they aren't returned to the original caller but are available
to future users.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Greg Smith

On 5/30/13 11:21 AM, Alvaro Herrera wrote:

Greg Smith escribió:


The messy part of extending relations in larger chunks
is how to communicate that back into the buffer manager usefully.
The extension path causing trouble is RelationGetBufferForTuple
calling ReadBufferBI.  All of that is passing a single buffer
around.  There's no simple way I can see to rewrite it to handle
more than one at a time.


No, but we can have it create several pages and insert them into the
FSM.  So they aren't returned to the original caller but are available
to future users.


There's actually a code comment wondering about this topic for the pages 
that are already created, in src/backend/access/heap/hio.c :


Remember the new page as our target for future insertions.
XXX should we enter the new page into the free space map immediately, or 
just keep it for this backend's exclusive use in the short run (until 
VACUUM sees it)?  Seems to depend on whether you expect the current 
backend to make more insertions or not, which is probably a good bet 
most of the time.  So for now, don't add it to FSM yet.


We have to be careful about touching too much at that particular point, 
because it's holding a relation extension lock at the obvious spot to 
make a change.


There's an interesting overlap with these questions about how files are 
extended too, with this comment in that file too, just before the above:


XXX This does an lseek - rather expensive - but at the moment it is the 
only way to accurately determine how many blocks are in a relation.  Is 
it worth keeping an accurate file length in shared memory someplace, 
rather than relying on the kernel to do it for us?


That whole sequence of code took the easy way forward when it was 
written, but it's obvious the harder one (also touching the FSM) was 
considered even then.  The whole sequence needs to be revisited to pull 
off multiple page extension.  I wouldn't say it's hard, but it's enough 
work that I haven't been able to find a block of time to go through the 
whole thing.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Behavior of a pg_trgm index for 2 (or 3) character LIKE queries

2013-05-30 Thread Robert Haas
On Thu, May 30, 2013 at 11:00 AM, Amit Langote amitlangot...@gmail.com wrote:
 Was there any improvement to pg_trgm in recent past that could make it
 better for partial matching (the case in question I suppose) or is
 partial-matching a different thing altogether?

Sorry, I am not sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Behavior of a pg_trgm index for 2 (or 3) character LIKE queries

2013-05-30 Thread Sawada Masahiko
2013/5/31 Amit Langote amitlangot...@gmail.com

 On Thu, May 30, 2013 at 11:47 PM, Robert Haas robertmh...@gmail.com wrote:
  On Wed, May 29, 2013 at 10:51 PM, Amit Langote amitlangot...@gmail.com 
  wrote:
  So, it appears, for search strings consisting of 2 (or  3)
  characters, trigrams can not be utilized. No?
 
  I think that's right.  trigram means a sequence of three characters,
  and what's stored in the indexes are three-character sequences from
  the original text.
 

 Was there any improvement to pg_trgm in recent past that could make it
 better for partial matching (the case in question I suppose) or is
 partial-matching a different thing altogether?


Hi Amit,

following emails are discussed about partial match of pg_trgm.  I hope
will this help.
http://www.postgresql.org/message-id/cahgqgwfjshvv2ngme19wdtw9tefw_w7h2ns4e+yysjkb9wd...@mail.gmail.com
as you may know, if search string contains multibyte characters
trigram key is converted to CRC of 4 byte and it is used as key.
(but only use upper 3 byte from CRC)
so we can do partial matching if KEEPONLYALNUM is enabled.

Regards,


Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] 9.3: Empty arrays returned by array_remove()

2013-05-30 Thread Dean Rasheed
Testing 9.3beta, it seems that array_remove() may return an empty 1-d
array whose upper bound is lower than its lower bound. I know that we
discussed allowing this kind of array, but I don't think that
discussion reached any conclusion, other than to agree that the
current empty 0-d array behaviour would be kept in 9.3.

I don't think it's intentional, but the current code in array_remove()
can return something like this:

SELECT array_dims(array_remove(array[1], 1));
 array_dims

 [1:0]
(1 row)

and so the resulting empty 1-d array won't compare as equal to the
usual 0-d empty array:

SELECT array_remove(array[1], 1) = '{}';
 ?column?
--
 f
(1 row)

The LHS is effectively '[1:0]={}', but we don't currently allow that
syntax, so I don't think we should be returning it (it wouldn't
survive a dump/restore, for example).

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Eliminating PD_ALL_VISIBLE, take 2

2013-05-30 Thread Jeff Davis
On Thu, 2013-05-30 at 11:32 +0300, Heikki Linnakangas wrote:
  That could cause a torn page and checksum failure if checksums are
  enabled.

Thank you, I missed that in the rebase; it should be
MarkBufferDirtyHint().

 Come to think of it, even without the torn page  checksum issue, do we 
 really want to actively clear the all-visible flags after upgrade? For 
 tables that haven't been changed, and thus have the all-visible bits 
 set, that amounts to a complete rewrite on the first vacuum after 
 upgrade. That's expensive.

I expected that question and intended to raise it for discussion when
and if we get past performance concerns (I believe Robert is still not
convinced that the patch is viable).

We have a few options: We can ignore the bit entirely, or we can
aggressively unset it, or we can opportunistically unset it if the page
is already dirty. I don't think we're in a hurry to reuse that bit for
something else, so maybe it's best to just ignore it entirely.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-05-30 Thread Josh Berkus
Heikki,

This sounds a lot like my idea for 9.3, which didn't go anywhere.
You've worked out the issues I couldn't, I think.

 Another method is
 to store the 32-bit xid values in tuple headers as offsets from the
 per-page 64-bit value, but then you'd always need to have the 64-bit
 value at hand when interpreting the XIDs, even if they're all recent.

Yeah, -1 on the latter, not least because it would require a 100%
rewrite of the tables in order to upgrade.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-05-30 Thread Robert Haas
On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
 and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
 that would waste a lot space. The trick is to add a field to the page header
 indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
 32-bit wide (*).

Check.

 The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we
 wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we
 could implicitly treat anything older than relfrozenxid as frozen.

Check.

 That's the basic idea. Vacuum freeze only needs to remove dead tuples, but
 doesn't need to dirty pages that contain no dead tuples.

Check.

 Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to
 replace the XIDs with FrozenXid whenever the difference between the smallest
 and largest XID on a page exceeds 2^31. But that would only happen when
 you're updating the page, in which case the page is dirtied anyway, so it
 wouldn't cause any extra I/O.

It would cause some extra WAL activity, but it wouldn't dirty the page
an extra time.

 This would also be the first step in allowing the clog to grow larger than 2
 billion transactions, eliminating the need for anti-wraparound freezing
 altogether. You'd still want to truncate the clog eventually, but it would
 be nice to not be pressed against the wall with run vacuum freeze now, or
 the system will shut down.

Interesting.  That seems like a major advantage.

 (*) Adding an epoch is inaccurate, but I like to use that as my mental
 model. If you just add a 32-bit epoch field, then you cannot have xids from
 different epochs on the page, which would be a problem. In reality, you
 would store one 64-bit XID value in the page header, and use that as the
 reference point for all the 32-bit XIDs on the tuples. See existing
 convert_txid() function for how that works. Another method is to store the
 32-bit xid values in tuple headers as offsets from the per-page 64-bit
 value, but then you'd always need to have the 64-bit value at hand when
 interpreting the XIDs, even if they're all recent.

As I see it, the main downsides of this approach are:

(1) It breaks binary compatibility (unless you do something to
provided for it, like put the epoch in the special space).

(2) It consumes 8 bytes per page.  I think it would be possible to get
this down to say 5 bytes per page pretty easily; we'd simply decide
that the low-order 3 bytes of the reference XID must always be 0.
Possibly you could even do with 4 bytes, or 4 bytes plus some number
of extra bits.

(3) You still need to periodically scan the entire relation, or else
have a freeze map as Simon and Josh suggested.

The upsides of this approach as compared with what Andres and I are
proposing are:

(1) It provides a stepping stone towards allowing indefinite expansion
of CLOG, which is quite appealing as an alternative to a hard
shut-down.

(2) It doesn't place any particular requirements on PD_ALL_VISIBLE.  I
don't personally find this much of a benefit as I want to keep
PD_ALL_VISIBLE, but I know Jeff and perhaps others disagree.

Random thought: Could you compute the reference XID based on the page
LSN?  That would eliminate the storage overhead.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-05-30 Thread Merlin Moncure
On Thu, May 30, 2013 at 1:39 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
 and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
 that would waste a lot space. The trick is to add a field to the page header
 indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
 32-bit wide (*).

 (3) You still need to periodically scan the entire relation, or else
 have a freeze map as Simon and Josh suggested.

Why is this scan required?

Also, what happens if you delete a tuple on a page when another tuple
on the same page with age  2^32 that is still in an open transaction?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Vacuum, Freeze and Analyze: the big picture

2013-05-30 Thread Josh Berkus
Folks,

There's currently some great ideas bouncing around about eliminating the
overhead associated with FREEZE.  However, I wanted to take a step back
and take a look at the big picture for VACUUM, FREEZE and ANALYZE.
Otherwise, we're liable to repeat the 8.4 problem of making one
operation better (background vacuum) while making another one worse
(freezing).

The big, big picture is this:

   90% of our users need to think about VACUUM/ANALYZE
   at least 10% of the time
   and 10% of our users need to think about it
   almost 90% of the time.

That's considerably better than was the case 5 years ago, when vacuum
management was a daily or weekly responsibility for nearly 100% of our
users, but it's still not good enough.  Our target should be that only
those with really unusual setups should have to *ever* think about
vacuum and analyze.

So I've set up a wiki page to document the various problems that force
users to think about vacuum and analyze and try to troubleshoot it:

https://wiki.postgresql.org/wiki/VacuumHeadaches

We can also collect suggested solutions here.  I'm looking to create a
long-term development target which removes most of these vacuum
headaches over the next 3 or 4 releases, without making the unremoved
headaches siginficantly worse.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture

2013-05-30 Thread Andres Freund
Hi,

On 2013-05-30 11:48:12 -0700, Josh Berkus wrote:
 There's currently some great ideas bouncing around about eliminating the
 overhead associated with FREEZE.  However, I wanted to take a step back
 and take a look at the big picture for VACUUM, FREEZE and ANALYZE.
 Otherwise, we're liable to repeat the 8.4 problem of making one
 operation better (background vacuum) while making another one worse
 (freezing).

Inhowfar did 8.4 make freezing worse? I can't remember any new problems
there?

I agree that we need to be careful not to make things worse...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture

2013-05-30 Thread Thom Brown
On 30 May 2013 19:48, Josh Berkus j...@agliodbs.com wrote:
 Folks,

 There's currently some great ideas bouncing around about eliminating the
 overhead associated with FREEZE.  However, I wanted to take a step back
 and take a look at the big picture for VACUUM, FREEZE and ANALYZE.
 Otherwise, we're liable to repeat the 8.4 problem of making one
 operation better (background vacuum) while making another one worse
 (freezing).

 The big, big picture is this:

90% of our users need to think about VACUUM/ANALYZE
at least 10% of the time
and 10% of our users need to think about it
almost 90% of the time.

 That's considerably better than was the case 5 years ago, when vacuum
 management was a daily or weekly responsibility for nearly 100% of our
 users, but it's still not good enough.  Our target should be that only
 those with really unusual setups should have to *ever* think about
 vacuum and analyze.

 So I've set up a wiki page to document the various problems that force
 users to think about vacuum and analyze and try to troubleshoot it:

 https://wiki.postgresql.org/wiki/VacuumHeadaches

 We can also collect suggested solutions here.  I'm looking to create a
 long-term development target which removes most of these vacuum
 headaches over the next 3 or 4 releases, without making the unremoved
 headaches siginficantly worse.

Great collection of issues.

I'm not sure I understand this:

Problem: As of 9.3, there's a significant benefit to vacuum freezing
tables early so that index-only scan is enabled, since freezing also
updates the visibility map. However, with default settings, such
freezing only happens for data which is very old. This means that
index-only scan is less effective than it could be for tables which
have relatively infrequent updates and deletes.

Why specifically VACUUM FREEZE rather than regular VACUUM?  I thought
regular VACUUM updated the visibility map too?  And why as of 9.3
instead of 9.2?

--
Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-05-30 Thread Heikki Linnakangas

On 30.05.2013 21:46, Merlin Moncure wrote:

On Thu, May 30, 2013 at 1:39 PM, Robert Haasrobertmh...@gmail.com  wrote:

On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas
hlinnakan...@vmware.com  wrote:

The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
that would waste a lot space. The trick is to add a field to the page header
indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
32-bit wide (*).


(3) You still need to periodically scan the entire relation, or else
have a freeze map as Simon and Josh suggested.


Why is this scan required?


To find all the dead tuples and remove them, and advance relfrozenxid. 
That in turn is required so that you can truncate the clog. This scheme 
relies on assuming that everything older than relfrozenxid committed, so 
if there are any aborted XIDs present in the table, you can't advance 
relfrozenxid past them.


Come to think of it, if there are no aborted XIDs in a range of XIDs, 
only commits, then you could just advance relfrozenxid past that range 
and truncate away the clog, without scanning the table. But that's quite 
a special case - generally there would be at least a few aborted XIDs - 
so it's probably not worth adding any special code to take advantage of 
that.



Also, what happens if you delete a tuple on a page when another tuple
on the same page with age  2^32 that is still in an open transaction?


Can't let that happen. Same as today.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture

2013-05-30 Thread Josh Berkus

 Problem: As of 9.3, there's a significant benefit to vacuum freezing
 tables early so that index-only scan is enabled, since freezing also
 updates the visibility map. However, with default settings, such
 freezing only happens for data which is very old. This means that
 index-only scan is less effective than it could be for tables which
 have relatively infrequent updates and deletes.
 
 Why specifically VACUUM FREEZE rather than regular VACUUM?  I thought
 regular VACUUM updated the visibility map too?  And why as of 9.3
 instead of 9.2?

As of 9.2, that was a typo.

Allvisible only gets set if there was some reason for VACUUM to visit
the page anyway, no?  Which means that an insert-only or insert-mostly
table doesn't get set allvisible until FREEZE.  And insert-only tables
are usually very large, and thus really *need* index-only scan.

Hmmm.  I should rewrite that item entirely.  It has nothing to do with
FREEZE, really.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture

2013-05-30 Thread Josh Berkus

 Inhowfar did 8.4 make freezing worse? I can't remember any new problems
 there?

Before the Visibility Map, we always vacuumed all pages in a relation
when it was vacuumed at all.  This means that we froze tuples at
vacuum_min_freeze_age, not at autovacuum_max_freeze_age, which is when
we do it post-8.4.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture

2013-05-30 Thread Andres Freund
On 2013-05-30 12:11:23 -0700, Josh Berkus wrote:
 
  Inhowfar did 8.4 make freezing worse? I can't remember any new problems
  there?
 
 Before the Visibility Map, we always vacuumed all pages in a relation
 when it was vacuumed at all.  This means that we froze tuples at
 vacuum_min_freeze_age, not at autovacuum_max_freeze_age, which is when
 we do it post-8.4.

If we have reason to vacuum the relation we do it at
vacuum_freeze_table_age, not only at autovacuum_max_freeze_age. The
difference is that the latter triggers a vacuum, while the former only
changes a partial vacuum into a full one.

Calling that behaviour unconditionally worse is, err, interesting...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture

2013-05-30 Thread Andres Freund
On 2013-05-30 20:01:01 +0100, Thom Brown wrote:
 Problem: As of 9.3, there's a significant benefit to vacuum freezing
 tables early so that index-only scan is enabled, since freezing also
 updates the visibility map. However, with default settings, such
 freezing only happens for data which is very old. This means that
 index-only scan is less effective than it could be for tables which
 have relatively infrequent updates and deletes.

 Why specifically VACUUM FREEZE rather than regular VACUUM?  I thought
 regular VACUUM updated the visibility map too?

It does. It's after all what it uses to decide which parts of the table
to scan if not doing a full table vacuum.

 And why as of 9.3 instead of 9.2?
 
Mabe because 9.3 updates the vm quicker than earlier version by checking
whether all tuples are visible after we've actually removed the dead
tuples.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture

2013-05-30 Thread Josh Berkus

 If we have reason to vacuum the relation we do it at
 vacuum_freeze_table_age, not only at autovacuum_max_freeze_age. The
 difference is that the latter triggers a vacuum, while the former only
 changes a partial vacuum into a full one.
 
 Calling that behaviour unconditionally worse is, err, interesting...

*overall* it's better.  But as far as FREEZE itself is concerned, it's
worse.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-05-30 Thread Andres Freund
On 2013-05-30 14:39:46 -0400, Robert Haas wrote:
  Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to
  replace the XIDs with FrozenXid whenever the difference between the smallest
  and largest XID on a page exceeds 2^31. But that would only happen when
  you're updating the page, in which case the page is dirtied anyway, so it
  wouldn't cause any extra I/O.
 
 It would cause some extra WAL activity, but it wouldn't dirty the page
 an extra time.

You probably could do it similarly to how we currently do
XLOG_HEAP_ALL_VISIBLE_CLEARED and just recheck the page on replay. The
insert/update/delete record will already contain a FPI if necessary, so
that should be safe.

  This would also be the first step in allowing the clog to grow larger than 2
  billion transactions, eliminating the need for anti-wraparound freezing
  altogether. You'd still want to truncate the clog eventually, but it would
  be nice to not be pressed against the wall with run vacuum freeze now, or
  the system will shut down.

 Interesting.  That seems like a major advantage.

Hm. Why? If freezing gets notably cheaper I don't really see much need
for keeping that much clog around? If we still run into anti-wraparound
areas, there has to be some major operational problem.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Behavior of a pg_trgm index for 2 (or 3) character LIKE queries

2013-05-30 Thread Alexander Korotkov
On Thu, May 30, 2013 at 12:49 PM, Sawada Masahiko sawada.m...@gmail.comwrote:

 following emails are discussed about partial match of pg_trgm.  I hope
  will this help.
 
 http://www.postgresql.org/message-id/cahgqgwfjshvv2ngme19wdtw9tefw_w7h2ns4e+yysjkb9wd...@mail.gmail.com
 
 as you may know, if search string contains multibyte characters
 trigram key is converted to CRC of 4 byte and it is used as key.
 (but only use upper 3 byte from CRC)
 so we can do partial matching if KEEPONLYALNUM is enabled.


Please, read the further discussion on that thread. We can't do partial
matching because of CRC independently of KEEPONLYALNUM.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Vacuum, Freeze and Analyze: the big picture

2013-05-30 Thread Andres Freund
On 2013-05-30 12:18:29 -0700, Josh Berkus wrote:
 
  If we have reason to vacuum the relation we do it at
  vacuum_freeze_table_age, not only at autovacuum_max_freeze_age. The
  difference is that the latter triggers a vacuum, while the former only
  changes a partial vacuum into a full one.
  
  Calling that behaviour unconditionally worse is, err, interesting...
 
 *overall* it's better.  But as far as FREEZE itself is concerned, it's
 worse.

I am not trying to give you a hard time, but I really can't follow. In
8.3 we only froze tuples that were older than vacuum_freeze_min_age,
just as today (although the default was higher then than today). 100mio
transactions is long enough that you almost guaranteedly be in a
different checkpoint cycle when freezing than when initially writing the
tuple's buffer. So independent of the time the buffer is frozen (be it
a) we always scan the whole relation, b) we have a partial vacuum
upgraded to a full one due to vacuum_freeze_table_age c) an anti
wraparound vacuum) we will usually write a buffer multiple times.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-05-30 Thread Bruce Momjian
On Thu, May 30, 2013 at 04:33:50PM +0300, Heikki Linnakangas wrote:
 This would also be the first step in allowing the clog to grow
 larger than 2 billion transactions, eliminating the need for
 anti-wraparound freezing altogether. You'd still want to truncate
 the clog eventually, but it would be nice to not be pressed against
 the wall with run vacuum freeze now, or the system will shut down.

Keep in mind that autovacuum_freeze_max_age is 200M to allow faster clog
truncation.  Does this help that?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Behavior of a pg_trgm index for 2 (or 3) character LIKE queries

2013-05-30 Thread Amit Langote
On Fri, May 31, 2013 at 4:25 AM, Alexander Korotkov
aekorot...@gmail.com wrote:
 On Thu, May 30, 2013 at 12:49 PM, Sawada Masahiko sawada.m...@gmail.com
 wrote:

 following emails are discussed about partial match of pg_trgm.  I hope
 will this help.

 http://www.postgresql.org/message-id/cahgqgwfjshvv2ngme19wdtw9tefw_w7h2ns4e+yysjkb9wd...@mail.gmail.com
 as you may know, if search string contains multibyte characters
 trigram key is converted to CRC of 4 byte and it is used as key.
 (but only use upper 3 byte from CRC)
 so we can do partial matching if KEEPONLYALNUM is enabled.


 Please, read the further discussion on that thread. We can't do partial
 matching because of CRC independently of KEEPONLYALNUM.


Thank you Sawada-san and Alexander.

I think the idea of using trigram text itself rather than its CRC
(due to its problems in partial matching) as GIN key (?) has not been
implemented into pg_trgm yet, right? And even though, such a facility
would  be added, we would still need to handle multibyte characters
case differently (even for partial matching), is that right?



--
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-05-30 Thread Robert Haas
On Thu, May 30, 2013 at 3:22 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-05-30 14:39:46 -0400, Robert Haas wrote:
  Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to
  replace the XIDs with FrozenXid whenever the difference between the 
  smallest
  and largest XID on a page exceeds 2^31. But that would only happen when
  you're updating the page, in which case the page is dirtied anyway, so it
  wouldn't cause any extra I/O.

 It would cause some extra WAL activity, but it wouldn't dirty the page
 an extra time.

 You probably could do it similarly to how we currently do
 XLOG_HEAP_ALL_VISIBLE_CLEARED and just recheck the page on replay. The
 insert/update/delete record will already contain a FPI if necessary, so
 that should be safe.

Ah, good point.

  This would also be the first step in allowing the clog to grow larger than 
  2
  billion transactions, eliminating the need for anti-wraparound freezing
  altogether. You'd still want to truncate the clog eventually, but it would
  be nice to not be pressed against the wall with run vacuum freeze now, or
  the system will shut down.

 Interesting.  That seems like a major advantage.

 Hm. Why? If freezing gets notably cheaper I don't really see much need
 for keeping that much clog around? If we still run into anti-wraparound
 areas, there has to be some major operational problem.

That is true, but we have a decent number of customers who do in fact
have such problems.  I think that's only going to get more common.  As
hardware gets faster and PostgreSQL improves, people are going to
process more and more transactions in shorter and shorter periods of
time.  Heikki's benchmark results for the XLOG scaling patch show
rates of 80,000 tps.  Even at a more modest 10,000 tps, with default
settings, you'll do anti-wraparound vacuums of the entire cluster
about every 8 hours.  That's not fun.

Being able to do such scans only of the not-all-visible pages would be
a huge step forward, of course.  But not having to do them on any
particular deadline would be a whole lot better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Behavior of a pg_trgm index for 2 (or 3) character LIKE queries

2013-05-30 Thread Amit Langote
On Fri, May 31, 2013 at 4:25 AM, Alexander Korotkov
aekorot...@gmail.com wrote:
 On Thu, May 30, 2013 at 12:49 PM, Sawada Masahiko sawada.m...@gmail.com
 wrote:

 following emails are discussed about partial match of pg_trgm.  I hope
 will this help.

 http://www.postgresql.org/message-id/cahgqgwfjshvv2ngme19wdtw9tefw_w7h2ns4e+yysjkb9wd...@mail.gmail.com
 as you may know, if search string contains multibyte characters
 trigram key is converted to CRC of 4 byte and it is used as key.
 (but only use upper 3 byte from CRC)
 so we can do partial matching if KEEPONLYALNUM is enabled.


 Please, read the further discussion on that thread. We can't do partial
 matching because of CRC independently of KEEPONLYALNUM.


Also, a few more questions:

1) When building a trgm index, are there any differences for
multi-byte character strings. For example, would a 2 character
Japanese string (multi-byte offcourse) produce exactly 3 trigrams to
be stored in the index which would later be used while look-up?

2) And if that is so, is there problem in gin_extract_query_trgm(),
that is while generating trigrams from a query search term that causes
trigrams (stored in the index if answer to (1) is yes) NOT to be used
in such a partial matching case?

--
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freezing without write I/O

2013-05-30 Thread Robert Haas
On Thu, May 30, 2013 at 2:39 PM, Robert Haas robertmh...@gmail.com wrote:
 Random thought: Could you compute the reference XID based on the page
 LSN?  That would eliminate the storage overhead.

After mulling this over a bit, I think this is definitely possible.
We begin a new half-epoch every 2 billion transactions.  We remember
the LSN at which the current half-epoch began and the LSN at which the
previous half-epoch began.  When a new half-epoch begins, the first
backend that wants to stamp a tuple with an XID from the new
half-epoch must first emit a new half-epoch WAL record, which
becomes the starting LSN for the new half-epoch.

We define a new page-level bit, something like PD_RECENTLY_FROZEN.
When this bit is set, it means there are no unfrozen tuples on the
page with XIDs that predate the current half-epoch.  Whenever we know
this to be true, we set the bit.  If the page LSN crosses more than
one half-epoch boundary at a time, we freeze the page and set the bit.
 If the page LSN crosses exactly one half-epoch boundary, then (1) if
the bit is set, we clear it and (2) if the bit is not set, we freeze
the page and set the bit.  The advantage of this is that we avoid an
epidemic of freezing right after a half-epoch change.  Immediately
after a half-epoch change, many pages will mix tuples from the current
and previous half-epoch - but relatively few pages will have tuples
from the current half-epoch and a half-epoch more than one in the
past.

As things stand today, we really only need to remember the last two
half-epoch boundaries; they could be stored, for example, in the
control file.  But if we someday generalize CLOG to allow indefinite
retention as you suggest, we could instead remember all half-epoch
boundaries that have ever occurred; just maintain a file someplace
with 8 bytes of data for every 2 billion XIDs consumed over the
lifetime of the cluster.  In fact, we might want to do it that way
anyhow, just to keep our options open, and perhaps for forensics.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers