Re: [HACKERS] Compression of full-page-writes

2013-10-22 Thread Andres Freund
On 2013-10-22 12:52:09 +0900, Fujii Masao wrote:
 So, our consensus is to introduce the hooks for FPW compression so that
 users can freely select their own best compression algorithm?

No, I don't think that's concensus yet. If you want to make it
configurable on that level you need to have:
1) compression format signature on fpws
2) mapping between identifiers for compression formats and the libraries 
implementing them.

Otherwise you can only change the configuration at initdb time...

 Also, probably we need to implement at least one compression contrib module
 using that hook, maybe it's based on pglz or snappy.

From my tests for toast compression I'd suggest starting with lz4.

I'd suggest starting by publishing test results with a more modern
compression formats, but without hacks like increasing padding.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cube extension point support // GSoC'13

2013-10-22 Thread Alexander Korotkov
On Tue, Oct 22, 2013 at 2:00 AM, Dimitri Fontaine dimi...@2ndquadrant.frwrote:

 Tom Lane t...@sss.pgh.pa.us writes:
  I believe the reason GIST has compress/decompress functions is not for
  TOAST (they predate that, if memory serves), but to allow the on-disk
  representation of an index entry to be different from the data type's
  normal representation in other ways --- think lossy storage in
 particular.

 My understanding of the use case for those functions is to do with
 storing a different data type in the index upper nodes and in the index
 leafs. It should be possible to do that in a non-lossy way, so that you
 would implement compress/decompress and not declare the RECHECK bits.

 Then again I'm talking from 8.3 era memories of when I tried to
 understand GiST enough to code the prefix extension.


Actually, I mean purpose of this particular decompress function
implementation, not compress/decompress in general. I understand that in
general compress/decompress can do useful job.

--
With best regards,
Alexander Korotkov.


[HACKERS] Why the asprintf patch is still breaking the buildfarm

2013-10-22 Thread Tom Lane
So I returned from vacation only to find that the buildfarm has a bad case
of acne.  All the Windows members are red or pink, and have been for
awhile.  Sigh.

After some research I believe that I understand the reason for the CHECK
failures, at least:

1. src/port/asprintf.c exhibits a truly touching faith that vsnprintf will
report exactly the number of bytes that would have been required, even if
the buffer is not that large.  While this is what is specified in recent
versions of the POSIX standard, older platforms have much sketchier
behavior.

2. In particular, our own src/port/snprintf.c follows the SUS v2 rule that
it should report the number of bytes it *actually wrote*.  This means
that asprintf.c will never think that its initial 128-byte allocation was
insufficient.  So, on platforms where we use this implementation (notably
including Windows), the result of any asprintf call is effectively
truncated at 128 bytes.

3. I believe the exact cause of the reported failures is that the
add_to_path calls in pg_regress.c result in truncating the value of the
PATH environment value, causing system() to not find the perl
executable.  (jacana is probably failing because of truncation of
LD_LIBRARY_PATH instead, which is unsurprising since the problem would
move around depending on the directory path lengths in use.)

IMO src/port/asprintf.c is hopelessly naive, as well as ugly and
undocumented.  We should throw it away and replace it with an
implementation more like stringinfo.c's appendStringInfo, which is code
that has been through the wars and is known to be pretty bulletproof these
days.  Aside from the immediate problem, that would allow us to get rid of
the unportable va_copy calls.  (I say they're unportable because no such
functionality is specified in SUS v2.  And no, I do not have any faith at
all in commit c2316dcda1cd057d7d4a56e3a51e3f8f0527e906 as a workaround.)

I have a lot of other gripes about this whole patch, but they can
wait till tomorrow.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why the asprintf patch is still breaking the buildfarm

2013-10-22 Thread David Rowley
On Tue, Oct 22, 2013 at 8:58 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 So I returned from vacation only to find that the buildfarm has a bad case
 of acne.  All the Windows members are red or pink, and have been for
 awhile.  Sigh.

 After some research I believe that I understand the reason for the CHECK
 failures, at least:

 1. src/port/asprintf.c exhibits a truly touching faith that vsnprintf will
 report exactly the number of bytes that would have been required, even if
 the buffer is not that large.  While this is what is specified in recent
 versions of the POSIX standard, older platforms have much sketchier
 behavior.

 2. In particular, our own src/port/snprintf.c follows the SUS v2 rule that
 it should report the number of bytes it *actually wrote*.  This means
 that asprintf.c will never think that its initial 128-byte allocation was
 insufficient.  So, on platforms where we use this implementation (notably
 including Windows), the result of any asprintf call is effectively
 truncated at 128 bytes.


Thanks for looking at this. I had a bash and trying to figure out why
vcregress check would not work last night and didn't get very far...
I can confirm that you are right just by changing the 128 into 12800 and
compiling, vcregress check worked after that.

Regards

David Rowley



 I have a lot of other gripes about this whole patch, but they can
 wait till tomorrow.

 regards, tom lane


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] Heavily modified big table bloat even in auto vacuum is running

2013-10-22 Thread Haribabu kommi
On 22 October 2013 10:15 Amit Kapila wrote:
On Mon, Oct 21, 2013 at 10:54 AM, Haribabu kommi haribabu.ko...@huawei.com 
wrote:

 Yes, it's correct. nkeep counter have the dead tuples which are recently 
 dead and are not vacuumed. The removal of tuples vacuumed from dead tuples 
 should be the same as nkeep counter.
 So if we remove the nkeep from num_tuples which gives us the proper live 
 tuples. How about following statement at the end scan for all blocks.

 num_tuples -= nkeep;

Actually what I had in mind was to use nkeep to estimate n_dead_tuples similar 
to how num_tuples is used to estimate n_live_tuples. I think it will match 
what Tom had pointed in his response (What 
would make more sense to me is for VACUUM to estimate the number
of remaining dead tuples somehow and send that in its message.  
However, since the whole point here is that we aren't accounting for 
transactions that commit while VACUUM runs, it's not very clear how 
to do that.)

I changed the patch as passing the nkeep counter data as the new dead tuples 
in the relation to stats like the new_rel_tuples.
The nkeep counter is an approximation of dead tuples data of a relation.
Instead of resetting dead tuples stats as zero, used this value to set 
n_dead_tuples same as n_live_tuples.

Patch is attached in the mail. Please let me know if any changes are required.

Regards,
Hari Babu.



vacuum_fix_v3.patch
Description: vacuum_fix_v3.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why the asprintf patch is still breaking the buildfarm

2013-10-22 Thread Manlio Perillo

On 22/10/2013 09:58, Tom Lane wrote:

So I returned from vacation only to find that the buildfarm has a bad case
of acne.  All the Windows members are red or pink, and have been for
awhile.  Sigh.

After some research I believe that I understand the reason for the CHECK
failures, at least:

1. src/port/asprintf.c exhibits a truly touching faith that vsnprintf will
report exactly the number of bytes that would have been required, even if
the buffer is not that large.  While this is what is specified in recent
versions of the POSIX standard, older platforms have much sketchier
behavior.



Just to be pedantic, this is required by C99.

 [...]


Regards  Manlio Perillo


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Getting Tuples list from an index

2013-10-22 Thread Albe Laurenz
Naman wrote:
 I have an 3 indexes on a relation t2(A,B,C) index1 , index2 ,index3
 
 What i need is if i know the indexname (say index1) then is their any 
 programmatic way by which i can
 get the list of tuples  which comes under  the index specified( i.e index1)

Do you need anything that exceeds a catalog query?

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2013-10-22 Thread Dimitri Fontaine
Tom Lane t...@sss.pgh.pa.us writes:
 Hm.  It's been a long time since college statistics, but doesn't the
 entire concept of standard deviation depend on the assumption that the
 underlying distribution is more-or-less normal (Gaussian)?  Is there a

I just had a quick chat with a statistician friends of mine on that
topic, and it seems that the only way to make sense of an average is if
you know already the distribution.

In our case, what I keep experiencing with tuning queries is that we
have like 99% of them running under acceptable threshold and 1% of them
taking more and more time.

In a normal (Gaussian) distribution, there would be no query time
farther away from the average than any other, so my experience tells me
that the query time distribution is anything BUT normal (Gaussian).

 good reason to suppose that query runtime is Gaussian?  (I'd bet not;
 in particular, multimodal behavior seems very likely due to things like
 plan changes.)  If not, how much does that affect the usefulness of
 a standard-deviation calculation?

I don't know what multi-modal is.

What I've been gathering from my quick chat this morning is that either
you know how to characterize the distribution and then the min max and
average are useful on their own, or you need to keep track of an
histogram where all the bins are of the same size to be able to learn
what the distribution actually is.

We didn't get to the point where I could understand if storing histogram
with a constant size on log10 of the data rather than the data itself is
going to allow us to properly characterize the distribution.

The main question I want to answer here would be the percentiles one, I
want to get the query max execution timing for 95% of the executions,
then 99%, then 99.9% etc. There's no way to answer that without knowing
the distribution shape, so we need enough stats to learn what the
distribution shape is (hence, histograms).

Of course keeping enough stats seems to always begin with keeping the
min, max and average, so we can just begin there. We would just be
unable to answer interesting questions with just that.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2013-10-22 Thread Daniel Farina
On Tue, Oct 22, 2013 at 2:56 AM, Dimitri Fontaine
dimi...@2ndquadrant.fr wrote:
 Tom Lane t...@sss.pgh.pa.us writes:
 Hm.  It's been a long time since college statistics, but doesn't the
 entire concept of standard deviation depend on the assumption that the
 underlying distribution is more-or-less normal (Gaussian)?  Is there a

 I just had a quick chat with a statistician friends of mine on that
 topic, and it seems that the only way to make sense of an average is if
 you know already the distribution.

 In our case, what I keep experiencing with tuning queries is that we
 have like 99% of them running under acceptable threshold and 1% of them
 taking more and more time.

Agreed.

In a lot of Heroku's performance work, the Perc99 and Perc95 have
provided a lot more value that stddev, although stddev is a lot better
than nothing and probably easier to implement.

There are apparently high-quality statistical approximations of these
that are not expensive to compute and are small in memory representation.

That said, I'd take stddev over nothing for sure.

Handily for stddev, I think by snapshots of count(x), sum(x),
sum(x**2) (which I understand to be the components of stddev), I think
one can compute stddevs across different time spans using auxiliary
tools that sample this triplet on occasion.  That's kind of a handy
property that I'm not sure if percN-approximates can get too easily.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: lob conversion functionality

2013-10-22 Thread Pavel Stehule
Hello

here is patch

Regards

Pavel


2013/10/21 Noah Misch n...@leadboat.com

 On Mon, Oct 21, 2013 at 08:10:24PM +0200, Pavel Stehule wrote:
   On 10/20/2013 07:52 PM, Noah Misch wrote:
   Anything we do here effectively provides wrappers around the existing
   functions tailored toward the needs of libpq.

 To clarify the above statement: the existing lo* SQL functions are
 designed to
 fit the needs of the libpq APIs that call those SQL functions internally.
  The
 additions we're discussing are SQL functions designed to fit the needs of
 user-written SQL statements.

  I am for including to core - we have no buildin SQL functions that allows
  access simple and fast access on binary level. Next - these functions
  completes lo functionality.
 
  Other questions - should be these functions propagated to libpq?

 No; I agree that the existing libpq large object API is adequate.

  and who will write patch? You or me?

 If you're prepared to change the function names and add the subset-oriented
 functions, I would appreciate that.

 Thanks,
 nm

 --
 Noah Misch
 EnterpriseDB http://www.enterprisedb.com



load_lo_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2013-10-22 Thread Gavin Flower

On 22/10/13 22:56, Dimitri Fontaine wrote:

Tom Lane t...@sss.pgh.pa.us writes:

Hm.  It's been a long time since college statistics, but doesn't the
entire concept of standard deviation depend on the assumption that the
underlying distribution is more-or-less normal (Gaussian)?  Is there a

I just had a quick chat with a statistician friends of mine on that
topic, and it seems that the only way to make sense of an average is if
you know already the distribution.

In our case, what I keep experiencing with tuning queries is that we
have like 99% of them running under acceptable threshold and 1% of them
taking more and more time.

In a normal (Gaussian) distribution, there would be no query time
farther away from the average than any other, so my experience tells me
that the query time distribution is anything BUT normal (Gaussian).


good reason to suppose that query runtime is Gaussian?  (I'd bet not;
in particular, multimodal behavior seems very likely due to things like
plan changes.)  If not, how much does that affect the usefulness of
a standard-deviation calculation?

I don't know what multi-modal is.


[...]

Multi-modal is basically having more than one hump when you graph the 
frequencies of values.

If you gave a series of mathematical questions of varying degrees of difficulty and 
divers areas in mathematics to a group of people between the ages of 20  25 
selected at random in New Zealand, then you would have at least 2 humps.  One hump 
would be those who had little mathematical training and/or no interest and those 
that had had more advanced mathematical training and/or were interested in 
mathematics.

You would also get at least 2 humps if you plotted numbers of people under the 
age of 50, with the number of visits to medical practioners.  Basically those 
people with chronic illnesses with those who tend not to have extended periods 
of illness - this implies 2 humps, but it may be more complicated.

Grabbing people at random and getting them to fire a rifle at targets would 
also be multi modal.  A lot of people with low scores and a lessor percentage 
with reasonable scores.  I would expect this to be quite pronounced, people 
with lots of rifle practice will tend to do significantly better.


Cheers,
Gavin




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] all_visible replay aborting due to uninitialized pages

2013-10-22 Thread Andres Freund
Hi Robert, Heikki,

On 2013-09-24 13:25:41 +0200, Andres Freund wrote:
  I'm afraid this patch was a few bricks shy of a load. The
  log_newpage_buffer() function asserts that:
  
 /* We should be in a critical section. */
 Assert(CritSectionCount  0);
  
  But the call in vacuumlazy.c is not inside a critical section. Also, the
  comments in log_newpage_buffer() say that the caller should mark the buffer
  dirty *before* calling log_newpage_buffer(), but in vacuumlazy.c, it's
  marked dirty afterwards. I'm not sure what consequences that might have, but
  at least it contradicts the comment.
  
  (spotted this while working on a patch, and ran into the assertion on crash
  recovery)
 
 What about the attached patches (one for 9.3 and master, the other for
 9.2)? I've tested that I can trigger the assert before and not after by
 inserting faults...

Yould either of you commit those patches to the corresponding branches?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2013-10-22 Thread Stephen Frost
All,

* Dimitri Fontaine (dimi...@2ndquadrant.fr) wrote:
 In our case, what I keep experiencing with tuning queries is that we
 have like 99% of them running under acceptable threshold and 1% of them
 taking more and more time.

This is usually described (at least where I come from) as 'rare events',
which goes to Tom's point that averages, stddev, etc, are not ideal
(though they are still better than nothing).

  good reason to suppose that query runtime is Gaussian?  (I'd bet not;
  in particular, multimodal behavior seems very likely due to things like
  plan changes.)  If not, how much does that affect the usefulness of
  a standard-deviation calculation?

Oscillating plan changes may fit multimodal but I don't feel that's
typical.  My experience has been it's either an extremely rare plan
difference or it's a shift from one plan to another over time.

 What I've been gathering from my quick chat this morning is that either
 you know how to characterize the distribution and then the min max and
 average are useful on their own, or you need to keep track of an
 histogram where all the bins are of the same size to be able to learn
 what the distribution actually is.

A histogram would certainly be useful.  We may also wish to look into
outlier/rare event detection methods and increase the logging we do in
those cases (if possible).

 Of course keeping enough stats seems to always begin with keeping the
 min, max and average, so we can just begin there. We would just be
 unable to answer interesting questions with just that.

It would probably be good to do some research into techniques for
outlier detection which minimizes CPU and storage cost.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Robert Haas
On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund and...@2ndquadrant.com wrote:
 So. As it turns out that solution isn't sufficient in the face of VACUUM
 FULL and mixed DML/DDL transaction that have not yet been decoded.

 To reiterate, as published it works like:
 For every modification of catalog tuple (insert, multi_insert, update,
 delete) that has influence over visibility issue a record that contains:
 * filenode
 * ctid
 * (cmin, cmax)

 When doing a visibility check on a catalog row during decoding of mixed
 DML/DDL transaction lookup (cmin, cmax) for that row since we don't
 store both for the tuple.

 That mostly works great.

 The problematic scenario is decoding a transaction that has done mixed
 DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM
 FULL obviously changes the filenode and the ctid of a tuple, so we
 cannot successfully do a lookup based on what we logged before.

So I have a new idea for handling this problem, which seems obvious in
retrospect.  What if we make the VACUUM FULL or CLUSTER log the old
CTID - new CTID mappings?  This would only need to be done for
catalog tables, and maybe could be skipped for tuples whose XIDs are
old enough that we know those transactions must already be decoded.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.4

2013-10-22 Thread Robert Haas
On Tue, Oct 22, 2013 at 10:07 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-10-21 20:16:29 +0200, Andres Freund wrote:
 On 2013-10-18 20:50:58 +0200, Andres Freund wrote:
  How about modifying the selection to go from:
  * all rows if ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL;
  * index chosen by ALTER TABLE ... REPLICA IDENTITY USING indexname
  * [later, maybe] ALTER TABLE ... REPLICA IDENTITY (cola, colb)

 Current draft is:
 ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL|DEFAULT
 ALTER TABLE ... REPLICA IDENTITY USING INDEX ...;

 which leaves the door open for

 ALTER TABLE ... REPLICA IDENTITY USING '(' column_name_list ')';

 Does anybody have a strong feeling about requiring support for CREATE
 TABLE for this?

 Attached is a patch ontop of master implementing this syntax. It's not
 wired up to the changeset extraction patch yet as I am not sure whether
 others agree about the storage.

 pg_class grew a 'relreplident' char, storing:
 * 'd' default
 * 'n' nothing
 * 'f' full
 * 'i' index with indisreplident set, or default if index has been
   dropped
 pg_index grew a 'indisreplident' bool indicating it is set as the
 replica identity for a replident = 'i' relation.

 Both changes shouldn't change the width of the affected relations, they
 should reuse existing padding.

 Does somebody prefer a different storage?

I had imagined that the storage might consist simply of a pg_attribute
boolean.  So full would turn them all on, null would turn them all of,
etc.  But that does make it hard to implement the whatever the pkey
happens to be right now behavior, so maybe your idea is better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Andres Freund
On 2013-10-22 10:52:48 -0400, Robert Haas wrote:
 On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund and...@2ndquadrant.com wrote:
  So. As it turns out that solution isn't sufficient in the face of VACUUM
  FULL and mixed DML/DDL transaction that have not yet been decoded.
 
  To reiterate, as published it works like:
  For every modification of catalog tuple (insert, multi_insert, update,
  delete) that has influence over visibility issue a record that contains:
  * filenode
  * ctid
  * (cmin, cmax)
 
  When doing a visibility check on a catalog row during decoding of mixed
  DML/DDL transaction lookup (cmin, cmax) for that row since we don't
  store both for the tuple.
 
  That mostly works great.
 
  The problematic scenario is decoding a transaction that has done mixed
  DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM
  FULL obviously changes the filenode and the ctid of a tuple, so we
  cannot successfully do a lookup based on what we logged before.
 
 So I have a new idea for handling this problem, which seems obvious in
 retrospect.  What if we make the VACUUM FULL or CLUSTER log the old
 CTID - new CTID mappings?  This would only need to be done for
 catalog tables, and maybe could be skipped for tuples whose XIDs are
 old enough that we know those transactions must already be decoded.

Ah. If it only were so simple ;). That was my first idea, and after I'd
bragged in an 2ndq internal chat that I'd found a simple idea I
obviously had to realize it doesn't work.

Consider:
INIT_LOGICAL_REPLICATION;
CREATE TABLE foo(...);
BEGIN;
INSERT INTO foo;
ALTER TABLE foo ...;
INSERT INTO foo;
COMMIT TX 3;
VACUUM FULL pg_class;
START_LOGICAL_REPLICATION;

When we decode tx 3 we haven't yet read the mapping from the vacuum
freeze. That scenario can happen either because decoding was stopped for
a moment, or because decoding couldn't keep up (slow connection,
whatever).

There also can be nasty variations where the VACUUM FULL happens while a
transaction is writing data since we don't hold locks on system
relations for very long.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why the asprintf patch is still breaking the buildfarm

2013-10-22 Thread Tom Lane
Manlio Perillo manlio.peri...@gmail.com writes:
 On 22/10/2013 09:58, Tom Lane wrote:
 1. src/port/asprintf.c exhibits a truly touching faith that vsnprintf will
 report exactly the number of bytes that would have been required, even if
 the buffer is not that large.  While this is what is specified in recent
 versions of the POSIX standard, older platforms have much sketchier
 behavior.

 Just to be pedantic, this is required by C99.

Yeah.  As a separate matter, it might be useful to revise stringinfo.c
and the asprintf code so that *if* the returned value is larger than the
given buffer size, we use it as a guide to resizing, avoiding the possible
need to loop multiple times to make the buffer large enough.  And we could
also improve our own implementation of snprintf to follow the C99 spec.

The point here is that we still need to cope with pre-C99 implementations
that might return -1 or the given buffer size on overflow.  The NetBSD
implementation doesn't do that, which is reasonable in their context, but
not workable for us.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY table FROM STDIN doesn't show count tag

2013-10-22 Thread Rajeev rastogi
On 21 October 2013 20:48,  Robert Haas robertmh...@gmail.com wrote:
On Fri, Oct 18, 2013 at 7:37 AM, Rajeev rastogi rajeev.rast...@huawei.com 
wrote:
 From the following mail, copy behaviour between stdin and normal file 
 having some inconsistency.


 http://www.postgresql.org/message-id/ce85a517.4878e%tim.k...@gmail.com



 The issue was that if copy  execute from stdin, then it goes to the 
 server to execute the command and then server request for the input, 
 it sends back the control to client to enter the data. So once client 
 sends the input to server, server execute the copy command and sends 
 back the result to client but client does not print the result instead it 
 just clear it out.

 Changes are made to ensure the final result from server get printed 
 before clearing the result.



 Please find the patch for the same and let me know your suggestions.



Please add your patch to the currently-open CommitFest so that it does not get 
forgotten:

https://commitfest.postgresql.org/action/commitfest_view/open

Added to the currently-open CommitFest.

Thanks and Regards,
Kumar Rajeev Rastogi


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Robert Haas
On Tue, Oct 22, 2013 at 11:02 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-10-22 10:52:48 -0400, Robert Haas wrote:
 On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  So. As it turns out that solution isn't sufficient in the face of VACUUM
  FULL and mixed DML/DDL transaction that have not yet been decoded.
 
  To reiterate, as published it works like:
  For every modification of catalog tuple (insert, multi_insert, update,
  delete) that has influence over visibility issue a record that contains:
  * filenode
  * ctid
  * (cmin, cmax)
 
  When doing a visibility check on a catalog row during decoding of mixed
  DML/DDL transaction lookup (cmin, cmax) for that row since we don't
  store both for the tuple.
 
  That mostly works great.
 
  The problematic scenario is decoding a transaction that has done mixed
  DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM
  FULL obviously changes the filenode and the ctid of a tuple, so we
  cannot successfully do a lookup based on what we logged before.

 So I have a new idea for handling this problem, which seems obvious in
 retrospect.  What if we make the VACUUM FULL or CLUSTER log the old
 CTID - new CTID mappings?  This would only need to be done for
 catalog tables, and maybe could be skipped for tuples whose XIDs are
 old enough that we know those transactions must already be decoded.

 Ah. If it only were so simple ;). That was my first idea, and after I'd
 bragged in an 2ndq internal chat that I'd found a simple idea I
 obviously had to realize it doesn't work.

 Consider:
 INIT_LOGICAL_REPLICATION;
 CREATE TABLE foo(...);
 BEGIN;
 INSERT INTO foo;
 ALTER TABLE foo ...;
 INSERT INTO foo;
 COMMIT TX 3;
 VACUUM FULL pg_class;
 START_LOGICAL_REPLICATION;

 When we decode tx 3 we haven't yet read the mapping from the vacuum
 freeze. That scenario can happen either because decoding was stopped for
 a moment, or because decoding couldn't keep up (slow connection,
 whatever).

That strikes me as a flaw in the implementation rather than the idea.
You're presupposing a patch where the necessary information is
available in WAL yet you don't make use of it at the proper time.  It
seems to me that you have to think of the CTID map as tied to a
relfilenode; if you try to use one relfilenode's map with a different
relfilenode, it's obviously not going to work.  So don't do that.

/me looks innocent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Andres Freund
On 2013-10-18 20:26:16 +0200, Andres Freund wrote:
 4) Store both (cmin, cmax) for catalog tuples.

BTW: That would have the nice side-effect of delivering the basis of
what you need to do parallel sort in a transaction that previously has
performed DDL.

Currently you cannot do anything in parallel after DDL, even if you only
scan the table in one backend, because operators et al. have to do
catalog lookups which you can't do consistently since cmin/cmax aren't
available in both.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Heikki Linnakangas

On 22.10.2013 19:12, Andres Freund wrote:

On 2013-10-18 20:26:16 +0200, Andres Freund wrote:

4) Store both (cmin, cmax) for catalog tuples.


BTW: That would have the nice side-effect of delivering the basis of
what you need to do parallel sort in a transaction that previously has
performed DDL.

Currently you cannot do anything in parallel after DDL, even if you only
scan the table in one backend, because operators et al. have to do
catalog lookups which you can't do consistently since cmin/cmax aren't
available in both.


Parallel workers will need cmin/cmax for user tables too, to know which 
tuples are visible to the snapshot.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Andres Freund
On 2013-10-22 19:19:19 +0300, Heikki Linnakangas wrote:
 On 22.10.2013 19:12, Andres Freund wrote:
 On 2013-10-18 20:26:16 +0200, Andres Freund wrote:
 4) Store both (cmin, cmax) for catalog tuples.
 
 BTW: That would have the nice side-effect of delivering the basis of
 what you need to do parallel sort in a transaction that previously has
 performed DDL.
 
 Currently you cannot do anything in parallel after DDL, even if you only
 scan the table in one backend, because operators et al. have to do
 catalog lookups which you can't do consistently since cmin/cmax aren't
 available in both.
 
 Parallel workers will need cmin/cmax for user tables too, to know which
 tuples are visible to the snapshot.

The existing proposals were mostly about just parallelizing the sort and
similar operations, right? In such scenarios you really need it only for
the catalog.

But we could easily generalize it for user data too. We should even be
able to only use wide cids when we a backend needs it it since
inherently it's only needed within a single transaction.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Heikki Linnakangas

On 22.10.2013 19:23, Andres Freund wrote:

On 2013-10-22 19:19:19 +0300, Heikki Linnakangas wrote:

On 22.10.2013 19:12, Andres Freund wrote:

On 2013-10-18 20:26:16 +0200, Andres Freund wrote:

4) Store both (cmin, cmax) for catalog tuples.


BTW: That would have the nice side-effect of delivering the basis of
what you need to do parallel sort in a transaction that previously has
performed DDL.

Currently you cannot do anything in parallel after DDL, even if you only
scan the table in one backend, because operators et al. have to do
catalog lookups which you can't do consistently since cmin/cmax aren't
available in both.


Parallel workers will need cmin/cmax for user tables too, to know which
tuples are visible to the snapshot.


The existing proposals were mostly about just parallelizing the sort and
similar operations, right? In such scenarios you really need it only for
the catalog.

But we could easily generalize it for user data too. We should even be
able to only use wide cids when we a backend needs it it since
inherently it's only needed within a single transaction.


Or just hand over a copy of the combocid map to the worker, along with 
the snapshot. Seems a lot simpler than this wide cids business..


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Robert Haas
On Tue, Oct 22, 2013 at 12:25 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Or just hand over a copy of the combocid map to the worker, along with the
 snapshot. Seems a lot simpler than this wide cids business..

Yes, that's what Noah and I talked about doing.  Or possibly even
making the map into a hash table in dynamic shared memory, so that new
combo CIDs could be allocated by any backend in the parallel group.
But that seems hard, so for starters I think we'll only parallelize
read-only operations and just hand over a copy of the map.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Statistics collection for CLUSTER command

2013-10-22 Thread Robert Haas
On Sun, Oct 20, 2013 at 1:37 AM, Noah Misch n...@leadboat.com wrote:
  (2013/08/08 20:52), Vik Fearing wrote:
  As part of routine maintenance monitoring, it is interesting for us to
  have statistics on the CLUSTER command (timestamp of last run, and
  number of runs since stat reset) like we have for (auto)ANALYZE and
  (auto)VACUUM.  Patch against today's HEAD attached.

 Adding new fields to PgStat_StatTabEntry imposes a substantial distributed
 cost, because every database stats file write-out grows by the width of those
 fields times the number of tables in the database.  Associated costs have been
 and continue to be a pain point with large table counts:

 http://www.postgresql.org/message-id/flat/1718942738eb65c8407fcd864883f...@fuzzy.cz
 http://www.postgresql.org/message-id/flat/52268887.9010...@uptime.jp

 In that light, I can't justify widening PgStat_StatTabEntry by 9.5% for this.
 I recommend satisfying this monitoring need in your application by creating a
 cluster_table wrapper function that issues CLUSTER and then updates statistics
 you store in an ordinary table.  Issue all routine CLUSTERs by way of that
 wrapper function.  A backend change that would help here is to extend event
 triggers to cover the CLUSTER command, permitting you to inject monitoring
 after plain CLUSTER and dispense with the wrapper.

I unfortunately have to agree with this, but I think it points to the
need for further work on the pgstat infrastructure.  We used to have
one file; now we have one per database.  That's better for people with
lots of databases, but many people just have one big database.  We
need a solution here that relieves the pain for those people.

(I can't help thinking that the root of the problem here is that we're
rewriting the whole file, and that any solution that doesn't somehow
facilitate updates of individual records will be only a small
improvement.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Reasons not to like asprintf

2013-10-22 Thread Tom Lane
I wrote:
 I have a lot of other gripes about this whole patch, but they can
 wait till tomorrow.

The other big thing I don't like about this patch is that adopting
asprintf() as a model was not any better thought-out than the
portability considerations were.  It's a bad choice because:

1. asprintf() is not in C99 nor any POSIX standard.

2. The two known implementations, glibc and *BSD, behave differently,
as admitted in the Linux man page.  Thus, there is nothing resembling
a standard here.

3. Neither implementation has given any thought to error reporting;
it's not possible to tell whether a failure was due to out-of-memory
or a vsnprintf failure such as a format-string error.  (pg_asprintf
assumes the former, which probably has something to do with the bogus
out-of-memory failures we're seeing on buildfarm member anole, though
I'm not sure about the details.)

4. The use of a char** output parameter is not very friendly, because
(at least on older versions of gcc) it disables uninitialized-variable
detection.  That is, if you have code like

char **str;

if (...)
   asprintf(str, ...);
else
   asprintf(str, ...);
... use str for something ...

many compilers will not be able to check whether you actually assigned
to str in both code paths.

5. The reason for the char** parameter is that the function return
value is needed for error/buffer overrun handling --- but no caller
of pg_asprintf looks at the return value at all.

6. Thus, I much prefer the API design that was used for psprintf,
that is return the allocated buffer pointer as the function result.
It seems like a bad idea to me that we adopted two different APIs
for frontend and backend code; if we're going to hardwire
exit()-on-failure into pg_asprintf there is no benefit whatsoever
to reserving the function return value for failure indications.

In short, I want to change pg_asprintf to return the malloc'd buffer
as its function result.  This probably means we should change the
function name too, since the allusion to asprintf is completely useless
if we do that.  I'm thinking pg_psprintf for the frontend version of
psprintf, but maybe somebody has a better suggestion?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Andres Freund
On 2013-10-22 19:25:31 +0300, Heikki Linnakangas wrote:
 On 22.10.2013 19:23, Andres Freund wrote:
 On 2013-10-22 19:19:19 +0300, Heikki Linnakangas wrote:
 On 22.10.2013 19:12, Andres Freund wrote:
 On 2013-10-18 20:26:16 +0200, Andres Freund wrote:
 4) Store both (cmin, cmax) for catalog tuples.
 
 BTW: That would have the nice side-effect of delivering the basis of
 what you need to do parallel sort in a transaction that previously has
 performed DDL.
 
 Currently you cannot do anything in parallel after DDL, even if you only
 scan the table in one backend, because operators et al. have to do
 catalog lookups which you can't do consistently since cmin/cmax aren't
 available in both.
 
 Parallel workers will need cmin/cmax for user tables too, to know which
 tuples are visible to the snapshot.
 
 The existing proposals were mostly about just parallelizing the sort and
 similar operations, right? In such scenarios you really need it only for
 the catalog.
 
 But we could easily generalize it for user data too. We should even be
 able to only use wide cids when we a backend needs it it since
 inherently it's only needed within a single transaction.
 
 Or just hand over a copy of the combocid map to the worker, along with the
 snapshot. Seems a lot simpler than this wide cids business..

That's not sufficient if you want to continue writing in the primary
backend though which isn't an uninteresting thing.

I am not saying that parallel XXX is a sufficient reason for this, just
that it might a be a co-benefactor.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Heikki Linnakangas
Splitting a B-tree page is a two-stage process: First, the page is 
split, and then a downlink for the new right page is inserted into the 
parent (which might recurse to split the parent page, too). What happens 
if inserting the downlink fails for some reason? I tried that out, and 
it turns out that it's not nice.


I used this to cause a failure:


--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel,
_bt_relbuf(rel, pbuf);
}

+   elog(ERROR, fail!);
+
/* get high key from left page == lowest key on new right page 
*/
ritem = (IndexTuple) PageGetItem(page,

 PageGetItemId(page, P_HIKEY));


postgres=# create table foo (i int4 primary key);
CREATE TABLE
postgres=# insert into foo select generate_series(1, 1);
ERROR:  fail!

That's not surprising. But when I removed that elog again and restarted 
the server, I still can't insert. The index is permanently broken:


postgres=# insert into foo select generate_series(1, 1);
ERROR:  failed to re-find parent key in index foo_pkey for split pages 4/5

In real life, you would get a failure like this e.g if you run out of 
memory or disk space while inserting the downlink to the parent. 
Although rare in practice, it's no fun if it happens.


I fixed the the same problem in GiST a few years back, by making it 
tolerate missing downlinks, and inserting them lazily. The B-tree code 
tolerates them already on scans, but gets confused on insertion, as seen 
above. I propose that we use the same approach I used with GiST, and add 
a flag to the page header to indicate the downlink hasn't been inserted 
yet. When insertion (or vacuum) bumps into a flagged page, it can 
finish the incomplete action by inserting the downlink.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Andres Freund
On 2013-10-22 11:59:35 -0400, Robert Haas wrote:
  So I have a new idea for handling this problem, which seems obvious in
  retrospect.  What if we make the VACUUM FULL or CLUSTER log the old
  CTID - new CTID mappings?  This would only need to be done for
  catalog tables, and maybe could be skipped for tuples whose XIDs are
  old enough that we know those transactions must already be decoded.
 
  Ah. If it only were so simple ;). That was my first idea, and after I'd
  bragged in an 2ndq internal chat that I'd found a simple idea I
  obviously had to realize it doesn't work.
 
  Consider:
  INIT_LOGICAL_REPLICATION;
  CREATE TABLE foo(...);
  BEGIN;
  INSERT INTO foo;
  ALTER TABLE foo ...;
  INSERT INTO foo;
  COMMIT TX 3;
  VACUUM FULL pg_class;
  START_LOGICAL_REPLICATION;
 
  When we decode tx 3 we haven't yet read the mapping from the vacuum
  freeze. That scenario can happen either because decoding was stopped for
  a moment, or because decoding couldn't keep up (slow connection,
  whatever).

 It seems to me that you have to think of the CTID map as tied to a
 relfilenode; if you try to use one relfilenode's map with a different
 relfilenode, it's obviously not going to work.  So don't do that.

It has to be tied to relfilenode (+ctid) *and* transaction
unfortunately.
 
 That strikes me as a flaw in the implementation rather than the idea.
 You're presupposing a patch where the necessary information is
 available in WAL yet you don't make use of it at the proper time.

The problem is that the mapping would be somewhere *ahead* from the
transaction/WAL we're currently decoding. We'd need to read ahead till
we find the correct one.
But I think I mainly misunderstood what you proposed. That mapping could
be written besides relfilenode, instead of into the WAL. Then my
imagined problem doesn't exist anymore.

We only would need to write out mappings for tuples modified since the
xmin horizon, so it wouldn't even be *too* bad for bigger relations.

This won't easily work for two+ rewrites because we'd need to apply all
mappings in order and thus would have to keep a history of intermediate
nodes/mappings. But it'd be perfectly doable to simply wait till
decoders are caught up.

I still feel that simply storing both cmin, cmax is cleaner, but if
that's not acceptable, I can certainly live with something like this.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Peter Geoghegan
On Tue, Oct 22, 2013 at 9:55 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I propose that we use the same approach I used with GiST, and add a flag to
 the page header to indicate the downlink hasn't been inserted yet. When
 insertion (or vacuum) bumps into a flagged page, it can finish the
 incomplete action by inserting the downlink.

Sounds very reasonable, but I'm interested in how you intend to
structure things, given this sounds like what could loosely be called
btree private state. I may also need to use a flag bit for something
that is of interest to the btree code alone.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Heikki Linnakangas

On 22.10.2013 20:27, Peter Geoghegan wrote:

On Tue, Oct 22, 2013 at 9:55 AM, Heikki Linnakangas
hlinnakan...@vmware.com  wrote:

I propose that we use the same approach I used with GiST, and add a flag to
the page header to indicate the downlink hasn't been inserted yet. When
insertion (or vacuum) bumps into a flagged page, it can finish the
incomplete action by inserting the downlink.


Sounds very reasonable, but I'm interested in how you intend to
structure things, given this sounds like what could loosely be called
btree private state. I may also need to use a flag bit for something
that is of interest to the btree code alone.


I may be missing something, but there are already plenty of b-tree 
specific flags. See BTP_* in nbtree.h. I'll just add another to that list.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Peter Geoghegan
On Tue, Oct 22, 2013 at 10:30 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I may be missing something, but there are already plenty of b-tree specific
 flags. See BTP_* in nbtree.h. I'll just add another to that list.

Based on your remarks, I thought that you were intent on directly
using page level bits (pd_flags), rather than the private state flag
bits. Blame it on the lack of coffee, I suppose.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reasons not to like asprintf

2013-10-22 Thread Stephen Frost
Tom,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
 In short, I want to change pg_asprintf to return the malloc'd buffer
 as its function result.  This probably means we should change the
 function name too, since the allusion to asprintf is completely useless
 if we do that.  I'm thinking pg_psprintf for the frontend version of
 psprintf, but maybe somebody has a better suggestion?

Sounds reasonable, and I haven't got a better name, but I'm trying to
figure out why psprintf hasn't got the same issues which you mention in
your other thread (eg: depending on vsnprintf to return the would-be
result size).  I'm also a bit nervous that we only check vsnprintf()
success in Assert-enabled builds in psprintf, though I suppose that's
all we're doing in appendStringInfo too.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Robert Haas
On Tue, Oct 22, 2013 at 1:08 PM, Andres Freund and...@2ndquadrant.com wrote:
 It seems to me that you have to think of the CTID map as tied to a
 relfilenode; if you try to use one relfilenode's map with a different
 relfilenode, it's obviously not going to work.  So don't do that.

 It has to be tied to relfilenode (+ctid) *and* transaction
 unfortunately.

I agree that it does, but it doesn't seem particularly unfortunate to me.

 That strikes me as a flaw in the implementation rather than the idea.
 You're presupposing a patch where the necessary information is
 available in WAL yet you don't make use of it at the proper time.

 The problem is that the mapping would be somewhere *ahead* from the
 transaction/WAL we're currently decoding. We'd need to read ahead till
 we find the correct one.

Yes, I think that's what you need to do.

 But I think I mainly misunderstood what you proposed. That mapping could
 be written besides relfilenode, instead of into the WAL. Then my
 imagined problem doesn't exist anymore.

That's pretty ugly.  I think it should be written into WAL.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.2

2013-10-22 Thread Andres Freund
On 2013-10-22 13:57:53 -0400, Robert Haas wrote:
 On Tue, Oct 22, 2013 at 1:08 PM, Andres Freund and...@2ndquadrant.com wrote:
  That strikes me as a flaw in the implementation rather than the idea.
  You're presupposing a patch where the necessary information is
  available in WAL yet you don't make use of it at the proper time.
 
  The problem is that the mapping would be somewhere *ahead* from the
  transaction/WAL we're currently decoding. We'd need to read ahead till
  we find the correct one.
 
 Yes, I think that's what you need to do.

My problem with that is that rewrite can be gigabytes into the future.

When reading forward we could either just continue reading data into the
reorderbuffer, but delay replaying all future commits till we found the
currently needed remap. That might have quite the additional
storage/memory cost, but runtime complexity should be the same as normal
decoding.
Or we could individually read ahead for every transaction. But doing so
for every transaction will get rather expensive (rougly O(amount_of_wal^2)).

I think that'd be pretty similar to just disallowing VACUUM
FREEZE/CLUSTER on catalog relations since effectively it'd be to
expensive to use.

  But I think I mainly misunderstood what you proposed. That mapping could
  be written besides relfilenode, instead of into the WAL. Then my
  imagined problem doesn't exist anymore.
 
 That's pretty ugly.  I think it should be written into WAL.

It basically has O(1) access, that's why I was thinking about it.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2013-10-22 Thread Jeff Janes
On Mon, Oct 21, 2013 at 4:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Andrew Dunstan and...@dunslane.net writes:
  This is why I suggested the standard deviation, and why I find it would
  be more useful than just min and max. A couple of outliers will set the
  min and max to possibly extreme values but hardly perturb the standard
  deviation over a large number of observations.

 Hm.  It's been a long time since college statistics, but doesn't the
 entire concept of standard deviation depend on the assumption that the
 underlying distribution is more-or-less normal (Gaussian)?



It is easy to misinterpret the standard deviation if the distribution is
not gaussian, but that is also true of the average.  The standard deviation
(or the variance) is commonly used with non-gaussian distributions, either
because it is the most efficient estimator for those particular
distributions, or just because it is so commonly available.

Cheers,

Jeff


Re: [HACKERS] Reasons not to like asprintf

2013-10-22 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 Sounds reasonable, and I haven't got a better name, but I'm trying to
 figure out why psprintf hasn't got the same issues which you mention in
 your other thread (eg: depending on vsnprintf to return the would-be
 result size).

It does, though not directly in psprintf but rather in the underlying
vasprintf implementation.  (Although psprintf is exposed to the problem
that it can't tell out-of-memory from format errors.)

What I'm on about in this thread is the API used by all the call sites,
not the underlying implementation issues.  As long as we're okay with
exit or elog on any error, I think we should just have the widely-used
functions returning a buffer pointer.  Underneath that, we need some work
so that we can print more-specific error messages, but that will not be
of concern to the callers.

It's possibly debatable whether exit-on-error is okay or not.  I think
it is though, because none of our frontend code is prepared to do anything
except go belly-up on out-of-memory, while in the backend case it's only
elog(ERROR) anyway.  The other possible class of failures is format string
or encoding errors, but those seem to reduce to the same cases: I can't
envision that frontend code would have any useful recovery strategies.
I'd like the printed error message to distinguish these cases, but I don't
think the callers need to care.

 I'm also a bit nervous that we only check vsnprintf()
 success in Assert-enabled builds in psprintf, though I suppose that's
 all we're doing in appendStringInfo too.

Actually, appendStringInfo treats result  0 as a buffer overrun report;
if the failure is persistent because it's really a format/encoding
problem, it'll loop till it runs out of memory, and then report the error
as that.  The trouble here is that in pre-C99 implementations of
vsnprintf, return code  0 might mean either buffer overrun or a
format/encoding problem.  We can't do too much about this unless we are
prepared to move our goalposts about portability assumptions.  (Hmm ...
current POSIX requires *printf to set errno on error, so maybe we could
look at errno?)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Andres Freund
On 2013-10-22 19:55:09 +0300, Heikki Linnakangas wrote:
 Splitting a B-tree page is a two-stage process: First, the page is split,
 and then a downlink for the new right page is inserted into the parent
 (which might recurse to split the parent page, too). What happens if
 inserting the downlink fails for some reason? I tried that out, and it turns
 out that it's not nice.
 
 I used this to cause a failure:
 
 --- a/src/backend/access/nbtree/nbtinsert.c
 +++ b/src/backend/access/nbtree/nbtinsert.c
 @@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel,
  _bt_relbuf(rel, pbuf);
  }
 
 +elog(ERROR, fail!);
 +
  /* get high key from left page == lowest key on new right page 
  */
  ritem = (IndexTuple) PageGetItem(page,
  
   PageGetItemId(page, P_HIKEY));
 
 postgres=# create table foo (i int4 primary key);
 CREATE TABLE
 postgres=# insert into foo select generate_series(1, 1);
 ERROR:  fail!
 
 That's not surprising. But when I removed that elog again and restarted the
 server, I still can't insert. The index is permanently broken:
 
 postgres=# insert into foo select generate_series(1, 1);
 ERROR:  failed to re-find parent key in index foo_pkey for split pages 4/5
 
 In real life, you would get a failure like this e.g if you run out of memory
 or disk space while inserting the downlink to the parent. Although rare in
 practice, it's no fun if it happens.

Why doesn't the incomplete split mechanism prevent this? Because we do
not delay checkpoints on the primary and a checkpoint happened just
befor your elog(ERROR) above?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Location for external scripts for Extensions?

2013-10-22 Thread Josh Berkus
All,

pg_partman has several external (python) scripts which help the
extension, located in /extras/ in its source.  The problem currently is
that if you install pg_partman via pgxn or package, you don't get those
scripts, because there's no install location for them.

Where should these go?  PGSHARE/pg_partman/?  Or somewhere else? I'm
looking to set a canonical location for extensions with external scripts.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Heikki Linnakangas

On 22.10.2013 21:25, Andres Freund wrote:

On 2013-10-22 19:55:09 +0300, Heikki Linnakangas wrote:

Splitting a B-tree page is a two-stage process: First, the page is split,
and then a downlink for the new right page is inserted into the parent
(which might recurse to split the parent page, too). What happens if
inserting the downlink fails for some reason? I tried that out, and it turns
out that it's not nice.

I used this to cause a failure:


--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel,
_bt_relbuf(rel, pbuf);
}

+   elog(ERROR, fail!);
+
/* get high key from left page == lowest key on new right page 
*/
ritem = (IndexTuple) PageGetItem(page,

 PageGetItemId(page, P_HIKEY));


postgres=# create table foo (i int4 primary key);
CREATE TABLE
postgres=# insert into foo select generate_series(1, 1);
ERROR:  fail!

That's not surprising. But when I removed that elog again and restarted the
server, I still can't insert. The index is permanently broken:

postgres=# insert into foo select generate_series(1, 1);
ERROR:  failed to re-find parent key in index foo_pkey for split pages 4/5

In real life, you would get a failure like this e.g if you run out of memory
or disk space while inserting the downlink to the parent. Although rare in
practice, it's no fun if it happens.


Why doesn't the incomplete split mechanism prevent this? Because we do
not delay checkpoints on the primary and a checkpoint happened just
befor your elog(ERROR) above?


Because there's no recovery involved. The failure I injected (or an 
out-of-memory or out-of-disk-space in the real world) doesn't cause a 
PANIC, just an ERROR that rolls back the current transaction, nothing more.


We could put a critical section around the whole recursion that inserts 
the downlinks, so that you would get a PANIC and the incomplete split 
mechanism would fix it at recovery. But that would hardly be an improvement.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2013-10-22 Thread Jeff Janes
On Mon, Oct 21, 2013 at 1:36 PM, Robert Haas robertmh...@gmail.com wrote:

 On Mon, Oct 21, 2013 at 4:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Gavin Flower gavinflo...@archidevsys.co.nz writes:
  If we're going to extend pg_stat_statements, even more than min and
 max
  I'd like to see the standard deviation in execution time.
 
  How about the 'median', often a lot more useful than the 'arithmetic
  mean' (which most people call the 'average').
 
  AFAIK, median is impossible to calculate cheaply (in particular, with
  a fixed amount of workspace).  So this apparently innocent request
  is actually moving the goalposts a long way, because the space per
  query table entry is a big concern for pg_stat_statements.

 Yeah, and I worry about min and max not being very usable - once they
 get pushed out to extreme values, there's nothing to drag them back
 toward normality except resetting the stats, and that's not something
 we want to encourage people to do frequently. Of course, averages over
 very long sampling intervals may not be too useful anyway, dunno.


I think the pg_stat_statements_reset() should be done every time you make a
change which you think (or hope) will push the system into a new regime,
which goes for either min/max or for average/stdev.

A histogram would be cool, but it doesn't seem very practical to implement.
 If I really needed that I'd probably set log_min_duration_statement = 0
and mine the log files.  But that means I'd have to wait to accumulate
enough logs once I made that change, then remember to turn it off.

What I'd like most in pg_stat_statements now is the ability to distinguish
which queries have a user grinding their teeth, versus which ones have a
cron job patiently doing a wait4.  I don't know the best way to figure that
out, other than stratify on application_name.  Or maybe a way to
selectively undo the query text normalization, so I could see which
parameters were causing the problem.

Cheers,

Jeff


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Andres Freund
On 2013-10-22 21:29:13 +0300, Heikki Linnakangas wrote:
 On 22.10.2013 21:25, Andres Freund wrote:
 On 2013-10-22 19:55:09 +0300, Heikki Linnakangas wrote:
 Splitting a B-tree page is a two-stage process: First, the page is split,
 and then a downlink for the new right page is inserted into the parent
 (which might recurse to split the parent page, too). What happens if
 inserting the downlink fails for some reason? I tried that out, and it turns
 out that it's not nice.
 
 I used this to cause a failure:
 
 --- a/src/backend/access/nbtree/nbtinsert.c
 +++ b/src/backend/access/nbtree/nbtinsert.c
 @@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel,
_bt_relbuf(rel, pbuf);
}
 
 +  elog(ERROR, fail!);
 +
/* get high key from left page == lowest key on new right page 
  */
ritem = (IndexTuple) PageGetItem(page,

   PageGetItemId(page, P_HIKEY));
 
 postgres=# create table foo (i int4 primary key);
 CREATE TABLE
 postgres=# insert into foo select generate_series(1, 1);
 ERROR:  fail!
 
 That's not surprising. But when I removed that elog again and restarted the
 server, I still can't insert. The index is permanently broken:
 
 postgres=# insert into foo select generate_series(1, 1);
 ERROR:  failed to re-find parent key in index foo_pkey for split pages 4/5
 
 In real life, you would get a failure like this e.g if you run out of memory
 or disk space while inserting the downlink to the parent. Although rare in
 practice, it's no fun if it happens.
 
 Why doesn't the incomplete split mechanism prevent this? Because we do
 not delay checkpoints on the primary and a checkpoint happened just
 befor your elog(ERROR) above?
 
 Because there's no recovery involved. The failure I injected (or an
 out-of-memory or out-of-disk-space in the real world) doesn't cause a PANIC,
 just an ERROR that rolls back the current transaction, nothing more.
 
 We could put a critical section around the whole recursion that inserts the
 downlinks, so that you would get a PANIC and the incomplete split mechanism
 would fix it at recovery. But that would hardly be an improvement.

You were talking about restarting the server, that's why I assumed
recovery had been involved... But you just were talking about removing
the elog() again.

For me this clearly *has* to be in a critical section with the current
code. I had always assumed all multi-part actions would be.

Do you forsee the fix with ignoring missing downlinks to be
back-patchable? FWIW, I think I might have seen real-world cases of
this.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reasons not to like asprintf

2013-10-22 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Stephen Frost sfr...@snowman.net writes:
  Sounds reasonable, and I haven't got a better name, but I'm trying to
  figure out why psprintf hasn't got the same issues which you mention in
  your other thread (eg: depending on vsnprintf to return the would-be
  result size).
 
 It does, though not directly in psprintf but rather in the underlying
 vasprintf implementation.  (Although psprintf is exposed to the problem
 that it can't tell out-of-memory from format errors.)

Right, pvsprintf().  Sorry for any confusion.

 What I'm on about in this thread is the API used by all the call sites,
 not the underlying implementation issues.  As long as we're okay with
 exit or elog on any error, I think we should just have the widely-used
 functions returning a buffer pointer.  Underneath that, we need some work
 so that we can print more-specific error messages, but that will not be
 of concern to the callers.

I agree that exit/Assert-or-elog is the right API here.  I wonder if
it'd be reasonable or worthwhile to try and actually make that happen-
that is to say, we really do only have one implementation for both
front-end and back-end here, but on the front-end we exit, while on the
back-end we elog(ERROR).  I'm guessing that's a pipe-dream, but figured
I'd mention it anyway, in case people have crafty ideas about how that
might be made to work (and if others think it's even a good idea).

 The other possible class of failures is format string
 or encoding errors, but those seem to reduce to the same cases: I can't
 envision that frontend code would have any useful recovery strategies.
 I'd like the printed error message to distinguish these cases, but I don't
 think the callers need to care.

Agreed.

  I'm also a bit nervous that we only check vsnprintf()
  success in Assert-enabled builds in psprintf, though I suppose that's
  all we're doing in appendStringInfo too.
 
 Actually, appendStringInfo treats result  0 as a buffer overrun report;

I had been looking at the Assert() that the last character is always
'\0' in allocStringInfoVA.  Probably a stretch to say that has to be
upgraded to an elog(), but I'd certainly be much happier with a runtime
did this error? check of some kind than not; hence my support for your
errno notion below..

 if the failure is persistent because it's really a format/encoding
 problem, it'll loop till it runs out of memory, and then report the error
 as that.  

Ah, yes, I see that in enlargeStringInfo and as long as we keep
MaxAllocSize reasonable that isn't terrible.

 (Hmm ...
 current POSIX requires *printf to set errno on error, so maybe we could
 look at errno?)

Sounds like a good idea to me..  Being explicit about checking for error
results, rather than hoping-against-hope that the only reason we'd ever
get a value less than the total length is when we need to provide more
memory, would be a lot better.  Perhaps it'd be worthwhile to run a test
against the build farm and see what kind of errno's are being set in
those cases where we're now getting the 'out of memory' failures you
mentioned upthread?

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Reasons not to like asprintf

2013-10-22 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 I agree that exit/Assert-or-elog is the right API here.  I wonder if
 it'd be reasonable or worthwhile to try and actually make that happen-
 that is to say, we really do only have one implementation for both
 front-end and back-end here, but on the front-end we exit, while on the
 back-end we elog(ERROR).  I'm guessing that's a pipe-dream,

It's not really --- we could make it happen with a single source-file in
libpgcommon, with a few #ifdef FRONTENDs, similarly to exec.c for
instance.  I'm not entirely sure that it's worth the trouble, since once
you take away the error and memory management there's not much left of
these functions; so the #ifdefs might end up being most of the code.
I'm going to go code it shortly, and we'll find out.

 I had been looking at the Assert() that the last character is always
 '\0' in allocStringInfoVA.  Probably a stretch to say that has to be
 upgraded to an elog(), but I'd certainly be much happier with a runtime
 did this error? check of some kind than not;

IIRC, that Assert is meant to catch an ancient buggy version of vsnprintf,
so it can't really be replaced by an errno check --- the whole point is
that that vsnprintf failed to recognize it was doing something wrong.
We could upgrade it to an elog; but we had that debate years ago when
the code went in, and I don't think the passage of time has made it more
important rather than less so to have an always-on check for that case.

 (Hmm ...
 current POSIX requires *printf to set errno on error, so maybe we could
 look at errno?)

 Sounds like a good idea to me..  Being explicit about checking for error
 results, rather than hoping-against-hope that the only reason we'd ever
 get a value less than the total length is when we need to provide more
 memory, would be a lot better.  Perhaps it'd be worthwhile to run a test
 against the build farm and see what kind of errno's are being set in
 those cases where we're now getting the 'out of memory' failures you
 mentioned upthread?

Yeah, I'm hoping that reporting errno will give us some more insight,
if anole's problem is still there after we replace the code.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2013-10-22 21:29:13 +0300, Heikki Linnakangas wrote:
 We could put a critical section around the whole recursion that inserts the
 downlinks, so that you would get a PANIC and the incomplete split mechanism
 would fix it at recovery. But that would hardly be an improvement.

 For me this clearly *has* to be in a critical section with the current
 code. I had always assumed all multi-part actions would be.

No, that's hardly a good idea.  As Heikki says, that would amount to
converting an entirely foreseeable situation into a PANIC.

Note also that the problem might be persistent, eg if you're out of disk
space.  In that case, you'd just get the PANIC again at recovery, so now
not only have you crashed all your sessions but you have a database that
won't start up.

I wonder whether Heikki's approach could be used to remove the need for
the incomplete-split-fixup code altogether, thus eliminating a class of
recovery failure possibilities.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Location for external scripts for Extensions?

2013-10-22 Thread Dimitri Fontaine
Josh Berkus j...@agliodbs.com writes:
 pg_partman has several external (python) scripts which help the
 extension, located in /extras/ in its source.  The problem currently is
 that if you install pg_partman via pgxn or package, you don't get those
 scripts, because there's no install location for them.

See also my proposal to solve that, I'd welcome some design level
discussions about it:

  http://www.postgresql.org/message-id/m28uyzgof3@2ndquadrant.fr

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Commitfest II CLosed

2013-10-22 Thread Josh Berkus
On 10/21/2013 11:59 AM, Tom Lane wrote:
 Josh Berkus j...@agliodbs.com writes:
 Either you're proposing a solution, supporting someone else's solution,
 or you're saying the problem isn't important.  There is no fourth
 alternative.
 
 Nonsense.  Pointing out that a proposed solution isn't workable is not
 saying that the problem isn't important.  Or are you trying to establish
 a rule that we can't complain about a proposal unless we have another one
 to make?  Sorry, I won't accept that.

In some cases the other solution is we need to search for a better
solution.  But if you say the proposed solution is bad without even
proposing criteria for a better solution, then you are *de facto* saying
that the problem isn't important, whether or not you would like ot
pretend that you're saying something else.  If a problem is important,
then it is worth solving.  If it's not worth solving, then it's not
important.

This is just as true of bugs in our process as it is of bugs in our
code; if we release without fixing a bug, then we are making a concrete
statement that the bug is not important.  For the past two years, we've
proceeded without fixing the bugs in our process, which is a material
statement that most contributors don't feel that the process is buggy.
Heck, the whole discussion about reviews got cut from the Developer's
Meeting agenda this year, and if that's not a statement of unimportance,
I don't know what is.

When I came up with the idea of CommitFests they were supposed to be an
incremental improvement for us to build on.  Instead it's remained
frozen in amber, and steadily becoming less and less effective.  I've
suggested a number of improvements and changes over the years, and
largely been rewarded with denial, attacks, ridicule, and general
sandbaggery.  I'm done.  If the community doesn't think there's a
problem, then clearly I'm in error for proposing fixes.

Not sure who you're going to get to do CF3, though.  I'm not going to be
CFM again, and I'm pretty sure nobody else wants the job either.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Heikki Linnakangas

On 22.10.2013 22:24, Tom Lane wrote:

I wonder whether Heikki's approach could be used to remove the need for
the incomplete-split-fixup code altogether, thus eliminating a class of
recovery failure possibilities.


Yes. I intend to do that, too.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Commitfest II CLosed

2013-10-22 Thread Stephen Frost
Josh,

* Josh Berkus (j...@agliodbs.com) wrote:
 In some cases the other solution is we need to search for a better
 solution.  But if you say the proposed solution is bad without even
 proposing criteria for a better solution, then you are *de facto* saying
 that the problem isn't important, whether or not you would like ot
 pretend that you're saying something else.  If a problem is important,
 then it is worth solving.  If it's not worth solving, then it's not
 important.

Or you're simply saying that other things hold priority over this
particular problem.  Sure, that makes it *less* important than other
things (if priority is your measure of importance, and it may or may not
be) but it is not the same to say that something is unimportant.

 This is just as true of bugs in our process as it is of bugs in our
 code; if we release without fixing a bug, then we are making a concrete
 statement that the bug is not important.  

Or that the priority of the release is *more* important.  Things are not
all either red-or-blue here (to use politically correct colors).

 For the past two years, we've
 proceeded without fixing the bugs in our process, which is a material
 statement that most contributors don't feel that the process is buggy.

It's not hard to imagine that developers might feel that bugs, code,
hacking, etc, are of a higher priority than very vaugue and extremely
challenging process 'bugs'.

 If the community doesn't think there's a
 problem, then clearly I'm in error for proposing fixes.

For my 2c, I agree that there's a problem, but I've got a ton of other
tasks, not to mention $dayjob that makes it unlikely that I'll find time
to come up with alternate proposals.  I will also say that I feel that
this process is still an *improvement* over the previous process, which
is really the only other one we've actually tested.  Perhaps that means
we should just try different things out and see if we can't build the
best process out there through supervised learning.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Commitfest II CLosed

2013-10-22 Thread Joshua D. Drake


On 10/21/2013 08:11 AM, Robert Haas wrote:


Supposedly, we have a policy that for each patch you submit, you ought
to review a patch.  That right there ought to provide enough reviewers
for all the patches, but clearly it didn't.  And I'm pretty sure that
some people (like me) looked at a lot MORE patches than they
themselves submitted.  I think auditing who is not contributing in
that area and finding tactful ways to encourage them to contribute
would be a very useful service to the project.


What if as part of the patch submission process you had to pick the 
patch you were going to review? If there are no patches to review, then 
we obviously don't have a problem. If there are patches to review then 
we are all set.


I guess there is the problem of there only being patches that a 
submitter is not qualified to review but I find that miniscule as every 
person on this list (myself included) can do a cursory review (patch 
applies, docs are good, indentation is appropriate, works as advertised).


The commitfest app would have to be modified for this but what do people 
think?


Joshua D. Drake


--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
   a rose in the deeps of my heart. - W.B. Yeats


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reasons not to like asprintf

2013-10-22 Thread Tom Lane
... BTW, another reason to choose identical APIs for frontend and backend
versions of these functions is that it greatly eases use of them in shared
frontend/backend code.  As I notice somebody has *already done* in
common/relpath.c.   I'm not exactly sure how those psprintf calls are
working at all in frontend builds.  Maybe they aren't quite, and that has
something to do with the failures on anole?

In order to avoid having to clutter stuff like that with #ifdef FRONTENDs,
I'm now thinking we should use exactly the same names for the frontend and
backend versions, ie psprintf() and pvsprintf().  The main reason for
considering a pg_ prefix for the frontend versions was to avoid cluttering
application namespace; but it's already the case that we don't expect
libpgcommon to be namespace clean.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reasons not to like asprintf

2013-10-22 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 ... BTW, another reason to choose identical APIs for frontend and backend
 versions of these functions is that it greatly eases use of them in shared
 frontend/backend code.  As I notice somebody has *already done* in
 common/relpath.c.   I'm not exactly sure how those psprintf calls are
 working at all in frontend builds.  Maybe they aren't quite, and that has
 something to do with the failures on anole?

Seems plausible.

 In order to avoid having to clutter stuff like that with #ifdef FRONTENDs,
 I'm now thinking we should use exactly the same names for the frontend and
 backend versions, ie psprintf() and pvsprintf().  The main reason for
 considering a pg_ prefix for the frontend versions was to avoid cluttering
 application namespace; but it's already the case that we don't expect
 libpgcommon to be namespace clean.

To be honest, I had been assuming we'd use the same names.  As for which
direction to go, I'd personally prefer psprintf but that's just my lazy
developer fingers talking.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Reasons not to like asprintf

2013-10-22 Thread Alvaro Herrera
Tom Lane wrote:
 ... BTW, another reason to choose identical APIs for frontend and backend
 versions of these functions is that it greatly eases use of them in shared
 frontend/backend code.  As I notice somebody has *already done* in
 common/relpath.c.   I'm not exactly sure how those psprintf calls are
 working at all in frontend builds.

There's psprintf in src/common/fe_memutils.c, too, using the backend's
API, which is why this works.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Andres Freund
On 2013-10-22 15:24:40 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2013-10-22 21:29:13 +0300, Heikki Linnakangas wrote:
  We could put a critical section around the whole recursion that inserts the
  downlinks, so that you would get a PANIC and the incomplete split mechanism
  would fix it at recovery. But that would hardly be an improvement.
 
  For me this clearly *has* to be in a critical section with the current
  code. I had always assumed all multi-part actions would be.
 
 No, that's hardly a good idea.  As Heikki says, that would amount to
 converting an entirely foreseeable situation into a PANIC.

But IIUC this can currently lead to an index giving wrong answers, not
just fail at further inserts? I think if we half-split (without
inserting the downlink) a page several times, via different child-pages,
we might suceed in splitting but we'll have corrupted left/right
links. With complete splits such things should be prevented by the
page-level locks we hold, but that's not the case anymore.
If so that could, especially in combination with unique indexes, lead to
quite nasty data corruption

 I wonder whether Heikki's approach could be used to remove the need for
 the incomplete-split-fixup code altogether, thus eliminating a class of
 recovery failure possibilities.

That'd be better...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reasons not to like asprintf

2013-10-22 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Tom Lane wrote:
 ... BTW, another reason to choose identical APIs for frontend and backend
 versions of these functions is that it greatly eases use of them in shared
 frontend/backend code.  As I notice somebody has *already done* in
 common/relpath.c.   I'm not exactly sure how those psprintf calls are
 working at all in frontend builds.

 There's psprintf in src/common/fe_memutils.c, too, using the backend's
 API, which is why this works.

Ah --- that's why it links, anyway.  As for works, I suspect that the
answer is that anole has a vsnprintf that returns -1 on buffer overrun.
vasprintf and then psprintf will interpret that as out of memory,
resulting in the observed behavior.  It'd be interesting to see the stack
trace from that error, though, just to confirm this theory.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2013-10-22 15:24:40 -0400, Tom Lane wrote:
 No, that's hardly a good idea.  As Heikki says, that would amount to
 converting an entirely foreseeable situation into a PANIC.

 But IIUC this can currently lead to an index giving wrong answers, not
 just fail at further inserts?

No.  It's exactly the same situation as when the insert is still in
progress, ie searches have to move right when following the original
downlink.  Go read the nbtree README.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Commitfest II CLosed

2013-10-22 Thread Jaime Casanova
On Tue, Oct 22, 2013 at 2:38 PM, Joshua D. Drake j...@commandprompt.com wrote:

 On 10/21/2013 08:11 AM, Robert Haas wrote:

 Supposedly, we have a policy that for each patch you submit, you ought
 to review a patch.  That right there ought to provide enough reviewers
 for all the patches, but clearly it didn't.  And I'm pretty sure that
 some people (like me) looked at a lot MORE patches than they
 themselves submitted.  I think auditing who is not contributing in
 that area and finding tactful ways to encourage them to contribute
 would be a very useful service to the project.


 What if as part of the patch submission process you had to pick the patch
 you were going to review? If there are no patches to review, then we
 obviously don't have a problem. If there are patches to review then we are
 all set.


if we are going to modify the CF app (not offering myself, and i'm not
trying to bind anyone also) i would prefer to see a flag stating if
number of reviews registered there are less than submitted patches.
This could be a column just after the author of a patch, so people can
give preference to patches of submitters that are also reviewing other
people's patches.

-- 
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-22 Thread Andres Freund
On 2013-10-22 16:38:05 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2013-10-22 15:24:40 -0400, Tom Lane wrote:
  No, that's hardly a good idea.  As Heikki says, that would amount to
  converting an entirely foreseeable situation into a PANIC.
 
  But IIUC this can currently lead to an index giving wrong answers, not
  just fail at further inserts?
 
 No.  It's exactly the same situation as when the insert is still in
 progress, ie searches have to move right when following the original
 downlink.  Go read the nbtree README.

If the parent insert is still in progress we have a write lock on the
left and right page preventing such issues, right? At least that's what
the nbtree README says...
That doesn't hold true if we split a page but fail before re-inserting
the parent since the transaction rollback will obviously have released
the locks.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reasons not to like asprintf

2013-10-22 Thread Tom Lane
I wrote:
 Stephen Frost sfr...@snowman.net writes:
 I agree that exit/Assert-or-elog is the right API here.  I wonder if
 it'd be reasonable or worthwhile to try and actually make that happen-
 that is to say, we really do only have one implementation for both
 front-end and back-end here, but on the front-end we exit, while on the
 back-end we elog(ERROR).  I'm guessing that's a pipe-dream,

 It's not really --- we could make it happen with a single source-file in
 libpgcommon, with a few #ifdef FRONTENDs, similarly to exec.c for
 instance.  I'm not entirely sure that it's worth the trouble, since once
 you take away the error and memory management there's not much left of
 these functions; so the #ifdefs might end up being most of the code.
 I'm going to go code it shortly, and we'll find out.

Attached is a draft, which compiles though I've not yet actually tested it.
It seems pretty reasonable as far as readability goes.  A couple of notes:

1. I omitted pvsprintf(), which we don't actually use anywhere anyway.
I don't see any way to implement that API without va_copy, which is one
of the main things I'm trying to get rid of.  Since we've never needed
a va_args variant in stringinfo.c in all the years we've had that, I
think we can get away with omitting it here.

2. The file includes utils/memutils.h, which I'm not 100% sure is safe
to include in the FRONTEND case.  This is so that it can refer to
MaxAllocSize.  If it turns out that that causes build problems on some
platforms, we could use some more ad-hoc way to define the limit (maybe
just INT_MAX/4?), or we could move the definition of MaxAllocSize into
palloc.h.

Barring objections, I'll push forward with replacing the existing code
with this.

regards, tom lane

/*-
 *
 * psprintf.c
 *  sprintf into an allocated-on-demand buffer
 *
 *
 * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 *
 * IDENTIFICATION
 *src/common/psprintf.c
 *
 *-
 */

#ifndef FRONTEND
#include postgres.h
#else
#include postgres_fe.h
#endif

#include utils/memutils.h


static size_t pvsnprintf(char *buf, size_t len, const char *fmt, va_list args);


/*
 * psprintf
 *
 * Format text data under the control of fmt (an sprintf-style format string)
 * and return it in an allocated-on-demand buffer.  The buffer is allocated
 * with palloc in the backend, or malloc in frontend builds.  Caller is
 * responsible to free the buffer when no longer needed, if appropriate.
 *
 * Errors are not returned to the caller, but are reported via elog(ERROR)
 * in the backend, or printf-to-stderr-and-exit() in frontend builds.
 * One should therefore think twice about using this in libpq.
 */
char *
psprintf(const char *fmt,...)
{
size_t  len = 128;  /* initial assumption about 
buffer size */

for (;;)
{
char   *result;
va_list args;

/*
 * Allocate result buffer.  Note that in frontend this maps 
to malloc
 * with exit-on-error.
 */
result = (char *) palloc(len);

/* Try to format the data. */
va_start(args, fmt);
len = pvsnprintf(result, len, fmt, args);
va_end(args);

if (len == 0)
return result;  /* success */

/* Release buffer and loop around to try again with larger len. 
*/
pfree(result);
}
}

/*
 * pvsnprintf
 *
 * Attempt to format text data under the control of fmt (an sprintf-style
 * format string) and insert it into buf (which has length len).
 *
 * If successful, return zero.  If there's not enough space in buf, return
 * an estimate of the buffer size needed to succeed (this *must* be more
 * than len, else psprintf might loop infinitely).
 * Other error cases do not return.
 *
 * XXX This API is ugly, but there seems no alternative given the C spec's
 * restrictions on what can portably be done with va_list arguments: you have
 * to redo va_start before you can rescan the argument list, and we can't do
 * that from here.
 */
static size_t
pvsnprintf(char *buf, size_t len, const char *fmt, va_list args)
{
int nprinted;

Assert(len  0);

errno = 0;

/*
 * Assert check here is to catch buggy vsnprintf that overruns the
 * specified buffer length.  Solaris 7 in 64-bit mode is an example of a
 * platform with such a bug.
 */
#ifdef USE_ASSERT_CHECKING
buf[len - 1] = '\0';
#endif

nprinted = vsnprintf(buf, len, fmt, args);

Assert(buf[len - 1] == '\0');

/*
 * 

[HACKERS] tracking commit timestamps

2013-10-22 Thread Alvaro Herrera
Hi,

There has been some interest in keeping track of timestamp of
transaction commits.  This patch implements that.

There are some seemingly curious choices here.  First, this module can
be disabled, and in fact it's turned off by default.  At startup, we
verify whether it's enabled, and create the necessary SLRU segments if
so.  And if the server is started with this disabled, we set the oldest
value we know about to avoid trying to read the commit TS of
transactions of which we didn't keep record.  The ability to turn this
off is there to avoid imposing the overhead on systems that don't need
this feature.

Another thing of note is that we allow for some extra data alongside the
timestamp proper.  This might be useful for a replication system that
wants to keep track of the origin node ID of a committed transaction,
for example.  Exactly what will we do with the bit space we have is
unclear, so I have kept it generic and called it commit extra data.

This offers the chance for outside modules to set the commit TS of a
transaction; there is support for WAL-logging such values.  But the core
user of the feature (RecordTransactionCommit) doesn't use it, because
xact.c's WAL logging itself is enough.  For systems that are replicating
transactions from remote nodes, it is useful.

We also keep track of the latest committed transaction.  This is
supposed to be useful to calculate replication lag.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services
*** a/contrib/pg_xlogdump/rmgrdesc.c
--- b/contrib/pg_xlogdump/rmgrdesc.c
***
*** 9,14 
--- 9,15 
  #include postgres.h
  
  #include access/clog.h
+ #include access/committs.h
  #include access/gin.h
  #include access/gist_private.h
  #include access/hash.h
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 2257,2262  include 'filename'
--- 2257,2277 
/listitem
   /varlistentry
  
+  varlistentry id=guc-track-commit-timestamp xreflabel=track_commit_timestamp
+   termvarnametrack_commit_timestamp/varname (typebool/type)/term
+   indexterm
+primaryvarnametrack_commit_timestamp/ configuration parameter/primary
+   /indexterm
+   listitem
+para
+ Record commit time of transactions.  This parameter
+ can only be set in
+ the filenamepostgresql.conf/ file or on the server command line.
+ The default value is off.
+/para
+   /listitem
+  /varlistentry
+ 
   /variablelist
  /sect2
  
*** a/src/backend/access/rmgrdesc/Makefile
--- b/src/backend/access/rmgrdesc/Makefile
***
*** 8,14  subdir = src/backend/access/rmgrdesc
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global
  
! OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
  	   mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
  	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
  
--- 8,15 
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global
  
! OBJS = clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o \
!heapdesc.o \
  	   mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
  	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
  
*** /dev/null
--- b/src/backend/access/rmgrdesc/committsdesc.c
***
*** 0 
--- 1,53 
+ /*-
+  *
+  * committsdesc.c
+  *rmgr descriptor routines for access/transam/committs.c
+  *
+  * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *src/backend/access/rmgrdesc/committsdesc.c
+  *
+  *-
+  */
+ #include postgres.h
+ 
+ #include access/committs.h
+ #include utils/timestamp.h
+ 
+ 
+ void
+ committs_desc(StringInfo buf, uint8 xl_info, char *rec)
+ {
+ 	uint8		info = xl_info  ~XLR_INFO_MASK;
+ 
+ 	if (info == COMMITTS_ZEROPAGE)
+ 	{
+ 		int			pageno;
+ 
+ 		memcpy(pageno, rec, sizeof(int));
+ 		appendStringInfo(buf, zeropage: %d, pageno);
+ 	}
+ 	else if (info == COMMITTS_TRUNCATE)
+ 	{
+ 		int			pageno;
+ 
+ 		memcpy(pageno, rec, sizeof(int));
+ 		appendStringInfo(buf, truncate before: %d, pageno);
+ 	}
+ 	else if (info == COMMITTS_SETTS)
+ 	{
+ 		xl_committs_set *xlrec = (xl_committs_set *) rec;
+ 		int		i;
+ 
+ 		appendStringInfo(buf, set committs %s for: %u,
+ 		 timestamptz_to_str(xlrec-timestamp),
+ 		 xlrec-mainxid);
+ 		for (i = 0; i  xlrec-nsubxids; i++)
+ 			appendStringInfo(buf, , %u, xlrec-subxids[i]);
+ 	}
+ 	else
+ 		appendStringInfo(buf, UNKNOWN);
+ }
*** a/src/backend/access/rmgrdesc/xlogdesc.c
--- b/src/backend/access/rmgrdesc/xlogdesc.c
***
*** 44,50  xlog_desc(StringInfo buf, 

[HACKERS] matviews do not lock relations during refresh

2013-10-22 Thread Andres Freund
Hi,

In an postgres build that has added instrumentation to detect cases
where indexes are index_open()ed without any locks on the underlying
relation, the matview code started to cry during the regression tests.

The problem seems to be that refresh_matview_datafill() uses
QueryRewrite() without previously using AcquireRewriteLocks() but as
QueryRewrite states:
 * ...
 * NOTE: the parsetree must either have come straight from the parser,
 * or have been scanned by AcquireRewriteLocks to acquire suitable locks.
 */

I think the missed locks could lead to postgres crashing under
concurrent DDL.

I've attached a quick patch that fixes the issue for me in HEAD, but I
am not 100% sure I understand the details of refresh_matview_datafill().

This probably needs to go into 9.3 in an adapted form.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From ae34a1b4d59627360131bb795da75326deea40b3 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Wed, 23 Oct 2013 01:25:49 +0200
Subject: [PATCH] Acquire appropriate locks when rewriting during REFRESH
 MATERIALIZED VIEW

---
 src/backend/commands/matview.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index fcfc678..1c02eab 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -283,6 +283,7 @@ refresh_matview_datafill(DestReceiver *dest, Query *query,
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	Query	   *copied_query;
 
 	/*
 	 * Switch to the owner's userid, so that any functions are run as that
@@ -294,8 +295,14 @@ refresh_matview_datafill(DestReceiver *dest, Query *query,
 		   save_sec_context | SECURITY_RESTRICTED_OPERATION);
 	save_nestlevel = NewGUCNestLevel();
 
-	/* Rewrite, copying the given Query to make sure it's not changed */
-	rewritten = QueryRewrite((Query *) copyObject(query));
+	/* copy the given Query to make sure it's not changed */
+	copied_query = copyObject(query);
+
+	/* acquire locks for rewriting */
+	AcquireRewriteLocks(copied_query, false);
+
+	/* Rewrite */
+	rewritten = QueryRewrite(copied_query);
 
 	/* SELECT should never rewrite to more or less than one SELECT query */
 	if (list_length(rewritten) != 1)
-- 
1.8.3.251.g1462b67


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why the asprintf patch is still breaking the buildfarm

2013-10-22 Thread Noah Misch
On Tue, Oct 22, 2013 at 11:00:42AM -0400, Tom Lane wrote:
 Yeah.  As a separate matter, it might be useful to revise stringinfo.c
 and the asprintf code so that *if* the returned value is larger than the
 given buffer size, we use it as a guide to resizing, avoiding the possible
 need to loop multiple times to make the buffer large enough.  And we could
 also improve our own implementation of snprintf to follow the C99 spec.
 
 The point here is that we still need to cope with pre-C99 implementations
 that might return -1 or the given buffer size on overflow.  The NetBSD
 implementation doesn't do that, which is reasonable in their context, but
 not workable for us.

I would vote for choosing the standard we want vsnprintf() to follow (probably
C99) and substituting a conforming implementation wherever configure detects
that libc does not conform.  We'll be shipping some replacement vsnprintf() in
any case; we may as well use it to insulate the rest of our code from
less-preferred variants.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] missing locking in at least INSERT INTO view WITH CHECK

2013-10-22 Thread Andres Freund
Hi,

Using the same debugging hack^Wpatch (0001) as in the matview patch
(0002) an hour or so ago I noticed that INSERT INTO view WITH CHECK
doesn't lock the underlying relations properly.

I've attached a sort-of-working (0003) hack but I really doubt it's the
correct approach, I don't really know enough about that area of the
code.
This looks like something that needs to be fixed.

Also attached is 0004 which just adds a heap_lock() around a newly
created temporary table in the matview code which shouldn't be required
for correctness but gives warm and fuzzy feelings as well as less
debugging noise.

Wouldn't it be a good idea to tack such WARNINGs (in a proper and clean
form) to index_open (checking the underlying relation is locked),
relation_open(..., NoLock) (checking the relation has previously been
locked) and maybe RelationIdGetRelation() when cassert is enabled? ISTM
we frequently had bugs around this.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] missing locking in at least INSERT INTO view WITH CHECK

2013-10-22 Thread Andres Freund
On 2013-10-23 03:18:55 +0200, Andres Freund wrote:
 (000[1-4])

Attached.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From bf329af4eb6d839ae2a75c4f8a2d6867877510f4 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Wed, 23 Oct 2013 02:34:51 +0200
Subject: [PATCH 1/4] debug-lock-on-index-without-heap-lock

---
 src/backend/access/heap/heapam.c   | 10 ++
 src/backend/access/index/indexam.c | 14 ++
 src/backend/storage/lmgr/lmgr.c|  9 +
 src/backend/storage/lmgr/lock.c| 16 
 src/backend/utils/cache/relcache.c | 10 ++
 src/include/storage/lmgr.h |  2 ++
 src/include/storage/lock.h |  2 ++
 7 files changed, 63 insertions(+)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a21f31b..3eecd5f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1203,6 +1203,16 @@ heap_open(Oid relationId, LOCKMODE lockmode)
  errmsg(\%s\ is a composite type,
 		RelationGetRelationName(r;
 
+	if (IsNormalProcessingMode()  lockmode == NoLock)
+	{
+		LOCKMODE	mode;
+
+		mode = StrongestLocalRelationLock(relationId);
+		if (mode == NoLock)
+			elog(WARNING, no relation lock on rel %s,
+ RelationGetRelationName(r));
+	}
+
 	return r;
 }
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index b878155..10c3991 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -65,6 +65,8 @@
 
 #include postgres.h
 
+#include miscadmin.h
+
 #include access/relscan.h
 #include access/transam.h
 #include catalog/index.h
@@ -142,6 +144,8 @@ static IndexScanDesc index_beginscan_internal(Relation indexRelation,
  * 
  */
 
+#include catalog/catalog.h
+
 /* 
  *		index_open - open an index relation by relation OID
  *
@@ -169,6 +173,16 @@ index_open(Oid relationId, LOCKMODE lockmode)
  errmsg(\%s\ is not an index,
 		RelationGetRelationName(r;
 
+	if (IsNormalProcessingMode())
+	{
+		LOCKMODE	mode;
+
+		mode = StrongestLocalRelationLock(r-rd_index-indrelid);
+		if (mode == NoLock)
+			elog(WARNING, no relation lock on rel %u held while opening index %s,
+ r-rd_index-indrelid,
+ RelationGetRelationName(r));
+	}
 	return r;
 }
 
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index a978172..32d7bba 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -232,6 +232,15 @@ UnlockRelation(Relation relation, LOCKMODE lockmode)
 	LockRelease(tag, lockmode, false);
 }
 
+LOCKMODE
+StrongestLocalRelationLock(Oid relid)
+{
+	LOCKTAG		tag;
+	SetLocktagRelationOid(tag, relid);
+	return StrongestLocalLock(tag);
+}
+
+
 /*
  *		LockHasWaitersRelation
  *
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index f4f32e9..2f068be 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2365,6 +2365,22 @@ LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent)
 	ResourceOwnerForgetLock(CurrentResourceOwner, locallock);
 }
 
+LOCKMODE
+StrongestLocalLock(const LOCKTAG* locktag)
+{
+	HASH_SEQ_STATUS status;
+	LOCALLOCK  *locallock;
+	LOCKMODE	strongest = NoLock;
+	hash_seq_init(status, LockMethodLocalHash);
+
+	while ((locallock = (LOCALLOCK *) hash_seq_search(status)) != NULL)
+	{
+		if (memcmp(locallock-tag.lock, locktag, sizeof(LOCKTAG)) == 0)
+			strongest = Max(strongest, locallock-tag.mode);
+	}
+	return strongest;
+}
+
 /*
  * FastPathGrantRelationLock
  *		Grant lock using per-backend fast-path array, if there is space.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index b4cc6ad..f74c6af 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1577,6 +1577,16 @@ RelationIdGetRelation(Oid relationId)
 {
 	Relation	rd;
 
+	if (IsNormalProcessingMode())
+	{
+		LOCKMODE	mode;
+
+		mode = StrongestLocalRelationLock(relationId);
+		if (mode == NoLock)
+			elog(WARNING, no relation lock for descriptor of oid %u,
+ relationId);
+	}
+
 	/*
 	 * first try to find reldesc in the cache
 	 */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 1a8c018..c2c6026 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -40,6 +40,8 @@ extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
 
+extern LOCKMODE StrongestLocalRelationLock(Oid relid);
+
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
 extern bool 

Re: [HACKERS] Compression of full-page-writes

2013-10-22 Thread KONDO Mitsumasa

(2013/10/22 12:52), Fujii Masao wrote:

On Tue, Oct 22, 2013 at 12:47 PM, Amit Kapila amit.kapil...@gmail.com wrote:

On Mon, Oct 21, 2013 at 4:40 PM, KONDO Mitsumasa
kondo.mitsum...@lab.ntt.co.jp wrote:

(2013/10/19 14:58), Amit Kapila wrote:

On Tue, Oct 15, 2013 at 11:41 AM, KONDO Mitsumasa
kondo.mitsum...@lab.ntt.co.jp wrote:
In general, my thinking is that we should prefer compression to reduce
IO (WAL volume), because reducing WAL volume has other benefits as
well like sending it to subscriber nodes. I think it will help cases
where due to less n/w bandwidth, the disk allocated for WAL becomes
full due to high traffic on master and then users need some
alternative methods to handle such situations.

Do you talk about archiving WAL file?


One of the points what I am talking about is sending data over
network to subscriber nodes for streaming replication and another is
WAL in pg_xlog. Both scenario's get benefited if there is is WAL
volume.


It can easy to reduce volume that we
set and add compression command with copy command at archive_command.


Okay.


I think many users would like to use a method which can reduce WAL
volume and the users which don't find it enough useful in their
environments due to decrease in TPS or not significant reduction in
WAL have the option to disable it.

I favor to select compression algorithm for higher performance. If we need
to compress WAL file more, in spite of lessor performance, we can change
archive copy command with high compression algorithm and add documents that
how to compress archive WAL files at archive_command. Does it wrong?


  No, it is not wrong, but there are scenario's as mentioned above
where less WAL volume can be beneficial.


In
actual, many of NoSQLs use snappy for purpose of higher performance.


Okay, you can also check the results with snappy algorithm, but don't
just rely completely on snappy for this patch, you might want to think
of another alternative for this patch.


So, our consensus is to introduce the hooks for FPW compression so that
users can freely select their own best compression algorithm?
Yes, it will be also good for future improvement. But I think WAL compression for 
disaster recovery system should be need in walsender and walreceiver proccess, 
and it is propety architecture for DR system. Higher compression ratio with high 
CPU usage algorithm in FPW might affect bad for perfomance in master server. If 
we can set compression algorithm in walsender and walreciever, performance is 
same as before or better, and WAL send performance will be better.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [BUGS] BUG #7873: pg_restore --clean tries to drop tables that don't exist

2013-10-22 Thread Josh Kupershmidt
On Mon, Oct 14, 2013 at 5:38 PM, Josh Kupershmidt schmi...@gmail.com wrote:
 Also, Pavel, this patch is still listed as 'Needs Review' in the CF
 app, but I haven't seen a response to the concerns in my last message.

It looks like this patch has been imported into the 2013-11 CF [1] and
marked Needs Review. I looked at the version of the patch pointed to
in that CF entry back in July, and noted [2] several problems that
still seemed to be present in the patch, for which I never saw a
followup from Pavel.  IMO this patch should have gotten marked
Returned with Feedback pending a response from Pavel.

Josh

[1] https://commitfest.postgresql.org/action/patch_view?id=1174
[2] 
http://www.postgresql.org/message-id/CAK3UJREth9DVL5U7ewOLQYhXF7EcV5BABFE+pzPQjkPfqbW=v...@mail.gmail.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why the asprintf patch is still breaking the buildfarm

2013-10-22 Thread Tom Lane
Noah Misch n...@leadboat.com writes:
 On Tue, Oct 22, 2013 at 11:00:42AM -0400, Tom Lane wrote:
 Yeah.  As a separate matter, it might be useful to revise stringinfo.c
 and the asprintf code so that *if* the returned value is larger than the
 given buffer size, we use it as a guide to resizing, avoiding the possible
 need to loop multiple times to make the buffer large enough.  And we could
 also improve our own implementation of snprintf to follow the C99 spec.
 
 The point here is that we still need to cope with pre-C99 implementations
 that might return -1 or the given buffer size on overflow.  The NetBSD
 implementation doesn't do that, which is reasonable in their context, but
 not workable for us.

 I would vote for choosing the standard we want vsnprintf() to follow (probably
 C99) and substituting a conforming implementation wherever configure detects
 that libc does not conform.  We'll be shipping some replacement vsnprintf() in
 any case; we may as well use it to insulate the rest of our code from
 less-preferred variants.

The problem is that we can't tell whether vsnprintf is standard-conforming
without a run-time test.  That's bad for cross-compiled builds, and it's
pretty hazardous even for normal cases, since conceivably an executable
built on one machine could be used on another one with different run-time
behavior.  I'd be willing to take those risks if we got a significant
benefit from it, but in this case I don't see much advantage to be had.
The code in stringinfo/psprintf wouldn't get very much simpler if we
assumed C99 behavior, and we've pretty well isolated the number of places
that care to those.  (I see a couple places in pg_dump that could be
modified to use psprintf instead of direct vsnprintf calls; will go fix.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Sigh, my old HPUX box is totally broken by DSM patch

2013-10-22 Thread Tom Lane
initdb.c quoth:

 * ... but the fact that a platform has shm_open
 * doesn't guarantee that that call will succeed when attempted.

Indeed:

$ initdb
The files belonging to this database system will be owned by user postgres.
This user must also own the server process.

The database cluster will be initialized with locale C.
The default database encoding has accordingly been set to SQL_ASCII.
The default text search configuration will be set to english.

Data page checksums are disabled.

creating directory /home/postgres/testversion/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... Bad system call(coredump)
$ 

gdb shows:

Core was generated by `initdb'.
Program terminated with signal 12, Bad system call.
(gdb) bt
#0  0xc0143fb0 in ?? () from /usr/lib/libc.1
#1  0xa890 in choose_dsm_implementation () at initdb.c:1098
#2  0xab10 in test_config_settings () at initdb.c:1217
#3  0xe310 in initialize_data_directory () at initdb.c:3412
#4  0xed0c in main (argc=1, argv=0x7b03ac68) at initdb.c:3691
#5  0xc0065784 in ?? () from /usr/lib/libc.1

I'm not entirely sure what to do about this.  Conceivably we could have
initdb catch SIGSYS, but that seems rather ugly.  Maybe configure needs a
run-time test to see if shm_open will work, rather than just probing to
see if such a function exists?  I'm not thrilled with run-time tests in
configure though.  Another possibility is for initdb to execute the
probe in a forked subprocess instead of risking doing it itself.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2013-10-22 Thread KONDO Mitsumasa

Hi All,

(2013/10/22 22:26), Stephen Frost wrote:

* Dimitri Fontaine (dimi...@2ndquadrant.fr) wrote:

In our case, what I keep experiencing with tuning queries is that we
have like 99% of them running under acceptable threshold and 1% of them
taking more and more time.


This is usually described (at least where I come from) as 'rare events',
which goes to Tom's point that averages, stddev, etc, are not ideal
(though they are still better than nothing).


good reason to suppose that query runtime is Gaussian?  (I'd bet not;
in particular, multimodal behavior seems very likely due to things like
plan changes.)  If not, how much does that affect the usefulness of
a standard-deviation calculation?


Oscillating plan changes may fit multimodal but I don't feel that's
typical.  My experience has been it's either an extremely rare plan
difference or it's a shift from one plan to another over time.
After all, all of avg, min, max and stdev are only numerical value for predicting 
model. There aren't the robustness and strictness such as Write Ahead Logging. It 
resembles a weather forecast. They are still better than nothing.
It is needed a human judgment to finally suppose a cause from the numerical 
values. By the way, we can guess probability of the value from stdev.
Therefore we can guess easily even if there is an extreme value in min/max 
whether it is normal or not.



What I've been gathering from my quick chat this morning is that either
you know how to characterize the distribution and then the min max and
average are useful on their own, or you need to keep track of an
histogram where all the bins are of the same size to be able to learn
what the distribution actually is.


A histogram would certainly be useful.  We may also wish to look into
outlier/rare event detection methods and increase the logging we do in
those cases (if possible).


Of course keeping enough stats seems to always begin with keeping the
min, max and average, so we can just begin there. We would just be
unable to answer interesting questions with just that.


It would probably be good to do some research into techniques for
outlier detection which minimizes CPU and storage cost.

pg_stat_statement is often used in operating database system, so I don't
 like high CPU usage implementation. The software which will be lessor
performance just to install it is too unpleasant to accept. And if we
need more detail information for SQL tuning, it would be better to
develop other useful performance tuning and monitoring contrib not to
use in operating database system.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Compression of full-page-writes

2013-10-22 Thread Amit Kapila
On Wed, Oct 23, 2013 at 7:05 AM, KONDO Mitsumasa
kondo.mitsum...@lab.ntt.co.jp wrote:
 (2013/10/22 12:52), Fujii Masao wrote:

 On Tue, Oct 22, 2013 at 12:47 PM, Amit Kapila amit.kapil...@gmail.com
 wrote:

 On Mon, Oct 21, 2013 at 4:40 PM, KONDO Mitsumasa
 kondo.mitsum...@lab.ntt.co.jp wrote:

 (2013/10/19 14:58), Amit Kapila wrote:

 On Tue, Oct 15, 2013 at 11:41 AM, KONDO Mitsumasa
 kondo.mitsum...@lab.ntt.co.jp wrote:
 In
 actual, many of NoSQLs use snappy for purpose of higher performance.


 Okay, you can also check the results with snappy algorithm, but don't
 just rely completely on snappy for this patch, you might want to think
 of another alternative for this patch.


 So, our consensus is to introduce the hooks for FPW compression so that
 users can freely select their own best compression algorithm?

 Yes, it will be also good for future improvement. But I think WAL
 compression for disaster recovery system should be need in walsender and
 walreceiver proccess, and it is propety architecture for DR system. Higher
 compression ratio with high CPU usage algorithm in FPW might affect bad for
 perfomance in master server.

   This is true, thats why there is a discussion for pluggable API for
compression of WAL, we should try to choose best algorithm from the
available choices. Even after that I am not sure it works same for all
kind of loads, so user will have option to completely disable it as
well.

 If we can set compression algorithm in
 walsender and walreciever, performance is same as before or better, and WAL
 send performance will be better.

Do you mean to say that walsender should compress the data before
sending and then walreceiver will decompress it, if yes then won't it
add extra overhead on standby, or do you think as walreceiver has to
read less data from socket, so it will compensate for it. I think may
be we should consider this if the test results are good, but lets not
try to do this until the current patch proves that such mechanism is
good for WAL compression.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add \i option to bring in the specified file as a quoted literal

2013-10-22 Thread Amit Kapila
On Tue, Oct 22, 2013 at 3:04 AM, Piotr Marcinczyk pmarc...@gmail.com wrote:
 Hi,

 I would like to implement item from TODO marked as easy: Add \i option
 to bring in the specified file as a quoted literal. I understand intent
 of this item, to be able to have parts of query written in separate
 files (now it is impossible, because \i tries to execute content of file
 as a separate command by function process_file).

For the usecase discussed in the mail chain of that TODO item, Robert
Haas has provided an alternative to achieve it, please check below
link:
http://www.postgresql.org/message-id/AANLkTi=7c8xfyf7uqw0y+si8ondkoy2nx8jc4bu0g...@mail.gmail.com

If you think that alternative is not sufficient for the use case, then
adding new option/syntax is worth, otherwise it might be a shortcut or
other form of some existing way which can be useful depending on how
frequently users use this syntax.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers