date:20150629

On 28 June 2015 at 17:17, Tom Lane t...@sss.pgh.pa.us wrote:

 Simon Riggs si...@2ndquadrant.com writes:
  On 27 June 2015 at 15:10, Tom Lane t...@sss.pgh.pa.us wrote:
  I don't like this too much because it will fail badly if the caller
  is wrong about the maximum possible page number for the table, which
  seems not exactly far-fetched.  (For instance, remember those kernel
 bugs
  we've seen that cause lseek to lie about the EOF position?)

  If that is true, then our reliance on lseek elsewhere could also cause
 data
  loss, for example by failing to scan data during a seq scan.

 The lseek point was a for-example, not the entire universe of possible
 problem sources for this patch.  (Also, underestimating the EOF point in
 a seqscan is normally not an issue since any rows in a just-added page
 are by definition not visible to the scan's snapshot.  But I digress.)

  The consequences of failure of lseek in this case are nowhere near as
 dire,
  since by definition the data is being destroyed by the user.

 I'm not sure what you consider dire, but missing a dirty buffer
 belonging to the to-be-destroyed table would result in the system being
 permanently unable to checkpoint, because attempts to write out the buffer
 to the no-longer-extant file would fail.  You could only get out of the
 situation via a forced database crash (immediate shutdown), followed by
 replaying all the WAL since the time of the problem.  In production
 contexts that could be pretty dire.


Yes, its bad, but we do notice that has happened. We can also put in code
to specifically avoid this error at checkpoint time.

If lseek fails badly then SeqScans would give *silent* data loss, which in
my view is worse. Just added pages aren't the only thing we might miss if
lseek is badly wrong.

So, I think this patch still has legs. We can check that the clean up has
been 100% when we do the buffer scan at the start of the checkpoint - that
way we do just one scan of the buffer pool and move a time-consuming
operation into a background process.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services

Re: [HACKERS] proposal: condition blocks in psql

2015-06-29 Thread Merlin Moncure

On Sun, Jun 28, 2015 at 1:59 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:

 \if_ver_eq 9.2


  What do you thinking about it?

 Couldn't this kind of thing be done directly with PL/pgSQL?


 you can use PL/pgSQL - but there are some limits

 * maintenance large plpgsql functions


 I agree with large but that would not necessarily mean complex. Also, some
 functions could be in SQL, and just the logic with PL/pgSQL.

 * the plpgsql functions or anonymous functions create a transaction
 borders
 - what should not be wanted


 Hmmm... If something fails when installing an extension, a transaction
 border is probably a good thing? Also, the interaction of \if with possible
 BEGIN/COMMIT can lead to strange states.

Manual transaction control is the killer feature IMO; not being able
to do it forces code out of sql and into a scripting language.
Transaction management in 'psql scripting' is no more or less fragile
than in most imperative languages.

Personally, I prefer a server side solution to this problem (stored
procedures) so that the application can leverage this functionality
through the protocol.  However, psql extensions are probably worth it
in their own right.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

On Wed, Mar 18, 2015 at 1:59 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Feedback is of course welcome, but note that I am not seriously
 expecting any until we get into 9.6 development cycle and I am adding
 this patch to the next CF.

I have moved this patch to CF 2015-09, as I have enough patches to
take care of for now... Let's focus on Windows support and improvement
of logging for TAP in the first round. That will be already a good
step forward.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] anole: assorted stability problems

On 2015-06-29 12:11:08 +0200, Andres Freund wrote:
 On 2015-06-29 00:42:53 -0400, Tom Lane wrote:
  #define S_UNLOCK(lock)  \
  do { _Asm_sched_fence(); (*(lock)) = 0; } while (0)
 
 Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler
 barrier? Shouldn't this be a _Asm_mf()?

I've pushed a patch the _sched_fence with _mf. That's what we use in the
atomics code as well. I did reread
http://h21007.www2.hp.com/portal/download/files/unprot/Itanium/inline_assem_ERS.pdf
and I do think _Asm_mf is the correct thing.

What I did not change was the mask used for serialization - the default
0x3D3D (c.f. page 23) should be correct, even though slightly
suboptimal.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPSERT on partition

2015-06-29 Thread Amit Langote

On 2015-06-25 AM 09:51, Amit Langote wrote:
 
 Peter,
 
 On 2015-06-25 AM 02:35, Peter Geoghegan wrote:

 Inheritance with triggers is a leaky abstraction, so this kind of
 thing is always awkward. Still, UPSERT has full support for
 *inheritance* -- that just doesn't help in this case.

 
 Could you clarify as to what UPSERT's support for inheritance entails?
 

Oh, I see that this stuff has been discussed (-hackers) and written about
(wiki). I'll go read about it.

I agree with Fujii-san's concern that any UPSERT on partition limitations
given those of partitioning approach should be documented likewise, if not
under INSERT/UPSERT, then under partitioning; not really sure which.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] anole: assorted stability problems

On 2015-06-29 00:42:53 -0400, Tom Lane wrote:
 #define S_UNLOCK(lock)\
   do { _Asm_sched_fence(); (*(lock)) = 0; } while (0)

Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler
barrier? Shouldn't this be a _Asm_mf()?

 which immediately raises the question of why omitting the volatile
 cast is okay.

Should be fine if _Asm_sched_fence() were a proper memory (or een
release) barrier. Which I don't think it is.


 I also wonder if we don't need a second _Asm_sched_fence() after the
 lock release.

Shouldn't be needed - the only thing that could be reordered is the
actual lock release. Which will just impact timing in a minor manner (it
can't move into another locked section).

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Rework the way multixact truncations work

On 2015-06-29 23:54:40 +1200, Thomas Munro wrote:
 On Mon, Jun 22, 2015 at 7:24 AM, Andres Freund and...@anarazel.de wrote:
  It'd be very welcome to see some wider testing and review on this.
 
 I started looking at this and doing some testing.  Here is some
 initial feedback:
 
 Perhaps vac_truncate_clog needs a new name now that it does more,
 maybe vac_truncate_transaction_logs?

It has done more before, so I don't really see a connection to this
patch...

 MultiXactState-sawTruncationCkptCyle: is 'Cyle' supposed to be 'Cycle'?

Oops.

 In the struct xl_multixact_truncate, and also the function
 WriteMTruncateXlogRec and other places, I think you have confused the
 typedefs MultiXactOffset and MultiXactId.  If I'm not mistaken,
 startMemb and endMemb have the correct type, but startOff and endOff
 should be of type MultiXactId despite their names (the *values* stored
 inside pg_multixact/offsets are indeed offsets (into
 pg_multixact/members), but their *location* is what a multixact ID
 represents).

IIRC I did it that way to make clear this is just 'byte' type offsets,
without other meaning. Wasn't a good idea.

 I was trying to understand if there could be any problem caused by
 setting latest_page_number to the pageno that holds (or will hold)
 xlrec.endOff in multixact_redo.  The only thing that jumps out at me
 is that the next call to SlruSelectLRUPage will no longer be prevented
 from evicting the *real* latest page (the most recently created page).

That hasn't changed unless I miss something?

 In SlruDeleteSegment, is it OK to call unlink() while holding the SLRU
 control lock?

I think it's safer than not doing it, but don't particularly care.

 In find_multixact_start you call SimpleLruFlush before calling
 SimpleLruDoesPhysicalPageExist.  Should we do something like this
 instead?  https://gist.github.com/macdice/8e5b2f0fe3827fdf3d5a

I'm currently slightly inclined to do it my way. They way these
functions are used it doesn't seem like a bad property to ensure things
are on disk.

 I think saw some extra autovacuum activity that I didn't expect in a
 simple scenario, but I'm not sure and will continue looking tomorrow.

Cool, thanks!

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_trgm version 1.2

2015-06-29 Thread Merlin Moncure

On Sat, Jun 27, 2015 at 5:17 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 This patch implements version 1.2 of contrib module pg_trgm.

 This supports the triconsistent function, introduced in version 9.4 of the
 server, to make it faster to implement indexed queries where some keys are
 common and some are rare.

 I've included the paths to both upgrade and downgrade between 1.1 and 1.2,
 although after doing so you must close and restart the session before you
 can be sure the change has taken effect. There is no change to the on-disk
 index structure

 This shows the difference it can make in some cases:

 create extension pg_trgm version 1.1;

 create table foo as select

   md5(random()::text)|| case when random()0.05 then 'lmnop' else '123'
 end ||

   md5(random()::text) as bar

 from generate_series(1,1000);

 create index on foo using gin (bar gin_trgm_ops);

 --some queries

 alter extension pg_trgm update to 1.2;

 --close, reopen, more queries


 select count(*) from foo where bar like '%12344321lmnabcddd%';



 V1.1: Time: 1743.691 ms  --- after repeated execution to warm the cache

 V1.2: Time:  2.839 ms  --- after repeated execution to warm the cache

Wow!  I'm going to test this.  I have some data sets for which trigram
searching isn't really practical...if the search string touches
trigrams with a lot of duplication the algorithm can have trouble
beating brute force searches.

trigram searching is important: it's the only way currently to search
string encoded structures for partial strings quickly.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Rework the way multixact truncations work

2015-06-29 Thread Thomas Munro

On Mon, Jun 22, 2015 at 7:24 AM, Andres Freund and...@anarazel.de wrote:
 It'd be very welcome to see some wider testing and review on this.

I started looking at this and doing some testing.  Here is some
initial feedback:

Perhaps vac_truncate_clog needs a new name now that it does more,
maybe vac_truncate_transaction_logs?

MultiXactState-sawTruncationCkptCyle: is 'Cyle' supposed to be 'Cycle'?

In the struct xl_multixact_truncate, and also the function
WriteMTruncateXlogRec and other places, I think you have confused the
typedefs MultiXactOffset and MultiXactId.  If I'm not mistaken,
startMemb and endMemb have the correct type, but startOff and endOff
should be of type MultiXactId despite their names (the *values* stored
inside pg_multixact/offsets are indeed offsets (into
pg_multixact/members), but their *location* is what a multixact ID
represents).

I was trying to understand if there could be any problem caused by
setting latest_page_number to the pageno that holds (or will hold)
xlrec.endOff in multixact_redo.  The only thing that jumps out at me
is that the next call to SlruSelectLRUPage will no longer be prevented
from evicting the *real* latest page (the most recently created page).

In SlruDeleteSegment, is it OK to call unlink() while holding the SLRU
control lock?

In find_multixact_start you call SimpleLruFlush before calling
SimpleLruDoesPhysicalPageExist.  Should we do something like this
instead?  https://gist.github.com/macdice/8e5b2f0fe3827fdf3d5a

I think saw some extra autovacuum activity that I didn't expect in a
simple scenario, but I'm not sure and will continue looking tomorrow.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_file_settings view vs. Windows

Sawada Masahiko sawada.m...@gmail.com writes:
 On Mon, Jun 29, 2015 at 12:20 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 So let me make a radical proposal that both gets rid of the portability
 problem and, arguably, makes the view more useful than it is today.
 I think we should define the view as showing, not what was in the config
 file(s) at the last SIGHUP, but what is in the files *now*.  That means
 you could use it to validate edits before you attempt to apply them,
 rather than having to pull the trigger and then ask if it worked.  And yet
 it would remain just about as useful as it is now for post-mortem checks
 when a SIGHUP didn't do what you expected.

 You meant that we parse each GUC parameter in configration file, and
 then pass changeVal=false to set_config_option whenever
 pg_file_settings is refered.
 So this view would be used for checking whether currently
 configuration file is applied successfully or not, right?

Well, it would check whether the current contents of the file could be
applied successfully.

 The such a feature would be also good, but the main purpose of
 pg_file_settings was to resolve where each GUC parameter came from,
 especially in case of using ALTER SYSTEM command.
 I'm not sure that it would be adequate for our originally purpose.

I'm not following.  Surely pg_settings itself is enough to tell you
where the currently-applied GUC value came from.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Reduce ProcArrayLock contention

I have been working on to analyze different ways to reduce
the contention around ProcArrayLock.  I have evaluated mainly
2 ideas, first one is to partition the ProcArrayLock (the basic idea
is to allow multiple clients (equal to number of ProcArrayLock partitions)
to perform ProcArrayEndTransaction and then wait for all of them at
GetSnapshotData time) and second one is to have a mechanism to
GroupClear the Xid during ProcArrayEndTransaction() and the second
idea clearly stands out in my tests, so I have prepared the patch for that
to further discuss here.

The idea behind second approach (GroupClear Xid) is, first try to get
ProcArrayLock conditionally in ProcArrayEndTransaction(), if we get
the lock then clear the advertised XID, else set the flag (which indicates
that advertised XID needs to be clear for this proc) and push this
proc to pendingClearXidList.  Except one proc, all other proc's will
wait for their Xid to be cleared. The only allowed proc will attempt the
lock acquiration, after acquring the lock, pop all of the requests off the
list using compare-and-swap, servicing each one before moving to
next proc, and clearing their Xids.  After servicing all the requests
on pendingClearXidList, release the lock and once again go through
the saved pendingClearXidList and wake all the processes waiting
for their Xid to be cleared.  To set the appropriate value for
ShmemVariableCache-latestCompletedXid, we need to advertise
latestXid incase proc needs to be pushed to pendingClearXidList.
Attached patch implements the above idea.

Performance Data
-
RAM - 500GB
8 sockets, 64 cores(Hyperthreaded128 threads total)

Non-default parameters

max_connections = 150
shared_buffers=8GB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout=35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 256MB

pgbench setup

scale factor - 300
Data is on magnetic disk and WAL on ssd.
pgbench -M prepared tpc-b

Head : commit 51d0fe5d
Patch -1 : group_xid_clearing_at_trans_end_rel_v1


Client Count/TPS18163264128HEAD814609210899199262363617812Patch-110866483
11093199083122028237

The graph for the data is attached.

Points about performance data
-
1.  Gives good performance improvement at or greater than 64 clients
and give somewhat moderate improvement at lower client count.  The
reason is that because the contention around ProcArrayLock is mainly
seen at higher client count.  I have checked that at higher client-count,
it started behaving lockless (which means performance with patch is
equivivalent to if we just comment out ProcArrayLock in
ProcArrayEndTransaction()).
2. There is some noise in this data (at 1 client count, I don't expect
much difference).
3. I have done similar tests on power-8 m/c and found similar gains.
4. The gains are visible when the data fits in shared_buffers as for other
workloads I/O starts dominating.
5. I have seen that effect of Patch is much more visible if we keep
autovacuum = off (do manual vacuum after each run) and keep
wal_writer_delay to lower value (say 20ms).  I have not included that
data here, but if somebody is interested, I can do the detailed tests
against HEAD with those settings and share the results.


Here are steps used to take data (there are repeated for each reading)

1. Start Server
2. dropdb postgres
3. createdb posters
4. pgbench -i -s 300 postgres
5. pgbench -c $threads -j $threads -T 1800 -M prepared postgres
6. checkpoint
7. Stop Server


Thanks to Robert Haas for having discussion (offlist) about the idea
and suggestions to improve it and also Andres Freund for having
discussion and sharing thoughts about this idea at PGCon.

Suggestions?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


group_xid_clearing_at_trans_end_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Oh, this is embarrassing: init file logic is still broken

2015-06-29 Thread Tatsuo Ishii

F.Y.I. here is the results I did on my laptop (Ubuntu 14, i7-4600U,
16GB mem, 512GB SSD). Unlike Josh, I used Unix domain sockets. In
summary:

9.4.3: 943.439840
9.4.4: 429.953953
9.4 stable as of June 30: 929.804491

So comparing with 9.4.3, 9.4.4 is 54% slow, and 9.4-stable is 1.4% slow.

I think the difference between 9.4.3 seems a measurement error but the
difference between 9.4.3 and 9.4.4 is real. Good thing is 9.4 stable
fixes the issue as expected. Thanks Tom!

Below is the details.

9.4.3:
t-ishii@localhost: pgbench -p 5434 -s 100 -j 2 -c 6 -r -C -S -T 1200 test
Scale option ignored, using pgbench_branches table count = 100
starting vacuum...end.
transaction type: SELECT only
scaling factor: 100
query mode: simple
number of clients: 6
number of threads: 2
duration: 1200 s
number of transactions actually processed: 1132130
latency average: 6.360 ms
tps = 943.439840 (including connections establishing)
tps = 64573.165915 (excluding connections establishing)
statement latencies in milliseconds:
0.002809\set naccounts 10 * :scale
0.001427\setrandom aid 1 :naccounts
4.254619SELECT abalance FROM pgbench_accounts WHERE aid = :aid;

9.4.4:
t-ishii@localhost: pgbench -p 5433 -s 100 -j 2 -c 6 -r -C -S -T 1200 test
Scale option ignored, using pgbench_branches table count = 100
starting vacuum...end.
transaction type: SELECT only
scaling factor: 100
query mode: simple
number of clients: 6
number of threads: 2
duration: 1200 s
number of transactions actually processed: 515946
latency average: 13.955 ms
tps = 429.953953 (including connections establishing)
tps = 56235.664740 (excluding connections establishing)
statement latencies in milliseconds:
0.003311\set naccounts 10 * :scale
0.001653\setrandom aid 1 :naccounts
9.320204SELECT abalance FROM pgbench_accounts WHERE aid = :aid

9.4-stable:
t-ishii@localhost:  pgbench -p 5435 -s 100 -j 2 -c 6 -r -C -S -T 1200 test
Scale option ignored, using pgbench_branches table count = 100
starting vacuum...end.
transaction type: SELECT only
scaling factor: 100
query mode: simple
number of clients: 6
number of threads: 2
duration: 1200 s
number of transactions actually processed: 1115767
latency average: 6.453 ms
tps = 929.804491 (including connections establishing)
tps = 64676.863921 (excluding connections establishing)
statement latencies in milliseconds:
0.002803\set naccounts 10 * :scale
0.001395\setrandom aid 1 :naccounts
4.316686SELECT abalance FROM pgbench_accounts WHERE aid = :aid;


 OK, this is pretty bad in its real performance effects.  On a workload
 which is dominated by new connection creation, we've lost about 17%
 throughput.
 
 To test it, I ran pgbench -s 100 -j 2 -c 6 -r -C -S -T 1200 against a
 database which fits in shared_buffers on two different m3.large
 instances on AWS (across the network, not on unix sockets).  A typical
 run on 9.3.6 looks like this:
 
 scaling factor: 100
 query mode: simple
 number of clients: 6
 number of threads: 2
 duration: 1200 s
 number of transactions actually processed: 252322
 tps = 210.267219 (including connections establishing)
 tps = 31958.233736 (excluding connections establishing)
 statement latencies in milliseconds:
 0.002515\set naccounts 10 * :scale
 0.000963\setrandom aid 1 :naccounts
 19.042859   SELECT abalance FROM pgbench_accounts WHERE aid
 = :aid;
 
 Whereas a typical run on 9.3.9 looks like this:
 
 scaling factor: 100
 query mode: simple
 number of clients: 6
 number of threads: 2
 duration: 1200 s
 number of transactions actually processed: 208180
 tps = 173.482259 (including connections establishing)
 tps = 31092.866153 (excluding connections establishing)
 statement latencies in milliseconds:
 0.002518\set naccounts 10 * :scale
 0.000988\setrandom aid 1 :naccounts
 23.076961   SELECT abalance FROM pgbench_accounts WHERE aid
 = :aid;
 
 Numbers are pretty consistent on four runs each on two different
 instances (+/- 4%), so I don't think this is Amazon variability we're
 seeing.  I think the syscache invalidation is really costing us 17%.  :-(
 
 -- 
 Josh Berkus
 PostgreSQL Experts Inc.
 http://pgexperts.com
 
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce ProcArrayLock contention

On June 29, 2015 7:02:10 PM GMT+02:00, Simon Riggs si...@2ndquadrant.com 
wrote:
On 29 June 2015 at 16:27, Amit Kapila amit.kapil...@gmail.com wrote:


 Thanks to Robert Haas for having discussion (offlist) about the idea
 and suggestions to improve it and also Andres Freund for having
 discussion and sharing thoughts about this idea at PGCon.


Weird. This patch is implemented exactly the way I said to implement it
publicly at PgCon.

Was nobody recording the discussion at the unconference??

Amit presented an earlier version of this at the in unconference?

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce ProcArrayLock contention

On 29 June 2015 at 16:27, Amit Kapila amit.kapil...@gmail.com wrote:


 Thanks to Robert Haas for having discussion (offlist) about the idea
 and suggestions to improve it and also Andres Freund for having
 discussion and sharing thoughts about this idea at PGCon.


Weird. This patch is implemented exactly the way I said to implement it
publicly at PgCon.

Was nobody recording the discussion at the unconference??

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services

Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-06-29 Thread Josh Berkus

On 06/29/2015 01:01 AM, Michael Paquier wrote:
 On Mon, Jun 29, 2015 at 4:20 AM, Josh Berkus j...@agliodbs.com wrote:

 Right.  Well, another reason we should be using a system catalog and not
 a single GUC ...
 
 I assume that this takes into account the fact that you will still
 need a SIGHUP to reload properly the new node information from those
 catalogs and to track if some information has been modified or not.

Well, my hope was NOT to need a sighup, which is something I see as a
failing of the current system.

 And the fact that a connection to those catalogs will be needed as
 well, something that we don't have now. 

Hmmm?  I was envisioning the catalog being used as one on the master.
Why do we need an additional connection for that?  Don't we already need
a connection in order to update pg_stat_replication?

 Another barrier to the catalog
 approach is that catalogs get replicated to the standbys, and I think
 that we want to avoid that. 

Yeah, it occurred to me that that approach has its downside as well as
an upside.  For example, you wouldn't want a failed-over new master to
synchrep to itself.  Mostly, I was looking for something reactive,
relational, and validated, instead of passing an unvalidated string to
pg.conf and hoping that it's accepted on reload.  Also some kind of
catalog approach would permit incremental changes to the config instead
of wholesale replacement.

 But perhaps you simply meant having an SQL
 interface with some metadata, right? Perhaps I got confused by the
 word 'catalog'.

No, that doesn't make any sense.

 I'm personally not convinced that quorum and prioritization are
 compatible.  I suggest instead that quorum and prioritization should be
 exclusive alternatives, that is that a synch set should be either a
 quorum set (with all members as equals) or a prioritization set (if rep1
 fails, try rep2).  I can imagine use cases for either mode, but not one
 which would involve doing both together.


 Yep, separating the GUC parameter between prioritization and quorum
 could be also good idea.

 We're agreed, then ...
 
 Er, I disagree here. Being able to get prioritization and quorum
 working together is a requirement of this feature in my opinion. Using
 again the example above with 2 data centers, being able to define a
 prioritization set on the set of nodes of data center 1, and a quorum
 set in data center 2 would reduce failure probability by being able to
 prevent problems where for example one or more nodes lag behind
 (improving performance at the same time).

Well, then *someone* needs to define the desired behavior for all
permutations of prioritized synch sets.  If it's undefined, then we're
far worse off than we are now.

 Also I think that we must enable us to decide which server we should
 promote when the master server is down.

 Yes, and probably my biggest issue with this patch is that it makes
 deciding which server to fail over to *more* difficult (by adding more
 synchronous options) without giving the DBA any more tools to decide how
 to fail over.  Aside from because we said we'd eventually do it, what
 real-world problem are we solving with this patch?
 
 Hm. This patch needs to be coupled with improvements to
 pg_stat_replication to be able to represent a node tree by basically
 adding to which group a node is assigned. I can draft that if needed,
 I am just a bit too lazy now...
 
 Honestly, this is not a matter of tooling. Even today if a DBA wants
 to change s_s_names without touching postgresql.conf you could just
 run ALTER SYSTEM and then reload parameters.

You're confusing two separate things.  The primary manageability problem
has nothing to do with altering the parameter.  The main problem is: if
there is more than one synch candidate, how do we determine *after the
master dies* which candidate replica was in synch at the time of
failure?  Currently there is no way to do that.  This proposal plans to,
effectively, add more synch candidate configurations without addressing
that core design failure *at all*.  That's why I say that this patch
decreases overall reliability of the system instead of increasing it.

When I set up synch rep today, I never use more than two candidate synch
servers because of that very problem.  And even with two I have to check
replay point because I have no way to tell which replica was in-sync at
the time of failure.  Even in the current limited feature, this
significantly reduces the utility of synch rep.  In your proposal, where
I could have multiple synch rep groups in multiple geos, how on Earth
could I figure out what to do when the master datacenter dies?

BTW, ALTER SYSTEM is a strong reason to use JSON for the synch rep GUC
(assuming it's one parameter) instead of some custom syntax.  If it's
JSON, we can validate it in psql, whereas if it's some custom syntax we
have to wait for the db to reload and fail to figure out that we forgot
a comma.  Using JSON would also permit us to use jsonb_set and

Re: [HACKERS] Reduce ProcArrayLock contention

On 29 June 2015 at 18:11, Andres Freund and...@anarazel.de wrote:

 On June 29, 2015 7:02:10 PM GMT+02:00, Simon Riggs si...@2ndquadrant.com
 wrote:
 On 29 June 2015 at 16:27, Amit Kapila amit.kapil...@gmail.com wrote:
 
 
  Thanks to Robert Haas for having discussion (offlist) about the idea
  and suggestions to improve it and also Andres Freund for having
  discussion and sharing thoughts about this idea at PGCon.
 
 
 Weird. This patch is implemented exactly the way I said to implement it
 publicly at PgCon.
 
 Was nobody recording the discussion at the unconference??

 Amit presented an earlier version of this at the in unconference?


Yes, I know. And we all had a long conversation about how to do it without
waking up the other procs.

Forming a list, like we use for sync rep and having just a single process
walk the queue was the way I suggested then and previously.

Weird.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services

[HACKERS] Bug in bttext_abbrev_convert()

2015-06-29 Thread Peter Geoghegan

Commits b181a919 and arguably c79b6413 added bugs to
bttext_abbrev_convert() in the process of fixing some others. In the
master branch, bttext_abbrev_convert() can leak memory when the C
locale happens to be used and we must detoast, which is unacceptable
for about the same reason that it's unacceptable for a standard B-Tree
comparator routine. There could also be a use-after-free issue for
large strings that are detoasted, because bttext_abbrev_convert()
hashes memory which might already be freed (when hashing the
authoritative value).

Attached patch fixes these issues.

As we all know, the state of automated testing is pretty lamentable.
This is the kind of thing that we could catch more easily in the
future if better infrastructure were in place. I caught this by
eyeballing bttext_abbrev_convert() with slightly fresher eyes than the
last time I looked.
-- 
Peter Geoghegan
From 9aa381b1c136b9c4228b0b965c38fc62234ba251 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan peter.geoghega...@gmail.com
Date: Mon, 29 Jun 2015 16:21:16 -0700
Subject: [PATCH] Fix memory management bugs in text abbreviation

Make sure that a pfree() of an unpacked Datum value (from a toasted
Datum) always occurs, including when a text sort uses the C locale.
Free unpacked value memory at the latest possible opportunity, ensuring
that it is not spuriously referenced after pfree().
---
 src/backend/utils/adt/varlena.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 779729d..2fbbf54 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -2034,13 +2034,9 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
 		}
 
 		/* Just like strcoll(), strxfrm() expects a NUL-terminated string */
-		memcpy(tss-buf1, VARDATA_ANY(authoritative), len);
+		memcpy(tss-buf1, authoritative_data, len);
 		tss-buf1[len] = '\0';
 
-		/* Don't leak memory here */
-		if (PointerGetDatum(authoritative) != original)
-			pfree(authoritative);
-
 		for (;;)
 		{
 #ifdef HAVE_LOCALE_T
@@ -2108,6 +2104,10 @@ bttext_abbrev_convert(Datum original, SortSupport ssup)
 
 	addHyperLogLog(tss-abbr_card, hash);
 
+	/* Don't leak memory here */
+	if (PointerGetDatum(authoritative) != original)
+		pfree(authoritative);
+
 	return res;
 }
 
-- 
1.9.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.5 release notes

Andres Freund and...@anarazel.de writes:
 I'm working on integrating the suggestions I made last week to the
 release notes. Would anybody mind if I start to add commit ids in
 comments, similar to what Tom has done for minor releases, to news
 items?

+1.  Helps confirm which items are meant to correspond to which commits.

In case you didn't realize it already, the stuff I put into the minor
release notes is from src/tools/git_changelog.  Dunno whether we need
to use that exact format for major releases though; there's no need to
identify branches in major release notes.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PANIC in GIN code

2015-06-29 Thread Heikki Linnakangas


On 06/29/2015 07:20 PM, Jeff Janes wrote:

On Mon, Jun 29, 2015 at 1:37 AM, Heikki Linnakangas hlinn...@iki.fi wrote:

Another piece of info here that might be relevant.  Almost all
UPDATE_META_PAGE xlog records other than the last one have two backup
blocks.  The last UPDATE_META_PAGE record only has one backup block.

And the metapage is mostly zeros:


head -c 8192 /tmp/data2_invalid_page/base/16384/16420 | od
000 00 00 161020 073642 00 00 00 00
020 00 00 00 00 053250 00 053250 00
040 006140 00 01 00 01 00 00 00
060 031215 00 000452 00 00 00 00 00
100 025370 00 00 00 02 00 00 00
120 00 00 00 00 00 00 00 00
*
002



Hmm. Looking at ginRedoUpdateMetapage, I think I see the problem: it
doesn't initialize the page. It copies the metapage data, but it doesn't
touch the page headers. The only way I can see that that would cause
trouble is if the index somehow got truncated away or removed in the
standby. That could happen in crash recovery, if you drop the index and the
crash, but that should be harmless, because crash recovery doesn't try to
read the metapage, only update it (by overwriting it), and by the time
crash recovery has completed, the index drop is replayed too.

But AFAICS that bug is present in earlier versions too.


Nope, looking closer, in previous versions the page was always read from 
disk, even though we're overwriting it. That was made smarter in 9.5, by 
using the ZERO_OR_LOCK mode, but that means that the page headers indeed 
need to be initialized.



Yes, I did see this error reported previously but it was always after the
first appearance of the PANIC, so I assumed it was a sequella to that and
didn't investigate it further at that time.


Can you reproduce this easily? How?


I can reproduce it fairly easy.

[instructions to reproduce]


I was actually not able to reproduce it that way, but came up with a 
much simpler method. The problem occurs when the metapage update record 
WAL record is replayed, and the metapage is not in the page cache yet. 
Usually it is, as if you do pretty much anything at all with the index, 
the metapage stays in cache. But VACUUM produces a metapage update 
record to update the stats, and it's pretty easy to arrange things so 
that that's the first record after checkpoint, and start recover from 
that checkpoint:


postgres=# create table foo (t text[]);
CREATE TABLE
postgres=# insert into foo values ('{foo}');
INSERT 0 1
postgres=# create index i_foo on foo using gin (t);
CREATE INDEX
postgres=# vacuum foo;
VACUUM
postgres=# vacuum foo;
VACUUM
postgres=# checkpoint;
CHECKPOINT
postgres=# vacuum foo;
VACUUM

Now kill -9 postmaster, and restart. Voila, the page headers are all zeros:

postgres=# select * from page_header(get_raw_page('i_foo', 0));
lsn| checksum | flags | lower | upper | special | pagesize | 
version |

prune_xid
---+--+---+---+---+-+--+-+-
--
 0/1891270 |0 | 0 | 0 | 0 |   0 |0 | 
0 |

0
(1 row)

postgres=# select * from gin_metapage_info(get_raw_page('i_foo', 
0));ERROR:  input page is not a GIN metapage

DETAIL:  Flags 0189, expected 0008

I just pushed a fix for this, but unfortunately it didn't make it 9.5alpha1.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Solaris testers wanted for strxfrm() behavior

Josh Berkus j...@agliodbs.com writes:
 Joyent confirms that the bug is fixed on SmartOS:

The more interesting bit of information would be *when* it was fixed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.5 release notes

Andres Freund and...@anarazel.de writes:
 Haven't yet thought much about the format, maybe in the style of
 git log --pretty='format:[%h] %aN [%ci]: %s' upstream/master
 I'd personally like to see the hash and the time, the rest isn't
 particularly important to me.

[ ... plays with that ... ]  Hm.  To keep down the bulk of the release
notes, I'd kind of like to fit this info into single SGML comment
lines, ie

!-- info here --

and if that's to fit into 80 columns without wrapping, the info is
pretty constrained.  I get something like this when running git log
in a 70-column window:

[f78329d] Tom Lane [2015-06-29 15:42:18 -0400]: Stamp 9.5alpha1.
[85c25fd] Tom Lane [2015-06-29 15:38:46 -0400]: Desultory review of 9.
[cbc8d65] Tom Lane [2015-06-29 12:42:52 -0400]: Code + docs review for
[07cb8b0] Andres Freund [2015-06-29 14:53:32 +0200]: Replace ia64 S_UN
[c5e5d44] Peter Eisentraut [2015-06-28 23:56:55 -0400]: Translation up
[2bdc51a] Tom Lane [2015-06-28 20:49:35 -0400]: Run the C portions of 
[62d16c7] Tom Lane [2015-06-28 18:06:14 -0400]: Improve design and imp
[d661532] Heikki Linnakangas [2015-06-29 00:09:10 +0300]: Also trigger
[6ab4d38] Heikki Linnakangas [2015-06-29 00:01:26 +0300]: Fix markup i
[a32c3ec] Heikki Linnakangas [2015-06-28 22:30:39 +0300]: Promote the 
[a45c70a] Heikki Linnakangas [2015-06-28 22:16:21 +0300]: Fix double-X
[b36805f] Heikki Linnakangas [2015-06-28 21:35:51 +0300]: Don't choke 
[cb2acb1] Heikki Linnakangas [2015-06-28 21:35:46 +0300]: Add missing_
[cca8ba9] Kevin Grittner [2015-06-28 12:43:59 -0500]: Fix comment for 
[527e6d3] Tatsuo Ishii [2015-06-28 18:54:27 +0900]: Fix function decla
[0a52d37] Tom Lane [2015-06-27 17:47:39 -0400]: Avoid passing NULL to 
[d47a113] Andres Freund [2015-06-27 19:00:45 +0200]: Fix test_decoding
[604e993] Kevin Grittner [2015-06-27 09:55:06 -0500]: Add opaque decla
[7845db2] Heikki Linnakangas [2015-06-27 10:17:42 +0300]: Fix typo in 
[66fbcb0] Simon Riggs [2015-06-27 00:41:47 +0100]: Avoid hot standby c
[7d60b2a] Alvaro Herrera [2015-06-26 18:17:54 -0300]: Fix DDL command 
[4028222] Alvaro Herrera [2015-06-26 18:13:05 -0300]: Fix BRIN xlog re
[7c02d48] Robert Haas [2015-06-26 16:04:46 -0400]: Fix grammar.
[8f15f74] Robert Haas [2015-06-26 15:53:13 -0400]: Be more conservativ
[c66bc72] Robert Haas [2015-06-26 14:49:37 -0400]: release notes: Add 
[8a8c581] Robert Haas [2015-06-26 14:46:48 -0400]: Remove unnecessary 
[31c018e] Robert Haas [2015-06-26 14:20:29 -0400]: release notes: Comb
[9043ef3] Robert Haas [2015-06-26 11:37:32 -0400]: Don't warn about cr

That might be all right but it seems like the timestamp is taking up
an undue amount of space.  If we could get git log to put out only
the date, it would be better IMO, but I don't see a format option
that does that :-(.

I wonder if it's worth teaching git_changelog to have an option
to put out summary lines in the specific format we select.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Solaris testers wanted for strxfrm() behavior

2015-06-29 Thread Josh Berkus

On 06/29/2015 02:08 PM, Tom Lane wrote:
 Josh Berkus j...@agliodbs.com writes:
 Joyent confirms that the bug is fixed on SmartOS:
 
 The more interesting bit of information would be *when* it was fixed.

Answer: not certain, but fixed at least 2 years ago.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.5 release notes

On 2015-06-29 17:30:57 -0400, Tom Lane wrote:
 Andres Freund and...@anarazel.de writes:
  Haven't yet thought much about the format, maybe in the style of
  git log --pretty='format:[%h] %aN [%ci]: %s' upstream/master
  I'd personally like to see the hash and the time, the rest isn't
  particularly important to me.
 
 [ ... plays with that ... ]  Hm.  To keep down the bulk of the release
 notes, I'd kind of like to fit this info into single SGML comment
 lines, ie

 and if that's to fit into 80 columns without wrapping, the info is
 pretty constrained.  I get something like this when running git log
 in a 70-column window:

 [f78329d] Tom Lane [2015-06-29 15:42:18 -0400]: Stamp 9.5alpha1.

I agree this is a bit long, but I don't particularly care about 70/80
cols.

How about:
git log --pretty='format:%cd [%h] %cN: %s' --date=short upstream/master
2015-06-29 [f78329d] Tom Lane: Stamp 9.5alpha1.
2015-06-29 [85c25fd] Tom Lane: Desultory review of 9.5 release notes.
2015-06-29 [cbc8d65] Tom Lane: Code + docs review for escaping of option values 
(commit 11a020eb6).
2015-06-29 [07cb8b0] Andres Freund: Replace ia64 S_UNLOCK compiler barrier with 
a full memory barrier.
2015-06-28 [c5e5d44] Peter Eisentraut: Translation updates
2015-06-28 [2bdc51a] Tom Lane: Run the C portions of guc-file.l through 
pgindent.
2015-06-28 [62d16c7] Tom Lane: Improve design and implementation of 
pg_file_settings view.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.5 release notes

Hi,

On 2015-06-11 00:15:21 -0400, Bruce Momjian wrote:
 I have committed the first draft of the 9.5 release notes.

I'm working on integrating the suggestions I made last week to the
release notes. Would anybody mind if I start to add commit ids in
comments, similar to what Tom has done for minor releases, to news
items?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_stat_*_columns?

2015-06-29 Thread Jim Nasby


On 6/26/15 6:09 PM, Joel Jacobson wrote:

Can't we just use the infrastructure of PostgreSQL to handle the few
megabytes of data we are talking about here? Why not just store the data
in a regular table? Why bother with special files and special data
structures? If it's just a table we want to produce as output, why can't
we just store it in a regular table, in the pg_catalog schema?


The problem is the update rate. I've never tried measuring it, but I'd 
bet that the stats collector can end up with 10s of thousands of updates 
per second. MVCC would collapse under that kind of load.


What might be interesting is setting things up so the collector simply 
inserted into history tables every X seconds and then had a separate 
process to prune that data. The big problem with that is I see no way 
for that to easily allow access to real-time data (which is certainly 
necessary sometimes).

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.5 release notes

Andres Freund and...@anarazel.de writes:
 How about:
 git log --pretty='format:%cd [%h] %cN: %s' --date=short upstream/master

Ah yes, didn't see that option.  That's definitely better.  I'd still vote
for restricting it to fit in an 80-column window though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Solaris testers wanted for strxfrm() behavior

2015-06-29 Thread Josh Berkus

All,

Joyent confirms that the bug is fixed on SmartOS:

This is what I see:

  SunOS pkgsrc-pbulk-2014Q4-1.local 5.11 joyent_20141030T081701Z i86pc
i386 i86pc
  locale is_IS.ISO8859-1: strxfrm returned 212; last modified byte at
58 (0x0)
  locale is_IS.ISO8859-1: strxfrm returned 212; last modified byte at
58 (0x0)
  locale : strxfrm returned 26; last modified byte at 27 (0x0)

Cheers,


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.5 release notes

On 2015-06-29 17:58:54 -0400, Tom Lane wrote:
 Yeah we are.  The only places you'll find where we aren't formatting to 77
 or 78 columns or so are where it would require breaking SGML tags in weird
 places.

Which isn't exactly infrequent...

Anyway, how about:
format:%cd [%h] %(8,trunc)%cN: %(49,trunc)%s

(which you can configure as pretty.pgmajor or so in .gitconfig btw.)

 (There are two different things to worry about here.  One is how you're
 going to reverse-engineer the annotations into the release notes Bruce
 already made.  Even un-truncated first lines of commit messages will often
 not be enough for matching up commits to those notes.  But I'm thinking
 more about how we do this in future release cycles, and for that, getting
 git_changelog to help is clearly the ticket.)

I'm not against doing that.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.5 release notes

On 2015-06-29 17:44:06 -0400, Tom Lane wrote:
 Andres Freund and...@anarazel.de writes:
  How about:
  git log --pretty='format:%cd [%h] %cN: %s' --date=short upstream/master
 
 Ah yes, didn't see that option.  That's definitely better.  I'd still vote
 for restricting it to fit in an 80-column window though.

I don't see the benefit of truncating away the end of the commit message
- that'll just mean more manual work and harder to understand
heading... It's also not like we're generally very religious about the
line length in sgml?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.5 release notes

Andres Freund and...@anarazel.de writes:
 On 2015-06-29 17:44:06 -0400, Tom Lane wrote:
 Ah yes, didn't see that option.  That's definitely better.  I'd still vote
 for restricting it to fit in an 80-column window though.

 I don't see the benefit of truncating away the end of the commit message
 - that'll just mean more manual work and harder to understand
 heading... It's also not like we're generally very religious about the
 line length in sgml?

Yeah we are.  The only places you'll find where we aren't formatting to 77
or 78 columns or so are where it would require breaking SGML tags in weird
places.  If we use this format without truncating the log lines then it'll
become practically impossible to edit release notes in a window less than
about 120 characters wide, and I for one will object to that.  It doesn't
fit well in my usual screen layout.  And it would be very wasteful of
screen space because even if you let regular text flow all the way to a
120-character margin, there are enough short lines like para that
there'd just be huge amounts of white space on your screen.

As for manual work to generate the format, we could fix that by getting
git_changelog to emit what we want.  I'd think the best thing to start
from would be the summary lines interspersed with full commit messages
anyway; the raw output of git log is going to be near unusable for this
purpose, with or without truncation.

(There are two different things to worry about here.  One is how you're
going to reverse-engineer the annotations into the release notes Bruce
already made.  Even un-truncated first lines of commit messages will often
not be enough for matching up commits to those notes.  But I'm thinking
more about how we do this in future release cycles, and for that, getting
git_changelog to help is clearly the ticket.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.5 release notes

On 2015-06-29 17:05:59 -0400, Tom Lane wrote:
 +1.  Helps confirm which items are meant to correspond to which commits.

That's what made me think of it.

 In case you didn't realize it already, the stuff I put into the minor
 release notes is from src/tools/git_changelog.  Dunno whether we need
 to use that exact format for major releases though; there's no need to
 identify branches in major release notes.

I'd recognized the format, but I didn't want to exactly go that way. As
you say, the branch information is redundant.

Haven't yet thought much about the format, maybe in the style of
git log --pretty='format:[%h] %aN [%ci]: %s' upstream/master
I'd personally like to see the hash and the time, the rest isn't
particularly important to me.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-06-29 Thread Jeff Davis

[Jumping in without catching up on entire thread. Please let me know
if these questions have already been covered.]

1. Can you change the name to something like ParallelHeapScan?
Parallel Sequential is a contradiction. (I know this is bikeshedding
and I won't protest further if you keep the name.)

2. Where is the speedup coming from? How much of it is CPU and IO
overlapping (i.e. not leaving disk or CPU idle while the other is
working), and how much from the CPU parallelism? I know this is
difficult to answer rigorously, but it would be nice to have some
breakdown even if for a specific machine.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for N synchronous standby servers - take 2

On Mon, Jun 29, 2015 at 4:20 AM, Josh Berkus j...@agliodbs.com wrote:
 On 06/28/2015 04:36 AM, Sawada Masahiko wrote:
 On Sat, Jun 27, 2015 at 3:53 AM, Josh Berkus j...@agliodbs.com wrote:
 On 06/26/2015 11:32 AM, Robert Haas wrote:
 I think your proposal is worth considering, but you would need to fill
 in a lot more details and explain how it works in detail, rather than
 just via a set of example function calls.  The GUC-based syntax
 proposal covers cases like multi-level rules and, now, prioritization,
 and it's not clear how those would be reflected in what you propose.

 So what I'm seeing from the current proposal is:

 1. we have several defined synchronous sets
 2. each set requires a quorum of k  (defined per set)
 3. within each set, replicas are arranged in priority order.

 One thing which the proposal does not implement is *names* for
 synchronous sets.  I would also suggest that if I lose this battle and
 we decide to go with a single stringy GUC, that we at least use JSON
 instead of defining our out, proprietary, syntax?

 JSON would be more flexible for making synchronous set, but it will
 make us to change how to parse configuration file to enable a value
 contains newline.

 Right.  Well, another reason we should be using a system catalog and not
 a single GUC ...

I assume that this takes into account the fact that you will still
need a SIGHUP to reload properly the new node information from those
catalogs and to track if some information has been modified or not.
And the fact that a connection to those catalogs will be needed as
well, something that we don't have now. Another barrier to the catalog
approach is that catalogs get replicated to the standbys, and I think
that we want to avoid that. But perhaps you simply meant having an SQL
interface with some metadata, right? Perhaps I got confused by the
word 'catalog'.

 I'm personally not convinced that quorum and prioritization are
 compatible.  I suggest instead that quorum and prioritization should be
 exclusive alternatives, that is that a synch set should be either a
 quorum set (with all members as equals) or a prioritization set (if rep1
 fails, try rep2).  I can imagine use cases for either mode, but not one
 which would involve doing both together.


 Yep, separating the GUC parameter between prioritization and quorum
 could be also good idea.

 We're agreed, then ...

Er, I disagree here. Being able to get prioritization and quorum
working together is a requirement of this feature in my opinion. Using
again the example above with 2 data centers, being able to define a
prioritization set on the set of nodes of data center 1, and a quorum
set in data center 2 would reduce failure probability by being able to
prevent problems where for example one or more nodes lag behind
(improving performance at the same time).

 Also I think that we must enable us to decide which server we should
 promote when the master server is down.

 Yes, and probably my biggest issue with this patch is that it makes
 deciding which server to fail over to *more* difficult (by adding more
 synchronous options) without giving the DBA any more tools to decide how
 to fail over.  Aside from because we said we'd eventually do it, what
 real-world problem are we solving with this patch?

Hm. This patch needs to be coupled with improvements to
pg_stat_replication to be able to represent a node tree by basically
adding to which group a node is assigned. I can draft that if needed,
I am just a bit too lazy now...

Honestly, this is not a matter of tooling. Even today if a DBA wants
to change s_s_names without touching postgresql.conf you could just
run ALTER SYSTEM and then reload parameters.

 It's always been a problem that one can accomplish a de-facto
 denial-of-service by joining a cluster using the same application_name
 as the synch standby, moreso because it's far too easy to do that
 accidentally.  One needs to simply make the mistake of copying
 recovery.conf from the synch replica instead of the async replica, and
 you've created a reliability problem.

That's a scripting problem then. There are many ways to do a false
manipulation in this area when setting up a standby. application_name
value is one, you can do worse by pointing to an incorrect IP as well,
miss a firewall filter or point to an incorrect port.

 Also, the fact that we use application_name for synch_standby groups
 prevents us from giving the standbys in the group their own names for
 identification purposes.  It's only the fact that synchronous groups are
 relatively useless in the current feature set that's prevented this from
 being a real operational problem; if we implement quorum commit, then
 users are going to want to use groups more often and will want to
 identify the members of the group, and not just by IP address.

Managing groups in the synchronous protocol is adding one level of
complexity for the operator, while what I had in mind first was to
allow a user to be able

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-06-29 Thread Etsuro Fujita

Hi KaiGai-san,

On 2015/06/27 21:09, Kouhei Kaigai wrote:

BTW, if you try newer version of postgres_fdw foreign join patch,
please provide me to reproduce the problem/

Did you forget to attach the patch, or v17 is in use?

Sorry, I made a mistake. The problem was produced using v16 [1].

I'd like to suggest a solution that re-construct remote tuple according
to the fdw_scan_tlist on ExecScanFetch, if given scanrelid == 0.
It enables to run local qualifier associated with the ForeignScan node,
and it will also work for the case when tuple in es_epqTupleSet[] was
local heap.

Maybe I'm missing something, but I don't think your proposal works
properly because we don't have any component ForeignScan state node or
subsidiary join state node once we've replaced the entire join with the
ForeignScan performing the join remotely, IIUC. So, my image was to
have another subplan for EvalPlanQual as well as the ForeignScan, to do
the entire join locally for the component test tuples if we are inside
an EvalPlanQual recheck.

Hmm... Probably, we have two standpoints to tackle the problem.

The first standpoint tries to handle the base foreign table as
a prime relation for locking. Thus, we have to provide a way to
fetch a remote tuple identified with the supplied ctid.
The advantage of this approach is the way to fetch tuples from
base relation is quite similar to the existing form, however,
its disadvantage is another side of the same coin, because the
ForeignScan node with scanrelid==0 (that represents remote join
query) may have local qualifiers which shall run on the tuple
according to fdw_scan_tlist.

IIUC, I think this approach would also need to evaluate join conditions
and remote qualifiers in addition to local qualifiers in the local, for
component tuples that were re-fetched from the remote (and remaining
component tuples that were copied from whole-row vars, if any), in cases
where the re-fetched tuples were updated versions of those tuples rather
than the same versions priviously obtained.

One other standpoint tries to handle a bunch of base foreign
tables as a unit. That means, if any of base foreign table is
the target of locking, it prompts FDW driver to fetch the latest
joined tuple identified by ctid, even if this join contains
multiple base relations to be locked.
The advantage of this approach is that we can use qualifiers of
the ForeignScan node with scanrelid==0 and no need to pay attention
of remote qualifier and/or join condition individually.
Its disadvantage is, we may extend EState structure to keep the
joined tuples, in addition to es_epqTupleSet[].

That is an idea. However, ISTM there is another disadvantage; that is
not efficient because that would need to perform another remote join
query having such additional conditions during an EvalPlanQual check, as
you proposed.

I'm inclined to think the later standpoint works well, because
it does not need to reproduce an alternative execution path in
local side again, even if a ForeignScan node represents much
complicated remote query.
If we would fetch tuples of individual base relations, we need
to reconstruct identical join path to be executed on remote-
side, don't it?

Yeah, that was my image for fixing this issue.

IIUC, the purpose of EvalPlanQual() is to ensure the tuples to
be locked is still visible, so it is not an essential condition
to fetch base tuples individually.

I think so too, but taking the similarity and/or efficiency of
processing into consideration, I would vote for the idea of having an
alternative execution path in the local. That would also allow FDW
authors to write the foreign join pushdown functionality in their FDWs
by smaller efforts.

Best regards,
Etsuro Fujita

[1]
http://www.postgresql.org/message-id/CAEZqfEe9KGy=1_wagh2rgzpg0o4pqgd+iauyaj8wtze+cyj...@mail.gmail.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewind failure by file deletion in source server

On Mon, Jun 29, 2015 at 3:46 AM, Heikki Linnakangas hlinn...@iki.fi wrote:
 On 06/24/2015 09:43 AM, Michael Paquier wrote:

 Attached is a new set of patches. Except for the last ones that
 addresses one issue of pg_rewind (symlink management when streaming
 PGDATA), all the others introduce if_not_exists options for the
 functions of genfile.c. The pg_rewind stuff could be more polished
 though. Feel free to comment.

 I've committed the additional option to the functions in genfile.c (I
 renamed it to missing_ok, for clarity), and the pg_rewind changes to use
 that option.

 I ended up refactoring the patch quite a bit, so if you could double-check
 what I committed to make sure I didn't break anything, that would be great.

Thanks, the new patch looks far better than what I did, I noticed a
couple of typos in the docs though:
- s/behaviour/behavior, behavior is American English spelling, and
it is the one used elsewhere as well, hence I guess that it makes
sense to our it.
- s/an non-existent/a non-existent
- pg_proc.h is still using if_not_exists as in my patch (you corrected
it to use missing_ok).
Those are fixed as 0001 in the attached set.

 I didn't commit the tablespace or symlink handling changes yet, will review
 those separately.

Thanks. Attached are rebased versions that take into account your
previous changes as well (like the variable renaming, wrapper function
usage, etc). I also added missing_ok to pg_readlink for consistency,
and rebased the fix of pg_rewind for soft links with 0005. Note that
0005 does not use pg_tablespace_location(), still the patch series
include an implementation of missing_ok for it. Feel free to use it if
necessary.

 I also didn't commit the new regression test yet. It would indeed be nice to
 have one, but I think it was a few bricks shy of a load. It should work in a
 freshly initdb'd system, but not necessarily on an existing installation.
 First, it relied on the fact that postgresql.conf.auto exists, but a DBA
 might remove that if he wants to make sure the feature is not used.
 Secondly, it relied on the fact that pg_twophase is empty, but there is no
 guarantee of that either. Third, the error messages included in the expected
 output, e.g No such file or directory, depend on the operating system and
 locale. And finally, it'd be nice to test more things, in particular the
 behaviour of different offsets and lengths to pg_read_binary_file(),
 although an incomplete test would be better than no test at all.

Portability is going to really reduce the test range, the only things
that we could test are:
- NULL results returned when missing_ok = true (with a dummy file
name/link/directory) as missing_ok = false would choke depending on
the platform as you mentioned.
- Ensure that those functions called by users without superuser rights
fail properly.
- Path format errors for each function, like that:
=# select pg_ls_dir('..');
ERROR:  42501: path must be in or below the current directory
LOCATION:  convert_and_check_filename, genfile.c:78
For tests on pg_read_binary_file, do you think that there is one file
of PGDATA that we could use for scanning? I cannot think of one on the
top of my mind now (postgresql.conf or any configuration files could
be overridden by the user so they are out of scope, PG_VERSION is an
additional maintenance burden). Well, I think that those things are
still worth testing in any case and I think that you think so as well.
If you are fine with that I could wrap up a patch that's better than
nothing done for sure. Thoughts?

Now we could have a TAP test for this stuff, where we could have a
custom PGDATA with some dummy files that we know will be empty for
example, still it seems like an overkill to me for this purpose,
initdb is rather costly in this facility.. And on small machines.
Regards,
-- 
Michael
From 378bd8af5c27e7e20b038f05b693b95e9871ef0b Mon Sep 17 00:00:00 2001
From: Michael Paquier mich...@otacoo.com
Date: Mon, 29 Jun 2015 13:38:03 +0900
Subject: [PATCH 1/5] Fix some typos and an incorrect naming introduced by
 cb2acb1

---
 doc/src/sgml/func.sgml| 4 ++--
 src/include/catalog/pg_proc.h | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index d49cd43..8b28e29 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17851,7 +17851,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
 
para
 All of these functions take an optional parametermissing_ok/ parameter,
-which specifies the behaviour when the file or directory does not exist.
+which specifies the behavior when the file or directory does not exist.
 If literaltrue/literal, the function returns NULL (except
 functionpg_ls_dir/, which returns an empty result set). If
 literalfalse/, an error is raised. The default is literalfalse/.
@@ -17867,7 +17867,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());

Re: [HACKERS] PANIC in GIN code

2015-06-29 Thread Heikki Linnakangas


On 06/29/2015 01:12 AM, Jeff Janes wrote:

Now I'm getting a different error, with or without checksums.

ERROR:  invalid page in block 0 of relation base/16384/16420
CONTEXT:  automatic vacuum of table jjanes.public.foo

16420 is the gin index.  I can't even get the page with pageinspect:

jjanes=# SELECT * FROM get_raw_page('foo_text_array_idx', 0);
ERROR:  invalid page in block 0 of relation base/16384/16420

This is the last few gin entries from pg_xlogdump


rmgr: Gin len (rec/tot):  0/  3893, tx:  0, lsn:
0/77270E90, prev 0/77270E68, desc: VACUUM_PAGE , blkref #0: rel
1663/16384/16420 blk 27 FPW
rmgr: Gin len (rec/tot):  0/  3013, tx:  0, lsn:
0/77272080, prev 0/77272058, desc: VACUUM_PAGE , blkref #0: rel
1663/16384/16420 blk 6904 FPW
rmgr: Gin len (rec/tot):  0/  3093, tx:  0, lsn:
0/77272E08, prev 0/77272DE0, desc: VACUUM_PAGE , blkref #0: rel
1663/16384/16420 blk 1257 FPW
rmgr: Gin len (rec/tot):  8/  4662, tx:  318119897, lsn:
0/77A2CF10, prev 0/77A2CEC8, desc: INSERT_LISTPAGE , blkref #0: rel
1663/16384/16420 blk 22184
rmgr: Gin len (rec/tot): 88/   134, tx:  318119897, lsn:
0/77A2E188, prev 0/77A2E160, desc: UPDATE_META_PAGE , blkref #0: rel
1663/16384/16420 blk 0


And the metapage is mostly zeros:

head -c 8192 /tmp/data2_invalid_page/base/16384/16420 | od
000 00 00 161020 073642 00 00 00 00
020 00 00 00 00 053250 00 053250 00
040 006140 00 01 00 01 00 00 00
060 031215 00 000452 00 00 00 00 00
100 025370 00 00 00 02 00 00 00
120 00 00 00 00 00 00 00 00
*
002


Hmm. Looking at ginRedoUpdateMetapage, I think I see the problem: it 
doesn't initialize the page. It copies the metapage data, but it doesn't 
touch the page headers. The only way I can see that that would cause 
trouble is if the index somehow got truncated away or removed in the 
standby. That could happen in crash recovery, if you drop the index and 
the crash, but that should be harmless, because crash recovery doesn't 
try to read the metapage, only update it (by overwriting it), and by the 
time crash recovery has completed, the index drop is replayed too.


But AFAICS that bug is present in earlier versions too. Can you 
reproduce this easily? How?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewind failure by file deletion in source server

On Mon, Jun 29, 2015 at 3:53 PM, Vladimir Borodin r...@simply.name wrote:
 28 июня 2015 г., в 21:46, Heikki Linnakangas hlinn...@iki.fi wrote:
 I've committed the additional option to the functions in genfile.c (I
 renamed it to missing_ok, for clarity), and the pg_rewind changes to use
 that option.

 And since it changes API it would not be back-ported to 9.4, right?

Those API changes are not going to be back-patched to 9.4 as this is
basically a new feature. For 9.4's pg_rewind, the problem does not
concern this mailing list anyway, so let's discuss it directly on the
project page on github. Just mentioning, but I am afraid that we are
going to need a set of TRY/CATCH blocks with a wrapper function in
plpgsql that does the failure legwork done here, or to export those
functions in an extension with some copy/paste as all the necessary
routines of genfile.c are not exposed externally, and to be sure that
those functions are installed on the source server. I am not sure that
I will be able to look at that in the short term unfortunately...
Hence patches are welcome there and I will happily review, argue or
integrate them if needed. The non-streaming option is not impacted in
any case, so it's not like pg_rewind cannot be used at all on 9.4 or
9.3 instances.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] anole: assorted stability problems

On Mon, Jun 29, 2015 at 6:11 AM, Andres Freund and...@anarazel.de wrote:
 On 2015-06-29 00:42:53 -0400, Tom Lane wrote:
 #define S_UNLOCK(lock)\
   do { _Asm_sched_fence(); (*(lock)) = 0; } while (0)

 Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler
 barrier? Shouldn't this be a _Asm_mf()?

The point of the commit was to make spinlocks act as compiler barriers
as well as CPU barriers.  So I was just looking to add a compiler
barrier to what was already there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce ProcArrayLock contention

On Mon, Jun 29, 2015 at 1:22 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Yes, I know. And we all had a long conversation about how to do it without
 waking up the other procs.

 Forming a list, like we use for sync rep and having just a single process
 walk the queue was the way I suggested then and previously.

 Weird.

I am not sure what your point is.  Are you complaining that you didn't
get a design credit for this patch?  If so, I think that's a bit
petty.  I agree that you mentioned something along these lines at
PGCon, but Amit and I have been discussing this every week for over a
month, so it's not as if the conversations at PGCon were the only
ones, or the first.  Nor is there a conspiracy to deprive Simon Riggs
of credit for his ideas.  I believe that you should assume good faith
and take it for granted that Amit credited who he believed that he got
his ideas from.  The fact that you may have had similar ideas does not
mean that he got his from you.  It probably does mean that they are
good ideas, since we are apparently all thinking in the same way.

(I could provide EDB email internal emails documenting the timeline of
various ideas and showing which ones happened before and after PGCon,
and we could debate exactly who thought of what when.  But I don't
really see the point.  I certainly hope that a debate over who
deserves how much credit for what isn't going to overshadow the good
work Amit has done on this patch.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Solaris testers wanted for strxfrm() behavior

On Mon, Jun 29, 2015 at 6:07 PM, Josh Berkus j...@agliodbs.com wrote:
 On 06/29/2015 02:08 PM, Tom Lane wrote:
 Josh Berkus j...@agliodbs.com writes:
 Joyent confirms that the bug is fixed on SmartOS:

 The more interesting bit of information would be *when* it was fixed.

 Answer: not certain, but fixed at least 2 years ago.

2 years is definitely not long enough to assume that no unpatched
machines exist in the wild.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] anole: assorted stability problems

On 2015-06-29 22:45:49 -0400, Robert Haas wrote:
 On Mon, Jun 29, 2015 at 10:32 PM, Andres Freund and...@anarazel.de wrote:
  On 2015-06-29 22:11:33 -0400, Robert Haas wrote:
  On Mon, Jun 29, 2015 at 6:11 AM, Andres Freund and...@anarazel.de wrote:
   On 2015-06-29 00:42:53 -0400, Tom Lane wrote:
   #define S_UNLOCK(lock)\
 do { _Asm_sched_fence(); (*(lock)) = 0; } while (0)
  
   Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler
   barrier? Shouldn't this be a _Asm_mf()?
 
  The point of the commit was to make spinlocks act as compiler barriers
  as well as CPU barriers.  So I was just looking to add a compiler
  barrier to what was already there.
 
  You removed a volatile at the same time, and volatile on IA64 has
  acquire/release semantics.
 
 Can you explain what you mean by volatile having acquire/release
 semantics?  I don't see how volatile can create a CPU barrier, but I'm
 guessing that it somehow can and that you're about to enlighten me.

It's a IA64 pecularity. I think it started with intel's compiler, but
since then at least msvc and gcc copied it. In essence volatile reads
implicitly have acquire semantics, and volatile writes release. That's
relatively cheap on IA64 because they have 'opcode tags' marking normal
moves etc. as having release or acquire semantics (mov.rel etc.).

We even have a comment about it that scratches the surface a bit:
/*
 * Intel Itanium, gcc or Intel's compiler.
 *
 * Itanium has weak memory ordering, but we rely on the compiler to enforce
 * strict ordering of accesses to volatile data.  In particular, while the
 * xchg instruction implicitly acts as a memory barrier with 'acquire'
 * semantics, we do not have an explicit memory fence instruction in the
 * S_UNLOCK macro.  We use a regular assignment to clear the spinlock, and
 * trust that the compiler marks the generated store instruction with the
 * .rel opcode.
 *
 * Testing shows that assumption to hold on gcc, although I could not find
 * any explicit statement on that in the gcc manual.  In Intel's compiler,
 * the -m[no-]serialize-volatile option controls that, and testing shows that
 * it is enabled by default.
 */



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] LWLock deadlock and gdb advice

2015-06-29 Thread Jeff Janes

I have a 9.5alpha1 cluster which is locked up.  All the user back ends seem
to be waiting on semop, eventually on WALInsertLockAcquire.

Is there a way to use gdb to figure out who holds the lock they are waiting
for?

It is compiled with both debug and cassert.

I am hoping someone can give me recipe similar to the one for Examining
backend memory use in https://wiki.postgresql.org/wiki/Developer_FAQ

example backtrace:

#0  0x003dcb6eaf27 in semop () from /lib64/libc.so.6
#1  0x0067190f in PGSemaphoreLock (sema=0x7f28a98b9468) at
pg_sema.c:387
#2  0x006d4b0c in LWLockAcquireCommon (l=0x7f28a0e6d600,
valptr=0x7f28a0e6d618, val=0) at lwlock.c:1042
#3  LWLockAcquireWithVar (l=0x7f28a0e6d600, valptr=0x7f28a0e6d618, val=0)
at lwlock.c:916
#4  0x004f3c4f in WALInsertLockAcquire (rdata=0xc5c130, fpw_lsn=0)
at xlog.c:1411
#5  XLogInsertRecord (rdata=0xc5c130, fpw_lsn=0) at xlog.c:948
#6  0x004f7aac in XLogInsert (rmid=13 '\r', info=32 ' ') at
xloginsert.c:453
#7  0x0047e0b0 in ginPlaceToPage (btree=0x7fffca9263e0,
stack=0x2c94ff8, insertdata=value optimized out, updateblkno=value
optimized out, childbuf=0, buildStats=0x0)
at ginbtree.c:418
#8  0x0047f3ad in ginInsertValue (btree=0x7fffca9263e0,
stack=0x2c94ff8, insertdata=0x7fffca926460, buildStats=0x0) at
ginbtree.c:748
#9  0x00475c8b in ginEntryInsert (ginstate=0x7fffca9267e0,
attnum=29784, key=1, category=value optimized out, items=0x7f28a0c7b458,
nitem=47, buildStats=0x0)
at gininsert.c:234
#10 0x00485ecc in ginInsertCleanup (ginstate=0x7fffca9267e0,
vac_delay=value optimized out, stats=0x0) at ginfast.c:843
#11 0x00487059 in ginHeapTupleFastInsert (ginstate=0x7fffca9267e0,
collector=value optimized out) at ginfast.c:436
#12 0x004760fa in gininsert (fcinfo=value optimized out) at
gininsert.c:531

Cheers,

Jeff

Re: [HACKERS] anole: assorted stability problems

On 2015-06-29 22:11:33 -0400, Robert Haas wrote:
 On Mon, Jun 29, 2015 at 6:11 AM, Andres Freund and...@anarazel.de wrote:
  On 2015-06-29 00:42:53 -0400, Tom Lane wrote:
  #define S_UNLOCK(lock)\
do { _Asm_sched_fence(); (*(lock)) = 0; } while (0)
 
  Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler
  barrier? Shouldn't this be a _Asm_mf()?
 
 The point of the commit was to make spinlocks act as compiler barriers
 as well as CPU barriers.  So I was just looking to add a compiler
 barrier to what was already there.

You removed a volatile at the same time, and volatile on IA64 has
acquire/release semantics.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Refactor to split nodeAgg.c?

2015-06-29 Thread Jeff Davis

I was going to rebase my HashAgg patch, and got some conflicts related
to the grouping sets patch. I could probably sort them out, but I think
that may be the tipping point where we want to break up nodeAgg.c into
nodeSortedAgg.c and nodeHashAgg.c, and probably a common file as well.

This would also (I hope) be convenient for Simon and David Rowley, who
have been hacking on aggregates in general.

Anyone see a reason I shouldn't give this a try?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] anole: assorted stability problems

On Mon, Jun 29, 2015 at 10:32 PM, Andres Freund and...@anarazel.de wrote:
 On 2015-06-29 22:11:33 -0400, Robert Haas wrote:
 On Mon, Jun 29, 2015 at 6:11 AM, Andres Freund and...@anarazel.de wrote:
  On 2015-06-29 00:42:53 -0400, Tom Lane wrote:
  #define S_UNLOCK(lock)\
do { _Asm_sched_fence(); (*(lock)) = 0; } while (0)
 
  Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler
  barrier? Shouldn't this be a _Asm_mf()?

 The point of the commit was to make spinlocks act as compiler barriers
 as well as CPU barriers.  So I was just looking to add a compiler
 barrier to what was already there.

 You removed a volatile at the same time, and volatile on IA64 has
 acquire/release semantics.

Can you explain what you mean by volatile having acquire/release
semantics?  I don't see how volatile can create a CPU barrier, but I'm
guessing that it somehow can and that you're about to enlighten me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] anole: assorted stability problems

Robert Haas robertmh...@gmail.com writes:
 On Mon, Jun 29, 2015 at 10:32 PM, Andres Freund and...@anarazel.de wrote:
 You removed a volatile at the same time, and volatile on IA64 has
 acquire/release semantics.

 Can you explain what you mean by volatile having acquire/release
 semantics?  I don't see how volatile can create a CPU barrier, but I'm
 guessing that it somehow can and that you're about to enlighten me.

It's late and I'm tired, but: gcc (and, apparently, icc) treats accesses
to volatile-qualified objects as cues to emit .acq or .rel memory ordering
qualifiers on IA64 instructions, per the comments in s_lock.h.  I have not
seen any documentation stating specifically that aCC does the same, but
the buildfarm evidence is pretty clear that the 9.4 IA64-non-gcc version
of S_UNLOCK worked and the up-to-now-9.5 version does not.  So personally,
I would be inclined to put back the volatile qualifier, independently of
any fooling around with _Asm_double_magic_xyzzy calls.  Or to put it
differently: where is the evidence that removing the volatile qual is a
good idea?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PANIC in GIN code

2015-06-29 Thread Jeff Janes

On Mon, Jun 29, 2015 at 2:08 PM, Heikki Linnakangas hlinn...@iki.fi wrote:



 Now kill -9 postmaster, and restart. Voila, the page headers are all zeros:

 postgres=# select * from page_header(get_raw_page('i_foo', 0));
 lsn| checksum | flags | lower | upper | special | pagesize |
 version |
 prune_xid

 ---+--+---+---+---+-+--+-+-
 --
  0/1891270 |0 | 0 | 0 | 0 |   0 |0 | 0
 |
 0
 (1 row)

 postgres=# select * from gin_metapage_info(get_raw_page('i_foo',
 0));ERROR:  input page is not a GIN metapage
 DETAIL:  Flags 0189, expected 0008

 I just pushed a fix for this, but unfortunately it didn't make it
 9.5alpha1.


Thanks.  I think that that fixed it.  It survived for over an hour this
time.  Then it hit a different problem (apparent deadlock in the LWLock
code, but that is probably a topic for another thread, I'm working on
reproducing it.)

Cheers,

Jeff

Re: [HACKERS] anole: assorted stability problems

On Mon, Jun 29, 2015 at 10:53 PM, Andres Freund and...@anarazel.de wrote:
 On 2015-06-29 22:45:49 -0400, Robert Haas wrote:
 On Mon, Jun 29, 2015 at 10:32 PM, Andres Freund and...@anarazel.de wrote:
  On 2015-06-29 22:11:33 -0400, Robert Haas wrote:
  On Mon, Jun 29, 2015 at 6:11 AM, Andres Freund and...@anarazel.de wrote:
   On 2015-06-29 00:42:53 -0400, Tom Lane wrote:
   #define S_UNLOCK(lock)\
 do { _Asm_sched_fence(); (*(lock)) = 0; } while (0)
  
   Robert, how did you choose that? Isn't _Asm_sched_fence just a compiler
   barrier? Shouldn't this be a _Asm_mf()?
 
  The point of the commit was to make spinlocks act as compiler barriers
  as well as CPU barriers.  So I was just looking to add a compiler
  barrier to what was already there.
 
  You removed a volatile at the same time, and volatile on IA64 has
  acquire/release semantics.

 Can you explain what you mean by volatile having acquire/release
 semantics?  I don't see how volatile can create a CPU barrier, but I'm
 guessing that it somehow can and that you're about to enlighten me.

 It's a IA64 pecularity. I think it started with intel's compiler, but
 since then at least msvc and gcc copied it. In essence volatile reads
 implicitly have acquire semantics, and volatile writes release. That's
 relatively cheap on IA64 because they have 'opcode tags' marking normal
 moves etc. as having release or acquire semantics (mov.rel etc.).

 We even have a comment about it that scratches the surface a bit:
 /*
  * Intel Itanium, gcc or Intel's compiler.
  *
  * Itanium has weak memory ordering, but we rely on the compiler to enforce
  * strict ordering of accesses to volatile data.  In particular, while the
  * xchg instruction implicitly acts as a memory barrier with 'acquire'
  * semantics, we do not have an explicit memory fence instruction in the
  * S_UNLOCK macro.  We use a regular assignment to clear the spinlock, and
  * trust that the compiler marks the generated store instruction with the
  * .rel opcode.
  *
  * Testing shows that assumption to hold on gcc, although I could not find
  * any explicit statement on that in the gcc manual.  In Intel's compiler,
  * the -m[no-]serialize-volatile option controls that, and testing shows that
  * it is enabled by default.
  */

Ah.  So my bad, then, removing the volatile.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LWLock deadlock and gdb advice

2015-06-29 Thread Peter Geoghegan

On Mon, Jun 29, 2015 at 5:37 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 Is there a way to use gdb to figure out who holds the lock they are waiting
 for?

Have you considered building with LWLOCK_STATS defined, and LOCK_DEBUG
defined? That might do it.

Otherwise, I suggest dereferencing the l argument to
LWLockAcquireWithVar() or something -- set the frame to 3 in your
example backtrace, using f 3. Then, p *l. You'll get the tranche
id there, and so on.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reduce ProcArrayLock contention

On Mon, Jun 29, 2015 at 10:52 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 29 June 2015 at 18:11, Andres Freund and...@anarazel.de wrote:

 On June 29, 2015 7:02:10 PM GMT+02:00, Simon Riggs si...@2ndquadrant.com
wrote:
 On 29 June 2015 at 16:27, Amit Kapila amit.kapil...@gmail.com wrote:
 
 
  Thanks to Robert Haas for having discussion (offlist) about the idea
  and suggestions to improve it and also Andres Freund for having
  discussion and sharing thoughts about this idea at PGCon.
 
 
 Weird. This patch is implemented exactly the way I said to implement it
 publicly at PgCon.
 
 Was nobody recording the discussion at the unconference??

 Amit presented an earlier version of this at the in unconference?


 Yes, I know. And we all had a long conversation about how to do it
without waking up the other procs.

 Forming a list, like we use for sync rep and having just a single process
walk the queue was the way I suggested then and previously.


Yes, I remember very well about your suggestion which you have
given during Unconference and I really value that idea/suggestion.
Here, the point is that I could get chance to have in-person discussion
with Robert and Andres where they have spent more time discussing
about the problem (Robert has discussed about this much more apart
from PGCon as well), but that doesn't mean I am not thankful of the
ideas or suggestions I got from you and or others during Unconference.

Thank you for participating in the discussion and giving suggestions
and I really mean it, as I felt good about it even after wards.

Now, I would like to briefly explain how allow-one-waker idea has
helped to improve the patch as not every body here was present
in that Un-conference.  The basic problem I was seeing during
initial version of patch was that I was using LWLockAcquireOrWait
call for all the clients who are not able to get ProcArrayLock
(conditionally in the first attempt) and then after waking each proc
was checking if it's XID is cleared and if not they were again try
to AcquireOrWait for ProcArrayLock, this was limiting the patch
to improve performance at somewhat moderate client count like
32 or 64 because even trying for LWLock in exclusive mode again
and again was the limiting factor (I have tried with
LWLockAcquireConditional as well instead of LWLockAcquireOrWait,
though it helped a bit but not very much).  Allowing just one-client
to become kind of Group leader to clear the other proc's xid and wake
them-up and all the other proc's waiting after adding them to
pendingClearXid
list has helped to reduce the bottleneck around procarraylock significantly.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Refactor to split nodeAgg.c?

Jeff Davis pg...@j-davis.com writes:
 I was going to rebase my HashAgg patch, and got some conflicts related
 to the grouping sets patch. I could probably sort them out, but I think
 that may be the tipping point where we want to break up nodeAgg.c into
 nodeSortedAgg.c and nodeHashAgg.c, and probably a common file as well.

 This would also (I hope) be convenient for Simon and David Rowley, who
 have been hacking on aggregates in general.

 Anyone see a reason I shouldn't give this a try?

As with the discussion about pgbench, it's hard to opine about this
without seeing a concrete refactoring proposal.  But if you want to
try, now, very early in the dev cycle, would be the best time to try.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in bttext_abbrev_convert()

On Mon, Jun 29, 2015 at 7:47 PM, Peter Geoghegan p...@heroku.com wrote:
 Commits b181a919 and arguably c79b6413 added bugs to
 bttext_abbrev_convert() in the process of fixing some others. In the
 master branch, bttext_abbrev_convert() can leak memory when the C
 locale happens to be used and we must detoast, which is unacceptable
 for about the same reason that it's unacceptable for a standard B-Tree
 comparator routine. There could also be a use-after-free issue for
 large strings that are detoasted, because bttext_abbrev_convert()
 hashes memory which might already be freed (when hashing the
 authoritative value).

 Attached patch fixes these issues.

 As we all know, the state of automated testing is pretty lamentable.
 This is the kind of thing that we could catch more easily in the
 future if better infrastructure were in place. I caught this by
 eyeballing bttext_abbrev_convert() with slightly fresher eyes than the
 last time I looked.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

On Mon, Jun 29, 2015 at 7:18 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 28 June 2015 at 17:17, Tom Lane t...@sss.pgh.pa.us wrote:

 I'm not sure what you consider dire, but missing a dirty buffer
 belonging to the to-be-destroyed table would result in the system being
 permanently unable to checkpoint, because attempts to write out the
buffer
 to the no-longer-extant file would fail.  You could only get out of the
 situation via a forced database crash (immediate shutdown), followed by
 replaying all the WAL since the time of the problem.  In production
 contexts that could be pretty dire.


 Yes, its bad, but we do notice that has happened. We can also put in code
to specifically avoid this error at checkpoint time.

 If lseek fails badly then SeqScans would give *silent* data loss, which
in my view is worse. Just added pages aren't the only thing we might miss
if lseek is badly wrong.


So for the purpose of this patch, do we need to assume that
lseek can give us wrong size of file and we should add preventive
checks and other handling for the same?
I am okay to change that way, if we are going to have that as assumption
in out code wherever we are using it or will use it in-future, otherwise
we will end with some preventive checks which are actually not required.

Another idea here is that use some other way instead of lseek to know
the size of file.  Do you think we can use stat() for this purpose, we
are already using it in fd.c?

 So, I think this patch still has legs. We can check that the clean up has
been 100% when we do the buffer scan at the start of the checkpoint


One way to ensure that is verify that each buffer header tag is
valid (which essentially means whether the object exists), do
you have something else in mind to accomplish this part if we
decide to go this route?

 - that way we do just one scan of the buffer pool and move a
time-consuming operation into a background process.


Agreed and if that turns out to be cheap, then we might want to use
some way to optimize Drop Database and others in same way.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

On Mon, Jun 29, 2015 at 5:41 AM, Amit Kapila amit.kapil...@gmail.com
wrote:

 On Sun, Jun 28, 2015 at 9:05 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 
  Amit Kapila amit.kapil...@gmail.com writes:
   On Sat, Jun 27, 2015 at 7:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
   I don't like this too much because it will fail badly if the caller
   is wrong about the maximum possible page number for the table, which
   seems not exactly far-fetched.  (For instance, remember those kernel
bugs
   we've seen that cause lseek to lie about the EOF position?)
 
   Considering we already have exclusive lock while doing this operation
   and nobody else can perform write on this file, won't closing and
   opening it again would avoid such problems.
 
  On what grounds do you base that touching faith?

 Sorry, but I don't get what problem do you see in this touching?


On again thinking about it, I think you are worried that if we close
and reopen the file, it would break any flush operation that could
happen in parallel to it via checkpoint.  Still I am not clear that do
we want to assume that we can't rely on lseek for the size of file
if there can be parallel write activity on file (even if that write doesn't
increase the size of file)?

If yes, then we have below options:

a.  Have some protection mechanism for File Access (ignore error
for file not present or accessed during flush) and clean the buffers
buffers containing invalid objects as is being discussed up-thread.

b. Use some other API like stat to get the size of file?

Do you prefer any of these or if you have any better idea, then please
do share the same?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] LWLock deadlock and gdb advice

On Tue, Jun 30, 2015 at 6:25 AM, Peter Geoghegan p...@heroku.com wrote:

 On Mon, Jun 29, 2015 at 5:37 PM, Jeff Janes jeff.ja...@gmail.com wrote:
  Is there a way to use gdb to figure out who holds the lock they are
waiting
  for?

 Have you considered building with LWLOCK_STATS defined, and LOCK_DEBUG
 defined? That might do it.


If you define LOCK_DEBUG, then you can check owner of the
lock [1], which will tell you about the Exclusive owner of that lock
and can help you in debugging the problem.


[1]
#ifdef LOCK_DEBUG
struct PGPROC *owner; /* last exlusive owner of the lock */
#endif
} LWLock;


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] drop/truncate table sucks for large values of shared buffers