date:20150818

Hello Kondo-san,

I briefly checked your patch. Let me put some comments about
its design and implementation, even though I have no arguments
towards its concept. :-)

* Construction of RelOptInfo

In your patch, try_hashjoin_pushdown() called by try_hashjoin_path()
constructs RelOptInfo of the join-rel between inner-rel and a subpath
of Append node. It is entirely wrong implementation.

I can understand we (may) have no RelOptInfo for the joinrel between
tbl_child_0 and other_table, when planner investigates a joinpath to
process join Append path with the other_table.

   HashJoin
 - Append
   - SeqScan on tbl_child_0
   - SeqScan on tbl_child_1
   - SeqScan on tbl_child_2
   - SeqScan on tbl_child_3
 - Hash
   - SeqScan on other_table

How about these alternatives?

- calls make_join_rel() to the pair of tbl_child_X and other_table
  by try_hashjoin_pushdown() or somewhere. make_join_rel() internally
  construct a RelOptInfo for the supplied pair of relations, so
  relevant RelOptInfo shall be properly constructed.
- make_join_rel() also calls add_paths_to_joinrel() towards all the
  join logic, so it makes easier to support to push down other join
  logic including nested-loop or custom-join.
- It may be an idea to add an extra argument to make_join_rel() to
  inform expressions to be applied for tuple filtering on
  construction of inner hash table.

* Why only SeqScan is supported

I think it is the role of Hash-node to filter out inner tuples
obviously unrelated to the join (if CHECK constraint of outer relation
gives information), because this join-pushdown may be able to support
multi-stacked pushdown.

For example, if planner considers a path to join this Append-path
with another relation, and join clause contains a reference to X?

   Append
- HashJoin
  - SeqScan on tbl_child_0
  - Hash ... Filter: hash_func(X) % 4 = 0
- SeqScan on other_table
- HashJoin
  - SeqScan on tbl_child_1
  - Hash ... Filter: hash_func(X) % 4 = 1
- SeqScan on other_table
- HashJoin
  - SeqScan on tbl_child_2
  - Hash ... Filter: hash_func(X) % 4 = 2
- SeqScan on other_table
- HashJoin
  - SeqScan on tbl_child_3
  - Hash ... Filter: hash_func(X) % 4 = 3
- SeqScan on other_table

It may be a good challenge to consider additional join pushdown,
even if subpaths of Append are HashJoin, not SeqScan, like:

   Append
- HashJoin
  - HashJoin
- SeqScan on tbl_child_0
- Hash ... Filter: hash_func(X) % 4 = 0
  - SeqScan on other_table
  - Hash ... Filter: hash_func(X) % 4 = 0
 - SeqScan on another_table
- HashJoin
  - HashJoin
- SeqScan on tbl_child_1
- Hash ... Filter: hash_func(X) % 4 = 1
  - SeqScan on other_table
  - Hash ... Filter: hash_func(X) % 4 = 1
 - SeqScan on another_table
- HashJoin
  - HashJoin
- SeqScan on tbl_child_2
- Hash ... Filter: hash_func(X) % 4 = 2
  - SeqScan on other_table
  - Hash ... Filter: hash_func(X) % 4 = 2
 - SeqScan on another_table
- HashJoin
  - HashJoin
- SeqScan on tbl_child_3
- Hash ... Filter: hash_func(X) % 4 = 3
  - SeqScan on other_table
  - Hash ... Filter: hash_func(X) % 4 = 3
 - SeqScan on another_table

In this case, underlying nodes are not always SeqScan. So, only
Hash-node can have filter clauses.


* Way to handle expression nodes

All this patch supported is CHECK() constraint with equal operation
on INT4 data type. You can learn various useful infrastructure of
PostgreSQL. For example, ...
- expression_tree_mutator() is useful to make a copy of expression
  node with small modification
- pull_varnos() is useful to check which relations are referenced
  by the expression node.
- RestrictInfo-can_join is useful to check whether the clause is
  binary operator, or not.


Anyway, reuse of existing infrastructure is the best way to make
a reliable infrastructure and to keep the implementation simple.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com


 -Original Message-
 From: pgsql-hackers-ow...@postgresql.org
 [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Taiki Kondo
 Sent: Thursday, August 13, 2015 6:30 PM
 To: pgsql-hackers@postgresql.org
 Cc: Kaigai Kouhei(海外 浩平); Iwaasa Akio(岩浅 晃郎)
 Subject: [HACKERS] [Proposal] Table partition + join pushdown
 
 Hi all,
 
 I saw the email about the idea from KaiGai-san[1],
 and I worked to implement this idea.
 
 Now, I have implemented a part of this idea,
 so I want to propose this feature.
 
 Patch attached just shows my concept of this feature.
 It works fine for EXPLAIN, but it returns wrong result for other operations, 
 sadly.
 
 
 
 Table partition + join pushdown
 ===
 
 Motivation
 --
 To make join logic working more effectively,
 it is important to

Re: [HACKERS] checkpointer continuous flushing

2015-08-18 Thread Fabien COELHO



Hello Amit,


So the option is best kept as off for now, without further data, I'm
fine with that.


One point to think here is on what basis user can decide make
this option on, is it predictable in any way?
I think one case could be when the data set fits in shared_buffers.


Yep.

In general, providing an option is a good idea if user can decide with 
ease when to use that option or we can give some clear recommendation 
for the same otherwise one has to recommend that test your workload with 
this option and if it works then great else don't use it which might 
also be okay in some cases, but it is better to be clear.


My opinion, which is not backed by any data (anyone can feel free to 
provide a FreeBSD box for testing...) is that it would mostly be an 
improvement if you have a significant write load to have the flush option 
on when running on non-Linux systems which provide posix_fadvise.


If you have a lot of reads and few writes, then postgresql currently works 
reasonably enough, which is why people do not complain too much about 
write stalls, and I expect that the situation would not be significantly 
degraded.


Now there are competing positive and negative effects induced by using 
posix_fadvise, and moreover its implementation varries from OS to OS, so 
without running some experiments it is hard to be definite.



One minor point, while glancing through the patch, I noticed that couple
of multiline comments are not written in the way which is usually used
in code (Keep the first line as empty).


Indeed.

Please find attached a v10, where I have reviewed comments for style  
contents, and also slightly extended the documentation about the flush 
option to hint that it is essentially useful for high write loads. Without 
further data, I think it is not obvious to give more definite advices.


--
Fabien.diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e900dcc..1cec243 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2454,6 +2454,28 @@ include_dir 'conf.d'
   /listitem
  /varlistentry
 
+ varlistentry id=guc-checkpoint-sort xreflabel=checkpoint_sort
+  termvarnamecheckpoint_sort/varname (typebool/type)
+  indexterm
+   primaryvarnamecheckpoint_sort/ configuration parameter/primary
+  /indexterm
+  /term
+  listitem
+   para
+Whether to sort buffers before writting them out to disk on checkpoint.
+For a HDD storage, this setting allows to group together
+neighboring pages written to disk, thus improving performance by
+reducing random write activity.
+This sorting should have limited performance effects on SSD backends
+as such storages have good random write performance, but it may
+help with wear-leveling so be worth keeping anyway.
+The default is literalon/.
+This parameter can only be set in the filenamepostgresql.conf/
+file or on the server command line.
+   /para
+  /listitem
+ /varlistentry
+
  varlistentry id=guc-checkpoint-warning xreflabel=checkpoint_warning
   termvarnamecheckpoint_warning/varname (typeinteger/type)
   indexterm
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index e3941c9..f538698 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,6 +546,18 @@
   /para
 
   para
+   When hard-disk drives (HDD) are used for terminal data storage
+   xref linkend=guc-checkpoint-sort allows to sort pages
+   so that neighboring pages on disk will be flushed together by
+   chekpoints, reducing the random write load and improving performance.
+   If solid-state drives (SSD) are used, sorting pages induces no benefit
+   as their random write I/O performance is good: this feature could then
+   be disabled by setting varnamecheckpoint_sort/ to valueoff/.
+   It is possible that sorting may help with SSD wear leveling, so it may
+   be kept on that account.
+  /para
+
+  para
The number of WAL segment files in filenamepg_xlog/ directory depends on
varnamemin_wal_size/, varnamemax_wal_size/ and
the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 68e33eb..bee38ab 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7995,11 +7995,13 @@ LogCheckpointEnd(bool restartpoint)
 sync_secs,
 total_secs,
 longest_secs,
+sort_secs,
 average_secs;
 	int			write_usecs,
 sync_usecs,
 total_usecs,
 longest_usecs,
+sort_usecs,
 average_usecs;
 	uint64		average_sync_time;
 
@@ -8030,6 +8032,10 @@ LogCheckpointEnd(bool restartpoint)
 		CheckpointStats.ckpt_end_t,
 		total_secs, total_usecs);
 
+	TimestampDifference(CheckpointStats.ckpt_sort_t,
+		CheckpointStats.ckpt_sort_end_t,
+		sort_secs, sort_usecs);
+
 	/*
 	 * Timing values returned from CheckpointStats are in

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-18 Thread PostgreSQL - Hans-Jürgen Schönig


 On 18 Aug 2015, at 10:32, Albe Laurenz laurenz.a...@wien.gv.at wrote:
 
 Victor Wagner wrote:
 Rationale
 =
 
 Since introduction of the WAL-based replication into the PostgreSQL, it is
 possible to create high-availability and load-balancing clusters.
 
 However, there is no support for failover in the client libraries. So, only
 way to provide transparent for client application failover is IP address
 migration. This approach has some limitation, i.e. it requires that
 master and backup servers reside in the same subnet or may not be
 feasible for other reasons.
 
 Commercial RDBMS, such as Oracle, employ more flexible approach. They
 allow to specify multiple servers in the connect string, so if primary
 server is not available, client library tries to connect to other ones.
 
 This approach allows to use geographically distributed failover clusters
 and also is a cheap way to implement load-balancing (which is not
 possible with IP address migration).
 
 I wonder how useful this is at the present time.
 
 If the primary goes down and the client gets connected to the standby,
 it would have read-only access there.  Most applications wouldn't cope
 well with that.
 
 Once we have multi-master replication that can be used for fail-over,
 the picture will change.  Then a feature like that would be very useful 
 indeed.
 
host=main-server host=standby1 host=standby2 port=5432 dbname=database
 
 It seems a bit arbitrary to require that all servers use the same port.
 
 Maybe parameters like host2, port2, host3, port3 etc. might be better.
 
 Yours,
 Laurenz Albe


i totally agree with laurenz.
in addition to that you have the “problem” of transactions. if you failover in 
the middle 
of a transaction, strange things might happen from the application point of 
view.

the good thing, however, is that stupid middleware is sometimes not able to 
handle
failed connections. however, overall i think it is more of a danger than a 
benefit.

regards,

hans



 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] allowing wal_level change at run time

On Tue, Aug 18, 2015 at 7:59 AM, Peter Eisentraut pete...@gmx.net wrote:
 How would we handle decreases at run time?  We can prevent =archive -
 minimal if archiving is running or there are physical replication slots,
 and we can prevent logical - something less if there are logical
 replication slots, but AFAICT, we don't have a way to check whether
 anyone currently needs level hot_standby.

What do you mean by prevent?  If the user edits postgresql.conf and
reduces the setting, and then reloads the configuration file, they
have a right to expect that the changes got applied.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] allowing wal_level change at run time

In [1], it was discussed to make wal_level changeable at run time
(PGC_SIGHUP), as part of an effort to make replication easier to set up
and/or provide better defaults.  I was wondering what it would take to
do that.

I looks like increasing the setting is doable, as long as we WAL-log the
change using existing facilities.  I don't understand the hot_standby -
logical transition, so maybe something else is necessary there.

How would we handle decreases at run time?  We can prevent =archive -
minimal if archiving is running or there are physical replication slots,
and we can prevent logical - something less if there are logical
replication slots, but AFAICT, we don't have a way to check whether
anyone currently needs level hot_standby.  (Thread [1] originally
proposed to get rid of the archive/hot_standby distinction, which would
help here.)  Or we could just let users make their own mess if they want to.

Comments?


[1]
http://www.postgresql.org/message-id/CA+TgmoZLji0tAjoVOPFAPJ8d2e8Q=zm4zdtqxska3j+z-pn...@mail.gmail.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Test code is worth the space

2015-08-18 Thread David Fetter

On Tue, Aug 18, 2015 at 02:03:19PM +0100, Greg Stark wrote:
 On Tue, Aug 18, 2015 at 6:57 AM, Noah Misch n...@leadboat.com wrote:

  I suspect any effort to significantly improve Postgres test
  coverage is doomed until there's an alternative to pg_regress.
 
  There is the src/test/perl/TestLib.pm harness.
 
 Sadly I think the test suite is only half the battle. The coding
 style of Postgres predates modern test suite systems and makes it
 hard to test. Most functions require a specific environment set up
 that would be laborious and difficult to do in any sane way. Even
 something as self-contained as tuplesort would be difficult to test
 without the whole operator class and syscache mechanisms initialized
 and populated. And that's an easy case, imagine trying to test
 individual functions in the planner without doing a complete planner
 run on a query.

I'm given to understand that this tight coupling is necessary for
performance.  Are you saying that it could be unwound, or that testing
strategies mostly need to take it into account, or...?

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning

2015-08-18 Thread Amit Langote

Hi Thom,

On Tue, Aug 18, 2015 at 8:02 PM, Thom Brown t...@linux.com wrote:


 Wow, didn't expect to see that email this morning.

 A very quick test:

 CREATE TABLE purchases (purchase_id serial, purchase_time timestamp, item
 text) partition by range on ((extract(year from
 purchase_time)),(extract(month from purchase_time)));
 ERROR:  referenced relation purchases is not a table or foreign table


Thanks for the quick test.

Damn, I somehow missed adding the new relkind to a check in
process_owned_by(). Will fix this and look for any such oversights.

Thanks,
Amit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] allowing wal_level change at run time

2015-08-18 Thread Magnus Hagander

On Tue, Aug 18, 2015 at 1:59 PM, Peter Eisentraut pete...@gmx.net wrote:

 In [1], it was discussed to make wal_level changeable at run time
 (PGC_SIGHUP), as part of an effort to make replication easier to set up
 and/or provide better defaults.  I was wondering what it would take to
 do that.

 I looks like increasing the setting is doable, as long as we WAL-log the
 change using existing facilities.  I don't understand the hot_standby -
 logical transition, so maybe something else is necessary there.

 How would we handle decreases at run time?  We can prevent =archive -
 minimal if archiving is running or there are physical replication slots,
 and we can prevent logical - something less if there are logical
 replication slots, but AFAICT, we don't have a way to check whether
 anyone currently needs level hot_standby.  (Thread [1] originally
 proposed to get rid of the archive/hot_standby distinction, which would
 help here.)  Or we could just let users make their own mess if they want
 to.



I still think we should get rid of the difference between
archive/hot_standby. It creates more trouble than I've ever seen it save. I
think we can safely label that as something we added because we didn't know
if it was going to be needed (because back in 9.0 nobody knew what the
impact *really* would be), but now we know it's not necessary so we can
just get rid of it.

Especially as it makes something like this easier. Removing complexity of
such important parts of the code is a feature in itself!

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: [HACKERS] Test code is worth the space

2015-08-18 Thread Greg Stark

On Tue, Aug 18, 2015 at 6:57 AM, Noah Misch n...@leadboat.com wrote:
 My own position is based on having maintained a pg_regress suite an order of
 magnitude larger than that.  I don't know why that outcome was so different.

Comparing the size of test suites by these numbers is impossible
because people put more or fewer tests in each schedule file. I would
be more interested in the run-time as a metric but even that is
fallible as I've seen individual tests that took 30min to run.

And does your pg_regress test suite actually find many bugs or does it
mainly detect when functionality has changed and require updating
expected results to match?


 I suspect any effort to significantly improve Postgres test coverage is
 doomed until there's an alternative to pg_regress.

 There is the src/test/perl/TestLib.pm harness.

Sadly I think the test suite is only half the battle. The coding style
of Postgres predates modern test suite systems and makes it hard to
test. Most functions require a specific environment set up that would
be laborious and difficult to do in any sane way. Even something as
self-contained as tuplesort would be difficult to test without the
whole operator class and syscache mechanisms initialized and
populated. And that's an easy case, imagine trying to test individual
functions in the planner without doing a complete planner run on a
query.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

2015-08-18 Thread Masahiko Sawada

On Mon, Aug 10, 2015 at 11:05 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Mon, Aug 10, 2015 at 12:39 AM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Aug 6, 2015 at 11:33 AM, Jim Nasby jim.na...@bluetreble.com wrote:
 They also provide a level of control over what is and isn't installed in a
 cluster. Personally, I'd prefer that most users not even be aware of the
 existence of things like pageinspect.

 +1.

 [...]

 Extensions are a useful packaging mechanism for functionality that is
 useful but not required, and debugging facilities are definitely very
 useful but should not be required.

 +1.

Sorry to be come discussion late.

I have encountered the much cases where pg_stat_statement,
pgstattuples are required in production, so I basically agree with
moving such extension into core.
But IMO, the diagnostic tools for visibility map, heap (pageinspect)
and so on, are a kind of debugging tool.

Attached latest v11 patches, which is separated into 2 patches: frozen
bit patch and diagnostic function patch.
Moving diagnostic function into core is still under the discussion,
but this patch puts such function into core because the diagnostic
function for visibility map needs to be in core to execute regression
test at least.

Regards,

--
Masahiko Sawada
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index 22c5f7a..b1b6a06 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -87,7 +87,7 @@ statapprox_heap(Relation rel, output_type *stat)
 		 * If the page has only visible tuples, then we can find out the free
 		 * space from the FSM and move on.
 		 */
-		if (visibilitymap_test(rel, blkno, vmbuffer))
+		if (visibilitymap_test(rel, blkno, vmbuffer, VISIBILITYMAP_ALL_VISIBLE))
 		{
 			freespace = GetRecordedFreeSpace(rel, blkno);
 			stat-tuple_len += BLCKSZ - freespace;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3701d8e..dabd632 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2176,8 +2176,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
 
 	/*
-	 * Find buffer to insert this tuple into.  If the page is all visible,
-	 * this will also pin the requisite visibility map page.
+	 * Find buffer to insert this tuple into.  If the page is all visible
+	 * or all frozen, this will also pin the requisite visibility map and
+	 * frozen map page.
 	 */
 	buffer = RelationGetBufferForTuple(relation, heaptup-t_len,
 	   InvalidBuffer, options, bistate,
@@ -2192,7 +2193,11 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	if (PageIsAllVisible(BufferGetPage(buffer)))
 	{
 		all_visible_cleared = true;
+
+		/* all-frozen information is also cleared at the same time */
 		PageClearAllVisible(BufferGetPage(buffer));
+		PageClearAllFrozen(BufferGetPage(buffer));
+
 		visibilitymap_clear(relation,
 			ItemPointerGetBlockNumber((heaptup-t_self)),
 			vmbuffer);
@@ -2493,7 +2498,11 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 		if (PageIsAllVisible(page))
 		{
 			all_visible_cleared = true;
+
+			/* all-frozen information is also cleared at the same time */
 			PageClearAllVisible(page);
+			PageClearAllFrozen(page);
+
 			visibilitymap_clear(relation,
 BufferGetBlockNumber(buffer),
 vmbuffer);
@@ -2776,9 +2785,9 @@ heap_delete(Relation relation, ItemPointer tid,
 
 	/*
 	 * If we didn't pin the visibility map page and the page has become all
-	 * visible while we were busy locking the buffer, we'll have to unlock and
-	 * re-lock, to avoid holding the buffer lock across an I/O.  That's a bit
-	 * unfortunate, but hopefully shouldn't happen often.
+	 * visible or all frozen while we were busy locking the buffer, we'll
+	 * have to unlock and re-lock, to avoid holding the buffer lock across an
+	 * I/O.  That's a bit unfortunate, but hopefully shouldn't happen often.
 	 */
 	if (vmbuffer == InvalidBuffer  PageIsAllVisible(page))
 	{
@@ -2972,10 +2981,15 @@ l1:
 	 */
 	PageSetPrunable(page, xid);
 
+	/* clear PD_ALL_VISIBLE and PD_ALL_FORZEN flags */
 	if (PageIsAllVisible(page))
 	{
 		all_visible_cleared = true;
+
+		/* all-frozen information is also cleared at the same time */
 		PageClearAllVisible(page);
+		PageClearAllFrozen(page);
+
 		visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
 			vmbuffer);
 	}
@@ -3254,7 +3268,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
 	 * in the middle of changing this, so we'll need to recheck after we have
 	 * the lock.
 	 */
-	if (PageIsAllVisible(page))
+	if (PageIsAllVisible(page) || PageIsAllFrozen(page))
 		visibilitymap_pin(relation, block, vmbuffer);
 
 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -3850,14 +3864,22 @@ l2:
 	if (PageIsAllVisible(BufferGetPage(buffer)))
 	{
 		all_visible_cleared = true;
+
+

Re: [HACKERS] missing documentation for partial WAL files

* Robert Haas (robertmh...@gmail.com) wrote:
 On Mon, Aug 17, 2015 at 2:50 PM, Peter Eisentraut pete...@gmx.net wrote:
  The commit message for de76884 contains some important information about
  the purpose and use of the new .partial WAL files.  But I don't see
  anything about this in the documentation or another user-visible place.
   We should probably add something.
 
 +1.  Heikki talked about this at his PGCon presentation, and I thought
 that was pretty helpful stuff, but not everyone was there (or will
 remember what he said when the time comes that they need to know).

Hopefully the audio or video of it will be posted eventually..  I was
hoping to review it myself and to discuss the points he made in depth
with David, to make sure we cover all of them (pretty sure we do, but
good to verify).

The link to the slides, at least, is here:

http://www.pgcon.org/2015/schedule/attachments/379_PGCon2015-Warm-standby-done-right.pdf

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Error message with plpgsql CONTINUE

2015-08-18 Thread Pavel Stehule

Hi

2015-08-17 23:46 GMT+02:00 Tom Lane t...@sss.pgh.pa.us:

 Jim Nasby jim.na...@bluetreble.com writes:
  On 8/17/15 9:48 AM, Tom Lane wrote:
  I'm inclined to think that if we wanted to make this better, the way to
  improve it would be to detect the error*at compile time*, and get rid of
  this hack in plpgsql_exec_function altogether.

  So split PLPGSQL_NSTYPE_LABEL into PLPGSQL_NSTYPE_BLOCK_LABEL and
  PLPGSQL_NSTYPE_LOOP_LABEL, and split opt_block_label and opt_label the
  same way?

 I think using two NSTYPE codes would probably be a pain because there are
 numerous places that don't care about the distinction; it'd be better to
 have a secondary attribute distinguishing these cases.  (It looks like you
 could perhaps reuse the itemno field for the purpose, since that seems
 to be going unused in LABEL items.)

 You likely do need to split opt_block_label into two productions, since
 that will be the easiest way to pass forward the knowledge of whether
 it's being called from a loop or non-loop construct.


when I implemented this check in plpgsql_check I found another minor issue
in CONTINUE statement - the typename is wrong

Regards

Pavel






 regards, tom lane


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

diff --git a/src/pl/plpgsql/src/pl_funcs.c b/src/pl/plpgsql/src/pl_funcs.c
new file mode 100644
index 7b26970..7603441
*** a/src/pl/plpgsql/src/pl_funcs.c
--- b/src/pl/plpgsql/src/pl_funcs.c
*** plpgsql_stmt_typename(PLpgSQL_stmt *stmt
*** 235,241 
  		case PLPGSQL_STMT_FOREACH_A:
  			return _(FOREACH over array);
  		case PLPGSQL_STMT_EXIT:
! 			return EXIT;
  		case PLPGSQL_STMT_RETURN:
  			return RETURN;
  		case PLPGSQL_STMT_RETURN_NEXT:
--- 235,241 
  		case PLPGSQL_STMT_FOREACH_A:
  			return _(FOREACH over array);
  		case PLPGSQL_STMT_EXIT:
! 			return ((PLpgSQL_stmt_exit *) stmt)-is_exit ? EXIT : CONTINUE;
  		case PLPGSQL_STMT_RETURN:
  			return RETURN;
  		case PLPGSQL_STMT_RETURN_NEXT:

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning

2015-08-18 Thread Marc Mamin


great :-)

2. Creating a partition of a partitioned table

CREATE TABLE table_name
PARTITION OF partitioned_table_name
FOR VALUES values_spec;

Where values_spec is:

listvalues: [IN] (val1, ...)


Would it make sense to allow one complementary partition to the listvalues?  

listvalues: [[NOT] IN] (val1, ...)

I've thought a few times about moving data with some most common values to 
dedicated partitions
and keeping the rest in a separate one...

best regards,

Marc Mamin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] missing documentation for partial WAL files

* Stephen Frost (sfr...@snowman.net) wrote:
 * Robert Haas (robertmh...@gmail.com) wrote:
  On Mon, Aug 17, 2015 at 2:50 PM, Peter Eisentraut pete...@gmx.net wrote:
   The commit message for de76884 contains some important information about
   the purpose and use of the new .partial WAL files.  But I don't see
   anything about this in the documentation or another user-visible place.
We should probably add something.
  
  +1.  Heikki talked about this at his PGCon presentation, and I thought
  that was pretty helpful stuff, but not everyone was there (or will
  remember what he said when the time comes that they need to know).
 
 Hopefully the audio or video of it will be posted eventually..  I was
 hoping to review it myself and to discuss the points he made in depth
 with David, to make sure we cover all of them (pretty sure we do, but
 good to verify).
 
 The link to the slides, at least, is here:
 
 http://www.pgcon.org/2015/schedule/attachments/379_PGCon2015-Warm-standby-done-right.pdf

Dan corrected me, it's been posted already, here:

https://www.youtube.com/watch?v=mlQ90MntJwMlist=PLWW0CjV-TafZo4lBWuzw7OYJY7Y4SW76Bindex=17

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Autonomous Transaction is back

2015-08-18 Thread Merlin Moncure

On Tue, Aug 18, 2015 at 8:17 AM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Aug 15, 2015 at 6:47 PM, Noah Misch n...@leadboat.com wrote:
 CREATE TABLE t (c) AS SELECT 1;
 BEGIN;
 UPDATE t SET c = 2 WHERE c = 1;
 BEGIN_AUTONOMOUS;
 UPDATE t SET c = 3 WHERE c = 1;
 UPDATE t SET c = 4 WHERE c = 2;
 COMMIT_AUTONOMOUS;
 ROLLBACK;

 If you replace the autonomous transaction with a savepoint, the c=3 update
 finds no rows, and the c=4 update changes one row.  When the outer 
 transaction
 aborts, only the original c=1 row remains live.  If you replace the 
 autonomous
 transaction with a dblink/pg_background call, the c=3 update waits
 indefinitely for c=2 to commit or abort, an undetected deadlock.

 Suppose you make the autonomous transaction see tuples like in the savepoint
 case.  The c=3 update finds no rows to update, and the c=4 update changes one
 row.  When the outer transaction aborts, you have two live rows (c=1 and 
 c=4).
 Suppose you instead make the autonomous transaction see tuples like in the
 dblink case, yet let c=3 ignore the lock and change a row.  If both the
 autonomous transaction and the outer transaction were to commit, then you get
 two live rows (c=2 and c=3).  Neither of those outcomes is acceptable, of
 course.  In today's tuple update rules, c=3 must deadlock[1].  Other credible
 tuple update rules may not have this problem, but nothing jumps to mind.

 [1] That's not to say it must use the shmem lock structures and deadlock
 detector.

 This footnote goes to my point.

 It seems clear to me that having the autonomous transaction see the
 effects of the outer uncommitted transaction is a recipe for trouble.
 If the autonomous transaction updates a row and commits, and the outer
 transaction later aborts, the resulting state is inconsistent with any
 serial history.  I'm fairly certain that's going to leave us in an
 unhappy place.

 Even more obviously, ending up with two committed row versions that
 are both updates of a single ancestor version is no good.

 So, I agree that this scenario should be an error.  What I don't agree
 with is the idea that it should be the deadlock detector's job to
 throw that error.  Rather, I think that when we examine the xmax of
 the tuple we can see - which is the original one, not the one updated
 by the outer transaction - we should check whether that XID belongs to
 an outer transaction.

Hm: do you mean 'an' outer transaction (meaning, basically, any in
progress transaction) or the outer transaction of the AT.  I think you
mean outer transaction of the AT, which makes a lot of sense and
should be easy and fast to test.  It's like an implied NOWAIT if the
locker is the AT and the lockee is the parent.

Can you get away with only looking at tuples though?  For example,
what about advisory locks?  Table locks?


merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

On Tue, Aug 18, 2015 at 12:53 PM, Jaime Casanova
jaime.casan...@2ndquadrant.com wrote:
 This is not completely true, you can always use something like
 pgbouncer or other middleware to change the server to which clients
 connect. you still need to solve the fact that you will have a
 read-only server at the other side.

 something like repmgr + pgbouncer will work fine.

Sure, but pgbouncer is an extra hop, and has its own foibles.  There's
real appeal to doing this in the client.

 i agree that once/if we ever have multimaster included then this could
 be a good idea

I think it has a lot of appeal *now*.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning

2015-08-18 Thread Josh Berkus

Amit,

 I would like propose $SUBJECT for this development cycle. Attached is a
 WIP patch that implements most if not all of what's described below. Some
 yet unaddressed parts are mentioned below, too. I'll add this to the CF-SEP.

First of all, wow!  Really happy to see this.

 
 Syntax
 ==
 
 1. Creating a partitioned table
 
 CREATE TABLE table_name
 PARTITION BY {RANGE|LIST}
 ON (column_list);
 
 Where column_list consists of simple column names or expressions:
 
 PARTITION BY LIST ON (name)
 PARTITION BY RANGE ON (year, month)
 
 PARTITION BY LIST ON ((lower(left(name, 2)))
 PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d)))

So far so good.  Have you given any thought as to how a user will
determine which partition corresponds to which values (for purposes of
dropping/maintaining a partition)?

Also, won't doing things like extract() for range partitions make it
much harder for you to develop the planner parts of this solution?

What about defining an interval instead, such as:

PARTITION BY RANGE USING ( interval ) ON ( column );

i.e.

PARTITION BY RANGE USING ( INTERVAL '1 month' ) ON ( submitted_date );
PARTITION BY RANGE USING ( 10 ) ON ( user_id );

This would make it easy for you to construct range type values defining
the range of each partition, which would then make the planner work much
easier than calling a function would, no?

Or am I misunderstanding how you're using ranges here? It kind of seems
like you're still leaving specific range defintions up to the user,
which is (from my perspective) unsatisfactory (see below).

I'm assuming that all range partitions will be [ closed, open ) ranges.


 2. Creating a partition of a partitioned table
 
 CREATE TABLE table_name
 PARTITION OF partitioned_table_name
 FOR VALUES values_spec;
 
 Where values_spec is:
 
 listvalues: [IN] (val1, ...)
 
 rangevalues: START (col1min, ... ) END (col1max, ... )
| START (col1min, ... )
| END (col1max, ... )

So, one thing I missed in here is anything about automated partitioning
of tables; that is, creating new partitions based on incoming data or a
simple statement which doesn't require knowledge of the partitioning
scheme.  It's possible (and entirely accceptable) that you're
considering automated partition creation outside of the scope of this
patch.  However, for range partitions, it would be *really* useful to
have this syntax:

CREATE NEXT PARTITION ON parent_table;

Which would just create the next partition based on whatever the range
partitoning scheme is, instead of requiring the user to calculate start
and end values which might or might not match the parent partitioning
scheme, and might leave gaps.  Also this would be useful for range
partitions:

CREATE PARTITION ON parent_table USING ( start_value );

... where start_value is the start range of the new partition.  Again,
easier for users to get correct.

Both of these require the idea of regular intervals for range
partitions, that is, on a table partitioned by month on a timestamptz
column, each partition will have the range [ month:1, nextmonth:1 ).
This is the most common use-case for range partitions (like, 95% of all
partitioning cases I've seen), so a new partitioning scheme ought to
address it.

While there are certainly users who desire the ability to define
arbitrary ranges for each range partition, these are by far the minority
and could be accomodated by a different path with more complex syntax.
Further, I'd wager that most users who want to define arbitrary ranges
for range partitions aren't going to be satisfied with the other
restrictions on declarative partitioning (e.g. same constraints, columns
for all partitions) and are going to use inheritance partitioning anyway.

 5. Detach partition
 
 ALTER TABLE partitioned_table
 DETACH PARTITION partition_name [USING table_name]
 
 This removes partition_name as partition of partitioned_table. The table
 continues to exist with the same name or 'table_name', if specified.
 pg_class.relispartition is set to false for the table, so it behaves like
 a normal table.

What about DROPping partitions?  Do they need to be detached first?

 Creating index on parent is not allowed. They should be defined on (leaf)
 partitions. Because of this limitation, primary keys are not allowed on a
 partitioned table. Perhaps, we should be able to just create a dummy
 entry somewhere to represent an index on parent (which every partition
 then copies.) 

This would be preferable, yes.  Making users remember to manually create
indexes on each partition is undesirable.

 What should TRUNCATE on partitioned table do?

On the master table?  Truncate all individual partitions.  Do not drop
the partitions.

On a partitition?  Truncate just that partition.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] allowing wal_level change at run time

2015-08-18 Thread Andres Freund

On 2015-08-18 13:24:54 -0400, Peter Eisentraut wrote:
 On 8/18/15 12:35 PM, Robert Haas wrote:
  If archive_mode=on or max_wal_senders0, then we need at least
  wal_level=archive.  Otherwise wal_level=minimal is enough.

 Totally forgot about max_wal_senders.

 However, the thread I linked to earlier aimed for a different master
 plan (or if not, I'm aiming for it now).  There is camp 1, which wants
 to keep all the defaults the same, for performance or something like
 that.  And there is camp 2, which wants to have a replication-friendly
 setup by default.  Instead of fighting over this, your idea was to be
 able to switch between 1 and 2 easily (which means in particular without
 restarts).

I don't think not requiring restarts is sufficient, having to twiddle a
bunch of parameters manually still is a lot more effort than people see
as necessary.

The only reason I think changing the default is a good approach is that
it's doable within a relatively short amount of time.

 But if we tie the effective wal_level to archive_mode or
 max_wal_senders, both of which are restart-only, then we haven't gained
 anything.  (We would have removed half a GUC parameter, effectively.
 Not bad, but not very exciting.)

ISTM that it's not too hard to
a) make archive_mode PGC_SIGHUP
b) make wal_level PGC_SIGHUP
c) automatically increase wal_level to logical whenever a logical
   replication slot is defined

it seems considerably harder to

d) make max_wal_senders PGC_SIGHUP
e) make max_replication_slots PGC_SIGHUP

because they need shmem, locks, and everything.


Therefore I propose something slightly different:

We change the default of max_wal_senders, max_replication_slots, to some
reasonably high setting and make wal_level an internal automagically
determined variable. archive_mode = on automatically increases the level
to what's now hot-standby.

To deal with streaming replication, we automatically create slots for
replicas, but, by default, *without* having them reserve WAL. The slot
name would be determined in some automatic fashion (uuid or something)
and stored in a new file in the data directory.  That allows us to
increase the internal wal_level to hot_standby/archive whenever a
replica has connected (and thus a physical slot exists), and to logical
whenever a logical slot exists.

Now, that'll mean that the wal_level doesn't decrease automatically
until a slot has been dropped. But that seems relatively harmless if
it's not reserving WAL.

Too crazy?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning

2015-08-18 Thread Corey Huinker

On Tue, Aug 18, 2015 at 6:30 AM, Amit Langote langote_amit...@lab.ntt.co.jp
 wrote:


 Hi,

 I would like propose $SUBJECT for this development cycle. Attached is a
 WIP patch that implements most if not all of what's described below. Some
 yet unaddressed parts are mentioned below, too. I'll add this to the
 CF-SEP.

 Syntax
 ==

 1. Creating a partitioned table

 CREATE TABLE table_name
 PARTITION BY {RANGE|LIST}
 ON (column_list);

 Where column_list consists of simple column names or expressions:

 PARTITION BY LIST ON (name)
 PARTITION BY RANGE ON (year, month)

 PARTITION BY LIST ON ((lower(left(name, 2)))
 PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d)))

 Note: LIST partition key supports only one column.

 For each column, you could write operator class name:

 PARTITION BY LIST/RANGE ON (colname [USING] opclass_name),

 If not specified, the default btree operator class based on type of each
 key column is used. If none of the available btree operator classes are
 compatible with the partitioning strategy (list/range), error is thrown.
 Built-in btree operator classes cover a good number of types for list and
 range partitioning in practical scenarios.

 A table created using this form is of proposed new relkind
 RELKIND_PARTITIONED_REL. An entry in pg_partitioned_rel (see below) is
 created to store partition key info.

 Note: A table cannot be partitioned after-the-fact using ALTER TABLE.

 Normal dependencies are created between the partitioned table and operator
 classes, object in partition expressions like functions.

 2. Creating a partition of a partitioned table

 CREATE TABLE table_name
 PARTITION OF partitioned_table_name
 FOR VALUES values_spec;

 Where values_spec is:

 listvalues: [IN] (val1, ...)

 rangevalues: START (col1min, ... ) END (col1max, ... )
| START (col1min, ... )
| END (col1max, ... )

 A table created using this form has proposed pg_class.relispartition set
 to true. An entry in pg_partition (see below) is created to store the
 partition bound info.

 The values_spec should match the partitioning strategy of the partitioned
 table. In case of a range partition, the values in START and/or END should
 match columns in the partition key.

 Defining a list partition is fairly straightforward - just spell out the
 list of comma-separated values. Error is thrown if the list of values
 overlaps with one of the existing partitions' list.

 CREATE TABLE persons_by_state (name text, state text)
 PARTITION BY LIST ON (state);

 CREATE TABLE persons_IL
 PARTITION OF persons_by_state
 FOR VALUES IN ('IL');

 CREATE TABLE persons_fail
 PARTITION OF persons_by_state
 FOR VALUES IN ('IL');
 ERROR: cannot create partition that overlaps with an existing one

 For a range partition, there are more than one way:

 Specify both START and END bounds: resulting range should not overlap with
 the range(s) covered by existing partitions. Error is thrown otherwise.
 Although rare in practice, gaps between ranges are OK.

 CREATE TABLE measurement(logdate date NOT NULL)
 PARTITION BY RANGE ON (logdate);

 CREATE TABLE measurement_y2006m02
 PARTITION OF measurement
 FOR VALUES START ('2006-02-01') END ('2006-03-01'); --success

 CREATE TABLE measurement_fail
 PARTITION OF measurement
 FOR VALUES START ('2006-02-15') END ('2006-03-01');
 ERROR: cannot create partition that overlaps with an existing one

 Specify only the START bound: add the partition on the left of some range
 covered by existing partitions provided no overlap occurs (also
 considering gaps between ranges, if any). If no such range exists, the new
 partition will cover the range [START, +INFINITY) and become the rightmost
 partition. Error is thrown if the specified START causes overlap.

 CREATE TABLE measurement_y2006m01
 PARTITION OF measurement
 FOR VALUES START ('2006-01-01'); --success

 CREATE TABLE measurement_fail
 PARTITION OF measurement
 FOR VALUES START ('2006-02-01'); --overlaps with measurement_y2006m02
 ERROR: cannot create partition that overlaps with an existing one

 Specify only the END bound: add the partition on the right of some range
 covered by existing partitions provided no overlap occurs (also
 considering gaps between ranges, if any). If no such range exists, the new
 partition would cover the range (-INFINITY, END) and become the leftmost
 partition. Error is thrown if the specified END causes overlap.

 CREATE TABLE measurement_y2006m03
 PARTITION OF measurement
 FOR VALUES END ('2006-04-01'); --success

 CREATE TABLE measurement_fail
 PARTITION OF measurement
 FOR VALUES END ('2006-03-01'); --overlaps with measurement_y2006m02
 ERROR: cannot create partition that overlaps with an existing one

 For each partition, START and END bound values are stored in the
 catalog. Note that the lower bound is inclusive, whereas the upper bound
 is exclusive.

 Note: At most one range partition can have null min bound in which case it
 covers the range

Re: [HACKERS] WIP: SCRAM authentication

* Robert Haas (robertmh...@gmail.com) wrote:
 On Tue, Aug 18, 2015 at 2:07 PM, Stephen Frost sfr...@snowman.net wrote:
  SCRAM itself, as has been discussed, supports multiple password
  verifiers- that's a specific case all by itself, and it's done
  specifically to address the issue that one or another of the algorithms
  used is compromised, or that a new algorithm becomes available which is
  better.  AD and Kerberos support multiple password verifiers because of
  this and that it allows you to migrate from one to the next without
  having to do wholesale replacment across all systems involved.  I bring
  them up as examples of the advances in password-based authentication
  which we've missed and because they are what users expect from current
  password-based authentication systems, not because we support them and
  therefore should just push everyone to them.
 
 OK, that's an interesting argument.  If SCRAM supports multiple
 password verifiers, and we support SCRAM, then I guess we should
 probably do that, too.  I still don't like it all that much, though.
 I think it's absolutely inevitable that people are going to end up
 with an account with 3 or more different passwords that can all be
 used to log into it, and that won't be good.  How do other systems
 avoid this pitfall?

They provide:

a) ability to configure which algorithms are used at change-password

  MIT Kerberos, kdc.conf: supported_enctypes and
  kadmin: enctype-salttype

b) ability to configure which algorithms are allowed to be used at all

  MIT Kerberos, libdefaults: default_tgs_enctypes, default_tkt_enctypes,
  and permitted_enctypes, along with allow_weak_crypto which removes all
  those considered 'weak' from the other sets if set to false (the
  default).

c) password aging (to get users to change their password regularly, so
   you know after, say, 90 days of the change to the set configured in
   'a' that all users will have either expired passwords or have been
   updated to the latest set).

Note that we don't need quite so many options, these were done over time
for Kerberos and also address the different types of tickets involved in
that system.

Today, they discourage explicitly setting default_tgs_enctypes and
default_tkt_enctypes.  We really just need to allow configuration of
which should be stored at change-password time and the set of allowed
types.  I wouldn't want those to be the same because you'd want to
remove a compromised type as soon as possible from the change-password
set while still allowing those users to connect, to get them to change
their password using self-service (doing it all through the
adminstrators of the system would be very painful in large
organizations).

We also don't strictly need password aging, but it certainly helps.  We
definitely do need to provide administrators with easy interfaces
through functions and tools to manage the password verifiers used.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] allowing wal_level change at run time

On 8/18/15 12:35 PM, Robert Haas wrote:
 If archive_mode=on or max_wal_senders0, then we need at least
 wal_level=archive.  Otherwise wal_level=minimal is enough.

Totally forgot about max_wal_senders.

However, the thread I linked to earlier aimed for a different master
plan (or if not, I'm aiming for it now).  There is camp 1, which wants
to keep all the defaults the same, for performance or something like
that.  And there is camp 2, which wants to have a replication-friendly
setup by default.  Instead of fighting over this, your idea was to be
able to switch between 1 and 2 easily (which means in particular without
restarts).

But if we tie the effective wal_level to archive_mode or
max_wal_senders, both of which are restart-only, then we haven't gained
anything.  (We would have removed half a GUC parameter, effectively.
Not bad, but not very exciting.)



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: SCRAM authentication

On Tue, Aug 18, 2015 at 2:07 PM, Stephen Frost sfr...@snowman.net wrote:
 I would expect there to be people who would run into pg_upgrade
 complaining, that's why there would be the check.  That's actually a
 much better situation than what happened around
 standard_conforming_strings.  Further, users would be able to continue
 with their existing version until they're ready to move or it goes out
 of support, by which time, if their connector isn't updated, they should
 be moving off of it also.  We hear about people running 8.4 and older
 because of some application which was never maintained or updated, and
 that sucks, but it doesn't prevent us from making the changes we need to
 make to move the project forward for the users who properly manage their
 systems and use supported connectors.

Sorry, that's a completely bogus argument.  We do not need to
prevent people from upgrading if they haven't moved off of MD5
authentication; that's just an arbitrary - and IMHO extremely
user-hostile - policy decision.  We have a legitimate need, to move
the project forward, to introduce a better system for password
authentication.  Ripping out the old one is not a real need.  I'm sure
at some point it will seem like antiquated cruft that nobody uses any
more, but that will not be in a year or two or anything close to
that.

 SCRAM itself, as has been discussed, supports multiple password
 verifiers- that's a specific case all by itself, and it's done
 specifically to address the issue that one or another of the algorithms
 used is compromised, or that a new algorithm becomes available which is
 better.  AD and Kerberos support multiple password verifiers because of
 this and that it allows you to migrate from one to the next without
 having to do wholesale replacment across all systems involved.  I bring
 them up as examples of the advances in password-based authentication
 which we've missed and because they are what users expect from current
 password-based authentication systems, not because we support them and
 therefore should just push everyone to them.

OK, that's an interesting argument.  If SCRAM supports multiple
password verifiers, and we support SCRAM, then I guess we should
probably do that, too.  I still don't like it all that much, though.
I think it's absolutely inevitable that people are going to end up
with an account with 3 or more different passwords that can all be
used to log into it, and that won't be good.  How do other systems
avoid this pitfall?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [patch] psql tab completion for grant execute

On Mon, Aug 17, 2015 at 5:07 PM, Daniel Verite dan...@manitou-mail.org wrote:
 When tab-completing after GRANT EXECUTE, currently psql injects
 PROCEDURE, rather than the expected ON.

 The code for completing with ON is there, but it's not reached due to
 falling earlier into another branch, one that handles CREATE TRIGGER.

 A trivial patch is attached. It adds the condition that if EXECUTE is
 preceded by GRANT itself preceded by nothing, then that completion
 with PROCEDURE is skipped.

Thanks, I committed this.  I don't think we usually back-patch tab
completion fixes, but I back-patched this one to 9.5 anyway, so that
it gets out into the wild a little sooner.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: SCRAM authentication

Robert,

* Robert Haas (robertmh...@gmail.com) wrote:
 On Tue, Aug 18, 2015 at 10:06 AM, Stephen Frost sfr...@snowman.net wrote:
  That was the imputus for my earlier suggestion that in a release or two,
  we actively make pg_upgrade error (or perhaps warn first, then error,
  across two releases) if any of the old verifiers exist.
 
 I think there's basically no chance of that being acceptable.  The
 time at which it's possible to get rid of MD5 is going to vary widely
 between installations.  People who are using only libpq or
 libpq-derived connectors will be able to get rid of it almost
 immediately, if they want, though some won't.  People who are using
 obscure connectors that are poorly maintained may not even have a
 version that supports SCRAM for 5 years.  Think about how long it took
 us to roll out the standard_conforming_strings changes, and there were
 still people who got bitten.

I would expect there to be people who would run into pg_upgrade
complaining, that's why there would be the check.  That's actually a
much better situation than what happened around
standard_conforming_strings.  Further, users would be able to continue
with their existing version until they're ready to move or it goes out
of support, by which time, if their connector isn't updated, they should
be moving off of it also.  We hear about people running 8.4 and older
because of some application which was never maintained or updated, and
that sucks, but it doesn't prevent us from making the changes we need to
make to move the project forward for the users who properly manage their
systems and use supported connectors.

  The other concern with a single password verifier is that we're locking
  ourselves into a one-verifier-per-role solution when most of the serious
  solutions in use today (Active Directory, Kerberos, certificate based
  systems) allow for more than one.
 
 So what?  If you want to delegate authentication to AD or Kerberos, we
 already support that.  That's not a reason to invent the same
 functionality inside the server.  If you've got a tangible plan, other
 than SCRAM, that would require us to support multiple verifiers, then
 please say what it is.  If not, the mere fact that some other people
 support it is not a reason why we should.  

SCRAM itself, as has been discussed, supports multiple password
verifiers- that's a specific case all by itself, and it's done
specifically to address the issue that one or another of the algorithms
used is compromised, or that a new algorithm becomes available which is
better.  AD and Kerberos support multiple password verifiers because of
this and that it allows you to migrate from one to the next without
having to do wholesale replacment across all systems involved.  I bring
them up as examples of the advances in password-based authentication
which we've missed and because they are what users expect from current
password-based authentication systems, not because we support them and
therefore should just push everyone to them.

That said, if we aren't going to move password-based authentication in
PG to beyond the stone age it's currently in, then perhaps we should
just drop it completely and force users to pick a solution which is well
written, uses standard protocols, supports multiple ciphers and hashes
(in case one or more become a problem) and is updated to keep up with
the field.  Obviously, that's a strawman and I appreciate that you are
not arguing to keep the status-quo, but I'm trying to make a point that
multiple password verifiers is part of modern authentication technology,
used by multiple standard and broadly installed solutions.

 In fact, we generally have
 taken the approach that needs which are already handled adequately by
 other tools to do not need to also be handled inside the database.
 That's not a perfect approach and we always argue about it around the
 edges, but generally, I think it's served us pretty well.

Unless we take my strawman literally, which I don't think we can, we
can't simply punt on pushing all password-based authentication out of
the server.  If we agree that we need to support password-based
authentication then we should be working to maintain it as well as we
do the WAL and other subsystems.

I fully agree that we should leverage existing systems, technologies,
and protocols where possible, to avoid having to add a lot of extra
code which we have to maintain.  I believe that this proposal already
meets that goal and we should move forward with full support for SCRAM.
That, in my view at least, means that we also need to provide for
multiple password verifiers.

Thanks!

Stephen


signature.asc
Description: Digital signature

[HACKERS] a few doubts around check_for_isn_and_int8_passing_mismatch

pg_upgrade's check.c contains a function called
check_for_isn_and_int8_passing_mismatch.  If the float8 pass-by-value
values differ between the old and new clusters, which is likely to
happen on Windows 64 because of cf376a4ad, and if the old cluster
contains any databases which contain functions from '$libdir/isn',
then it fails the upgrade.  The comments say this:

 *  contrib/isn relies on data type int8, and in 8.4 int8 can now be passed
 *  by value.  The schema dumps the CREATE TYPE PASSEDBYVALUE setting so
 *  it must match for the old and new servers.

There are a couple of things that bother me here:

1. The check is restricted to isn, but the issue isn't.  Presumably,
any datatype that is passed either by value or by reference depending
on whether float8 is passed-by-value has the same issue.  isn is the
only module in core that has this issue, but there could be
third-party modules with the same issue.  I'm not sure there's
anything we can do about that, but it's troublesome.

2. If pg_dump knew that we were trying to dump a type of this sort, it
could easily emit an incantation that would work everywhere.  Commit
3f936aacc introduced a LIKE = typename syntax that suits this problem
perfectly, and changed the scripts for contrib/isn to use it.  But
pg_dump can't use that syntax, because there's no catalog state to
record whether we want new typbyval = old typbyval or whether we
instead want new typbyval = new cluster's float8-pass-by-value.

Is it worth trying to fix this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Extension upgrade and GUCs

2015-08-18 Thread Paul Ramsey

Hi hackers, I've been wrestling with this one for a while and gone
down a couple blind alleys, so time to ask the experts.

PostGIS has a couple things different from the extensions that live in contrib,

- it has some GUCs
- it has a versioned loadable library (postgis-2.1.so, postgis-2.2.so, etc)

We've found that, when we run the following SQL,

  CREATE EXTENSION postgis VERSION '2.1.9';
  ALTER EXTENSION postgis UPDATE TO '2.2.0';

The update fails, because of a collision in the GUC.

When the extension is CREATEd, postgis-2.1.so is loaded, _PG_init() is
called, and that in turn calls DefineCustomStringVariable() to create
our GUC.

When the ALTER is called, the first time a C function definition is
called, the new library, postgis-2.2.so is loaded, the _PG_init() of
*that* library is called, and DefineCustomStringVariable() is called,
but this time it runs into the GUC definition from the first library
load, and the EXTENSION update process stops as an ERROR is thrown.

My initial attempt at avoiding this problem involved looking at
GetConfigOption() before running DefineCustomStringVariable() to see
if the GUC was already defined. This did not work, as it's possible to
define a GUC before loading the library. So an otherwise innocent
sequence of commands like:

  SET my_guc = 'foo'; -- no library loaded yet
  SELECT my_library_function(); -- causes library load and _PG_init() to fire

Would now fail, as it hit my test for a pre-existing GUC.

Unfortunately, the GUC we are using is not one where we simply read a
value now and a again. It switches the backend geometry library that
various functions use, so performance is a big deal: instead of
reading the GUC value, the code expects that GUC changes will flip a
global variable, using the GUC assign callback.

So I need a way to either (a) notice when I already have a (old) copy
of the library loaded and avoid trying to setup the GUC in that case
or (b) set-up the GUC in a somewhat less brittle way than
DefineCustomStringVariable() allows, something that can overwrite
things instead of just erroring out.

The ugly code in question is here

https://github.com/postgis/postgis/blob/svn-trunk/postgis/lwgeom_backend_api.c#L105

Discussion is here

https://trac.osgeo.org/postgis/ticket/2382

Thanks,

P


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: SCRAM authentication

2015-08-18 Thread Greg Stark

On Tue, Aug 18, 2015 at 7:19 PM, Robert Haas robertmh...@gmail.com wrote:

 Sorry, that's a completely bogus argument.  We do not need to
 prevent people from upgrading if they haven't moved off of MD5
 authentication; that's just an arbitrary - and IMHO extremely
 user-hostile - policy decision.  We have a legitimate need, to move
 the project forward, to introduce a better system for password
 authentication.  Ripping out the old one is not a real need.  I'm sure
 at some point it will seem like antiquated cruft that nobody uses any
 more, but that will not be in a year or two or anything close to
 that.

I would imagine a GUC like enable_legacy_authentication=true which we
would just start defaulting to false after a few years and perhaps
consider removing five years after that. pg_upgrade could check if
there are any users which require it to be set to true and warn users
that they must enable it but should check the documentation so they
understand the impact.

 OK, that's an interesting argument.  If SCRAM supports multiple
 password verifiers, and we support SCRAM, then I guess we should
 probably do that, too.  I still don't like it all that much, though.
 I think it's absolutely inevitable that people are going to end up
 with an account with 3 or more different passwords that can all be
 used to log into it, and that won't be good.  How do other systems
 avoid this pitfall?

Fwiw having multiple passwords would make automated credential
rotations *so* much easier. Heroku has a really baroque solution to
this problem in Postgres involving creating new child roles and
swapping them around. My team in Google wasted many man hours dealing
with fallout from the quarterly password rotations.

(I wish we could just drop the account management and authentication
system entirely and dump the whole work on a system designed for this
particular problem. It's a hard problem that's far outside our core
features and Postgres is never going to be a good system for anyone
let alone everyone when there are many different types of users.)

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows

2015-08-18 Thread Kevin Grittner

Kouhei Kaigai kai...@ak.jp.nec.com wrote:

 longlbuckets;

 lbuckets = 1  my_log2(hash_table_bytes / bucket_size);

 Assert(nbuckets  0);

 In my case, the hash_table_bytes was 101017630802, and bucket_size was 48.
 So, my_log2(hash_table_bytes / bucket_size) = 31, then lbuckets will have
 negative number because both 1 and my_log2() is int32.
 So, Min(lbuckets, max_pointers) chooses 0x8000, then it was set on
 the nbuckets and triggers the Assert().

 Attached patch fixes the problem.

So you changed the literal of 1 to 1U, but doesn't that just double
the threshold for failure?  Wouldn't 1L (to match the definition of
lbuckets) be better?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add support for RADIUS passwords longer than 16 octets

2015-08-18 Thread Marko Tiikkaja


On 2015-08-15 17:55, I wrote:

The attached patch adds support for RADIUS passwords longer than 16 octets.


Improved the coding and comments a bit, new version attached.


.m
*** a/src/backend/libpq/auth.c
--- b/src/backend/libpq/auth.c
***
*** 2168,2173  CheckCertAuth(Port *port)
--- 2168,2174 
  
  #define RADIUS_VECTOR_LENGTH 16
  #define RADIUS_HEADER_LENGTH 20
+ #define RADIUS_MAX_PASSWORD_LENGTH 128
  
  typedef struct
  {
***
*** 2241,2247  CheckRADIUSAuth(Port *port)
radius_packet *receivepacket = (radius_packet *) receive_buffer;
int32   service = htonl(RADIUS_AUTHENTICATE_ONLY);
uint8  *cryptvector;
!   uint8   encryptedpassword[RADIUS_VECTOR_LENGTH];
int packetlength;
pgsocketsock;
  
--- 2242,2250 
radius_packet *receivepacket = (radius_packet *) receive_buffer;
int32   service = htonl(RADIUS_AUTHENTICATE_ONLY);
uint8  *cryptvector;
!   int encryptedpasswordlen;
!   uint8   encryptedpassword[RADIUS_MAX_PASSWORD_LENGTH];
!   uint8  *md5trailer;
int packetlength;
pgsocketsock;
  
***
*** 2259,2264  CheckRADIUSAuth(Port *port)
--- 2262,2268 
fd_set  fdset;
struct timeval endtime;
int i,
+   j,
r;
  
/* Make sure struct alignment is correct */
***
*** 2316,2325  CheckRADIUSAuth(Port *port)
return STATUS_ERROR;
}
  
!   if (strlen(passwd)  RADIUS_VECTOR_LENGTH)
{
ereport(LOG,
!   (errmsg(RADIUS authentication does not support 
passwords longer than 16 characters)));
return STATUS_ERROR;
}
  
--- 2320,2329 
return STATUS_ERROR;
}
  
!   if (strlen(passwd)  RADIUS_MAX_PASSWORD_LENGTH)
{
ereport(LOG,
!   (errmsg(RADIUS authentication does not support 
passwords longer than %d characters, RADIUS_MAX_PASSWORD_LENGTH)));
return STATUS_ERROR;
}
  
***
*** 2344,2371  CheckRADIUSAuth(Port *port)
radius_add_attribute(packet, RADIUS_NAS_IDENTIFIER, (unsigned char *) 
identifier, strlen(identifier));
  
/*
!* RADIUS password attributes are calculated as: e[0] = p[0] XOR
!* MD5(secret + vector)
 */
!   cryptvector = palloc(RADIUS_VECTOR_LENGTH + 
strlen(port-hba-radiussecret));
memcpy(cryptvector, port-hba-radiussecret, 
strlen(port-hba-radiussecret));
!   memcpy(cryptvector + strlen(port-hba-radiussecret), packet-vector, 
RADIUS_VECTOR_LENGTH);
!   if (!pg_md5_binary(cryptvector, RADIUS_VECTOR_LENGTH + 
strlen(port-hba-radiussecret), encryptedpassword))
{
!   ereport(LOG,
!   (errmsg(could not perform MD5 encryption of 
password)));
!   pfree(cryptvector);
!   return STATUS_ERROR;
}
pfree(cryptvector);
!   for (i = 0; i  RADIUS_VECTOR_LENGTH; i++)
!   {
!   if (i  strlen(passwd))
!   encryptedpassword[i] = passwd[i] ^ encryptedpassword[i];
!   else
!   encryptedpassword[i] = '\0' ^ encryptedpassword[i];
!   }
!   radius_add_attribute(packet, RADIUS_PASSWORD, encryptedpassword, 
RADIUS_VECTOR_LENGTH);
  
/* Length need to be in network order on the wire */
packetlength = packet-length;
--- 2348,2390 
radius_add_attribute(packet, RADIUS_NAS_IDENTIFIER, (unsigned char *) 
identifier, strlen(identifier));
  
/*
!* RADIUS password attributes are calculated as:
!*   e[0] = p[0] XOR MD5(secret + Request Authenticator)
!* for the first group of 16 octets, and then:
!*   e[i] = p[i] XOR MD5(secret + e[i-1])
!* for the following ones (if necessary)
 */
!   encryptedpasswordlen = ((strlen(passwd) + RADIUS_VECTOR_LENGTH - 1) / 
RADIUS_VECTOR_LENGTH) * RADIUS_VECTOR_LENGTH;
!   cryptvector = palloc(strlen(port-hba-radiussecret) + 
RADIUS_VECTOR_LENGTH);
memcpy(cryptvector, port-hba-radiussecret, 
strlen(port-hba-radiussecret));
! 
!   /* for the first iteration, we use the Request Authenticator vector */
!   md5trailer = packet-vector;
!   for (i = 0; i  encryptedpasswordlen; i += RADIUS_VECTOR_LENGTH)
{
!   memcpy(cryptvector + strlen(port-hba-radiussecret), 
md5trailer, RADIUS_VECTOR_LENGTH);
!   /* .. and for subsequent iterations the result of the previous 
XOR (calculated below) */
!   md5trailer = encryptedpassword + i;
! 
!   if (!pg_md5_binary(cryptvector, strlen(port-hba-radiussecret) 
+

Re: [HACKERS] Test code is worth the space

2015-08-18 Thread David Fetter

On Tue, Aug 18, 2015 at 04:54:07PM +0100, Greg Stark wrote:
 On Tue, Aug 18, 2015 at 2:16 PM, David Fetter da...@fetter.org wrote:
  I'm given to understand that this tight coupling is necessary for
  performance.  Are you saying that it could be unwound, or that
  testing strategies mostly need to take it into account, or...?
 
 I'm just saying that we shouldn't expect to find a magic bullet test
 framework that solves all these problems. Without restructuring
 code, which I don't think is really feasible, we won't be able to
 have good unit test coverage for most existing code.
 
 It might be more practical to start using such a new tool for new
 code only. Then the new code could be structured in ways that allow
 the environment to be mocked more easily and the results observed
 more easily.

Great!

Do we have examples of such tools and code bases structured to
accommodate them that we'd like to use for reference, or at least for
inspiration?

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More WITH

Craig Ringer cr...@2ndquadrant.com writes:
 On 18 August 2015 at 01:18, David Fetter da...@fetter.org wrote:
 FETCH [in WITH]

 I'd be a huge fan of this one. I'd love to see FETCH in subqueries,
 too. Currently doing anything like this requires an ugly PL/PgSQL
 wrapper.

 The cursor would have to be known at plan-time so it could be
 interrogated for its types.

That's barely the tip of the iceberg of the problems with this idea.

How many rows would be fetched from the cursor?  What row would it be
left on?  Whatever answer you give will be wrong from some perspective,
but particularly that of giving the planner any freedom-of-action
to optimize such a query.

More generally, what would you hope to accomplish with such a construct
that wouldn't be better done by writing the cursor's underlying query
directly in the WITH clause?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-18 Thread Tatsuo Ishii

I wonder how extended protocol is handled by this proposal. Suppose
load balacing mode is enabled. PQprepare is executed on standby1. Then
PQexecPrepared gets called. This may be executed on standby2, which
will fail because there's no prepared statement created by the former
PQprepare call.

Even simple procotol is used, same thing can be said to SQL
PREPARE/EXECUTE/DEALLOCATE.

SQL BEGIN/COMMIT/ROLLBACK would be more interesting example in load
balancing mode.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [DESIGN] ParallelAppend

2015-08-18 Thread Amit Kapila

On Thu, Aug 13, 2015 at 5:26 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:

  On Fri, Aug 7, 2015 at 2:15 PM, Kouhei Kaigai kai...@ak.jp.nec.com
 wrote:
  
 
  Sure, that is what we should do, however the tricky part would be when
  the path for doing local scan is extremely cheaper than path for parallel
  scan for one of the child nodes.  For such cases, pulling up Funnel-node
  can incur more cost.  I think some of the other possible ways to make
 this
  work could be to extend Funnel so that it is capable of executing both
 parallel
  and non-parallel nodes, have a new Funnel like node which has such a
  capability.
 
 I think it is job of (more intelligent) planner but not in the first
 version. If subplans of Append are mixture of nodes which has or does
 not have worth of parallel execution, we will be able to arrange the
 original form:

   Append
+ Scan on rel1 (large)
+ Scan on rel2 (large)
+ Scan on rel3 (middle)
+ Scan on rel4 (tiny)
+ Scan on rel5 (tiny)

 to Funnel aware form, but partially:

   Append
+ Funnel
|  + Scan on rel1 (large)
|  + Scan on rel2 (large)
|  + Scan on rel3 (large)
+ Scan on rel4 (tiny)
+ Scan on rel5 (tiny)


This is exactly what I have in mind.



 Here is one other issue I found. Existing code assumes a TOC segment has
 only one contents per node type, so it uses pre-defined key (like
 PARALLEL_KEY_SCAN) per node type, however, it is problematic if we put
 multiple PlannedStmt or PartialSeqScan node on a TOC segment.


We have few keys in parallel-seq-scan patch
(PARALLEL_KEY_TUPLE_QUEUE and PARALLEL_KEY_INST_INFO) for
which multiple structures are shared between master and worker backends.

Check if something similar can work for your use case.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows

I reported a similar issue before.

* Re: DBT-3 with SF=20 got failed
http://www.postgresql.org/message-id/557a19d1.9050...@2ndquadrant.com

I didn't get a server crash at that time, however, palloc() complained
about request size = 1GB.
So, we may need a couple of overhaul around HashJoin to support large
size of data, not only nbuckets around 0x8000.

Also, we may need to pay attention to reliability of scale estimation
by planner.
Even though the plan below says that Join generates 60521928028 rows,
it actually generates 776157676 rows (0.12%).


tpcds100=# EXPLAIN ANALYZE select ws1.ws_order_number,ws1.ws_warehouse_sk 
wh1,ws2.ws_warehouse_sk wh2
 from web_sales ws1,web_sales ws2
 where ws1.ws_order_number = ws2.ws_order_number
   and ws1.ws_warehouse_sk  ws2.ws_warehouse_sk;
 QUERY PLAN


 Merge Join  (cost=25374644.08..1160509591.61 rows=60521928028 width=24) 
(actual time=138347.979..491889.343 rows=776157676 loops=1)
   Merge Cond: (ws1.ws_order_number = ws2.ws_order_number)
   Join Filter: (ws1.ws_warehouse_sk  ws2.ws_warehouse_sk)
   Rows Removed by Join Filter: 127853313
   -  Sort  (cost=12687322.04..12867325.16 rows=72001248 width=16) (actual 
time=73252.300..79017.420 rows=72001237 loops=1)
 Sort Key: ws1.ws_order_number
 Sort Method: quicksort  Memory: 7083296kB
 -  Seq Scan on web_sales ws1  (cost=0.00..3290612.48 rows=72001248 
width=16) (actual time=0.023..39951.201 rows=72001237 loops=1)
   -  Sort  (cost=12687322.04..12867325.16 rows=72001248 width=16) (actual 
time=65095.655..128885.811 rows=904010978 loops=1)
 Sort Key: ws2.ws_order_number
 Sort Method: quicksort  Memory: 7083296kB
 -  Seq Scan on web_sales ws2  (cost=0.00..3290612.48 rows=72001248 
width=16) (actual time=0.014..31046.888 rows=72001237 loops=1)
 Planning time: 0.232 ms
 Execution time: 530176.521 ms
(14 rows)


So, even if we allows nodeHash.c to allocate hash buckets larger than
1GB, its initial size may be determined carefully.
Probably, 1GB is a good starting point even if expanded later.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com


 -Original Message-
 From: pgsql-hackers-ow...@postgresql.org
 [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of David Rowley
 Sent: Wednesday, August 19, 2015 10:07 AM
 To: Tom Lane
 Cc: Kevin Grittner; Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed 
 with
 crazy number of rows
 
 On 19 August 2015 at 12:38, Tom Lane t...@sss.pgh.pa.us wrote:
 
 
   David Rowley david.row...@2ndquadrant.com writes:
david=# set work_mem = '94GB';
ERROR:  98566144 is outside the valid range for parameter work_mem
 (64 ..
2097151)
 
   Apparently you're testing on a 32-bit server.  64-bit servers allow
   work_mem to go up to INT_MAX kilobytes.
 
 
 
 hmm, no, but it does appear that sizeof(long) is 4 bytes for me, despite 64 
 bit
 OS.
 
 I guess this accounts for my misunderstanding that work_mem was limited to 2GB
 even on 64 bit machines.
 
 From guc.h
 
 #if SIZEOF_SIZE_T  4  SIZEOF_LONG  4
 #define MAX_KILOBYTES INT_MAX
 #else
 #define MAX_KILOBYTES (INT_MAX / 1024)
 #endif
 
 Apologies for the noise.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-08-18 Thread Amit Kapila

On Tue, Aug 18, 2015 at 12:38 PM, Fabien COELHO coe...@cri.ensmp.fr wrote:


 Hello Amit,

 So the option is best kept as off for now, without further data, I'm
 fine with that.


 One point to think here is on what basis user can decide make
 this option on, is it predictable in any way?
 I think one case could be when the data set fits in shared_buffers.


 Yep.

 In general, providing an option is a good idea if user can decide with
 ease when to use that option or we can give some clear recommendation for
 the same otherwise one has to recommend that test your workload with this
 option and if it works then great else don't use it which might also be
 okay in some cases, but it is better to be clear.


 My opinion, which is not backed by any data (anyone can feel free to
 provide a FreeBSD box for testing...) is that it would mostly be an
 improvement if you have a significant write load to have the flush option
 on when running on non-Linux systems which provide posix_fadvise.

 If you have a lot of reads and few writes, then postgresql currently works
 reasonably enough, which is why people do not complain too much about write
 stalls, and I expect that the situation would not be significantly degraded.

 Now there are competing positive and negative effects induced by using
 posix_fadvise, and moreover its implementation varries from OS to OS, so
 without running some experiments it is hard to be definite.


Sure, I think what can help here is a testcase/'s (in form of script file
or some other form, to test this behaviour of patch) which you can write
and post here, so that others can use that to get the data and share it.
Ofcourse, that is not mandatory to proceed with this patch, but still can
help you to prove your point as you might not have access to different
kind of systems to run the tests.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-18 Thread Amit Kapila

On Tue, Aug 18, 2015 at 9:48 AM, Victor Wagner vi...@wagner.pp.ru wrote:


 Behavoir
 

 If PQconnectdb encounters connect string with multiple hosts specified,
 it attempts to establish connection with all these hosts simultaneously,
 and begins to work with server which responds first, unless
 loadbalancing parameter is true.


I think here you are mixing the behaviour for load balancing solution and
failover solution.  It seems to me that for client-side failover solution
(which is also known as Transparent Application Failover), the connection
attempt to second server should be done after the first connection is
broken as that provide more flexibility.


 If the loadbalancing parameter is true, it tries servers sequentially in
 the random order.

 If the parameter readonly is false, after authenticating with server it
 executes show transaction_read_only, to find out whether current
 connection is to the master or to the hot standby, and connection is
 considered successful only if server allows read write transactions.


Although both ideas (load balancing and failover) seems worth discussing,
they are separate features and can be worked on separately.  It will be
easier to sort out the details as well that way.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] More WITH

2015-08-18 Thread Craig Ringer

On 18 August 2015 at 01:18, David Fetter da...@fetter.org wrote:

 FETCH [in WITH]

I'd be a huge fan of this one. I'd love to see FETCH in subqueries,
too. Currently doing anything like this requires an ugly PL/PgSQL
wrapper.

The cursor would have to be known at plan-time so it could be
interrogated for its types.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

On 18 August 2015 at 01:18, David Fetter da...@fetter.org wrote:
 Folks,

 In the interest of consistency, which is to say, of not hitting
 barriers that are essentially implementation details, I'd like to
 propose that we allow the rest of the row-returning commands inside
 WITH clauses.  We currently have:

 SELECT
 VALUES
 INSERT/UPDATE/DELETE ... RETURNING

 We don't yet have:

 EXPLAIN [ANALYZE]
 SHOW
 FETCH

 A little further out there, although this would be an API change, we
 might consider allowing the results of VACUUM and ANALYZE as row sets,
 which would also be good to wrap in WITH.

 Is there a good reason, or more than one, why we shouldn't have all
 the row-returning commands in WITH?

 Cheers,
 David.
 --
 David Fetter da...@fetter.org http://fetter.org/
 Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
 Skype: davidfetter  XMPP: david.fet...@gmail.com

 Remember to vote!
 Consider donating to Postgres: http://www.postgresql.org/about/donate


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How to compile, link and use a C++ extension

2015-08-18 Thread Craig Ringer

On 15 August 2015 at 00:51, Andres Freund and...@anarazel.de wrote:

 I started my tests by cloning the contrib/worker_spi code, and when
 transforming the code into C++, I could only note that C++ is not supported
 in the provided Makefiles.

 Yes, that doesn't surprise me. Postgres itself doesn't need to care
 about looking for a usable C++ compiler et al.

I think it'd be necessary to add C++ support to PGXS to do this nicely.

Because PostgreSQL's headers have C++

extern C {

wrappers in them, you could potentially compile the whole extension
g++, using extern C to expose unmangled symbols. You'll run into
some issues due to incompatible compiler flags, though:

$ PATH=$HOME/pg/95/bin:$PATH make USE_PGXS=1 CC=g++
g++ -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -fexcess-precision=standard -O2 -fpic -I. -I./
-I/home/craig/pg/95/include/postgresql/server
-I/home/craig/pg/95/include/postgresql/internal -D_GNU_SOURCE   -c -o
worker_spi.o worker_spi.c
cc1plus: warning: command line option ‘-Wmissing-prototypes’ is valid
for C/ObjC but not for C++
cc1plus: warning: command line option ‘-Wdeclaration-after-statement’
is valid for C/ObjC but not for C++
cc1plus: sorry, unimplemented: -fexcess-precision=standard for C++
builtin: recipe for target 'worker_spi.o' failed
make: *** [worker_spi.o] Error 1


 How exactly do you need to use the C++ code? The easiest probably would
 be to have a separate object file, built using your own makefile rule,
 that contains a the C++ and provides a C interface, and then use that
 from a background worker purely written in C.

This is what I would advise too.

It helps to separate the PostgreSQL bits from the C++ bits, providing
a clean interface across which exceptions are not thrown.

It isn't safe to access PostgreSQL API from within your C++ code
anyway, so the C++ code can be compiled without any PostgreSQL
includes, exposing a pure C interface that your extension can then
link to as just another library via PGXS's SHLIB_LINK option.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Make HeapTupleSatisfiesMVCC more concurrent

2015-08-18 Thread Jeff Janes

When we check a tuple for MVCC, it has to pass checks that the inserting
transaction has committed, and that it committed before our snapshot
began. And similarly that the deleting transaction hasn't committed, or
did so after our snapshot.

XidInMVCCSnapshot is (or can be) very much cheaper
than TransactionIdIsInProgress, because the former touches only local
memory while the latter takes a highly contended lock and inspects shared
memory. We do the slow one first, but we could do the fast one first and
sometimes short-circuit the slow one. If the transaction is in our
snapshot, it doesn't matter if it is still in progress or not.

This was discussed back in 2013 (
http://www.postgresql.org/message-id/CAMkU=1yy-YEQVvqj2xJitT1EFkyuFk7uTV_hrOMGyGMxpU=n...@mail.gmail.com),
and I wanted to revive it. The recent lwlock atomic changes haven't made
the problem irrelevant.

This patch swaps the order of the checks under some conditions. So that
hackers can readily do testing without juggling binaries, I've added an
experimental guc which controls the behavior. JJ_SNAP=0 is the original
(git HEAD) behavior, JJ_SNAP=1 turns on the new behavior.

I've added some flag variables to record if XidInMVCCSnapshot was already
called. XidInMVCCSnapshot is cheap, but not so cheap that we want to call
it twice if we can avoid it. Those would probably stay in some form or
another when the experimental guc goes away.

We might be able to rearrange the series of if-tests to get rid of the
flags variables, but I didn't want to touch the HEAP_MOVED_OFF
and HEAP_MOVED_IN parts of the code, as those must get about zero
regression testing.

The situation where the performance of this really shows up is when there
are tuples that remain in an unresolved state while highly concurrent
processes keep stumbling over them.

I set that up by using the pgbench tables with scale factor of 1, and
running a custom query at high concurrency which seq_scans the accounts
table:

pgbench -f (echo 'select sum(abalance) from pgbench_accounts') -T 30 \
-n -c32 -j32 --startup='set JJ_SNAP=1'

While the test is contrived, it reproduces complaints I've seen on several
forums.

To create the burden of unresolved tuples, I open psql and run:
begin; update pgbench_accounts set abalance =1-abalance;

...and leave it uncommitted for a while.

Representative numbers for test runs of the above custom query on a 8-CPU
machine:

tps = 542 regardless of JJ_SNAP, when no in-progress tuples
tps = 30 JJ_SNAP=0 with uncommitted bulk update
tps = 364 JJ_SNAP=1 with uncommitted bulk update

A side effect of making this change would be that a query which finds a
tuple inserted or deleted by a transaction still in the query's snapshot
never checks to see if that transaction committed, and so it doesn't set
the hint bit if it did commit or abort. Some future query with a newer
snapshot will have to do that. It is at least theoretically possible that
this could mean that many hint bits could fail to get set while the buffer
is still dirty in shared_buffers, which means it needs to get dirtied again
once set. I doubt this would be significant, but if anyone has a test case
which they think could show up a problem in this area, please try it out or
describe it.

There are other places in tqual.c which could probably use similar
re-ordering tricks, but this is the one for which I have a reproducible
test case,

Cheers

Jeff

SatisfiesMVCC_reorder_v001.patch
Description: Binary data

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Make HeapTupleSatisfiesMVCC more concurrent

Jeff Janes jeff.ja...@gmail.com writes:
When we check a tuple for MVCC, it has to pass checks that the inserting
transaction has committed, and that it committed before our snapshot
began. And similarly that the deleting transaction hasn't committed, or
did so after our snapshot.

This patch swaps the order of the checks under some conditions.

Just thinking about this ... I wonder why we need to call
TransactionIdIsInProgress() at all rather than believing the answer from
the snapshot? Under what circumstances could TransactionIdIsInProgress()
return true where XidInMVCCSnapshot() had not?

I'm thinking maybe TransactionIdIsInProgress is only needed for non-MVCC
snapshot types.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows

 Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 
  longlbuckets;
 
  lbuckets = 1  my_log2(hash_table_bytes / bucket_size);
 
  Assert(nbuckets  0);
 
  In my case, the hash_table_bytes was 101017630802, and bucket_size was 48.
  So, my_log2(hash_table_bytes / bucket_size) = 31, then lbuckets will have
  negative number because both 1 and my_log2() is int32.
  So, Min(lbuckets, max_pointers) chooses 0x8000, then it was set on
  the nbuckets and triggers the Assert().
 
  Attached patch fixes the problem.
 
 So you changed the literal of 1 to 1U, but doesn't that just double
 the threshold for failure?  Wouldn't 1L (to match the definition of
 lbuckets) be better?

Ah, yes, it is my careless mistake.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows

2015-08-18 Thread David Rowley

On 19 August 2015 at 08:54, Kevin Grittner kgri...@ymail.com wrote:

 Kouhei Kaigai kai...@ak.jp.nec.com wrote:

  longlbuckets;

  lbuckets = 1  my_log2(hash_table_bytes / bucket_size);

  Assert(nbuckets  0);

  In my case, the hash_table_bytes was 101017630802, and bucket_size was
 48.
  So, my_log2(hash_table_bytes / bucket_size) = 31, then lbuckets will have
  negative number because both 1 and my_log2() is int32.
  So, Min(lbuckets, max_pointers) chooses 0x8000, then it was set on
  the nbuckets and triggers the Assert().

  Attached patch fixes the problem.

 So you changed the literal of 1 to 1U, but doesn't that just double
 the threshold for failure?  Wouldn't 1L (to match the definition of
 lbuckets) be better?


I agree, but I can only imagine this is happening because the maximum
setting of work_mem has been modified with the code you're running.

hash_tables_bytes is set based on work_mem

hash_table_bytes = work_mem * 1024L;

The size of your hash table is 101017630802 bytes, which is:

david=# select pg_size_pretty(101017630802);
 pg_size_pretty

 94 GB
(1 row)

david=# set work_mem = '94GB';
ERROR:  98566144 is outside the valid range for parameter work_mem (64 ..
2097151)

So I think the only way the following could cause an error, is if
bucket_size was 1, which it can't be.

lbuckets = 1  my_log2(hash_table_bytes / bucket_size);

I think one day soon we'll need to allow larger work_mem sizes, but I think
there's lots more to do than this change.

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows

2015-08-18 Thread David Rowley

On 19 August 2015 at 12:23, Kouhei Kaigai kai...@ak.jp.nec.com wrote:

  -Original Message-
  From: David Rowley [mailto:david.row...@2ndquadrant.com]
  Sent: Wednesday, August 19, 2015 9:00 AM
  The size of your hash table is 101017630802 bytes, which is:
 
  david=# select pg_size_pretty(101017630802);
 
   pg_size_pretty
  
   94 GB
  (1 row)
 
  david=# set work_mem = '94GB';
  ERROR:  98566144 is outside the valid range for parameter work_mem (64
 ..
  2097151)
 
 Hmm. Why I could set work_mem = '96GB' without error.

 It was described in the postgresql.conf.

   postgres=# SHOW work_mem;
work_mem
   --
96GB
   (1 row)

  So I think the only way the following could cause an error, is if
 bucket_size
  was 1, which it can't be.
 
  lbuckets = 1  my_log2(hash_table_bytes / bucket_size);
 
 
  I think one day soon we'll need to allow larger work_mem sizes, but I
 think there's
  lots more to do than this change.
 
 I oversight this limitation, but why I can bypass GUC limitation check?


I'm unable to get the server to start if I set work_mem that big. I also
tried starting the start with 1GB work_mem then doing a pg_ctl reload. With
each of these I get the same error message that I would have gotten if I
had done; set work_mem = '96GB';

Which version are you running?

Are you sure there's not changes in guc.c for work_mem's range?

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

Re: [HACKERS] Make HeapTupleSatisfiesMVCC more concurrent

I wrote:
 Just thinking about this ... I wonder why we need to call
 TransactionIdIsInProgress() at all rather than believing the answer from
 the snapshot?  Under what circumstances could TransactionIdIsInProgress()
 return true where XidInMVCCSnapshot() had not?

I experimented with the attached patch, which replaces
HeapTupleSatisfiesMVCC's calls of TransactionIdIsInProgress with
XidInMVCCSnapshot, and then as a cross-check has all the return false
exits from XidInMVCCSnapshot assert !TransactionIdIsInProgress().
The asserts did not fire in the standard regression tests nor in a
pgbench run, which is surely not proof of anything but it suggests
that I'm not totally nuts.

I wouldn't commit the changes in XidInMVCCSnapshot for real, but
otherwise this is a possibly committable patch.

regards, tom lane

diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index de7b3fc..c8f59f7 100644
*** a/src/backend/utils/time/tqual.c
--- b/src/backend/utils/time/tqual.c
***
*** 10,21 
   * the passed-in buffer.  The caller must hold not only a pin, but at least
   * shared buffer content lock on the buffer containing the tuple.
   *
!  * NOTE: must check TransactionIdIsInProgress (which looks in PGXACT array)
   * before TransactionIdDidCommit/TransactionIdDidAbort (which look in
   * pg_clog).  Otherwise we have a race condition: we might decide that a
   * just-committed transaction crashed, because none of the tests succeed.
   * xact.c is careful to record commit/abort in pg_clog before it unsets
!  * MyPgXact-xid in PGXACT array.  That fixes that problem, but it also
   * means there is a window where TransactionIdIsInProgress and
   * TransactionIdDidCommit will both return true.  If we check only
   * TransactionIdDidCommit, we could consider a tuple committed when a
--- 10,22 
   * the passed-in buffer.  The caller must hold not only a pin, but at least
   * shared buffer content lock on the buffer containing the tuple.
   *
!  * NOTE: When using a non-MVCC snapshot, we must check
!  * TransactionIdIsInProgress (which looks in the PGXACT array)
   * before TransactionIdDidCommit/TransactionIdDidAbort (which look in
   * pg_clog).  Otherwise we have a race condition: we might decide that a
   * just-committed transaction crashed, because none of the tests succeed.
   * xact.c is careful to record commit/abort in pg_clog before it unsets
!  * MyPgXact-xid in the PGXACT array.  That fixes that problem, but it also
   * means there is a window where TransactionIdIsInProgress and
   * TransactionIdDidCommit will both return true.  If we check only
   * TransactionIdDidCommit, we could consider a tuple committed when a
***
*** 26,31 
--- 27,37 
   * subtransactions of our own main transaction and so there can't be any
   * race condition.
   *
+  * When using an MVCC snapshot, we rely on XidInMVCCSnapshot rather than
+  * TransactionIdIsInProgress, but the logic is otherwise the same: do not
+  * check pg_clog until after deciding that the xact is no longer in progress.
+  *
+  *
   * Summary of visibility functions:
   *
   *	 HeapTupleSatisfiesMVCC()
*** HeapTupleSatisfiesMVCC(HeapTuple htup, S
*** 961,967 
  
  			if (TransactionIdIsCurrentTransactionId(xvac))
  return false;
! 			if (!TransactionIdIsInProgress(xvac))
  			{
  if (TransactionIdDidCommit(xvac))
  {
--- 967,973 
  
  			if (TransactionIdIsCurrentTransactionId(xvac))
  return false;
! 			if (!XidInMVCCSnapshot(xvac, snapshot))
  			{
  if (TransactionIdDidCommit(xvac))
  {
*** HeapTupleSatisfiesMVCC(HeapTuple htup, S
*** 980,986 
  
  			if (!TransactionIdIsCurrentTransactionId(xvac))
  			{
! if (TransactionIdIsInProgress(xvac))
  	return false;
  if (TransactionIdDidCommit(xvac))
  	SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
--- 986,992 
  
  			if (!TransactionIdIsCurrentTransactionId(xvac))
  			{
! if (XidInMVCCSnapshot(xvac, snapshot))
  	return false;
  if (TransactionIdDidCommit(xvac))
  	SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
*** HeapTupleSatisfiesMVCC(HeapTuple htup, S
*** 1035,1041 
  			else
  return false;	/* deleted before scan started */
  		}
! 		else if (TransactionIdIsInProgress(HeapTupleHeaderGetRawXmin(tuple)))
  			return false;
  		else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
  			SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
--- 1041,1047 
  			else
  return false;	/* deleted before scan started */
  		}
! 		else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
  			return false;
  		else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
  			SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
*** HeapTupleSatisfiesMVCC(HeapTuple htup, S
*** 1048,1061 
  			return false;
  		}
  	}
  
! 	/*
! 	 * By here, the inserting transaction has

Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable plan

 On Mon, Aug 17, 2015 at 6:40 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
  Here is one other thing I could learn from TPC-DS benchmark.
 
  The attached query is Q4 of TPC-DS, and its result was towards SF=100.
  It took long time to compete (about 30min), please see the attached
  EXPLAIN ANALYZE output.
 
 Look at this:
 
  -  CTE Scan on year_total t_s_firstyear  (cost=0.00..13120715.27
 rows=3976 width=52) (actual time=0.020..5425.980 rows=1816438 loops=1)
Filter: ((year_total  '0'::numeric) AND
 (sale_type = 's'::text) AND (dyear = 2001))
Rows Removed by Filter: 19879897
  -  CTE Scan on year_total t_s_secyear  (cost=0.00..11927922.98
 rows=11928 width=164) (actual time=0.007..45.249 rows=46636 loops=1)
Filter: ((sale_type = 's'::text) AND (dyear = 
 2002))
Rows Removed by Filter: 185596
 
 CTE expansion shall help here as we can push the filer down. I did a
 quick patch to demonstrate the idea, following Tom's proposal
 (38448.1430519...@sss.pgh.pa.us). I see obvious performance boost:
 
 Turn off NLJ:
  original: Planning time: 4.391 ms
  Execution time: 77113.721 ms
  patched: Planning time: 8.429 ms
  Execution time: 18572.663 ms
 
 + work_mem to 1G
  original: Planning time: 4.487 ms
  Execution time: 29249.466 ms
  patched: Planning time: 11.148 ms
  Execution time: 7309.586 ms
 
 Attached please find the WIP patch and also the ANALYZE results.
 Notes: the patch may not directly apply to head as some network issue
 here so my Linux box can't talk to git server.

Thanks for your patch.
Let me test and report the result in my environment.

BTW, did you register the patch on the upcoming commit-fest?
I think it may be a helpful feature, if we can add alternative
subquery-path towards cte-scan on set_cte_pathlist() and choose
them according to the cost estimation.

Best regards,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] allowing wal_level change at run time

2015-08-18 Thread Michael Paquier

On Wed, Aug 19, 2015 at 2:46 AM, Andres Freund and...@anarazel.de wrote:
 On 2015-08-18 13:24:54 -0400, Peter Eisentraut wrote:
 But if we tie the effective wal_level to archive_mode or
 max_wal_senders, both of which are restart-only, then we haven't gained
 anything.  (We would have removed half a GUC parameter, effectively.
 Not bad, but not very exciting.)

 ISTM that it's not too hard to
 a) make archive_mode PGC_SIGHUP
 b) make wal_level PGC_SIGHUP
 c) automatically increase wal_level to logical whenever a logical
replication slot is defined

Switching archive_mode and wal_level to PGC_SIGHUP would be nice. We
can already faking that by setting archive_command to '/usr/bin/true'
for the first one or similar but that's not really the same as
switching it completely to off.

 it seems considerably harder to

 d) make max_wal_senders PGC_SIGHUP
 e) make max_replication_slots PGC_SIGHUP

 because they need shmem, locks, and everything.

Yes. Those have effects on the shared memory size allocated at
postmaster startup.

 Therefore I propose something slightly different:
 We change the default of max_wal_senders, max_replication_slots, to some
 reasonably high setting and make wal_level an internal automagically
 determined variable. archive_mode = on automatically increases the level
 to what's now hot-standby.

Or surely max_wal_senders  0.

 To deal with streaming replication, we automatically create slots for
 replicas, but, by default, *without* having them reserve WAL. The slot
 name would be determined in some automatic fashion (uuid or something)
 and stored in a new file in the data directory.  That allows us to
 increase the internal wal_level to hot_standby/archive whenever a
 replica has connected (and thus a physical slot exists), and to logical
 whenever a logical slot exists.

So, wal_level is automatically bumped to hot_standby when the physical
slot is created (or logical for a logical slot) even if WAL is not
reserved, right? When would those slots be created?

 Now, that'll mean that the wal_level doesn't decrease automatically
 until a slot has been dropped. But that seems relatively harmless if
 it's not reserving WAL.

 Too crazy?

Perhaps, craziest ideas are usually worth it :)

At least, we have a couple of things we can consider:
- Make archive_mode SIGHUP
- Remove wal_level and set it as follows:
-- archive/hot_standby if max_wal_sender  0 (depends surely on
restart) or archive_mode = on (gets more complicated if archive_mode
switches to SIGHUP) at startup.
-- logical should a logical slot be created.
-- If max_wal_senders = 0 and archive_mode = off, switch wal_level to
hot_standby once a physical slot is created.
In short switching archive_mode to SIGHUP has as requirement to get
rid of wal_level first.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] allowing wal_level change at run time

On 8/18/15 1:46 PM, Andres Freund wrote:
 I don't think not requiring restarts is sufficient, having to twiddle a
 bunch of parameters manually still is a lot more effort than people see
 as necessary.

I agree that we want both.  But requiring a restart is a hard stop,
whereas making configuration easier is a soft feature.

 ISTM that it's not too hard to
 a) make archive_mode PGC_SIGHUP

That has been contentious in the past, but I would agree that it seems
that it should be doable.  (I consider archiving a legacy feature at
this point, so for this purpose I don't really care whether it's possible.)

 b) make wal_level PGC_SIGHUP
 c) automatically increase wal_level to logical whenever a logical
replication slot is defined
 
 it seems considerably harder to
 
 d) make max_wal_senders PGC_SIGHUP
 e) make max_replication_slots PGC_SIGHUP
 
 because they need shmem, locks, and everything.

check

 Therefore I propose something slightly different:
 
 We change the default of max_wal_senders, max_replication_slots, to some
 reasonably high setting and make wal_level an internal automagically
 determined variable. archive_mode = on automatically increases the level
 to what's now hot-standby.

check

 To deal with streaming replication, we automatically create slots for
 replicas, but, by default, *without* having them reserve WAL. The slot
 name would be determined in some automatic fashion (uuid or something)
 and stored in a new file in the data directory.  That allows us to
 increase the internal wal_level to hot_standby/archive whenever a
 replica has connected (and thus a physical slot exists), and to logical
 whenever a logical slot exists.

That seems kind of weird.  So every time a replication client connects,
we create a new replication slot?  Won't they accumulate?  Do we need
the replication slot, or could we just trigger this on the existence of
a replication client?

Also note that this scheme and any similar one requires merging the
archive and hot_standby levels, which was the core of your original
proposal [1], which was then objected to, which subsequently lead to
Robert's proposal to make wal_level SIGHUP, which lead to my message,
which lead to your message, which is similar to your initial one.
Somehow we have to find a way break out of this circle. ;-)


[1]
http://www.postgresql.org/message-id/20150203124317.ga24...@awork2.anarazel.de



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Error message with plpgsql CONTINUE

Pavel Stehule pavel.steh...@gmail.com writes:
 when I implemented this check in plpgsql_check I found another minor issue
 in CONTINUE statement - the typename is wrong

Hmmm ... a bit of nosing around says that fetch/move and get diagnostics
are similarly sloppy.  Will fix.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench - allow backslash-continuations in custom scripts

2015-08-18 Thread Kyotaro HORIGUCHI

Hi, all.

  I don't think we actually want backslash-continuations. The feature we
  want is allow SQL statements span multiple lines, and using the psql
  lexer solves that. We don't need the backslash-continuations when we
  have that.
 
 Sure. The feature *I* initially wanted was to have multi-line
 meta-commands. For this feature ISTM that continuations are, alas, the
 solution.
 
  Indeed there are plenty of links already which are generated by
  makefiles
  (see src/bin/pg_xlogdump/*), and probably a copy is made on
  windows. There
  should no file duplication within the source tree.
 
  Yeah, following the example of pg_xlogdump and others is the way to
  go.
 
  Docs need updating, and there's probably some cleanup to do before
  this is ready for committing, but overall I think this is definitely
  the right direction.
 
 I've created an entry for the next commitfest, and put the status to
 waiting on author.
 
  I complained upthread that this makes it impossible to use
  multi-statements in pgbench, as they would be split into separate
  statements, but looking at psqlscan.l there is actually a syntax for
  that in psql already. You escape the semicolon as \;, e.g. SELECT 1
  \; SELECT 2;, and then both queries will be sent to the server as
  one. So even that's OK.
 
 Good!

Hmm. psqlscan.l handles multistatement naturally.
I worked on that and the attached patche set does,

 - backslash continuation for pgbench metacommands.

   set variable \
   some value

 - SQL statement natural continuation lines.

   SELECT :foo
FROM  :bar;

 - SQL multi-statement.

   SELECT 1; SELECT 2;


The work to be left is eliminating double-format of Command
struct.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 274bc1cd6de4fb5806e308b002b086b1dfdf7479 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi horiguchi.kyot...@lab.ntt.co.jp
Date: Thu, 23 Jul 2015 20:44:37 +0900
Subject: [PATCH 1/3] Prepare for share psqlscan with pgbench.

Eliminate direct usage of pset variables and enable parts unnecessary
for other than psql to be disabled by defining OUTSIDE_PSQL.
---
 src/bin/psql/mainloop.c |  6 ++--
 src/bin/psql/psqlscan.h | 14 +
 src/bin/psql/psqlscan.l | 79 -
 src/bin/psql/startup.c  |  4 +--
 4 files changed, 67 insertions(+), 36 deletions(-)

diff --git a/src/bin/psql/mainloop.c b/src/bin/psql/mainloop.c
index b6cef94..e98cb94 100644
--- a/src/bin/psql/mainloop.c
+++ b/src/bin/psql/mainloop.c
@@ -233,7 +233,8 @@ MainLoop(FILE *source)
 		/*
 		 * Parse line, looking for command separators.
 		 */
-		psql_scan_setup(scan_state, line, strlen(line));
+		psql_scan_setup(scan_state, line, strlen(line),
+		pset.db, pset.vars, pset.encoding);
 		success = true;
 		line_saved_in_history = false;
 
@@ -373,7 +374,8 @@ MainLoop(FILE *source)
 	resetPQExpBuffer(query_buf);
 	/* reset parsing state since we are rescanning whole line */
 	psql_scan_reset(scan_state);
-	psql_scan_setup(scan_state, line, strlen(line));
+	psql_scan_setup(scan_state, line, strlen(line),
+	pset.db, pset.vars, pset.encoding);
 	line_saved_in_history = false;
 	prompt_status = PROMPT_READY;
 }
diff --git a/src/bin/psql/psqlscan.h b/src/bin/psql/psqlscan.h
index 55070ca..4bf8dcb 100644
--- a/src/bin/psql/psqlscan.h
+++ b/src/bin/psql/psqlscan.h
@@ -11,7 +11,11 @@
 #include pqexpbuffer.h
 
 #include prompt.h
-
+#if !defined OUTSIDE_PSQL
+#include variables.h
+#else
+typedef int * VariableSpace;
+#endif
 
 /* Abstract type for lexer's internal state */
 typedef struct PsqlScanStateData *PsqlScanState;
@@ -36,12 +40,11 @@ enum slash_option_type
 	OT_NO_EVAL	/* no expansion of backticks or variables */
 };
 
-
 extern PsqlScanState psql_scan_create(void);
 extern void psql_scan_destroy(PsqlScanState state);
 
-extern void psql_scan_setup(PsqlScanState state,
-const char *line, int line_len);
+extern void psql_scan_setup(PsqlScanState state, const char *line, int line_len,
+			PGconn *db, VariableSpace vars, int encoding);
 extern void psql_scan_finish(PsqlScanState state);
 
 extern PsqlScanResult psql_scan(PsqlScanState state,
@@ -52,6 +55,7 @@ extern void psql_scan_reset(PsqlScanState state);
 
 extern bool psql_scan_in_quote(PsqlScanState state);
 
+#if !defined OUTSIDE_PSQL
 extern char *psql_scan_slash_command(PsqlScanState state);
 
 extern char *psql_scan_slash_option(PsqlScanState state,
@@ -60,5 +64,5 @@ extern char *psql_scan_slash_option(PsqlScanState state,
 	   bool semicolon);
 
 extern void psql_scan_slash_command_end(PsqlScanState state);
-
+#endif	 /* if !defined OUTSIDE_PSQL */
 #endif   /* PSQLSCAN_H */
diff --git a/src/bin/psql/psqlscan.l b/src/bin/psql/psqlscan.l
index be059ab..f9a19cd 100644
--- a/src/bin/psql/psqlscan.l
+++ b/src/bin/psql/psqlscan.l
@@ -43,11 +43,6 @@
 
 #include ctype.h
 
-#include common.h
-#include settings.h
-#include variables.h
-
-
 /*
  * We use a stack of

Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows

 -Original Message-
 From: David Rowley [mailto:david.row...@2ndquadrant.com]
 Sent: Wednesday, August 19, 2015 9:00 AM
 To: Kevin Grittner
 Cc: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed 
 with
 crazy number of rows

 On 19 August 2015 at 08:54, Kevin Grittner kgri...@ymail.com wrote:

   Kouhei Kaigai kai...@ak.jp.nec.com wrote:

longlbuckets;

lbuckets = 1  my_log2(hash_table_bytes / bucket_size);

Assert(nbuckets  0);

In my case, the hash_table_bytes was 101017630802, and bucket_size was
 48.
So, my_log2(hash_table_bytes / bucket_size) = 31, then lbuckets will
 have
negative number because both 1 and my_log2() is int32.
So, Min(lbuckets, max_pointers) chooses 0x8000, then it was set
 on
the nbuckets and triggers the Assert().

Attached patch fixes the problem.

   So you changed the literal of 1 to 1U, but doesn't that just double
   the threshold for failure?  Wouldn't 1L (to match the definition of
   lbuckets) be better?

 I agree, but I can only imagine this is happening because the maximum setting
 of work_mem has been modified with the code you're running.

 hash_tables_bytes is set based on work_mem

 hash_table_bytes = work_mem * 1024L;

 The size of your hash table is 101017630802 bytes, which is:

 david=# select pg_size_pretty(101017630802);

  pg_size_pretty

  94 GB
 (1 row)

 david=# set work_mem = '94GB';
 ERROR:  98566144 is outside the valid range for parameter work_mem (64 ..
 2097151)

Hmm. Why I could set work_mem = '96GB' without error.

It was described in the postgresql.conf.

  postgres=# SHOW work_mem;
   work_mem
  --
   96GB
  (1 row)

 So I think the only way the following could cause an error, is if bucket_size
 was 1, which it can't be.

 lbuckets = 1  my_log2(hash_table_bytes / bucket_size);

 I think one day soon we'll need to allow larger work_mem sizes, but I think 
 there's
 lots more to do than this change.

I oversight this limitation, but why I can bypass GUC limitation check?

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows

David Rowley david.row...@2ndquadrant.com writes:
 david=# set work_mem = '94GB';
 ERROR:  98566144 is outside the valid range for parameter work_mem (64 ..
 2097151)

Apparently you're testing on a 32-bit server.  64-bit servers allow
work_mem to go up to INT_MAX kilobytes.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows

2015-08-18 Thread David Rowley

On 19 August 2015 at 12:38, Tom Lane t...@sss.pgh.pa.us wrote:

 David Rowley david.row...@2ndquadrant.com writes:
  david=# set work_mem = '94GB';
  ERROR:  98566144 is outside the valid range for parameter work_mem (64
 ..
  2097151)

 Apparently you're testing on a 32-bit server.  64-bit servers allow
 work_mem to go up to INT_MAX kilobytes.


hmm, no, but it does appear that sizeof(long) is 4 bytes for me, despite 64
bit OS.

I guess this accounts for my misunderstanding that work_mem was limited to
2GB even on 64 bit machines.

From guc.h

#if SIZEOF_SIZE_T  4  SIZEOF_LONG  4
#define MAX_KILOBYTES INT_MAX
#else
#define MAX_KILOBYTES (INT_MAX / 1024)
#endif

Apologies for the noise.

Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable plan

2015-08-18 Thread Qingqing Zhou

On Mon, Aug 17, 2015 at 6:40 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 Here is one other thing I could learn from TPC-DS benchmark.

 The attached query is Q4 of TPC-DS, and its result was towards SF=100.
 It took long time to compete (about 30min), please see the attached
 EXPLAIN ANALYZE output.

Look at this:

 -  CTE Scan on year_total t_s_firstyear  (cost=0.00..13120715.27
rows=3976 width=52) (actual time=0.020..5425.980 rows=1816438 loops=1)
   Filter: ((year_total  '0'::numeric) AND
(sale_type = 's'::text) AND (dyear = 2001))
   Rows Removed by Filter: 19879897
 -  CTE Scan on year_total t_s_secyear  (cost=0.00..11927922.98
rows=11928 width=164) (actual time=0.007..45.249 rows=46636 loops=1)
   Filter: ((sale_type = 's'::text) AND (dyear = 2002))
   Rows Removed by Filter: 185596

CTE expansion shall help here as we can push the filer down. I did a
quick patch to demonstrate the idea, following Tom's proposal
(38448.1430519...@sss.pgh.pa.us). I see obvious performance boost:

Turn off NLJ:
 original: Planning time: 4.391 ms
 Execution time: 77113.721 ms
 patched: Planning time: 8.429 ms
 Execution time: 18572.663 ms

+ work_mem to 1G
 original: Planning time: 4.487 ms
 Execution time: 29249.466 ms
 patched: Planning time: 11.148 ms
 Execution time: 7309.586 ms

Attached please find the WIP patch and also the ANALYZE results.
Notes: the patch may not directly apply to head as some network issue
here so my Linux box can't talk to git server.

Regards,
Qingqing


ctes.patch
Description: Binary data


result.out
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: SCRAM authentication

2015-08-18 Thread Michael Paquier

On Wed, Aug 19, 2015 at 5:30 AM, Greg Stark st...@mit.edu wrote:
 On Tue, Aug 18, 2015 at 7:19 PM, Robert Haas robertmh...@gmail.com wrote:

 Sorry, that's a completely bogus argument.  We do not need to
 prevent people from upgrading if they haven't moved off of MD5
 authentication; that's just an arbitrary - and IMHO extremely
 user-hostile - policy decision.  We have a legitimate need, to move
 the project forward, to introduce a better system for password
 authentication.  Ripping out the old one is not a real need.  I'm sure
 at some point it will seem like antiquated cruft that nobody uses any
 more, but that will not be in a year or two or anything close to
 that.

 I would imagine a GUC like enable_legacy_authentication=true which we
 would just start defaulting to false after a few years and perhaps
 consider removing five years after that. pg_upgrade could check if
 there are any users which require it to be set to true and warn users
 that they must enable it but should check the documentation so they
 understand the impact.

Yep, that's one of the ideas mentioned upstread. This parameter should
actually be filled with a list of verifiers that we consider
out-of-support, deprecated, well things that users should be warned
about. One solution may be to log in warnings when parsing pg_hba.conf
should a deprecated method be used.

 OK, that's an interesting argument.  If SCRAM supports multiple
 password verifiers, and we support SCRAM, then I guess we should
 probably do that, too.  I still don't like it all that much, though.
 I think it's absolutely inevitable that people are going to end up
 with an account with 3 or more different passwords that can all be
 used to log into it, and that won't be good.  How do other systems
 avoid this pitfall?

 Fwiw having multiple passwords would make automated credential
 rotations *so* much easier. Heroku has a really baroque solution to
 this problem in Postgres involving creating new child roles and
 swapping them around. My team in Google wasted many man hours dealing
 with fallout from the quarterly password rotations.

 (I wish we could just drop the account management and authentication
 system entirely and dump the whole work on a system designed for this
 particular problem. It's a hard problem that's far outside our core
 features and Postgres is never going to be a good system for anyone
 let alone everyone when there are many different types of users.)

This makes me think that we may need actually a finer grammar than
what the patch is proposing:
ADD PASSWORD VERIFIERS (method = 'value' [, ...] ) [ OPTIONS
(validUntil = 'timestamp') ]
DROP PASSWORD VERIFIERS (method [, ...])
PASSWORD VERIFIERS (method = 'value' [, ...]) [OPTIONS validUntil =
'timestamp'] -- replace all the existing ones

Reaching to the following catalog for pg_auth_verifiers:
Oid roleoid;
char method;
text value;
timestamp valueUntil;

I am not sure if we would want to be able to have multiple verifier
values for the same method, but feel free to comment.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] proposal: function parse_ident

2015-08-18 Thread Pavel Stehule

Hi

I miss a functionality that helps with parsing any identifier to basic
three parts - database, schema, objectname. We have this function
internally, but it is not available for SQL layer.

FUNCTION parse_ident(IN ident text, OUT dbname text, OUT schemaname text,
OUT objectname text)

Examples:

SELECT parse_ident('some schema.tablename');

Comments, ideas, notes?

Regards

Pavel

Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable plan

On Mon, Aug 17, 2015 at 9:40 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 I think SortSupport logic provides a reasonable way to solve this
 kind of problem. For example, btint4sortsupport() informs a function
 pointer of the fast version of comparator (btint4fastcmp) which takes
 two Datum argument without indirect memory reference.
 This mechanism will also make sense for HashAggregate logic, to reduce
 the cost of function invocations.

 Please comment on the idea I noticed here.

It's possible that this can work, but it might be a good idea to run
'perf' on this query and find out where the CPU time is actually
going.  That might give you a clearer picture of why the HashAggregate
is slow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning

2015-08-18 Thread David Fetter

On Tue, Aug 18, 2015 at 07:30:20PM +0900, Amit Langote wrote:
 Hi,
 
 I would like propose $SUBJECT for this development cycle. Attached is a
 WIP patch that implements most if not all of what's described below. Some
 yet unaddressed parts are mentioned below, too. I'll add this to the CF-SEP.

Thanks for pushing this forward!  We've needed this done for at least
a decade.

 4. (yet unimplemented) Attach partition (from existing table)
 
 ALTER TABLE partitioned_table
 ATTACH PARTITION partition_name
 FOR VALUES values_spec
 USING [TABLE] table_name;
 
 ALTER TABLE table_name
 SET VALID PARTITION OF parent;
 
 The first of the above pair of commands would attach table_name as a (yet)
 'invalid' partition of partitioned_table (after confirming that it matches
 the schema and does not overlap with other partitions per FOR VALUES
 spec). It would also record the FOR VALUES part in the partition catalog
 and set pg_class.relispartition to true for table_name.
 
 After the first command is done, the second command would take exclusive
 lock on table_name, scan the table to check if it contains any values
 outside the boundaries defined by FOR VALUES clause defined previously,
 throw error if so, mark as valid partition of parent if not.

One small change to make this part more efficient:

1. Take the access exclusive lock on table_name.
2. Check for a matching constraint on it.
3. If it's there, mark it as a valid partition.
4. If not, check for values outside the boundaries as above.

Should the be a *valid* constraint?  Perhaps that should be
parameterized, as I'm not yet seeing a compelling argument either
direction.  I'm picturing something like:

ALTER TABLE table_name SET VALID PARTITION OF parent [TRUST]

where TRUST would mean that an existing constraint need not be VALID.

 Does that make sense?

Yep.

 5. Detach partition
 
 ALTER TABLE partitioned_table
 DETACH PARTITION partition_name [USING table_name]
 
 This removes partition_name as partition of partitioned_table. The table
 continues to exist with the same name or 'table_name', if specified.
 pg_class.relispartition is set to false for the table, so it behaves like
 a normal table.

Could this take anything short of an access exclusive lock on the
parent?

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] allowing wal_level change at run time

Peter Eisentraut pete...@gmx.net writes:
 On 8/18/15 8:48 AM, Robert Haas wrote:
 What do you mean by prevent?  If the user edits postgresql.conf and
 reduces the setting, and then reloads the configuration file, they
 have a right to expect that the changes got applied.

 We have certain checks in place that require a minimum wal_level before
 other things are allowed.  For example, turning on archiving requires
 wal_level = archive.  The issue is then, if you have archiving on and
 then turn wal_level to minimal at run time, we need to prevent that to
 preserve the integrity of the original check.

IIRC, the reason for having a wal_level parameter separate from those
other ones was precisely that only wal_level had to be PGC_POSTMASTER,
and you could change the others if you wanted.  If we are going to fix
the mechanisms to allow dynamic changing in wal log verbosity, maybe
we could simply get rid of wal_level as a user-settable parameter, and
have its effective value be inferred from the active settings of the
other parameters.

IOW: let's simplify, not complicate even further.  Try to get rid of
the interdependencies between settable parameters.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb array-style subscripting

2015-08-18 Thread Andrew Dunstan




On 08/18/2015 01:11 AM, Kaare Rasmussen wrote:

On 2015-08-17 22:33, Josh Berkus wrote:
So, both perl and python do not allow deep nesting of assignments. 
For example:

d = { a : { } }
d[a][a1][a2] = 42

Traceback (most recent call last):
   File stdin, line 1, in module
KeyError: 'a1'


Not sure I understand what you mean. In Perl you'd do

$ perl -e '%d = (a = {}); $d{a}{a1}{a2} = 42; print $d{a}{a1}{a2}'
42

which looks pretty much like what's proposed.





Indeed, I mentioned recently that perl auto-vivifies intermediate paths 
like this.


But we don't do that in jsonb_set, as was discussed back in May, and as 
JS doesn't either I think that decision is sound.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

On Mon, Aug 10, 2015 at 12:36 AM, Rahila Syed rahilasye...@gmail.com wrote:
 Hello,

Say, 6 bigint counters, 6 float8
counters, and 3 strings up to 80 characters each.  So we have a
fixed-size chunk of shared memory per backend, and each backend that
wants to expose progress information can fill in those fields however
it likes, and we expose the results.
This would be sorta like the way pg_statistic works: the same columns
can be used for different purposes depending on what estimator will be
used to access them.

 After thinking more on this suggestion, I came up with following generic
 structure which can be used to store progress of any command per backend in
 shared memory.

 Struct PgBackendProgress
 {
 int32 *counter[COMMAND_NUM_SLOTS];
 float8 *counter_float[COMMAND_NUM_SLOTS];

 char *progress_message[COMMAND_NUM_SLOTS];
 }

This can't actually work, because we don't have a dynamic allocator
for shared memory.  What you need to do is something like this:

struct PgBackendProgress
{
uint64 progress_integer[N_PROGRESS_INTEGER];
float8 progress_float[N_PROGRESS_FLOAT];
char progress_string[PROGRESS_STRING_LENGTH][N_PROGRESS_STRING];
};

You probably want to protect this with the st_changecount protocol, or
just put the fields in PgBackendStatus.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Configure with thread sanitizer fails the thread test

On Mon, Aug 17, 2015 at 3:02 PM, Andres Freund and...@anarazel.de wrote:
 On 2015-08-17 14:31:24 -0300, Alvaro Herrera wrote:
 The postmaster process in particular runs in a rather unusual
 arrangement, where most of the interesting stuff does happen in signal
 handlers.

 FWIW, I think it might be worthwhile to convert postmaster into a loop
 over a process local latch, with that latch being set in signal
 handlers. My feeling is that that'd simplify the code rather
 significantly. I'm not 100% it's worth the code churn, but it'd
 definitely be easier to understand.  Thread sanitizer isn't the first
 analysis tool that has problems coping with forks in signal handlers
 btw, valgrind on amd64 for a long while had misaligned stacks in the
 children afterwards leading to very odd crashes.

Yeah, I'm a little worried about whether we'd destabilize things by
changing them in that way, but if we could avoid that pitfall I
suspect we'd end up better off.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Warnings around booleans

On Mon, Aug 17, 2015 at 8:01 AM, Andres Freund and...@anarazel.de wrote:
 I'd rather see those split into separate commits. Doing polishing and
 active bugs in one commit imo isn't a good idea if the polishing goes
 beyond a line or two.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] allowing wal_level change at run time

On 8/18/15 8:48 AM, Robert Haas wrote:
 On Tue, Aug 18, 2015 at 7:59 AM, Peter Eisentraut pete...@gmx.net wrote:
 How would we handle decreases at run time?  We can prevent =archive -
 minimal if archiving is running or there are physical replication slots,
 and we can prevent logical - something less if there are logical
 replication slots, but AFAICT, we don't have a way to check whether
 anyone currently needs level hot_standby.
 
 What do you mean by prevent?  If the user edits postgresql.conf and
 reduces the setting, and then reloads the configuration file, they
 have a right to expect that the changes got applied.

We have certain checks in place that require a minimum wal_level before
other things are allowed.  For example, turning on archiving requires
wal_level = archive.  The issue is then, if you have archiving on and
then turn wal_level to minimal at run time, we need to prevent that to
preserve the integrity of the original check.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: SCRAM authentication

Josh,

* Josh Berkus (j...@agliodbs.com) wrote:
 I don't feel like you've correctly assessed the risk inherent in the
 md5 auth method, which is that, having captured an md5auth string by
 whatever means, and attacker can reuse that md5 string on other
 databases in the network *without* cracking it.  That's the biggest risk
 as long as md5 is present.

Robert and I, I believe, are both aware of that risk and had moved on to
discussing the risk to the cleartext password by keeping an md5 present,
even if it wasn't able to be used for login.  We will want to make it
simple for admins to enable/disable md5 (eg: pg_hba), but we also have
to consider how to make it simple for admins to remove the old md5
password verifier or there will be risk of the cleartext password being
compromised from backups, etc.

I fully agree that it's a serious risk that, today, what's stored in the
database is directly usable for logging into the running system (or any
system where the user has the same username and password).

 Aside from code complexity, the user security concern with a multiple
 verifier per role approach is that the DBAs would never remember to
 completely disable md5auth and would capture md5 hashes either in flight
 or from backups.  This approach can be used to capture an md5hash from a
 non-critical database which is poorly secured, and then re-use it
 against an important database.

The 'in flight' case is at least a bit less of an issue, as we don't
ship the password verifier directly over the network.  The backup
concern is certainly corret though.  I fully agree that we need a way to
make sure users don't end up having the old password verifiers around
longer than necessary.

That was the imputus for my earlier suggestion that in a release or two,
we actively make pg_upgrade error (or perhaps warn first, then error,
across two releases) if any of the old verifiers exist.

 Now, the counter-argument to this is that a DBA is just as likely to
 rememeber to remove md5 verifiers as she is to remember to remove roles
 with md5auth.

Indeed, that's a serious concern also.

The other concern with a single password verifier is that we're locking
ourselves into a one-verifier-per-role solution when most of the serious
solutions in use today (Active Directory, Kerberos, certificate based
systems) allow for more than one.

 Regardless of the approach we take, encouraging users to migrate is
 going to be more of a matter of documentation, publicity, and
 administrative tools than one of multiple verifiers vs. multiple roles.
  That is, giving DBAs the ability to see and log who's using what kind
 of verifier, and what account has what verifier(s) available, will make
 more of a difference.

Fully agreed.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] jsonb array-style subscripting

2015-08-18 Thread Andrew Dunstan




On 08/18/2015 01:32 AM, Pavel Stehule wrote:

Hi

2015-08-17 21:12 GMT+02:00 Jim Nasby jim.na...@bluetreble.com 
mailto:jim.na...@bluetreble.com:


On 8/17/15 12:57 PM, Dmitry Dolgov wrote:

* is it interesting for the community?


We definitely need better ways to manipulate JSON.

* is that a good idea to extend the `ArrayRef` for jsonb? If it's
appropriate, probably we can rename it to `ArrayJsonbRef` of
something.
* what can be improved in the code at the top level (function
placement,
probably, functionality duplication, etc.)?
* are there any special cases, that I should take care of in this
implementation?


How would this work when you have a JSON array? Postgres array
syntax suddenly becoming key/value syntax for JSON seems like a
pretty bad idea to me. Could a different syntax (maybe {}) be used
instead?


I don't understand why '{}' should be better than '[]' ?

The lot of modern languages doesn't different between arrays and hash.




What is more, it would be a lot more intrusive to the parser, I suspect. 
We currently allow json path expressions to contain both object key and 
array index values, and I don't see any reason to do this differently. 
This is the syntax that was discussed at pgcon.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Autonomous Transaction is back

On Sat, Aug 15, 2015 at 6:47 PM, Noah Misch n...@leadboat.com wrote:
 CREATE TABLE t (c) AS SELECT 1;
 BEGIN;
 UPDATE t SET c = 2 WHERE c = 1;
 BEGIN_AUTONOMOUS;
 UPDATE t SET c = 3 WHERE c = 1;
 UPDATE t SET c = 4 WHERE c = 2;
 COMMIT_AUTONOMOUS;
 ROLLBACK;

 If you replace the autonomous transaction with a savepoint, the c=3 update
 finds no rows, and the c=4 update changes one row.  When the outer transaction
 aborts, only the original c=1 row remains live.  If you replace the autonomous
 transaction with a dblink/pg_background call, the c=3 update waits
 indefinitely for c=2 to commit or abort, an undetected deadlock.

 Suppose you make the autonomous transaction see tuples like in the savepoint
 case.  The c=3 update finds no rows to update, and the c=4 update changes one
 row.  When the outer transaction aborts, you have two live rows (c=1 and c=4).
 Suppose you instead make the autonomous transaction see tuples like in the
 dblink case, yet let c=3 ignore the lock and change a row.  If both the
 autonomous transaction and the outer transaction were to commit, then you get
 two live rows (c=2 and c=3).  Neither of those outcomes is acceptable, of
 course.  In today's tuple update rules, c=3 must deadlock[1].  Other credible
 tuple update rules may not have this problem, but nothing jumps to mind.

 [1] That's not to say it must use the shmem lock structures and deadlock
 detector.

This footnote goes to my point.

It seems clear to me that having the autonomous transaction see the
effects of the outer uncommitted transaction is a recipe for trouble.
If the autonomous transaction updates a row and commits, and the outer
transaction later aborts, the resulting state is inconsistent with any
serial history.  I'm fairly certain that's going to leave us in an
unhappy place.

Even more obviously, ending up with two committed row versions that
are both updates of a single ancestor version is no good.

So, I agree that this scenario should be an error.  What I don't agree
with is the idea that it should be the deadlock detector's job to
throw that error.  Rather, I think that when we examine the xmax of
the tuple we can see - which is the original one, not the one updated
by the outer transaction - we should check whether that XID belongs to
an outer transaction.  If it does, we should throw an error instead of
trying to lock it.  That way (1) the error message will be clear and
specific to the situation and (2) we don't need a separate PGPROC for
each autonomous transaction.  The first of those benefits is
agreeable; the second one is, in my opinion, a key design goal for
this feature.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DTrace build dependency rules

On Sat, Aug 15, 2015 at 6:45 PM, Mark Johnston ma...@freebsd.org wrote:
 There seems to be a bug in the make rules when DTrace is enabled. It
 causes dtrace -G to be invoked twice when building PostgreSQL as a
 FreeBSD port: once during the build itself, and once during
 installation. For a long time this has been worked around on FreeBSD
 with a change to libdtrace itself, but this workaround is proving
 problematic and I'd like to fix the problem properly. I'm not sure
 whether the problem has been observed on other operating systems that
 support DTrace.

 The bug is in src/backend/Makefile. probes.o, the dtrace(1)-generated
 object file, depends on the objfiles.txt for each of the backend
 subdirs. These files depend in turn on the object files themselves; if
 objfiles.txt is out of date with respect to one of its object files, the
 mtime of objfiles.txt is updated with touch (see backend/common.mk).
 The problem is that dtrace -G, which runs at the end of the build,
 modifies a number of object files (it overwrites their probe sites with
 NOPs), thus making their corresponding objfiles.txt out of date. Then,
 when make install traverses the backend subdirs, it updates
 objfiles.txt, which causes probes.o to be rebuilt, resulting in an error
 from dtrace(1).

Gosh, that's pretty ugly.  I would have thought it would be a real
no-no to update the .o file once it got generated.  If nothing else, a
modification to the .c file concurrent with a make invocation might
lead to the .o not getting rebuilt the next time make is run.

 The attached patch fixes the problem by having probes.o depend on object
 files directly, rather than on objfiles.txt. I've tested it with
 PostgreSQL 9.0-9.4 on FreeBSD CURRENT.

I don't see a particular reason not to make this change, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] allowing wal_level change at run time

On Tue, Aug 18, 2015 at 9:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Peter Eisentraut pete...@gmx.net writes:
 On 8/18/15 8:48 AM, Robert Haas wrote:
 What do you mean by prevent?  If the user edits postgresql.conf and
 reduces the setting, and then reloads the configuration file, they
 have a right to expect that the changes got applied.

 We have certain checks in place that require a minimum wal_level before
 other things are allowed.  For example, turning on archiving requires
 wal_level = archive.  The issue is then, if you have archiving on and
 then turn wal_level to minimal at run time, we need to prevent that to
 preserve the integrity of the original check.

 IIRC, the reason for having a wal_level parameter separate from those
 other ones was precisely that only wal_level had to be PGC_POSTMASTER,
 and you could change the others if you wanted.  If we are going to fix
 the mechanisms to allow dynamic changing in wal log verbosity, maybe
 we could simply get rid of wal_level as a user-settable parameter, and
 have its effective value be inferred from the active settings of the
 other parameters.

Mmm, I like that idea.  If we can do it, it seems much cleaner that way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows

Hello,

I noticed ExecChooseHashTableSize() in nodeHash.c got failed by
Assert(nbuckets  0), when crazy number of rows are expected.

BACKTRACE:

#0  0x003f79432625 in raise () from /lib64/libc.so.6
#1  0x003f79433e05 in abort () from /lib64/libc.so.6
#2  0x0092600a in ExceptionalCondition (conditionName=0xac1ea0 
!(nbuckets  0),
errorType=0xac1d88 FailedAssertion, fileName=0xac1d40 nodeHash.c, 
lineNumber=545) at assert.c:54
#3  0x006851ff in ExecChooseHashTableSize (ntuples=60521928028, 
tupwidth=8, useskew=1 '\001',
numbuckets=0x7fff146bff04, numbatches=0x7fff146bff00, 
num_skew_mcvs=0x7fff146bfefc) at nodeHash.c:545
#4  0x00701735 in initial_cost_hashjoin (root=0x253a318, 
workspace=0x7fff146bffc0, jointype=JOIN_SEMI,
hashclauses=0x257e4f0, outer_path=0x2569a40, inner_path=0x2569908, 
sjinfo=0x2566f40, semifactors=0x7fff146c0168)
at costsize.c:2592
#5  0x0070e02a in try_hashjoin_path (root=0x253a318, joinrel=0x257d940, 
outer_path=0x2569a40, inner_path=0x2569908,
hashclauses=0x257e4f0, jointype=JOIN_SEMI, extra=0x7fff146c0150) at 
joinpath.c:543


See the following EXPLAIN output, at the configuration without --enable-cassert.
Planner expects 60.5B rows towards the self join by a relation with 72M rows.
(Probably, this estimation is too much.)

[kaigai@ayu ~]$ (echo EXPLAIN; cat ~/tpcds/query95.sql) | psql tpcds100

   QUERY PLAN

 Limit  (cost=9168667273.07..9168667273.08 rows=1 width=20)
   CTE ws_wh
 -  Custom Scan (GpuJoin)  (cost=3342534.49..654642911.88 rows=60521928028 
width=24)
   Bulkload: On (density: 100.00%)
   Depth 1: Logic: GpuHashJoin, HashKeys: (ws_order_number), JoinQual: 
((ws_warehouse_sk  ws_warehouse_sk) AND (ws_order_number = ws_order_number)), 
nrows (ratio: 84056.77%)
   -  Custom Scan (BulkScan) on web_sales ws1_1  
(cost=0.00..3290612.48 rows=72001248 width=16)
   -  Seq Scan on web_sales ws2  (cost=0.00..3290612.48 rows=72001248 
width=16)
   -  Sort  (cost=8514024361.19..8514024361.20 rows=1 width=20)
 Sort Key: (count(DISTINCT ws1.ws_order_number))
  :

This crash was triggered by Assert(nbuckets  0), and nbuckets is calculated
as follows.

/*
 * If there's not enough space to store the projected number of tuples and
 * the required bucket headers, we will need multiple batches.
 */
if (inner_rel_bytes + bucket_bytes  hash_table_bytes)
{
/* We'll need multiple batches */
longlbuckets;
double  dbatch;
int minbatch;
longbucket_size;

/*
 * Estimate the number of buckets we'll want to have when work_mem is
 * entirely full.  Each bucket will contain a bucket pointer plus
 * NTUP_PER_BUCKET tuples, whose projected size already includes
 * overhead for the hash code, pointer to the next tuple, etc.
 */
bucket_size = (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple));
lbuckets = 1  my_log2(hash_table_bytes / bucket_size);
lbuckets = Min(lbuckets, max_pointers);
nbuckets = (int) lbuckets;
bucket_bytes = nbuckets * sizeof(HashJoinTuple);
  :
  :
}
Assert(nbuckets  0);
Assert(nbatch  0);

In my case, the hash_table_bytes was 101017630802, and bucket_size was 48.
So, my_log2(hash_table_bytes / bucket_size) = 31, then lbuckets will have
negative number because both 1 and my_log2() is int32.
So, Min(lbuckets, max_pointers) chooses 0x8000, then it was set on
the nbuckets and triggers the Assert().

Attached patch fixes the problem.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com



pgsql-fix-hash-nbuckets.patch
Description: pgsql-fix-hash-nbuckets.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

On Tue, Aug 18, 2015 at 7:27 AM, Masahiko Sawada sawada.m...@gmail.com wrote:
 I have encountered the much cases where pg_stat_statement,
 pgstattuples are required in production, so I basically agree with
 moving such extension into core.
 But IMO, the diagnostic tools for visibility map, heap (pageinspect)
 and so on, are a kind of debugging tool.

Just because something might be required in production isn't a
sufficient reason to put it in core.  Debugging tools, or anything
else, can be required in production, too.

 Attached latest v11 patches, which is separated into 2 patches: frozen
 bit patch and diagnostic function patch.
 Moving diagnostic function into core is still under the discussion,
 but this patch puts such function into core because the diagnostic
 function for visibility map needs to be in core to execute regression
 test at least.

As has been discussed recently, there are other ways to handle that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

On Tue, Aug 18, 2015 at 6:07 AM, PostgreSQL - Hans-Jürgen Schönig
postg...@cybertec.at wrote:
 On 18 Aug 2015, at 11:19, Albe Laurenz laurenz.a...@wien.gv.at wrote:

 Hans-Jürgen Schönig wrote:
 in addition to that you have the “problem” of transactions. if you failover 
 in the middle
 of a transaction, strange things might happen from the application point of 
 view.

 the good thing, however, is that stupid middleware is sometimes not able to 
 handle
 failed connections. however, overall i think it is more of a danger than a 
 benefit.

 Maybe I misunderstood the original proposal, but my impression was that the 
 alternative
 servers would be tried only at the time the connection is established, and 
 there would be no
 such problems as you describe.

 it would still leave the problem of having a read only on the other side 
 unless you are using BDR or so.

That doesn't make this a bad idea.  Some people are using replication
solutions that can cope with this already (EDB has a proprietary
product, and I'm sure there are people using BDR, too) and, as the
solutions get better and more widely deployed, more people will want
to do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: SCRAM authentication

On Tue, Aug 18, 2015 at 10:06 AM, Stephen Frost sfr...@snowman.net wrote:
 That was the imputus for my earlier suggestion that in a release or two,
 we actively make pg_upgrade error (or perhaps warn first, then error,
 across two releases) if any of the old verifiers exist.

I think there's basically no chance of that being acceptable.  The
time at which it's possible to get rid of MD5 is going to vary widely
between installations.  People who are using only libpq or
libpq-derived connectors will be able to get rid of it almost
immediately, if they want, though some won't.  People who are using
obscure connectors that are poorly maintained may not even have a
version that supports SCRAM for 5 years.  Think about how long it took
us to roll out the standard_conforming_strings changes, and there were
still people who got bitten.

 The other concern with a single password verifier is that we're locking
 ourselves into a one-verifier-per-role solution when most of the serious
 solutions in use today (Active Directory, Kerberos, certificate based
 systems) allow for more than one.

So what?  If you want to delegate authentication to AD or Kerberos, we
already support that.  That's not a reason to invent the same
functionality inside the server.  If you've got a tangible plan, other
than SCRAM, that would require us to support multiple verifiers, then
please say what it is.  If not, the mere fact that some other people
support it is not a reason why we should.  In fact, we generally have
taken the approach that needs which are already handled adequately by
other tools to do not need to also be handled inside the database.
That's not a perfect approach and we always argue about it around the
edges, but generally, I think it's served us pretty well.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-18 Thread Jaime Casanova

On 17 August 2015 at 23:18, Victor Wagner vi...@wagner.pp.ru wrote:

 Rationale
 =

 Since introduction of the WAL-based replication into the PostgreSQL, it is
 possible to create high-availability and load-balancing clusters.

 However, there is no support for failover in the client libraries. So, only
 way to provide transparent for client application failover is IP address
 migration. This approach has some limitation, i.e. it requires that
 master and backup servers reside in the same subnet or may not be
 feasible for other reasons.


This is not completely true, you can always use something like
pgbouncer or other middleware to change the server to which clients
connect. you still need to solve the fact that you will have a
read-only server at the other side.

something like repmgr + pgbouncer will work fine.

i agree that once/if we ever have multimaster included then this could
be a good idea

-- 
Jaime Casanova  www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Potential GIN vacuum bug

On Mon, Aug 17, 2015 at 5:41 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 User backends attempt to take the lock conditionally, because otherwise they
 would cause an autovacuum already holding the lock to cancel itself, which
 seems quite bad.

 Not that this a substantial behavior change, in that with this code the user
 backends which find the list already being cleaned will just add to the end
 of the pending list and go about their business.  So if things are added to
 the list faster than they can be cleaned up, the size of the pending list
 can increase without bound.

 Under the existing code each concurrent user backend will try to clean the
 pending list at the same time.  The work doesn't parallelize, so doing this
 is just burns CPU (and possibly consuming up to maintenance_work_mem  for
 *each* backend) but it does server to throttle the insertion rate and so
 keep the list from growing without bound.

 This is just a proof-of-concept patch, because I don't know if this approach
 is the right approach.

I'm not sure if this is the right approach, but I'm a little wary of
involving the heavyweight lock manager in this.  If pending list
cleanups are frequent, this could involve a lot of additional lock
manager traffic, which could be bad for performance.  Even if they are
infrequent, it seems like it would be more natural to handle this
without some regime of locks and pins and buffer cleanup locks on the
buffers that are storing the pending list, rather than a heavyweight
lock on the whole relation.  But I am just waving my hands wildly
here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DTrace build dependency rules

2015-08-18 Thread Alvaro Herrera

Robert Haas wrote:
 On Sat, Aug 15, 2015 at 6:45 PM, Mark Johnston ma...@freebsd.org wrote:

  The bug is in src/backend/Makefile. probes.o, the dtrace(1)-generated
  object file, depends on the objfiles.txt for each of the backend
  subdirs. These files depend in turn on the object files themselves; if
  objfiles.txt is out of date with respect to one of its object files, the
  mtime of objfiles.txt is updated with touch (see backend/common.mk).
  The problem is that dtrace -G, which runs at the end of the build,
  modifies a number of object files (it overwrites their probe sites with
  NOPs), thus making their corresponding objfiles.txt out of date. Then,
  when make install traverses the backend subdirs, it updates
  objfiles.txt, which causes probes.o to be rebuilt, resulting in an error
  from dtrace(1).
 
 Gosh, that's pretty ugly.  I would have thought it would be a real
 no-no to update the .o file once it got generated.  If nothing else, a
 modification to the .c file concurrent with a make invocation might
 lead to the .o not getting rebuilt the next time make is run.

I had the same thought, and wondered for a bit whether we should instead
have the compilation rules produce some intermediate file (prior to
dtrace fumbling), then emit the .o from dtrace -G.  OTOH this might be
more trouble than is worth for a feature that doesn't see a lot of use.

  The attached patch fixes the problem by having probes.o depend on object
  files directly, rather than on objfiles.txt. I've tested it with
  PostgreSQL 9.0-9.4 on FreeBSD CURRENT.
 
 I don't see a particular reason not to make this change, though.

I wanted to test it on Linux yesterday but didn't get any further than
installing a couple of packages.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] allowing wal_level change at run time

On Tue, Aug 18, 2015 at 12:27 PM, Peter Eisentraut pete...@gmx.net wrote:
 On 8/18/15 9:50 AM, Tom Lane wrote:
 IIRC, the reason for having a wal_level parameter separate from those
 other ones was precisely that only wal_level had to be PGC_POSTMASTER,
 and you could change the others if you wanted.

 I think you are thinking of having split archive_mode/archive_command,
 so you can change archive_command at run time.

 If we are going to fix
 the mechanisms to allow dynamic changing in wal log verbosity, maybe
 we could simply get rid of wal_level as a user-settable parameter, and
 have its effective value be inferred from the active settings of the
 other parameters.

 That would be nice, but those other parameters aren't really things
 that we can easily look at.  In the old days, you could say that
 archive_mode = on was a pretty sure sign that you'd need wal_level =
 archive, but nowadays you can run replication without archiving.  We
 could dial wal_level up and down every time someone connects to stream
 WAL, but that would surely lead to complications.  Also, we have no way
 of knowing whether someone needs wal_level hot_standby until the WAL is
 on the standby and the standby decides to set hot_standby = on.  The
 permutations of what could possibly influence this setting are quite
 enormous, if you consider archiving, streaming, hot standby, hot standby
 feedback, replication slots, etc., and synchronizing all of that sounds
 like a mess.

If archive_mode=on or max_wal_senders0, then we need at least
wal_level=archive.  Otherwise wal_level=minimal is enough.

If we eliminate the distinction between wal_level=archive and
wal_level=hot_standby, then we just need a way to distinguish between
hot_standby and logical.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] allowing wal_level change at run time

On 8/18/15 9:50 AM, Tom Lane wrote:
 IIRC, the reason for having a wal_level parameter separate from those
 other ones was precisely that only wal_level had to be PGC_POSTMASTER,
 and you could change the others if you wanted.

I think you are thinking of having split archive_mode/archive_command,
so you can change archive_command at run time.

 If we are going to fix
 the mechanisms to allow dynamic changing in wal log verbosity, maybe
 we could simply get rid of wal_level as a user-settable parameter, and
 have its effective value be inferred from the active settings of the
 other parameters.

That would be nice, but those other parameters aren't really things
that we can easily look at.  In the old days, you could say that
archive_mode = on was a pretty sure sign that you'd need wal_level =
archive, but nowadays you can run replication without archiving.  We
could dial wal_level up and down every time someone connects to stream
WAL, but that would surely lead to complications.  Also, we have no way
of knowing whether someone needs wal_level hot_standby until the WAL is
on the standby and the standby decides to set hot_standby = on.  The
permutations of what could possibly influence this setting are quite
enormous, if you consider archiving, streaming, hot standby, hot standby
feedback, replication slots, etc., and synchronizing all of that sounds
like a mess.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Colon Omitted From pgSQL Assignments

2015-08-18 Thread Pavel Stehule

Hi



2015-08-18 17:41 GMT+02:00 Charles Sheridan cesh...@swbell.net:

 Hi,

 I was looking at PL/pgSQL documentation and realized that contrary to
 spec, I've been omitting the colon ':' from assignments, e.g. writing

   'x  = 5' rather than the correct
   'x := 5'

 I don't see any error messages about this.


this is undocumented feature, that should not be removed due possible
compatibility issues.

see comments from gram.y

 * Ada-based PL/SQL uses := for assignment and variable defaults, while
 * the SQL standard uses equals for these cases and for GET
 * DIAGNOSTICS, so we support both.  FOR and OPEN only support :=.

Personally I am thinking, so this design is unhappy, but at the end this is
small problem not too important to solve.

Regards

Pavel




 I am not aware of any problems due to this.  I suppose that if a
 condition precedes this syntax, the code could be interpreted as an
 equals test.

 Should I immediately update all assignments ?

 Are there known scenarios where this error becomes a problem ?

 Regards, Charles





 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] missing documentation for partial WAL files