Re: [HACKERS] KNN-GiST with recheck

2014-09-26 Thread Alexander Korotkov
On Fri, Sep 26, 2014 at 5:18 AM, Bruce Momjian br...@momjian.us wrote:

 On Sun, Sep 14, 2014 at 11:34:26PM +0400, Alexander Korotkov wrote:
   Cost estimation of GiST is a big problem anyway. It doesn't care
 (and
   can't) about amount of recheck for regular operators. In this
 patch, same
   would be for knn recheck. The problem is that touching heap from
 access

 This is very important work.  While our existing KNN-GiST index code
 works fine for scalar values and point-to-point distance ordering, it
 doesn't work well for 2-dimensional objects because they are only
 indexed by their bounding boxes (a rectangle around the object).  The
 indexed bounding box can't produce accurate distances to other objects.

 As an example, see this PostGIS blog post showing how to use LIMIT in a
 CTE to filter results and then compute the closest object (search for
 LIMIT 50):


 http://shisaa.jp/postset/postgis-postgresqls-spatial-partner-part-3.html

 This patch fixes our code for distances from a point to indexed 2-D
 objects.

 Does this also fix the identical PostGIS problem or is there something
 PostGIS needs to do?


This patch provides general infrastructure for recheck in KNN-GiST. PostGIS
need corresponding change in its GiST opclass. Since PostGIS already define
- and # operators as distance to bounding box border and bounding box
center, it can't change their behaviour.
it has to support new operator exact distance in opclass.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2014-09-26 Thread Amit Kapila
On Wed, Sep 24, 2014 at 3:18 PM, Dilip kumar dilip.ku...@huawei.com wrote:
 On 24 August 2014 11:33, Amit Kapila Wrote

 7. I think in new mechanism cancel handler will not work.

 In single connection vacuum it was always set/reset

 in function executeMaintenanceCommand(). You might need

 to set/reset it in function run_parallel_vacuum().



 Good catch, Now i have called SetCancelConn(pSlot[0].connection), on
first connection. this will enable cancle

 handler to cancle the query on first connection so that select loop will
come out.


I don't think this can handle cancel requests properly because
you are just setting it in GetIdleSlot() what if the cancel
request came during GetQueryResult() after sending sql for
all connections (probably thats the reason why Jeff is not able
to cancel the vacuumdb when using parallel option).




 1. I could see one shortcoming in the way the patch has currently
parallelize the

work for --analyze-in-stages. Basically patch is performing the work
for each stage

for multiple tables in concurrent connections that seems okay for the
cases when

number of parallel connections is less than equal to number of
tables, but for

the case when user has asked for more number of connections than
number of tables,

then I think this strategy will not be able to use the extra
connections.



 I think --analyze-in-stages should maintain the order.

Yes, you are right.  So lets keep the code as it is for this case.



 2. Similarly for the case of multiple databases, currently it will not
be able

to use connections more than number of tables in each database
because the

parallelizing strategy is to just use the conncurrent connections for

tables inside single database.



 I think for multiple database doing the parallel execution we need to
maintain the multiple connection with multiple databases.
 And we need to maintain a table list for all the databases together to
run them concurrently. I think this may impact the startup cost,
 as we need to create a multiple connection and disconnect for preparing
the list

Yeah probably startup cost will be bit higher, but that cost we
will anyway incur during overall operation.

 and i think it will increase the complexity also.

I understand that complexity of code might increase and the strategy
to parallelize can also be different incase we want to parallelize
for all databases case, so lets leave as it is unless someone else
raises voice for the same.

Today while again thinking about the startegy used in patch to
parallelize the operation (vacuum database), I think we can
improve the same for cases when number of connections are
lesser than number of tables in database (which I presume
will normally be the case).  Currently we are sending command
to vacuum one table per connection, how about sending multiple
commands (example Vacuum t1; Vacuum t2) on one connection.
It seems to me there is extra roundtrip for cases when there
are many small tables in database and few large tables.  Do
you think we should optimize for any such cases?

Few other points
1.
+ vacuum_parallel(const char *dbname, bool full, bool verbose,
{
..
+ connSlot = (ParallelSlot*)pg_malloc(concurrentCons *
sizeof(ParallelSlot));
+ connSlot[0].connection = conn;

a.
Does above memory gets freed anywhere, if not isn't it
good idea to do the same
b. For slot 0, you are not seeting it as PQsetnonblocking,
where as I think it can be used to run commands like any other
connection.

2.
+ /*
+  * If user has given the vacuum of complete db, then if
+  * any of the object vacuum
failed it can be ignored and vacuuming
+  * of other object can be continued, this is the same
behavior as
+  * vacuuming of complete db is handled without --jobs option
+  */

s/object/object's

3.
+ if(!completedb ||
+ (sqlState  strcmp(sqlState,
ERRCODE_UNDEFINED_TABLE) != 0))
+ {
+
+ fprintf(stderr, _(%s: vacuuming of
database \%s\ failed: %s),
+ progname, dbname, PQerrorMessage
(conn));

Indentation on both places is wrong.  Check other palces for
similar issues.

4.
+ bool analyze_only, bool freeze, int numAsyncCons,

In code still there is reference to AsyncCons, as decided lets
change it to concurrent_connections | conc_cons




With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


[HACKERS] copy pastos in atomics.h comments

2014-09-26 Thread Erik Rijkers
the the (6x)


--- src/include/port/atomics.h.orig	2014-09-26 09:06:10.553635053 +0200
+++ src/include/port/atomics.h	2014-09-26 09:07:14.763579885 +0200
@@ -346,7 +346,7 @@
 /*
  * pg_atomic_fetch_add_u32 - atomically add to variable
  *
- * Returns the the value of ptr before the arithmetic operation.
+ * Returns the value of ptr before the arithmetic operation.
  *
  * Full barrier semantics.
  */
@@ -360,7 +360,7 @@
 /*
  * pg_atomic_fetch_sub_u32 - atomically subtract from variable
  *
- * Returns the the value of ptr before the arithmetic operation. Note that
+ * Returns the value of ptr before the arithmetic operation. Note that
  * sub_ may not be INT_MIN due to platform limitations.
  *
  * Full barrier semantics.
@@ -376,7 +376,7 @@
 /*
  * pg_atomic_fetch_and_u32 - atomically bit-and and_ with variable
  *
- * Returns the the value of ptr before the arithmetic operation.
+ * Returns the value of ptr before the arithmetic operation.
  *
  * Full barrier semantics.
  */
@@ -390,7 +390,7 @@
 /*
  * pg_atomic_fetch_or_u32 - atomically bit-or or_ with variable
  *
- * Returns the the value of ptr before the arithmetic operation.
+ * Returns the value of ptr before the arithmetic operation.
  *
  * Full barrier semantics.
  */
@@ -404,7 +404,7 @@
 /*
  * pg_atomic_add_fetch_u32 - atomically add to variable
  *
- * Returns the the value of ptr after the arithmetic operation.
+ * Returns the value of ptr after the arithmetic operation.
  *
  * Full barrier semantics.
  */
@@ -418,7 +418,7 @@
 /*
  * pg_atomic_sub_fetch_u32 - atomically subtract from variable
  *
- * Returns the the value of ptr after the arithmetic operation. Note that sub_
+ * Returns the value of ptr after the arithmetic operation. Note that sub_
  * may not be INT_MIN due to platform limitations.
  *
  * Full barrier semantics.
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] copy pastos in atomics.h comments

2014-09-26 Thread Andres Freund
On 2014-09-26 09:18:08 +0200, Erik Rijkers wrote:
 the the (6x)

Gah. I thought I'd fixed the one Heikki had noticed, but apparently it
was a independent one. Thanks for watching/fixing. Pushed.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RLS feature has been committed

2014-09-26 Thread Heikki Linnakangas

On 09/26/2014 01:07 AM, Stephen Frost wrote:

* Simon Riggs (si...@2ndquadrant.com) wrote:

My major reason to revert is the following: the documentation contains
no examples of real world usage, making the feature essentially
unusable out of the box.


I find this to be an interesting argument considering most of our
documentation doesn't include real-world examples.


+1 for adding examples.


This wouldn't be the only case of documentation (indeed, *any*
documentation) being added after a commit, and so I'm mystified by this
requirement for *real-world* examples in documentation to be provided
prior to commit.


IMO it depends on the situation. Yeah, we've had a lot of commits 
without docs, where the documentation has been added later. I'm OK with 
that, if for example the committed patch is part of a series of patches, 
where it doesn't make sense to add documentation until the whole series 
has been committed. Then there are situations where the patch author 
just hasn't gotten around to adding documentation yet (like in this 
case). That's more questionable; documentation is an important part of 
adding a feature, and there's the risk that the author never gets around 
to add documentation, after the code has been committed, or does only 
lip service to it. It's usually worked out fine, though - people who 
have successfully added major features are conscientious enough to get 
it done.


But in many cases, lack of good documentation make even reviewing the 
patch difficult. How do you determine if the patch works as intended, if 
you don't know what it's supposed to do?


Wrt. reverting or not, I'm strongly of the opinion that whatever. If you 
just add the docs, I'm happy. This is a big and security-sensitive 
feature, and it requires documentation not only on how to use the 
feature, but an introduction to what row-level security is, what level 
of security to expect, what circumstances you would use it in, 
comparison to other approaches, the side-channels, etc. We'll probably 
have to go through a few rounds of review on the docs alone. I'd leave 
it up to Stephen to decided how he wants to handle this. But I'd really 
like to see the documentation added (at least posted as a patch if not 
committed), well before the next commitfest, so that Robert and others 
who want to review the code, have the documentation available.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] [SQL] pg_multixact issues

2014-09-26 Thread Dev Kumkar
On Fri, Sep 19, 2014 at 1:23 PM, Dev Kumkar devdas.kum...@gmail.com wrote:

 Apologies for the delay, was working/troubleshooting same issue and was
 away from my emails. :(
 Regards...


Received the database with huge pg_multixact directory of size 21G and
there are ~82,000 files in pg_multixact/members and 202 files in
pg_multixact/offsets directory.

Did run vacuum full on this database and it was successful. However now
am not sure about pg_multixact directory. truncating this directory except
 file results into database start up issues, of course this is not
correct way of truncating.
 FATAL:  could not access status of transaction 13224692

Stumped ! Please provide some comments on how to truncate pg_multixact
files and if there is any impact because of these files on database
performance.

Regards...


Re: [HACKERS] RLS feature has been committed

2014-09-26 Thread Simon Riggs
On 26 September 2014 08:48, Heikki Linnakangas hlinnakan...@vmware.com wrote:

 But in many cases, lack of good documentation makes even reviewing the patch
 difficult. How do you determine if the patch works as intended, if you don't
 know what it's supposed to do?

Exactly.

Lack of review and lack of consensus are often caused by the author
not making the patch fully and genuinely accessible to peer review.
Don't say you're having problems getting buy in when you've done very
little to encourage that. Committing early is not the solution.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 3

2014-09-26 Thread Petr Jelinek

On 26/09/14 04:44, Robert Haas wrote:

On Tue, Sep 23, 2014 at 2:24 PM, Andres Freund and...@2ndquadrant.com wrote:


Note that it depends on the replication solution whether these
external identifiers need to be coordinated across systems or not. I
think it's *good* if we don't propose a solution for that - different
replication solutions will have different requirements.


I'm pretty fuzzy on how this actually works.  Like, the short form
here is just getting injected into WAL by the apply process.  How does
it figure out what value to inject?  What if it injects a value that
doesn't have a short-to-long mapping?  What's the point of the
short-to-long mappings in the first place?  Is that only required
because of the possibility that there might be multiple replication
solutions in play on the same node?



From my perspective the short-to-long mapping is mainly convenience 
thing, long id should be something that can be used to map the 
identifier to the specific node for the purposes of configuration, 
monitoring, troubleshooting, etc. You also usually don't use just Oids 
to represent the DB objects, I see some analogy there.
This could be potentially done by the solution itself, not by the 
framework, but it seems logical (pardon the pun) that most (if not all) 
solutions will want some kind of mapping of the generated ids to 
something that represents the logical node.


So answer to your first two questions depends on the specific solution, 
it can map it from connection configuration, it can get the it from the 
output plugin as part of wire protocol, it can generate it based on some 
internal logic, etc.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 3

2014-09-26 Thread Andres Freund
On 2014-09-25 22:44:49 -0400, Robert Haas wrote:
 Thanks for this write-up.

 On Tue, Sep 23, 2014 at 2:24 PM, Andres Freund and...@2ndquadrant.com wrote:
  1) The ability Monitor how for replication has progressed in a
 crashsafe manner to allow it to restart at the right point after
 errors/crashes.
  2) Efficiently identify the origin of individual changes and
 transactions. In multimaster and some cascading scenarios it is
 necessary to do so to avoid sending out replayed changes again.
  3) Allow to efficiently filter out replayed changes from logical
 decoding. It's currently possible to filter changes from inside the
 output plugin, but it's much more efficient to filter them out
 before decoding.

 I agree with these goals.

 Let me try to summarize the information requirements for each of these
 things.  For #1, you need to know, after crash recovery, for each
 standby, the last commit LSN which the client has confirmed via a
 feedback message.

I'm not sure I understand what you mean here? This is all happening on
the *standby*. The standby needs to know, after crash recovery, the
latest commit LSN from the primary that it has successfully replayed.

Say you replay the following:
SET synchronous_commit = off;

BEGIN;
INSERT INTO foo ...
COMMIT /* original LSN 0/10 */;

BEGIN;
INSERT INTO foo ...
COMMIT /* original LSN 0/20 */;

BEGIN;
INSERT INTO foo ...
COMMIT /* original LSN 0/30 */;

If postgres crashes at any point during this, we need to know whether we
successfully replayed up to 0/10, 0/20 or 0/30. Note that the problem
exists independent of s_c=off, it just excerbates the issue.

Then, after finishing recovery and discovering only 0/10 has persisted,
the standby can reconnect to the primary and do
START_REPLICATION SLOT .. LOGICAL 0/10;
and it'll receive all transactions that have committed since 0/10.

 For #2, you need to know, when decoding each
 change, what the origin node was.  And for #3, you need to know, when
 decoding each change, whether it is of local origin.  The information
 requirements for #3 are a subset of those for #2.

Right. For #3 it's more important to have the information available
efficiently on individual records.

  A rather efficient solution for 1) is to attach the 'origin node' and
  the remote commit LSN to every local commit record produced by
  replay. That allows to have a shared memory table (remote_node,
  local_lsn, remote_lsn).

 This seems OK to me, modulo some arguing about what the origin node
 information ought to look like.  People who are not using logical
 replication can use the compact form of the commit record in most
 cases, and people who are using logical replication can pay for it.

Exactly.

  Similarly, to solve the problem of identifying the origin of changes
  during decoding, the problem can be solved nicely by adding the origin
  node of every change to changes/transactions. At first it might seem
  to be sufficient to do so on transaction level, but for cascading
  scenarios it's very useful to be able to have changes from different
  source transactions combinded into a larger one.

 I think this is a lot more problematic.  I agree that having the data
 in the commit record isn't sufficient here, because for filtering
 purposes (for example) you really want to identify the problematic
 transactions at the beginning, so you can chuck their WAL, rather than
 reassembling the transaction first and then throwing it out.  But
 putting the origin ID in every insert/update/delete is pretty
 unappealing from my point of view - not just because it adds bytes to
 WAL, though that's a non-trivial concern, but also because it adds
 complexity - IMHO, a non-trivial amount of complexity.  I'd be a lot
 happier with a solution where, say, we have a separate WAL record that
 says XID 1234 will be decoding for origin 567 until further notice.

I think it actually ends up much simpler than what you propose. In the
apply process, you simply execute
SELECT pg_replication_identifier_setup_replaying_from('bdr: 
this-is-my-identifier');
or it's C equivalent. That sets a global variable which XLogInsert()
includes the record.
Note that this doesn't actually require any additional space in the WAL
- padding bytes in struct XLogRecord are used to store the
identifier. These have been unused at least since 8.0.

I don't think a solution which logs the change of origin will be
simpler. When the origin is in every record, you can filter without keep
track of any state. That's different if you can switch the origin per
tx. At the very least you need a in memory entry for the origin.

  == Replication Identifiers ==
 
  The above explains the need to have as small as possible identifiers
  for nodes. Two years back I'd suggested that we rely on the user to
  manually assign 16bit ids to individual nodes. Not very surprisingly
  that was shot down because a) 16bit numbers are not descriptive b) a
  per node identifier is problematic 

Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Amit Kapila
 On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas robertmh...@gmail.com
wrote:

 But this gets at another point: the way we're benchmarking this right
 now, we're really conflating the effects of three different things:

 1. Changing the locking regimen around the freelist and clocksweep.
 2. Adding a bgreclaimer process.
 3. Raising the number of buffer locking partitions.

First of all thanks for committing part-1 of this changes and it
seems you are planing to commit part-3 based on results of tests
which Andres is planing to do and for remaining part (part-2), today
I have tried some tests, the results of which are as follows:

Scale Factor - 3000, Shared_buffer - 8GB

   Patch_Ver/Client_Count 16 32 64 128  reduce-replacement-locking.patch +
128 Buf Partitions 157732 229547 271536 245295
scalable_buffer_eviction_v9.patch 163762 230753 275147 248309


Scale Factor - 3000, Shared_buffer - 8GB

   Patch_Ver/Client_Count 16 32 64 128  reduce-replacement-locking.patch +
128 Buf Partitions 157781 212134 202209 171176
scalable_buffer_eviction_v9.patch 160301 213922 208680 172720


The results indicates that in all cases there is benefit by doing
part-2 (bgreclaimer).  Though the benefit at this configuration is
not high, but might be at some higher configurations
(scale factor - 1) there is more benefit.  Do you see any merit
in pursuing further to accomplish part-2 as well?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Heikki Linnakangas

On 09/25/2014 05:40 PM, Andres Freund wrote:

There's two reasons for that: a) dynahash just isn't very good and it
does a lot of things that will never be necessary for these hashes. b)
the key into the hash table is*far*  too wide. A significant portion of
the time is spent comparing buffer/lock tags.


Hmm. Is it the comparing, or calculating the hash? We could precalculate 
the hash for RelFileNode+ForkNumber, and store it RelationData. At a 
lookup, you'd only need to mix in the block number.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] delta relations in AFTER triggers

2014-09-26 Thread Heikki Linnakangas

On 09/25/2014 06:54 PM, Kevin Grittner wrote:

Heikki Linnakangas hlinnakan...@vmware.com wrote:


You cast the TuplestoreRelation to Plan, and pass it to CopyPlanFields.
That will crash, because TuplestoreRelation is nothing like a Plan:


Oops.  That's a copy/paste error I should have noticed.  Fixed,
even though the node type might be going away.  Since all of this
seems to be working very well from a user point of view, I'm going
to try to generate a lot more regression tests against the existing
code before taking another run at the API, to make sure that things
don't break in the refactoring.

I didn't hit the copy/out bugs in testing so far -- any suggestions
on a test that would exercise this code?  (I'm probably missing
something obvious.)


There's some debugging code in tcop/postgres.c, search for 
COPY_PARSE_PLAN_TREES. It won't catch everything, but probably would've 
caught this one.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-26 Thread Heikki Linnakangas

On 09/26/2014 12:13 AM, Peter Geoghegan wrote:

On Thu, Sep 25, 2014 at 1:48 PM, Simon Riggs si...@2ndquadrant.com wrote:

C. Internal weirdness
Submit C based upon earlier agreed B, submit to Dec 15 CF, major patch
deadline, so we can fine tune for last CF.
Then Heikki rewrites half your patch in a better way, you thank him
and then we commit. All done.


I don't have a problem with Heikki or anyone else rewriting the value
locking part of the patch, provided it meets my requirements for such
a mechanism. Since Heikki already agreed that that standard should be
imposed, he'd hardly take issue with it now.

However, the fact is that once you actually make something like
promise tuples meet that standard, at the very least it becomes a lot
messier than you'd think. Heikki's final prototype super deleted
tuples by setting their xmin to InvalidTransactionId. We weren't sure
that that doesn't break some random other heapam code. Consider this,
for example:

https://github.com/postgres/postgres/blob/REL9_4_STABLE/src/backend/executor/execMain.c#L1961

So that looks safe in the face of setting xmin to InvalidTransactionId
in the way the later prototype patch did if you think about it for a
while, but there are other places where that is less clear. In short,
it becomes something that we have to worry about for ever, because
xmin cannot change without the tuple in the slot changing is clearly
an invariant for certain purposes. It might accidentally fail to fail
right now, but I'm not comfortable with it.


Just to be clear: I wrote the initial patch to demonstrate what I had in 
mind, because I was not able to explain it well enough otherwise. You 
pointed out issues with it, which I then fixed. You then pointed out 
more issues, which I then fixed again.


My patch version was a proof of concept, to demonstrate that it can be 
done. What I'd like you to do now, as the patch author, is to take the 
promise tuple approach and clean it up. If the xmin stuff is ugly, 
figure out some other way to do it.



Now, I might be convinced that that's actually the way to go. I have
an open mind. But that will take discussion. I like that page
hwlocking is something that many systems do (even including Oracle, I
believe). Making big changes to nbtree is always something that
deserves to be met with skepticism, but it is nice to have an
implementation that lives in the head of AM.


I don't know what you mean by in the head of AM, but IMO it would be 
far better if we can implement this outside the index AMs. Then it will 
work with any index AM.


BTW, in the discussions, you pointed out that exclusion constraints 
currently behave differently from a unique index, when two backends 
insert a tuple at the same time. With a unique index, one of them will 
fail, but one is always guaranteed to succeed. With an exclusion 
constraint, they can both fail if you're unlucky. I think the promise 
tuples would allow us to fix that, too, while we're at it. In fact, you 
might consider tackling that first, and build the new INSERT ON CONFLICT 
syntax on top of that. Basically, an INSERT to a table with an exclusion 
constraint would be the same as INSERT ON CONFLICT throw an error. 
That would be a useful way to split this patch into two parts.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Andres Freund
On 2014-09-26 15:04:54 +0300, Heikki Linnakangas wrote:
 On 09/25/2014 05:40 PM, Andres Freund wrote:
 There's two reasons for that: a) dynahash just isn't very good and it
 does a lot of things that will never be necessary for these hashes. b)
 the key into the hash table is*far*  too wide. A significant portion of
 the time is spent comparing buffer/lock tags.
 
 Hmm. Is it the comparing, or calculating the hash?

Neither, really. The hash calculation is visible in the profile, but not
that pronounced yet. The primary thing noticeable in profiles (besides
cache efficiency) is the comparison of the full tag after locating a
possible match in a bucket. 20 byte memcmp's aren't free.

Besides making the hashtable more efficent, a smaller key (say, 4 byte
relfilenode, 4 byte blocknumber) would also make using a radix tree or
similar more realistic. I've prototyped that once and it has nice
properties, but the tree is too deep.  Obviousy it'd also help making
buffer descriptors smaller, which is also good from a cache efficiency
perspective...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-26 Thread Andres Freund
On 2014-09-26 15:24:21 +0300, Heikki Linnakangas wrote:
 I don't know what you mean by in the head of AM, but IMO it would be far
 better if we can implement this outside the index AMs. Then it will work
 with any index AM.

Also, it's the only chance to make this ever work across partitions.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-26 Thread Heikki Linnakangas

On 09/26/2014 03:30 PM, Andres Freund wrote:

On 2014-09-26 15:24:21 +0300, Heikki Linnakangas wrote:

I don't know what you mean by in the head of AM, but IMO it would be far
better if we can implement this outside the index AMs. Then it will work
with any index AM.


Also, it's the only chance to make this ever work across partitions.


How so? Assuming there's no overlap in the partitions, you could lock 
the page in the index of the partition you're inserting to, just like 
you would insert the promise tuple to the right partition.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inefficient barriers on solaris with sun cc

2014-09-26 Thread Oskari Saarenmaa

25.09.2014, 16:34, Andres Freund kirjoitti:

Binaries compiled on solaris using sun studio cc currently don't have
compiler and memory barriers implemented. That means we fall back to
relatively slow generic implementations for those. Especially compiler,
read, write barriers will be much slower than necessary (since they all
just need to prevent compiler reordering as both sparc and x86 are run
in TSO mode under solaris).


Attached patch implements compiler and memory barriers for Solaris 
Studio based on documentation at

http://docs.oracle.com/cd/E18659_01/html/821-1383/gjzmf.html

I defined read and write barriers as acquire and release barriers 
instead of pure read and write ones as that's what other platforms 
appear to do.


/ Oskari
From 0d1ee2b1d720a6ff1ae6b4707356e198b763fccf Mon Sep 17 00:00:00 2001
From: Oskari Saarenmaa o...@ohmu.fi
Date: Fri, 26 Sep 2014 15:05:34 +0300
Subject: [PATCH] =?UTF-8?q?=C2=A0atomics:=20add=20compiler=20and=20memory?=
 =?UTF-8?q?=20barriers=20for=20solaris=20studio?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 configure |  2 +-
 configure.in  |  2 +-
 src/include/pg_config.h.in|  3 +++
 src/include/port/atomics/generic-sunpro.h | 17 +
 4 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/configure b/configure
index f0580ce..6aa55d1 100755
--- a/configure
+++ b/configure
@@ -9163,7 +9163,7 @@ fi
 done
 
 
-for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
+for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
 do :
   as_ac_Header=`$as_echo ac_cv_header_$ac_header | $as_tr_sh`
 ac_fn_c_check_header_mongrel $LINENO $ac_header $as_ac_Header $ac_includes_default
diff --git a/configure.in b/configure.in
index 527b076..6dc9c08 100644
--- a/configure.in
+++ b/configure.in
@@ -1016,7 +1016,7 @@ AC_SUBST(UUID_LIBS)
 ##
 
 dnl sys/socket.h is required by AC_FUNC_ACCEPT_ARGTYPES
-AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h])
+AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h])
 
 # On BSD, test for net/if.h will fail unless sys/socket.h
 # is included first.
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index ddcf4b0..3e78d65 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -340,6 +340,9 @@
 /* Define to 1 if `long long int' works and is 64 bits. */
 #undef HAVE_LONG_LONG_INT_64
 
+/* Define to 1 if you have the mbarrier.h header file. */
+#undef HAVE_MBARRIER_H
+
 /* Define to 1 if you have the `mbstowcs_l' function. */
 #undef HAVE_MBSTOWCS_L
 
diff --git a/src/include/port/atomics/generic-sunpro.h b/src/include/port/atomics/generic-sunpro.h
index b71b523..faab3d7 100644
--- a/src/include/port/atomics/generic-sunpro.h
+++ b/src/include/port/atomics/generic-sunpro.h
@@ -17,6 +17,23 @@
  * -
  */
 
+#ifdef HAVE_MBARRIER_H
+#include mbarrier.h
+
+#define pg_compiler_barrier_impl()	__compiler_barrier()
+
+#ifndef pg_memory_barrier_impl
+#	define pg_memory_barrier_impl()		__machine_rw_barrier()
+#endif
+#ifndef pg_read_barrier_impl
+#	define pg_read_barrier_impl()		__machine_acq_barrier()
+#endif
+#ifndef pg_write_barrier_impl
+#	define pg_write_barrier_impl()		__machine_rel_barrier()
+#endif
+
+#endif /* HAVE_MBARRIER_H */
+
 /* Older versions of the compiler don't have atomic.h... */
 #ifdef HAVE_ATOMIC_H
 
-- 
2.1.0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inefficient barriers on solaris with sun cc

2014-09-26 Thread Robert Haas
On Fri, Sep 26, 2014 at 8:36 AM, Oskari Saarenmaa o...@ohmu.fi wrote:
 25.09.2014, 16:34, Andres Freund kirjoitti:
 Binaries compiled on solaris using sun studio cc currently don't have
 compiler and memory barriers implemented. That means we fall back to
 relatively slow generic implementations for those. Especially compiler,
 read, write barriers will be much slower than necessary (since they all
 just need to prevent compiler reordering as both sparc and x86 are run
 in TSO mode under solaris).

 Attached patch implements compiler and memory barriers for Solaris Studio
 based on documentation at
 http://docs.oracle.com/cd/E18659_01/html/821-1383/gjzmf.html

 I defined read and write barriers as acquire and release barriers instead of
 pure read and write ones as that's what other platforms appear to do.

So you think a read barrier is the same thing as an acquire barrier
and a write barrier is the same as a release barrier?  That would be
surprising.  It's certainly not true in general.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-26 Thread Andres Freund
On 2014-09-26 15:32:35 +0300, Heikki Linnakangas wrote:
 On 09/26/2014 03:30 PM, Andres Freund wrote:
 On 2014-09-26 15:24:21 +0300, Heikki Linnakangas wrote:
 I don't know what you mean by in the head of AM, but IMO it would be far
 better if we can implement this outside the index AMs. Then it will work
 with any index AM.
 
 Also, it's the only chance to make this ever work across partitions.
 
 How so? Assuming there's no overlap in the partitions, you could lock the
 page in the index of the partition you're inserting to, just like you would
 insert the promise tuple to the right partition.

Well, the 'no overlap' case is boring. At least if you mean that each
partition has disctinct value ranges in the index?

And the reason that the buffer locking approach in the overlapping case
is that you'd need to hold a large number of pages locked at the same
time. Right?

But primarily I mean that bulk of the uniqueness checking logic has to
live outside the individual AMs. It doesn't sound enticing to reach from
inside one AM into another partitions index to do stuff.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inefficient barriers on solaris with sun cc

2014-09-26 Thread Oskari Saarenmaa

26.09.2014, 15:39, Robert Haas kirjoitti:

On Fri, Sep 26, 2014 at 8:36 AM, Oskari Saarenmaa o...@ohmu.fi wrote:

25.09.2014, 16:34, Andres Freund kirjoitti:

Binaries compiled on solaris using sun studio cc currently don't have
compiler and memory barriers implemented. That means we fall back to
relatively slow generic implementations for those. Especially compiler,
read, write barriers will be much slower than necessary (since they all
just need to prevent compiler reordering as both sparc and x86 are run
in TSO mode under solaris).


Attached patch implements compiler and memory barriers for Solaris Studio
based on documentation at
http://docs.oracle.com/cd/E18659_01/html/821-1383/gjzmf.html

I defined read and write barriers as acquire and release barriers instead of
pure read and write ones as that's what other platforms appear to do.


So you think a read barrier is the same thing as an acquire barrier
and a write barrier is the same as a release barrier?  That would be
surprising.  It's certainly not true in general.


The above doc describes the difference: read barrier requires loads 
before the barrier to be completed before loads after the barrier - an 
acquire barrier is the same, but it also requires loads to be complete 
before stores after the barrier.


Similarly write barrier requires stores before the barrier to be 
completed before stores after the barrier - a release barrier is the 
same, but it also requires loads before the barrier to be completed 
before stores after the barrier.


So acquire is read + loads-before-stores and release is write + 
loads-before-stores.


The generic gcc atomics also define read barrier to __ATOMIC_ACQUIRE and 
write barrier to __ATOMIC_RELEASE.


/ Oskari


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN improvements part2: fast scan

2014-09-26 Thread Heikki Linnakangas

On 09/25/2014 09:05 PM, Thom Brown wrote:

On 12 March 2014 16:29, Heikki Linnakangas hlinnakan...@vmware.com wrote:

Good point. We have done two major changes to GIN in this release cycle:
changed the data page format and made it possible to skip items without
fetching all the keys (fast scan). gincostestimate doesn't know about
either change.


Did this cost estimation issue get fixed?


Nope.


If not, and if this is due for 9.5, can this be removed from the 9.4 open items 
list?


Removed and moved to the TODO list. It's a pity, but no-one seems to 
have a great idea on what to do about the cost estimation, nor interest 
in working on it.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-26 Thread Heikki Linnakangas

On 09/26/2014 03:40 PM, Andres Freund wrote:

On 2014-09-26 15:32:35 +0300, Heikki Linnakangas wrote:

On 09/26/2014 03:30 PM, Andres Freund wrote:

On 2014-09-26 15:24:21 +0300, Heikki Linnakangas wrote:

I don't know what you mean by in the head of AM, but IMO it would be far
better if we can implement this outside the index AMs. Then it will work
with any index AM.


Also, it's the only chance to make this ever work across partitions.


How so? Assuming there's no overlap in the partitions, you could lock the
page in the index of the partition you're inserting to, just like you would
insert the promise tuple to the right partition.


Well, the 'no overlap' case is boring.


Ok.


At least if you mean that each partition has disctinct value ranges
in the index?


Right.


And the reason that the buffer locking approach in the overlapping case
is that you'd need to hold a large number of pages locked at the same
time. Right?


Yeah, you would. To be honest, I didn't even think about the overlapping 
case, I just assumed that the overlapping case is the typical one and 
only thought about that.



But primarily I mean that bulk of the uniqueness checking logic has to
live outside the individual AMs. It doesn't sound enticing to reach from
inside one AM into another partitions index to do stuff.


Yeah, that's a non-starter. Even with the index locking stuff, though, 
it wouldn't be the AM's responsibility to reach out to other partitions.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inefficient barriers on solaris with sun cc

2014-09-26 Thread Andres Freund
On 2014-09-26 08:39:38 -0400, Robert Haas wrote:
 On Fri, Sep 26, 2014 at 8:36 AM, Oskari Saarenmaa o...@ohmu.fi wrote:
  25.09.2014, 16:34, Andres Freund kirjoitti:
  Binaries compiled on solaris using sun studio cc currently don't have
  compiler and memory barriers implemented. That means we fall back to
  relatively slow generic implementations for those. Especially compiler,
  read, write barriers will be much slower than necessary (since they all
  just need to prevent compiler reordering as both sparc and x86 are run
  in TSO mode under solaris).
 
  Attached patch implements compiler and memory barriers for Solaris Studio
  based on documentation at
  http://docs.oracle.com/cd/E18659_01/html/821-1383/gjzmf.html
 
  I defined read and write barriers as acquire and release barriers instead of
  pure read and write ones as that's what other platforms appear to do.
 
 So you think a read barrier is the same thing as an acquire barrier
 and a write barrier is the same as a release barrier?  That would be
 surprising.  It's certainly not true in general.

It's generally true that a read barrier is implied by an acquire
barrier, no? Same for write barriers being implied by read
barriers. Neither is true the other way round, but that's fine.

Given how postgres uses memory barriers we actually could declare
read/write barriers to be compiler barriers when on solaris. Both
supported architectures (x86, sparc) are run in TSO mode. As the
existing barrier code for x86 says:
 * Both 32 and 64 bit x86 do not allow loads to be reordered with other loads,
 * or stores to be reordered with other stores, but a load can be performed
 * before a subsequent store.
 *
 * Technically, some x86-ish chips support uncached memory access and/or
 * special instructions that are weakly ordered.  In those cases we'd need
 * the read and write barriers to be lfence and sfence.  But since we don't
 * do those things, a compiler barrier should be enough.
 *
 * lock; addl has worked for longer than mfence. It's also rumored to be
 * faster in many scenarios

Unless I miss something the same is true for sparc *in solaris
userland*. But I'd be perfectly happy to go with something like Oksari's
version because it's still much better than the current code.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Robert Haas
On Fri, Sep 26, 2014 at 7:40 AM, Amit Kapila amit.kapil...@gmail.com
wrote:

 First of all thanks for committing part-1 of this changes and it
 seems you are planing to commit part-3 based on results of tests
 which Andres is planing to do and for remaining part (part-2), today
 I have tried some tests, the results of which are as follows:

 Scale Factor - 3000, Shared_buffer - 8GB

Patch_Ver/Client_Count 16 32 64 128  reduce-replacement-locking.patch
 + 128 Buf Partitions 157732 229547 271536 245295
 scalable_buffer_eviction_v9.patch 163762 230753 275147 248309


 Scale Factor - 3000, Shared_buffer - 8GB

Patch_Ver/Client_Count 16 32 64 128  reduce-replacement-locking.patch
 + 128 Buf Partitions 157781 212134 202209 171176
 scalable_buffer_eviction_v9.patch 160301 213922 208680 172720


 The results indicates that in all cases there is benefit by doing
 part-2 (bgreclaimer).  Though the benefit at this configuration is
 not high, but might be at some higher configurations
 (scale factor - 1) there is more benefit.  Do you see any merit
 in pursuing further to accomplish part-2 as well?


Interesting results.  Thanks for gathering this data.

If this is the best we can do with the bgreclaimer, I think the case for
pursuing it is somewhat marginal.  The biggest jump you've got seems to be
at scale factor 3000 with 64 clients, where you picked up about 4%.  4%
isn't nothing, but it's not a lot, either.  On the other hand, this might
not be the best we can do.  There may be further improvements to
bgreclaimer that make the benefit larger.

Backing up a it, to what extent have we actually solved the problem here?
If we had perfectly removed all of the scalability bottlenecks, what would
we expect to see?  You didn't say which machine this testing was done on,
or how many cores it had, but for example on the IBM POWER7 machine, we
probably wouldn't expect the throughput at 64 clients to be 4 times the
throughput at 16 cores because up to 16 clients each one can have a full
CPU core, whereas after that and out to 64 each is getting a hardware
thread, which is not quite as good.  Still, we'd expect performance to go
up, or at least not go down.  Your data shows a characteristic performance
knee: between 16 and 32 clients we go up, but then between 32 and 64 we go
down, and between 64 and 128 we go down more.  You haven't got enough data
points there to show very precisely where the knee is, but unless you
tested this on a smaller box than what you have been using, we're certainly
hitting the knee sometime before we run out of physical cores.  That
implies a remaining contention bottleneck.

My results from yesterday were a bit different.  I tested 1 client, 8
clients, and multiples of 16 clients out to 96.  With
reduce-replacement-locking.patch + 128 buffer mapping partitions,
performance continued to rise all the way out to 96 clients.  It definitely
wasn't linearly, but it went up, not down.  I don't know why this is
different from what you are seeing.  Anyway there's a little more ambiguity
there about how much contention remains, but my bet is that there is at
least some contention that we could still hope to remove.  We need to
understand where that contention is.  Are the buffer mapping locks still
contended?  Is the new spinlock contended?  Are there other contention
points?  I won't be surprised if it turns out that the contention is on the
new spinlock and that a proper design for bgreclaimer is the best way to
remove that contention  but I also won't be surprised if it turns out
that there are bigger wins elsewhere.  So I think you should try to figure
out where the remaining contention is first, and then we can discuss what
to do about it.

On another point, I think it would be a good idea to rebase the bgreclaimer
patch over what I committed, so that we have a clean patch against master
to test with.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2014-09-26 Thread Alvaro Herrera
Amit Kapila wrote:

 Today while again thinking about the startegy used in patch to
 parallelize the operation (vacuum database), I think we can
 improve the same for cases when number of connections are
 lesser than number of tables in database (which I presume
 will normally be the case).  Currently we are sending command
 to vacuum one table per connection, how about sending multiple
 commands (example Vacuum t1; Vacuum t2) on one connection.
 It seems to me there is extra roundtrip for cases when there
 are many small tables in database and few large tables.  Do
 you think we should optimize for any such cases?

I don't think this is a good idea; at least not in a first cut of this
patch.  It's easy to imagine that a table you initially think is small
enough turns out to have grown much larger since last analyze.  In that
case, putting one worker to process that one together with some other
table could end up being bad for parallelism, if later it turns out that
some other worker has no table to process.  (Table t2 in your example
could grown between the time the command is sent and t1 is vacuumed.)

It's simpler to have workers do one thing at a time only.

I don't think it's a very good idea to call pg_relation_size() on every
table in the database from vacuumdb.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Heikki Linnakangas

On 09/26/2014 03:26 PM, Andres Freund wrote:

On 2014-09-26 15:04:54 +0300, Heikki Linnakangas wrote:

On 09/25/2014 05:40 PM, Andres Freund wrote:

There's two reasons for that: a) dynahash just isn't very good and it
does a lot of things that will never be necessary for these hashes. b)
the key into the hash table is*far*  too wide. A significant portion of
the time is spent comparing buffer/lock tags.


Hmm. Is it the comparing, or calculating the hash?


Neither, really. The hash calculation is visible in the profile, but not
that pronounced yet. The primary thing noticeable in profiles (besides
cache efficiency) is the comparison of the full tag after locating a
possible match in a bucket. 20 byte memcmp's aren't free.


Hmm. We could provide a custom compare function instead of relying on 
memcmp. We can do somewhat better than generic memcmo when we know that 
the BufferTag is MAXALIGNed (is it? at least it's 4 bytes aligned), and 
it's always exactly 20 bytes. I wonder if you're actually just seeing a 
cache miss showing up in the profile, though.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 3

2014-09-26 Thread Robert Haas
On Fri, Sep 26, 2014 at 5:05 AM, Andres Freund and...@2ndquadrant.com wrote:
 Let me try to summarize the information requirements for each of these
 things.  For #1, you need to know, after crash recovery, for each
 standby, the last commit LSN which the client has confirmed via a
 feedback message.

 I'm not sure I understand what you mean here? This is all happening on
 the *standby*. The standby needs to know, after crash recovery, the
 latest commit LSN from the primary that it has successfully replayed.

Ah, sorry, you're right: so, you need to know, after crash recovery,
for each machine you are replicating *from*, the last transaction (in
terms of LSN) from that server that you successfully replayed.

  Similarly, to solve the problem of identifying the origin of changes
  during decoding, the problem can be solved nicely by adding the origin
  node of every change to changes/transactions. At first it might seem
  to be sufficient to do so on transaction level, but for cascading
  scenarios it's very useful to be able to have changes from different
  source transactions combinded into a larger one.

 I think this is a lot more problematic.  I agree that having the data
 in the commit record isn't sufficient here, because for filtering
 purposes (for example) you really want to identify the problematic
 transactions at the beginning, so you can chuck their WAL, rather than
 reassembling the transaction first and then throwing it out.  But
 putting the origin ID in every insert/update/delete is pretty
 unappealing from my point of view - not just because it adds bytes to
 WAL, though that's a non-trivial concern, but also because it adds
 complexity - IMHO, a non-trivial amount of complexity.  I'd be a lot
 happier with a solution where, say, we have a separate WAL record that
 says XID 1234 will be decoding for origin 567 until further notice.

 I think it actually ends up much simpler than what you propose. In the
 apply process, you simply execute
 SELECT pg_replication_identifier_setup_replaying_from('bdr: 
 this-is-my-identifier');
 or it's C equivalent. That sets a global variable which XLogInsert()
 includes the record.
 Note that this doesn't actually require any additional space in the WAL
 - padding bytes in struct XLogRecord are used to store the
 identifier. These have been unused at least since 8.0.

Sure, that's simpler for logical decoding, for sure.  That doesn't
make it the right decision for the system overall.

 I don't think a solution which logs the change of origin will be
 simpler. When the origin is in every record, you can filter without keep
 track of any state. That's different if you can switch the origin per
 tx. At the very least you need a in memory entry for the origin.

But again, that complexity pertains only to logical decoding.
Somebody who wants to tweak the WAL format for an UPDATE in the future
doesn't need to understand how this works, or care.  You know me: I've
been a huge advocate of logical decoding.  But just like row-level
security or BRIN indexes or any other feature, I think it needs to be
designed in a way that minimizes the impact it has on the rest of the
system.  I simply don't believe your contention that this isn't adding
any complexity to the code path for regular DML operations.  It's
entirely possible we could need bit space in those records in the
future for something that actually pertains to those operations; if
you've burned it for logical decoding, it'll be difficult to claw it
back.  And what if Tom gets around, some day, to doing that pluggable
heap AM work?  Then every heap AM has got to allow for those bits, and
maybe that doesn't happen to be free for them.

Admittedly, these are hypothetical scenarios, but I don't think
they're particularly far-fetched.  And as a fringe benefit, if you do
it the way that I'm proposing, you can use an OID instead of a 16-bit
thing that we picked to be 16 bits because that happens to be 100% of
the available bit-space.  Yeah, there's some complexity on decoding,
but it's minimal: one more piece of fixed-size state to track per XID.
That's trivial compared to what you've already got.

 What's the point of the short-to-long mappings in the first place?  Is
 that only required because of the possibility that there might be
 multiple replication solutions in play on the same node?

 In my original proposal, 2 years+ back, I only used short numeric
 ids. And people didn't like it because it requires coordination between
 the replication solutions and possibly between servers. Using a string
 identifier like in the above allows to easily build unique names; and
 allows every solution to add the information it needs into replication
 identifiers.

I get that, but what I'm asking is why those mappings can't be managed
on a per-replication-solution basis.  I think that's just because
there's a limited namespace and so coordination is needed between
multiple replication solutions that might possibly be running 

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-26 Thread Andres Freund
On 2014-09-26 15:58:17 +0300, Heikki Linnakangas wrote:
 On 09/26/2014 03:40 PM, Andres Freund wrote:
 And the reason that the buffer locking approach in the overlapping case
 is that you'd need to hold a large number of pages locked at the same
 time. Right?
 
 Yeah, you would. To be honest, I didn't even think about the overlapping
 case, I just assumed that the overlapping case is the typical one and only
 thought about that.

I think it's actually quite common to want to have uniqueness constraint
overlapping partitions. Consider e.g. partitioning on the username. You
might still want to ensure emails are unique.

 But primarily I mean that bulk of the uniqueness checking logic has to
 live outside the individual AMs. It doesn't sound enticing to reach from
 inside one AM into another partitions index to do stuff.
 
 Yeah, that's a non-starter. Even with the index locking stuff, though, it
 wouldn't be the AM's responsibility to reach out to other partitions.

I'm thinking of the way btree currently does uniqueness checks. Unless
you move a large chunk of that out of the AM you'll have a hard time
building anything crossing partitions based on it. At least I can't see
how.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Robert Haas
On Fri, Sep 26, 2014 at 8:26 AM, Andres Freund and...@2ndquadrant.com wrote:
 Neither, really. The hash calculation is visible in the profile, but not
 that pronounced yet. The primary thing noticeable in profiles (besides
 cache efficiency) is the comparison of the full tag after locating a
 possible match in a bucket. 20 byte memcmp's aren't free.

Would it be faster to test the relfilenode first, and then test the
rest only if that matches?

One idea for improving this is to get rid of relation forks.  Like all
true PostgreSQL patriots, I love the free space map and the visibility
map passionately, and I believe that those features are one of the
biggest contributors to the success of the project over the last
half-decade.  But I'd love them even more if they didn't triple our
rate of inode consumption and bloat our buffer tags.  More, it's just
not an extensible mechanism: too many things have to loop over all
forks, and it just doesn't scale to keep adding more of them.  If we
added a metapage to each heap, we could have the FSM and VM have their
own relfilenode and just have the heap point at them.  Or (maybe
better still) we could store the data in the heap itself.

It would be a lot of work, though.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Ants Aasma
On Fri, Sep 26, 2014 at 3:26 PM, Andres Freund and...@2ndquadrant.com wrote:
 Neither, really. The hash calculation is visible in the profile, but not
 that pronounced yet. The primary thing noticeable in profiles (besides
 cache efficiency) is the comparison of the full tag after locating a
 possible match in a bucket. 20 byte memcmp's aren't free.

I'm not arguing against a more efficient hash table, but one simple
win would be to have a buffer tag specific comparison function. I mean
something like the attached patch. This way we avoid the loop counter
overhead, can check the blocknum first and possibly have better branch
prediction.

Do you have a workload where I could test if this helps alleviate the
comparison overhead?

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 7a38f2f..ba37b5f 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -34,6 +34,7 @@ typedef struct
 
 static HTAB *SharedBufHash;
 
+static int BufTableCmpBufferTag(BufferTag *a, BufferTag *b, int n);
 
 /*
  * Estimate space needed for mapping hashtable
@@ -46,6 +47,19 @@ BufTableShmemSize(int size)
 }
 
 /*
+ * Compare contents of two buffer tags. We use this instead of the default
+ * memcmp to minimize comparison overhead.
+ */
+static int
+BufTableCmpBufferTag(BufferTag *a, BufferTag *b, int n)
+{
+	Assert(offsetof(BufferTag, blockNum) == 2*sizeof(uint64));
+	return (a-blockNum == b-blockNum
+			 ((uint64*)a)[0] == ((uint64*)b)[0]
+			 ((uint64*)a)[1] == ((uint64*)b)[1]) ? 0 : 1;
+}
+
+/*
  * Initialize shmem hash table for mapping buffers
  *		size is the desired hash table size (possibly more than NBuffers)
  */
@@ -61,6 +75,7 @@ InitBufTable(int size)
 	info.entrysize = sizeof(BufferLookupEnt);
 	info.hash = tag_hash;
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
+	info.match = BufTableCmpBufferTag;
 
 	SharedBufHash = ShmemInitHash(Shared Buffer Lookup Table,
   size, size,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] WITH CHECK and Column-Level Privileges

2014-09-26 Thread Stephen Frost
All,

  Through continued testing, we've discovered an issue in the
  WITH CHECK OPTION code when it comes to column-level privileges
  which impacts 9.4.

  It's pretty straight-forward, thankfully, but:

postgres=# create view myview
postgres-# with (security_barrier = true,
postgres-# check_option = 'local')
postgres-# as select * from passwd where username = current_user;
CREATE VIEW
postgres=# grant select (username) on myview to public;
GRANT
postgres=# grant update on myview to public;
GRANT
postgres=# set role alice;
SET
postgres= update myview set username = 'joe';
ERROR:  new row violates WITH CHECK OPTION for myview
DETAIL:  Failing row contains (joe, abc).

  Note that the entire failing tuple is returned, including the
  'password' column, even though the 'alice' user does not have select
  rights on that column.

  The detail information is useful for debugging, but I believe we have
  to remove it from the error message.

  Barring objections, and in the hopes of getting the next beta out the
  door soon, I'll move forward with this change and back-patch it to
  9.4 after a few hours (or I can do it tomorrow if there is contention;
  I don't know what, if any, specific plans there are for the next beta,
  just that it's hopefully 'soon').
  
  To hopefully shorten the discussion about 9.4, I'll clarify that I'm
  happy to discuss trying to re-work this in 9.5 to include what columns
  the user should be able to see (if there is consensus that we should
  do that at all)  but I don't see that as a change which should be
  back-patched to 9.4 at this point given that we're trying to get it
  out the door.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Replication identifiers, take 3

2014-09-26 Thread Andres Freund
On 2014-09-26 09:53:09 -0400, Robert Haas wrote:
 On Fri, Sep 26, 2014 at 5:05 AM, Andres Freund and...@2ndquadrant.com wrote:
  Let me try to summarize the information requirements for each of these
  things.  For #1, you need to know, after crash recovery, for each
  standby, the last commit LSN which the client has confirmed via a
  feedback message.
 
  I'm not sure I understand what you mean here? This is all happening on
  the *standby*. The standby needs to know, after crash recovery, the
  latest commit LSN from the primary that it has successfully replayed.
 
 Ah, sorry, you're right: so, you need to know, after crash recovery,
 for each machine you are replicating *from*, the last transaction (in
 terms of LSN) from that server that you successfully replayed.

Precisely.

  I don't think a solution which logs the change of origin will be
  simpler. When the origin is in every record, you can filter without keep
  track of any state. That's different if you can switch the origin per
  tx. At the very least you need a in memory entry for the origin.
 
 But again, that complexity pertains only to logical decoding.

 Somebody who wants to tweak the WAL format for an UPDATE in the future
 doesn't need to understand how this works, or care.

I agree that that's a worthy goal. But I don't see how this isn't the
case with what I propose? This isn't happening on the level of
individual rmgrs/ams - there've been two padding bytes after 'xl_rmid'
in struct XLogRecord for a long time.

There's also the significant advantage that not basing this on the xid
allows it to work correctly with records not tied to a
transaction. There's not that much of that happening yet, but I've
several features in mind:

* separately replicate 2PC commits. 2PC commits don't have an xid
  anymore... With some tooling on top replication 2PC in two phases
  allow for very cool stuff. Like optionally synchronous multimaster
  replication.
* I have a pending patch that allows to send 'messages' through logical
  decoding - yielding a really fast and persistent queue. For that it's
  useful have transactional *and* nontransactional messages.
* Sanely replicating CONCURRENTLY stuff gets harder if you tie things to
  the xid.

The absolutely, super, uber most convincing reason is:
It's trivial to build tools to analyze how much WAL traffic is generated
by which replication stream and how much by originates locally. A
pg_xlogdump --stats=replication_identifier wouldn't be hard ;)

 You know me: I've
 been a huge advocate of logical decoding.  But just like row-level
 security or BRIN indexes or any other feature, I think it needs to be
 designed in a way that minimizes the impact it has on the rest of the
 system.

Totally agreed. And that always will take some arguing...

 I simply don't believe your contention that this isn't adding
 any complexity to the code path for regular DML operations.  It's
 entirely possible we could need bit space in those records in the
 future for something that actually pertains to those operations; if
 you've burned it for logical decoding, it'll be difficult to claw it
 back.  And what if Tom gets around, some day, to doing that pluggable
 heap AM work?  Then every heap AM has got to allow for those bits, and
 maybe that doesn't happen to be free for them.

As explained above this isn't happening on the level of individual AMs.

 Admittedly, these are hypothetical scenarios, but I don't think
 they're particularly far-fetched.  And as a fringe benefit, if you do
 it the way that I'm proposing, you can use an OID instead of a 16-bit
 thing that we picked to be 16 bits because that happens to be 100% of
 the available bit-space.  Yeah, there's some complexity on decoding,
 but it's minimal: one more piece of fixed-size state to track per XID.
 That's trivial compared to what you've already got.

But it forces you to track the xids/transactions. With my proposal you
can ignore transaction *entirely* unless they manipulate the
catalog. For concurrent OLTP workloads that's quite the advantage.

  What's the point of the short-to-long mappings in the first place?  Is
  that only required because of the possibility that there might be
  multiple replication solutions in play on the same node?
 
  In my original proposal, 2 years+ back, I only used short numeric
  ids. And people didn't like it because it requires coordination between
  the replication solutions and possibly between servers. Using a string
  identifier like in the above allows to easily build unique names; and
  allows every solution to add the information it needs into replication
  identifiers.
 
 I get that, but what I'm asking is why those mappings can't be managed
 on a per-replication-solution basis.  I think that's just because
 there's a limited namespace and so coordination is needed between
 multiple replication solutions that might possibly be running on the
 same system.  But I want to confirm if that's actually what you're
 

Re: [HACKERS] Inefficient barriers on solaris with sun cc

2014-09-26 Thread Robert Haas
On Fri, Sep 26, 2014 at 8:55 AM, Oskari Saarenmaa o...@ohmu.fi wrote:
 So you think a read barrier is the same thing as an acquire barrier
 and a write barrier is the same as a release barrier?  That would be
 surprising.  It's certainly not true in general.

 The above doc describes the difference: read barrier requires loads before
 the barrier to be completed before loads after the barrier - an acquire
 barrier is the same, but it also requires loads to be complete before stores
 after the barrier.

 Similarly write barrier requires stores before the barrier to be completed
 before stores after the barrier - a release barrier is the same, but it also
 requires loads before the barrier to be completed before stores after the
 barrier.

 So acquire is read + loads-before-stores and release is write +
 loads-before-stores.

Hmm.  My impression was that an acquire barrier means that loads and
stores can migrate forward across the barrier but not backward; and
that a release barrier means that loads and stores can migrate
backward across the barrier but not forward.  I'm actually not really
sure what this means unless the barrier also does something in and of
itself.  For example, consider this:

some stuff
CAS(lock, 0, 1) // i am an acquire barrier
more stuff
lock = 0 // i am a release barrier
even more stuff

If the CAS() and lock = 0 instructions were FULL barriers, then we'd
be saying that the stuff that happens in the critical section needs to
be exactly more stuff.  But if they are acquire and release
barriers, respectively, then the CPU is allowed to move some stuff
or even more stuff into the critical section; but what it can't do
is move more stuff out.

Now if you just have a naked acquire barrier that is not doing
anything itself, I don't really know what the semantics of that should
be.  Say I want to appear to only change things while flag is 1, so I
write this code:

flag = 1
acquire barrier
things++
release barrier
flag = 0

With the definition you (and Oracle) propose, this won't work, because
there's nothing to keep the modification of things from being
reordered before flag = 1.  What good is that?  Apparently, I don't
have any idea!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Amit Kapila
On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas robertmh...@gmail.com wrote:

 On Fri, Sep 26, 2014 at 7:40 AM, Amit Kapila amit.kapil...@gmail.com
 wrote:

 First of all thanks for committing part-1 of this changes and it
 seems you are planing to commit part-3 based on results of tests
 which Andres is planing to do and for remaining part (part-2), today
 I have tried some tests, the results of which are as follows:

 Scale Factor - 3000, Shared_buffer - 8GB

Patch_Ver/Client_Count 16 32 64 128  reduce-replacement-locking.patch
 + 128 Buf Partitions 157732 229547 271536 245295
 scalable_buffer_eviction_v9.patch 163762 230753 275147 248309


 Scale Factor - 3000, Shared_buffer - 8GB


Typo here Scale Factor - 3000, Shared_buffer - *2*GB



Patch_Ver/Client_Count 16 32 64 128  reduce-replacement-locking.patch
 + 128 Buf Partitions 157781 212134 202209 171176
 scalable_buffer_eviction_v9.patch 160301 213922 208680 172720


 The results indicates that in all cases there is benefit by doing
 part-2 (bgreclaimer).  Though the benefit at this configuration is
 not high, but might be at some higher configurations
 (scale factor - 1) there is more benefit.  Do you see any merit
 in pursuing further to accomplish part-2 as well?


 Interesting results.  Thanks for gathering this data.


One more point I have missed is that above data is using
-M prepared option of pgbench.



 If this is the best we can do with the bgreclaimer, I think the case for
 pursuing it is somewhat marginal.  The biggest jump you've got seems to be
 at scale factor 3000 with 64 clients, where you picked up about 4%.  4%
 isn't nothing, but it's not a lot, either.  On the other hand, this might
 not be the best we can do.  There may be further improvements to
 bgreclaimer that make the benefit larger.

 Backing up a it, to what extent have we actually solved the problem here?
 If we had perfectly removed all of the scalability bottlenecks, what would
 we expect to see?  You didn't say which machine this testing was done on


It was IBM POWER7
Sorry, I should have mentioned it.


 , or how many cores it had, but for example on the IBM POWER7 machine, we
 probably wouldn't expect the throughput at 64 clients to be 4 times the
 throughput at 16 cores because up to 16 clients each one can have a full
 CPU core, whereas after that and out to 64 each is getting a hardware
 thread, which is not quite as good.  Still, we'd expect performance to go
 up, or at least not go down.  Your data shows a characteristic performance
 knee: between 16 and 32 clients we go up, but then between 32 and 64 we go
 down,


Here another point worth noting is that it goes down between
32 and 64 when shared_buffers is 2GB, however when
shared_buffers are 8GB it doesn't go down between 32 and 64.


 and between 64 and 128 we go down more.  You haven't got enough data
 points there to show very precisely where the knee is, but unless you
 tested this on a smaller box than what you have been using, we're certainly
 hitting the knee sometime before we run out of physical cores.  That
 implies a remaining contention bottleneck.

 My results from yesterday were a bit different.  I tested 1 client, 8
 clients, and multiples of 16 clients out to 96.  With
 reduce-replacement-locking.patch + 128 buffer mapping partitions,
 performance continued to rise all the way out to 96 clients.  It definitely
 wasn't linearly, but it went up, not down.  I don't know why this is
 different from what you are seeing.


I think it is almost same if we consider same configuration
(scale_factor - 3000, shared_buffer - 8GB).


 Anyway there's a little more ambiguity there about how much contention
 remains, but my bet is that there is at least some contention that we could
 still hope to remove.  We need to understand where that contention is.  Are
 the buffer mapping locks still contended?  Is the new spinlock contended?
 Are there other contention points?  I won't be surprised if it turns out
 that the contention is on the new spinlock and that a proper design for
 bgreclaimer is the best way to remove that contention  but I also won't
 be surprised if it turns out that there are bigger wins elsewhere.  So I
 think you should try to figure out where the remaining contention is first,
 and then we can discuss what to do about it.


Make sense.


 On another point, I think it would be a good idea to rebase the
 bgreclaimer patch over what I committed, so that we have a clean patch
 against master to test with.


I think this also makes sense, however I think it is better to first
see where is the bottleneck.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Andres Freund
On 2014-09-26 17:01:52 +0300, Ants Aasma wrote:
 On Fri, Sep 26, 2014 at 3:26 PM, Andres Freund and...@2ndquadrant.com wrote:
  Neither, really. The hash calculation is visible in the profile, but not
  that pronounced yet. The primary thing noticeable in profiles (besides
  cache efficiency) is the comparison of the full tag after locating a
  possible match in a bucket. 20 byte memcmp's aren't free.
 
 I'm not arguing against a more efficient hash table, but one simple
 win would be to have a buffer tag specific comparison function. I mean
 something like the attached patch. This way we avoid the loop counter
 overhead, can check the blocknum first and possibly have better branch
 prediction.

Heh. Yea. As I wrote to Heikki, 64bit compares were the thing showing
most benefits - at least with my own hashtable implementation.

 Do you have a workload where I could test if this helps alleviate the
 comparison overhead?

Fully cached readonly pgbench workloads with -M prepared already show
it. But it gets more pronounced for workload that access buffers at a
higher frequency.

At two customers I've seen this really badly in the profile because they
have OLTP statements that some index nested loops. Often looking the
same buffer up *over and over*. There often isn't a better plan (with
current pg join support at least) for joining a somewhat small number of
rows out of a large table to small/mid sized table.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 3

2014-09-26 Thread Robert Haas
On Fri, Sep 26, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com wrote:
 As explained above this isn't happening on the level of individual AMs.

Well, that's even worse.  You want to grab 100% of the available
generic bitspace applicable to all record types for purposes specific
to logical decoding (and it's still not really enough bits).

 I get that, but what I'm asking is why those mappings can't be managed
 on a per-replication-solution basis.  I think that's just because
 there's a limited namespace and so coordination is needed between
 multiple replication solutions that might possibly be running on the
 same system.  But I want to confirm if that's actually what you're
 thinking.

 Yes, that and that such a mapping needs to be done across all database
 are the primary reasons. As it's currently impossible to create further
 shared relations you'd have to invent something living in the data
 directory on filesystem level... Brr.

 I think it'd also be much worse for debugging if there'd be no way to
 map such a internal identifier back to the replication solution in some
 form.

OK.

One question I have is what the structure of the names should be.  It
seems some coordination could be needed here.  I mean, suppose BDR
uses bdr:$NODENAME and Slony uses
$SLONY_CLUSTER_NAME:$SLONY_INSTANCE_NAME and EDB's xDB replication
server uses xdb__$NODE_NAME.  That seems like it would be sad.  Maybe
we should decide that names ought to be of the form
replication-solution.further-period-separated-components or
something like that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Andres Freund
On 2014-09-26 09:59:41 -0400, Robert Haas wrote:
 On Fri, Sep 26, 2014 at 8:26 AM, Andres Freund and...@2ndquadrant.com wrote:
  Neither, really. The hash calculation is visible in the profile, but not
  that pronounced yet. The primary thing noticeable in profiles (besides
  cache efficiency) is the comparison of the full tag after locating a
  possible match in a bucket. 20 byte memcmp's aren't free.
 
 Would it be faster to test the relfilenode first, and then test the
 rest only if that matches?

I tried that and I couldn't see that much benefit. Check my message to
Heikki.

 One idea for improving this is to get rid of relation forks.

I think that's the hardest end to start from. A cool goal, but
hard. Getting rid of the tablespace sound comparatively simple. And even
getting rid of the database in the buffer tag seems to be simpler
although already pretty hard.

 Like all
 true PostgreSQL patriots, I love the free space map and the visibility
 map passionately, and I believe that those features are one of the
 biggest contributors to the success of the project over the last
 half-decade.  But I'd love them even more if they didn't triple our
 rate of inode consumption and bloat our buffer tags.  More, it's just
 not an extensible mechanism: too many things have to loop over all
 forks, and it just doesn't scale to keep adding more of them.  If we
 added a metapage to each heap, we could have the FSM and VM have their
 own relfilenode and just have the heap point at them.  Or (maybe
 better still) we could store the data in the heap itself.
 
 It would be a lot of work, though.  :-(

Yea, it's really hard. And nearly impossible to do without breaking
binary compatibility.

What I've been wondering about was to give individual forks their own
relfilenodes and manage them via columns in pg_class. But that's also a
heck of a lot of work and gets complicated for unlogged relations
because we need to access those during recovery when we don't have
catalog access yet and can't access all databases anyway.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2014-09-26 Thread Amit Kapila
On Fri, Sep 26, 2014 at 7:06 PM, Alvaro Herrera alvhe...@2ndquadrant.com
wrote:

 Amit Kapila wrote:

  Today while again thinking about the startegy used in patch to
  parallelize the operation (vacuum database), I think we can
  improve the same for cases when number of connections are
  lesser than number of tables in database (which I presume
  will normally be the case).  Currently we are sending command
  to vacuum one table per connection, how about sending multiple
  commands (example Vacuum t1; Vacuum t2) on one connection.
  It seems to me there is extra roundtrip for cases when there
  are many small tables in database and few large tables.  Do
  you think we should optimize for any such cases?

 I don't think this is a good idea; at least not in a first cut of this
 patch.  It's easy to imagine that a table you initially think is small
 enough turns out to have grown much larger since last analyze.

That could be possible, but currently it vacuum's even system tables
one by one (where I think chances of growing up would be comparatively
less) which was the main reason I thought it might be worth
to consider if the current work distribution strategy is good enough.

 In that
 case, putting one worker to process that one together with some other
 table could end up being bad for parallelism, if later it turns out that
 some other worker has no table to process.  (Table t2 in your example
 could grown between the time the command is sent and t1 is vacuumed.)

 It's simpler to have workers do one thing at a time only.

Yeah probably that is best at least for initial patch.

 I don't think it's a very good idea to call pg_relation_size() on every
 table in the database from vacuumdb.

You might be right, however I was bit skeptical about the current
strategy where the work unit is one object and each object is considered
same irrespective of it's size/bloat.  OTOH I agree with you that it is
good to keep the first version simpler.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Sloppy thinking about leakproof properties of opclass co-members

2014-09-26 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 On Wed, Sep 24, 2014 at 5:34 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  ISTM that the most appropriate solution here is to insist that all or none
  of the members of an operator class be marked leakproof.  (Possibly we
  could restrict that to btree opclasses, but I'm not sure any exception is
  needed for other index types.)  I looked for existing violations of this
  precept, and unfortunately found a *lot*.  For example, texteq is marked
  leakproof but its fellow text comparison operators aren't.  Is that really
  sane?
 
 Not really.  Fortunately, AFAICT, most if not all of these are in the
 good direction: there are some things not marked leakproof that can be
 so marked.  The reverse direction would be a hard-to-fix security
 hole.  I think at some point somebody went through and tried to mark
 all of the same-type equality operators as leakproof, and it seems
 like that got expanded somewhat without fully rationalizing what we
 had in pg_proc... mostly because I think nobody had a clear idea how
 to do that.

We'll need to investigate and see if there are any which *aren't* safe
and probably fix those to be safe rather than trying to deal with this
risk in some other way.  In other words, I hope it's really all of
these rather than just most.  In general, I've been hoping that we have
more leakproof functions than not as, while it's non-trivial to write
them and ensure they're correct, it's better for us to take that on than
to ask our users to do so (or have them get some set that someone else
wrote off of a website or even the extension network).  We can't cover
everything, of course, but hopefully we'll cover all reasonable use
cases for types we ship.

 I think your proposal here is a good one.  Heikki proposed treating
 opclass functions as leakproof *in lieu of* adding a flag, but that
 didn't seem good because we wanted to allow for the possibility of
 cases where that wasn't true, and ensure individual scrutiny of each
 case.  Your idea of making sure the flag is set consistently
 throughout the opclass (opfamily?) is similar in spirit, but better in
 detail.

Agreed- a regression test here is definitely needed and have any
exceptions which must exist, after we've determined that they don't
produce an actual security hole (not sure how they wouldn't, but still)
explicitly called out in the regression tests, to avoid individuals
missing this requirement by copying an existing pg_proc example and
thinking they can add a new opclass item (or change an existing one) to
not be leakproof when the rest is.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] RLS feature has been committed

2014-09-26 Thread Robert Haas
On Fri, Sep 26, 2014 at 5:12 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 26 September 2014 08:48, Heikki Linnakangas hlinnakan...@vmware.com 
 wrote:
 But in many cases, lack of good documentation makes even reviewing the patch
 difficult. How do you determine if the patch works as intended, if you don't
 know what it's supposed to do?

 Exactly.

 Lack of review and lack of consensus are often caused by the author
 not making the patch fully and genuinely accessible to peer review.
 Don't say you're having problems getting buy in when you've done very
 little to encourage that. Committing early is not the solution.

That is quite true.

Furthermore, in this particular case, I had already put a lot of
effort into reviewing the patch and had expressed a clear intention to
put in more.  If the worst that happens is that the patch has a few
bugs, no great harm will have been done by committing it.  Things get
a lot more thorny if there are still design-level issues.  I think we
made a lot of progress on those issues in previous rounds of review,
but I'm not sure we squashed them all, and I didn't appreciate having
that process short-circuited.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 On Thu, Sep 25, 2014 at 4:49 PM, Stephen Frost sfr...@snowman.net wrote:
  * Gregory Smith (gregsmithpg...@gmail.com) wrote:
  On 9/25/14, 2:02 PM, Peter Eisentraut wrote:
  But having the same parameter setting mean different things in
  different versions is the path to complete madness.
 
  Could we go so far as to remove support for unitless time settings
  eventually?  The fact that people are setting raw numbers in the
  configuration file and have to know the unit to understand what they
  just did has never been something I like.
 
  I could certainly get behind that idea...  Tho I do understand that
  people will complain about backwards compatibility, etc, etc.
 
 There's also the fact that it doesn't fix the originally complained-of
 problem.  It does fix a problem with one of the fixes proposed for
 that original problem, though.

That's certainly an excellent point- this is really orthogonal to the
actual issue of setting a value smaller than a single unit for that
setting.  Even if you have units attached to every GUC, an individual
could take a setting which is understaood at the '1 minute' level and
attempt to set it to '30 seconds', for example.

On the other hand, if we're making it an error to set values without
units, I'd again be behind the idea of throwing an error on the
smaller-than-one-unit case- people are going to have to go update their
configurations to deal with the errors from the lack-of-units, this
would just be another item to fix during that process.  It'd certainly
be worse to change to require units and then wait anothe release to fix
or change things to address the original complaint.

Thanks,

Stephen


signature.asc
Description: Digital signature


[HACKERS] json (b) and null fields

2014-09-26 Thread Andrew Dunstan
I should have been paying a bit more attention to the recent work on 
adding an ignore_nulls option to row_to_json(). Here are some belated 
thought. I apologize to Pavel and Stephen for not having commented earlier.


I think this is really a bandaid, and it will fail to catch lots of 
cases. Several examples:


 * it doesn't apply recursively, so if the row has a nested composite,
   or an array of composites, none of those will have nulls ignored,
   only the top level will.
 * it doesn't apply to other json generators, notably json_agg().
   That's a very big omission.
 * it does nothing to allow us to strip nulls from existing json/jsonb
   data.

I think a much more comprehensive solution would be preferable. What I 
have in mind is something like


json_strip_null_fields(json) - json

and a similar function for jsonb.

These would operate recursively. There is a downside, in that they would 
be required to reprocess the json/jsonb. But adding an option like this 
to all the json generator functions would be seriously ugly, especially 
since they are mostly aggregate functions or variadic functions. At 
least in the jsonb case the cost of reprocessing is likely to be fairly low.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread Robert Haas
On Thu, Sep 25, 2014 at 4:49 PM, Stephen Frost sfr...@snowman.net wrote:
 * Gregory Smith (gregsmithpg...@gmail.com) wrote:
 On 9/25/14, 2:02 PM, Peter Eisentraut wrote:
 But having the same parameter setting mean different things in
 different versions is the path to complete madness.

 Could we go so far as to remove support for unitless time settings
 eventually?  The fact that people are setting raw numbers in the
 configuration file and have to know the unit to understand what they
 just did has never been something I like.

 I could certainly get behind that idea...  Tho I do understand that
 people will complain about backwards compatibility, etc, etc.

There's also the fact that it doesn't fix the originally complained-of
problem.  It does fix a problem with one of the fixes proposed for
that original problem, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 3

2014-09-26 Thread Robert Haas
On Fri, Sep 26, 2014 at 10:55 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-09-26 10:40:37 -0400, Robert Haas wrote:
 On Fri, Sep 26, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  As explained above this isn't happening on the level of individual AMs.

 Well, that's even worse.  You want to grab 100% of the available
 generic bitspace applicable to all record types for purposes specific
 to logical decoding (and it's still not really enough bits).

 I don't think that's a fair characterization. Right now it's available
 to precisely nobody. You can't put any data in there in *any* way. It
 just has been sitting around unused for at least 8 years.

Huh?  That's just to say that the unused bit space is, in fact,
unused.  But so what?  We've always been very careful about using up
things like infomask bits, because there are only so many bits
available, and when they're gone they are gone.

 One question I have is what the structure of the names should be.  It
 seems some coordination could be needed here.  I mean, suppose BDR
 uses bdr:$NODENAME and Slony uses
 $SLONY_CLUSTER_NAME:$SLONY_INSTANCE_NAME and EDB's xDB replication
 server uses xdb__$NODE_NAME.  That seems like it would be sad.  Maybe
 we should decide that names ought to be of the form
 replication-solution.further-period-separated-components or
 something like that.

 I've also wondered about that. Perhaps we simply should have an
 additional 'name' column indicating the replication solution?

Yeah, maybe, but there's still the question of substructure within the
non-replication-solution part of the name.  Not sure if we can assume
that a bipartite identifier, specifically, is right, or whether some
solutions will end up with different numbers of components.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WITH CHECK and Column-Level Privileges

2014-09-26 Thread Heikki Linnakangas

On 09/26/2014 05:20 PM, Stephen Frost wrote:

All,

   Through continued testing, we've discovered an issue in the
   WITH CHECK OPTION code when it comes to column-level privileges
   which impacts 9.4.

   It's pretty straight-forward, thankfully, but:

postgres=# create view myview
postgres-# with (security_barrier = true,
postgres-# check_option = 'local')
postgres-# as select * from passwd where username = current_user;
CREATE VIEW
postgres=# grant select (username) on myview to public;
GRANT
postgres=# grant update on myview to public;
GRANT
postgres=# set role alice;
SET
postgres= update myview set username = 'joe';
ERROR:  new row violates WITH CHECK OPTION for myview
DETAIL:  Failing row contains (joe, abc).

   Note that the entire failing tuple is returned, including the
   'password' column, even though the 'alice' user does not have select
   rights on that column.


Is there similar problems with unique or exclusion constraints?


   The detail information is useful for debugging, but I believe we have
   to remove it from the error message.

   Barring objections, and in the hopes of getting the next beta out the
   door soon, I'll move forward with this change and back-patch it to
   9.4 after a few hours


What exactly are you going to commit? Did you forget to attach a patch?


(or I can do it tomorrow if there is contention;
   I don't know what, if any, specific plans there are for the next beta,
   just that it's hopefully 'soon').


Probably would be wise to wait 'till tomorrow; there's no need to rush this.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WITH CHECK and Column-Level Privileges

2014-09-26 Thread Stephen Frost
* Heikki Linnakangas (hlinnakan...@vmware.com) wrote:
 On 09/26/2014 05:20 PM, Stephen Frost wrote:
Note that the entire failing tuple is returned, including the
'password' column, even though the 'alice' user does not have select
rights on that column.
 
 Is there similar problems with unique or exclusion constraints?

That's certainly an excellent question..  I'll have to go look.

The detail information is useful for debugging, but I believe we have
to remove it from the error message.
 
Barring objections, and in the hopes of getting the next beta out the
door soon, I'll move forward with this change and back-patch it to
9.4 after a few hours
 
 What exactly are you going to commit? Did you forget to attach a patch?

I had, but now it seems like it might be insufficient anyway..  Let me
go poke at the unique constraint question.

 (or I can do it tomorrow if there is contention;
I don't know what, if any, specific plans there are for the next beta,
just that it's hopefully 'soon').
 
 Probably would be wise to wait 'till tomorrow; there's no need to rush this.

Sure, works for me.

Would be great to get an idea of when the next beta is going to be cut,
if there's been any thought to that..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread Robert Haas
On Fri, Sep 26, 2014 at 10:58 AM, Stephen Frost sfr...@snowman.net wrote:
 * Robert Haas (robertmh...@gmail.com) wrote:
 On Thu, Sep 25, 2014 at 4:49 PM, Stephen Frost sfr...@snowman.net wrote:
  * Gregory Smith (gregsmithpg...@gmail.com) wrote:
  On 9/25/14, 2:02 PM, Peter Eisentraut wrote:
  But having the same parameter setting mean different things in
  different versions is the path to complete madness.
 
  Could we go so far as to remove support for unitless time settings
  eventually?  The fact that people are setting raw numbers in the
  configuration file and have to know the unit to understand what they
  just did has never been something I like.
 
  I could certainly get behind that idea...  Tho I do understand that
  people will complain about backwards compatibility, etc, etc.

 There's also the fact that it doesn't fix the originally complained-of
 problem.  It does fix a problem with one of the fixes proposed for
 that original problem, though.

 That's certainly an excellent point- this is really orthogonal to the
 actual issue of setting a value smaller than a single unit for that
 setting.  Even if you have units attached to every GUC, an individual
 could take a setting which is understaood at the '1 minute' level and
 attempt to set it to '30 seconds', for example.

 On the other hand, if we're making it an error to set values without
 units, I'd again be behind the idea of throwing an error on the
 smaller-than-one-unit case- people are going to have to go update their
 configurations to deal with the errors from the lack-of-units, this
 would just be another item to fix during that process.  It'd certainly
 be worse to change to require units and then wait anothe release to fix
 or change things to address the original complaint.

Yep, I'll buy that.

If we want the narrowest possible fix for this, I think it's complain
if a non-zero value would round to zero.  That fixes the original
complaint and changes absolutely nothing else.  But I think that's
kind of wussy.  Yeah, rounding 29 seconds down to a special magic
value of 0 is more surprising than rounding 30 seconds up to a minute,
but the latter is still surprising.  We're generally not averse to
tighter validation, so why here?  We're not talking about preventing
the use of 60s to mean 1min, just the use of 42s to mean 1min, which
is no different than not letting 2/29/93 mean 3/1/93, a decision about
which we have no compunctions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WITH CHECK and Column-Level Privileges

2014-09-26 Thread Stephen Frost
* Stephen Frost (sfr...@snowman.net) wrote:
  Is there similar problems with unique or exclusion constraints?
 
 That's certainly an excellent question..  I'll have to go look.

Looks like there is an issue here with CHECK constraints and NOT NULL
constraints, yes.  The uniqueness check complains about the key already
existing and returns the key, but I don't think that's actually a
problem- to get that to happen you have to specify the new key and
that's what is returned.

Looks like this goes all the way back to column-level privileges and was
just copied into WithCheckOptions from ExecConstraints. :(

Here's the test case I used:

create table passwd (username text primary key, password text);
grant select (username) on passwd to public;
grant update on passwd to public;
insert into passwd values ('abc','hidden');
insert into passwd values ('def','hidden2');

set role alice;
update passwd set username = 'def';
update passwd set username = null;

Results in:

postgres= update passwd set username = 'def';
ERROR:  duplicate key value violates unique constraint passwd_pkey
DETAIL:  Key (username)=(def) already exists.
postgres= update passwd set username = null;
ERROR:  null value in column username violates not-null constraint
DETAIL:  Failing row contains (null, hidden).

Thoughts?

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] WITH CHECK and Column-Level Privileges

2014-09-26 Thread Stephen Frost
* Stephen Frost (sfr...@snowman.net) wrote:
 * Stephen Frost (sfr...@snowman.net) wrote:
   Is there similar problems with unique or exclusion constraints?
  
  That's certainly an excellent question..  I'll have to go look.
 
 Looks like there is an issue here with CHECK constraints and NOT NULL
 constraints, yes.  The uniqueness check complains about the key already
 existing and returns the key, but I don't think that's actually a
 problem- to get that to happen you have to specify the new key and
 that's what is returned.

Yeah, I take that back.  If there is a composite key involved then you
can run into the same issue- you update one of the columns to a
conflicting value and get back the entire key, including columns you
shouldn't be allowed to see.

Ugh.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Inefficient barriers on solaris with sun cc

2014-09-26 Thread Oskari Saarenmaa

26.09.2014, 17:28, Robert Haas kirjoitti:

On Fri, Sep 26, 2014 at 8:55 AM, Oskari Saarenmaa o...@ohmu.fi wrote:

So you think a read barrier is the same thing as an acquire barrier
and a write barrier is the same as a release barrier?  That would be
surprising.  It's certainly not true in general.


The above doc describes the difference: read barrier requires loads before
the barrier to be completed before loads after the barrier - an acquire
barrier is the same, but it also requires loads to be complete before stores
after the barrier.

Similarly write barrier requires stores before the barrier to be completed
before stores after the barrier - a release barrier is the same, but it also
requires loads before the barrier to be completed before stores after the
barrier.

So acquire is read + loads-before-stores and release is write +
loads-before-stores.


Hmm.  My impression was that an acquire barrier means that loads and
stores can migrate forward across the barrier but not backward; and
that a release barrier means that loads and stores can migrate
backward across the barrier but not forward.  I'm actually not really
sure what this means unless the barrier also does something in and of
itself.  For example, consider this:


[...]


With the definition you (and Oracle) propose, this won't work, because
there's nothing to keep the modification of things from being
reordered before flag = 1.  What good is that?  Apparently, I don't
have any idea!


I'm not proposing any definition for acquire or release barriers, I was 
just proposing to use the things Solaris Studio defines as acquire and 
release barriers to implement read and write barriers in PostgreSQL 
because similar barrier names are used with gcc and on Solaris Studio 
acquire is a stronger read barrier and release is a stronger write 
barrier.  atomics.h's definition of pg_(read|write)_barrier doesn't have 
any requirements for loads before stores, though, so we could use 
__machine_r_barrier and __machine_w_barrier instead.


But as Andres pointed out all this is probably unnecessary and we could 
define read and write barrier as __compiler_barrier with Solaris Studio 
cc.  It's only available for Solaris (x86 and Sparc) and Linux (x86).


/ Oskari


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BRIN indexes - TRAP: BadArgument

2014-09-26 Thread Erik Rijkers
On Tue, September 23, 2014 21:04, Alvaro Herrera wrote:
 Alvaro Herrera wrote:

 [minmax-19.patch]
 [minmax-19a.patch]

Although admittedly it is not directly likely for us to need it, and although I 
see that there is a BRIN Extensibility
chapter added (good!), I am still a bit surprised by the absence of a built-in 
BRIN operator class for bigint, as the BRIN
index type is specifically useful for huge tables (where after all huge values 
are more likely to occur).

Will a brin int8 be added operator class for 9.5? (I know, quite some time 
left...)

(btw, so far the patch proves quite stable under my abusive testing...)


thanks,


Erik Rijkers









-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 3

2014-09-26 Thread Andres Freund
On 2014-09-26 11:02:16 -0400, Robert Haas wrote:
 On Fri, Sep 26, 2014 at 10:55 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2014-09-26 10:40:37 -0400, Robert Haas wrote:
  On Fri, Sep 26, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com 
  wrote:
   As explained above this isn't happening on the level of individual AMs.
 
  Well, that's even worse.  You want to grab 100% of the available
  generic bitspace applicable to all record types for purposes specific
  to logical decoding (and it's still not really enough bits).
 
  I don't think that's a fair characterization. Right now it's available
  to precisely nobody. You can't put any data in there in *any* way. It
  just has been sitting around unused for at least 8 years.
 
 Huh?  That's just to say that the unused bit space is, in fact,
 unused.  But so what?  We've always been very careful about using up
 things like infomask bits, because there are only so many bits
 available, and when they're gone they are gone.

I don't think that's a very meaningful comparison. The problem with
infomask bits is that it's impossible to change anything once added
because of pg_upgrade'ability. That problem does not exist for
XLogRecord. We've twiddled with the WAL format pretty much in every
release. We can reconsider every release.

I can't remember anyone but me thinking about using these two bytes. So
the comparison here really is using two free bytes vs. issuing at least
~30 (record + origin) for every replayed transaction. Don't think that's
a unfair tradeof.

  One question I have is what the structure of the names should be.  It
  seems some coordination could be needed here.  I mean, suppose BDR
  uses bdr:$NODENAME and Slony uses
  $SLONY_CLUSTER_NAME:$SLONY_INSTANCE_NAME and EDB's xDB replication
  server uses xdb__$NODE_NAME.  That seems like it would be sad.  Maybe
  we should decide that names ought to be of the form
  replication-solution.further-period-separated-components or
  something like that.
 
  I've also wondered about that. Perhaps we simply should have an
  additional 'name' column indicating the replication solution?
 
 Yeah, maybe, but there's still the question of substructure within the
 non-replication-solution part of the name.  Not sure if we can assume
 that a bipartite identifier, specifically, is right, or whether some
 solutions will end up with different numbers of components.

Ah. I thought you only wanted to suggest a separator between the
replication solution and it's internal dat. But you actually want to
suggest an internal separator to be used in the solution's namespace?
I'm fine with that. I don't think we can suggest much beyond that -
different solutions will have fundamentally differing requirements about
which information to store.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 If we want the narrowest possible fix for this, I think it's complain
 if a non-zero value would round to zero.  That fixes the original
 complaint and changes absolutely nothing else.  But I think that's
 kind of wussy.  Yeah, rounding 29 seconds down to a special magic
 value of 0 is more surprising than rounding 30 seconds up to a minute,
 but the latter is still surprising.  We're generally not averse to
 tighter validation, so why here?

So in other words, if I set shared_buffers = 100KB, you are proposing
that that be rejected because it's not an exact multiple of 8KB?  This
seems like it's throwing away one of the fundamental reasons why we
invented GUC units in the first place.

I apparently have got to make this point one more time: if the user
cares about the difference between 30sec and 1min, then we erred in
designing the GUC in question; it should have had a smaller unit.
I am completely not impressed by arguments based on such cases.
The right fix for such a case is to choose a different unit for the GUC.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread Stephen Frost
Tom,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Robert Haas robertmh...@gmail.com writes:
  If we want the narrowest possible fix for this, I think it's complain
  if a non-zero value would round to zero.  That fixes the original
  complaint and changes absolutely nothing else.  But I think that's
  kind of wussy.  Yeah, rounding 29 seconds down to a special magic
  value of 0 is more surprising than rounding 30 seconds up to a minute,
  but the latter is still surprising.  We're generally not averse to
  tighter validation, so why here?
 
 So in other words, if I set shared_buffers = 100KB, you are proposing
 that that be rejected because it's not an exact multiple of 8KB?  This
 seems like it's throwing away one of the fundamental reasons why we
 invented GUC units in the first place.

I don't believe that's what was suggested at all (it's not what I was
suggesting, in any case), but rather shared_buffers = 1KB would error
(erm, won't it error *anyway*?  so this wouldn't be a change there..)

shared_buffers = 100KB would be round the same as today.

 I apparently have got to make this point one more time: if the user
 cares about the difference between 30sec and 1min, then we erred in
 designing the GUC in question; it should have had a smaller unit.
 I am completely not impressed by arguments based on such cases.
 The right fix for such a case is to choose a different unit for the GUC.

I don't think anyone is argueing that we should do away with the
rounding rules entirely, only that we should a) require units to be
specified, and b) error if the value specified is below '1 unit', but
still non-zero, as it would then be rounded to zero.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread Josh Berkus
On 09/26/2014 10:27 AM, Stephen Frost wrote:
 I don't think anyone is argueing that we should do away with the
 rounding rules entirely, only that we should a) require units to be
 specified, and b) error if the value specified is below '1 unit', but
 still non-zero, as it would then be rounded to zero.

That would not be a back-portable fix.

There are many 3rd-party tools out there (AWS RDS, for one) which do not
use the units.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread Stephen Frost
* Josh Berkus (j...@agliodbs.com) wrote:
 On 09/26/2014 10:27 AM, Stephen Frost wrote:
  I don't think anyone is argueing that we should do away with the
  rounding rules entirely, only that we should a) require units to be
  specified, and b) error if the value specified is below '1 unit', but
  still non-zero, as it would then be rounded to zero.
 
 That would not be a back-portable fix.

No, certainly not.  Sorry, thought that was clear..

 There are many 3rd-party tools out there (AWS RDS, for one) which do not
 use the units.

Of course.  Those would need to be updated for whichever major release
introduced this requirement.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread David Johnston
On Fri, Sep 26, 2014 at 1:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Robert Haas robertmh...@gmail.com writes:
  If we want the narrowest possible fix for this, I think it's complain
  if a non-zero value would round to zero.  That fixes the original
  complaint and changes absolutely nothing else.  But I think that's
  kind of wussy.  Yeah, rounding 29 seconds down to a special magic
  value of 0 is more surprising than rounding 30 seconds up to a minute,
  but the latter is still surprising.  We're generally not averse to
  tighter validation, so why here?

 So in other words, if I set shared_buffers = 100KB, you are proposing
 that that be rejected because it's not an exact multiple of 8KB?  This
 seems like it's throwing away one of the fundamental reasons why we
 invented GUC units in the first place.


​Both Robert and myself at one point made suggestions to this effect but I
believe at this point we are both good simply solving the 1 rounding and
calling it a day.​


 I apparently have got to make this point one more time: if the user
 cares about the difference between 30sec and 1min, then we erred in
 designing the GUC in question; it should have had a smaller unit.
 I am completely not impressed by arguments based on such cases.
 The right fix for such a case is to choose a different unit for the GUC.


You are right - both those values are probably quite stupid to set
log_rotation_age to...but we cannot error if they choose 1min

Stephen suggested changing the unit but that particular cure seems to be
considerably worse than simply changing floor() to ceil()

I'm out of arguments for why the minimally invasive solution is, for me,
the best of the currently proposed solutions.  That option gets my +1.  I
have to ask someone else to actually write such a patch but it seems
straight forward enough as no one has complained that the solution would be
hard to implement - only that they dislike the idea.

David J


Re: [HACKERS] Escaping from blocked send() reprised.

2014-09-26 Thread Heikki Linnakangas

On 09/09/2014 01:31 PM, Kyotaro HORIGUCHI wrote:

Hi, I added and edited some comments.


Sorry, It tha patch contains a silly bug. Please find the
attatched one.


I must say this scares the heck out of me. The current code goes through 
some trouble to not throw an error while in a recv() send(). For 
example, you removed the DoingCommandRead check from 
prepare_for_client_read(). There's an comment in postgres.c that says this:



/*
 * (2) Allow asynchronous signals to be executed immediately if 
they
 * come in while we are waiting for client input. (This must be
 * conditional since we don't want, say, reads on behalf of 
COPY FROM
 * STDIN doing the same thing.)
 */
DoingCommandRead = true;


With the patch, we do allow asynchronous signals to be processed while 
blocked in COPY FROM STDIN. Maybe that's OK, but I don't feel 
comfortable just changing it. (the comment is now wrong, of course)


This patch also enables processing query cancel signals while blocked, 
not just SIGTERM. That's not good; we might be in the middle of sending 
a message, and we cannot just error out of that or we might violate the 
fe/be protocol. That's OK with a SIGTERM as you're terminating the 
connection anyway, and we have the PqCommBusy safeguard in place that 
prevents us from sending broken messages to the client, but that's not 
good enough if we wanted to keep the backend alive, as we won't be able 
to send anything to the client anymore.


BTW, we've been talking about blocking in send(), but this patch also 
let's a recv() in e.g. COPY FROM STDIN to be interrupted. That's 
probably a good thing; surely you have exactly the same issues with that 
as with send(). But I didn't realize we had a problem with that too.



In summary, this patch is not ready as it is, but I think we can fix it. 
The key question is: is it safe to handle SIGTERM in the signal handler, 
calling the exit-handlers and exiting the backend, when blocked in a 
recv() or send()? It's a change in the pqcomm.c API; most pqcomm.c 
functions have not thrown errors or processed interrupts before. But 
looking at the callers, I think it's safe, and there isn't actually any 
comments explicitly saying that pqcomm.c will never throw errors.


I propose the attached patch. It adds a new flag ImmediateDieOK, which 
is a weaker form of ImmediateInterruptOK that only allows handling a 
pending die-signal in the signal handler.


Robert, others, do you see a problem with this?

Over IM, Robert pointed out that it's not safe to jump out of a signal 
handler with siglongjmp, when we're inside library calls, like in a 
callback called by OpenSSL. But even with current master branch, that's 
exactly what we do. In secure_raw_read(), we set ImmediateInterruptOK = 
true, which means that any incoming signal will be handled directly in 
the signal handler, which can mean elog(ERROR). Should we be worried? 
OpenSSL might get confused if control never returns to the SSL_read() or 
SSL_write() function that called secure_raw_read().


- Heikki

diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index 41ec1ad..049e5b1 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -178,5 +178,13 @@ secure_write(Port *port, void *ptr, size_t len)
 ssize_t
 secure_raw_write(Port *port, const void *ptr, size_t len)
 {
-	return send(port-sock, ptr, len, 0);
+	ssize_t result;
+
+	prepare_for_client_write();
+
+	result = send(port-sock, ptr, len, 0);
+
+	client_write_ended();
+
+	return result;
 }
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 61f17bf..138060b 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -497,6 +497,17 @@ ReadCommand(StringInfo inBuf)
  * non-reentrant libc functions.  This restriction makes it safe for us
  * to allow interrupt service routines to execute nontrivial code while
  * we are waiting for input.
+ *
+ * When waiting in the main loop, we can process any interrupt immediately
+ * in the signal handler. In any other read from the client, like in a COPY
+ * FROM STDIN, we can't safely process a query cancel signal, because we might
+ * be in the middle of sending a message to the client, and jumping out would
+ * violate the protocol. Or rather, pqcomm.c would detect it and refuse to
+ * send any more messages to the client. But handling a SIGTERM is OK, because
+ * we're terminating the backend and don't need to send any more messages
+ * anyway. That means that we might not be able to send an error message to
+ * the client, but that seems better than waiting indefinitely, in case the
+ * client is not responding.
  */
 void
 prepare_for_client_read(void)
@@ -513,6 +524,15 @@ prepare_for_client_read(void)
 		/* And don't forget to detect one that already arrived */
 		CHECK_FOR_INTERRUPTS();
 	}
+	else
+	{
+		/* Allow die 

Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread Tom Lane
David Johnston david.g.johns...@gmail.com writes:
 On Fri, Sep 26, 2014 at 1:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I apparently have got to make this point one more time: if the user
 cares about the difference between 30sec and 1min, then we erred in
 designing the GUC in question; it should have had a smaller unit.
 I am completely not impressed by arguments based on such cases.
 The right fix for such a case is to choose a different unit for the GUC.

 You are right - both those values are probably quite stupid to set
 log_rotation_age to...but we cannot error if they choose 1min

I've been thinking more about this, and I think I'm about ready to
change my position on it: why shouldn't we error out if the value
is too small?  If we believe that a GUC's unit is reasonably chosen,
then it's not sensible to try to set the value to less than 1 unit.
If the user tries then there's fairly good reason to assume that
either it's a typo or he fundamentally doesn't understand the GUC.
(This does not imply that he cares about the difference between
 and .4 units.)

Or another way to say it is that if we had set the min_val to something
positive, he'd have gotten a value out of range type of error.
The only reason we don't do that, typically, is that we're allowing
zero as a special case.  Well, ok, let's allow zero as a special case,
but it has to be written as 0 not something else.  If you try to
set a positive value then we should act as though the min_val is 1 unit.

So I'm coming around to the idea that throw an error if a nonzero
input would round (or truncate) to zero is a reasonable solution.
I think it'd be even more reasonable if we also fixed the rounding
rule to be round to nearest, but the two changes can be considered
independently.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread David Johnston
On Fri, Sep 26, 2014 at 2:02 PM, Robert Haas robertmh...@gmail.com wrote:

 On Fri, Sep 26, 2014 at 1:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Robert Haas robertmh...@gmail.com writes:
  If we want the narrowest possible fix for this, I think it's complain
  if a non-zero value would round to zero.  That fixes the original
  complaint and changes absolutely nothing else.  But I think that's
  kind of wussy.  Yeah, rounding 29 seconds down to a special magic
  value of 0 is more surprising than rounding 30 seconds up to a minute,
  but the latter is still surprising.  We're generally not averse to
  tighter validation, so why here?
 
  So in other words, if I set shared_buffers = 100KB, you are proposing
  that that be rejected because it's not an exact multiple of 8KB?

 Absolutely.  And if anyone is inconvenienced by that, then they should
 upgrade to a 386.  Seriously, who is going to set a value of
 shared_buffers that is not measured in MB?  And if they do, why
 shouldn't we complain if we can't honor the value exactly?  If they
 really put in a value that small, it's not stupid to think that the
 difference between 96kB and 104kB is significant.  Honestly, the most
 likely explanation for that value is that it's a developer doing
 testing.


​Related
thought - why don't we allow the user to specify 1.5MB as a valid value?
​ Since we don't the rounding on the 8kb stuff makes sense because not
everyone wants to choose between 1MB and 2MB.  A difference of 1 minute is
not as noticeable.​

In the thread Tom linked to earlier the whole idea of a unit being 8kb
(instead 1 block) is problematic and this is just another symptom of that.

David J.


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 I've been thinking more about this, and I think I'm about ready to
 change my position on it: why shouldn't we error out if the value
 is too small?  If we believe that a GUC's unit is reasonably chosen,
 then it's not sensible to try to set the value to less than 1 unit.

Right, agreed.

 If the user tries then there's fairly good reason to assume that
 either it's a typo or he fundamentally doesn't understand the GUC.
 (This does not imply that he cares about the difference between
  and .4 units.)

+1

 Or another way to say it is that if we had set the min_val to something
 positive, he'd have gotten a value out of range type of error.
 The only reason we don't do that, typically, is that we're allowing
 zero as a special case.  Well, ok, let's allow zero as a special case,
 but it has to be written as 0 not something else.  If you try to
 set a positive value then we should act as though the min_val is 1 unit.

Yes.

 So I'm coming around to the idea that throw an error if a nonzero
 input would round (or truncate) to zero is a reasonable solution.
 I think it'd be even more reasonable if we also fixed the rounding
 rule to be round to nearest, but the two changes can be considered
 independently.

Agreed- they're independent considerations and the original concern was
about the nonzero-to-zero issue, so I'd suggest we address that first,
though in doing so we will need to consider what *actual* min values we
should have for some cases which currently allow going to zero for the
special case and that, I believe, makes this all 9.5 material and allows
us a bit more freedom in deciding how to hanlde things more generally.

As such, I'd also +1 the round-to-nearest suggestion as being quite
sensible too.

THanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 Agreed- they're independent considerations and the original concern was
 about the nonzero-to-zero issue, so I'd suggest we address that first,
 though in doing so we will need to consider what *actual* min values we
 should have for some cases which currently allow going to zero for the
 special case and that, I believe, makes this all 9.5 material and allows
 us a bit more freedom in deciding how to hanlde things more generally.

Yeah, I was thinking the same: we should go through the GUCs having zero
as min_val and see if any of them could be tightened up.  And I agree
that *all* of this is 9.5 material --- it's not a big enough deal to
risk changing behaviors in a minor release.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread David Johnston
On Fri, Sep 26, 2014 at 2:27 PM, Stephen Frost sfr...@snowman.net wrote:


 Agreed- they're independent considerations and the original concern was
 about the nonzero-to-zero issue, so I'd suggest we address that first,
 though in doing so we will need to consider what *actual* min values we
 should have for some cases which currently allow going to zero for the
 special case and that, I believe, makes this all 9.5 material and allows
 us a bit more freedom in deciding how to hanlde things more generally.


​This is 9.5 material because 1) it isn't all that critical and, 2) we DO
NOT want a system to not come up because of a GUC paring error after a
minor-release update.

​I don't get where we need to do anything else besides that...the whole
actual min values comment is unclear to me.

My thought on rounding is simply no-complaint, no-change; round-to-nearest
would be my first choice if implementing anew.

David J.


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread Stephen Frost
David,

* David Johnston (david.g.johns...@gmail.com) wrote:
 ​This is 9.5 material because 1) it isn't all that critical and, 2) we DO
 NOT want a system to not come up because of a GUC paring error after a
 minor-release update.

Agreed.

 ​I don't get where we need to do anything else besides that...the whole
 actual min values comment is unclear to me.

Well, for cases that allow going to zero as an off option, we've
already decided, I believe, that sub-1-unit options are off the table
and so the min value is at *least* 1, but there could be cases where '1'
doesn't actually make any sense and it should be higher than that.

Consider the log file rotation bit.  If it was in seconds, would it
actually make sense to support actually doing a rotation *every second*?

No.

In that case, perhaps we'd set the minimum to '60s', even though
technically we could represent less than that, it's not sensible to do
so.  The point of having minimum (and maximum..) values is that typos
and other mistakes happen and we want the user to realize they've made a
mistake.

What needs to happen next is a review of all the GUCs which allow going
to zero and which treat zero as a special value, and consider what the
*actual* minimum value for those should be (excluding zero).  I was
hoping you might be interested in doing that... :D

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2014-09-26 Thread Gavin Flower

On 27/09/14 01:36, Alvaro Herrera wrote:

Amit Kapila wrote:


Today while again thinking about the startegy used in patch to
parallelize the operation (vacuum database), I think we can
improve the same for cases when number of connections are
lesser than number of tables in database (which I presume
will normally be the case).  Currently we are sending command
to vacuum one table per connection, how about sending multiple
commands (example Vacuum t1; Vacuum t2) on one connection.
It seems to me there is extra roundtrip for cases when there
are many small tables in database and few large tables.  Do
you think we should optimize for any such cases?

I don't think this is a good idea; at least not in a first cut of this
patch.  It's easy to imagine that a table you initially think is small
enough turns out to have grown much larger since last analyze.  In that
case, putting one worker to process that one together with some other
table could end up being bad for parallelism, if later it turns out that
some other worker has no table to process.  (Table t2 in your example
could grown between the time the command is sent and t1 is vacuumed.)

It's simpler to have workers do one thing at a time only.

I don't think it's a very good idea to call pg_relation_size() on every
table in the database from vacuumdb.

Curious: would it be both feasible and useful to have multiple workers 
process a 'large' table, without complicating things too much?  The 
could each start at a different position in the file.



Cheers,
Gavin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2014-09-26 Thread Alvaro Herrera
Gavin Flower wrote:

 Curious: would it be both feasible and useful to have multiple
 workers process a 'large' table, without complicating things too
 much?  The could each start at a different position in the file.

Feasible: no.  Useful: maybe, we don't really know.  (You could just as
well have a worker at double the speed, i.e. double vacuum_cost_limit).

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread David G Johnston
On Fri, Sep 26, 2014 at 2:39 PM, Stephen Frost [via PostgreSQL] 
ml-node+s1045698n5820714...@n5.nabble.com wrote:

 David,

 * David Johnston ([hidden email]
 http://user/SendEmail.jtp?type=nodenode=5820714i=0) wrote:
  ​This is 9.5 material because 1) it isn't all that critical and, 2) we
 DO
  NOT want a system to not come up because of a GUC paring error after a
  minor-release update.

 Agreed.

  ​I don't get where we need to do anything else besides that...the
 whole
  actual min values comment is unclear to me.

 Well, for cases that allow going to zero as an off option, we've
 already decided, I believe, that sub-1-unit options are off the table
 and so the min value is at *least* 1, but there could be cases where '1'
 doesn't actually make any sense and it should be higher than that.

 Consider the log file rotation bit.  If it was in seconds, would it
 actually make sense to support actually doing a rotation *every second*?

 No.

 In that case, perhaps we'd set the minimum to '60s', even though
 technically we could represent less than that, it's not sensible to do
 so.  The point of having minimum (and maximum..) values is that typos
 and other mistakes happen and we want the user to realize they've made a
 mistake.

 What needs to happen next is a review of all the GUCs which allow going
 to zero and which treat zero as a special value, and consider what the
 *actual* minimum value for those should be (excluding zero).  I was
 hoping you might be interested in doing that... :D


Like I said I just want to fix the bug and call it a day :)

For me just enforcing 1 as the minimum for everything would be fine.

I'd rather mandate non-integer data entry than impose an actual minimum
that is greater than 1.  Specifically a too-short/too-small value might be
used during exploration and testing by a new user even if the same value
would never be useful in production.  That, in fact, is the one reason that
allowing 5s for log rotation age would make sense - to allow people to
more easily checking their log rotation policies.  But making it work
without disrupting people using =60' (1hr) is impossible without simply
outlawing unitless values.

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/proposal-rounding-up-time-value-less-than-its-unit-tp5811102p5820717.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread Josh Berkus
On 09/26/2014 11:17 AM, Tom Lane wrote:
 So I'm coming around to the idea that throw an error if a nonzero
 input would round (or truncate) to zero is a reasonable solution.
 I think it'd be even more reasonable if we also fixed the rounding
 rule to be round to nearest, but the two changes can be considered
 independently.

I'm good with the error.  We'll want to add stuff to both the docs and
pg_settings to *show* the minimum value, and an informative error
message would help, i.e.:

Invalid value for log_rotation_interval.  Minimum value is 1min

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread Gregory Smith

On 9/26/14, 2:17 PM, Tom Lane wrote:
Well, ok, let's allow zero as a special case, but it has to be written 
as 0 not something else. If you try to set a positive value then we 
should act as though the min_val is 1 unit. So I'm coming around to 
the idea that throw an error if a nonzero input would round (or 
truncate) to zero is a reasonable solution. 


I expressed some distaste for throwing errors before, but I find this 
specific rule reasonable.  I'll even write the patch.  I owe the GUC 
system a re-match after my wal_buffers auto-scaling thing for 9.1 
rippled to cause extra work for you (maybe Robert too).


I just changed this submission from Ready for Committer to Returned 
with Feedback, with the feedback being that we'd like to see special 
values rounded to 0 treated differently altogether first, before 
touching any rounding.  And that's a whole new submission for the next CF.


I think it'd be even more reasonable if we also fixed the rounding 
rule to be round to nearest, but the two changes can be considered 
independently. regards, tom lane 


Personally I'm still -1 on any rounding change, instead preferring to 
work toward eliminating these 0  -1 special values altogether.  Outside 
of these special cases, I feel you are fundamentally right that if the 
rounding really matters, the unit is simply too large.  And I believe 
that is the case for the log rotation one right now.


I can see my idea for rescaling the rotation age parameter is an 
unpopular one, but I still like it.  That cleanly splits out into a 
third thing though, where it can live or die without being tied to the 
rest of these issues.


--
Greg Smith greg.sm...@crunchydatasolutions.com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 3

2014-09-26 Thread Robert Haas
On Fri, Sep 26, 2014 at 12:32 PM, Andres Freund and...@2ndquadrant.com wrote:
 Huh?  That's just to say that the unused bit space is, in fact,
 unused.  But so what?  We've always been very careful about using up
 things like infomask bits, because there are only so many bits
 available, and when they're gone they are gone.

 I don't think that's a very meaningful comparison. The problem with
 infomask bits is that it's impossible to change anything once added
 because of pg_upgrade'ability. That problem does not exist for
 XLogRecord. We've twiddled with the WAL format pretty much in every
 release. We can reconsider every release.

 I can't remember anyone but me thinking about using these two bytes. So
 the comparison here really is using two free bytes vs. issuing at least
 ~30 (record + origin) for every replayed transaction. Don't think that's
 a unfair tradeof.

Mmph.  You have a point about the WAL format being easier to change.
Reconsidering, though, would mean that some developer who probably
isn't you needs those bytes for something that really is a more
general need than this, so they write a patch to get them back by
doing what I proposed - and then it gets rejected because it's not as
good for logical replication.  So I'm not sure I really buy this as an
argument.  For all practical purposes, if you grab them, they'll be
gone.

  I've also wondered about that. Perhaps we simply should have an
  additional 'name' column indicating the replication solution?

 Yeah, maybe, but there's still the question of substructure within the
 non-replication-solution part of the name.  Not sure if we can assume
 that a bipartite identifier, specifically, is right, or whether some
 solutions will end up with different numbers of components.

 Ah. I thought you only wanted to suggest a separator between the
 replication solution and it's internal dat. But you actually want to
 suggest an internal separator to be used in the solution's namespace?
 I'm fine with that. I don't think we can suggest much beyond that -
 different solutions will have fundamentally differing requirements about
 which information to store.

Agreed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread Gregory Smith

On 9/26/14, 2:34 PM, David Johnston wrote:


​ I don't get where we need to do anything else besides that...the 
whole actual min values comment is unclear to me.


If you look at pg_settings, there is a minimum value exposed there as 
min_val.  For some of these parameters, that number would normally be 
1.  But since we have decided that 0 is a special flag value, min_val is 
0 instead.


There are others where min_val is -1 for the same reason, where 
functionally the minimum is really 0.  Some of us would like to see 
min_val reflect the useful minimum, period, and move all these special 
case ones out of there.  That is a multi-year battle to engage in 
though, and there's little real value to the user community coming out 
of it relative to that work scope.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread Tom Lane
Gregory Smith gregsmithpg...@gmail.com writes:
 On 9/26/14, 2:34 PM, David Johnston wrote:
 ​ I don't get where we need to do anything else besides that...the 
 whole actual min values comment is unclear to me.

 If you look at pg_settings, there is a minimum value exposed there as 
 min_val.  For some of these parameters, that number would normally be 
 1.  But since we have decided that 0 is a special flag value, min_val is 
 0 instead.

Right.

 There are others where min_val is -1 for the same reason, where 
 functionally the minimum is really 0.  Some of us would like to see 
 min_val reflect the useful minimum, period, and move all these special 
 case ones out of there.  That is a multi-year battle to engage in 
 though, and there's little real value to the user community coming out 
 of it relative to that work scope.

The impression I had was that Stephen was thinking of actually setting
min_val to 1 (or whatever) and handling zero or -1 in some out-of-band
fashion, perhaps by adding GUC flag bits showing those as allowable
special cases.  I'm not sure how we would display such a state of affairs
in pg_settings, but other than that it doesn't sound implausible.

We could alternatively try to split up these cases into multiple GUCs,
which I guess is what you're imagining as a multi-year battle.  But
personally I think any such proposal will fail on the grounds that
it's too much compatibility loss for the value gained.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
  There are others where min_val is -1 for the same reason, where 
  functionally the minimum is really 0.  Some of us would like to see 
  min_val reflect the useful minimum, period, and move all these special 
  case ones out of there.  That is a multi-year battle to engage in 
  though, and there's little real value to the user community coming out 
  of it relative to that work scope.
 
 The impression I had was that Stephen was thinking of actually setting
 min_val to 1 (or whatever) and handling zero or -1 in some out-of-band
 fashion, perhaps by adding GUC flag bits showing those as allowable
 special cases.  I'm not sure how we would display such a state of affairs
 in pg_settings, but other than that it doesn't sound implausible.

Yes.  I'm not 100% sure about how to deal with it in pg_settings, but
that is the general idea.

 We could alternatively try to split up these cases into multiple GUCs,
 which I guess is what you're imagining as a multi-year battle.  But
 personally I think any such proposal will fail on the grounds that
 it's too much compatibility loss for the value gained.

Agreed.

Thanks,

Stephen


signature.asc
Description: Digital signature


[HACKERS] jsonb generator functions

2014-09-26 Thread Andrew Dunstan


Here is a patch for the generator and aggregate functions for jsonb that 
we didn't manage to get done in time for 9.4. They are all equivalents 
of the similarly names json functions. Included are


to_jsonb
jsonb_build_object
jsonb_build_array
jsonb_object
jsonb_agg
jsonb_object_agg


Still to come: documentation.

Adding to the next commitfest.

cheers

andrew
diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c
index 2fd87fc..6c23b75 100644
--- a/src/backend/utils/adt/jsonb.c
+++ b/src/backend/utils/adt/jsonb.c
@@ -12,11 +12,20 @@
  */
 #include postgres.h
 
+#include miscadmin.h
+#include access/htup_details.h
+#include access/transam.h
+#include catalog/pg_cast.h
+#include catalog/pg_type.h
 #include libpq/pqformat.h
 #include utils/builtins.h
+#include utils/datetime.h
+#include utils/lsyscache.h
 #include utils/json.h
 #include utils/jsonapi.h
 #include utils/jsonb.h
+#include utils/syscache.h
+#include utils/typcache.h
 
 typedef struct JsonbInState
 {
@@ -24,6 +33,23 @@ typedef struct JsonbInState
 	JsonbValue *res;
 } JsonbInState;
 
+/* unlike with json categories, we need to treat json and jsonb differently */
+typedef enum	/* type categories for datum_to_jsonb */
+{
+	JSONBTYPE_NULL,/* null, so we didn't bother to identify */
+	JSONBTYPE_BOOL,/* boolean (built-in types only) */
+	JSONBTYPE_NUMERIC,			/* numeric (ditto) */
+	JSONBTYPE_TIMESTAMP,		/* we use special formatting for timestamp */
+	JSONBTYPE_TIMESTAMPTZ,		/* ... and timestamptz */
+	JSONBTYPE_JSON,/* JSON */
+	JSONBTYPE_JSONB,			/* JSONB */
+	JSONBTYPE_ARRAY,			/* array */
+	JSONBTYPE_COMPOSITE,		/* composite */
+	JSONBTYPE_JSONCAST,			/* something with an explicit cast to JSON */
+	JSONBTYPE_JSONBCAST,		/* something with an explicit cast to JSONB */
+	JSONBTYPE_OTHER/* all else */
+}	JsonbTypeCategory;
+
 static inline Datum jsonb_from_cstring(char *json, int len);
 static size_t checkStringLen(size_t len);
 static void jsonb_in_object_start(void *pstate);
@@ -33,6 +59,22 @@ static void jsonb_in_array_end(void *pstate);
 static void jsonb_in_object_field_start(void *pstate, char *fname, bool isnull);
 static void jsonb_put_escaped_value(StringInfo out, JsonbValue *scalarVal);
 static void jsonb_in_scalar(void *pstate, char *token, JsonTokenType tokentype);
+static void jsonb_categorize_type(Oid typoid,
+	  JsonbTypeCategory * tcategory,
+	  Oid *outfuncoid);
+static void composite_to_jsonb(Datum composite, JsonbInState *result);
+static void array_dim_to_jsonb(JsonbInState *result, int dim, int ndims, int *dims,
+   Datum *vals, bool *nulls, int *valcount,
+   JsonbTypeCategory tcategory, Oid outfuncoid);
+static void array_to_jsonb_internal(Datum array, JsonbInState *result);
+static void jsonb_categorize_type(Oid typoid,
+	  JsonbTypeCategory * tcategory,
+	  Oid *outfuncoid);
+static void datum_to_jsonb(Datum val, bool is_null, JsonbInState *result,
+			   JsonbTypeCategory tcategory, Oid outfuncoid,
+			   bool key_scalar);
+static void add_jsonb(Datum val, bool is_null, JsonbInState *result,
+		  Oid val_type, bool key_scalar);
 
 /*
  * jsonb type input function
@@ -462,3 +504,1278 @@ JsonbToCString(StringInfo out, JsonbContainer *in, int estimated_len)
 
 	return out-data;
 }
+
+
+/*
+ * Determine how we want to render values of a given type in datum_to_jsonb.
+ *
+ * Given the datatype OID, return its JsonbTypeCategory, as well as the type's
+ * output function OID.  If the returned category is JSONBTYPE_CAST, we
+ * return the OID of the type-JSON cast function instead.
+ */
+static void
+jsonb_categorize_type(Oid typoid,
+	  JsonbTypeCategory * tcategory,
+	  Oid *outfuncoid)
+{
+	bool		typisvarlena;
+
+	/* Look through any domain */
+	typoid = getBaseType(typoid);
+
+	/* We'll usually need to return the type output function */
+	getTypeOutputInfo(typoid, outfuncoid, typisvarlena);
+
+	/* Check for known types */
+	switch (typoid)
+	{
+		case BOOLOID:
+			*tcategory = JSONBTYPE_BOOL;
+			break;
+
+		case INT2OID:
+		case INT4OID:
+		case INT8OID:
+		case FLOAT4OID:
+		case FLOAT8OID:
+		case NUMERICOID:
+			*tcategory = JSONBTYPE_NUMERIC;
+			break;
+
+		case TIMESTAMPOID:
+			*tcategory = JSONBTYPE_TIMESTAMP;
+			break;
+
+		case TIMESTAMPTZOID:
+			*tcategory = JSONBTYPE_TIMESTAMPTZ;
+			break;
+
+		case JSONBOID:
+			*tcategory = JSONBTYPE_JSONB;
+			break;
+
+		case JSONOID:
+			*tcategory = JSONBTYPE_JSON;
+			break;
+
+		default:
+			/* Check for arrays and composites */
+			if (OidIsValid(get_element_type(typoid)))
+*tcategory = JSONBTYPE_ARRAY;
+			else if (type_is_rowtype(typoid))
+*tcategory = JSONBTYPE_COMPOSITE;
+			else
+			{
+/* It's probably the general case ... */
+*tcategory = JSONBTYPE_OTHER;
+
+/*
+ * but let's look for a cast to json or jsonb, if it's not
+ * built-in
+ */
+if (typoid = FirstNormalObjectId)
+{
+	HeapTuple	tuple;
+
+	tuple = 

Re: [HACKERS] jsonb generator functions

2014-09-26 Thread Peter Geoghegan
On Fri, Sep 26, 2014 at 1:54 PM, Andrew Dunstan and...@dunslane.net wrote:
 Here is a patch for the generator and aggregate functions for jsonb that we
 didn't manage to get done in time for 9.4.


That's cool, but I hope someone revisits adding a concatenate
operator. That's a biggest omission IMHO. I'm not going to have time
for that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread Alvaro Herrera
Tom Lane wrote:

 The impression I had was that Stephen was thinking of actually setting
 min_val to 1 (or whatever) and handling zero or -1 in some out-of-band
 fashion, perhaps by adding GUC flag bits showing those as allowable
 special cases.  I'm not sure how we would display such a state of affairs
 in pg_settings, but other than that it doesn't sound implausible.

I would think that if we're going to have out of band values, we
should just use off, not 0 or -1.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread David Johnston
On Friday, September 26, 2014, Alvaro Herrera alvhe...@2ndquadrant.com
wrote:

 Tom Lane wrote:

  The impression I had was that Stephen was thinking of actually setting
  min_val to 1 (or whatever) and handling zero or -1 in some out-of-band
  fashion, perhaps by adding GUC flag bits showing those as allowable
  special cases.  I'm not sure how we would display such a state of affairs
  in pg_settings, but other than that it doesn't sound implausible.

 I would think that if we're going to have out of band values, we
 should just use off, not 0 or -1.



Except off is not always semantically correct - some GUCs (not sure which
ones ATM) use zero to mean 'default'.  I think -1 always means off.
Instead of 0 and -1 we'd need 'default' and 'off' with the ability to error
if there is no meaningful default defined or if a feature cannot be turned
off.

David J.


Re: [HACKERS] jsonb generator functions

2014-09-26 Thread Andrew Dunstan


On 09/26/2014 05:00 PM, Peter Geoghegan wrote:

On Fri, Sep 26, 2014 at 1:54 PM, Andrew Dunstan and...@dunslane.net wrote:

Here is a patch for the generator and aggregate functions for jsonb that we
didn't manage to get done in time for 9.4.


That's cool, but I hope someone revisits adding a concatenate
operator. That's a biggest omission IMHO. I'm not going to have time
for that.




This patch is the work that I have publicly promised to do.

Dmitry Dolgov is in fact working on jsonb_concat(), and several other 
utility functions.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-26 Thread Peter Geoghegan
On Fri, Sep 26, 2014 at 5:24 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Just to be clear: I wrote the initial patch to demonstrate what I had in
 mind, because I was not able to explain it well enough otherwise. You
 pointed out issues with it, which I then fixed. You then pointed out more
 issues, which I then fixed again.

 My patch version was a proof of concept, to demonstrate that it can be done.

Right. It was a rough prototype built to prove a point. It also served
to show what I was talking about as regards deadlocks (and how the
locks could problematically persist in other ways), which I was
previously unable to effectively explain to Andres. So it was a very
useful exercise, and I wish we did that kind of thing more frequently.
But at the same time, I don't want to hold you to that prototype, or
misrepresent that prototype as showing your final position on any
technical issue. So please correct me if I do that. I've tried to be
careful about that.

 What I'd like you to do now, as the patch author, is to take the promise
 tuple approach and clean it up. If the xmin stuff is ugly, figure out some
 other way to do it.

My concern with the xmin stuff is not that it's ugly; it's that it's
potentially dangerous. It isn't at all easy to reason about where bugs
might appear - lots of things could interact with it in unpredictable
ways. I think we'd have to audit a lot of code, all over the place,
just to make sure nowhere had an assumption broken. This is a big
issue. You are asking me to find a way to save a design that I don't
particularly believe in. That might change, but right now I'm afraid
that that's the reality. Whereas, my design is entirely contained in
the file nbtinsert.c.

 I don't know what you mean by in the head of AM, but IMO it would be far
 better if we can implement this outside the index AMs. Then it will work
 with any index AM.

I mean that value locking is an abstraction that lives in the head
of amcanunique AMs. That kind of encapsulation has considerable value
in reducing the risk of bugs. If what I've done has bugs, there isn't
that many places that could expose interactions with other complicated
code. There are fewer moving parts. It's a generalization of the
existing mechanism for unique index enforcement. Plus, database
systems have used heavyweight index locks for this kind of thing since
the 1970s. That's how this works everywhere else (SQL server certainly
does this for MERGE [1], and only grabs the page-level lock for a
second at lower isolation levels, as in my implementation). I think
that that ought to count for something.

I will be frank. Everyone knows that the nbtree locking parts of this
are never going to be committed over your objections. It cannot
happen. And yet, I persist in proposing that we go that way. I may be
stubborn, but I am not so stubborn that I'd jeopardize all the work
I've put into this to save one aspect of it that no one really cares
about anyway (even I only care about meeting my goals for user visible
behavior [2]). I may actually come up with a better way to make what
you outline work; then again, I may not. I have no idea, to be honest.
It's pretty clear that I'm going to have a hard time getting your
basic approach to value locking accepted without rethinking it a lot,
though. Can you really say that you won't have serious misgivings
about something like the tuple-xmin = InvalidTransactionId
swapping, if I actually formally propose it? That's very invasive to a
lot of places. And right now, I have no idea how we could do better.

I really only want to get to where we have a design that's acceptable.
In all sincerity, I may yet be convinced to go your way. It's possible
that I've failed to fully understand your concerns. Is it really just
about making INSERT ... ON CONFLICT IGNORE work with exclusion
constraints (UPDATE clearly makes little sense)?

 Basically, an INSERT to a table with an exclusion constraint would be the
 same as INSERT ON CONFLICT throw an error. That would be a useful way to
 split this patch into two parts.

I'll think about it. I don't want to do that until I see a way to make
your approach to value locking work in a way that someone is actually
going to be comfortable committing. I am looking for one.

By the way, IMO stress testing has a very useful role to play in the
development of this feature. I've been doing things like trying to
flush out races by running long stress tests with random delays
artificially added at key points. I would like to make that part of
the testing strategy public and transparent.

[1] http://weblogs.sqlteam.com/mladenp/archive/2007/08/03/60277.aspx
[2] Slide 8, Goals for UPSERT in Postgres:
http://www.pgcon.org/2014/schedule/attachments/327_upsert_weird.pdf
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread Alvaro Herrera
David Johnston wrote:
 On Friday, September 26, 2014, Alvaro Herrera alvhe...@2ndquadrant.com
 wrote:
 
  Tom Lane wrote:
 
   The impression I had was that Stephen was thinking of actually setting
   min_val to 1 (or whatever) and handling zero or -1 in some out-of-band
   fashion, perhaps by adding GUC flag bits showing those as allowable
   special cases.  I'm not sure how we would display such a state of affairs
   in pg_settings, but other than that it doesn't sound implausible.
 
  I would think that if we're going to have out of band values, we
  should just use off, not 0 or -1.

 Except off is not always semantically correct - some GUCs (not sure which
 ones ATM) use zero to mean 'default'.  I think -1 always means off.
 Instead of 0 and -1 we'd need 'default' and 'off' with the ability to error
 if there is no meaningful default defined or if a feature cannot be turned
 off.

Sure, off (and other spellings of boolean false value) and default
where they make sense, and whatever other values that make sense.  My
point is to avoid collapsing such logical values to integer/floating
point values just because the option uses a numeric scale.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication identifiers, take 3

2014-09-26 Thread Andres Freund
On 2014-09-26 14:57:12 -0400, Robert Haas wrote:
 On Fri, Sep 26, 2014 at 12:32 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  Huh?  That's just to say that the unused bit space is, in fact,
  unused.  But so what?  We've always been very careful about using up
  things like infomask bits, because there are only so many bits
  available, and when they're gone they are gone.
 
  I don't think that's a very meaningful comparison. The problem with
  infomask bits is that it's impossible to change anything once added
  because of pg_upgrade'ability. That problem does not exist for
  XLogRecord. We've twiddled with the WAL format pretty much in every
  release. We can reconsider every release.
 
  I can't remember anyone but me thinking about using these two bytes. So
  the comparison here really is using two free bytes vs. issuing at least
  ~30 (record + origin) for every replayed transaction. Don't think that's
  a unfair tradeof.
 
 Mmph.  You have a point about the WAL format being easier to change.
 Reconsidering, though, would mean that some developer who probably
 isn't you needs those bytes for something that really is a more
 general need than this, so they write a patch to get them back by
 doing what I proposed - and then it gets rejected because it's not as
 good for logical replication.  So I'm not sure I really buy this as an
 argument.  For all practical purposes, if you grab them, they'll be
 gone.

Sure, it'll possibly not be trivial to move them elsewhere. On the other
hand, the padding bytes have been unused for 8+ years without somebody
laying claim on them but me. I don't think it's a good idea to leave
them there unused when nobody even has proposed another use for a long
while. That'll just end up with them continuing to be unused.  And
there's actually four more consecutive bytes on 64bit systems that are
unused.

Should there really be a dire need after that, we can simply bump the
record size. WAL volume wise it'd not be too bad to make the record a
tiny bit bigger - the header is only a relatively small fraction of the
entire content.

   I've also wondered about that. Perhaps we simply should have an
   additional 'name' column indicating the replication solution?
 
  Yeah, maybe, but there's still the question of substructure within the
  non-replication-solution part of the name.  Not sure if we can assume
  that a bipartite identifier, specifically, is right, or whether some
  solutions will end up with different numbers of components.
 
  Ah. I thought you only wanted to suggest a separator between the
  replication solution and it's internal dat. But you actually want to
  suggest an internal separator to be used in the solution's namespace?
  I'm fine with that. I don't think we can suggest much beyond that -
  different solutions will have fundamentally differing requirements about
  which information to store.
 
 Agreed.

So, let's recommend underscores as that separator?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-26 Thread Peter Geoghegan
On Fri, Sep 26, 2014 at 5:58 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 And the reason that the buffer locking approach in the overlapping case
 is that you'd need to hold a large number of pages locked at the same
 time. Right?


 Yeah, you would. To be honest, I didn't even think about the overlapping
 case, I just assumed that the overlapping case is the typical one and only
 thought about that.

I'm not sure that I follow. Unique constraints don't work across
partitions today. Why should this work across partitions in the most
general case? Simply because there'd have to be one page lock held per
unique index/partition, where promise tuples are somewhat like row
locks, so presumably only one lock table entry is required?

In other database systems with better partitioning support, there is
such a thing as indexes that apply across all partitions (global
indexes). There are also local indexes, that can only be unique if
that comports with the partitioning key in a way that makes sense. But
we don't have anything like global indexes, and even in those other
systems there are huge caveats around MERGE and its impact on global
indexes (they are automatically *marked unusable* by an SQL MERGE
command). So I think making what you have in mind here work for
current Postgres partitioning is totally unrealistic, unless (at the
very least) someone also goes and writes a global index feature, which
is obviously an enormous project.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-26 Thread Alvaro Herrera
Peter Geoghegan wrote:

 Can you really say that you won't have serious misgivings
 about something like the tuple-xmin = InvalidTransactionId
 swapping, if I actually formally propose it? That's very invasive to a
 lot of places. And right now, I have no idea how we could do better.

FWIW there are 28 callers of HeapTupleHeaderGetXmin.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-09-26 Thread Robert Haas
On Sat, Sep 20, 2014 at 3:03 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 Okay, but as there is no predictability (it can be either same as what
 launching process has at the when it has launched background worker
 or it could be some different value if got changed later due to sighup)
 which GUC value will be used by background worker, it might be good
 to clarify the same in pg_bacground docs (which are not there currently,
 but I assume it will eventually be part of this patch).

OK, I will mention that in the documentation when I write it.  I
didn't sweat that too much originally because I wasn't sure how much
churn there was going to be in the user-visible API, but so far
everybody seems happy with that, so maybe it's time to go document it.
It's a pretty simple API but, as you say, there are a few details
worth mentioning.  I still need some review of the earlier patches in
the series before this really gets urgent, though: so far no one has
commented on #1, #2, #4, or #5, and I'm not entirely whether my
revised version of #3 passed muster.

 Keeping transaction control (Start/commit) outside the function
 execute_sql_string() could lead to EndCommand() message being
 sent before transaction end which could be a problem in case
 transaction commit fails due to any reason. exec_simple_query() takes
 care of the same by calling finish_xact_command() before reporting
 command-complete for last parse tree.  It even has comment indicating
 that we should follow this protocol.

Fixed in the attached version.

 Won't CommandCounterIncrement() required after every command like
 we do in exec_simple_query()?

Fixed in the attached version.

 Whats the need to reverse the order of execution for EndCommand()
 and PortalDrop()?  Any error in PortalDrop() will lead to wrong
 message being sent to client.

Fixed in the attached version.

 What is the reason for not logging statements executed via
 pg_background, even though it allows to report the sql statement
 to various monitoring tools by setting debug_query_string?

I wasn't really sure whether core GUCs should affect the behavior of a
contrib module, and I wasn't excited about duplicating the code.

 Isn't it better to add a comment why  execute_sql_string() uses
 binary format for result?

Done in the attached version.

 Sure, please take a call based on what you feel is right here, I
 mentioned it because I felt it might be little bit easier for other people
 to understand that code.

I played around with this a bit but it didn't seem like it worked out
to a win.  There were a bunch of things that had to be passed down
into that function and it didn't seem like it was really reducing
complexity.  What I think makes sense is to keep an eye on the
complexity of the handling for each individual message type and move
any handlers that get complex to their own functions.

 There can be certain scenarios where user might like to invoke this
 again.  Assume, user calls function
 pg_background_launch('select count(*) from t1') and this statement
 execution via background worker is going to take very long time before
 it could return anything. After sometime user tries to retrieve data via
 pg_background_result(), but the call couldn't came out as it is waiting
 for results, so user presses cancel and on again trying after sometime,
 he won't get any data.  I think this behaviour is bit annoying.

Yep. I don't have a better solution at the moment, but there may be one.

 To avoid user to wait for longer, function pg_background_result()
 can take an additional parameter where user can specify whether
 to WAIT incase results are not available.

That gets complicated.  Until any results are available?  Until all
results are available?  What if we try to read from the queue to find
out if results are available, and the first message in the queue is
long enough that it wraps the queue, so that we have to block and wait
for the background worker to send more data before we can continue?

 Why FATAL inside background worker is not propagated at same level by
 launcher process?
 If PANIC in background worker can kill other backends and restart server
 then ideally FATAL in background worker should also be treated at same
 level by client backend.

It was initially like that, but then I realized it was silly.  If the
background worker hits some error that forces its session to
terminate, there is no need to terminate the user's session too - and
in fact doing so is really annoying, as I rapidly found out while
experimenting with this.  Generally a FATAL is something we do because
backend-local state is corrupted to a degree that makes it impractical
to continue, but the fact that that other backend is messed up does
not mean our backend is messed up too.

 Any error (unable to map dynamic shared memory segment) before
 pq_redirect_to_shm_mq() will not reach launcher.  Launcher client will
 get ERROR:  lost connection to worker process with PID 4020.

 I think it is very 

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-26 Thread Peter Geoghegan
On Fri, Sep 26, 2014 at 3:11 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 FWIW there are 28 callers of HeapTupleHeaderGetXmin.

31 by my count, though that difference hardly matters. A lot of those
callers are in parts of the code that I don't know well. For example,
CheckForSerializableConflictOut().

Don't forget about direct callers to HeapTupleHeaderGetRawXmin(),
though. There are plenty of those in tqual.c.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread Gregory Smith

On 9/26/14, 2:50 PM, David G Johnston wrote:


Like I said I just want to fix the bug and call it a day :)


I'm afraid you've come to the wrong project and mailing list for *that*.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-26 Thread David Johnston
On Friday, September 26, 2014, Alvaro Herrera alvhe...@2ndquadrant.com
wrote:

 David Johnston wrote:
  On Friday, September 26, 2014, Alvaro Herrera alvhe...@2ndquadrant.com
 javascript:;
  wrote:
 
   Tom Lane wrote:
  
The impression I had was that Stephen was thinking of actually
 setting
min_val to 1 (or whatever) and handling zero or -1 in some
 out-of-band
fashion, perhaps by adding GUC flag bits showing those as allowable
special cases.  I'm not sure how we would display such a state of
 affairs
in pg_settings, but other than that it doesn't sound implausible.
  
   I would think that if we're going to have out of band values, we
   should just use off, not 0 or -1.
 
  Except off is not always semantically correct - some GUCs (not sure
 which
  ones ATM) use zero to mean 'default'.  I think -1 always means off.
  Instead of 0 and -1 we'd need 'default' and 'off' with the ability to
 error
  if there is no meaningful default defined or if a feature cannot be
 turned
  off.

 Sure, off (and other spellings of boolean false value) and default
 where they make sense, and whatever other values that make sense.  My
 point is to avoid collapsing such logical values to integer/floating
 point values just because the option uses a numeric scale.


My comment was more about storage than data entry.  I'm not sure, though,
that we'd want to allow 0 as valid input even if it is acceptable for
Boolean.

David J.


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-26 Thread Peter Geoghegan
On Fri, Sep 26, 2014 at 3:25 PM, Peter Geoghegan p...@heroku.com wrote:
 On Fri, Sep 26, 2014 at 3:11 PM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
 FWIW there are 28 callers of HeapTupleHeaderGetXmin.

 Don't forget about direct callers to HeapTupleHeaderGetRawXmin(),
 though. There are plenty of those in tqual.c.

Which reminds me: commit 37484ad2 added the opportunistic freezing
stuff. To quote the commit message:


Instead of changing the tuple xmin to FrozenTransactionId, the combination
of HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID, which were previously never
set together, is now defined as HEAP_XMIN_FROZEN.  A variety of previous
proposals to freeze tuples opportunistically before vacuum_freeze_min_age
is reached have foundered on the objection that replacing xmin by
FrozenTransactionId might hinder debugging efforts when things in this
area go awry; this patch is intended to solve that problem by keeping
the XID around (but largely ignoring the value to which it is set).



Why wouldn't the same objection (the objection that the earlier
opportunistic freezing ideas stalled on) apply to directly setting
tuple xmin to InvalidTransactionId?

You get the idea, though: Making promise tuples possible to release
early (before transaction end) by setting tuple xmin to
InvalidTransactionId is certainly hard to get right, and seems
dangerous.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2014-09-26 Thread Gregory Smith

On 9/26/14, 2:38 PM, Gavin Flower wrote:
Curious: would it be both feasible and useful to have multiple workers 
process a 'large' table, without complicating things too much?  The 
could each start at a different position in the file.


Not really feasible without a major overhaul.  It might be mildly useful 
in one rare case.  Occasionally I'll find very hot single tables that 
vacuum is constantly processing, despite mostly living in RAM because 
the server has a lot of memory.  You can set vacuum_cost_page_hit=0 in 
order to get vacuum to chug through such a table as fast as possible.


However, the speed at which that happens will often then be limited by 
how fast a single core can read from memory, for things in 
shared_buffers.  That is limited by the speed of memory transfers from a 
single NUMA memory bank.  Which bank you get will vary depending on the 
core that owns that part of shared_buffers' memory, but it's only one at 
a time.


On large servers, that can be only a small fraction of the total memory 
bandwidth the server is able to reach.  I've attached a graph showing 
how this works on a system with many NUMA banks of RAM, and this is only 
a medium sized system.  This server can hit 40GB/s of memory transfers 
in total; no one process will ever see more than 8GB/s.


If we had more vacuum processes running against the same table, there 
would then be more situations where they were doing work against 
different NUMA memory banks at the same time, therefore making faster 
progress through the hits in shared_buffers possible. In the real world, 
this situation is rare enough compared to disk-bound vacuum work that I 
doubt it's worth getting excited over.  Systems with lots of RAM where 
performance is regularly dominated by one big ugly table are common 
though, so I wouldn't just rule the idea out as not useful either.


--
Greg Smith greg.sm...@crunchydatasolutions.com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-26 Thread Josh Berkus
All,

So these results have become a bit complex.  So spreadsheet time.

https://docs.google.com/spreadsheets/d/1Mokpx3EqlbWlFDIkF9qzpM7NneN9z-QOXWSzws3E-R4

Some details:

The Length-and-Offset test was performed using a more recent 9.4
checkout than the other two tests.  This was regrettable, and due to a
mistake with git, since the results tell me that there have been some
other changes.

I added two new datasets:

errlog2 is a simple, 4-column error log in JSON format, with 2 small
values and 2 large values in each datum.  It was there to check if any
of our changes affected the performance or size of such simple
structures (answer: no).

processed_b is a synthetic version of Mozilla Socorro's crash dumps,
about 900,000 of them, with nearly identical JSON on each row. These are
large json values (around 4KB each) with a broad mix of values and 5
levels of nesting.  However, none of the levels have very many keys per
level; the max is that the top level has up to 40 keys.  Unlike the
other data sets, I can provide a copy of processed_b for asking.

So, some observations:

* Data sizes with lengths-and-offets are slightly (3%) larger than
all-lengths for the pathological case (jsonbish) and unaffected for
other cases.

* Even large, complex JSON (processed_b) gets better compression with
the two patches than with head, although only slightly better (16%)

* This better compression for processed_b leads to slightly slower
extraction (6-7%), and surprisingly slower extraction for
length-and-offset than for all-lengths (about 2%).

* in the patholgical case, length-and-offset was notably faster on Q1
than all-lengths (24%), and somewhat slower on Q2 (8%).  I think this
shows me that I don't understand what JSON keys are at the end.

* notably, length-and-offset when uncompressed (EXTERNAL) was faster on
Q1 than head!  This was surprising enough that I retested it.

Overall, I'm satisfied with the performance of the length-and-offset
patch.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2014-09-26 Thread Jeff Janes
On Fri, Sep 26, 2014 at 11:47 AM, Alvaro Herrera alvhe...@2ndquadrant.com
wrote:

 Gavin Flower wrote:

  Curious: would it be both feasible and useful to have multiple
  workers process a 'large' table, without complicating things too
  much?  The could each start at a different position in the file.

 Feasible: no.  Useful: maybe, we don't really know.  (You could just as
 well have a worker at double the speed, i.e. double vacuum_cost_limit).


Vacuum_cost_delay is already 0 by default.  So unless you changed that,
vacuum_cost_limit does not take effect under vacuumdb.

It is pretty easy for vacuum to be CPU limited, and even easier for analyze
to be CPU limited (It does a lot of sorting).  I think analyzing is the
main use case for this patch, to shorten the pg_upgrade window.  At least,
that is how I anticipate using it.

Cheers,

Jeff


Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2014-09-26 Thread Gavin Flower

On 27/09/14 11:36, Gregory Smith wrote:

On 9/26/14, 2:38 PM, Gavin Flower wrote:
Curious: would it be both feasible and useful to have multiple 
workers process a 'large' table, without complicating things too 
much?  The could each start at a different position in the file.


Not really feasible without a major overhaul.  It might be mildly 
useful in one rare case.  Occasionally I'll find very hot single 
tables that vacuum is constantly processing, despite mostly living in 
RAM because the server has a lot of memory.  You can set 
vacuum_cost_page_hit=0 in order to get vacuum to chug through such a 
table as fast as possible.


However, the speed at which that happens will often then be limited by 
how fast a single core can read from memory, for things in 
shared_buffers.  That is limited by the speed of memory transfers from 
a single NUMA memory bank.  Which bank you get will vary depending on 
the core that owns that part of shared_buffers' memory, but it's only 
one at a time.


On large servers, that can be only a small fraction of the total 
memory bandwidth the server is able to reach.  I've attached a graph 
showing how this works on a system with many NUMA banks of RAM, and 
this is only a medium sized system.  This server can hit 40GB/s of 
memory transfers in total; no one process will ever see more than 8GB/s.


If we had more vacuum processes running against the same table, there 
would then be more situations where they were doing work against 
different NUMA memory banks at the same time, therefore making faster 
progress through the hits in shared_buffers possible. In the real 
world, this situation is rare enough compared to disk-bound vacuum 
work that I doubt it's worth getting excited over.  Systems with lots 
of RAM where performance is regularly dominated by one big ugly table 
are common though, so I wouldn't just rule the idea out as not useful 
either.



Thanks for the very detailed reply of yours, and the comments from others.

Cheers,
Gavin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2014-09-26 Thread Amit Kapila
On Wed, Sep 24, 2014 at 2:26 PM, Abhijit Menon-Sen a...@2ndquadrant.com
wrote:

 Hi Amit.

 Thanks for your comments, and I'm sorry it's taken me so long to
 respond.

No issues.

 At 2014-08-03 11:18:57 +0530, amit.kapil...@gmail.com wrote:

  7.
  HeapTupleSatisfiesVacuumNoHint()
  a. Why can't we use HeapTupleSatisfiesVisibility() with dirty snapshot
  as we use for pgstattuple?

 Heavier locking. I tried to make do with the existing HTS* functions,
 but fastbloat was noticeably faster in tests with the current setup.

I am not sure for the purpose of this functionality, why we need to
use a different HTS (HeapTupleSatisfiesVacuum) routine as compare
to pgstat_heap().  Unless you have some specific purpose to achieve,
I think it is better to use HeapTupleSatisfiesVisibility().

 Maybe I'll call it not_too_slow_bloat().

How about pgfaststattuple() or pgquickstattuple() or pgfuzzystattuple()?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com