Re: [HACKERS] [PATCH] libpq: Allow specifying multiple host names to try to connect to
On Thu, Aug 6, 2015 at 03:15 AM, Michael Paquier wrote: >On Wed, Aug 5, 2015 at 11:53 PM, Bruce Momjian wrote: >> On Wed, Jul 8, 2015 at 12:24:37PM -0400, Robbie Harwood wrote: >>> > You update the documentation just for psql but your change effects any >>> > libpq application if we go forward with this patch we should update the >>> > documentation for libpq as well. >>> > >>> > This approach seems to work with the url style of conninfo >>> > >>> > For example >>> > postgres://some-down-host.info,some-other-host.org:5435/test1 >>> > >>> > seems to work as expected but I don't like that syntax I would rather see >>> > postgres://some-down-host.info:5435/test1,postgres://some-other-host.org:5435/test1 >>> > >>> > This would be a more invasive change but I think the syntax is more >>> > usable. >>> >>> I agree with this; it seems to me that it's more powerful to be able to >>> specify complete urls for when they may differ. >>> >>> For the non-url case though, I don't see a clean way of doing this. We >>> could always, e.g., locally bind port specification to the closest host >>> specification, but that seems nasty, and is still less powerful than >>> passing urls (or we could just do the same for all parameters, but >>> that's just a mess). >>> >>> Might it be reasonable to only allow the multi-host syntax in the >>> url-style and not otherwise? >> >> First, I agree this is a very useful feature that we want. Many NoSQL >> databases are promoting multi-host client libraries as HA, which is kind >> of humorous, and also makes sense because many NoSQL solution are >> multi-host. >> I can see this feature benefitting us for clients to auto-failover >> without requiring a pooler or virtual IP reassignment, and also useful >> for read-only connections that want to connect to a read-only slave, but >> don't care which one. The idea of randomly selecting a host from the >> list might be a future feature. > >Yep. The JDBC driver is doing it as well. I added the JDBC driver support similar feature. Currently it supports the following tuning parameters given a list of hostname/port combinations to connect to: targetServerType=any|master|slave|preferSlave loadBalanceHosts=false|true For an example 2 node master,replica setup one would open write connections with host1,host2 & targetServerType=master and read-only connections with host1,host2 & targetServerType=preferSlave. >> I realize this is libpq-feature-creep, but considering the complexities >> of a pooler and virtual IP address reassignment, I think adding this >> The fact that other DBs are doing it, including I think >> VMWare's libpq, supports the idea of adding this simple specification. Because the feature as its simplest is a for loop in libpq. I would not think it much of a feature creep, especially since my original patch to libpq showed the loop already has been hidden in libpq for a long time, it just needed a special dns record for the postgresql hosts that returned dns records for all hosts. Even there are poolers in front of postgres they can be set up in much simpler and reliable non-cluster mode when the libpq can be given multiple pooler addresses to connect to. -Mikko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH] libpq: Allow specifying multiple host names to try to connect to
Hi, I would like allow specifying multiple host names for libpq to try to connecting to. This is currently only supported if the host name resolves to multiple addresses. Having the support for it without complex dns setup would be much easier. Example: psql -h dbslave,dbmaster -p 5432 dbname psql 'postgresql://dbslave,dbmaster:5432/dbname' Here the idea is that without any added complexity of pgbouncer or similar tool I can get any libpq client to try connecting to multiple nodes until one answers. I have added the similar functionality to the jdbc driver few years ago. Because libpq almost supported the feature already the patch is very simple. I just split the given host name and do a dns lookup on each separately, and link the results. If you configure a port that does not exist you can see the libpq trying to connect to multiple hosts. psql -h 127.0.0.2,127.0.0.3, -p psql: could not connect to server: Connection refused Is the server running on host "127.0.0.2,127.0.0.3" (127.0.0.2) and accepting TCP/IP connections on port ? could not connect to server: Connection refused Is the server running on host "127.0.0.2,127.0.0.3" (127.0.0.3) and accepting TCP/IP connections on port ? Further improvement would be to add a connection parameter to limit connection only to master (writable) or to slave (read only). -Mikko diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index 62a3b21..e79f96c 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -232,6 +232,12 @@ EOF with a slash, it is used as the directory for the Unix-domain socket. + + It is possible to provide multiple host names to connect to. The first host name that accepts + the connection will be used. To explicitly specify multiple host names they must be separated + by comma. + An alternative is to use dns record for the host name that resolves to multiple addresses. + diff --git a/src/backend/libpq/ip.c b/src/backend/libpq/ip.c index db939b5..90d3c1e 100644 --- a/src/backend/libpq/ip.c +++ b/src/backend/libpq/ip.c @@ -56,6 +56,9 @@ static int getnameinfo_unix(const struct sockaddr_un * sa, int salen, int flags); #endif +/* Marker for multiple dns resolver results linked together */ +#define AI_PG_LINKED (1<<30) + /* * pg_getaddrinfo_all - get address info for Unix, IPv4 and IPv6 sockets @@ -65,6 +68,9 @@ pg_getaddrinfo_all(const char *hostname, const char *servname, const struct addrinfo * hintp, struct addrinfo ** result) { int rc; + char *hostnames; + char *host_start; + struct addrinfo *addr; /* not all versions of getaddrinfo() zero *result on failure */ *result = NULL; @@ -73,10 +79,46 @@ pg_getaddrinfo_all(const char *hostname, const char *servname, if (hintp->ai_family == AF_UNIX) return getaddrinfo_unix(servname, hintp, result); #endif - /* NULL has special meaning to getaddrinfo(). */ - rc = getaddrinfo((!hostname || hostname[0] == '\0') ? NULL : hostname, - servname, hintp, result); + if (!hostname || hostname[0] == '\0') + return getaddrinfo(NULL, servname, hintp, result); + + hostnames = strdup(hostname); + if (!hostnames) + return EAI_MEMORY; + + host_start = hostnames; + + /* dns lookups in reverse order to make result links in correct order */ + do + { + host_start = strrchr(hostnames, ','); + if (host_start) + { + *host_start = '\0'; + host_start++; + } + else + host_start = hostnames; + + addr = NULL; + rc = getaddrinfo(host_start, servname, hintp, &addr); + if (rc || !addr) + { + if (addr) +freeaddrinfo(addr); + break; + } + if (*result) + { + (*result)->ai_flags |= AI_PG_LINKED; + addr->ai_next = *result; + } + *result = addr; + + } while (host_start > hostnames); + + free(hostnames); return rc; } @@ -94,6 +136,9 @@ pg_getaddrinfo_all(const char *hostname, const char *servname, void pg_freeaddrinfo_all(int hint_ai_family, struct addrinfo * ai) { + struct addrinfo * last_addrinfo; + struct addrinfo * ai_ptr; + #ifdef HAVE_UNIX_SOCKETS if (hint_ai_family == AF_UNIX) { @@ -109,10 +154,24 @@ pg_freeaddrinfo_all(int hint_ai_family, struct addrinfo * ai) } else #endif /* HAVE_UNIX_SOCKETS */ + /* struct was built by getaddrinfo() */ + if (ai != NULL) { - /* struct was built by getaddrinfo() */ - if (ai != NULL) - freeaddrinfo(ai); + while (true) + { + /* unlink extra addrinfo structures before freeing */ + last_addrinfo = NULL; + for (ai_ptr = ai; ai_ptr->ai_next; ai_ptr = ai_ptr->ai_next) + { +if (ai_ptr->ai_next->ai_flags & AI_PG_LINKED) + last_addrinfo = ai_ptr; + } + if (!last_addrinfo) +break; + freeaddrinfo(last_addrinfo->ai_next); + last_addrinfo->ai_next = NULL; + } + freeaddrinfo(ai); } } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: h
Re: [JDBC] [HACKERS] Pipelining executions to postgresql server
Kevin Wooten wrote: > > On Nov 4, 2014, at 12:55 AM, Craig Ringer wrote: > > > > I have to say I like some aspects of how pgjdbc-ng is being done - in > > particular use of Maven and being built around non-blocking I/O. > > > > OTOH, the use of Google Guava I find pretty inexplicable in a JDBC > > driver, and since it hard-depends on JDK 7 I don't understand why Netty > > was used instead of the async / non-blocking features of the core JVM. > > That may simply be my inexperience with writing non-blocking socket I/O > > code on Java though. > > Java6 has been EOL since 2011 and Java7 is EOL in less than 6 months. I think that depending on old Java 7 version that soon should not even be used in production (without paying for extra support) can hardly be too hard requirement. > It confuses me as to why you consider using stable, well implemented, well > tested and well cared for libraries as inexplicable. > Just because we are > building a “driver” means we have to write every line of code ourselves? No > thanks. Embedding parts of other projects into code-base during build with renamed packages is nowadays common practice in java projects: spring does it, elasticsearch embeds whole netty and more, even jre embeds for example xerces and asm. It might not be the optimal solution, but still definitely better than writing everything from scratch or copy-pasting code from other projects. If pgjdbc-ng provides both a thin maven version (with external versioned dependencies) and a fat-jar with the external versions repackaged inside the users can choose either old way: just-drop-a-jdbc-jar-into-project or new way with their chosen build tool that automatically manages the dependencies. -Mikko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pipelining executions to postgresql server
> Craig Ringer wrote: > On 11/02/2014 09:27 PM, Mikko Tiihonen wrote: > > Is the following summary correct: > > - the network protocol supports pipelinings > Yes. > > All you have to do is *not* send a Sync message and be aware that the > server will discard all input until the next Sync, so pipelining + > autocommit doesn't make a ton of sense for error handling reasons. I do not quite grasp why not sending Sync is so important. My proof of concept setup was for queries with autocommit enabled. When looking with wireshark I could observe that the client sent 3-10 P/B//D/E/S messages to server, before the server started sending the corresponding 1/2/T/D/C/Z replies for each request. Every 10 requests the test application waited for the all the replies to come to not overflow the network buffers (which is known to cause deadlocks with current pg jdbc driver). If I want separate error handling for each execution then shouldn't I be sending separate sync for each pipelined operation? > > - the server handles operations in order, starting the processing of next > > operation only after fully processing the previous one > >- thus pipelining is invisible to the server > > As far as I know, yes. The server just doesn't care. > > > - libpq driver does not support pipelining, but that is due to internal > > limitations > > Yep. > > > - if proper error handling is done by the client then there is no reason > > why pipelining could be supported by any pg client > > Indeed, and most should support it. Sending batches of related queries > would make things a LOT faster. > > PgJDBC's batch support is currently write-oriented. There is no > fundamental reason it can't be expanded for reads. I've already written > a patch to do just that for the case of returning generated keys. > > https://github.com/ringerc/pgjdbc/tree/batch-returning-support > > and just need to rebase it so I can send a pull for upstream PgJDBC. > It's already linked in the issues documenting the limitatations in batch >support. Your code looked like good. Returning inserts are an important thing to optimize. > If you want to have more general support for batches that return rowsets > there's no fundamental technical reason why it can't be added. It just > requires some tedious refactoring of the driver to either: > > - Sync and wait before it fills its *send* buffer, rather than trying > to manage its receive buffer (the server send buffer), so it can > reliably avoid deadlocks; or > > - Do async I/O in a select()-like loop over a protocol state machine, > so it can simultaneously read and write on the wire. I also think the async I/O is the way to go. Luckily that has already been done in the pgjdbc-ng (https://github.com/impossibl/pgjdbc-ng), built on top of netty java NIO library. It has quite good feature parity with the original pgjdbc driver. I'll try next if I can enable the pipelining with it now that I have tried the proof of concept with the originial pgjdbc driver. > I might need to do some of that myself soon, but it's a big (and > therefore error-prone) job I've so far avoided by making smaller, more > targeted changes. > > For JDBC the JDBC batch interface is the right place to do this, and you > should not IMO attempt to add pipelining outside that interface. > (Multiple open resultsets from portals, yes, but not pipelining of queries). I do not think the JDBC batch interface even allow doing updates to multiple tables when using prepared statements? -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pipelining executions to postgresql server
> > > From: Andres Freund > > > On 2014-11-01 14:04:05 +0000, Mikko Tiihonen wrote: > > > > I created a proof of concecpt patch for postgresql JDBC driver that > > > > allows the caller to do pipelining of requests within a > > > > transaction. The pipelining here means same as for HTTP: the client > > > > can send the next execution already before waiting for the response of > > > > the previous request to be fully processed. > > > > > > Slightly confused here. To my knowledge the jdbc driver already employs > > > some pipelining? There's some conditions where it's disabled (IIRC > > > RETURNING for DML is one of them), but otherwise it's available. > > > > > > I'm very far from a pgjdbc expert, but that's what I gathered from the > > > code when investigating issues a while back and from my colleague Craig. > > > > Most DB interfaces make the server operations look synchronous. > > You IIRC can use jdbc's batch interface. Yes, there is a limited batch interface for inserts and updates. But for example when using prepared statements you can only do batches of same statement (with different parameters of course). > > I should have searched earlier a better reference to libpg. I am planning > > on adding support for something similar to > > http://www.postgresql.org/docs/9.3/static/libpq-async.html > > More specifically operations like: > > int PQsendQuery(PGconn *conn, const char *command); > > PGresult *PQgetResult(PGconn *conn); > > The network protocol allows for pipelining, yes. But libpq unfortunately > doesn't. > You should read the protocol definition. I have studied the protocol, that is why I concluded that it would be possible to add support for pipelining for clients. > > It should be, if the server startd to process (read in) the next query > > only after it has sent the previous response out. > > There's complexities around error handling and such making it slightly > more complex. Are you referring to some complexities on the server side related to error handling or on the client side? Is the following summary correct: - the network protocol supports pipelinings - the server handles operations in order, starting the processing of next operation only after fully processing the previous one - thus pipelining is invisible to the server - libpq driver does not support pipelining, but that is due to internal limitations - if proper error handling is done by the client then there is no reason why pipelining could be supported by any pg client -Mikko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pipelining executions to postgresql server
> From: Andres Freund > On 2014-11-01 14:04:05 +, Mikko Tiihonen wrote: > > I created a proof of concecpt patch for postgresql JDBC driver that > > allows the caller to do pipelining of requests within a > > transaction. The pipelining here means same as for HTTP: the client > > can send the next execution already before waiting for the response of > > the previous request to be fully processed. > > Slightly confused here. To my knowledge the jdbc driver already employs > some pipelining? There's some conditions where it's disabled (IIRC > RETURNING for DML is one of them), but otherwise it's available. > > I'm very far from a pgjdbc expert, but that's what I gathered from the > code when investigating issues a while back and from my colleague Craig. Most DB interfaces make the server operations look synchronous. For JDBC the only standard interface is similar to libpg: PGresult *PQexec(PGconn *conn, const char *command); I should have searched earlier a better reference to libpg. I am planning on adding support for something similar to http://www.postgresql.org/docs/9.3/static/libpq-async.html More specifically operations like: int PQsendQuery(PGconn *conn, const char *command); PGresult *PQgetResult(PGconn *conn); The Java API will of course be custom to postgresql jdbc driver since there is no official java api for async db operations. The above I can do purely on the jdbc driver side. But my my question was about pipelining. In libpg terms: Is it safe to do multiple PQsendQuery operations before invoking PQgetResult as many times? It should be, if the server startd to process (read in) the next query only after it has sent the previous response out. And do any pg connection poolers support having multiple executions on-the-fly at the same time for the same connection? -Mikko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Pipelining executions to postgresql server
Hi, I created a proof of concecpt patch for postgresql JDBC driver that allows the caller to do pipelining of requests within a transaction. The pipelining here means same as for HTTP: the client can send the next execution already before waiting for the response of the previous request to be fully processed. The goal is to reduce the effects of latency between server and client. The pipelining allowed my test with localhost postgresql and jdbc test that queries a single value from database 200 times to get a more than 20% speed-up. The pipelining version processes the responses every 10 queries. With actual latency between the server and client larger speed-ups are of course possible. I think pipelining + jsonb support would make postgresql even more competive key/value and document store. Example use case: 1) insert to shopping cart table, and 3 inserts shopping cart rows table in one pipeline. - only one round trip penalty instead of 4 2) query shopping cart row and query shopping cart rows in one pipeline - only one round trip penalty instead of 2 The only alternative way to reduce the round-trip latency is to make every operation in single round-trip and that can only be done with pl functions and by passing in more complex objects, for example the whole shopping cart with rows as json data. What kind of problems could pipelining cause (assuming we limit it to rather simple use cases only)? -MikkoFrom c662b8865c58cba714655148401ac86a21c10f3c Mon Sep 17 00:00:00 2001 From: Mikko Tiihonen Date: Sat, 1 Nov 2014 15:43:19 +0200 Subject: [PATCH] Example pipelined single-shot query --- org/postgresql/core/QueryExecutor.java | 13 +++ org/postgresql/core/v2/QueryExecutorImpl.java | 5 + org/postgresql/core/v3/QueryExecutorImpl.java | 41 +++ org/postgresql/jdbc2/AbstractJdbc2Statement.java | 79 + .../test/jdbc2/PipelineExecutionTest.java | 129 + 5 files changed, 267 insertions(+) create mode 100644 org/postgresql/test/jdbc2/PipelineExecutionTest.java diff --git a/org/postgresql/core/QueryExecutor.java b/org/postgresql/core/QueryExecutor.java index e80a23c..b8e46a6 100644 --- a/org/postgresql/core/QueryExecutor.java +++ b/org/postgresql/core/QueryExecutor.java @@ -8,6 +8,7 @@ */ package org.postgresql.core; +import java.io.IOException; import java.sql.SQLException; import org.postgresql.copy.CopyOperation; @@ -101,6 +102,16 @@ public interface QueryExecutor { static int QUERY_NO_BINARY_TRANSFER = 256; /** + * Flag for pipeline executions with responses read later. + */ +static int QUERY_PIPELINE = 512; + +/** + * Flag for pipeline executions with responses read later. + */ +static int QUERY_DEQUEUE_PIPELINE = 1024; + +/** * Execute a Query, passing results to a provided ResultHandler. * * @param query the query to execute; must be a query returned from @@ -125,6 +136,8 @@ public interface QueryExecutor { int flags) throws SQLException; +void processPipelinedResult(ResultHandler handler) throws SQLException; + /** * Execute several Query, passing results to a provided ResultHandler. * diff --git a/org/postgresql/core/v2/QueryExecutorImpl.java b/org/postgresql/core/v2/QueryExecutorImpl.java index 33c0048..5a6f607 100644 --- a/org/postgresql/core/v2/QueryExecutorImpl.java +++ b/org/postgresql/core/v2/QueryExecutorImpl.java @@ -616,4 +616,9 @@ public class QueryExecutorImpl implements QueryExecutor { public CopyOperation startCopy(String sql, boolean suppressBegin) throws SQLException { throw new PSQLException(GT.tr("Copy not implemented for protocol version 2"), PSQLState.NOT_IMPLEMENTED); } + +@Override +public void processPipelinedResult(ResultHandler handler) throws SQLException { +throw new PSQLException(GT.tr("Copy not implemented for protocol version 2"), PSQLState.NOT_IMPLEMENTED); +} } diff --git a/org/postgresql/core/v3/QueryExecutorImpl.java b/org/postgresql/core/v3/QueryExecutorImpl.java index 966a6f6..7297764 100644 --- a/org/postgresql/core/v3/QueryExecutorImpl.java +++ b/org/postgresql/core/v3/QueryExecutorImpl.java @@ -11,6 +11,7 @@ package org.postgresql.core.v3; import org.postgresql.core.*; import java.util.ArrayList; +import java.util.LinkedList; import java.util.List; import java.util.HashMap; import java.util.Properties; @@ -1713,7 +1714,33 @@ public class QueryExecutorImpl implements QueryExecutor { } } +public void processPipelinedResult(ResultHandler handler) throws SQLException { +ResultHandlerHolder holder; +while ((holder = pipelineResultHandlers.remove(0)) != null) { +try { +processResults(holder.handler, holder.flags & (~QUERY_PIPELINE) | QUERY_DEQUEUE_PIPELINE); +} catch (IOException e) { +
[HACKERS] Documenting the Frontend/Backend Protocol update criteria
Hi, Currently the criteria on updating the F/B protocol is undefined. This makes it hard to update the protocol going forward. It makes it also hard to write library/driver/application implementations that will be more future proof to future server versions. Ideally the documentation for 9.4 (backport?) would say what kind of things are allowed to change within the v3 protocol, and thus implies what kind of changes need a new v4 protocol. Is there some wishlist page of items to do differently for v4 already? I did find the following text in the documentation: "binary representations for complex data types might change across server versions". But having more specific rules would help, especially since it seems to be there just to scare: so far changes have been strongly discouraged. An example to consider: some binary formats have flags (arrays) or version (jsonb) field. We should explicitly say that clients must reject any unknown bits/versions that they do not know about. This guarantees they detect small format updates instead of silently accepting then and possibly returning corrupt data. My motivation: Two years ago accidentally I opened a discussion on how to do updates to the binary encoding of data in the protocol [1]. I would like to reopen the discussion now since the jsonb 'binary' encoding is just a version '1' + text json. The result from the last discussion was that having a version or flags as part of the binary format is not enough, since drivers/libraries (fixable) and applications (unfixable) are depending on the current encoding. And if we add a new bit to the flags or bump the version number we will break backward compatibility. To summarize the previous discussion: - there are currently no written rules for modifying the binary encoding formats - bytea modification was done with a GUC, but GUC was seen as a bad solution in general - my proposal was to add a minor format version number was not good either since any per session state would be problematic for connection poolers [1]: http://grokbase.com/t/postgresql/pgsql-hackers/11bwhv1esa/add-minor-version-to-v3-protocol-to-allow-changes-without-breaking-backwards-compatibility
Re: [HACKERS] Use gcc built-in atomic inc/dec in lock.c
On 12/14/2012 05:55 PM, Merlin Moncure wrote: On Fri, Dec 14, 2012 at 9:33 AM, Mikko Tiihonen wrote: On 12/13/2012 12:19 AM, Peter Geoghegan wrote: On 12 December 2012 22:11, Mikko Tiihonen wrote: noticed a "XXX: It might be worth considering using an atomic fetch-and-add instruction here, on architectures where that is supported." in lock.c Here is my first try at using it. That's interesting, but I have to wonder if there is any evidence that this *is* actually helpful to performance. One of my open questions listed in the original email was request for help on creating a test case that exercise the code path enough so that it any improvements can be measured. But apart from performance I think there are two other aspects to consider: 1) Code clarity: I think the lock.c code is easier to understand after the patch 2) Future possibilities: having the atomic_inc/dec generally available allows other performance critical parts of postgres take advantage of them in the future This was actually attempted a little while back; a spinlock was replaced with a few atomic increment and decrement calls for managing the refcount and other things on the freelist. It helped or hurt depending on contention but the net effect was negative. On reflection I think that was because that the assembly 'lock' instructions are really expensive relative to the others: so it's not safe to assume that say 2-3 gcc primitive increment calls are cheaper that a spinlock. The spinlock uses one 'lock' instruction when taken, and no lock instructions when released. Thus I think replacing one spinlock protected add/sub with atomic 'lock' add/sub not perform worse. But if you replace "mutex-lock,add,add,unlock" with "atomic add, atomic add" you already have more hw level synchronization and thus risk lower performance if they are on separate cache lines. This of course limits the use cases of the atomic operations. -Mikko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Use gcc built-in atomic inc/dec in lock.c
On 12/13/2012 12:19 AM, Peter Geoghegan wrote: On 12 December 2012 22:11, Mikko Tiihonen wrote: noticed a "XXX: It might be worth considering using an atomic fetch-and-add instruction here, on architectures where that is supported." in lock.c Here is my first try at using it. That's interesting, but I have to wonder if there is any evidence that this *is* actually helpful to performance. One of my open questions listed in the original email was request for help on creating a test case that exercise the code path enough so that it any improvements can be measured. But apart from performance I think there are two other aspects to consider: 1) Code clarity: I think the lock.c code is easier to understand after the patch 2) Future possibilities: having the atomic_inc/dec generally available allows other performance critical parts of postgres take advantage of them in the future -Mikko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Use gcc built-in atomic inc/dec in lock.c
: 8b 44 ab 04 mov0x4(%rbx,%rbp,4),%eax e49: 83 e8 01sub$0x1,%eax e4c: 89 44 ab 04 mov%eax,0x4(%rbx,%rbp,4) e50: c6 03 00movb $0x0,(%rbx) e53: 48 8b 5c 24 08 mov0x8(%rsp),%rbx e58: 48 8b 6c 24 10 mov0x10(%rsp),%rbp e5d: 48 83 c4 18 add$0x18,%rsp e61: f3 c3 repz retq e63: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) e68: ba 41 00 00 00 mov$0x41,%edx e6d: be 00 00 00 00 mov$0x0,%esi e72: 48 89 dfmov%rbx,%rdi e75: e8 00 00 00 00 callq e7a e7a: eb c9 jmpe45 e7c: 0f 1f 40 00 nopl 0x0(%rax) Code with gcc-built-ins: 0da0 : da0: 48 8b 05 00 00 00 00mov0x0(%rip),%rax# da7 da7: 48 85 c0test %rax,%rax daa: 74 2a je dd6 dac: 8b 50 28mov0x28(%rax),%edx daf: c6 40 40 00 movb $0x0,0x40(%rax) db3: 48 c7 05 00 00 00 00movq $0x0,0x0(%rip)# dbe dba: 00 00 00 00 dbe: 48 89 d0mov%rdx,%rax dc1: 48 8b 15 00 00 00 00mov0x0(%rip),%rdx# dc8 dc8: 25 ff 03 00 00 and$0x3ff,%eax dcd: 48 c1 e0 02 shl$0x2,%rax dd1: f0 83 2c 02 01 lock subl $0x1,(%rdx,%rax,1) dd6: f3 c3 repz retq dd8: 0f 1f 84 00 00 00 00nopl 0x0(%rax,%rax,1) ddf: 00 >From 958b04eb08603167dee2fe6684f9430f5b578f28 Mon Sep 17 00:00:00 2001 From: Mikko Tiihonen Date: Wed, 12 Dec 2012 20:02:49 +0200 Subject: [PATCH] Use gcc built-in atomic add/sub instructions, if available diff --git a/configure.in b/configure.in index 2dee4b3..dec2785 100644 --- a/configure.in +++ b/configure.in @@ -1451,6 +1451,28 @@ if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.]) fi +#TODO, check for __atomic_is_lock_free (sizeof(int), 0) too + +AC_CACHE_CHECK([for builtin atomic functions], pgac_cv_gcc_int_atomic_add, +[AC_TRY_LINK([], + [int counter = 0; + __atomic_add_fetch(&counter, 1, __ATOMIC_SEQ_CST);], + [pgac_cv_gcc_int_atomic_add="yes"], + [pgac_cv_gcc_int_atomic_add="no"])]) +if test x"$pgac_cv_gcc_int_atomic_add" = x"yes"; then + AC_DEFINE(HAVE_GCC_INT_ATOMIC_ADD, 1, [Define to 1 if you have __atomic_add_fetch(int *, int, int) and friends.]) +fi + +AC_CACHE_CHECK([for builtin sync functions], pgac_cv_gcc_int_sync_add, +[AC_TRY_LINK([], + [int counter = 0; + __sync_add_and_fetch(&counter, 1);], + [pgac_cv_gcc_int_sync_add="yes"], + [pgac_cv_gcc_int_sync_add="no"])]) +if test x"$pgac_cv_gcc_int_sync_add" = x"yes"; then + AC_DEFINE(HAVE_GCC_INT_SYNC_ADD, 1, [Define to 1 if you have __sync_add_and_fetch(int *, int) and friends.]) +fi + # # Pthreads diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index ec4da20..c8c0b91 100644 --- a/src/backend/storage/lmgr/lock.c +++ b/src/backend/storage/lmgr/lock.c @@ -40,7 +40,7 @@ #include "pgstat.h" #include "storage/proc.h" #include "storage/sinvaladt.h" -#include "storage/spin.h" +#include "storage/atomics.h" #include "storage/standby.h" #include "utils/memutils.h" #include "utils/ps_status.h" @@ -234,7 +234,12 @@ static PROCLOCK *FastPathGetRelationLockEntry(LOCALLOCK *locallock); typedef struct { +#ifdef MUTEXLESS_ATOMIC_INC +#define FAST_PATH_MUTEX(data) NULL +#else +#define FAST_PATH_MUTEX(data) &(data)->mutex slock_t mutex; +#endif uint32 count[FAST_PATH_STRONG_LOCK_HASH_PARTITIONS]; } FastPathStrongRelationLockData; @@ -427,8 +432,10 @@ InitLocks(void) FastPathStrongRelationLocks = ShmemInitStruct("Fast Path Strong Relation Lock Data", sizeof(FastPathStrongRelationLockData), &found); +#ifndef MUTEXLESS_ATOMIC_INC if (!found) SpinLockInit(&FastPathStrongRelationLocks->mutex); +#endif /* * Allocate non-shared hash table for LOCALLOCK structs. This stores lock @@ -1207,11 +1214,8 @@ RemoveLocalLock(LOCALLOCK *locallock) fasthashcode = FastPathStrongLockHashPartition(locallock->hashcode); - SpinLockAcquire(&FastPathStrongRelationLocks->mutex); - Assert(FastPathStrongRelationLocks->count[fasthashcode] > 0); - FastPathStrongRelationLocks->count[fasthashcode]--; locallock->holdsStrongLockCount = FALSE; - SpinLockRelease(&FastPathStro
Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
On 01/25/2012 06:40 PM, Tom Lane wrote: Marko Kreen writes: On Wed, Jan 25, 2012 at 10:23:14AM -0500, Tom Lane wrote: Huh? How can that work? If we decide to change the representation of some other "well known type", say numeric, how do we decide whether a client setting that bit is expecting that change or not? It sets that bit *and* version code - which means that it is up-to-date with all "well-known" type formats in that version. Then why bother with the bit in the format code? If you've already done some other negotiation to establish what datatype formats you will accept, this doesn't seem to be adding any value. Basically, I see 2 scenarios here: 1) Client knows the result types and can set the text/bin/version code safely, without further restrictions. 2) There is generic framework, that does not know query contents but can be expected to track Postgres versions closely. Such framework cannot say "binary" for results safely, but *could* do it for some well-defined subset of types. The hole in approach (2) is that it supposes that the client side knows the specific datatypes in a query result in advance. While this is sometimes workable for application-level code that knows what query it's issuing, it's really entirely untenable for a framework or library. The only way that a framework can deal with arbitrary queries is to introduce an extra round trip (Describe step) to see what datatypes the query will produce so it can decide what format codes to issue ... and that will pretty much eat up any time savings you might get from a more efficient representation. This is pretty much what jdbc driver already does, since it does not have 100% coverage of even current binary formats. First time you execute a query it requests text encoding, but caches the Describe results. Next time it sets the binary bits on all return columns that it knows how to decode. You really want to do the negotiation once, at connection setup, and then be able to process queries without client-side prechecking of what data types will be sent back. I think my original minor_version patch tried to do that. It introduced a per-connection setting for version. Server GUC_REPORTED the maximum supported minor_version but defaulted to the baseline wire format. The jdbc client could bump the minor_version to supported higher value (error if value larger than what server advertised). A way was provided for the application using jdbc driver to override the requested minor_version in the rare event that something broke (rare, because jdbc driver generally does not expose the wire-encoding to applications). Now if pgbounce and other pooling solutions would reset the minor_version to 0 then it should work. Scenarios where other end is too old to know about the minor_version: Vserver>>Vlibpq => client does nothing -> use baseline version Vlibpq>>Vserver => no supported_minor_version in GUC_REPORT -> use baseline Normal 9.2+ scenarios: Vserver>Vlibpq => libpg sets minor_version to largest that is supports -> libpq requested version used Vlibpq>Vserver => libpg notices that server supported value is lower than its so it sets minor_version to server supported value -> server version used For perl driver that exposes the wire format to application by default I can envision that the driver needs to add a new API that applications need to use to explicitly bump the minor_version up instead of defaulting to the largest supported by the driver as in jdbc/libpg. The reason why I proposed a incrementing minor_version instead of bit flags of new encodings was that it takes less space and is easier to document and understand so that exposing it to applications is possible. But how to handle postgres extensions that change their wire-format? Maybe we do need to have "oid:minor_version,oid:ver,oid_ver" as the negotiated version variable? -Mikko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements
Previous title was: Add minor version to v3 protocol to allow changes without breaking backwards compatibility On 01/20/2012 04:45 AM, Noah Misch wrote: On Thu, Jan 19, 2012 at 02:00:20PM -0500, Robert Haas wrote: On Thu, Jan 19, 2012 at 10:37 AM, Noah Misch wrote: I agree with Merlin; the frontend/backend protocol is logically distinct from the binary send/recv formats of data types. ?For one key point, the latter is not exclusively core-defined; third-party extensions change their send/recv formats on a different schedule. ?They can add myext.binary_format_version GUCs of their own to cope in a similar way. I agree. It occurs to me that we recently changed the default *text* output format for bytea for reasons not dissimilar to those contemplated here. Presumably, that's a much more disruptive change, and yet we've had minimal complaints because anyone who gets bitten can easily set bytea_output='escape' and the problem goes away. The same thing seems like it would work here, only the number of people needing to change the parameter will probably be even smaller, because fewer people use binary than text. Having said that, if we're to follow the precedent set by bytea_format, maybe we ought to just add binary_array_format={huge,ittybitty} and be done with it, rather than inventing a minor protocol version GUC for something that isn't really a protocol version change at all. We could introduce a differently-named general mechanism, but I guess I'm not seeing the point of that either. Just because someone has a backward-compatibility issue with one change of this type doesn't mean they have a similar issue with all of them. So I think adding a special-purpose GUC is more logical and more parallel to what we've done in the past, and it doesn't saddle us with having to be certain that we've designed the mechanism generally enough to handle all the cases that may come later. That makes sense. An attraction of a single binary format version was avoiding the "Is this worth a GUC?" conversation for each change. However, adding a GUC should be no more notable than bumping a binary format version. I see the main difference between the GUC per feature vs minor version being that in versioned changes old clients keep working because the have to explicitly request a specific version. Whereas in separate GUC variables each feature will be enabled by default and users have to either keep up with new client versions or figure out how to explicitly disable the changes. However, due to popular vote I removed the minor version proposal for now. Here is a second version of the patch. The binary array encoding changes stay the same but all code around was rewritten. Changes from previous versions based on received comments: * removed the minor protocol version concept * introduced a new GUC variable array_output copying the current bytea_output type, with values "full" (old value) and "smallfixed" (new default) * added documentation for the new GUC variable * used constants for the array flags variable values -Mikko diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml new file mode 100644 index e55b503..179a081 *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** COPY postgres_log FROM '/full/path/to/lo *** 4888,4893 --- 4888,4910 + + array_output (enum) + +array_output configuration parameter + + + + Sets the output format for binary encoding of values of + type array. Valid values are + smallfixed (the default) + and full (the traditional PostgreSQL + format). The array type always + accepts both formats on input, regardless of this setting. + + + + xmlbinary (enum) diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c new file mode 100644 index 5582a06..6192a2e *** a/src/backend/utils/adt/arrayfuncs.c --- b/src/backend/utils/adt/arrayfuncs.c *** *** 30,41 --- 30,45 * GUC parameter */ bool Array_nulls = true; + int array_output = ARRAY_OUTPUT_SMALLFIXED; /* * Local definitions */ #define ASSGN "=" + #define FLAG_HASNULLS 1 + #define FLAG_FIXEDLEN 2 + typedef enum { ARRAY_NO_LEVEL, *** static void ReadArrayBinary(StringInfo b *** 86,92 FmgrInfo *receiveproc, Oid typioparam, int32 typmod, int typlen, bool typbyval, char typalign, Datum *values, bool *nulls, ! bool *hasnulls, int32 *nbytes); static void CopyArrayEls(ArrayType *array, Datum *values, bool *nulls, int nitems, int typlen, bool typbyval, char typalign, --- 90,96 FmgrInfo *receiveproc, Oid typioparam, int32 typmod, int typlen, bool typbyval, char typalign, Datum *values, bool *nulls, ! bool *hasnulls, int32 *nbytes, bool fixedlen); static void CopyArrayEls
Re: [HACKERS] Add minor version to v3 protocol to allow changes without breaking backwards compatibility
On 11/30/2011 06:52 PM, Merlin Moncure wrote: On Mon, Nov 28, 2011 at 9:18 AM, Mikko Tiihonen wrote: Hi, As discussed few days ago it would be beneficial if we could change the v3 backend<->client protocol without breaking backwards compatibility. Here is a working patch that exports a supported_binary_minor constant and a binary_minor session variable and a that can be used by clients to enable newer features. I also added an example usage where the array encoding for constant size elements is optimized if binary_minor version is new enough. I have coded the client side support for binary_minor for jdbc driver and tested that it worked. But lets first discuss if this is an acceptable path forward. Regarding your TODO in the code comments about interactions with replication: I think it should be removed. WAL streaming depends on more things being identical than what is guaranteed here which is basically the protocol + data formats. OK. I'll remove the comments about replication. I think all references to 'protocol' should be removed; Binary wire formats != protocol: the protocol could bump to v4 but the wire formats (by happenstance) could all still continue to work -- therefore the version isn't minor (it's not dependent on protocol version but only on itself). Therefore, don't much like the name 'supported_binary_minor'. How about binary_format_version? I was thinking that it would be possible use the new minor version to introduce also new protocol messages such as streaming of large data into database without knowing it's size beforehand. Also, is a non granular approach really buying us anything here? ISTM *something* is likely to change format on most releases of the server so I'm wondering what's the point (if you don't happen to be on the same x.x release of the server, you might as well assume to not match or at least 'go on your own risk). The value added to the client version query is quite low. You have a very good point about changes in every new postgres version. Either text or the binary encoding is likely to change for some types in each new release. There needs to be decision on official policy about breaking backwards compatibility of postgresql clients. Is it: A) Every x.y postgres release is free to break any parts of the * protocol * text encoding * binary protocol as long as it is documented -> all client libraries should be coded so that they refuse to support version x.y+1 or newer (they might have a option to override this but there are no guarantees that it would work) Good: no random bugs when using old client library Bad: initial complaints from users before they understand that this is the best option for everyone B) Every x.y postgres release must guarantee that no client visible parts of protocol, text encoding or binary encoding will change from previous release in the v3 protocol. If any changes are needed then a new protocol version must be created. -> very high barrier for new features Good: can upgrade server without updating clients Bad: new features are only introduced very rarely after enough have accumulated Bad: many feature/change patches will rot while waiting for the upcoming new version C) There is effort to try avoiding incompatible changes. Some changes are blocked because it is detected that they can break backwards compatibility while others are let through (often with some compatibility option on server side to fall back to previous functionality (f.ex. bytea hex encoding). -> As far as I understand this is the current situation. Good: has worked so far Bad: accumulates compatibility flags in the server D) My proposed compromise where there is one minor_version setting and code in server to support all different minor versions. The client requests the minor version so that all releases default to backwards compatible version. When there combinations starts to create too much maintenance overhead a new clean v4 version of protocol is specified. Good: Keeps full backwards compatibility Good: Allows introducing changes at any time Bad: Accumulates conditional code to server and clients until new version of protocol is released I'd like the official policy to be A, otherwise I'll push for D. -Mikko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Add minor version to v3 protocol to allow changes without breaking backwards compatibility
Hi, As discussed few days ago it would be beneficial if we could change the v3 backend<->client protocol without breaking backwards compatibility. Here is a working patch that exports a supported_binary_minor constant and a binary_minor session variable and a that can be used by clients to enable newer features. I also added an example usage where the array encoding for constant size elements is optimized if binary_minor version is new enough. I have coded the client side support for binary_minor for jdbc driver and tested that it worked. But lets first discuss if this is an acceptable path forward. On 11/25/2011 02:20 AM, Oliver Jowett wrote: > Re list vs. always-incrementing minor version, you could just use an > integer and set bits to represent features, which would keep it simple > but also let clients be more selective about which features they > implement (you could support feature 21 and 23 without supporting 22) I decided not to use a feature flag because when features start to depend on each other we need multiple negotiation round trips until the final feature set can be known. If in your example above the feature 23 depends on server side on feature 22, but the client only requests 21,23. The server must inform back that combination 21,23 is reduced to 21. And if then the client can not support 21 without 23 the final feature set will not contain 21 or 23. -Mikko commit d7ef5f1aef0941ec4b931f24745b65d77e7511e4 Author: Mikko Tiihonen Date: Sun Nov 27 14:12:59 2011 +0200 Define binary_minor variable to control binary protocol minor version diff --git a/src/backend/utils/adt/version.c b/src/backend/utils/adt/version.c index 6ea5bd2..33ed4d3 100644 --- a/src/backend/utils/adt/version.c +++ b/src/backend/utils/adt/version.c @@ -16,6 +16,7 @@ #include "utils/builtins.h" +int binary_minor = 0; Datum pgsql_version(PG_FUNCTION_ARGS) diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index da7b6d4..67e80f1 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -64,6 +64,7 @@ #include "storage/predicate.h" #include "tcop/tcopprot.h" #include "tsearch/ts_cache.h" +#include "utils/binaryminorversion.h" #include "utils/builtins.h" #include "utils/bytea.h" #include "utils/guc_tables.h" @@ -2053,6 +2054,18 @@ static struct config_int ConfigureNamesInt[] = }, { + {"binary_minor", PGC_USERSET, CLIENT_CONN_LOCALE, + gettext_noop("Sets the binary protocol minor version that controls enabling" +"of newer features."), + gettext_noop("The default value is 0 which uses the binary protocol features" +"as specified in postgres 9.1 release.") + }, + &binary_minor, + 0, 0, SUPPORTED_BINARY_MINOR_VERSION, + NULL, NULL, NULL + }, + + { {"log_min_duration_statement", PGC_SUSET, LOGGING_WHEN, gettext_noop("Sets the minimum execution time above which " "statements will be logged."), diff --git a/src/include/utils/binaryminorversion.h b/src/include/utils/binaryminorversion.h new file mode 100644 index 000..40ba970 --- /dev/null +++ b/src/include/utils/binaryminorversion.h @@ -0,0 +1,32 @@ +/*- + * + * binaryminorversion.h + * "Binary protocol encoding minor version number" for PostgreSQL. + * + * The catalog version number is used to flag incompatible changes in + * the PostgreSQL v3 binary protocol. Whenever anyone changes the + * format of binary protocol, or modifies the standard types in such a + * way that an updated client wouldn't work with an old database + * (or vice versa), the binary protocol encoding version number + * should be changed. + * + * The point of this feature is to provide a way to allow introducing + * small new features to the binary encoding of types or the actual + * v3 protocol while still allowing full backward compatibility with + * old clients. The client can be an application using postgres, + * any tool using COPY BINARY or another postgres backend using + * replication (TODO: does this affect replication?). + * + * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/util/binprotoversion.h + * + *- + */ +#ifndef BINARYMINORVERSION_H +#define BINARYMINORVERSION_H + +#define SUPPORTED_BINARY_MINOR_VERSION 1 + +#endif diff --g
Re: [HACKERS] [JDBC] Optimize postgres protocol for fixed size arrays
On 11/24/2011 02:36 AM, Kevin Grittner wrote: Oliver Jowett wrote: Can we get a mechanism for minor protocol changes in this future version? Something as simple as exchanging a list of protocol features during the initial handshake (then use only features that are present on both sides) would be enough. The difficulty of making any protocol changes at the moment is a big stumbling block. I've been thinking the same thing. Any new protocol should include a way for each side to publish a list of what it can accept from the other during initial handshaking. (You could probably retrofit that to the current protocol version) Perhaps. It would be great if both sides could recognize the case where the "feature negotiation" was absent and use a default feature list for the protocol available on the other end. What about a hand-shake protocol based on simple binary-protocol minor version instead of features. We keep the v3 protocol as is but can add cumulative conditionally enabled features when we bump the minor version. The hand shake requires that the server sends a parameter back with it's highest supported minor version: FE=> StartupPacket <=BE ParameterStatus(binary_minor = 23) And the client can send any number between 1<=binary_minor back to enable newer protocol versions and/or limit what the server sends FE=> Execute(SET binary_minor = 20) To keep full backwards compatibility: 1) if backend does not send a binary_minor parameter on connection the highest supported minor version is assumed to be 0 (current format) 2) the backend assumes the active minor version is 0 unless the SET binary_minor is received I think bumping a minor version is better than feature flags because: 1) the hand shake is simpler and faster 2) coding is easier as all previous features are known to be supported and active when implementing feature+1 I'm not exactly sure about the COPY BINARY feature Tom mentioned. But probably we could prefix the data with the int4 containing the minor version? -Mikko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Optimize postgres protocol for fixed size arrays
Hi, During conversion of the jdbc driver to use binary encoding when receiving array objects from postgres it was noticed that for example for int[] arrays the binary encoding is normally 30% to 200% larger in bytes than the text encoding. This was of concern to some users with slower links to database. Even though the encoded size was larger the binary encoding was still constantly faster (with 50% speed up for float[]). Here is a patch that adds a new flag to the protocol that is set when all elements of the array are of same fixed size. When the bit is set the 4 byte length is only sent once and not for each element. Another restriction is that the flag can only be set when there are no NULLs in the array. The postgres part is my first try at the problem and I really need some feedback around the detection of fixed size elements. I just made a guess and noticed that typlen != 0 seemed to work for the basic types I user for testing. First the postgres patch: diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c index bfb6065..970272f 100644 --- a/src/backend/utils/adt/arrayfuncs.c +++ b/src/backend/utils/adt/arrayfuncs.c @@ -86,7 +86,7 @@ static void ReadArrayBinary(StringInfo buf, int nitems, FmgrInfo *receiveproc, Oid typioparam, int32 typmod, int typlen, bool typbyval, char typalign, Datum *values, bool *nulls, - bool *hasnulls, int32 *nbytes); + bool *hasnulls, int32 *nbytes, bool fixedlen); static void CopyArrayEls(ArrayType *array, Datum *values, bool *nulls, int nitems, int typlen, bool typbyval, char typalign, @@ -1242,7 +1242,7 @@ array_recv(PG_FUNCTION_ARGS) ndim, MAXDIM))); flags = pq_getmsgint(buf, 4); - if (flags != 0 && flags != 1) + if ((flags & ~3) != 0) ereport(ERROR, (errcode(ERRCODE_INVALID_BINARY_REPRESENTATION), errmsg("invalid array flags"))); @@ -1327,7 +1327,7 @@ array_recv(PG_FUNCTION_ARGS) &my_extra->proc, typioparam, typmod, typlen, typbyval, typalign, dataPtr, nullsPtr, - &hasnulls, &nbytes); + &hasnulls, &nbytes, (flags & 2) != 0); if (hasnulls) { dataoffset = ARR_OVERHEAD_WITHNULLS(ndim, nitems); @@ -1390,7 +1390,8 @@ ReadArrayBinary(StringInfo buf, Datum *values, bool *nulls, bool *hasnulls, - int32 *nbytes) + int32 *nbytes, + bool fixedlen) { int i; boolhasnull; @@ -1403,7 +1404,7 @@ ReadArrayBinary(StringInfo buf, charcsave; /* Get and check the item length */ - itemlen = pq_getmsgint(buf, 4); + itemlen = fixedlen ? typlen : pq_getmsgint(buf, 4); if (itemlen < -1 || itemlen > (buf->len - buf->cursor)) ereport(ERROR, (errcode(ERRCODE_INVALID_BINARY_REPRESENTATION), @@ -1496,6 +1497,7 @@ array_send(PG_FUNCTION_ARGS) int bitmask; int nitems, i; + int flags; int ndim, *dim; StringInfoData buf; @@ -1535,6 +1537,8 @@ array_send(PG_FUNCTION_ARGS) typbyval = my_extra->typbyval; typalign = my_extra->typalign; + flags = ARR_HASNULL(v) ? 1 : (typlen > 0 ? 2 : 0); + ndim = ARR_NDIM(v); dim = ARR_DIMS(v); nitems = ArrayGetNItems(ndim, dim); @@ -1543,7 +1547,7 @@ array_send(PG_FUNCTION_ARGS) /* Send the array header information */ pq_sendint(&buf, ndim, 4); - pq_sendint(&buf, ARR_HASNULL(v) ? 1 : 0, 4); + pq_sendint(&buf, flags, 4); pq_sendint(&buf, element_type, sizeof(Oid)); for (i = 0; i < ndim; i++) { @@ -1571,7 +1575,8 @@ array_send(PG_FUNCTION_ARGS) itemvalue = fetch_att(p, typbyval, typlen); outputbytes = SendFunctionCall(&my_extra->proc, itemvalue); - pq_sendint(&buf, VARSIZE(outputbytes) - VARHDRSZ, 4); + if ((flags & 2) == 0) + pq_sendint(&buf, VARSIZE(outputbytes) - VARHDRSZ, 4); pq_sendbytes(&buf, VARDATA(outputbytes),