Re: [HACKERS] Potential GIN vacuum bug
On Mon, Aug 17, 2015 at 6:23 AM, Jeff Janes jeff.ja...@gmail.com wrote: On Aug 16, 2015 11:49 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 08/16/2015 12:58 AM, Jeff Janes wrote: When ginbulkdelete gets called for the first time in a VACUUM(i.e. stats == NULL), one of the first things it does is call ginInsertCleanup to get rid of the pending list. It does this in lieu of vacuuming the pending list. This is important because if there are any dead tids still in the Pending list, someone else could come along during the vacuum and post the dead tids into a part of the index that VACUUM has already passed over. The potential bug is that ginInsertCleanup exits early (ginfast.c lines 796, 860, 898) if it detects that someone else is cleaning up the pending list, without waiting for that someone else to finish the job. Isn't this a problem? Yep, I think you're right. When that code runs as part of VACUUM, it should not give up like that. Hmm, I see other race conditions in that code too. Even if VACUUM wins the race you spotted, and performs all the insertions, reaches the end of the pending items list, and deletes the pending list pages, it's possible that another backend started earlier, and is still processing the same items from the pending items list. It will add them to the tree, and after it's finished with that it will see that the pending list page was already deleted, and bail out. But if there is a dead tuple in the pending items list, you have trouble. The other backend will re-insert it, and that might happen after VACUUM had already removed it from the tree. Could the right to clean the pending list be represented by a self-conflicting heavy weight lock on the index? Vacuum could block on it, while user back-ends could try to get it conditionally and just give up on the cleanup if it is not available. Also, ginInsertCleanup() seems to assume that if another backend has just finished cleaning up the same page, it will see the page marked as deleted. But what if the page is not only marked as deleted, but also reused for something else already? Yeah. Which is possible but pretty unlikely now; but would be far more likely if we added these page to the fsm more aggressively. The attached patch takes a ShareUpdateExclusiveLock lock on the index in order to clean the pending list. This fixes the problem I had been seeing when testing https://commitfest.postgresql.org/6/322/ (which was probably caused by the deleted page situation you mentioned, not by tids getting revived issue I originally brought up.) User backends attempt to take the lock conditionally, because otherwise they would cause an autovacuum already holding the lock to cancel itself, which seems quite bad. Not that this a substantial behavior change, in that with this code the user backends which find the list already being cleaned will just add to the end of the pending list and go about their business. So if things are added to the list faster than they can be cleaned up, the size of the pending list can increase without bound. Under the existing code each concurrent user backend will try to clean the pending list at the same time. The work doesn't parallelize, so doing this is just burns CPU (and possibly consuming up to maintenance_work_mem for *each* backend) but it does server to throttle the insertion rate and so keep the list from growing without bound. This is just a proof-of-concept patch, because I don't know if this approach is the right approach. One potential problem is how it will interact with create index concurrently. Cheers, Jeff gin_pending_lock.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More WITH
On Mon, Aug 17, 2015 at 1:41 PM, Tom Lane t...@sss.pgh.pa.us wrote: Peter Geoghegan p...@heroku.com writes: On Mon, Aug 17, 2015 at 10:22 AM, Josh Berkus j...@agliodbs.com wrote: Would be tricky. We don't currently have any way to wrap an EXPLAIN in any larger statement, do we? Would be very useful for automated query analysis, though. No. In the grammar, ExplainStmt expects the EXPLAIN to be at the top-level. Having it work any other way would require significant refactoring. You can use EXPLAIN as the source of rows in a plpgsql FOR-over-query loop, so there's a workaround available that way when you need to read EXPLAIN output programmatically. I'm not convinced there's sufficient value in trying to make EXPLAIN a full-fledged subquery otherwise. I think a lot of people would find that handy - I would - but I don't know how hard it is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Error message with plpgsql CONTINUE
Jim Nasby jim.na...@bluetreble.com writes: On 8/17/15 9:48 AM, Tom Lane wrote: I'm inclined to think that if we wanted to make this better, the way to improve it would be to detect the error*at compile time*, and get rid of this hack in plpgsql_exec_function altogether. So split PLPGSQL_NSTYPE_LABEL into PLPGSQL_NSTYPE_BLOCK_LABEL and PLPGSQL_NSTYPE_LOOP_LABEL, and split opt_block_label and opt_label the same way? I think using two NSTYPE codes would probably be a pain because there are numerous places that don't care about the distinction; it'd be better to have a secondary attribute distinguishing these cases. (It looks like you could perhaps reuse the itemno field for the purpose, since that seems to be going unused in LABEL items.) You likely do need to split opt_block_label into two productions, since that will be the easiest way to pass forward the knowledge of whether it's being called from a loop or non-loop construct. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Proposal: Implement failover on libpq connect level.
Rationale = Since introduction of the WAL-based replication into the PostgreSQL, it is possible to create high-availability and load-balancing clusters. However, there is no support for failover in the client libraries. So, only way to provide transparent for client application failover is IP address migration. This approach has some limitation, i.e. it requires that master and backup servers reside in the same subnet or may not be feasible for other reasons. Commercial RDBMS, such as Oracle, employ more flexible approach. They allow to specify multiple servers in the connect string, so if primary server is not available, client library tries to connect to other ones. This approach allows to use geographically distributed failover clusters and also is a cheap way to implement load-balancing (which is not possible with IP address migration). Proposed change === Allow to specify multiple hosts in the libpq connect string. Make libpq attempt to connect to all host simultaneously or in random order and use of the server which successfully establishes connection first. Syntax -- Libpq connect string can be either set of the keyword=value pairs or an URL. In the first form it can be just allowed to specify keyword host multiple times. host=main-server host=standby1 host=standby2 port=5432 dbname=database In the second form host can be specified either in the first part of URL or in the query parameters. postgresql://user@host/database postgresql:///database?host=hostnameuser=username If host is specified as a parameter, it is also possible to allow multiple host parameters without breaking existing syntax. postgresql:///database?host=main-serverhost=standby1host=standby2 In order to implement load-balancing clusters, additional parameters should be added readonly=boolean and loadbalancing=boolean Support for this syntax extensions is added to the PQconnectdb, PQconnectdbParams, PQConnectStart and PQConnectStartParams, but not PQsetdb/PQsetdblogin functions. Behavoir If PQconnectdb encounters connect string with multiple hosts specified, it attempts to establish connection with all these hosts simultaneously, and begins to work with server which responds first, unless loadbalancing parameter is true. If the loadbalancing parameter is true, it tries servers sequentially in the random order. If the parameter readonly is false, after authenticating with server it executes show transaction_read_only, to find out whether current connection is to the master or to the hot standby, and connection is considered successful only if server allows read write transactions. This allows to have clients which write to the database and clients which perform read-only access. Read-only clients would be load-balanced between the master and slave servers, and read-write clients connect only to the master (whichever server has this role at the moment of connection). Information of the alternate servers should be stored in the PGconn structure. Function PQreset should be able to take advantage of new syntax and possibly open connection to the new master, if failover occurred during lifetime of the connection. Possible drawbacks == Compatibility - Proposed patch requires no modifications to the server or protocol, and modification of synchronous function (PQconnectdb, PQconnectdbParams) doesn't introduce incompatible changes to the client library. Even if connect string with multiple host would be erroneously used with version of libpq, which do not support this feature, it is not an error. It just use last host specified in the connect string. There could be some compatibility problems with asynchronous connections created with PQConnectStart functions. Problem is that we are trying to establish several connections at once, and there are several sockets which should be integrated into application event loop. Even if we would try servers in some particular order (such as randomized order during load balancing), file descriptor of socket can change during execution PQConnectPoll, and existing applications are not prepared to it. Performance impact -- Performance impact seems to be negligible. 1. If connect string contain only one host, the only complication is the maintenance of the data structure, which possible can hold more than one host name. Connection process itself would not be affected. 2. If there is pure high-availability cluster, i.e. standby servers do not accept client connections on the specified port, there is no extra load on standby servers, and almost no (only several unsuccessful connect calls) on client. 3. If there is load balancing cluster, there is no performance impacts for read-only client, but each read-write client causes standby servers to process extra connection to the point where server can report read-only state of transaction (i.e. including SSL handshake and
Re: [HACKERS] checkpointer continuous flushing
On 2015-08-11 17:15:22 +0200, Fabien COELHO wrote: +void +PerformFileFlush(FileFlushContext * context) +{ + if (context-ncalls != 0) + { + int rc; + +#if defined(HAVE_SYNC_FILE_RANGE) + + /* Linux: tell the memory manager to move these blocks to io so + * that they are considered for being actually written to disk. + */ + rc = sync_file_range(context-fd, context-offset, context-nbytes, + SYNC_FILE_RANGE_WRITE); + +#elif defined(HAVE_POSIX_FADVISE) + + /* Others: say that data should not be kept in memory... + * This is not exactly what we want to say, because we want to write + * the data for durability but we may need it later nevertheless. + * It seems that Linux would free the memory *if* the data has + * already been written do disk, else the dontneed call is ignored. + * For FreeBSD this may have the desired effect of moving the + * data to the io layer, although the system does not seem to + * take into account the provided offset size, so it is rather + * rough... + */ + rc = posix_fadvise(context-fd, context-offset, context-nbytes, +POSIX_FADV_DONTNEED); + +#endif + + if (rc 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg(could not flush block INT64_FORMAT + on INT64_FORMAT blocks in file \%s\: %m, + context-offset / BLCKSZ, + context-nbytes / BLCKSZ, + context-filename))); + } I'm a bit wary that this might cause significant regressions on platforms not supporting sync_file_range, but support posix_fadvise() for workloads that are bigger than shared_buffers. Consider what happens if the workload does *not* fit into shared_buffers but *does* fit into the OS's buffer cache. Suddenly reads will go to disk again, no? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Raising our compiler requirements for 9.6
On 2015-08-16 03:31:48 -0400, Noah Misch wrote: As we see from the patches' need to add #include statements to .c files and from Robert's report of a broken EDB build, this change creates work for maintainers of non-core code, both backend code (modules) and frontend code (undocumented, but folks do it). That's to be expected and doesn't take a great deal of justification, but users should get benefits in connection with the work. This brings to mind the htup_details.h introduction, which created so much busy work in non-core code. I don't expect the lock.h split to be quite that disruptive. Statistics on PGXN modules: Four modules (imcs, pg_stat_kcache, query_histogram, query_recorder) mentioned LWLock without mentioning lwlock.h. These patches (trivially) break the imcs build. The other three failed to build for unrelated reasons, but I suspect these patches give those modules one more thing to update. What do users get out of this? They'll learn if their code is not portable to pademelon, but #warning atomics.h in frontend code is not portable would communicate the same without compelling non-core code to care. I'd love to make it a #warning intead of an error, but unfortunately that's not standard C :( Other than that benefit, making headers #error-on-FRONTEND mostly lets us congratulate ourselves for having introduced the start of a header layer distinction. I'd be for that if PostgreSQL were new, but I can't justify doing it at the user cost already apparent. That would be bad business. To me that's basically saying that we'll never ever have any better separation between frontend/backend headers since each incremental improvement won't be justifiable. Adding an explicit include which exists in all version of postgres is really rather low cost, you don't even need version dependant define. The modules you name are, as you noticed, likely to require minor surgery for new major versions anyway, being rather low level. As you say breaking C code over major releases doesn't have to meet a high barrier, and getting rid of the wart of lock.h being used that widely seems to be more than suffient to meet that qualification. If others think this is the right way, ok, I can live with that and implement it, but I won't agree. Therefore, I urge you to instead add the aforementioned #warning and wrap with #ifdef FRONTEND the two #include port/atomics.h in headers. That tightly limits build breakage; it can only break frontend code, which is rare outside core. Hackers will surely notice if a patch makes the warning bleat, so mistakes won't survive long. Non-core frontend code, if willing to cede a shred of portability, can experiment with atomics. Folks could do nifty things with that. I don't think that's something worth keeping in mind from our side. If you don't care about portability you can just use C11 atomics or such. If somebody actually wants to add FRONTEND support for the fallback code - possibly falling back to pthread mutexes - ok, but before that... Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
Hi Fabien, On 2015-08-12 22:34:59 +0200, Fabien COELHO wrote: sort/flush : tps avg stddev (percent of time beyond 10.0 tps) on on : 631 +- 131 (0.1%) on off : 564 +- 303 (12.0%) off on : 167 +- 315 (76.8%) # stuck... off off : 177 +- 305 (71.2%) # ~ current pg What exactly do you mean with 'stuck'? - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Warnings around booleans
Hi, On 2015-08-12 16:46:01 -0400, Stephen Frost wrote: From ab44660e9b8fc5b10f84ce6ba99a8340047ac8a5 Mon Sep 17 00:00:00 2001 From: Stephen Frost sfr...@snowman.net Date: Wed, 12 Aug 2015 15:50:54 -0400 Subject: [PATCH] In AlterRole, make bypassrls an int When reworking bypassrls in AlterRole to operate the same way the other attribute handling is done, I missed that the variable was incorrectly a bool rather than an int. This meant that on platforms with an unsigned char, we could end up with incorrect behavior during ALTER ROLE. Pointed out by Andres thanks to tests he did changing our bool to be the one from stdbool.h which showed this and a number of other issues. Add regression tests to test CREATE/ALTER role for the various role attributes. Arrange to leave roles behind for testing pg_dumpall, but none which have the LOGIN attribute. In passing, to avoid confusion, rename CreatePolicyStmt's 'cmd' to 'cmd_name', parse_policy_command's 'cmd' to 'polcmd', and AlterPolicy's 'cmd_datum' to 'polcmd_datum', per discussion with Noah and as a follow-up to his correction of copynodes/equalnodes handling of the CreatePolicyStmt 'cmd' field. I'd rather see those split into separate commits. Doing polishing and active bugs in one commit imo isn't a good idea if the polishing goes beyond a line or two. Otherwise this looks ok to me. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb array-style subscripting
Peter Geoghegan p...@heroku.com writes: This patch does not add an operator at all, actually. If feels like there ought to be an operator, but in fact there is not. The parser is hard-coded to recognize array-style subscripts, which this uses. While I'm certainly glad that Dmitry took the time to work on this, I think we will need an operator, too. Or, more accurately, there should probably be a way to make something like this use some available GIN index: postgres=# explain analyze select * from testjsonb where p['a'] = '[1]'; Hm. There is definitely attraction to the idea that x[y] is some sort of operator-like syntactic sugar for invocation of an underlying function, rather than necessarily a hard-coded behavior. That would provide a framework for extending the feature to all container-like datatypes, whereas the approach Dimitry has prototyped doesn't look like it offers much help at all for other datatypes. But I do not think that that will change things very much as far as making it possible for things like the above to be indexed. You'd still have an expression tree that's a nest of two operators, which doesn't fit into our ideas about index opclasses. (That is, it still has to be a special case, so it doesn't matter that much if one of the nodes is some kind of NotAnArrayRef rather than a FuncExpr with some new CoercionForm value to make it print differently.) Also, the syntactic sugar would have to be able to deal with multiple subscripts, which makes things a lot more complicated. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
In case of vacuum, I think we need to track the number of scanned heap pages at least, and the information about index scan is the additional information Actually the progress of heap pages scan depend on index scans. So complete VACUUM progress needs to have a count of index pages scanned too. So, progress can be calculated by measuring index_pages_scanned + heap_pages_scanned against total_index_pages + total_heap_pages. This can make essential information. This can be followed by additional individual phase information. Following fields common across different commands can be used to display progress Command work done total workpercent complete message VACUUM x y z total progress uv w phase 1 The command code can be divided into distinct phases and each phase progress can be represented separately. With a summary of entire command progress as the first entry. The summary can be the summation of individual phase entries. If the phase repeats during command execution the previous entry for the phase will be replaced.(for ex. index scan in vacuum) Essential information has one numeric data, which is stored essentially information regarding of its processing. We may need more than one numeric data as mentioned above to represent scanned blocks versus total blocks. Additional information has two data: text and numeric. These data is free-style data which is stored by each backend as it like. If I understand your point correctly, I think you are missing following, The amount of additional information for each command can be different. We may need an array of text and numeric data to represent more additional information. Thank you, Rahila Syed
Re: [HACKERS] replication slot restart_lsn initialization
On Tue, Aug 18, 2015 at 4:46 AM, Andres Freund and...@anarazel.de wrote: On 2015-08-17 15:22:44 -0400, Peter Eisentraut wrote: On 8/14/15 3:54 AM, Andres Freund wrote: I'd name it RESERVE_WAL. Why reserve? Isn't it preserve? Hm. I honestly do not know. I was thinking of ticket reservations... Or retain? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
On Tue, Aug 18, 2015 at 1:02 AM, Fabien COELHO coe...@cri.ensmp.fr wrote: Hello Andres, [...] posix_fadvise(). My current thinking is maybe yes, maybe no:-), as it may depend on the OS implementation of posix_fadvise, so it may differ between OS. As long as fadvise has no 'undirty' option, I don't see how that problem goes away. You're telling the OS to throw the buffer away, so unless it ignores it that'll have consequences when you read the page back in. Yep, probably. Note that we are talking about checkpoints, which write buffers out *but* keep them nevertheless. As the buffer is kept, the OS page is a duplicate, and freeing it should not harm, at least immediatly. This theory could makes sense if we can predict in some way that the data we are flushing out of OS cache won't be needed soon. After flush, we can only rely to an extent that data could be found in shared_buffers if the usage_count is high, other wise it could be replaced any moment by backend needing the buffer and there is no free buffer. Now here one way to think is that if the usage_count is low, then anyway it's okay to assume that this won't be needed in near future, however I don't think relying only on usage_count for such a thing is good idea. To sum up, I agree that it is indeed possible that flushing with posix_fadvise could reduce read OS-memory hits on some systems for some workloads, although not on Linux, see below. So the option is best kept as off for now, without further data, I'm fine with that. One point to think here is on what basis user can decide make this option on, is it predictable in any way? I think one case could be when the data set fits in shared_buffers. In general, providing an option is a good idea if user can decide with ease when to use that option or we can give some clear recommendation for the same otherwise one has to recommend that test your workload with this option and if it works then great else don't use it which might also be okay in some cases, but it is better to be clear. One minor point, while glancing through the patch, I noticed that couple of multiline comments are not written in the way which is usually used in code (Keep the first line as empty). +/* Status of buffers to checkpoint for a particular tablespace, + * used internally in BufferSync. + * - space: oid of the tablespace + * - num_to_write: number of checkpoint pages counted for this tablespace + * - num_written: number of pages actually written out +/* entry structure for table space to count hashtable, + * used internally in BufferSync. + */ With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [patch] psql tab completion for grant execute
On Tue, Aug 18, 2015 at 6:07 AM, Daniel Verite dan...@manitou-mail.org wrote: Hi, When tab-completing after GRANT EXECUTE, currently psql injects PROCEDURE, rather than the expected ON. The code for completing with ON is there, but it's not reached due to falling earlier into another branch, one that handles CREATE TRIGGER. A trivial patch is attached. It adds the condition that if EXECUTE is preceded by GRANT itself preceded by nothing, then that completion with PROCEDURE is skipped. I've looked at fixing it more directly, by testing if the EXECUTE is part of a CREATE TRIGGER, but it didn't seem fitting to go looking backwards that many words into the string (more than the 5 words suggested by the rest of the code). You should consider adding it to the next CF: https://commitfest.postgresql.org/6/ -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Potential GIN vacuum bug
Jeff Janes wrote: The attached patch takes a ShareUpdateExclusiveLock lock on the index in order to clean the pending list. Does it take a lock on the table also? Because if not ... One potential problem is how it will interact with create index concurrently. ... then I don't understand how you could have a problem here. Surely no pending list cleanup can happen concurrently with the index being created? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
Hello Andres, On 2015-08-12 22:34:59 +0200, Fabien COELHO wrote: sort/flush : tps avg stddev (percent of time beyond 10.0 tps) on on : 631 +- 131 (0.1%) on off : 564 +- 303 (12.0%) off on : 167 +- 315 (76.8%) # stuck... off off : 177 +- 305 (71.2%) # ~ current pg What exactly do you mean with 'stuck'? I mean that the during the I/O storms induced by the checkpoint pgbench sometimes get stuck, i.e. does not report its progression every second (I run with -P 1). This occurs when sort is off, either with or without flush, for instance an extract from the off/off medium run: progress: 573.0 s, 5.0 tps, lat 933.022 ms stddev 83.977 progress: 574.0 s, 777.1 tps, lat 7.161 ms stddev 37.059 progress: 575.0 s, 148.9 tps, lat 4.597 ms stddev 10.708 progress: 814.4 s, 0.0 tps, lat -nan ms stddev -nan progress: 815.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 816.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 817.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 818.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 819.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 820.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 821.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 822.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 823.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 824.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 825.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 826.0 s, 0.0 tps, lat -nan ms stddev -nan There is a 239.4 seconds gap in pgbench output. This occurs from time to time and may represent a significant part of the run, and I count these stuck times as 0 tps. Sometimes pgbench is stuck performance wise but manages nevetheless to report a 0.0 tps every second, as above after it unstuck. The actual origin of the issue with a stuck client (pgbench, libpq, OS, postgres...) is unclear to me, but the whole system does not behave well under an I/O storm anyway, and I have not succeeded in understanding where pgbench is stuck when it does not report its progress. I tried some runs with gdb but it did not get stuck and reported a lot of 0.0 tps during the storms. Here are a few more figures with the v8 version of the patch, on a host with 8 cores, 16 GB, RAID 1 HDD, under Ubuntu precise. I already reported the medium case, and the small case turned afterwards. small postgresql.conf: shared_buffers = 2GB checkpoint_timeout = 300s # this is the default checkpoint_completion_target = 0.8 # initialization: pgbench -i -s 120 medium postgresql.conf: ## ALREADY REPORTED shared_buffers = 4GB checkpoint_timeout = 15min checkpoint_completion_target = 0.8 max_wal_size = 4GB # initialization: pgbench -i -s 250 warmup pgbench -T 1200 -M prepared -S -j 2 -c 4 # 400 tps throttled test sh pgbench -M prepared -N -P 1 -T 4000 -R 400 -L 100 -j 2 -c 4 options / percent of skipped/late transactions sort/flush / small medium on on :3.52.7 on off : 24.6 16.2 off on : 66.1 68.4 off off : 63.2 68.7 # 200 tps throttled test sh pgbench -M prepared -N -P 1 -T 4000 -R 200 -L 100 -j 2 -c 4 options / percent of skipped/late transactions sort/flush / small medium on on :1.92.7 on off : 14.39.5 off on : 45.6 47.4 off off : 47.4 48.8 # 100 tps throttled test sh pgbench -M prepared -N -P 1 -T 4000 -R 100 -L 100 -j 2 -c 4 options / percent of skipped/late transactions sort/flush / small medium on on :0.91.8 on off :9.37.9 off on :5.0 13.0 off off : 31.2 31.9 # full speed 1 client sh pgbench -M prepared -N -P 1 -T 4000 options / tps avg stddev (percent of time below 10.0 tps) sort/flush /small medium on on : 564 +- 148 ( 0.1%) 631 +- 131 ( 0.1%) on off : 470 +- 340 (21.7%) 564 +- 303 (12.0%) off on : 157 +- 296 (66.2%) 167 +- 315 (76.8%) off off : 154 +- 251 (61.5%) 177 +- 305 (71.2%) # full speed 2 threads 4 clients sh pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4 options / tps avg stddev (percent of time below 10.0 tps) sort/flush /small medium on on : 757 +- 417 ( 0.1%) 1058 +- 455 ( 0.1%) on off : 752 +- 893 (48.4%) 1056 +- 942 (32.8%) off on : 173 +- 521 (83.0%) 170 +- 500 (88.3%) off off : 199 +- 512 (82.5%) 209 +- 506 (82.0%) In all cases, the sort on flush on provides the best results, with tps speedup from 3-5, and overall high responsiveness ( lower latency). -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Raising our compiler requirements for 9.6
On Wed, Jul 1, 2015 at 11:14 AM, Andres Freund and...@anarazel.de wrote: Hi, During the 9.5 cycle, and earlier, the topic of increasing our minimum bar for compilers came up a bunch of times. Specifically whether we still should continue to use C90 as a baseline. Minor question: is there any value to keeping the client side code to older standards? On a previous project compiled libpq on many legacy architectures because we were using it as frontend to backup software. The project didn't end up taking off so it's no big deal to me, but just throwing it out there: libpq has traditionally enjoyed broader compiler support than the full project (borland, windows back in the day). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable plan
Here is one other thing I could learn from TPC-DS benchmark. The attached query is Q4 of TPC-DS, and its result was towards SF=100. It took long time to compete (about 30min), please see the attached EXPLAIN ANALYZE output. Its core workload is placed on CTE year_total. Its Append node has underlying three HashAggregate nodes which also has tables join for each. Below shows the first HashAggregate node. It consumes 268M rows, then generates 9M rows. Underlying GpuJoin takes 74sec to process 268M rows, so we can guess HashAggregate consumed 400sec. HashAggregate (cost=18194948.40..21477516.00 rows=262605408 width=178) (actual time=480652.856..488055.918 rows=9142442 loops=1) Group Key: customer.c_customer_id, customer.c_first_name, customer.c_last_name, customer.c_preferred_cust_flag, customer.c_birth_country, customer.c_login, customer.c_email_address, date_dim.d_year - Custom Scan (GpuJoin) (cost=101342.54..9660272.64 rows=262605408 width=178) (actual time=2472.174..73908.894 rows=268562375 loops=1) Bulkload: On (density: 100.00%) Depth 1: Logic: GpuHashJoin, HashKeys: (ss_sold_date_sk), JoinQual: (ss_sold_date_sk = d_date_sk), nrows (287997024 - 275041999, 95.50% expected 95.47%) Depth 2: Logic: GpuHashJoin, HashKeys: (ss_customer_sk), JoinQual: (ss_customer_sk = c_customer_sk), nrows (275041999 - 268562375, 93.25% expected 91.18%) - Custom Scan (BulkScan) on store_sales (cost=0.00..9649559.60 rows=287996960 width=38) (actual time=18.372..54522.803 rows=287997024 loops=1) - Seq Scan on date_dim (cost=0.00..2705.49 rows=73049 width=16) (actual time=0.015..15.533 rows=73049 loops=1) - Seq Scan on customer (cost=0.00..87141.74 rows=274 width=156) (actual time=0.018..582.373 rows=200 loops=1) Other two HashAggregate nodes have similar behavior. The second one consumed 281sec, including 30sec by underlying GpuJoin. The third one consumed 138sec, including 25sec by underlying GpuJoin. Apart from my custom join implementation, It seems to me HashAggregate node consumed too much time than expectation. One characteristics of this workload is, this aggregation takes eight grouping-keys. I doubt cost of function invocation for hash-value and equal-checks may be criminal. Let's dive into nodeAgg.c. ExecAgg() calls agg_fill_hash_table() to fill up the hash table with rows fetched from the underlying tables. On construction of the hash table, it calls hash functions (at TupleHashTableHash) and equal check functions (at execTuplesMatch) repeatedly. Both of them uses FunctionCallX() interface that exchange the argument using FmgrInfo structure. Usually, it is not the best way from performance perspective. Especially, this workload takes 268M input rows and 8 grouping keys, so 268M (rows) x 8 (grouping keys) x 2 (for hash/equal), 4.2B times function calls via FmgrInfo shall happen. I think SortSupport logic provides a reasonable way to solve this kind of problem. For example, btint4sortsupport() informs a function pointer of the fast version of comparator (btint4fastcmp) which takes two Datum argument without indirect memory reference. This mechanism will also make sense for HashAggregate logic, to reduce the cost of function invocations. Please comment on the idea I noticed here. And, as an aside, if HashAggregate picks up a hash-value of grouping-keys from the target-list of underlying plan node (GpuJoin in this case), it may be possible to off-load calculation of hash-values on GPU device. :-) Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai Sent: Thursday, August 13, 2015 8:23 PM To: Greg Stark Cc: PostgreSQL-development Subject: Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable plan On Thu, Aug 13, 2015 at 2:49 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: In fact, cost of HashJoin underlying Sort node is: - Hash Join (cost=621264.91..752685.48 rows=1 width=132) On the other hands, NestedLoop on same place is: - Nested Loop (cost=0.00..752732.26 rows=1 width=132) Probably, small GUC adjustment may make optimizer to prefer HashJoin towards these kind of queries. With that kind of discrepancy I doubt adjusting GUCs will be sufficient Do you have a good idea? Do you have EXPLAIN ANALYZE from the plan that finishes? Are there any row estimates that are way off? Yes, EXPLAIN ANALYZE is attached. According to this, CTE year_total generates 384,208 rows. It is much smaller than estimation (4.78M rows), however, filter's selectivity of CTE Scan was not large as expectation. For example, the deepest CTE Scan returns 37923 rows and 26314 rows, even though 40 rows were expected. On the next level, relations join between 11324 rows and
Re: [HACKERS] Raising our compiler requirements for 9.6
On 2015-08-17 15:51:33 +0100, Greg Stark wrote: But I'm not clear from the discussion exactly which compilers we're thinking of ruling out. None. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Raising our compiler requirements for 9.6
On Wed, Jul 1, 2015 at 5:14 PM, Andres Freund and...@anarazel.de wrote: During the 9.5 cycle, and earlier, the topic of increasing our minimum bar for compilers came up a bunch of times. Specifically whether we still should continue to use C90 as a baseline. I think the time has come to rely at least on some newer features. I don't have much opinion on the topic (aside from it's nice that we run on old systems but that's neither here nor there). But I'm not clear from the discussion exactly which compilers we're thinking of ruling out. For GCC are we talking about bumping the minimum version required or are all current versions of GCC new enough and we're only talking about old HPUX/Solaris/etc compilers? -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] what would tar file FDW look like?
I'm starting to work on a tar FDW as a proxy for a much more specific FDW. (It's the 'faster to build two and toss the first away' approach - tar lets me get the FDW stuff nailed down before attacking the more complex container.) It could also be useful in its own right, or as the basis for a zip file FDW. I have figured out that in one mode the FDW mapping that would take the name of the tarball as an option and produce a relation that has all of the metadata for the contained files - filename, size, owner, timestamp, etc. I can use the same approach I used for the /etc/passwd FDW for that. (BTW the current version is at https://github.com/beargiles/passwd-fdw. It's skimpy on automated tests until I can figure out how to handle the user mapping but it works.) The problem is the second mode where I pull a single file out of the FDW. I've identified three approachs so far: 1. A FDW mapping specific to each file. It would take the name of the tarfile and the embedded file. Cleanest in some ways but it would be a real pain if you're reading a tarball dynamically. 2. A user-defined function that takes the name of the tarball and file and returns a blob. This is the traditional approach but why bother with a FDW then? It also brings up access control issues since it requires disclosure of the tarball name to the user. A FDW could hide that. 3. A user-defined function that takes a tar FDW and the name of a file and returns a blob. I think this is the best approach but I don't know if I can specify a FDW as a parameter or how to access it. I've skimmed the existing list of FDW but didn't find anything that can serve as a model. The foreign DB are closest but, again, they aren't designed for dynamic use where you want to do something with every file in an archive / table in a foreign DB. Is there an obvious approach? Or is it simply a bad match for FDW and should be two standard UDF? (One returns the metadata, the second returns the specific file.) Thanks, Bear
Re: [HACKERS] what would tar file FDW look like?
On Mon, Aug 17, 2015 at 3:14 PM, Bear Giles bgi...@coyotesong.com wrote: I'm starting to work on a tar FDW as a proxy for a much more specific FDW. (It's the 'faster to build two and toss the first away' approach - tar lets me get the FDW stuff nailed down before attacking the more complex container.) It could also be useful in its own right, or as the basis for a zip file FDW. Hm. tar may be a bad fit where zip may be much easier. Tar has no index or table of contents. You have to scan the entire file to find all the members. IIRC Zip does have a table of contents at the end of the file. The most efficient way to process a tar file is to describe exactly what you want to happen with each member and then process it linearly from start to end (or until you've found the members you're looking for). Trying to return meta info and then go looking for individual members will be quite slow and have a large startup cost. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
Oops, stalled post, sorry wrong From, resent.. Hello Andres, + rc = posix_fadvise(context-fd, context-offset, [...] I'm a bit wary that this might cause significant regressions on platforms not supporting sync_file_range, but support posix_fadvise() for workloads that are bigger than shared_buffers. Consider what happens if the workload does *not* fit into shared_buffers but *does* fit into the OS's buffer cache. Suddenly reads will go to disk again, no? That is an interesting question! My current thinking is maybe yes, maybe no:-), as it may depend on the OS implementation of posix_fadvise, so it may differ between OS. This is a reason why I think that flushing should be kept a guc, even if the sort guc is removed and always on. The sync_file_range implementation is clearly always very beneficial for Linux, and the posix_fadvise may or may not induce a good behavior depending on the underlying system. This is also a reason why the default value for the flush guc is currently set to false in the patch. The documentation should advise to turn it on for Linux and to test otherwise. Or if Linux is assumed to be often a host, then maybe to set the default to on and to suggest that on some systems it may be better to have it off. (Another reason to keep it off is that I'm not sure about what happens with such HD flushing features on virtual servers). Overall, I'm not pessimistic, because I've seen I/O storms on a FreeBSD host and it was as bad as Linux (namely the database and even the box was offline for long minutes...), and if you can avoid that having to read back some data may be not that bad a down payment. The issue is largely mitigated if the data is not removed from shared_buffers, because the OS buffer is just a copy of already hold data. What I would do on such systems is to increase shared_buffers and keep flushing on, that is to count less on the system cache and more on postgres own cache. Overall, I'm not convince that the practice of relying on the OS cache is a good one, given what it does with it, at least on Linux. Now, if someone could provide a dedicated box with posix_fadvise (say FreeBSD, maybe others...) for testing that would allow to provide data instead of speculating... and then maybe to decide to change its default value. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Error message with plpgsql CONTINUE
Jim Nasby jim.na...@bluetreble.com writes: Calling CONTINUE with a label that's not a loop produces an error message with no context info [1]. True. I think err_stmt should probably only be reset in the non-return case a bit below that. I'm not sure about err_text though. That is not going to help, as you'd soon find if you experimented: given your example, the produced error message would be ERROR: CONTINUE cannot be used outside a loop CONTEXT: PL/pgSQL function inline_code_block line 2 at statement block rather than pointing at the CONTINUE. To get where you needed to be, you'd need to have some complicated and fragile rules about where err_stmt is reset or not reset as a statement nest gets unwound. I'm inclined to think that if we wanted to make this better, the way to improve it would be to detect the error *at compile time*, and get rid of this hack in plpgsql_exec_function altogether. pl_gram.y already successfully detects cases where CONTINUE mentions a label that doesn't exist or isn't surrounding the CONTINUE. What it is missing is that we don't distinguish labels on loops from labels on non-loop statements, and thus it can't tell if CONTINUE is referencing a non-loop label or has no label but is not inside any loop-type statement. Seems like that detail could be added to the PLpgSQL_nsitem data structure without a huge amount of work. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb array-style subscripting
Hi 2015-08-17 21:12 GMT+02:00 Jim Nasby jim.na...@bluetreble.com: On 8/17/15 12:57 PM, Dmitry Dolgov wrote: * is it interesting for the community? We definitely need better ways to manipulate JSON. * is that a good idea to extend the `ArrayRef` for jsonb? If it's appropriate, probably we can rename it to `ArrayJsonbRef` of something. * what can be improved in the code at the top level (function placement, probably, functionality duplication, etc.)? * are there any special cases, that I should take care of in this implementation? How would this work when you have a JSON array? Postgres array syntax suddenly becoming key/value syntax for JSON seems like a pretty bad idea to me. Could a different syntax (maybe {}) be used instead? I don't understand why '{}' should be better than '[]' ? The lot of modern languages doesn't different between arrays and hash. Regards Pavel I'm not sure having the UPDATE you show cause objects to spring to life is so great either. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Test code is worth the space
On Mon, Aug 17, 2015 at 02:04:40PM -0500, Jim Nasby wrote: On 8/16/15 8:48 AM, Greg Stark wrote: On Sun, Aug 16, 2015 at 7:33 AM, Noah Misch n...@leadboat.com wrote: When I've just spent awhile implementing a behavior change, the test diffs are a comforting sight. They confirm that the test suite exercises the topic I just changed. Furthermore, most tests today do not qualify under this stringent metric you suggest. The nature of pg_regress opposes it. It's not a comforting sight for me. It means I need to change the test results and then of course the tests will pass. So they didn't really test anything and I don't really know whether the changes broke anything. Any test I had to adjust for my change was effectively worthless. This is precisely my point about pg_regress and why it's the reason there's pressure not to have extensive tests. It's useful to have some end-to-end black box tests like this but it's self-limiting. You can't test many details without tying the tests to specific behaviour. I have worked with insanely extensive testing suites that didn't have this problem. But they were unit tests for code that was structured around testability. When the API is designed to be testable and you have good infrastructure for mocking up the environment and comparing function results in a much narrower way than just looking at text output you can test the correctness of each function without tying the test to the specific behaviour. *That* gives a real comfort. When you change a function to behave entirely differently and can still run all the tests and see that all the code that depends on it still operate correctly then you know you haven't broken anything. In that case it actually *speeds up* development rather than slowing it down. It lets newbie developers hack on code where they don't necessarily know all the downstream dependent code and not be paralyzed by fear that they'll break something. As someone who oversaw a database test suite with almost 500 test files (each testing multiple cases), I completely agree. Our early framework was actually similar to pg_regress and that worked OK up to about 200 test files, but it got very unwieldy very quickly. We switched to pgTap and it was far easier to work with. My own position is based on having maintained a pg_regress suite an order of magnitude larger than that. I don't know why that outcome was so different. I suspect any effort to significantly improve Postgres test coverage is doomed until there's an alternative to pg_regress. There is the src/test/perl/TestLib.pm harness. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] missing documentation for partial WAL files
On Tue, Aug 18, 2015 at 3:57 AM, Peter Geoghegan p...@heroku.com wrote: On Mon, Aug 17, 2015 at 11:50 AM, Peter Eisentraut pete...@gmx.net wrote: The commit message for de76884 contains some important information about the purpose and use of the new .partial WAL files. But I don't see anything about this in the documentation or another user-visible place. We should probably add something. This makes sense, those files are exposed in the user's archives when the end of a timeline is reached at promotion. I think that this should be added in Continuous archiving in standby with a new paragraph, as the first paragraph argues about archive_mode = 'always' and the second about 'on'. What about the attached? Uh, some documentation around .ready files would be nice too. Why? Users normally need to have no knowledge of that, those status files are managed only by the backend. -- Michael diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml index 37aa047..fe161b6 100644 --- a/doc/src/sgml/high-availability.sgml +++ b/doc/src/sgml/high-availability.sgml @@ -1259,6 +1259,18 @@ primary_slot_name = 'node_a_slot' When a server is not in recovery mode, there is no difference between literalon/literal and literalalways/literal modes. /para + + para +When standby is promoted and varnamearchive_mode/varname is set to +literalon/ or literalalways/literal, it will archive the last, +partial and incomplete segment from the previous timeline with suffix +filename.partial/filename. There is no automatic mechanism to detect +and use filename.partial/filename files at recovery so they will go +unused except if they are renamed and placed in +filenamepg_xlog/filename. This segment would not be needed when +recovering on the new timeline as the first segment of the new timeline +would be used instead, still it may be useful for debugging purposes. + /para /sect2 /sect1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Potential GIN vacuum bug
On Mon, Aug 17, 2015 at 3:02 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Jeff Janes wrote: The attached patch takes a ShareUpdateExclusiveLock lock on the index in order to clean the pending list. Does it take a lock on the table also? Because if not ... There must be some kind of lock held on the table already (either RowExclusive at least for user backends or ShareRowExclusive for vacuum backends), but I don't do anything in this patch with it. One potential problem is how it will interact with create index concurrently. ... then I don't understand how you could have a problem here. Surely no pending list cleanup can happen concurrently with the index being created? While grepping through the code looking for precedent, I noticed that create index concurrently takes a ShareUpdateExclusiveLock on both table and index. I don't know why it needs one on the index, I didn't investigate it thoroughly. During the last stage of the create index concurrently, inserters and updaters are obliged to maintain the index even though they can't use it yet. So that would mean adding to the pending list, which might get big enough to need cleaning. Cheers, Jeff
Re: [HACKERS] jsonb array-style subscripting
On 2015-08-17 22:33, Josh Berkus wrote: So, both perl and python do not allow deep nesting of assignments. For example: d = { a : { } } d[a][a1][a2] = 42 Traceback (most recent call last): File stdin, line 1, in module KeyError: 'a1' Not sure I understand what you mean. In Perl you'd do $ perl -e '%d = (a = {}); $d{a}{a1}{a2} = 42; print $d{a}{a1}{a2}' 42 which looks pretty much like what's proposed. /kaare -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] what would tar file FDW look like?
On 08/17/2015 10:14 AM, Bear Giles wrote: I'm starting to work on a tar FDW as a proxy for a much more specific FDW. (It's the 'faster to build two and toss the first away' approach - tar lets me get the FDW stuff nailed down before attacking the more complex container.) It could also be useful in its own right, or as the basis for a zip file FDW. I have figured out that in one mode the FDW mapping that would take the name of the tarball as an option and produce a relation that has all of the metadata for the contained files - filename, size, owner, timestamp, etc. I can use the same approach I used for the /etc/passwd FDW for that. (BTW the current version is at https://github.com/beargiles/passwd-fdw. It's skimpy on automated tests until I can figure out how to handle the user mapping but it works.) The problem is the second mode where I pull a single file out of the FDW. I've identified three approachs so far: 1. A FDW mapping specific to each file. It would take the name of the tarfile and the embedded file. Cleanest in some ways but it would be a real pain if you're reading a tarball dynamically. 2. A user-defined function that takes the name of the tarball and file and returns a blob. This is the traditional approach but why bother with a FDW then? It also brings up access control issues since it requires disclosure of the tarball name to the user. A FDW could hide that. 3. A user-defined function that takes a tar FDW and the name of a file and returns a blob. I think this is the best approach but I don't know if I can specify a FDW as a parameter or how to access it. I've skimmed the existing list of FDW but didn't find anything that can serve as a model. The foreign DB are closest but, again, they aren't designed for dynamic use where you want to do something with every file in an archive / table in a foreign DB. Is there an obvious approach? Or is it simply a bad match for FDW and should be two standard UDF? (One returns the metadata, the second returns the specific file.) I would probably do something like this: In this mode, define a table that has path, blob. To get the blob for a single file, just do select blob from fdwtable where path = '/path/to/foo'. Make sure you process the qual in the FDW. e.g. create foreign table tarblobs (path text, blob bytea) server tarfiles options (filename '/path/to/tarball', mode 'contents'); cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] what would tar file FDW look like?
I've written readers for both from scratch. Tar isn't that bad since it's blocked - you read the header, skip forward N blocks, continue. The hardest part is setting up the decompression libraries if you want to support tar.gz or tar.bz2 files. Zip files are more complex. You have (iirc) 5 control blocks - start of archive, start of file, end of file, start of index, end of archive, and the information in the control block is pretty limited. That's not a huge burden since there's support for extensions for things like the unix file metadata. One complication is that you need to support compression from the start. Zip files support two types of encryption. There's a really weak version that almost nobody supports and a much stronger modern version that's subject to license restrictions. (Some people use the weak version on embedded systems because of legal requirements to /do something/, no matter how lame.) There are third-party libraries, of course, but that introduces dependencies. Both formats are simple enough to write from scratch. I guess my bigger question is if there's an interest in either or both for real use. I'm doing this as an exercise but am willing to contrib the code if there's a general interest in it. (BTW the more complex object I'm working on is the .p12 keystore for digital certificates and private keys. We have everything we need in the openssl library so there's no additional third-party dependencies. I have a minimal FDW for the digital certificate itself and am now working on a way to access keys stored in a standard format on the filesystem instead of in the database itself. A natural fit is a specialized archive FDW. Unlike tar and zip it will have two payloads, the digital certificate and the (optionally encrypted) private key. It has searchable metadata, e.g., finding all records with a specific subject.) Bear On Mon, Aug 17, 2015 at 8:29 AM, Greg Stark st...@mit.edu wrote: On Mon, Aug 17, 2015 at 3:14 PM, Bear Giles bgi...@coyotesong.com wrote: I'm starting to work on a tar FDW as a proxy for a much more specific FDW. (It's the 'faster to build two and toss the first away' approach - tar lets me get the FDW stuff nailed down before attacking the more complex container.) It could also be useful in its own right, or as the basis for a zip file FDW. Hm. tar may be a bad fit where zip may be much easier. Tar has no index or table of contents. You have to scan the entire file to find all the members. IIRC Zip does have a table of contents at the end of the file. The most efficient way to process a tar file is to describe exactly what you want to happen with each member and then process it linearly from start to end (or until you've found the members you're looking for). Trying to return meta info and then go looking for individual members will be quite slow and have a large startup cost. -- greg
Re: [HACKERS] Raising our compiler requirements for 9.6
On Sat, Aug 15, 2015 at 8:03 PM, Andres Freund and...@anarazel.de wrote: That gave me new respect for STATIC_IF_INLINE. While it does add tedious work to the task of introducing a new batch of inline functions, the work is completely mechanical. Anyone can write it; anyone can review it; there's one correct way to write it. What's the advantage of using STATIC_IF_INLINE over just #ifndef FRONTEND? That doesn't work well for entire headers in my opinion, but for individual functions it shouldn't be a problem? In fact we've done it for years for MemoryContextSwitchTo(). And by the problem definition such functions cannot be required by frontend code. STATIC_IF_INLINE is really tedious because to know whether it works you essentially need to recompile postgres with inlines enabled/disabled. In fact STATIC_IF_INLINE does *not* even help here unless we also detect compilers that support inlining but balk when inline functions reference symbols not available. There was an example of that in the buildfarm as of 2014, c.f. a9baeb361d63 . We'd have to disable inlining for such compilers. The advantage of STATIC_IF_INLINE is that we had everything working in that model. We seem to be replacing problems that we had solved and code that worked on our entire buildfarm with new problems that we haven't solved yet and which don't seem to be a whole lot simpler than the ones they replaced. As far as I can see, the anticipated benefits of what we're doing here are: - Get a cleaner separation of frontend and backend headers (this could also be done independently of STATIC_IF_INLINE, but removing STATIC_IF_INLINE increases the urgency). - Eliminate multiple evaluations hazards. - Modest improvements to code generation. And the costs are: - Lots of warnings with compilers that are not smart about static inline, and probably significantly worse code generation. - The possibility that may repeatedly break #define FRONTEND compilation when we add static inline functions, where instead adding macros would not have caused breakage, thus resulting in continual tinkering with the header files. What I'm basically worried about is that second one. Actually, what I'm specifically worried about is that we will break somebody's #ifdef FRONTEND code, they will eventually complain, and we will refuse to fix it because we don't think their use case is valid. If that happens even a few times, it will lead me to think that this whole effort was misguided. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More WITH
On Mon, Aug 17, 2015 at 10:22:11AM -0700, Josh Berkus wrote: EXPLAIN [ANALYZE] Would be tricky. We don't currently have any way to wrap an EXPLAIN in any larger statement, do we? We do, but it's kinda horrible. CREATE OR REPLACE FUNCTION get_something_from_explain(your_query) RETURNS TEXT LANGUAGE plpgsql /* uh oh */ AS $$ DECLARE foo JSON; BEGIN EXECUTE format('EXPLAIN (FORMAT json), your_query) INTO foo; RETURN foo # '{bar,baz,quux}'; END; $$; Would be very useful for automated query analysis, though. Among many other things, certainly :) SHOW Not very useful, easy to work around (pg_settings). This particular one is just about being consistent, or the way I look at it, about avoiding surprising users with inconsistencies. Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More WITH
On 08/17/2015 01:30 PM, Peter Geoghegan wrote: On Mon, Aug 17, 2015 at 10:22 AM, Josh Berkus j...@agliodbs.com wrote: Would be tricky. We don't currently have any way to wrap an EXPLAIN in any larger statement, do we? Would be very useful for automated query analysis, though. No. In the grammar, ExplainStmt expects the EXPLAIN to be at the top-level. Having it work any other way would require significant refactoring. Slightly apropos, I have wrapped EXPLAIN calls inside a function, such as one that gets back the result and then sends it off to http://explain.depesz.com, returning the URL cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Configure with thread sanitizer fails the thread test
Ewan Higgs wrote: So I changed volatile to _Atomic and continued (patch is in thread_test_atomic.patch). I then ran it against sqlsmith. The good news: I didn't happen to find any problems in normal use. The bad news: I did find a lot of warnings about improper use of functions like malloc and free from signal handlers. There's a reason why we don't offer a threaded server ... The postmaster process in particular runs in a rather unusual arrangement, where most of the interesting stuff does happen in signal handlers. I doubt there's any chance that we would make it run in a threaded environment. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Configure with thread sanitizer fails the thread test
On 2015-08-17 07:37:45 +, Ewan Higgs wrote: So I changed volatile to _Atomic and continued (patch is in thread_test_atomic.patch). I then ran it against sqlsmith. The good news: I didn't happen to find any problems in normal use. The bad news: I did find a lot of warnings about improper use of functions like malloc and free from signal handlers. I attached the log under 'errors.log'. Aren't pretty much all of those false positives? I checked the first few and they all look like: == WARNING: ThreadSanitizer: signal-unsafe call inside of a signal (pid=26767) #0 free null (libtsan.so.0+0x00025d09) #1 AllocSetDelete /home/ehiggs/src/download/postgresql/src/backend/utils/mmgr/aset.c:643 (postgres+0x009ece3d) #2 MemoryContextDelete /home/ehiggs/src/download/postgresql/src/backend/utils/mmgr/mcxt.c:226 (postgres+0x009ef7cc) #3 MemoryContextDeleteChildren /home/ehiggs/src/download/postgresql/src/backend/utils/mmgr/mcxt.c:246 (postgres+0x009ef83b) #4 MemoryContextDelete /home/ehiggs/src/download/postgresql/src/backend/utils/mmgr/mcxt.c:209 (postgres+0x009ef798) #5 StartChildProcess /home/ehiggs/src/download/postgresql/src/backend/postmaster/postmaster.c:5211 (postgres+0x007b2e72) #6 reaper /home/ehiggs/src/download/postgresql/src/backend/postmaster/postmaster.c:2717 (postgres+0x007b44ac) #7 null null (libtsan.so.0+0x000236e3) #8 ServerLoop /home/ehiggs/src/download/postgresql/src/backend/postmaster/postmaster.c:1631 (postgres+0x007b6f78) #9 PostmasterMain /home/ehiggs/src/download/postgresql/src/backend/postmaster/postmaster.c:1274 (postgres+0x007b6f78) #10 main /home/ehiggs/src/download/postgresql/src/backend/main/main.c:223 (postgres+0x006f2da9) This is after a fork. And fork is a async-signal-safe function. So it seems tsan doesn't properly recognize that we actually escaped the signal handler and are in a new process. #include stdio.h #include stdlib.h +#include stdatomic.h #include unistd.h #include netdb.h #include sys/types.h @@ -84,11 +85,11 @@ static void func_call_2(void); static pthread_mutex_t init_mutex = PTHREAD_MUTEX_INITIALIZER; -static volatile int thread1_done = 0; -static volatile int thread2_done = 0; +static _Atomic int thread1_done = 0; +static _Atomic int thread2_done = 0; -static volatile int errno1_set = 0; -static volatile int errno2_set = 0; +static _Atomic int errno1_set = 0; +static _Atomic int errno2_set = 0; #ifndef HAVE_STRERROR_R static char *strerror_p1; We can't do that because we have to work on older compilers as well. My suggestion for now would be to disable tsan during configure and only enable it during the actual build (you can do stuff like make COPT='-fsanitize...'. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] jsonb array-style subscripting
Hi, Some time ago the array-style subscripting for the jsonb data type was discussed in this mailing list. I think it will be quite convenient to have a such nice syntax to update jsonb objects, so I'm trying to implement this. I created a patch, that allows doing something like this: =# create TEMP TABLE test_jsonb_subscript ( id int, test_json jsonb ); =# insert into test_jsonb_subscript values (1, '{}'), (2, '{}'); =# update test_jsonb_subscript set test_json['a']['a1']['a2'] = 42; =# select * from test_jsonb_subscript; id |test_json +-- 1 | {a: {a1: {a2: 42}}} 2 | {a: {a1: {a2: 42}}} (2 rows) =# select test_json['a']['a1'] from test_jsonb_subscript; test_json {a2: 42} {a2: 42} (2 rows) This patch has a status work in progress of course. Generally speaking, this implementation extends the `ArrayRef` usage for the jsonb. And I need some sort of advice about several questions: * is it interesting for the community? * is that a good idea to extend the `ArrayRef` for jsonb? If it's appropriate, probably we can rename it to `ArrayJsonbRef` of something. * what can be improved in the code at the top level (function placement, probably, functionality duplication, etc.)? * are there any special cases, that I should take care of in this implementation? jsonb_subscript2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More WITH
Peter Geoghegan p...@heroku.com writes: On Mon, Aug 17, 2015 at 10:22 AM, Josh Berkus j...@agliodbs.com wrote: Would be tricky. We don't currently have any way to wrap an EXPLAIN in any larger statement, do we? Would be very useful for automated query analysis, though. No. In the grammar, ExplainStmt expects the EXPLAIN to be at the top-level. Having it work any other way would require significant refactoring. You can use EXPLAIN as the source of rows in a plpgsql FOR-over-query loop, so there's a workaround available that way when you need to read EXPLAIN output programmatically. I'm not convinced there's sufficient value in trying to make EXPLAIN a full-fledged subquery otherwise. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Raising our compiler requirements for 9.6
On 2015-08-17 12:30:56 -0400, Robert Haas wrote: As far as I can see, the anticipated benefits of what we're doing here are: - Get a cleaner separation of frontend and backend headers (this could also be done independently of STATIC_IF_INLINE, but removing STATIC_IF_INLINE increases the urgency). - Eliminate multiple evaluations hazards. - Modest improvements to code generation. Plus: * Not having 7k long macros, that e.g. need extra flags to even be supported. C.f. http://archives.postgresql.org/message-id/4407.1435763473%40sss.pgh.pa.us * Easier development due to actual type checking and such. Compare the errors from heap_getattr as a macro being passed a boolean with the same from an inline function: Inline: /home/andres/src/postgresql/src/backend/executor/spi.c: In function ‘SPI_getvalue’: /home/andres/src/postgresql/src/backend/executor/spi.c:883:46: error: incompatible type for argument 4 of ‘heap_getattr’ val = heap_getattr(tuple, fnumber, tupdesc, isnull); ^ In file included from /home/andres/src/postgresql/src/backend/executor/spi.c:17:0: /home/andres/src/postgresql/src/include/access/htup_details.h:765:1: note: expected ‘_Bool *’ but argument is of type ‘_Bool’ heap_getattr(HeapTuple tup, int attnum, TupleDesc tupleDesc, ^ Macro: In file included from /home/andres/src/postgresql/src/backend/executor/spi.c:17:0: /home/andres/src/postgresql/src/backend/executor/spi.c: In function ‘SPI_getvalue’: /home/andres/src/postgresql/src/include/access/htup_details.h:750:6: error: invalid type argument of unary ‘*’ (have ‘int’) (*(isnull) = true), \ ^ /home/andres/src/postgresql/src/backend/executor/spi.c:883:8: note: in expansion of macro ‘heap_getattr’ val = heap_getattr(tuple, fnumber, tupdesc, isnull); ^ /home/andres/src/postgresql/src/include/access/htup_details.h:750:23: warning: left-hand operand of comma expression has no effect [-Wunused-value] (*(isnull) = true), \ ^ /home/andres/src/postgresql/src/backend/executor/spi.c:883:8: note: in expansion of macro ‘heap_getattr’ val = heap_getattr(tuple, fnumber, tupdesc, isnull); ^ /home/andres/src/postgresql/src/include/access/htup_details.h:697:3: error: invalid type argument of unary ‘*’ (have ‘int’) (*(isnull) = false), \ ^ /home/andres/src/postgresql/src/include/access/htup_details.h:754:5: note: in expansion of macro ‘fastgetattr’ fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \ ^ /home/andres/src/postgresql/src/backend/executor/spi.c:883:8: note: in expansion of macro ‘heap_getattr’ val = heap_getattr(tuple, fnumber, tupdesc, isnull); ^ /home/andres/src/postgresql/src/include/access/htup_details.h:713:5: error: invalid type argument of unary ‘*’ (have ‘int’) (*(isnull) = true), \ ^ /home/andres/src/postgresql/src/include/access/htup_details.h:754:5: note: in expansion of macro ‘fastgetattr’ fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \ ^ /home/andres/src/postgresql/src/backend/executor/spi.c:883:8: note: in expansion of macro ‘heap_getattr’ val = heap_getattr(tuple, fnumber, tupdesc, isnull); ^ /home/andres/src/postgresql/src/include/access/htup_details.h:713:22: warning: left-hand operand of comma expression has no effect [-Wunused-value] (*(isnull) = true), \ ^ /home/andres/src/postgresql/src/include/access/htup_details.h:754:5: note: in expansion of macro ‘fastgetattr’ fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \ ^ /home/andres/src/postgresql/src/backend/executor/spi.c:883:8: note: in expansion of macro ‘heap_getattr’ val = heap_getattr(tuple, fnumber, tupdesc, isnull); ^ /home/andres/src/postgresql/src/include/access/htup_details.h:697:21: warning: left-hand operand of comma expression has no effect [-Wunused-value] (*(isnull) = false), \ ^ /home/andres/src/postgresql/src/include/access/htup_details.h:754:5: note: in expansion of macro ‘fastgetattr’ fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \ ^ /home/andres/src/postgresql/src/backend/executor/spi.c:883:8: note: in expansion of macro ‘heap_getattr’ val = heap_getattr(tuple, fnumber, tupdesc, isnull); ^ /home/andres/src/postgresql/src/include/access/htup_details.h:757:50: warning: passing argument 4 of ‘heap_getsysattr’ makes pointer from integer without a cast [-Wint-conversion] heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \ ^ /home/andres/src/postgresql/src/backend/executor/spi.c:883:8: note: in expansion of macro ‘heap_getattr’ val = heap_getattr(tuple, fnumber, tupdesc, isnull); ^ /home/andres/src/postgresql/src/include/access/htup_details.h:771:14: note: expected ‘bool * {aka char *}’ but argument is of type ‘bool {aka char}’ extern Datum
[HACKERS] More WITH
Folks, In the interest of consistency, which is to say, of not hitting barriers that are essentially implementation details, I'd like to propose that we allow the rest of the row-returning commands inside WITH clauses. We currently have: SELECT VALUES INSERT/UPDATE/DELETE ... RETURNING We don't yet have: EXPLAIN [ANALYZE] SHOW FETCH A little further out there, although this would be an API change, we might consider allowing the results of VACUUM and ANALYZE as row sets, which would also be good to wrap in WITH. Is there a good reason, or more than one, why we shouldn't have all the row-returning commands in WITH? Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More WITH
EXPLAIN [ANALYZE] Would be tricky. We don't currently have any way to wrap an EXPLAIN in any larger statement, do we? Would be very useful for automated query analysis, though. SHOW Not very useful, easy to work around (pg_settings). -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More WITH
On Mon, Aug 17, 2015 at 10:22 AM, Josh Berkus j...@agliodbs.com wrote: Would be tricky. We don't currently have any way to wrap an EXPLAIN in any larger statement, do we? Would be very useful for automated query analysis, though. No. In the grammar, ExplainStmt expects the EXPLAIN to be at the top-level. Having it work any other way would require significant refactoring. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Raising our compiler requirements for 9.6
On 2015-08-17 12:30:56 -0400, Robert Haas wrote: - The possibility that may repeatedly break #define FRONTEND compilation when we add static inline functions, where instead adding macros would not have caused breakage, thus resulting in continual tinkering with the header files. Again, that's really independent. Inlines have that problem, even with STATIC_IF_INLINE. C.f. MemoryContextSwitch() and a9baeb361d. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory allocation in spi_printtup()
Neil Conway neil.con...@gmail.com writes: Hi Neil! Long time no see. spi_printtup() has the following code (spi.c:1798): if (tuptable-free == 0) { tuptable-free = 256; tuptable-alloced += tuptable-free; tuptable-vals = (HeapTuple *) repalloc(tuptable-vals, tuptable-alloced * sizeof(HeapTuple)); } i.e., it grows the size of the tuptable by a fixed amount when it runs out of space. That seems odd; doubling the size of the table would be more standard. Does anyone know if there is a rationale for this behavior? Seems like it must be just legacy code. We're only allocating pointers here; the actual tuples will likely be significantly larger. So there's not a lot of reason not to use the normal doubling rule. Attached is a one-liner to double the size of the table when space is exhausted. I think this could use a comment, but otherwise seems OK. Should we back-patch this change? Seems like it's arguably a performance bug. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] DTrace build dependency rules
Hi, There seems to be a bug in the make rules when DTrace is enabled. It causes dtrace -G to be invoked twice when building PostgreSQL as a FreeBSD port: once during the build itself, and once during installation. For a long time this has been worked around on FreeBSD with a change to libdtrace itself, but this workaround is proving problematic and I'd like to fix the problem properly. I'm not sure whether the problem has been observed on other operating systems that support DTrace. The bug is in src/backend/Makefile. probes.o, the dtrace(1)-generated object file, depends on the objfiles.txt for each of the backend subdirs. These files depend in turn on the object files themselves; if objfiles.txt is out of date with respect to one of its object files, the mtime of objfiles.txt is updated with touch (see backend/common.mk). The problem is that dtrace -G, which runs at the end of the build, modifies a number of object files (it overwrites their probe sites with NOPs), thus making their corresponding objfiles.txt out of date. Then, when make install traverses the backend subdirs, it updates objfiles.txt, which causes probes.o to be rebuilt, resulting in an error from dtrace(1). The attached patch fixes the problem by having probes.o depend on object files directly, rather than on objfiles.txt. I've tested it with PostgreSQL 9.0-9.4 on FreeBSD CURRENT. Thanks! -Mark diff --git a/src/backend/Makefile b/src/backend/Makefile index 98b978f..b1e3969 100644 --- a/src/backend/Makefile +++ b/src/backend/Makefile @@ -180,8 +180,8 @@ $(top_builddir)/src/include/utils/probes.h: utils/probes.h $(LN_S) ../../../$(subdir)/utils/probes.h . -utils/probes.o: utils/probes.d $(SUBDIROBJS) - $(DTRACE) $(DTRACEFLAGS) -C -G -s $(call expand_subsys,$^) -o $@ +utils/probes.o: utils/probes.d $(call expand_subsys,$(SUBDIROBJS)) + $(DTRACE) $(DTRACEFLAGS) -C -G -s $ $(call expand_subsys,$(SUBDIROBJS)) -o $@ ## -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory allocation in spi_printtup()
On Mon, Aug 17, 2015 at 7:56 AM, Tom Lane t...@sss.pgh.pa.us wrote: Hi Neil! Long time no see. Likewise :) Attached is a one-liner to double the size of the table when space is exhausted. I think this could use a comment, but otherwise seems OK. Attached is a revised patch with a comment. Should we back-patch this change? Seems like it's arguably a performance bug. Sounds good to me. Neil diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c index d544ad9..05a2b21 100644 --- a/src/backend/executor/spi.c +++ b/src/backend/executor/spi.c @@ -1797,7 +1797,8 @@ spi_printtup(TupleTableSlot *slot, DestReceiver *self) if (tuptable-free == 0) { - tuptable-free = 256; + /* Double the size of the table */ + tuptable-free = tuptable-alloced; tuptable-alloced += tuptable-free; tuptable-vals = (HeapTuple *) repalloc(tuptable-vals, tuptable-alloced * sizeof(HeapTuple)); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
On 2015-08-17 15:21:22 +0200, Fabien COELHO wrote: My current thinking is maybe yes, maybe no:-), as it may depend on the OS implementation of posix_fadvise, so it may differ between OS. As long as fadvise has no 'undirty' option, I don't see how that problem goes away. You're telling the OS to throw the buffer away, so unless it ignores it that'll have consequences when you read the page back in. This is a reason why I think that flushing should be kept a guc, even if the sort guc is removed and always on. The sync_file_range implementation is clearly always very beneficial for Linux, and the posix_fadvise may or may not induce a good behavior depending on the underlying system. That's certainly an argument. This is also a reason why the default value for the flush guc is currently set to false in the patch. The documentation should advise to turn it on for Linux and to test otherwise. Or if Linux is assumed to be often a host, then maybe to set the default to on and to suggest that on some systems it may be better to have it off. I'd say it should then be an os-specific default. No point in making people work for it needlessly on linux and/or elsewhere. (Another reason to keep it off is that I'm not sure about what happens with such HD flushing features on virtual servers). I don't see how that matters? Either the host will entirely ignore flushing, and thus the sync_file_range and the fsync won't cost much, or fsync will be honored, in which case the pre-flushing is helpful. Overall, I'm not pessimistic, because I've seen I/O storms on a FreeBSD host and it was as bad as Linux (namely the database and even the box was offline for long minutes...), and if you can avoid that having to read back some data may be not that bad a down payment. I don't see how that'd alleviate my fear. Sure, the latency for many workloads will be better, but I don't how that argument says anything about the reads? And we'll not just use this in cases it'd be beneficial... The issue is largely mitigated if the data is not removed from shared_buffers, because the OS buffer is just a copy of already hold data. What I would do on such systems is to increase shared_buffers and keep flushing on, that is to count less on the system cache and more on postgres own cache. That doesn't work that well for a bunch of reasons. For one it's completely non-adaptive. With the OS's page cache you can rely on free memory being used for caching *and* it be available should a query or another program need lots of memory. Overall, I'm not convince that the practice of relying on the OS cache is a good one, given what it does with it, at least on Linux. The alternatives aren't super realistic near-term though. Using direct IO efficiently on the set of operating systems we support is *hard*. It's more or less trivial to hack pg up to use direct IO for relations/shared_buffers, but it'll perform utterly horribly in many many cases. To pick one thing out: Without the OS buffering writes any write will have to wait for the disks, instead being asynchronous. That'll make writes performed by backends a massive bottleneck. Now, if someone could provide a dedicated box with posix_fadvise (say FreeBSD, maybe others...) for testing that would allow to provide data instead of speculating... and then maybe to decide to change its default value. Testing, as an approximation, how it turns out to work on linux would be a good step. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] missing documentation for partial WAL files
The commit message for de76884 contains some important information about the purpose and use of the new .partial WAL files. But I don't see anything about this in the documentation or another user-visible place. We should probably add something. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Test code is worth the space
On 8/15/15 4:45 AM, Petr Jelinek wrote: We could fix a) by adding ORDER BY to those queries but I don't see how to fix the rest easily or at all without sacrificing some test coverage. Hopefully at some point we'll have test frameworks that don't depend on capturing raw psql output, which presumably makes dealing with a lot of this easier. Maybe some of this could be handled by factoring BLKSZ into the queries... A really crude solution would be to have expected output be BLKSZ dependent where appropriate. No one will remember to test that by hand, but if we have CI setup then maybe it's workable... -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Raising our compiler requirements for 9.6
On Mon, Aug 17, 2015 at 12:36 PM, Andres Freund and...@anarazel.de wrote: On 2015-08-17 12:30:56 -0400, Robert Haas wrote: - The possibility that may repeatedly break #define FRONTEND compilation when we add static inline functions, where instead adding macros would not have caused breakage, thus resulting in continual tinkering with the header files. Again, that's really independent. Inlines have that problem, even with STATIC_IF_INLINE. C.f. MemoryContextSwitch() and a9baeb361d. Inlines, yes, but macros don't. I'm not saying we shouldn't do this, but I *am* saying that we need to be prepared to treat breaking FRONTEND compilation as a problem, not just today and tomorrow, but way off into the future. It's not at all a stretch to think that we could still be hitting fallout from these changes in 2-3 years time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Test code is worth the space
On 8/16/15 8:48 AM, Greg Stark wrote: On Sun, Aug 16, 2015 at 7:33 AM, Noah Misch n...@leadboat.com wrote: When I've just spent awhile implementing a behavior change, the test diffs are a comforting sight. They confirm that the test suite exercises the topic I just changed. Furthermore, most tests today do not qualify under this stringent metric you suggest. The nature of pg_regress opposes it. It's not a comforting sight for me. It means I need to change the test results and then of course the tests will pass. So they didn't really test anything and I don't really know whether the changes broke anything. Any test I had to adjust for my change was effectively worthless. This is precisely my point about pg_regress and why it's the reason there's pressure not to have extensive tests. It's useful to have some end-to-end black box tests like this but it's self-limiting. You can't test many details without tying the tests to specific behaviour. I have worked with insanely extensive testing suites that didn't have this problem. But they were unit tests for code that was structured around testability. When the API is designed to be testable and you have good infrastructure for mocking up the environment and comparing function results in a much narrower way than just looking at text output you can test the correctness of each function without tying the test to the specific behaviour. *That* gives a real comfort. When you change a function to behave entirely differently and can still run all the tests and see that all the code that depends on it still operate correctly then you know you haven't broken anything. In that case it actually *speeds up* development rather than slowing it down. It lets newbie developers hack on code where they don't necessarily know all the downstream dependent code and not be paralyzed by fear that they'll break something. As someone who oversaw a database test suite with almost 500 test files (each testing multiple cases), I completely agree. Our early framework was actually similar to pg_regress and that worked OK up to about 200 test files, but it got very unwieldy very quickly. We switched to pgTap and it was far easier to work with. I suspect any effort to significantly improve Postgres test coverage is doomed until there's an alternative to pg_regress. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] missing documentation for partial WAL files
On Mon, Aug 17, 2015 at 11:50 AM, Peter Eisentraut pete...@gmx.net wrote: The commit message for de76884 contains some important information about the purpose and use of the new .partial WAL files. But I don't see anything about this in the documentation or another user-visible place. We should probably add something. Uh, some documentation around .ready files would be nice too. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Configure with thread sanitizer fails the thread test
On 2015-08-17 14:31:24 -0300, Alvaro Herrera wrote: The postmaster process in particular runs in a rather unusual arrangement, where most of the interesting stuff does happen in signal handlers. FWIW, I think it might be worthwhile to convert postmaster into a loop over a process local latch, with that latch being set in signal handlers. My feeling is that that'd simplify the code rather significantly. I'm not 100% it's worth the code churn, but it'd definitely be easier to understand. Thread sanitizer isn't the first analysis tool that has problems coping with forks in signal handlers btw, valgrind on amd64 for a long while had misaligned stacks in the children afterwards leading to very odd crashes. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench bug in head
On 08/11/2015 06:44 PM, Fabien COELHO wrote: Probably since 1bc90f7a (shame on me) pgbench does not report skipped transactions (-L) counts properly: data are not aggregated over all threads as they should be, either under --progress or in the end of run report. The attached patch fixes these regressions. Applied, thanks. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Autonomous Transaction is back
Noah Misch wrote: If the autonomous transaction can interact with uncommitted work in a way that other backends could not, crazy things happen when the autonomous transaction commits and the suspended transaction aborts: CREATE TABLE t (c) AS SELECT 1; BEGIN; UPDATE t SET c = 2 WHERE c = 1; BEGIN_AUTONOMOUS; UPDATE t SET c = 3 WHERE c = 1; UPDATE t SET c = 4 WHERE c = 2; COMMIT_AUTONOMOUS; ROLLBACK; If you replace the autonomous transaction with a savepoint, the c=3 update finds no rows, and the c=4 update changes one row. When the outer transaction aborts, only the original c=1 row remains live. If you replace the autonomous transaction with a dblink/pg_background call, the c=3 update waits indefinitely for c=2 to commit or abort, an undetected deadlock. My starting expectation is that the semantics of an autonomous transaction will be exactly those of dblink/pg_background. (I said that during the unconference session.) The application would need to read data from tables before switching to the autonomous section. Autonomous transactions are then a performance and syntactic help, not a source of new semantics. Does any database have autonomous transactions that do otherwise? Oracle behaves like that, i.e. it deadlocks with your example: SQL SELECT * FROM t; C -- 1 SQL CREATE PROCEDURE proc2 IS 2 PRAGMA AUTONOMOUS_TRANSACTION; 3 BEGIN 4 UPDATE t SET c = 3 WHERE c = 1; 5 UPDATE t SET c = 4 WHERE c = 2; 6 COMMIT; 7 END; 8 / Procedure created. SQL CREATE PROCEDURE proc1 IS 2 BEGIN 3 UPDATE t SET c = 2 WHERE c = 1; 4 proc2; 5 ROLLBACK; 6 END; 7 / Procedure created. SQL CALL proc1(); CALL proc1() * ERROR at line 1: ORA-00060: deadlock detected while waiting for resource ORA-06512: at LAURENZ.PROC2, line 4 ORA-06512: at LAURENZ.PROC1, line 4 Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench stats per script other stuff
v7 is a rebase after another small bug fix in pgbench. v8 is a rebase after yet another small bug fix in pgbench. -- Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index 2517a3a..99670d4 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -261,6 +261,23 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/ benchmarking arguments: variablelist + varlistentry + termoption-b/ replaceablescriptname[@weight]//term + termoption--builtin/ replaceablescriptname[@weight]//term + listitem + para +Add the specified builtin script to the list of executed scripts. +An optional integer weight after literal@/ allows to adjust the +probability of drawing the test. +Available builtin scripts are: literaltpcb-like/, +literalsimple-update/ and literalselect-only/. +The provided repleacablescriptname/ needs only to be a prefix +of the builtin name, hence literalsimp/ would be enough to select +literalsimple-update/. + /para + /listitem + /varlistentry + varlistentry termoption-c/option replaceableclients//term @@ -307,14 +324,15 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/ /varlistentry varlistentry - termoption-f/option replaceablefilename//term - termoption--file=/optionreplaceablefilename//term + termoption-f/ replaceablefilename[@weight]//term + termoption--file=/replaceablefilename[@weight]//term listitem para -Read transaction script from replaceablefilename/. +Add a transaction script read from replaceablefilename/ to +the list of executed scripts. +An optional integer weight after literal@/ allows to adjust the +probability of drawing the test. See below for details. -option-N/option, option-S/option, and option-f/option -are mutually exclusive. /para /listitem /varlistentry @@ -404,10 +422,7 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/ termoption--skip-some-updates/option/term listitem para -Do not update structnamepgbench_tellers/ and -structnamepgbench_branches/. -This will avoid update contention on these tables, but -it makes the test case even less like TPC-B. +Shorthand for option-b simple-update@1/. /para /listitem /varlistentry @@ -499,9 +514,9 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/ Report the specified scale factor in applicationpgbench/'s output. With the built-in tests, this is not necessary; the correct scale factor will be detected by counting the number of -rows in the structnamepgbench_branches/ table. However, when testing -custom benchmarks (option-f/ option), the scale factor -will be reported as 1 unless this option is used. +rows in the structnamepgbench_branches/ table. +However, when testing only custom benchmarks (option-f/ option), +the scale factor will be reported as 1 unless this option is used. /para /listitem /varlistentry @@ -511,7 +526,7 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/ termoption--select-only/option/term listitem para -Perform select-only transactions instead of TPC-B-like test. +Shorthand for option-b select-only@1/. /para /listitem /varlistentry @@ -567,6 +582,16 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/ /varlistentry varlistentry + termoption--per-script-stats/option/term + listitem + para +Report some statistics per script run by pgbench. + /para + /listitem + /varlistentry + + + varlistentry termoption--sampling-rate=replaceablerate//option/term listitem para @@ -661,7 +686,20 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/ titleWhat is the quoteTransaction/ Actually Performed in pgbench?/title para - The default transaction script issues seven commands per transaction: + Pgbench executes test scripts chosen randomly from a specified list. + They include built-in scripts with option-b/ and + user-provided custom scripts with option-f/. + Each script may be given a relative weight specified after a + literal@/ so as to change its drawing probability. + The default weight is literal1/. + /para + + para + The default builtin transaction script (also invoked with option-b tpcb-like/) + issues seven commands per transaction over randomly chosen literalaid/, + literaltid/, literalbid/ and literalbalance/. + The scenario is inspired by the TPC-B benchmark, but is not actually TPC-B, + hence the name. /para orderedlist @@ -675,9
Re: [HACKERS] replication slot restart_lsn initialization
On 8/14/15 3:54 AM, Andres Freund wrote: I'd name it RESERVE_WAL. My feeling is that the options for the logical case are geared towards the output plugin, not the walsender. I think it'd be confusing to use (slot_options) differently for physical slots. Why reserve? Isn't it preserve? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb array-style subscripting
On 08/17/2015 03:26 PM, Merlin Moncure wrote: I'm not sure if this: update test_jsonb_subscript set test_json['a']['a1']['a2'] = 42; ...is a good idea. postgres operators tend to return immutable copies of the item they are referring to. In other words, you'd never see a column operator on the 'left' side of the equals in an update statement. I think you need to look at a function to get the behavior you want: update test_jsonb_subscript set test_json = jsonb_modify(test_json, '[a][a1][a2] = 42');] ...as a hypothetical example. The idea is you need to make a function that provides the ability to make the complete json you want. Update statements always make a copy of the record anyways. Why should jsonb be different from an array? You can assign to an array element, using exactly this syntax, except that the index expressions have to be integers. This was discussed at pgcon and generally met with approval. There is some demand for it. If Dmitry hadn't done this I would probably have done it myself. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Error message with plpgsql CONTINUE
On 8/17/15 9:48 AM, Tom Lane wrote: I'm inclined to think that if we wanted to make this better, the way to improve it would be to detect the error*at compile time*, and get rid of this hack in plpgsql_exec_function altogether. pl_gram.y already successfully detects cases where CONTINUE mentions a label that doesn't exist or isn't surrounding the CONTINUE. What it is missing is that we don't distinguish labels on loops from labels on non-loop statements, and thus it can't tell if CONTINUE is referencing a non-loop label or has no label but is not inside any loop-type statement. Seems like that detail could be added to the PLpgSQL_nsitem data structure without a huge amount of work. So split PLPGSQL_NSTYPE_LABEL into PLPGSQL_NSTYPE_BLOCK_LABEL and PLPGSQL_NSTYPE_LOOP_LABEL, and split opt_block_label and opt_label the same way? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb array-style subscripting
On Mon, Aug 17, 2015 at 12:26 PM, Merlin Moncure mmonc...@gmail.com wrote: I'm not sure if this: update test_jsonb_subscript set test_json['a']['a1']['a2'] = 42; ...is a good idea. This kind of targetlist indirection is already possible with arrays and composite types. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb array-style subscripting
On 8/17/15 12:57 PM, Dmitry Dolgov wrote: * is it interesting for the community? We definitely need better ways to manipulate JSON. * is that a good idea to extend the `ArrayRef` for jsonb? If it's appropriate, probably we can rename it to `ArrayJsonbRef` of something. * what can be improved in the code at the top level (function placement, probably, functionality duplication, etc.)? * are there any special cases, that I should take care of in this implementation? How would this work when you have a JSON array? Postgres array syntax suddenly becoming key/value syntax for JSON seems like a pretty bad idea to me. Could a different syntax (maybe {}) be used instead? I'm not sure having the UPDATE you show cause objects to spring to life is so great either. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb array-style subscripting
On Mon, Aug 17, 2015 at 12:57 PM, Dmitry Dolgov 9erthali...@gmail.com wrote: Hi, Some time ago the array-style subscripting for the jsonb data type was discussed in this mailing list. I think it will be quite convenient to have a such nice syntax to update jsonb objects, so I'm trying to implement this. I created a patch, that allows doing something like this: =# create TEMP TABLE test_jsonb_subscript ( id int, test_json jsonb ); =# insert into test_jsonb_subscript values (1, '{}'), (2, '{}'); =# update test_jsonb_subscript set test_json['a']['a1']['a2'] = 42; =# select * from test_jsonb_subscript; id |test_json +-- 1 | {a: {a1: {a2: 42}}} 2 | {a: {a1: {a2: 42}}} (2 rows) =# select test_json['a']['a1'] from test_jsonb_subscript; test_json {a2: 42} {a2: 42} (2 rows) This patch has a status work in progress of course. Generally speaking, this implementation extends the `ArrayRef` usage for the jsonb. And I need some sort of advice about several questions: * is it interesting for the community? * is that a good idea to extend the `ArrayRef` for jsonb? If it's appropriate, probably we can rename it to `ArrayJsonbRef` of something. * what can be improved in the code at the top level (function placement, probably, functionality duplication, etc.)? * are there any special cases, that I should take care of in this implementation? I'm not sure if this: update test_jsonb_subscript set test_json['a']['a1']['a2'] = 42; ...is a good idea. postgres operators tend to return immutable copies of the item they are referring to. In other words, you'd never see a column operator on the 'left' side of the equals in an update statement. I think you need to look at a function to get the behavior you want: update test_jsonb_subscript set test_json = jsonb_modify(test_json, '[a][a1][a2] = 42');] ...as a hypothetical example. The idea is you need to make a function that provides the ability to make the complete json you want. Update statements always make a copy of the record anyways. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
Hello Andres, [...] posix_fadvise(). My current thinking is maybe yes, maybe no:-), as it may depend on the OS implementation of posix_fadvise, so it may differ between OS. As long as fadvise has no 'undirty' option, I don't see how that problem goes away. You're telling the OS to throw the buffer away, so unless it ignores it that'll have consequences when you read the page back in. Yep, probably. Note that we are talking about checkpoints, which write buffers out *but* keep them nevertheless. As the buffer is kept, the OS page is a duplicate, and freeing it should not harm, at least immediatly. The situation is different if the memory is reused in between, which is the work of the bgwriter I think, based on LRU/LFU heuristics, but such writes are not flushed by the current patch. Now, if a buffer was recently updated it should not be selected by the bgwriter, if the LRU/LFU heuristics works as expected, which mitigate the issue somehow... To sum up, I agree that it is indeed possible that flushing with posix_fadvise could reduce read OS-memory hits on some systems for some workloads, although not on Linux, see below. So the option is best kept as off for now, without further data, I'm fine with that. [...] I'd say it should then be an os-specific default. No point in making people work for it needlessly on linux and/or elsewhere. Ok. Version 9 attached does that, on for Linux, off for others because of the potential issues you mentioned. (Another reason to keep it off is that I'm not sure about what happens with such HD flushing features on virtual servers). I don't see how that matters? Either the host will entirely ignore flushing, and thus the sync_file_range and the fsync won't cost much, or fsync will be honored, in which case the pre-flushing is helpful. Possibly. I know that I do not know:-) The distance between the database and real hardware is so great in VM, that I think that it may have any effect, including good, bad or none:-) Overall, I'm not pessimistic, because I've seen I/O storms on a FreeBSD host and it was as bad as Linux (namely the database and even the box was offline for long minutes...), and if you can avoid that having to read back some data may be not that bad a down payment. I don't see how that'd alleviate my fear. I'm trying to mitigate your fears, not to alleviate them:-) Sure, the latency for many workloads will be better, but I don't how that argument says anything about the reads? It just says that there may be a compromise, better in some case, possibly not so in others, because posix_fadvise does not really say what the database would like to say to the OS, this is why I wrote such a large comment about it in the source file in the first place. And we'll not just use this in cases it'd be beneficial... I'm fine if it is off by default for some systems. If people want to avoid write stalls they can use the option, but it may have adverse effect on the tps in some cases, that's life? Not using the option also has adverse effects in some cases, because you have write stalls... and currently you do not have the choice, so it would be a progress. The issue is largely mitigated if the data is not removed from shared_buffers, because the OS buffer is just a copy of already hold data. What I would do on such systems is to increase shared_buffers and keep flushing on, that is to count less on the system cache and more on postgres own cache. That doesn't work that well for a bunch of reasons. For one it's completely non-adaptive. With the OS's page cache you can rely on free memory being used for caching *and* it be available should a query or another program need lots of memory. Yep. I was thinking about a dedicated database server, not a shared one. Overall, I'm not convince that the practice of relying on the OS cache is a good one, given what it does with it, at least on Linux. The alternatives aren't super realistic near-term though. Using direct IO efficiently on the set of operating systems we support is *hard*. [...] Sure. This is not necessarily what I had in mind. Currently pg writes stuff to the OS, and then suddenly calls fsync out of the blue, hoping that in between the OS will actually have done a good job with the underlying hardware. This is pretty naive, the fsync generates write storms, and the database is offline: trying to improve these things is the motivation for this patch. Now if you think of the bgwriter, it does pretty much the same, and probably may generate plenty of random I/Os, because the underlying LRU/LFU heuristics used to select buffers does not care about the file structures. So I think that to get good performance the database must take some control over the OS. That does not mean that direct I/O needs to be involved, although maybe it could, but this patch shows that it is not needed to improve things. Now, if someone could provide a
Re: [HACKERS] jsonb array-style subscripting
Dmitry Dolgov wrote: Some time ago the array-style subscripting for the jsonb data type was discussed in this mailing list. I think it will be quite convenient to have a such nice syntax to update jsonb objects, so I'm trying to implement this. I created a patch, that allows doing something like this: =# update test_jsonb_subscript set test_json['a']['a1']['a2'] = 42; FWIW we discussed exactly this during the unconference https://wiki.postgresql.org/wiki/PgCon_2015_Developer_Unconference#Direction_of_json_and_jsonb and Andrew and Tom seemed okay with this syntax. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replication slot restart_lsn initialization
On 2015-08-17 15:22:44 -0400, Peter Eisentraut wrote: On 8/14/15 3:54 AM, Andres Freund wrote: I'd name it RESERVE_WAL. Why reserve? Isn't it preserve? Hm. I honestly do not know. I was thinking of ticket reservations... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [patch] psql tab completion for grant execute
Hi, When tab-completing after GRANT EXECUTE, currently psql injects PROCEDURE, rather than the expected ON. The code for completing with ON is there, but it's not reached due to falling earlier into another branch, one that handles CREATE TRIGGER. A trivial patch is attached. It adds the condition that if EXECUTE is preceded by GRANT itself preceded by nothing, then that completion with PROCEDURE is skipped. I've looked at fixing it more directly, by testing if the EXECUTE is part of a CREATE TRIGGER, but it didn't seem fitting to go looking backwards that many words into the string (more than the 5 words suggested by the rest of the code). Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c index 62cb721..816deda 100644 --- a/src/bin/psql/tab-complete.c +++ b/src/bin/psql/tab-complete.c @@ -2622,6 +2622,7 @@ psql_completion(const char *text, int start, int end) COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_views, NULL); /* complete CREATE TRIGGER ... EXECUTE with PROCEDURE */ else if (pg_strcasecmp(prev_wd, EXECUTE) == 0 + !(pg_strcasecmp(prev2_wd, GRANT) == 0 prev3_wd[0] == '\0') prev2_wd[0] != '\0') COMPLETE_WITH_CONST(PROCEDURE); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb array-style subscripting
On 8/17/15 3:33 PM, Josh Berkus wrote: Again, how do we handle missing keys? Just return NULL? or ERROR? I'd prefer the former, but there will be arguments the other way. I've been wondering if we should add some kind of strict JSON. My big concern is throwing an error if you try to provide duplicate keys, but it seems reasonable that json_strict would throw an error if you try to reference something that doesn't exist. This patch has a status work in progress of course. Generally speaking, this implementation extends the `ArrayRef` usage for the jsonb. And I need some sort of advice about several questions: * is it interesting for the community? * is that a good idea to extend the `ArrayRef` for jsonb? If it's appropriate, probably we can rename it to `ArrayJsonbRef` of something. * what can be improved in the code at the top level (function placement, probably, functionality duplication, etc.)? * are there any special cases, that I should take care of in this implementation? array/key ambiguity is going to be painful. JSON keys are required to be strings, so maybe it's OK to differentiate based on whether the index is a string or a number. Or perhaps we use different nomenclature (ie: {} for objects). We should also think about what would work for hstore. Adding this for hstore is clearly separate work, but it'd be bad to close that door here. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb array-style subscripting
Dmitry Dolgov 9erthali...@gmail.com writes: * is that a good idea to extend the `ArrayRef` for jsonb? No. Make a new expression node type. (Salesforce did something similar for an internal feature, and it was a disaster both for code modularity and performance. We had to change it to a separate node type, which I just got finished doing. Don't go down that path. While you're at it, I'd advise that fetch and assignment be two different node types rather than copying ArrayRef's bad precedent of using only one.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb array-style subscripting
On 08/17/2015 02:18 PM, Jim Nasby wrote: On 8/17/15 3:33 PM, Josh Berkus wrote: Again, how do we handle missing keys? Just return NULL? or ERROR? I'd prefer the former, but there will be arguments the other way. I've been wondering if we should add some kind of strict JSON. My big concern is throwing an error if you try to provide duplicate keys, but it seems reasonable that json_strict would throw an error if you try to reference something that doesn't exist. Only if there's demand for it. Is there? array/key ambiguity is going to be painful. JSON keys are required to be strings, so maybe it's OK to differentiate based on whether the index is a string or a number. Or perhaps we use different nomenclature (ie: {} for objects). Well, we did get rid of all of those implicit conversions for a reason. So maybe that's good enough? i.e. json['a']['b'][1] = 5 assign 5 as the first element in the array 'b' of object 'a' json['a']['b']['1'] = 5 assign 5 to key '1' of object 'b' of object 'a' -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Memory allocation in spi_printtup()
spi_printtup() has the following code (spi.c:1798): if (tuptable-free == 0) { tuptable-free = 256; tuptable-alloced += tuptable-free; tuptable-vals = (HeapTuple *) repalloc(tuptable-vals, tuptable-alloced * sizeof(HeapTuple)); } i.e., it grows the size of the tuptable by a fixed amount when it runs out of space. That seems odd; doubling the size of the table would be more standard. Does anyone know if there is a rationale for this behavior? Attached is a one-liner to double the size of the table when space is exhausted. Constructing a simple test case in which a large result is materialized via SPI_execute() (e.g., by passing two large queries to crosstab() from contrib/tablefunc), this change reduces the time required to exceed the palloc size limit from ~300 seconds to ~93 seconds on my laptop. Of course, using SPI_execute() rather than cursors for queries with large result sets is not a great idea, but there is demonstrably code that does this (e.g., contrib/tablefunc -- I'll send a patch for that shortly). Neil diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c index d544ad9..2573b3f 100644 --- a/src/backend/executor/spi.c +++ b/src/backend/executor/spi.c @@ -1797,7 +1797,7 @@ spi_printtup(TupleTableSlot *slot, DestReceiver *self) if (tuptable-free == 0) { - tuptable-free = 256; + tuptable-free = tuptable-alloced; tuptable-alloced += tuptable-free; tuptable-vals = (HeapTuple *) repalloc(tuptable-vals, tuptable-alloced * sizeof(HeapTuple)); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Configure checks and optimizations/crc32 tests
On 08/14/2015 07:42 PM, Andres Freund wrote: Going over my vpaths I noticed another problem with the test. With gcc I get slice-by-8 instead of the runtime variant: checking for builtin __atomic int64 atomic operations... yes checking for __get_cpuid... yes checking for __cpuid... no checking for _mm_crc32_u8 and _mm_crc32_u32 with CFLAGS=... no checking for _mm_crc32_u8 and _mm_crc32_u32 with CFLAGS=-msse4.2... no checking which CRC-32C implementation to use... slicing-by-8 That's because I get a warning conftest.c:179:1: warning: old-style function definition [-Wold-style-definition] main () ^ and PGAC_SSE42_CRC32_INTRINSICS turns on ac_c_werror_flag. Now I can work around this by , but I don't really see why that test needs to turn on -Werror? Isn't the point of using a linker test that we'd get a proper linker failure if the functions don't exist? Yeah, it was probably just copy-pasted from the other macros in c-compiler.m4 without thinking. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Error message with plpgsql CONTINUE
2015-08-17 6:19 GMT+02:00 Jim Nasby jim.na...@bluetreble.com: Calling CONTINUE with a label that's not a loop produces an error message with no context info [1]. This is because of rc = exec_stmt_block(estate, func-action); if (rc != PLPGSQL_RC_RETURN) { estate.err_stmt = NULL; estate.err_text = NULL; I trawled through git blame a bit and it looks like it's been that way for a very long time. I think err_stmt should probably only be reset in the non-return case a bit below that. I'm not sure about err_text though. Also, the code treats PLPGSQL_RC_OK and PLPGSQL_RC_EXIT the same, which seems like a bug; I would think PLPGSQL_RC_EXIT should be handled the same way as CONTINUE. If someone can confirm this and tell me what to do about err_text I'll submit a patch. maybe during function exit ? Regards Pavel [1] decibel@decina.attlocal/50703=# do $$ begin outer for a in 1..3 loop sub BEGIN inner for b in 8..9 loop if a=2 then continue sub; end if; raise notice '% %', a, b; end loop inner; END sub; end loop outer; end; $$; NOTICE: 1 8 NOTICE: 1 9 ERROR: CONTINUE cannot be used outside a loop CONTEXT: PL/pgSQL function inline_code_block decibel@decina.attlocal/50703=# [2] https://github.com/postgres/postgres/blob/83604cc42353b6c0de2a3f3ac31f94759a9326ae/src/pl/plpgsql/src/pl_exec.c#L438 -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Potential GIN vacuum bug
On 08/16/2015 12:58 AM, Jeff Janes wrote: When ginbulkdelete gets called for the first time in a VACUUM(i.e. stats == NULL), one of the first things it does is call ginInsertCleanup to get rid of the pending list. It does this in lieu of vacuuming the pending list. This is important because if there are any dead tids still in the Pending list, someone else could come along during the vacuum and post the dead tids into a part of the index that VACUUM has already passed over. The potential bug is that ginInsertCleanup exits early (ginfast.c lines 796, 860, 898) if it detects that someone else is cleaning up the pending list, without waiting for that someone else to finish the job. Isn't this a problem? Yep, I think you're right. When that code runs as part of VACUUM, it should not give up like that. Hmm, I see other race conditions in that code too. Even if VACUUM wins the race you spotted, and performs all the insertions, reaches the end of the pending items list, and deletes the pending list pages, it's possible that another backend started earlier, and is still processing the same items from the pending items list. It will add them to the tree, and after it's finished with that it will see that the pending list page was already deleted, and bail out. But if there is a dead tuple in the pending items list, you have trouble. The other backend will re-insert it, and that might happen after VACUUM had already removed it from the tree. Also, ginInsertCleanup() seems to assume that if another backend has just finished cleaning up the same page, it will see the page marked as deleted. But what if the page is not only marked as deleted, but also reused for something else already? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb array-style subscripting
On Mon, Aug 17, 2015 at 12:26 PM, Merlin Moncure mmonc...@gmail.com wrote: ...is a good idea. postgres operators tend to return immutable copies of the item they are referring to. This patch does not add an operator at all, actually. If feels like there ought to be an operator, but in fact there is not. The parser is hard-coded to recognize array-style subscripts, which this uses. While I'm certainly glad that Dmitry took the time to work on this, I think we will need an operator, too. Or, more accurately, there should probably be a way to make something like this use some available GIN index: postgres=# explain analyze select * from testjsonb where p['a'] = '[1]'; QUERY PLAN - Seq Scan on testjsonb (cost=0.00..27.00 rows=7 width=32) (actual time=0.022..0.023 rows=1 loops=1) Filter: (p['a'] = '[1]'::jsonb) Planning time: 0.070 ms Execution time: 0.054 ms (4 rows) This doesn't really matter with arrays, but ISTM that it matters here. I have no strong feelings on how it should work, but certain things do seem to suggest themselves. For example, maybe the parser can be made to create a query tree that uses an indexable operator based on special-case logic. Although maybe that's a kludge too far, since I can imagine it breaking other legitimate things. My sense is that this will need to be discussed. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb array-style subscripting
On Mon, Aug 17, 2015 at 2:44 PM, Andrew Dunstan and...@dunslane.net wrote: On 08/17/2015 03:26 PM, Merlin Moncure wrote: I'm not sure if this: update test_jsonb_subscript set test_json['a']['a1']['a2'] = 42; ...is a good idea. postgres operators tend to return immutable copies of the item they are referring to. In other words, you'd never see a column operator on the 'left' side of the equals in an update statement. I think you need to look at a function to get the behavior you want: update test_jsonb_subscript set test_json = jsonb_modify(test_json, '[a][a1][a2] = 42');] ...as a hypothetical example. The idea is you need to make a function that provides the ability to make the complete json you want. Update statements always make a copy of the record anyways. Why should jsonb be different from an array? You can assign to an array element, using exactly this syntax, except that the index expressions have to be integers. This was discussed at pgcon and generally met with approval. There is some demand for it. If Dmitry hadn't done this I would probably have done it myself. Yeah, this all makes sense. I withdraw the statement. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb array-style subscripting
On 08/17/2015 10:57 AM, Dmitry Dolgov wrote: Hi, Some time ago the array-style subscripting for the jsonb data type was discussed in this mailing list. I think it will be quite convenient to have a such nice syntax to update jsonb objects, so I'm trying to implement this. I created a patch, that allows doing something like this: Yaaay! =# create TEMP TABLE test_jsonb_subscript ( id int, test_json jsonb ); =# insert into test_jsonb_subscript values (1, '{}'), (2, '{}'); =# update test_jsonb_subscript set test_json['a']['a1']['a2'] = 42; =# select * from test_jsonb_subscript; id |test_json +-- 1 | {a: {a1: {a2: 42}}} 2 | {a: {a1: {a2: 42}}} (2 rows) So, both perl and python do not allow deep nesting of assignments. For example: d = { a : { } } d[a][a1][a2] = 42 Traceback (most recent call last): File stdin, line 1, in module KeyError: 'a1' ... you have to append one key level at a time. Your approach, on the other hand, feels more user-friendly to me; I can't tell you the number of if 'a2' in dic[key] tests I've written. So, is there any reason why consistency with perl/python behavior would be more desirable than user-friendliness? I'm thinking no, but figured that it's something which needs to come up. There is one ambiguous case you need to address: testjson = '{ a : { } }' SET testjson['a']['a1']['1'] = 42 ... so in this case, is '1' a key, or the first item of an array? how do we determine that? How does the user assign something to an array? =# select test_json['a']['a1'] from test_jsonb_subscript; test_json {a2: 42} {a2: 42} (2 rows) Again, how do we handle missing keys? Just return NULL? or ERROR? I'd prefer the former, but there will be arguments the other way. This patch has a status work in progress of course. Generally speaking, this implementation extends the `ArrayRef` usage for the jsonb. And I need some sort of advice about several questions: * is it interesting for the community? * is that a good idea to extend the `ArrayRef` for jsonb? If it's appropriate, probably we can rename it to `ArrayJsonbRef` of something. * what can be improved in the code at the top level (function placement, probably, functionality duplication, etc.)? * are there any special cases, that I should take care of in this implementation? array/key ambiguity is going to be painful. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers