Re: [HACKERS] postgresql.auto.conf and reload
On Fri, Aug 8, 2014 at 11:41 AM, Fujii Masao masao.fu...@gmail.com wrote: Yep, right. ParseConfigFp() is not good place to pick up data_directory. What about the attached patch which makes ProcessConfigFile() instead of ParseConfigFp() pick up data_directory just after the configuration file is parsed? I think this is better way to handle it. Few comments about patch: 1. can't we directly have the code by having else in below loop. if (DataDir !ParseConfigFile(PG_AUTOCONF_FILENAME, NULL, false, 0, elevel, head, tail)) 2. + if (DataDir == NULL) + { + ConfigVariable *prev = NULL; + for (item = head; item;) + { .. .. + } If data_directory is not present in list, then can't we directly return and leave the other work in function ProcessConfigFile() for second pass. 3. I think comments can be improved: a. + In this case, + * we should not pick up all the settings except the data_directory + * from postgresql.conf yet because they might be overwritten + * with the settings in PG_AUTOCONF_FILENAME file which will be + * read later. It would be better if we change it as below: In this case, we shouldn't pick any settings except the data_directory from postgresql.conf because they might be overwritten with the settings in PG_AUTOCONF_FILENAME file which will be read later. b. +Now only the data_directory needs to be picked up to + * read PG_AUTOCONF_FILENAME file later. This part of comment seems to be repetitive as you already mentioned some part of it in first line as well as at one other location: + * Pick up only the data_directory if DataDir is not set, + /* + * Read the configuration file for the first time. This time only + * data_directory parameter is picked up to determine the data directory + * so that we can read PG_AUTOCONF_FILENAME file next time. Please see if you can improve it. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] postgresql.auto.conf and reload
On Sun, Aug 10, 2014 at 12:24 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Fri, Aug 8, 2014 at 11:41 AM, Fujii Masao masao.fu...@gmail.com wrote: Yep, right. ParseConfigFp() is not good place to pick up data_directory. What about the attached patch which makes ProcessConfigFile() instead of ParseConfigFp() pick up data_directory just after the configuration file is parsed? I think this is better way to handle it. Few comments about patch: 1. can't we directly have the code by having else in below loop. if (DataDir !ParseConfigFile(PG_AUTOCONF_FILENAME, NULL, false, 0, elevel, head, tail)) I think for this we need to change the above condition a bit. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Reporting the commit LSN at commit time
On 2014-08-10 08:50:58 +0800, Craig Ringer wrote: On 08/10/2014 12:54 AM, Andres Freund wrote: On 2014-08-07 21:02:54 -0400, Tom Lane wrote: Craig Ringer cr...@2ndquadrant.com writes: On 08/08/2014 03:54 AM, Tom Lane wrote: FWIW, I think it's a seriously bad idea to expose LSNs in the protocol at all. What happens five years from now when we switch to some other implementation that doesn't have LSNs? Everyone who's relying on them already via pg_stat_replication, etc, breaks. They're _already_ exposed to users. That ship has sailed. They're exposed to replication tools, yeah, but embedding them in the wire protocol would be moving the goalposts a long way past that. As an example of something that doubtless seemed like a good idea at the time, consider the business about how an INSERT command completion tag includes the OID of the inserted row. We're stuck with that obsolete idea *forever* because it's embedded in the protocol for all clients. I don't think we really need to embed it at that level. And it doesn't have to be always on - only clients that ask for it need to get the answer. Something like COMMIT WITH (report_commit_lsn ON); or similar might do the trick? Wouldn't that force client drivers - libpq, psqlODBC, PgJDBC, etc - to all watch for explicit COMMITs sent by the application and rewrite them? Any application doing such transparent failover would need to have a driver that's aware of all that anyway. They need to learn about the transaction boundaries, the commit command and such. I personally think this should mean that that feature requires an explicit API call for transaction control. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 8 August 2014 16:03, Heikki Linnakangas hlinnakan...@vmware.com wrote: 1. MMTuple contains the block number of the heap page (range) that the tuple represents. Vacuum is no longer needed to clean up old tuples; when an index tuples is updated, the old tuple is deleted atomically with the insertion of a new tuple and updating the revmap, so no garbage is left behind. What happens if the transaction that does this aborts? Surely that means the new value is itself garbage? What cleans up that? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parametric block size?
Hello Andres, But further benchmarks sound like a good idea. I've started running some benchmarks with pgbench, with varying block WAL block sizes. I've done a blog post on a small subset of results, focussing on block size with SSDs and to validate the significance of the figures found, see for more details: http://blog.coelho.net/database/2014/08/08/postgresql-page-size-for-SSD/ I've also found an old post by Tomas Vondra who did really extensive tests, including playing around with file system options: http://www.fuzzy.cz/en/articles/ssd-benchmark-results-read-write-pgbench/ The cumulated and consistent result of all these tests, including Hans-Jürgen Schönig short tests, is that reducing page size on SSDs increases significantly pgbench reported performance, by about 10%. I've also done some tests with HDDs which are quite disappointing, with PostgreSQL running in batch mode: a few seconds at 1000 tps followed by a catch-up phase of 20 seconds at about 0 (zero) tps, and back to a new cycle. I'm not sure of which parameter to tweak (postgresql configuration, linux io scheduler, ext4 options or possibly stay away from ext4) to get something more stable. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 8 August 2014 10:01, Heikki Linnakangas hlinnakan...@vmware.com wrote: It's possible that two backends arrive at phase 3 at the same time, with different values. For example, backend A wants to update the minimum to contain 10, and and backend B wants to update it to 5. Now, if backend B gets to update the tuple first, to 5, backend A will update the tuple to 10 when it gets the lock, which is wrong. The simplest solution would be to get the buffer lock in exclusive mode to begin with, so that you don't need to release it between steps 2 and 5. That might be a significant hit on concurrency, though, when most of the insertions don't in fact have to update the value. Another idea is to re-check the updated values after acquiring the lock in exclusive mode, to see if they match the previous values. Simplest solution is to re-apply the test just before update, so in the above example, if we think we want to lower the minimum to 10 and when we get there it is already 5, we just don't update. We don't need to do the re-check always, though. We can read the page LSN while holding share lock, then re-read it once we acquire exclusive lock. If LSN is the same, no need for datatype specific re-checks at all. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 8 August 2014 16:03, Heikki Linnakangas hlinnakan...@vmware.com wrote: I couldn't resist starting to hack on this, and implemented the scheme I've been having in mind: 1. MMTuple contains the block number of the heap page (range) that the tuple represents. Vacuum is no longer needed to clean up old tuples; when an index tuples is updated, the old tuple is deleted atomically with the insertion of a new tuple and updating the revmap, so no garbage is left behind. 2. LockTuple is gone. When following the pointer from revmap to MMTuple, the block number is used to check that you land on the right tuple. If not, the search is started over, looking at the revmap again. Part 2 sounds interesting, especially because of the reduction in CPU that it might allow. Part 1 doesn't sound good yet. Are they connected? More importantly, can't we tweak this after commit? Delaying commit just means less time for other people to see, test, understand tune and fix. I see you (Heikki) doing lots of incremental development, lots of small commits. Can't we do this one the same? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 08/10/2014 12:22 PM, Simon Riggs wrote: On 8 August 2014 16:03, Heikki Linnakangas hlinnakan...@vmware.com wrote: 1. MMTuple contains the block number of the heap page (range) that the tuple represents. Vacuum is no longer needed to clean up old tuples; when an index tuples is updated, the old tuple is deleted atomically with the insertion of a new tuple and updating the revmap, so no garbage is left behind. What happens if the transaction that does this aborts? Surely that means the new value is itself garbage? What cleans up that? It's no different from Alvaro's patch. The updated MMTuple covers the aborted value, but that's OK from a correctnes point of view. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 08/10/2014 12:42 PM, Simon Riggs wrote: On 8 August 2014 16:03, Heikki Linnakangas hlinnakan...@vmware.com wrote: I couldn't resist starting to hack on this, and implemented the scheme I've been having in mind: 1. MMTuple contains the block number of the heap page (range) that the tuple represents. Vacuum is no longer needed to clean up old tuples; when an index tuples is updated, the old tuple is deleted atomically with the insertion of a new tuple and updating the revmap, so no garbage is left behind. 2. LockTuple is gone. When following the pointer from revmap to MMTuple, the block number is used to check that you land on the right tuple. If not, the search is started over, looking at the revmap again. Part 2 sounds interesting, especially because of the reduction in CPU that it might allow. Part 1 doesn't sound good yet. Are they connected? Yes. The optimistic locking in part 2 is based on checking that the block number on the MMTuple matches what you're searching for, and that there is never more than one MMTuple in the index with the same block number. More importantly, can't we tweak this after commit? Delaying commit just means less time for other people to see, test, understand tune and fix. I see you (Heikki) doing lots of incremental development, lots of small commits. Can't we do this one the same? Well, I wouldn't consider let's redesign how locking and vacuuming works and change the on-disk format as incremental development ;-). It's more like, well, redesigning the whole thing. Any testing and tuning would certainly need to be redone after such big changes. If you agree that these changes make sense, let's do them now and not waste people's time testing and tuning a dead-end design. If you don't agree, then let's discuss that. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] HINT: pg_hba.conf changed since last config reload
Hi all I just had an idea I wanted to run by you all before turning it into a patch. People seem to get confused when they get auth errors because they changed pg_hba.conf but didn't reload. Should we emit a HINT alongside the main auth error in that case? Given the amount of confusion that I see around pg_hba.conf from new users, I figure anything that makes it less confusing might be a good thing if there aren't other consequences. Interested in a patch? -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch to support SEMI and ANTI join removal
On Tue, Aug 5, 2014 at 10:35 PM, David Rowley dgrowle...@gmail.com wrote: The patch (attached) is also now able to detect when a NOT EXISTS clause cannot produce any records at all. I've attached an updated version of the patch which fixes up some incorrect logic in the foreign key matching code, plus various comment improvements. Regards David Rowley semianti_join_removal_f92541e_2014-08-10.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] HINT: pg_hba.conf changed since last config reload
Hi, On 2014-08-10 19:48:29 +0800, Craig Ringer wrote: I just had an idea I wanted to run by you all before turning it into a patch. People seem to get confused when they get auth errors because they changed pg_hba.conf but didn't reload. Should we emit a HINT alongside the main auth error in that case? Given the amount of confusion that I see around pg_hba.conf from new users, I figure anything that makes it less confusing might be a good thing if there aren't other consequences. I think we could/would only emit that to the server log because of security concerns. It very well might be interesting for an attacker to know that an outdated hba.conf is still being used... Would that still provide enough benefits? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of versioning on Windows, take two
From: Michael Paquier michael.paqu...@gmail.com Please find attached a patch finishing the work begun during CF1. This adds versioning support for all the remaining dll and exe files on both MinGW and MSVC: - regress.dll (MSVC only) - isolationtester.exe - pg_isolation_regress.exe - pg_regress.exe - pg_regress_ecpg.exe - zic.exe I will add this patch to CF2. Comments are welcome. The patch applied cleanly to the latest source code. But the build failed with MSVC 2008 Express due to the exact same LNK1104 error mentioned in: http://www.postgresql.org/message-id/cab7npqrty1eikqgmz1_wbvhpvfyve9vma67ricepq6d8-eg...@mail.gmail.com LINK : fatal error LNK1104: ファイル '.\release\postgres\src_timezone_win32ver.obj' を開くことができません。 Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] HINT: pg_hba.conf changed since last config reload
* Andres Freund (and...@2ndquadrant.com) wrote: On 2014-08-10 19:48:29 +0800, Craig Ringer wrote: I just had an idea I wanted to run by you all before turning it into a patch. People seem to get confused when they get auth errors because they changed pg_hba.conf but didn't reload. Should we emit a HINT alongside the main auth error in that case? Given the amount of confusion that I see around pg_hba.conf from new users, I figure anything that makes it less confusing might be a good thing if there aren't other consequences. I think we could/would only emit that to the server log because of security concerns. It very well might be interesting for an attacker to know that an outdated hba.conf is still being used... Would that still provide enough benefits? I'd think that'd be useful even if it's only in the main log. To Craig's point on addressing user confusion (when the user is really an admin trying to work through changes), a HINT along the lines of this incident has been logged with details to the PostgreSQL log file or something might help. It amazes me how often just telling people to go *look at the server log file* helps... Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] PostgreSQL vs oracle doing 1 million sqrts am I doing it wrong?
Hi 2014-08-09 10:20 GMT+02:00 Guillaume Lelarge guilla...@lelarge.info: Hi, Le 9 août 2014 05:57, Ramirez, Danilo danilo.rami...@hmhco.com a écrit : Thanks to all for the great info. We are new to postgresql and this discussion has both instructed us and increased our respect for the database and the community. I am seeing a behavior that I don’t understand and hopefully you guys can clear it up. I am using AWS postgresql db.m3.2xlarge and using pgadmin III 1.18 comparing against AWS oracle on db.m3.2xlarge using sql developer and TOAD. I am running a query with 30 tables in the from clause, getting 137 columns back (this is our most basic query, they get a lot more more complex). It returns back 4800 rows. In oracle 1st run takes 3.92 seconds, 2nd .38 seconds. Scrolling to end takes and extra 1.5 seconds for total of 5.5. Using pgadmin, I run the query. Looking at the lower right hand I can see the time going up. It stops at 8200 ms or close to it every time, then it takes an extra 6 seconds before it displays the rows on the screen. 2nd, 3rd, etc. runs all take about same amount of time 8 sec plus 6 sec I then changed it to return only 1 column back. In oracle/sqldeveloper identical behavior as before, same time. In postgresql it now goes down to 1.8 seconds for 1st, 2nd, etc. runs. I then change it so that I am asking for the sum of 1 column. In oracle time goes down to .2 seconds and postgresql now goes down to .2 seconds also. I then change it back to get the full result set and behavior goes back to original, oracle .38 since its cached, postgresql 8 seconds. Are you sure this is postgresql 8 seconds? I'd believe this is more something like postgresql something really low and PgAdmin around 8 seconds displaying it. What I mean is, PgAdmin uses really slow UI components and the time it shows is the time to execute the query and display the data. IOW, you shouldn't use it to benchmark. You should better use psql. Or, much better, you should set log_min_duration_statement to 0 and see exactly how much time postgresql needs to execute it. yes, try to eliminate a impact of PgAdmin for this purpose use psql \timing \o /dev/null SELECT ... -- your query Regards Pavel p.s. you can send a plans of slow and fast variants.
Re: [HACKERS] Support for N synchronous standby servers
On Sat, Aug 9, 2014 at 3:03 PM, Michael Paquier michael.paqu...@gmail.com wrote: Hi all, Please find attached a patch to add support of synchronous replication for multiple standby servers. This is controlled by the addition of a new GUC parameter called synchronous_standby_num, that makes server wait for transaction commit on the first N standbys defined in synchronous_standby_names. The implementation is really straight-forward, and has just needed a couple of modifications in walsender.c for pg_stat_get_wal_senders and syncrep.c. Great! This is really the feature which I really want. Though I forgot why we missed this feature when we had added the synchronous replication feature, maybe it's worth reading the old discussion which may suggest the potential problem of N sync standbys. I just tested this feature with synchronous_standby_num = 2. I started up only one synchronous standby and ran the write transaction. Then the transaction was successfully completed, i.e., it didn't wait for two standbys. Probably this is a bug of the patch. And, you forgot to add the line of synchronous_standby_num to postgresql.conf.sample. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Fri, Aug 8, 2014 at 6:01 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: It's possible that two backends arrive at phase 3 at the same time, with different values. For example, backend A wants to update the minimum to contain 10, and and backend B wants to update it to 5. Now, if backend B gets to update the tuple first, to 5, backend A will update the tuple to 10 when it gets the lock, which is wrong. The simplest solution would be to get the buffer lock in exclusive mode to begin with, so that you don't need to release it between steps 2 and 5. That might be a significant hit on concurrency, though, when most of the insertions don't in fact have to update the value. Another idea is to re-check the updated values after acquiring the lock in exclusive mode, to see if they match the previous values. No, the simplest solution is to re-check the bounds after acquiring the exclusive lock. So instead of doing the addValue with share lock, do a consistency check first, and if it's not consistent, do the addValue with exclusive lock. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On Fri, 2014-08-08 at 01:16 -0700, Jeff Davis wrote: Either way, it's better to be conservative. Attached is a version of the patch with opt-in memory usage tracking. Child contexts inherit the setting from their parent. There was a problem with the previous patch: the parent was unlinked before the delete_context method was called, which meant that the parent's memory was not being decremented. Attached is a fix. It would be simpler to just reorder the operations in MemoryContextDelete, but there is a comment warning against that. So, I pass the old parent as an argument to the delete_context method. Regards, Jeff Davis *** a/src/backend/utils/mmgr/aset.c --- b/src/backend/utils/mmgr/aset.c *** *** 242,247 typedef struct AllocChunkData --- 242,249 #define AllocChunkGetPointer(chk) \ ((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ)) + static void update_allocation(MemoryContext context, int64 size); + /* * These functions implement the MemoryContext API for AllocSet contexts. */ *** *** 250,256 static void AllocSetFree(MemoryContext context, void *pointer); static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size); static void AllocSetInit(MemoryContext context); static void AllocSetReset(MemoryContext context); ! static void AllocSetDelete(MemoryContext context); static Size AllocSetGetChunkSpace(MemoryContext context, void *pointer); static bool AllocSetIsEmpty(MemoryContext context); static void AllocSetStats(MemoryContext context, int level); --- 252,258 static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size); static void AllocSetInit(MemoryContext context); static void AllocSetReset(MemoryContext context); ! static void AllocSetDelete(MemoryContext context, MemoryContext parent); static Size AllocSetGetChunkSpace(MemoryContext context, void *pointer); static bool AllocSetIsEmpty(MemoryContext context); static void AllocSetStats(MemoryContext context, int level); *** *** 430,435 randomize_mem(char *ptr, size_t size) --- 432,440 * minContextSize: minimum context size * initBlockSize: initial allocation block size * maxBlockSize: maximum allocation block size + * + * The flag determining whether this context tracks memory usage is inherited + * from the parent context. */ MemoryContext AllocSetContextCreate(MemoryContext parent, *** *** 438,443 AllocSetContextCreate(MemoryContext parent, --- 443,468 Size initBlockSize, Size maxBlockSize) { + return AllocSetContextCreateTracked( + parent, name, minContextSize, initBlockSize, maxBlockSize, + (parent == NULL) ? false : parent-track_mem); + } + + /* + * AllocSetContextCreateTracked + * Create a new AllocSet context. + * + * Implementation for AllocSetContextCreate, but also allows the caller to + * specify whether memory usage should be tracked or not. + */ + MemoryContext + AllocSetContextCreateTracked(MemoryContext parent, + const char *name, + Size minContextSize, + Size initBlockSize, + Size maxBlockSize, + bool track_mem) + { AllocSet context; /* Do the type-independent part of context creation */ *** *** 445,451 AllocSetContextCreate(MemoryContext parent, sizeof(AllocSetContext), AllocSetMethods, parent, ! name); /* * Make sure alloc parameters are reasonable, and save them. --- 470,477 sizeof(AllocSetContext), AllocSetMethods, parent, ! name, ! track_mem); /* * Make sure alloc parameters are reasonable, and save them. *** *** 500,505 AllocSetContextCreate(MemoryContext parent, --- 526,534 errdetail(Failed while creating memory context \%s\., name))); } + + update_allocation((MemoryContext) context, blksize); + block-aset = context; block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; block-endptr = ((char *) block) + blksize; *** *** 590,595 AllocSetReset(MemoryContext context) --- 619,625 else { /* Normal case, release the block */ + update_allocation(context, -(block-endptr - ((char*) block))); #ifdef CLOBBER_FREED_MEMORY wipe_mem(block, block-freeptr - ((char *) block)); #endif *** *** 611,617 AllocSetReset(MemoryContext context) * But note we are not responsible for deleting the context node itself. */ static void ! AllocSetDelete(MemoryContext context) { AllocSet set = (AllocSet) context; AllocBlock block = set-blocks; --- 641,647 * But note we are not responsible for deleting the context node itself. */ static void ! AllocSetDelete(MemoryContext context, MemoryContext parent) { AllocSet set = (AllocSet) context; AllocBlock block =
[HACKERS] 9.5: Memory-bounded HashAgg
This patch is requires the Memory Accounting patch, or something similar to track memory usage. The attached patch enables hashagg to spill to disk, which means that hashagg will contain itself to work_mem even if the planner makes a bad misestimate of the cardinality. This is a well-known concept; there's even a Berkeley homework assignment floating around to implement it -- in postgres 7.2, no less. I didn't take the exact same approach as the homework assignment suggests, but it's not much different, either. My apologies if some classes are still using this as a homework assignment, but postgres needs to eventually have an answer to this problem. Included is a GUC, enable_hashagg_disk (default on), which allows the planner to choose hashagg even if it doesn't expect the hashtable to fit in memory. If it's off, and the planner misestimates the cardinality, hashagg will still use the disk to contain itself to work_mem. One situation that might surprise the user is if work_mem is set too low, and the user is *relying* on a misestimate to pick hashagg. With this patch, it would end up going to disk, which might be significantly slower. The solution for the user is to increase work_mem. Rough Design: Change the hash aggregate algorithm to accept a generic work item, which consists of an input file as well as some other bookkeeping information. Initially prime the algorithm by adding a single work item where the file is NULL, indicating that it should read from the outer plan. If the memory is exhausted during execution of a work item, then continue to allow existing groups to be aggregated, but do not allow new groups to be created in the hash table. Tuples representing new groups are saved in an output partition file referenced in the work item that is currently being executed. When the work item is done, emit any groups in the hash table, clear the hash table, and turn each output partition file into a new work item. Each time through at least some groups are able to stay in the hash table, so eventually none will need to be saved in output partitions, no new work items will be created, and the algorithm will terminate. This is true even if the number of output partitions is always one. Open items: * costing * EXPLAIN details for disk usage * choose number of partitions intelligently * performance testing Initial tests indicate that it can be competitive with sort+groupagg when the disk is involved, but more testing is required. Feedback welcome. Regards, Jeff Davis *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 2884,2889 include_dir 'conf.d' --- 2884,2904 /listitem /varlistentry + varlistentry id=guc-enable-hashagg-disk xreflabel=enable_hashagg_disk + termvarnameenable_hashagg_disk/varname (typeboolean/type) + indexterm +primaryvarnameenable_hashagg_disk/ configuration parameter/primary + /indexterm + /term + listitem +para + Enables or disables the query planner's use of hashed aggregation plan + types when the planner expects the hash table size to exceed + varnamework_mem/varname. The default is literalon/. +/para + /listitem + /varlistentry + varlistentry id=guc-enable-hashjoin xreflabel=enable_hashjoin termvarnameenable_hashjoin/varname (typeboolean/type) indexterm *** a/src/backend/executor/execGrouping.c --- b/src/backend/executor/execGrouping.c *** *** 331,336 TupleHashEntry --- 331,385 LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot, bool *isnew) { + uint32 hashvalue; + + hashvalue = TupleHashEntryHash(hashtable, slot); + return LookupTupleHashEntryHash(hashtable, slot, hashvalue, isnew); + } + + /* + * TupleHashEntryHash + * + * Calculate the hash value of the tuple. + */ + uint32 + TupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot) + { + TupleHashEntryData dummy; + TupleHashTable saveCurHT; + uint32hashvalue; + + /* + * Set up data needed by hash function. + * + * We save and restore CurTupleHashTable just in case someone manages to + * invoke this code re-entrantly. + */ + hashtable-inputslot = slot; + hashtable-in_hash_funcs = hashtable-tab_hash_funcs; + hashtable-cur_eq_funcs = hashtable-tab_eq_funcs; + + saveCurHT = CurTupleHashTable; + CurTupleHashTable = hashtable; + + dummy.firstTuple = NULL; /* flag to reference inputslot */ + hashvalue = TupleHashTableHash(dummy, sizeof(TupleHashEntryData)); + + CurTupleHashTable = saveCurHT; + + return hashvalue; + } + + /* + * LookupTupleHashEntryHash + * + * Like LookupTupleHashEntry, but allows the caller to specify the tuple's + * hash value, to avoid recalculating it. + */ + TupleHashEntry + LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot, + uint32 hashvalue, bool *isnew) + {
Re: [HACKERS] Partitioning performance: cache stringToNode() of pg_constraint.ccbin
On Thu, Aug 07, 2014 at 09:39:57AM -0400, Robert Haas wrote: On Wed, Aug 6, 2014 at 9:35 PM, Bruce Momjian br...@momjian.us wrote: On Sun, Jan 12, 2014 at 12:53:40PM -0500, Noah Misch wrote: Further study revealed a defect in the patch's memory management, and I have not gotten around to correcting that. I talked to Noah and he can't continue on this item. Can someone else work on it? Well, it would be helpful if he could describe the defect he found, so that the next person doesn't have to guess. Quite right. Primarily, a leak: FreeTupleDesc() did a mere pfree() on a node tree, leaking all but the root node. Perhaps the best fix there is to give each TupleDesc a memory context and then simplify FreeTupleDesc() to a context deletion. That will tend to waste more memory, but I haven't thought of something clearly better. The patch passes a relcache-owned node tree to eval_const_expressions() via ExecPrepareExpr(). If that stands, I should add a comment to the effect that eval_const_expressions() must not modify in place the original tree or reference it from the result tree. The comment at expression_planner() implies that's already true. I know of no exceptions, but I have not audited. How reasonable is reliance on continued conformance to that rule? An alternative is to have the relcache always return a node tree copy, like it does in RelationGetIndexExpressions(). That sacrifices about 1/4 of the performance gain. -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] nulls in GIN index
http://www.postgresql.org/docs/9.1/static/gin-implementation.html As of PostgreSQL 9.1, NULL key values can be included in the index. Also, placeholder NULLs are included in the index for indexed items that are NULL or contain no keys according to extractValue. This allows searches that should find empty items to do so. How do I define an index that includes nulls then? -- View this message in context: http://postgresql.1045698.n5.nabble.com/nulls-in-GIN-index-tp5814384.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL
[Due for a new subject line?] On Sat, Aug 09, 2014 at 08:16:01PM +0200, Andres Freund wrote: On 2014-08-09 14:09:36 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-08-09 14:00:49 -0400, Tom Lane wrote: I don't think it's anywhere near as black-and-white as you guys claim. What it comes down to is whether allowing existing transactions/sessions to finish is more important than allowing new sessions to start. Depending on the application, either could be more important. Nah. The current behaviour circumvents security measures we normally consider absolutely essential. If the postmaster died some bad shit went on. The likelihood of hitting corner case bugs where it's important that we react to a segfault/panic with a restart/crash replay is rather high. What's your point? Once a new postmaster starts, it *will* do a crash restart, because certainly no shutdown checkpoint ever happened. That's not saying much. For one, there can be online checkpoints in that time. So it's certainly not guaranteed (or even all that likely) that all the WAL since the incident is replayed. For another, it can be *hours* before all the backends finish. IIRC we'll continue to happily write WAL and everything after postmaster (and possibly some backends, corrupting shmem) have crashed. The bgwriter, checkpointer, backends will continue to write dirty buffers to disk. We'll IIRC continue to write checkpoints. That's simply not things we should be doing after postmaster crashed if we can avoid at all. The basic support processes, including the checkpointer, exit promptly upon detecting a postmaster exit. Checkpoints cease. Your central point still stands. WAL protects data integrity only to the extent that we stop writing it after shared memory ceases to be trustworthy. Crash recovery of WAL written based on corrupt buffers just reproduces the corruption. The only issue here is what grace period existing orphaned backends are given to finish their work --- and it's not possible for the answer to that to be zero, so you don't get to assume that nothing happens in backend-land after the instant of postmaster crash. Our grace period for active backends after unclean exit of one of their peers is low, milliseconds to seconds. Our grace period for active backends after unclean exit of the postmaster is unconstrained. At least one of those policies has to be wrong. Like Andres and Robert, I pick the second one. -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL
* Noah Misch (n...@leadboat.com) wrote: [Due for a new subject line?] Probably. Our grace period for active backends after unclean exit of one of their peers is low, milliseconds to seconds. Our grace period for active backends after unclean exit of the postmaster is unconstrained. At least one of those policies has to be wrong. Like Andres and Robert, I pick the second one. Ditto for me. The postmaster going away really is a bad sign and the confusion due to leftover processes is terrible for our users. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] 9.5: Memory-bounded HashAgg
Hi, it's 1AM here, so only a few comments after quickly reading the patch. On 10.8.2014 23:26, Jeff Davis wrote: This patch is requires the Memory Accounting patch, or something similar to track memory usage. I think the patch you sent actually includes the accounting patch. Is that on purpose, or by accident? I'd suggest keeping these two patches separate. Rough Design: Change the hash aggregate algorithm to accept a generic work item, which consists of an input file as well as some other bookkeeping information. Initially prime the algorithm by adding a single work item where the file is NULL, indicating that it should read from the outer plan. If the memory is exhausted during execution of a work item, then continue to allow existing groups to be aggregated, but do not allow new groups to be created in the hash table. Tuples representing new groups are saved in an output partition file referenced in the work item that is currently being executed. When the work item is done, emit any groups in the hash table, clear the hash table, and turn each output partition file into a new work item. Each time through at least some groups are able to stay in the hash table, so eventually none will need to be saved in output partitions, no new work items will be created, and the algorithm will terminate. This is true even if the number of output partitions is always one. So once a group gets into memory, it stays there? That's going to work fine for aggregates with fixed-size state (int4, or generally state that gets allocated and does not grow), but I'm afraid for aggregates with growing state (as for example array_agg and similar) that's not really a solution. How difficult would it be to dump the current state into a file (and remove them from the hash table)? While hacking on the hash join, I envisioned the hash aggregate might work in a very similar manner, i.e. something like this: * nbatches=1, nbits=0 * when work_mem gets full = nbatches *= 2, nbits += 1 * get rid of half the groups, using nbits from the hash = dump the current states into 'states.batchno' file = dump further tuples to 'tuples.batchno' file * continue until the end, or until work_mem gets full again This is pretty much what the hashjoin does, except that the join needs to batch the outer relation too (which hashagg does not need to do). Otherwise most of the batching logic can be copied. It also seems to me that the logic of the patch is about this: * try to lookup the group in the hash table * found = call the transition function * not found * enough space = call transition function * not enough space = tuple/group goes to a batch Which pretty much means all tuples need to do the lookup first. The nice thing on the hash-join approach is that you don't really need to do the lookup - you just need to peek at the hash whether the group belongs to the current batch (and if not, to which batch it should go). Of course, that would require the ability to dump the current state of the group, but for the aggregates using basic types as a state (int4/int8, ...) with fixed-length state that's trivial. For aggregates using 'internal' to pass pointers that requires some help from the author - serialization/deserialization functions. Open items: * costing Not sure how this is done for the hash-join, but I guess that might be a good place for inspiration. * EXPLAIN details for disk usage * choose number of partitions intelligently What is the purpose of HASH_DISK_MAX_PARTITIONS? I mean, when we decide we need 2048 partitions, why should we use less if we believe it will get us over work_mem? * performance testing Initial tests indicate that it can be competitive with sort+groupagg when the disk is involved, but more testing is required. For us, removing the sort is a big deal, because we're working with 100M rows regularly. It's more complicated though, because the sort is usually enforced by COUNT(DISTINCT) and that's not going to disappear because of this patch. But that's solvable with a custom aggregate. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Function to know last log write timestamp
We can know the LSN of last committed WAL record on primary by using pg_current_xlog_location(). It seems there's no API to know the time when the WAL record was created. I would like to know standby delay by using pg_last_xact_replay_timestamp() and such that API. If there's no such a API, it would be useful to invent usch an API IMO. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Function to know last log write timestamp
On Mon, Aug 11, 2014 at 9:23 AM, Tatsuo Ishii is...@postgresql.org wrote: We can know the LSN of last committed WAL record on primary by using pg_current_xlog_location(). It seems there's no API to know the time when the WAL record was created. I would like to know standby delay by using pg_last_xact_replay_timestamp() and such that API. If there's no such a API, it would be useful to invent usch an API IMO. +1 I proposed that function before, but unfortunately it failed to be applied. But I still think that function is useful to calculate the replication delay. The past discussion is http://www.postgresql.org/message-id/cahgqgwf3zjfunej5ka683ku5rqubtswtqfq7g1x0g34o+jx...@mail.gmail.com Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Function to know last log write timestamp
On Mon, Aug 11, 2014 at 9:23 AM, Tatsuo Ishii is...@postgresql.org wrote: We can know the LSN of last committed WAL record on primary by using pg_current_xlog_location(). It seems there's no API to know the time when the WAL record was created. I would like to know standby delay by using pg_last_xact_replay_timestamp() and such that API. If there's no such a API, it would be useful to invent usch an API IMO. +1 I proposed that function before, but unfortunately it failed to be applied. But I still think that function is useful to calculate the replication delay. The past discussion is http://www.postgresql.org/message-id/cahgqgwf3zjfunej5ka683ku5rqubtswtqfq7g1x0g34o+jx...@mail.gmail.com I looked into the thread briefly and found Simon and Robert gave -1 for this because of performance concern. I'm not sure if it's a actual performance penalty or not. Maybe we need to major the penalty? However I still think that kind of API is very useful because replication delay is one of the big DBA's concern. Why don't we have a switch to enable the API for DBAs who think the priority is replication delay, over small performance penalty? Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers
On Mon, Aug 11, 2014 at 1:31 AM, Fujii Masao masao.fu...@gmail.com wrote: On Sat, Aug 9, 2014 at 3:03 PM, Michael Paquier michael.paqu...@gmail.com wrote: Great! This is really the feature which I really want. Though I forgot why we missed this feature when we had added the synchronous replication feature, maybe it's worth reading the old discussion which may suggest the potential problem of N sync standbys. Sure, I'll double check. Thanks for your comments. I just tested this feature with synchronous_standby_num = 2. I started up only one synchronous standby and ran the write transaction. Then the transaction was successfully completed, i.e., it didn't wait for two standbys. Probably this is a bug of the patch. Oh OK, yes this is a bug of what I did. The number of standbys to wait for takes precedence on the number of standbys found in the list of active WAL senders. I changed the patch to take into account that behavior. So for example if you have only one sync standby connected, and synchronous_standby_num = 2, client waits indefinitely. And, you forgot to add the line of synchronous_standby_num to postgresql.conf.sample. Yep, right. On top of that, I refactored the code in such a way that pg_stat_get_wal_senders and SyncRepReleaseWaiters rely on a single API to get the list of synchronous standbys found. This reduces code duplication, duplication that already exists in HEAD... Regards, -- Michael *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 2586,2597 include_dir 'conf.d' Specifies a comma-separated list of standby names that can support firsttermsynchronous replication/, as described in xref linkend=synchronous-replication. ! At any one time there will be at most one active synchronous standby; ! transactions waiting for commit will be allowed to proceed after ! this standby server confirms receipt of their data. ! The synchronous standby will be the first standby named in this list ! that is both currently connected and streaming data in real-time ! (as shown by a state of literalstreaming/literal in the link linkend=monitoring-stats-views-table literalpg_stat_replication//link view). Other standby servers appearing later in this list represent potential --- 2586,2598 Specifies a comma-separated list of standby names that can support firsttermsynchronous replication/, as described in xref linkend=synchronous-replication. ! At any one time there will be at a number of active synchronous standbys ! defined by varnamesynchronous_standby_num/; transactions waiting ! for commit will be allowed to proceed after those standby servers ! confirms receipt of their data. The synchronous standbys will be ! the first entries named in this list that are both currently connected ! and streaming data in real-time (as shown by a state of ! literalstreaming/literal in the link linkend=monitoring-stats-views-table literalpg_stat_replication//link view). Other standby servers appearing later in this list represent potential *** *** 2627,2632 include_dir 'conf.d' --- 2628,2652 /listitem /varlistentry + varlistentry id=guc-synchronous-standby-num xreflabel=synchronous_standby_num + termvarnamesynchronous_standby_num/varname (typeinteger/type) + indexterm +primaryvarnamesynchronous_standby_num/ configuration parameter/primary + /indexterm + /term + listitem +para + Specifies the number of standbys that support + firsttermsynchronous replication/, as described in + xref linkend=synchronous-replication, and listed as the first + elements of xref linkend=guc-synchronous-standby-names. +/para +para + Default value is 1. +/para + /listitem + /varlistentry + varlistentry id=guc-vacuum-defer-cleanup-age xreflabel=vacuum_defer_cleanup_age termvarnamevacuum_defer_cleanup_age/varname (typeinteger/type) indexterm *** a/doc/src/sgml/high-availability.sgml --- b/doc/src/sgml/high-availability.sgml *** *** 1081,1092 primary_slot_name = 'node_a_slot' WAL record is then sent to the standby. The standby sends reply messages each time a new batch of WAL data is written to disk, unless varnamewal_receiver_status_interval/ is set to zero on the standby. ! If the standby is the first matching standby, as specified in ! varnamesynchronous_standby_names/ on the primary, the reply ! messages from that standby will be used to wake users waiting for ! confirmation that the commit record has been received. These parameters ! allow the administrator to specify which standby servers should be ! synchronous standbys. Note
Re: [HACKERS] Function to know last log write timestamp
On Mon, Aug 11, 2014 at 10:48 AM, Tatsuo Ishii is...@postgresql.org wrote: On Mon, Aug 11, 2014 at 9:23 AM, Tatsuo Ishii is...@postgresql.org wrote: We can know the LSN of last committed WAL record on primary by using pg_current_xlog_location(). It seems there's no API to know the time when the WAL record was created. I would like to know standby delay by using pg_last_xact_replay_timestamp() and such that API. If there's no such a API, it would be useful to invent usch an API IMO. +1 I proposed that function before, but unfortunately it failed to be applied. But I still think that function is useful to calculate the replication delay. The past discussion is http://www.postgresql.org/message-id/cahgqgwf3zjfunej5ka683ku5rqubtswtqfq7g1x0g34o+jx...@mail.gmail.com I looked into the thread briefly and found Simon and Robert gave -1 for this because of performance concern. I'm not sure if it's a actual performance penalty or not. Maybe we need to major the penalty? I think that the performance penalty is negligible small because the patch I posted before added only three stores to shared memory per commit/abort. No time-consuming operations like lock, gettimeofday, etc were added. Of course, it's worth checking whether the penalty is actually small or not, though. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hokey wrong versions of libpq in apt.postgresql.org
The SO major version should be sufficient for applications to operate normally. If that isn't the case then I agree that we need to review the changes we are making to see if the SO should be bumped. Note that Debian's viewpoint on this is more along the lines of why build against an old version if the latest one, whose major SO is the same, exists? This is largely to avoid the hell of versioned Build-Depends and having to support multiple sets of -dev packages concurrently (consider that builds generally only look for the unversioned '.so' file also..). At the same time, there would be more merit in Josh's position if he could point to any *actual* libpq changes that might pose application compatibility problems. I don't recall that we've made many such, so the above argument might just be hypothetical. I don't believe it's hypothetical from Josh's perspective, but I didn't follow the threads completely to confirm that there was a real issue. If there is a real issue here, I'd most likely vote to fix it and backpatch it as a bug, though it's not clear if that would be considered 'good enough' for this case. The issue that I specifically ran into is that by using apt.postgresql.org in its default configuration, I can't add certain extensions (without jumping through hoops). Simple: Assume a running 9.2.9 from apt.postgresql.org apt-get install pgxnclient sudo pgxnclient install pg_repack Doesn't work. Because it is looking for libs in /var/lib/postgresql/9.2 not /var/lib/postgresql/9.3. It also failed on another extension (but I don't recall which one it is). Yes. I can get the 9.2 libpq but that isn't really the point is it? This is quite unexpected behavior from an operational perspective. It should just work but it doesn't because we are shipping from apt.postgresql.org a 9.3 version of libpq. (protocol versions don't really matter here, this is an operational thing). Sincerely, Joshua D. Drake Thanks, Stephen -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, @cmdpromptinc If we send our children to Caesar for their education, we should not be surprised when they come back as Romans. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers
On Mon, Aug 11, 2014 at 11:54 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Mon, Aug 11, 2014 at 1:31 AM, Fujii Masao masao.fu...@gmail.com wrote: On Sat, Aug 9, 2014 at 3:03 PM, Michael Paquier michael.paqu...@gmail.com wrote: Great! This is really the feature which I really want. Though I forgot why we missed this feature when we had added the synchronous replication feature, maybe it's worth reading the old discussion which may suggest the potential problem of N sync standbys. Sure, I'll double check. Thanks for your comments. I just tested this feature with synchronous_standby_num = 2. I started up only one synchronous standby and ran the write transaction. Then the transaction was successfully completed, i.e., it didn't wait for two standbys. Probably this is a bug of the patch. Oh OK, yes this is a bug of what I did. The number of standbys to wait for takes precedence on the number of standbys found in the list of active WAL senders. I changed the patch to take into account that behavior. So for example if you have only one sync standby connected, and synchronous_standby_num = 2, client waits indefinitely. Thanks for updating the patch! Again I tested the feature and found something wrong. I set synchronous_standby_num to 2 and started three standbys. Two of them are included in synchronous_standby_names, i.e., they are synchronous standbys. That is, the other one standby is always asynchronous. When I shutdown one of synchronous standbys and executed the write transaction, the transaction was successfully completed. So the transaction doesn't wait for two sync standbys in that case. Probably this is a bug. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]
On Mon, Aug 4, 2014 at 11:41 AM, Dilip kumar dilip.ku...@huawei.com wrote: On 31 July 2014 10:59, Amit kapila Wrote, Thanks for the review and valuable comments. I have fixed all the comments and attached the revised patch. I have again looked into your revised patch and would like to share my findings with you. 1. +Number of parallel connections to perform the operation. This option will enable the vacuum +operation to run on parallel connections, at a time one table will be operated on one connection. a. How about describing w.r.t asynchronous connections instead of parallel connections? b. It is better to have line length as lesser than 80. c. As you are using multiple connections to achieve parallelism, I suggest you add a line in your description indicating user should verify max_connections parameters. Something similar to pg_dump: pg_dump will open njobs + 1 connections to the database, so make sure your max_connections setting is high enough to accommodate all connections. 2. +So at one time as many tables will be vacuumed parallely as number of jobs. can you briefly mention about the case when number of jobs is more than number of tables? 3. + /* When user is giving the table list, and list is smaller then + * number of tables + */ + if (tbl_count (parallel tbl_count)) + parallel = tbl_count; + Again here multiline comments are wrong. Some other instances are as below: a. /* This will give the free connection slot, if no slot is free it will * wait for atleast one slot to get free. */ b. /* if table list is not provided then we need to do vaccum for whole DB * get the list of all tables and prpare the list */ c. /* Some of the slot are free, Process the results for slots whichever * are free */ 4. src/bin/scripts/vacuumdb.c:51: indent with spaces. + bool analyze_only, bool freeze, PQExpBuffer sql); src/bin/scripts/vacuumdb.c:116: indent with spaces. +int parallel = 0; src/bin/scripts/vacuumdb.c:198: indent with spaces. +optind++; src/bin/scripts/vacuumdb.c:299: space before tab in indent. + vacuum_one_database(dbname, full, verbose, and_analyze, There are lot of redundant whitespaces, check with git diff --check and fix them. 5. res = executeQuery(conn, select relname, nspname from pg_class c, pg_namespace ns where (relkind = \'r\' or relkind = \'m\') and c.relnamespace = ns.oid order by relpages desc, progname, echo); a. Here you need to use SQL keywords in capital letters, refer one of the other caller of executeQuery() in vacuumdb.c b. Why do you need this condition c.relnamespace = ns.oid in above query? I think to get the list of required objects from pg_class, you don't need to have a join with pg_namespace. 6. vacuum_parallel() { .. if (!tables || !tables-head) { .. tbl_count++; } .. } a. Here why you need a separate variable (tbl_count) to verify number asynchronous/parallel connections you want, why can't we use ntuple? b. there is a warning in code (I have compiled it on windows) as well related to this variable. vacuumdb.c(695): warning C4700: uninitialized local variable 'tbl_count' used 7. Fix for one of my previous comment is as below: GetQueryResult() { .. if (!r !completedb) .. } Here I think some generic errors like connection broken or others will also get ignored. Is it possible that we can ignore particular error which we want to ignore without complicating the code? Also in anycase add comments to explain why you are ignoring error for particular case. 8. + fprintf(stderr, _(%s: Number of parallel \jobs\ should be at least 1\n), + progname); formatting of 2nd line progname is not as per standard (you can refer other fprintf in the same file). 9. + int parallel = 0; I think it is better to name it as numAsyncCons or something similar. 10. It is better if you can add function header for newly added functions. Test2: (as per your scenario, where actual vacuum time is very less) Vacuum done for complete DB 8 tables all with 1 records and few dead tuples I think this test is missing in attached file. Few means? Can you try with 0.1%, 1% of dead tuples in table and try to take time in milliseconds if it is taking less time to complete the test. PS - It is better if you mention against each review comment/suggestion what you have done, because in some cases it will help reviewer to understand your fix easily and as author you will also be sure that all got fixed. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Support for N synchronous standby servers
On Mon, Aug 11, 2014 at 1:26 PM, Fujii Masao masao.fu...@gmail.com wrote: Thanks for updating the patch! Again I tested the feature and found something wrong. I set synchronous_standby_num to 2 and started three standbys. Two of them are included in synchronous_standby_names, i.e., they are synchronous standbys. That is, the other one standby is always asynchronous. When I shutdown one of synchronous standbys and executed the write transaction, the transaction was successfully completed. So the transaction doesn't wait for two sync standbys in that case. Probably this is a bug. Well, that's working in my case :) Please see below with 4 nodes: 1 master and 3 standbys on same host. Master listens to 5432, other nodes to 5433, 5434 and 5435. Each standby's application_name is node_$PORT =# show synchronous_standby_names ; synchronous_standby_names --- node_5433,node_5434 (1 row) =# show synchronous_standby_num ; synchronous_standby_num - 2 (1 row) =# SELECT application_name, pg_xlog_location_diff(sent_location, flush_location) AS replay_delta, sync_priority, sync_state FROM pg_stat_replication ORDER BY replay_delta ASC, application_name; application_name | replay_delta | sync_priority | sync_state --+--+---+ node_5433|0 | 1 | sync node_5434|0 | 2 | sync node_5435|0 | 0 | async (3 rows) =# create table aa (a int); CREATE TABLE [...] -- Stopped node with port 5433: [...] =# SELECT application_name, pg_xlog_location_diff(sent_location, flush_location) AS replay_delta, sync_priority, sync_state FROM pg_stat_replication ORDER BY replay_delta ASC, application_name; application_name | replay_delta | sync_priority | sync_state --+--+---+ node_5434|0 | 2 | sync node_5435|0 | 0 | async (2 rows) =# create table ab (a int); ^CCancel request sent WARNING: 01000: canceling wait for synchronous replication due to user request DETAIL: The transaction has already committed locally, but might not have been replicated to the standby(s). LOCATION: SyncRepWaitForLSN, syncrep.c:227 CREATE TABLE Regards, -- Michael