[HACKERS] Re: proposal: ignore null fields in not relation type composite type based constructors
Hi Stephen Can I help with something, it is there some open question? Regards Pavel 2014-09-08 6:27 GMT+02:00 Stephen Frost sfr...@snowman.net: * Pavel Stehule (pavel.steh...@gmail.com) wrote: ignore_nulls in array_to_json_pretty probably is not necessary. On second hand, the cost is zero, and we can have it for API consistency. I'm willing to be persuaded either way on this, really. I do think it would be nice for both array_to_json and row_to_json to be single functions which take default values, but as for if array_to_json has a ignore_nulls option, I'm on the fence and would defer to people who use that function regularly (I don't today). Beyond that, I'm pretty happy moving forward with this patch. Thanks, Stephen
Re: [HACKERS] add modulo (%) operator to pgbench
Hello Robert, I am not objecting to the functionality; I'm objecting to bolting on ad-hoc operators one at a time. I think an expression syntax would let us do this in a much more scalable way. If I had time, I'd go do that, but I don't. We could add abs(x) and hash(x) and it would all be grand. Ok. I do agree that an expression syntax would be great! However, that would not diminish nor change much the amount and kind of code necessary to add an operator or a function: the parser would have to be updated, and the expression structure, and the executor. Currently the pgbench parser and expression are very limited, but they are also very generic so that nothing is needed to add a new operator there, only the execution code needs to be updated, and the update would be basically the same (if this is this function or operator, actually do it...) if the execution part of a real expression syntax would have to be updated. So although I agree that a real expression syntax would be great nice to have, I do not think that it is valid objection to block this patch. Moreover, upgrading pgbench to handle an actual expression syntax is not so trivial, because for instance some parts of the text needs to be parsed (set syntax) while others would need not to be pased (SQL commands), so some juggling would be needed in the lexer, or maybe only call the parser on some lines and not others... Everything is possible, but this one would require some careful thinking. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RLS Design
On Wed, September 10, 2014 23:50, Stephen Frost wrote: [rls_9-10-2014.patch] I can't get this to apply; I attach the complaints of patch. Erik Rijkers -- git clone git://git.postgresql.org/git/postgresql.git master Cloning into 'master'... -- git branch * master patching file doc/src/sgml/catalogs.sgml patching file doc/src/sgml/config.sgml Hunk #1 succeeded at 5411 (offset 114 lines). patching file doc/src/sgml/keywords.sgml patching file doc/src/sgml/ref/allfiles.sgml patching file doc/src/sgml/ref/alter_policy.sgml patching file doc/src/sgml/ref/alter_role.sgml patching file doc/src/sgml/ref/create_policy.sgml patching file doc/src/sgml/ref/create_role.sgml patching file doc/src/sgml/ref/drop_policy.sgml patching file doc/src/sgml/reference.sgml patching file src/backend/catalog/Makefile patching file src/backend/catalog/aclchk.c patching file src/backend/catalog/dependency.c patching file src/backend/catalog/heap.c patching file src/backend/catalog/objectaddress.c patching file src/backend/catalog/system_views.sql patching file src/backend/commands/Makefile patching file src/backend/commands/alter.c patching file src/backend/commands/copy.c patching file src/backend/commands/event_trigger.c patching file src/backend/commands/policy.c patching file src/backend/commands/tablecmds.c patching file src/backend/commands/user.c patching file src/backend/nodes/copyfuncs.c patching file src/backend/nodes/equalfuncs.c patching file src/backend/nodes/outfuncs.c patching file src/backend/nodes/readfuncs.c patching file src/backend/optimizer/plan/planner.c patching file src/backend/optimizer/plan/setrefs.c patching file src/backend/optimizer/prep/prepsecurity.c patching file src/backend/parser/gram.y patching file src/backend/parser/parse_agg.c patching file src/backend/parser/parse_expr.c patching file src/backend/rewrite/Makefile patching file src/backend/rewrite/rewriteHandler.c patching file src/backend/rewrite/rowsecurity.c patching file src/backend/tcop/utility.c Hunk #2 succeeded at 833 (offset -4 lines). patching file src/backend/utils/adt/acl.c patching file src/backend/utils/adt/ri_triggers.c patching file src/backend/utils/cache/plancache.c patching file src/backend/utils/cache/relcache.c patching file src/backend/utils/misc/guc.c patching file src/backend/utils/misc/postgresql.conf.sample patching file src/bin/pg_dump/common.c patching file src/bin/pg_dump/pg_backup.h patching file src/bin/pg_dump/pg_backup_archiver.c patching file src/bin/pg_dump/pg_dump.c patching file src/bin/pg_dump/pg_dump.h patching file src/bin/pg_dump/pg_dump_sort.c patching file src/bin/pg_dump/pg_dumpall.c patching file src/bin/pg_dump/pg_restore.c patching file src/bin/psql/describe.c patching file src/include/catalog/catversion.h Hunk #1 FAILED at 53. 1 out of 1 hunk FAILED -- saving rejects to file src/include/catalog/catversion.h.rej patching file src/include/catalog/dependency.h patching file src/include/catalog/indexing.h patching file src/include/catalog/pg_authid.h patching file src/include/catalog/pg_class.h patching file src/include/catalog/pg_rowsecurity.h patching file src/include/commands/policy.h patching file src/include/miscadmin.h patching file src/include/nodes/nodes.h patching file src/include/nodes/parsenodes.h patching file src/include/nodes/plannodes.h patching file src/include/nodes/relation.h patching file src/include/optimizer/planmain.h patching file src/include/parser/kwlist.h patching file src/include/parser/parse_node.h patching file src/include/rewrite/rowsecurity.h patching file src/include/utils/acl.h patching file src/include/utils/plancache.h patching file src/include/utils/rel.h patching file src/test/regress/expected/dependency.out patching file src/test/regress/expected/privileges.out patching file src/test/regress/expected/rowsecurity.out patching file src/test/regress/expected/rules.out patching file src/test/regress/expected/sanity_check.out patching file src/test/regress/parallel_schedule patching file src/test/regress/serial_schedule patching file src/test/regress/sql/rowsecurity.sql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] about half processes are blocked by btree, btree is bottleneck?
On 09/11/2014 06:08 AM, Xiaoyulei wrote: I use benchmarksql with more than 200 clients on pg 9.3.3. when the test is going on, I collect all the process stack. I found about 100 processes are blocked by btree insert. Another 100 are blocked by xloginsert. Does btree has bad performance in concurrency scenarios? Well, there's always a bottleneck. I'd suggest trying 9.4, there were changes to make WAL insertion more scalable, as well as small tweaks to the spinlock implementation. I'm not too familiar with BenchmarkSQL, but if 200 clients means 200 concurrent active connections to the database, that's a lot. You'll likely see better performance if you dial it down closer to the number of CPU cores in the server. Also make sure you use a large-enough scale factor. Otherwise all transactions try to access a small set of rows, which naturally becomes a bottleneck. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers
On Thu, Sep 11, 2014 at 9:10 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Sep 11, 2014 at 5:21 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 08/28/2014 10:10 AM, Michael Paquier wrote: + #synchronous_standby_num = -1 # number of standbys servers using sync rep To be honest, that's a horrible name for the GUC. Back when synchronous replication was implemented, we had looong discussions on this feature. It was called quorum commit back then. I'd suggest using the quorum term in this patch, too, that's a fairly well-known term in distributed computing for this. I am open to any suggestions. Then what about the following parameter names? - synchronous_quorum_num - synchronous_standby_quorum - synchronous_standby_quorum_num - synchronous_quorum_commit or simply synchronous_standbys When synchronous replication was added, quorum was left out to keep things simple; the current feature set was the most we could all agree on to be useful. If you search the archives for quorum commit you'll see what I mean. There was a lot of confusion on what is possible and what is useful, but regarding this particular patch: people wanted to be able to describe more complicated scenarios. For example, imagine that you have a master and two standbys in one the primary data center, and two more standbys in a different data center. It should be possible to specify that you must get acknowledgment from at least on standby in both data centers. Maybe you could hack that by giving the standbys in the same data center the same name, but it gets ugly, and it still won't scale to even more complex scenarios. Won't this problem be handled if synchronous mode is supported in cascading replication? I am not sure about the feasibility of same, but I think the basic problem mentioned above can be addressed and may be others as well. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] pg_xlogdump --stats
At 2014-08-21 10:06:39 +0300, hlinnakan...@vmware.com wrote: Committed the patch to add INT64_MODIFIER, with minor fixes. Thank you. The new rm_identify method needs to be documented. […] Perhaps add comments to the RmgrData struct, explaining all of the methods. OK, I'll submit a patch to add these comments. I think the names that rm_identify returns should match those that the rm_desc functions print. I had originally started off trying to keep the output in sync, but it doesn't work very well. There are rm_desc functions that print things like truncate before and Create posting tree, and many decisions are quite arbitrary (freeze_page, cleanup info, multi-insert). I think it's better the (grep-friendly) way it is. If anything, perhaps rm_desc should output ${rm_identify}[: optional explanation]. That would also make it very clear what should go in rm_identify and what should go in rm_desc. Thoughts? The corresponding rm_identify output is: HOT_UPDATE+INIT The +INIT is admittedly a special case, and I would have no objection to writing that as (INIT) or something instead. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_xlogdump --stats
On 09/11/2014 11:43 AM, Abhijit Menon-Sen wrote: At 2014-08-21 10:06:39 +0300, hlinnakan...@vmware.com wrote: I think the names that rm_identify returns should match those that the rm_desc functions print. I had originally started off trying to keep the output in sync, but it doesn't work very well. There are rm_desc functions that print things like truncate before and Create posting tree, and many decisions are quite arbitrary (freeze_page, cleanup info, multi-insert). It would be good to clean up those messages to be more sensible and consistent. I think it's better the (grep-friendly) way it is. If anything, perhaps rm_desc should output ${rm_identify}[: optional explanation]. That would also make it very clear what should go in rm_identify and what should go in rm_desc. Yeah, that makes sense too. The corresponding rm_identify output is: HOT_UPDATE+INIT The +INIT is admittedly a special case, and I would have no objection to writing that as (INIT) or something instead. I don't care much, as long as it's consistent the rm_desc output. It's quite arbitrary that the INIT cases are considered different record types, with their own identity string, instead of being just a flag in the same record type. It's an implementation detail that that flag is stored in the xl_info, and that implementation detail is now exposed in the stats pg_xlogdump --stats output. There are similar cases in B-tree splits, for example, where a split where the new tuple goes to the left half is considered a different record type than one where it goes ot the right half. It might be interesting to count them separately in the stats, but then again it might just be confusing. The xl_info flags and the rm_desc output were never intended to be a useful categorization for statistics purposes, so it's not surprising that it's not too well suited for that. Nevertheless, the pg_xlogdump --stats is useful as it is, if you know the limitations. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inherit support for foreign tables
(2014/09/11 4:32), Heikki Linnakangas wrote: I had a cursory look at this patch and the discussions around this. Thank you! ISTM there are actually two new features in this: 1) allow CHECK constraints on foreign tables, and 2) support inheritance for foreign tables. How about splitting it into two? That's right. There are the two in this patch. I'm not sure if I should split the patch into the two, because 1) won't make sense without 2). As described in the following note added to the docs on CREATE FOREIGN TABLE, CHECK constraints on foreign tables are intended to support constraint exclusion for partitioned foreign tables: + Constraints on foreign tables are not enforced on insert or update. + Those definitions simply declare the constraints hold for all rows + in the foreign tables. It is the user's responsibility to ensure + that those definitions match the remote side. Such constraints are + used for some kind of query optimization such as constraint exclusion + for partitioned tables Thanks, Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_xlogdump --stats
On 2014-09-11 12:14:42 +0300, Heikki Linnakangas wrote: On 09/11/2014 11:43 AM, Abhijit Menon-Sen wrote: At 2014-08-21 10:06:39 +0300, hlinnakan...@vmware.com wrote: I think the names that rm_identify returns should match those that the rm_desc functions print. I had originally started off trying to keep the output in sync, but it doesn't work very well. There are rm_desc functions that print things like truncate before and Create posting tree, and many decisions are quite arbitrary (freeze_page, cleanup info, multi-insert). It would be good to clean up those messages to be more sensible and consistent. I think it's better the (grep-friendly) way it is. If anything, perhaps rm_desc should output ${rm_identify}[: optional explanation]. That would also make it very clear what should go in rm_identify and what should go in rm_desc. Yeah, that makes sense too. I'm not sure I agree here. From a theoretical POV sure, but wouldn't that lead to even longer lines for xlogdump and other error messages for some records? We probably need to see how it plays out. The corresponding rm_identify output is: HOT_UPDATE+INIT The +INIT is admittedly a special case, and I would have no objection to writing that as (INIT) or something instead. I don't care much, as long as it's consistent the rm_desc output. It's quite arbitrary that the INIT cases are considered different record types, with their own identity string, instead of being just a flag in the same record type. It's an implementation detail that that flag is stored in the xl_info, and that implementation detail is now exposed in the stats pg_xlogdump --stats output. It's also actually quite good from a display purpose. +INIT will have quite different wal logging characteristics :) There are similar cases in B-tree splits, for example, where a split where the new tuple goes to the left half is considered a different record type than one where it goes ot the right half. It might be interesting to count them separately in the stats, but then again it might just be confusing. The xl_info flags and the rm_desc output were never intended to be a useful categorization for statistics purposes, so it's not surprising that it's not too well suited for that. Nevertheless, the pg_xlogdump --stats is useful as it is, if you know the limitations. I agree it's not perfect, but I think it's a huge step forward. We very well might want to improve upon the stats granularity once in, but I think it already gives a fair amount of direction where to look. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_xlogdump --stats
On 09/11/2014 12:22 PM, Andres Freund wrote: On 2014-09-11 12:14:42 +0300, Heikki Linnakangas wrote: On 09/11/2014 11:43 AM, Abhijit Menon-Sen wrote: At 2014-08-21 10:06:39 +0300, hlinnakan...@vmware.com wrote: I think the names that rm_identify returns should match those that the rm_desc functions print. I had originally started off trying to keep the output in sync, but it doesn't work very well. There are rm_desc functions that print things like truncate before and Create posting tree, and many decisions are quite arbitrary (freeze_page, cleanup info, multi-insert). It would be good to clean up those messages to be more sensible and consistent. I think it's better the (grep-friendly) way it is. If anything, perhaps rm_desc should output ${rm_identify}[: optional explanation]. That would also make it very clear what should go in rm_identify and what should go in rm_desc. Yeah, that makes sense too. I'm not sure I agree here. From a theoretical POV sure, but wouldn't that lead to even longer lines for xlogdump and other error messages for some records? No. All the rm_desc functions already follow that convention, and print foo: details, where foo identifies the record type based on xl_info. This proposal would just make that convention more official, and stipulate that the foo part should match what the new rm_identify() function returns. (Or perhaps even better, define it so that rm_desc prints only the details part, and rm_identify() the foo part, and have the callers put the two strings together if they want) There are similar cases in B-tree splits, for example, where a split where the new tuple goes to the left half is considered a different record type than one where it goes ot the right half. It might be interesting to count them separately in the stats, but then again it might just be confusing. The xl_info flags and the rm_desc output were never intended to be a useful categorization for statistics purposes, so it's not surprising that it's not too well suited for that. Nevertheless, the pg_xlogdump --stats is useful as it is, if you know the limitations. I agree it's not perfect, but I think it's a huge step forward. We very well might want to improve upon the stats granularity once in, but I think it already gives a fair amount of direction where to look. Agreed. That's what I was also trying to say. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Fix MSVC isnan warning from e80252d4
The attached small patch fixes the following warning: src\backend\utils\adt\arrayfuncs.c(5613): warning C4013: '_isnan' undefined; assuming extern returning int [D:\Postgres\a\postgres.vcxproj] The fix is pretty much just a copy and paste from costsize.c Regards David Rowley isnan_msvc_fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench throttling latency limit
On 09/10/2014 05:47 PM, Mitsumasa KONDO wrote: Hi, I find typo in your patch. Please confirm. @line 239 - agg-sum2_lag = 0; + agg-sum_lag = 0; Ah thanks, cood catch! And back patch is welcome for me. I've committed and backpatched this, as well as a patch to refactor the way the Poisson delay is computed. I kept the log file format unchanged when --rate is not used, so it now has a different number of fields depending on whether --rate is used or not. Please review the changes I made one more time, to double-check that I mess up something. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench throttling latency limit
On 08/30/2014 07:16 PM, Fabien COELHO wrote: + if (latency_limit) + printf(number of transactions above the %.1f ms latency limit: INT64_FORMAT \n, + latency_limit / 1000.0, latency_late); + Any reason not to report a percentage here? Yes: I did not thought of it. Here is a v7, with a percent. I also added a paragraph in the documenation about how the latency is computed under throttling, and I tried to reorder the reported stuff so that it is more logical. Now that I've finished the detour and committed and backpatched the changes to the way latency is calculated, we can get back to this patch. It needs to be rebased. How should skipped transactions should be taken into account in the log file output, with and without aggregation? I assume we'll want to have some trace of skipped transactions in the logs. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inherit support for foreign tables
On 09/11/2014 12:22 PM, Etsuro Fujita wrote: (2014/09/11 4:32), Heikki Linnakangas wrote: I had a cursory look at this patch and the discussions around this. Thank you! ISTM there are actually two new features in this: 1) allow CHECK constraints on foreign tables, and 2) support inheritance for foreign tables. How about splitting it into two? That's right. There are the two in this patch. I'm not sure if I should split the patch into the two, because 1) won't make sense without 2). As described in the following note added to the docs on CREATE FOREIGN TABLE, CHECK constraints on foreign tables are intended to support constraint exclusion for partitioned foreign tables: + Constraints on foreign tables are not enforced on insert or update. + Those definitions simply declare the constraints hold for all rows + in the foreign tables. It is the user's responsibility to ensure + that those definitions match the remote side. Such constraints are + used for some kind of query optimization such as constraint exclusion + for partitioned tables The planner can do constraint exclusion based on CHECK constraints even without inheritance. It's not enabled by default because it can increase planning time, but if you set constraint_exclusion=on, it will work. For example: postgres=# create table foo (i int4 CHECK (i 0)); CREATE TABLE postgres=# explain select * from foo WHERE i 0; QUERY PLAN -- Seq Scan on foo (cost=0.00..40.00 rows=800 width=4) Filter: (i 0) Planning time: 0.335 ms (3 rows) postgres=# show constraint_exclusion ; constraint_exclusion -- partition (1 row) postgres=# set constraint_exclusion ='on'; SET postgres=# explain select * from foo WHERE i 0; QUERY PLAN -- Result (cost=0.00..0.01 rows=1 width=0) One-Time Filter: false Planning time: 0.254 ms (3 rows) postgres=# - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix MSVC isnan warning from e80252d4
On 09/11/2014 12:52 PM, David Rowley wrote: The attached small patch fixes the following warning: src\backend\utils\adt\arrayfuncs.c(5613): warning C4013: '_isnan' undefined; assuming extern returning int [D:\Postgres\a\postgres.vcxproj] The fix is pretty much just a copy and paste from costsize.c Thanks, committed. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
Hi, On 2014-09-10 12:17:34 +0530, Amit Kapila wrote: include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/postmaster/bgreclaimer.c b/src/backend/postmaster/bgreclaimer.c new file mode 100644 index 000..3df2337 --- /dev/null +++ b/src/backend/postmaster/bgreclaimer.c A fair number of comments in that file refer to bgwriter... @@ -0,0 +1,302 @@ +/*- + * + * bgreclaimer.c + * + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5. It + * attempts to keep regular backends from having to run clock sweep (which + * they would only do when they don't find a usable shared buffer from + * freelist to read in another page). That's not really accurate. Freelist pages are often also needed to write new pages, without reading anything in. I'd phrase it as which they only need to do if they don't find a victim buffer from the freelist In the best scenario all requests + * for shared buffers will be fulfilled from freelist as the background + * reclaimer process always tries to maintain buffers on freelist. However, + * regular backends are still empowered to run clock sweep to find a usable + * buffer if the bgreclaimer fails to maintain enough buffers on freelist. empowered sounds strange to me. 'still can run the clock sweep'? + * The bgwriter is started by the postmaster as soon as the startup subprocess + * finishes, or as soon as recovery begins if we are doing archive recovery. Why only archive recovery? I guess (only read this far...) it's not just during InArchiveRecoveryb recovery but also StandbyMode? But I don't see why we shouldn't use it during normal crash recovery. That's also often painfully slow and the reclaimer could help? Less, but still. +static void bgreclaim_quickdie(SIGNAL_ARGS); +static void BgreclaimSigHupHandler(SIGNAL_ARGS); +static void ReqShutdownHandler(SIGNAL_ARGS); +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS); This looks inconsistent. + /* + * If an exception is encountered, processing resumes here. + * + * See notes in postgres.c about the design of this coding. + */ + if (sigsetjmp(local_sigjmp_buf, 1) != 0) + { + /* Since not using PG_TRY, must reset error stack by hand */ + error_context_stack = NULL; + + /* Prevent interrupts while cleaning up */ + HOLD_INTERRUPTS(); + + /* Report the error to the server log */ + EmitErrorReport(); + + /* + * These operations are really just a minimal subset of + * AbortTransaction(). We don't have very many resources to worry + * about in bgreclaim, but we do have buffers and file descriptors. + */ + UnlockBuffers(); + AtEOXact_Buffers(false); + AtEOXact_Files(); No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a good idea, regardless of it possibly being true today (which I'm not sure about yet). + /* + * Now return to normal top-level context and clear ErrorContext for + * next time. + */ + MemoryContextSwitchTo(bgreclaim_context); + FlushErrorState(); + + /* Flush any leaked data in the top-level context */ + MemoryContextResetAndDeleteChildren(bgreclaim_context); + + /* Now we can allow interrupts again */ + RESUME_INTERRUPTS(); Other processes sleep for a second here, I think that's a good idea. E.g. that bit: /* * Sleep at least 1 second after any error. A write error is likely * to be repeated, and we don't want to be filling the error logs as * fast as we can. */ pg_usleep(100L); + /* + * Loop forever + */ + for (;;) + { + int rc; + + + /* + * Backend will signal bgreclaimer when the number of buffers in + * freelist falls below than low water mark of freelist. + */ + rc = WaitLatch(MyProc-procLatch, +WL_LATCH_SET | WL_POSTMASTER_DEATH, +-1); That's probably not going to work well directly after a (re)start of bgreclaim (depending on how you handle the water mark, I'll see in a bit). Maybe it should rather be ResetLatch(); BgMoveBuffersToFreelist(); pgstat_send_bgwriter(); rc = WaitLatch() if (rc WL_POSTMASTER_DEATH) exit(1) +Background Reclaimer's Processing +- + +The background reclaimer is designed to move buffers to freelist that are +likely to be recycled soon, thereby offloading the need to perform +clock sweep work from active
Re: [HACKERS] Patch to support SEMI and ANTI join removal
On Thu, Aug 28, 2014 at 6:23 AM, Tom Lane t...@sss.pgh.pa.us wrote: If the majority of the added code is code that will be needed for less-bogus optimizations, it might be all right; but I'd kind of want to see the less-bogus optimizations working first. That seems fair. Likely there'd be not a great deal of value to semi and anti joins removal alone. I was more just trying to spread weight of an inner join removal patch... So In response to this, I've gone off and written an inner join removal support patch (attached), which turned out to be a bit less complex than I thought. Here's a quick demo, of the patch at work: test=# create table c (id int primary key); CREATE TABLE test=# create table b (id int primary key, c_id int not null references c(id)); CREATE TABLE test=# create table a (id int primary key, b_id int not null references b(id)); CREATE TABLE test=# test=# explain select a.* from a inner join b on a.b_id = b.id inner join c on b.c_id = c.id; QUERY PLAN - Seq Scan on a (cost=0.00..31.40 rows=2140 width=8) Planning time: 1.061 ms (2 rows) Perhaps not a greatly useful example, but if you can imagine the joins are hidden in a view and the user is just requesting a small subset of columns, then it does seem quite powerful. There's currently a few things with the patch that I'll list below, which may raise a few questions: 1. I don't think that I'm currently handling removing eclass members properly. So far the code just removes the Vars that belong to the relation being removed. I likely should also be doing bms_del_member(ec-ec_relids, relid); on the eclass, but perhaps I should just be marking the whole class as ec_broken = true and adding another eclass all everything that the broken one has minus the parts from the removed relation? 2. Currently the inner join removal is dis-allowed if the (would be) removal relation has *any* baserestrictinfo items. The reason for this is that we must ensure that the inner join gives us exactly 1 row match on the join condition, but a baserestrictinfo can void the proof that a foreign key would give us that a matching row does exist. However there is an exception to this that could allow that restriction to be relaxed. That is if the qual in baserestrictinfo use vars that are in an eclass, where the same eclass also has ec members vars that belong to the rel that we're using the foreign key for to prove the relation not needed umm.. that's probably better described by example: Assume there's a foreign key a (x) reference b(x) SELECT a.* FROM a INNER JOIN b ON a.x = b.x WHERE b.x = 1 relation b should be removable because an eclass will contain {a.x, b.x} and therefore s baserestrictinfo for a.x = 1 should also exist on relation a. Therefore removing relation b should produce equivalent results, i.e everything that gets filtered out on relation b will also be filtered out on relation a anyway. I think the patch without this is still worth it, but if someone feels strongly about it I'll take a bash at supporting it. 3. Currently the inner join support does not allow removals using foreign keys which contain duplicate columns on the referencing side. e.g (a,a) references (x,y), this is basically because of the point I made in item 2. In this case a baserestrictinfo would exist on the referenced relation to say WHERE x = y. I'd have to remove the restriction described in item 2 and do a small change to the code that extracts the join condition from the eclass for this to work. But it's likely a corner case that's not worth too much trouble to support. I think probably if I saw an FK like that in the field, I'd probably scratch my head for a while, while trying to understanding why they bothered. 4. The patch currently only allows removals for eclass join types. If the rel has any joininfo items, then the join removal is disallowed. From what I can see equality type inner join conditions get described in eclasses, and only non-equality join conditions make it into the joininfo list, and since foreign keys only support equality operators, then I thought this was a valid restriction, however, if someone can show me a flaw in my assumption then I may need to improve this. 5. I've added a flag to pg_class called relhasfkey. Currently this gets set to true when a foreign key is added, though I've added nothing to set it back to false again. I notice that relhasindex gets set back to false during vacuum, if vacuum happens to find there to not be any indexes on the rel. I didn't put my logic here as I wasn't too sure if scanning pg_constraint during a vacuum seemed very correct, so I just left out the setting it to false logic based on the the fact that I noticed that relhaspkey gets away with quite lazy setting back to false logic (only when there's no indexes of any kind left on the relation at all). The only think else I can think of is perhaps optimising a little.
Re: [HACKERS] Support for N synchronous standby servers
On Thu, Aug 28, 2014 at 12:40 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Wed, Aug 27, 2014 at 2:46 PM, Rajeev rastogi rajeev.rast...@huawei.com wrote: I have done some more review, below are my comments: Thanks! 1. There are currently two loops on *num_sync, Can we simplify the function SyncRepGetSynchronousNodes by moving the priority calculation inside the upper loop if (*num_sync == allowed_sync_nodes) { for (j = 0; j *num_sync; j++) { Anyway we require priority only if *num_sync == allowed_sync_nodes condition matches. So in this loop itself, we can calculate the priority as well as assigment of new standbys with lower priority. Let me know if you see any issue with this. OK, I see, yes this can minimize process a bit so I refactored the code by integrating the second loop to the first. This has needed the removal of the break portion as we need to find the highest priority value among the nodes already determined as synchronous. 2. Comment inside the function SyncRepReleaseWaiters, /* * We should have found ourselves at least, except if it is not expected * to find any synchronous nodes. */ Assert(num_sync 0); I think except if it is not expected to find any synchronous nodes is not required. As if it has come till this point means at least this node is synchronous. Yes, removed. 3. Document says that s_s_num should be lesser than max_wal_senders but code wise there is no protection for the same. IMHO, s_s_num should be lesser than or equal to max_wal_senders otherwise COMMIT will never return back the console without any knowledge of user. I see that some discussion has happened regarding this but I think just adding documentation for this is not enough. I am not sure what issue is observed in adding check during GUC initialization but if there is unavoidable issue during GUC initialization then can't we try to add check at later points. The trick here is that you cannot really return a warning at GUC loading level to the user as a warning could be easily triggered if for example s_s_num is present before max_wal_senders in postgresql.conf. I am open to any solutions if there are any (like an error when initializing WAL senders?!). Documentation seems enough for me to warn the user. How about making it as a PGC_POSTMASTER parameter and then have a check similar to below in PostmasterMain() /* * Check for invalid combinations of GUC settings. */ if (ReservedBackends = MaxConnections) { write_stderr(%s: superuser_reserved_connections must be less than max_connections\n, progname); ExitPostmaster(1); } if (max_wal_senders = MaxConnections) { write_stderr(%s: max_wal_senders must be less than max_connections\n, progname); ExitPostmaster(1); } if (XLogArchiveMode wal_level == WAL_LEVEL_MINIMAL) ereport(ERROR, (errmsg(WAL archival (archive_mode=on) requires wal_level \archive\, \hot_standby\, or \logical\))); if (max_wal_senders 0 wal_level == WAL_LEVEL_MINIMAL) ereport(ERROR, (errmsg(WAL streaming (max_wal_senders 0) requires wal_level \archive\, \hot_standby\, or \logical\))); With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] RLS Design
Erik, * Erik Rijkers (e...@xs4all.nl) wrote: On Wed, September 10, 2014 23:50, Stephen Frost wrote: [rls_9-10-2014.patch] I can't get this to apply; I attach the complaints of patch. Thanks for taking a look at this! [...] patching file src/include/catalog/catversion.h Hunk #1 FAILED at 53. 1 out of 1 hunk FAILED -- saving rejects to file src/include/catalog/catversion.h.rej That's just the catversion bump- you can simply ignore it and everything should be fine. Look forward to hearing how it works for you! Thanks again, Stephen signature.asc Description: Digital signature
Re: [HACKERS] inherit support for foreign tables
(2014/09/11 19:46), Heikki Linnakangas wrote: On 09/11/2014 12:22 PM, Etsuro Fujita wrote: (2014/09/11 4:32), Heikki Linnakangas wrote: I had a cursory look at this patch and the discussions around this. Thank you! ISTM there are actually two new features in this: 1) allow CHECK constraints on foreign tables, and 2) support inheritance for foreign tables. How about splitting it into two? That's right. There are the two in this patch. I'm not sure if I should split the patch into the two, because 1) won't make sense without 2). As described in the following note added to the docs on CREATE FOREIGN TABLE, CHECK constraints on foreign tables are intended to support constraint exclusion for partitioned foreign tables: + Constraints on foreign tables are not enforced on insert or update. + Those definitions simply declare the constraints hold for all rows + in the foreign tables. It is the user's responsibility to ensure + that those definitions match the remote side. Such constraints are + used for some kind of query optimization such as constraint exclusion + for partitioned tables The planner can do constraint exclusion based on CHECK constraints even without inheritance. It's not enabled by default because it can increase planning time, but if you set constraint_exclusion=on, it will work. Exactly! For example: postgres=# create table foo (i int4 CHECK (i 0)); CREATE TABLE postgres=# explain select * from foo WHERE i 0; QUERY PLAN -- Seq Scan on foo (cost=0.00..40.00 rows=800 width=4) Filter: (i 0) Planning time: 0.335 ms (3 rows) postgres=# show constraint_exclusion ; constraint_exclusion -- partition (1 row) postgres=# set constraint_exclusion ='on'; SET postgres=# explain select * from foo WHERE i 0; QUERY PLAN -- Result (cost=0.00..0.01 rows=1 width=0) One-Time Filter: false Planning time: 0.254 ms (3 rows) postgres=# Actually, this patch allows the exact same thing to apply to foreign tables. My explanation was insufficient about that. Sorry for that. So, should I split the patch into the two? Thanks, Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background (and more parallelism infrastructure patches)
On Fri, Jul 25, 2014 at 11:41 PM, Robert Haas robertmh...@gmail.com wrote: Patch 4 adds infrastructure that allows one session to save all of its non-default GUC values and another session to reload those values. This was written by Amit Khandekar and Noah Misch. It allows pg_background to start up the background worker with the same GUC settings that the launching process is using. I intend this as a demonstration of how to synchronize any given piece of state between cooperating backends. For real parallelism, we'll need to synchronize snapshots, combo CIDs, transaction state, and so on, in addition to GUCs. But GUCs are ONE of the things that we'll need to synchronize in that context, and this patch shows the kind of API we're thinking about for these sorts of problems. Don't we need some way to prohibit changing GUC by launching process, once it has shared the existing GUC? Patch 6 is pg_background itself. I'm quite pleased with how easily this came together. The existing background worker, dsm, shm_toc, and shm_mq infrastructure handles most of the heavily lifting here - obviously with some exceptions addressed by the preceding patches. Again, this is the kind of set-up that I'm expecting will happen in a background worker used for actual parallelism - clearly, more state will need to be restored there than here, but nonetheless the general flow of the code here is about what I'm imagining, just with somewhat more different kinds of state. Most of the work of writing this patch was actually figuring out how to execute the query itself; what I ended up with is mostly copied form exec_simple_query, but with some difference here and there. I'm not sure if it would be possible/advisable to try to refactor to reduce duplication. 1. This patch generates warning on windows 1pg_background.obj : error LNK2001: unresolved external symbol StatementTimeout You need to add PGDLLIMPORT for StatementTimeout 2. CREATE FUNCTION pg_background_launch(sql pg_catalog.text, queue_size pg_catalog.int4 DEFAULT 65536) Here shouldn't queue_size be pg_catalog.int8 as I could see some related functions in test_shm_mq uses int8? CREATE FUNCTION test_shm_mq(queue_size pg_catalog.int8, CREATE FUNCTION test_shm_mq_pipelined(queue_size pg_catalog.int8, Anyway I think corresponding C function doesn't use matching function to extract the function args. pg_background_launch(PG_FUNCTION_ARGS) { text *sql = PG_GETARG_TEXT_PP(0); int32 queue_size = PG_GETARG_INT64(1); Here it should _INT32 variant to match with current sql definition, otherwise it leads to below error. postgres=# select pg_background_launch('vacuum verbose foo'); ERROR: queue size must be at least 64 bytes 3. Comparing execute_sql_string() and exec_simple_query(), I could see below main differences: a. Allocate a new memory context different from message context b. Start/commit control of transaction is outside execute_sql_string c. enable/disable statement timeout is done from outside incase of execute_sql_string() d. execute_sql_string() prohibits Start/Commit/Abort transaction statements. e. execute_sql_string() doesn't log statements f. execute_sql_string() uses binary format for result whereas exec_simple_query() uses TEXT as defult format g. processed stat related info from caller incase of execute_sql_string(). Can't we handle all these or other changes inside exec_simple_query() based on some parameter or may be a use a hook similar to what we do in ProcessUtility? Basically it looks bit odd to have a duplicate (mostly) copy of exec_simple_query(). 4. Won't it be better if pg_background_worker_main() can look more like PostgresMain() (in terms of handling different kind of messages), so that it can be extended in future to handle parallel worker. 5. pg_background_result() { .. /* Read and processes messages from the shared memory queue. */ } Shouldn't the processing of messages be a separate function as we do for pqParseInput3(). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] inherit support for foreign tables
On 09/11/2014 02:30 PM, Etsuro Fujita wrote: Actually, this patch allows the exact same thing to apply to foreign tables. My explanation was insufficient about that. Sorry for that. Great, that's what I thought. So, should I split the patch into the two? Yeah, please do. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [TODO] Process pg_hba.conf keywords as case-insensitive
On Wed, Sep 10, 2014 at 4:54 AM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Finally I think that we need case-insensitive version of get_role_id and() get_database_id() to acoomplish this patch'es objective. (This runs full-scans on pg_database or pg_authid X() Any such thing is certainly grounds for rejecting the patch outright. It may be that pg_hba.conf should follow the same case-folding rules we use elsewhere, but it should not invent novel semantics, especially ones that make connecting to the database a far more expensive operation than it is today. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: proposal: ignore null fields in not relation type composite type based constructors
* Pavel Stehule (pavel.steh...@gmail.com) wrote: Can I help with something, it is there some open question? I had been hoping for a more definitive answer regarding this option for array_to_json, or even a comment about the change to row_to_json. Andrew- any thoughts on this? (that's what the ping on IRC is for :). Thanks, Stephen 2014-09-08 6:27 GMT+02:00 Stephen Frost sfr...@snowman.net: * Pavel Stehule (pavel.steh...@gmail.com) wrote: ignore_nulls in array_to_json_pretty probably is not necessary. On second hand, the cost is zero, and we can have it for API consistency. I'm willing to be persuaded either way on this, really. I do think it would be nice for both array_to_json and row_to_json to be single functions which take default values, but as for if array_to_json has a ignore_nulls option, I'm on the fence and would defer to people who use that function regularly (I don't today). Beyond that, I'm pretty happy moving forward with this patch. signature.asc Description: Digital signature
Re: [HACKERS] pgbench throttling latency limit
Hello Heikki, Now that I've finished the detour and committed and backpatched the changes to the way latency is calculated, we can get back to this patch. It needs to be rebased. Before rebasing, I think that there are a few small problems with the modification applied to switch from an integer range to double. (1) ISTM that the + 0.5 which remains in the PoissonRand computation comes from the previous integer approach and is not needed here. If I'm not mistaken the formula should be plain: -log(uniform) * center (2) I'm not sure of the name center, I think that lambda or mean would be more appropriate. (3) I wish that the maximum implied multiplier could be explicitely documented in the source code. From pg_rand48 source code, I think that it is 33.27106466687737 -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
I agree that there's no reason to fix an algorithm to it, unless maybe it's pglz. There's some initial talk about implementing pluggable compression algorithms for TOAST and I guess the same must be taken into consideration for the WAL. -- Arthur Silva On Thu, Sep 11, 2014 at 2:46 AM, Rahila Syed rahilasyed...@gmail.com wrote: I will repeat the above tests with high load on CPU and using the benchmark given by Fujii-san and post the results. Average % of CPU usage at user level for each of the compression algorithm are as follows. CompressionMultipleSingle Off81.133881.1267 LZ4 81.099881.1695 Snappy:80.9741 80.9703 Pglz :81.235381.2753 http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user_single.png http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user.png The numbers show CPU utilization of Snappy is the least. The CPU utilization in increasing order is pglz No compression LZ4 Snappy The variance of average CPU utilization numbers is very low. However , snappy seems to be best when it comes to lesser utilization of CPU. As per the measurement results posted till date LZ4 outperforms snappy and pglz in terms of compression ratio and performance. However , CPU utilization numbers show snappy utilizes least amount of CPU . Difference is not much though. As there has been no consensus yet about which compression algorithm to adopt, is it better to make this decision independent of the FPW compression patch as suggested earlier in this thread?. FPW compression can be done using built in compression pglz as it shows considerable performance over uncompressed WAL and good compression ratio Also, the patch to compress multiple blocks at once gives better compression as compared to single block. ISTM that performance overhead introduced by multiple blocks compression is slightly higher than single block compression which can be tested again after modifying the patch to use pglz . Hence, this patch can be built using multiple blocks compression. Thoughts? -- View this message in context: http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5818552.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench throttling latency limit
(3) I wish that the maximum implied multiplier could be explicitely documented in the source code. From pg_rand48 source code, I think that it is 33.27106466687737 Small possibly buggy code attached, to show how I computed the above figure. -- Fabien.#include math.h #include stdio.h int main(void) { double v = ldexp(0x, -48) + ldexp(0x, -32) + ldexp(0x, -16); double u = 1.0 - v; double m = -log(u); printf(v=%.16g u=%.16g m=%.16g\n, v, u, m); } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Thu, Sep 11, 2014 at 09:37:07AM -0300, Arthur Silva wrote: I agree that there's no reason to fix an algorithm to it, unless maybe it's pglz. There's some initial talk about implementing pluggable compression algorithms for TOAST and I guess the same must be taken into consideration for the WAL. -- Arthur Silva On Thu, Sep 11, 2014 at 2:46 AM, Rahila Syed rahilasyed...@gmail.com wrote: I will repeat the above tests with high load on CPU and using the benchmark given by Fujii-san and post the results. Average % of CPU usage at user level for each of the compression algorithm are as follows. CompressionMultipleSingle Off81.133881.1267 LZ4 81.099881.1695 Snappy:80.9741 80.9703 Pglz :81.235381.2753 http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user_single.png http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user.png The numbers show CPU utilization of Snappy is the least. The CPU utilization in increasing order is pglz No compression LZ4 Snappy The variance of average CPU utilization numbers is very low. However , snappy seems to be best when it comes to lesser utilization of CPU. As per the measurement results posted till date LZ4 outperforms snappy and pglz in terms of compression ratio and performance. However , CPU utilization numbers show snappy utilizes least amount of CPU . Difference is not much though. As there has been no consensus yet about which compression algorithm to adopt, is it better to make this decision independent of the FPW compression patch as suggested earlier in this thread?. FPW compression can be done using built in compression pglz as it shows considerable performance over uncompressed WAL and good compression ratio Also, the patch to compress multiple blocks at once gives better compression as compared to single block. ISTM that performance overhead introduced by multiple blocks compression is slightly higher than single block compression which can be tested again after modifying the patch to use pglz . Hence, this patch can be built using multiple blocks compression. Thoughts? Hi, The big (huge) win for lz4 (not the HC variant) is the enormous compression and decompression speed. It compresses quite a bit faster (33%) than snappy and decompresses twice as fast as snappy. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
Thanks for reviewing, Andres. On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund and...@2ndquadrant.com wrote: +static void bgreclaim_quickdie(SIGNAL_ARGS); +static void BgreclaimSigHupHandler(SIGNAL_ARGS); +static void ReqShutdownHandler(SIGNAL_ARGS); +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS); This looks inconsistent. It's exactly the same as what bgwriter.c does. No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a good idea, regardless of it possibly being true today (which I'm not sure about yet). We really need a more centralized way to handle error cleanup in auxiliary processes. The current state of affairs is really pretty helter-skelter. But for this patch, I think we should aim to mimic the existing style, as ugly as it is. I'm not sure whether Amit's got the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum, is probably a good idea. +Two water mark indicators are used to maintain sufficient number of buffers +on freelist. Low water mark indicator is used by backends to wake bgreclaimer +when the number of buffers in freelist falls below it. High water mark +indicator is used by bgreclaimer to move buffers to freelist. For me the description of the high water as stated here doesn't seem to explain anything. Yeah, let me try to revise and expand on that a bit: Background Reclaimer's Processing - The background reclaimer runs the clock sweep to identify buffers that are good candidates for eviction and puts them on the freelist. This makes buffer allocation much faster, since removing a buffer from the head of a linked list is much cheaper than linearly scanning the whole buffer pool until a promising candidate is found. It's possible that a buffer we add to the freelist may be accessed or even pinned before it's evicted; if that happens, the backend that would have evicted it will simply disregard it and take the next buffer instead (or run the clock sweep itself, if necessary). However, to make sure that doesn't happen too often, we need to keep the freelist as short as possible, so that there won't be many other buffer accesses between when the time a buffer is added to the freelist and the time when it's actually evicted. We use two water marks to control the activity of the bgreclaimer process. Each time bgreclaimer is awoken, it will move buffers to the freelist until the length of the free list reaches the high water mark. It will then sleep. When the number of buffers on the freelist reaches the low water mark, backends attempting to allocate new buffers will set the bgreclaimer's latch, waking it up again. While it's important for the high water mark to be small (for the reasons described above), we also need to ensure adequate separation between the low and high water marks, so that the bgreclaimer isn't constantly being awoken to find just a handful of additional candidate buffers, and we need to ensure that the low watermark is adequate to keep the freelist from becoming completely empty before bgreclaimer has time to wake up and beginning filling it again. This section should have a description of how the reclaimer interacts with the bgwriter logic. Do we put dirty buffers on the freelist that are then cleaned by the bgwriter? Which buffers does the bgwriter write out? The bgwriter is cleaning ahead of the strategy point, and the bgreclaimer is advancing the strategy point. I think we should consider having the bgreclaimer wake the bgwriter if it comes across a dirty buffer, because while the bgwriter only estimates the rate of buffer allocation, bgreclaimer *knows* the rate of allocation, because its own activity is tied to the allocation rate. I think there's the potential for this kind of thing to make the background writer significantly more effective than it is today, but I'm heavily in favor of leaving it for a separate patch. I wonder if we don't want to increase the high watermark when tmp_recent_backend_clocksweep 0? That doesn't really work unless there's some countervailing force to eventually reduce it again; otherwise, it'd just converge to infinity. And it doesn't really seem necessary at the moment. Hm. Perhaps we should do a bufHdr-refcount != zero check without locking here? The atomic op will transfer the cacheline exclusively to the reclaimer's CPU. Even though it very shortly afterwards will be touched afterwards by the pinning backend. Meh. I'm not in favor of adding more funny games with locking unless we can prove they're necessary for performance. * Are we sure that the freelist_lck spinlock won't cause pain? Right now there will possibly be dozens of processes busily spinning on it... I think it's a acceptable risk, but we should think about it. As you and I have talked about before, we could reduce contention here by partitioning the freelist, or by using a CAS loop to pop items off of it. But I am not convinced either
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 11, 2014 at 6:32 PM, Robert Haas robertmh...@gmail.com wrote: Thanks for reviewing, Andres. On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund and...@2ndquadrant.com wrote: +static void bgreclaim_quickdie(SIGNAL_ARGS); +static void BgreclaimSigHupHandler(SIGNAL_ARGS); +static void ReqShutdownHandler(SIGNAL_ARGS); +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS); This looks inconsistent. It's exactly the same as what bgwriter.c does. No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a good idea, regardless of it possibly being true today (which I'm not sure about yet). We really need a more centralized way to handle error cleanup in auxiliary processes. The current state of affairs is really pretty helter-skelter. But for this patch, I think we should aim to mimic the existing style, as ugly as it is. I'm not sure whether Amit's got the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum, is probably a good idea. Code related to bgreclaimer logic itself doesn't take any LWLock, do you suspect the same might be required due to some Signal/Interrupt handling? From myside, I have thought about what to keep for error cleanup based on the working of bgreclaimer. However there is a chance that I have missed something. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-11 09:02:34 -0400, Robert Haas wrote: Thanks for reviewing, Andres. On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund and...@2ndquadrant.com wrote: +static void bgreclaim_quickdie(SIGNAL_ARGS); +static void BgreclaimSigHupHandler(SIGNAL_ARGS); +static void ReqShutdownHandler(SIGNAL_ARGS); +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS); This looks inconsistent. It's exactly the same as what bgwriter.c does. So what? There's no code in common, so I see no reason to have one signal handler using underscores and the next one camelcase names. No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a good idea, regardless of it possibly being true today (which I'm not sure about yet). We really need a more centralized way to handle error cleanup in auxiliary processes. The current state of affairs is really pretty helter-skelter. Agreed. There really should be three variants: * full abort including support for transactions * full abort without transactions being used (most background processes) * abort without shared memory interactions But for this patch, I think we should aim to mimic the existing style, as ugly as it is. Agreed. Background Reclaimer's Processing - The background reclaimer runs the clock sweep to identify buffers that are good candidates for eviction and puts them on the freelist. This makes buffer allocation much faster, since removing a buffer from the head of a linked list is much cheaper than linearly scanning the whole buffer pool until a promising candidate is found. It's possible that a buffer we add to the freelist may be accessed or even pinned before it's evicted; if that happens, the backend that would have evicted it will simply disregard it and take the next buffer instead (or run the clock sweep itself, if necessary). However, to make sure that doesn't happen too often, we need to keep the freelist as short as possible, so that there won't be many other buffer accesses between when the time a buffer is added to the freelist and the time when it's actually evicted. We use two water marks to control the activity of the bgreclaimer process. Each time bgreclaimer is awoken, it will move buffers to the freelist until the length of the free list reaches the high water mark. It will then sleep. I wonder if we should recheck the number of freelist items before sleeping. As the latch currently is reset before sleeping (IIRC) we might miss being woken up soon. It very well might be that bgreclaim needs to run for more than one cycle in a row to keep up... This section should have a description of how the reclaimer interacts with the bgwriter logic. Do we put dirty buffers on the freelist that are then cleaned by the bgwriter? Which buffers does the bgwriter write out? The bgwriter is cleaning ahead of the strategy point, and the bgreclaimer is advancing the strategy point. That sentence, in some form, should be in the above paragraph. I think we should consider having the bgreclaimer wake the bgwriter if it comes across a dirty buffer, because while the bgwriter only estimates the rate of buffer allocation, bgreclaimer *knows* the rate of allocation, because its own activity is tied to the allocation rate. I think there's the potential for this kind of thing to make the background writer significantly more effective than it is today, but I'm heavily in favor of leaving it for a separate patch. Yes, doing that sounds like a good plan. I'm happy with that being done in a separate patch. I wonder if we don't want to increase the high watermark when tmp_recent_backend_clocksweep 0? That doesn't really work unless there's some countervailing force to eventually reduce it again; otherwise, it'd just converge to infinity. And it doesn't really seem necessary at the moment. Right, it obviously needs to go both ways. I'm a bit sceptic about untunable, fixed, numbers proving to be accurate for widely varied workloads. Hm. Perhaps we should do a bufHdr-refcount != zero check without locking here? The atomic op will transfer the cacheline exclusively to the reclaimer's CPU. Even though it very shortly afterwards will be touched afterwards by the pinning backend. Meh. I'm not in favor of adding more funny games with locking unless we can prove they're necessary for performance. Well, this in theory increases the number of processes touching buffer headers regularly. Currently, if you have one read IO intensive backend, there's pretty much only process touching the cachelines. This will make it two. I don't think it's unreasonable to try to reduce the cacheline pingpong caused by that... * Are we sure that the freelist_lck spinlock won't cause pain? Right now there will possibly be dozens of processes busily spinning on it... I think it's a acceptable risk, but we should think about it. As you and I have talked
Re: [HACKERS] pgbench throttling latency limit
Hello Heikki Now that I've finished the detour and committed and backpatched the changes to the way latency is calculated, we can get back to this patch. It needs to be rebased. Here is the rebase, which seems ok. See also the small issues raised aboud getPoissonRand in another email. -- Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 4001a98..d36fa2c 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -141,6 +141,18 @@ double sample_rate = 0.0; int64 throttle_delay = 0; /* + * Transactions which take longer that this limit are counted as late + * and reported as such, although they are completed anyway. + * + * When under throttling: execution time slots which are more than + * this late (in us) are simply skipped, and the corresponding transaction + * is counted as such... it is not even started; + * otherwise above the limit transactions are counted as such, with the latency + * measured wrt the transaction schedule, not its actual start. + */ +int64 latency_limit = 0; + +/* * tablespace selection */ char *tablespace = NULL; @@ -238,6 +250,8 @@ typedef struct int64 throttle_trigger; /* previous/next throttling (us) */ int64 throttle_lag; /* total transaction lag behind throttling */ int64 throttle_lag_max; /* max transaction lag */ + int64 throttle_latency_skipped; /* lagging transactions skipped */ + int64 latency_late; /* late transactions */ } TState; #define INVALID_THREAD ((pthread_t) 0) @@ -250,6 +264,8 @@ typedef struct int64 sqlats; int64 throttle_lag; int64 throttle_lag_max; + int64 throttle_latency_skipped; + int64 latency_late; } TResult; /* @@ -372,6 +388,10 @@ usage(void) -f, --file=FILENAME read transaction script from FILENAME\n -j, --jobs=NUM number of threads (default: 1)\n -l, --logwrite transaction times to log file\n + -L, --limit=NUM count transactions lasting more than NUM ms.\n + under throttling (--rate), transactions behind schedule\n + more than NUM ms are skipped, and those finishing more\n + than NUM ms after their scheduled start are counted.\n -M, --protocol=simple|extended|prepared\n protocol for submitting queries (default: simple)\n -n, --no-vacuum do not run VACUUM before tests\n @@ -1038,6 +1058,23 @@ top: thread-throttle_trigger += wait; + if (latency_limit) + { + instr_time now; + int64 now_us; + INSTR_TIME_SET_CURRENT(now); + now_us = INSTR_TIME_GET_MICROSEC(now); + while (thread-throttle_trigger now_us - latency_limit) + { +/* if too far behind, this slot is skipped, and we + * iterate till the next nearly on time slot. + */ +int64 wait = getPoissonRand(thread, throttle_delay); +thread-throttle_trigger += wait; +thread-throttle_latency_skipped ++; + } + } + st-txn_scheduled = thread-throttle_trigger; st-sleeping = 1; st-throttling = true; @@ -1110,8 +1147,11 @@ top: thread-exec_count[cnum]++; } - /* transaction finished: record latency under progress or throttling */ - if ((progress || throttle_delay) commands[st-state + 1] == NULL) + /* transaction finished: record latency under progress or throttling, + * or if latency limit is set + */ + if ((progress || throttle_delay || latency_limit) + commands[st-state + 1] == NULL) { int64 latency; @@ -1133,6 +1173,11 @@ top: * would take 256 hours. */ st-txn_sqlats += latency * latency; + + /* record over the limit transactions if needed. + */ + if (latency_limit latency latency_limit) +thread-latency_late++; } /* @@ -1360,7 +1405,7 @@ top: } /* Record transaction start time under logging, progress or throttling */ - if ((logfile || progress || throttle_delay) st-state == 0) + if ((logfile || progress || throttle_delay || latency_limit) st-state == 0) { INSTR_TIME_SET_CURRENT(st-txn_begin); @@ -2426,7 +2471,8 @@ printResults(int ttype, int64 normal_xacts, int nclients, TState *threads, int nthreads, instr_time total_time, instr_time conn_total_time, int64 total_latencies, int64 total_sqlats, - int64 throttle_lag, int64 throttle_lag_max) + int64 throttle_lag, int64 throttle_lag_max, + int64 throttle_latency_skipped, int64 latency_late) { double time_include, tps_include, @@ -2465,7 +2511,17 @@ printResults(int ttype, int64 normal_xacts, int nclients, normal_xacts); } - if (throttle_delay || progress) + if (throttle_delay latency_limit) + printf(number of transactions skipped: INT64_FORMAT (%.3f %%)\n, + throttle_latency_skipped, + 100.0 * throttle_latency_skipped / (throttle_latency_skipped + normal_xacts)); + + if (latency_limit) + printf(number of transactions above the %.1f ms latency limit:
Re: [HACKERS] Scaling shared buffer eviction
We really need a more centralized way to handle error cleanup in auxiliary processes. The current state of affairs is really pretty helter-skelter. But for this patch, I think we should aim to mimic the existing style, as ugly as it is. I'm not sure whether Amit's got the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum, is probably a good idea. Code related to bgreclaimer logic itself doesn't take any LWLock, do you suspect the same might be required due to some Signal/Interrupt handling? I suspect it might creep in at some point at some unrelated place. Which will only ever break in production scenarios. Say, a lwlock in in config file processing. I seem to recall somebody seing a version of a patching adding a lwlock there... :). Or a logging hook. Or ... The savings from not doing LWLockReleaseAll() are nonexistant, so ... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On Wed, Sep 10, 2014 at 5:09 PM, Tomas Vondra t...@fuzzy.cz wrote: OK. So here's v13 of the patch, reflecting this change. With the exception of ExecChooseHashTableSize() and a lot of stylistic issues along the lines of what I've already complained about, this patch seems pretty good to me. It does three things: (1) It changes NTUP_PER_BUCKET to 1. Although this increases memory utilization, it also improves hash table performance, and the additional memory we use is less than what we saved as a result of commit 45f6240a8fa9d35548eb2ef23dba2c11540aa02a. (2) It changes ExecChooseHashTableSize() to include the effect of the memory consumed by the bucket array. If we care about properly respecting work_mem, this is clearly right for any NTUP_PER_BUCKET setting, but more important for a small setting like 1 than for the current value of 10. I'm not sure the logic in this function is as clean and simple as it can be just yet. (3) It allows the number of batches to increase on the fly while the hash join is in process. This case arises when we initially estimate that we only need a small hash table, and then it turns out that there are more tuples than we expect. Without this code, the hash table's load factor gets too high and things start to suck. I recommend separating this patch in two patches, the first covering items (1) and (2) and the second covering item (3), which really seems like a separate improvement. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory Alignment in Postgres
On Wed, Sep 10, 2014 at 12:43 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Sep 9, 2014 at 10:08 AM, Arthur Silva arthur...@gmail.com wrote: I'm continuously studying Postgres codebase. Hopefully I'll be able to make some contributions in the future. For now I'm intrigued about the extensive use of memory alignment. I'm sure there's some legacy and some architecture that requires it reasoning behind it. That aside, since it wastes space (a lot of space in some cases) there must be a tipping point somewhere. I'm sure one can prove aligned access is faster in a micro-benchmark but I'm not sure it's the case in a DBMS like postgres, specially in the page/rows area. Just for the sake of comparison Mysql COMPACT storage (default and recommended since 5.5) doesn't align data at all. Mysql NDB uses a fixed 4-byte alignment. Not sure about Oracle and others. Is it worth the extra space in newer architectures (specially Intel)? Do you guys think this is something worth looking at? Yes. At least in my opinion, though, it's not a good project for a beginner. If you get your changes to take effect, you'll find that a lot of things will break in places that are not easy to find or fix. You're getting into really low-level areas of the system that get touched infrequently and require a lot of expertise in how things work today to adjust. I thought all memory alignment was (or at least the bulk of it) handled using some codebase wide macros/settings, otherwise how could different parts of the code inter-op? Poking this area might suffice for some initial testing to check if it's worth any more attention. Unaligned memory access received a lot attention in Intel post-Nehalen era. So it may very well pay off on Intel servers. You might find this blog post and it's comments/external-links interesting http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/ I'm a newbie in the codebase, so please let me know if I'm saying anything non-sense. The idea I've had before is to try to reduce the widest alignment we ever require from 8 bytes to 4 bytes. That is, look for types with typalign = 'd', and rewrite them to have typalign = 'i' by having them use two 4-byte loads to load an eight-byte value. In practice, I think this would probably save a high percentage of what can be saved, because 8-byte alignment implies a maximum of 7 bytes of wasted space, while 4-byte alignment implies a maximum of 3 bytes of wasted space. And it would probably be pretty cheap, too, because any type with less than 8 byte alignment wouldn't be affected at all, and even those types that were affected would only be slightly slowed down by doing two loads instead of one. In contrast, getting rid of alignment requirements completely would save a little more space, but probably at the cost of a lot more slowdown: any type with alignment requirements would have to fetch the value byte-by-byte instead of pulling the whole thing out at once. Does byte-by-byte access stand true nowadays? I though modern processors would fetch memory at very least in word sized chunks, so 4/8 bytes then merge-slice. But there are a couple of obvious problems with this idea, too, such as: 1. It's really complicated and a ton of work. 2. It would break pg_upgrade pretty darn badly unless we employed some even-more-complex strategy to mitigate that. 3. The savings might not be enough to justify the effort. Very true. It might be interesting for someone to develop a tool measuring the number of bytes of alignment padding we lose per tuple or per page and gather some statistics on it on various databases. That would give us some sense as to the possible savings. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
2014-09-11 22:01 GMT+09:00 k...@rice.edu k...@rice.edu: On Thu, Sep 11, 2014 at 09:37:07AM -0300, Arthur Silva wrote: I agree that there's no reason to fix an algorithm to it, unless maybe it's pglz. Yes, it seems difficult to judge only the algorithm performance. We have to start to consider source code maintenance, quality and the other factors.. The big (huge) win for lz4 (not the HC variant) is the enormous compression and decompression speed. It compresses quite a bit faster (33%) than snappy and decompresses twice as fast as snappy. Show us the evidence. Postgres members showed the test result and them consideration. It's very objective comparing. Best Regards, -- Mitsumasa KONDO
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 11, 2014 at 6:59 PM, Andres Freund and...@2ndquadrant.com wrote: We really need a more centralized way to handle error cleanup in auxiliary processes. The current state of affairs is really pretty helter-skelter. But for this patch, I think we should aim to mimic the existing style, as ugly as it is. I'm not sure whether Amit's got the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum, is probably a good idea. Code related to bgreclaimer logic itself doesn't take any LWLock, do you suspect the same might be required due to some Signal/Interrupt handling? I suspect it might creep in at some point at some unrelated place. Which will only ever break in production scenarios. Say, a lwlock in in config file processing. Yeah, I suspected the same and checked that path, but couldn't find but may be in some path it is there as the code has many flows. I seem to recall somebody seing a version of a patching adding a lwlock there... :). Or a logging hook. Or ... The savings from not doing LWLockReleaseAll() are nonexistant, so ... Okay, I shall add it in next version of patch and mention in comments the reasons quoted by you. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund and...@2ndquadrant.com wrote: It's exactly the same as what bgwriter.c does. So what? There's no code in common, so I see no reason to have one signal handler using underscores and the next one camelcase names. /me shrugs. It's not always possible to have things be consistent with each other within a file and also with what gets done in other files. I'm not sure we should fault patch authors for choosing a different one than we would have chosen. FWIW, I probably would have done it the way Amit did it. I don't actually care, though. I wonder if we should recheck the number of freelist items before sleeping. As the latch currently is reset before sleeping (IIRC) we might miss being woken up soon. It very well might be that bgreclaim needs to run for more than one cycle in a row to keep up... The outer loop in BgMoveBuffersToFreelist() was added to address precisely this point, which I raised in a previous review. I wonder if we don't want to increase the high watermark when tmp_recent_backend_clocksweep 0? That doesn't really work unless there's some countervailing force to eventually reduce it again; otherwise, it'd just converge to infinity. And it doesn't really seem necessary at the moment. Right, it obviously needs to go both ways. I'm a bit sceptic about untunable, fixed, numbers proving to be accurate for widely varied workloads. Me, too, but I'm *even more* skeptical about making things complicated on the pure theory that a simple solution can't be correct. I'm not blind to the possibility that the current logic is inadequate, but testing proves that it works well enough to produce a massive performance boost over where we are now. When, and if, we develop a theory about specifically how it falls short then, sure, let's adjust it. But I think it would be a serious error to try to engineer a perfect algorithm here based on the amount of testing that we can reasonably do pre-commit. We have no chance of getting that right, and I'd rather have an algorithm that is simple and imperfect than an algorithm that is complex and still imperfect. No matter what, it's better than what we have now. Hm. Perhaps we should do a bufHdr-refcount != zero check without locking here? The atomic op will transfer the cacheline exclusively to the reclaimer's CPU. Even though it very shortly afterwards will be touched afterwards by the pinning backend. Meh. I'm not in favor of adding more funny games with locking unless we can prove they're necessary for performance. Well, this in theory increases the number of processes touching buffer headers regularly. Currently, if you have one read IO intensive backend, there's pretty much only process touching the cachelines. This will make it two. I don't think it's unreasonable to try to reduce the cacheline pingpong caused by that... It's not unreasonable, but this is a good place to apply Knuth's first law of optimization. There's no proof we need to do this, so let's not until there is. I also think it would be good to get some statistics on how often regular backends are running the clocksweep vs. how often bgreclaimer is satisfying their needs. I think that's necessary. The patch added buf_backend_clocksweep. Maybe we just also need buf_backend_from_freelist? That's just (or should be just) buf_alloc - buf_backend_clocksweep. I think buf_backend_clocksweep should really be called buf_alloc_clocksweep, and should be added (in all relevant places) right next to buf_alloc. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory Alignment in Postgres
On Thu, Sep 11, 2014 at 9:32 AM, Arthur Silva arthur...@gmail.com wrote: I thought all memory alignment was (or at least the bulk of it) handled using some codebase wide macros/settings, otherwise how could different parts of the code inter-op? Poking this area might suffice for some initial testing to check if it's worth any more attention. Well, sure, but the issues aren't too simple. For example, I think there are cases where we rely on the alignment bytes being zero to distinguish between an aligned value following and an unaligned toasted value. That stuff can make your head explode, or at least mine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] add modulo (%) operator to pgbench
2014-09-11 15:47 GMT+09:00 Fabien COELHO coe...@cri.ensmp.fr: Hello Robert, I am not objecting to the functionality; I'm objecting to bolting on ad-hoc operators one at a time. I think an expression syntax would let us do this in a much more scalable way. If I had time, I'd go do that, but I don't. We could add abs(x) and hash(x) and it would all be grand. Ok. I do agree that an expression syntax would be great! Yes, it's not bad. However, will we need some method of modulo calculation? I don't think so much. I think most intuitive modulo calculation method is floor modulo, Because if we use the negative value in modulo calculation, it just set negative value as both positive values, it is easy to expect the result than others. This strong point is good for benchmark script users. But I don't have any strong opinion about this patch, not blocking:) Best Regards -- Mistumasa KONDO
Re: [HACKERS] Aussie timezone database changes incoming
On Thu, Sep 11, 2014 at 12:20 AM, Craig Ringer cr...@2ndquadrant.com wrote: I shouldn't be surprised that Australia gets to change. While the cynic in me thinks this is the usual USA-is-the-center-of-the-universe-ism, in reality it makes sense given relative population and likely impact. Just because it makes sense doesn't mean it isn't USA-is-the-center-of-the-universe-ism. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
Robert Haas robertmh...@gmail.com writes: With the exception of ExecChooseHashTableSize() and a lot of stylistic issues along the lines of what I've already complained about, this patch seems pretty good to me. It does three things: ... (3) It allows the number of batches to increase on the fly while the hash join is in process. This case arises when we initially estimate that we only need a small hash table, and then it turns out that there are more tuples than we expect. Without this code, the hash table's load factor gets too high and things start to suck. Pardon me for not having read the patch yet, but what part of (3) wasn't there already? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-11 09:48:10 -0400, Robert Haas wrote: On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund and...@2ndquadrant.com wrote: I wonder if we should recheck the number of freelist items before sleeping. As the latch currently is reset before sleeping (IIRC) we might miss being woken up soon. It very well might be that bgreclaim needs to run for more than one cycle in a row to keep up... The outer loop in BgMoveBuffersToFreelist() was added to address precisely this point, which I raised in a previous review. Hm, right. But then let's move BgWriterStats.m_buf_alloc =+, ... pgstat_send_bgwriter(); into that loop. Otherwise it'd possibly end up being continously busy without being visible. I wonder if we don't want to increase the high watermark when tmp_recent_backend_clocksweep 0? That doesn't really work unless there's some countervailing force to eventually reduce it again; otherwise, it'd just converge to infinity. And it doesn't really seem necessary at the moment. Right, it obviously needs to go both ways. I'm a bit sceptic about untunable, fixed, numbers proving to be accurate for widely varied workloads. Me, too, but I'm *even more* skeptical about making things complicated on the pure theory that a simple solution can't be correct. Fair enough. I'm not blind to the possibility that the current logic is inadequate, but testing proves that it works well enough to produce a massive performance boost over where we are now. But, to be honest, the testing so far was pretty narrow in the kind of workloads that were run if I crossread things accurately. Don't get me wrong, I'm *really* happy about having this patch, that just doesn't mean every detail is right ;) Hm. Perhaps we should do a bufHdr-refcount != zero check without locking here? The atomic op will transfer the cacheline exclusively to the reclaimer's CPU. Even though it very shortly afterwards will be touched afterwards by the pinning backend. Meh. I'm not in favor of adding more funny games with locking unless we can prove they're necessary for performance. Well, this in theory increases the number of processes touching buffer headers regularly. Currently, if you have one read IO intensive backend, there's pretty much only process touching the cachelines. This will make it two. I don't think it's unreasonable to try to reduce the cacheline pingpong caused by that... It's not unreasonable, but this is a good place to apply Knuth's first law of optimization. There's no proof we need to do this, so let's not until there is. That's true for new (pieces of) software; less so, when working with a installed base that you might regress... But whatever. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: With the exception of ExecChooseHashTableSize() and a lot of stylistic issues along the lines of what I've already complained about, this patch seems pretty good to me. It does three things: ... (3) It allows the number of batches to increase on the fly while the hash join is in process. This case arises when we initially estimate that we only need a small hash table, and then it turns out that there are more tuples than we expect. Without this code, the hash table's load factor gets too high and things start to suck. Pardon me for not having read the patch yet, but what part of (3) wasn't there already? EINSUFFICIENTCAFFEINE. It allows the number of BUCKETS to increase, not the number of batches. As you say, the number of batches could already increase. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
Robert Haas robertmh...@gmail.com writes: On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: (3) It allows the number of batches to increase on the fly while the hash join is in process. Pardon me for not having read the patch yet, but what part of (3) wasn't there already? EINSUFFICIENTCAFFEINE. It allows the number of BUCKETS to increase, not the number of batches. As you say, the number of batches could already increase. Ah. Well, that would mean that we need a heuristic for deciding when to increase the number of buckets versus the number of batches ... seems like a difficult decision. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 11, 2014 at 10:03 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-09-11 09:48:10 -0400, Robert Haas wrote: On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund and...@2ndquadrant.com wrote: I wonder if we should recheck the number of freelist items before sleeping. As the latch currently is reset before sleeping (IIRC) we might miss being woken up soon. It very well might be that bgreclaim needs to run for more than one cycle in a row to keep up... The outer loop in BgMoveBuffersToFreelist() was added to address precisely this point, which I raised in a previous review. Hm, right. But then let's move BgWriterStats.m_buf_alloc =+, ... pgstat_send_bgwriter(); into that loop. Otherwise it'd possibly end up being continously busy without being visible. Good idea. I'm not blind to the possibility that the current logic is inadequate, but testing proves that it works well enough to produce a massive performance boost over where we are now. But, to be honest, the testing so far was pretty narrow in the kind of workloads that were run if I crossread things accurately. Don't get me wrong, I'm *really* happy about having this patch, that just doesn't mean every detail is right ;) Oh, sure. Totally agreed. And, to the extent that we're improving things based on actual testing, I'm A-OK with that. I just don't want to start speculating, or we'll never get this thing off the ground. Some possibly-interesting test cases would be: (1) A read-only pgbench workload that is just a tiny bit larger than shared_buffers, say size of shared_buffers plus 0.01%. Such workloads tend to stress buffer eviction heavily. (2) A workload that maximizes the rate of concurrent buffer eviction relative to other tasks. Read-only pgbench is not bad for this, but maybe somebody's got a better idea. As I sort of mentioned in what I was writing for the bufmgr README, there are, more or less, three ways this can fall down, at least that I can see: (1) if the high water mark is too high, then we'll start finding buffers in the freelist that have already been touched since we added them: (2) if the low water mark is too low, the freelist will run dry; and (3) if the low and high water marks are too close together, the bgreclaimer will be constantly getting woken up and going to sleep again. I can't personally think of a workload that will enable us to get a better handle on those cases than high-concurrency pgbench, but you're known to be ingenious at coming up with destruction workloads, so if you have an idea, by all means fire away. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench throttling latency limit
How should skipped transactions should be taken into account in the log file output, with and without aggregation? I assume we'll want to have some trace of skipped transactions in the logs. The problem with this point is that how to report something not done is unclear, especially as the logic of the log is one line per performed transaction. Obviously we can log something, but as the transaction are not performed the format would be different, which break the expectation of a simple and homogeneous log file format that people like to process directly. So bar any great idea, I would suggest not to log skipped transactions and to wait for someone to need to have access to this detailed information and for whom the final report is not enough. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory Alignment in Postgres
On 2014-09-11 10:32:24 -0300, Arthur Silva wrote: Unaligned memory access received a lot attention in Intel post-Nehalen era. So it may very well pay off on Intel servers. You might find this blog post and it's comments/external-links interesting http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/ FWIW, the reported results of imo pretty meaningless for postgres. It's sequential access over larger amount of memory. I.e. a perfectly prefetchable workload where it doesn't matter if superflous cachelines are fetched because they're going to be needed next round anyway. In many production workloads one of the most busy accesses to individual datums is the binary search on individual pages during index lookups. That's pretty much exactly the contrary to the above. Not saying that it's not going to be a benefit in many scenarios, but it's far from being as simple as saying that unaligned accesses on their own aren't penalized anymore. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] gist vacuum seaq access
After discussion of gist seaq access in vaccum there are 2 issue: Heikki says :Vacuum needs to memorize the current NSN when it begins1) how i may getting corect NSN. Also i must setting F_DELETED flag on empty page and to clean parent from link on deleted_pages2) how i may getting parent of page fast?? Thanks.
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On 11 Září 2014, 15:31, Robert Haas wrote: On Wed, Sep 10, 2014 at 5:09 PM, Tomas Vondra t...@fuzzy.cz wrote: OK. So here's v13 of the patch, reflecting this change. With the exception of ExecChooseHashTableSize() and a lot of stylistic issues along the lines of what I've already complained about, this patch seems pretty good to me. It does three things: I believe I fixed the stylistic issues you pointed out, except maybe the bracketing (which seems fine to me, though). So if you could point the issues out, that'd be helpful. (1) It changes NTUP_PER_BUCKET to 1. Although this increases memory utilization, it also improves hash table performance, and the additional memory we use is less than what we saved as a result of commit 45f6240a8fa9d35548eb2ef23dba2c11540aa02a. (2) It changes ExecChooseHashTableSize() to include the effect of the memory consumed by the bucket array. If we care about properly respecting work_mem, this is clearly right for any NTUP_PER_BUCKET setting, but more important for a small setting like 1 than for the current value of 10. I'm not sure the logic in this function is as clean and simple as it can be just yet. I made that as clear as I was able to (based on your feedback), but I'm stuck as I'm familiar with the logic. That is not to say it can't be improved, but this needs to be reviewed by someone else. (3) It allows the number of batches to increase on the fly while the hash join is in process. This case arises when we initially estimate that we only need a small hash table, and then it turns out that there are more tuples than we expect. Without this code, the hash table's load factor gets too high and things start to suck. I recommend separating this patch in two patches, the first covering items (1) and (2) and the second covering item (3), which really seems like a separate improvement. That probably makes sense. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal (9.5) : psql unicode border line styles
Pavel, * Pavel Stehule (pavel.steh...@gmail.com) wrote: I removed dynamic allocation and reduced patch size. This is certainly better, imv, though there are a couple of minor issues (extra semi-colons, extraneous whitespace, get_line_style was still changed to non-const, even though it doesn't need to be now). What I tested a old unicode style is same as new unicode style. There nothing was changed .. some fields are specified in refresh_utf8format function I don't particularly like this (having these fields set in refresh_utf8format to hard-coded strings in the function), why not have those handled the same as the rest, where the strings themselves are in the unicode_style structure? The rest looks pretty good. Need to step out for a bit but I'll look at making the above changes when I get back if I don't hear anything. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On 11 Září 2014, 16:11, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: (3) It allows the number of batches to increase on the fly while the hash join is in process. Pardon me for not having read the patch yet, but what part of (3) wasn't there already? EINSUFFICIENTCAFFEINE. It allows the number of BUCKETS to increase, not the number of batches. As you say, the number of batches could already increase. Ah. Well, that would mean that we need a heuristic for deciding when to increase the number of buckets versus the number of batches ... seems like a difficult decision. That's true, but that's not the aim of this patch. The patch simply increases the number of buckets if the load happens to get too high, and does not try to decide between increasing nbuckets and nbatch. It's true that we can often get better performance with more batches, but the cases I was able to inspect were caused either by (a) underestimates resulting in inappropriate nbucket count, or (b) caching effects. This patch solves (a), not even trying to fix (b). regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add shutdown_at_recovery_target option to recovery.conf
On 10/09/14 13:13, Fujii Masao wrote: On Wed, Sep 10, 2014 at 1:45 AM, Petr Jelinek p...@2ndquadrant.com wrote: Hi, I recently wanted several times to have slave server prepared at certain point in time to reduce the time it takes for it to replay remaining WALs (say I have pg_basebackup -x on busy db for example). In your example, you're thinking to perform the recovery after taking the backup and stop it at the consistent point (i.e., end of backup) by using the proposed feature? Then you're expecting that the future recovery will start from that consistent point and which will reduce the recovery time? This is true if checkpoint is executed at the end of backup. But there might be no occurrence of checkpoint during backup. In this case, even future recovery would need to start from very start of backup. That is, we cannot reduce the recovery time. So, for your purpose, for example, you might also need to add new option to pg_basebackup so that checkpoint is executed at the end of backup if the option is set. For my use-case it does not matter much as I am talking here of huge volumes where it would normally take hours to replay so being behind one checkpoint is not too bad, but obviously I am not sure that it's good enough for project in general. Adding checkpoint for pg_basebackup might be useful addition, yes. Also I forgot to write another use-case which making sure that I actually do have all the WAL present to get to certain point in time (this one could be done via patch to pg_receivexlog I guess, but I see advantage in having the changes already applied compared to just having the wal files). So I wrote simple patch that adds option to shut down the cluster once recovery_target is reached. The server will still be able to continue WAL replay if needed later or can be configured to start standalone. What about adding something like action_at_recovery_target=pause|shutdown instead of increasing the number of parameters? That will also increase number of parameters as we can't remove the current pause one if we want to be backwards compatible. Also there would have to be something like action_at_recovery_target=none or off or something since the default is that pause is on and we need to be able to turn off pause without having to have shutdown on. What more, I am not sure I see any other actions that could be added in the future as promote action already works and listen (for RO queries) also already works independently of this. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory Alignment in Postgres
On Thu, Sep 11, 2014 at 8:32 AM, Arthur Silva arthur...@gmail.com wrote: On Wed, Sep 10, 2014 at 12:43 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Sep 9, 2014 at 10:08 AM, Arthur Silva arthur...@gmail.com wrote: I'm continuously studying Postgres codebase. Hopefully I'll be able to make some contributions in the future. For now I'm intrigued about the extensive use of memory alignment. I'm sure there's some legacy and some architecture that requires it reasoning behind it. That aside, since it wastes space (a lot of space in some cases) there must be a tipping point somewhere. I'm sure one can prove aligned access is faster in a micro-benchmark but I'm not sure it's the case in a DBMS like postgres, specially in the page/rows area. Just for the sake of comparison Mysql COMPACT storage (default and recommended since 5.5) doesn't align data at all. Mysql NDB uses a fixed 4-byte alignment. Not sure about Oracle and others. Is it worth the extra space in newer architectures (specially Intel)? Do you guys think this is something worth looking at? Yes. At least in my opinion, though, it's not a good project for a beginner. If you get your changes to take effect, you'll find that a lot of things will break in places that are not easy to find or fix. You're getting into really low-level areas of the system that get touched infrequently and require a lot of expertise in how things work today to adjust. I thought all memory alignment was (or at least the bulk of it) handled using some codebase wide macros/settings, otherwise how could different parts of the code inter-op? Poking this area might suffice for some initial testing to check if it's worth any more attention. Unaligned memory access received a lot attention in Intel post-Nehalen era. So it may very well pay off on Intel servers. You might find this blog post and it's comments/external-links interesting http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/ I'm a newbie in the codebase, so please let me know if I'm saying anything non-sense. Be advised of the difficulties you are going to face here. Assuming for a second there is no reason not to go unaligned on Intel and there are material benefits to justify the effort, that doesn't necessarily hold for other platforms like arm/power. Even though intel handles the vast majority of installations it's not gonna fly to optimize for that platform at the expense of others so there'd have to be some kind of compile time setting to control alignment behavior. That being said, if you could pull this off cleanly, it'd be pretty neat. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On Thu, Sep 11, 2014 at 10:11 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Thu, Sep 11, 2014 at 9:59 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: (3) It allows the number of batches to increase on the fly while the hash join is in process. Pardon me for not having read the patch yet, but what part of (3) wasn't there already? EINSUFFICIENTCAFFEINE. It allows the number of BUCKETS to increase, not the number of batches. As you say, the number of batches could already increase. Ah. Well, that would mean that we need a heuristic for deciding when to increase the number of buckets versus the number of batches ... seems like a difficult decision. Well, the patch takes the approach of increasing the number of buckets when (1) there's only one batch and (2) the load factor exceeds NTUP_PER_BUCKET. As Tomas points out, it's just increasing the number of buckets to the exact same number which we would have allocated had we known that this many tuples would show up. Now, there is a possibility that we could lose. Let's suppose that the tuples overflow work_mem, but just barely. If we've accurately estimate how many tuples there are, the patch changes nothing: either way we're going to need two batches. But let's say we've underestimated the number of tuples by 3x. Then it could be that without the patch we'd allocate fewer buckets, never increase the number of buckets, and avoid batching; whereas with the patch, we'll discover that our tuple estimate was wrong, increase the number of buckets on the fly, and then be forced by the slightly-increased memory consumption that results from increasing the number of buckets to do two batches. That would suck. But catering to that case is basically hoping that we'll fall into a septic tank and come out smelling like a rose - that is, we're hoping that our initial poor estimate will cause us to accidentally do the right thing later. I don't think that's the way to go. It's much more important to get the case where things are bigger than we expected but still fit within work_mem right; and we're currently taking a huge run-time penalty in that case, as Tomas' results show. If we wanted to cater to the other case in the future, we could consider the opposite approach: if work_mem is exhahusted, throw the bucket headers away and keep reading tuples into the dense-packed chunks added by yesterday's commit. If we again run out of work_mem, then we *definitely* need to increase the batch count. If we manage to make everything fit, then we know exactly how many bucket headers we can afford, and can use some heuristic to decide between that and using more batches. I don't think that should be a requirement for this patch, though: I think the cumulative effect of Tomas's patches is going to be a win *much* more often than it's a loss. In 100% of cases, the dense-packing stuff will make a hash table containing the same tuples significantly smaller. Except when we've radically overestimated the tuple count, the change to NTUP_PER_BUCKET will mean less time spent chasing hash chains. It does use more memory, but that's more than paid for by the first change. Both the dense-packing stuff and the changes to include bucket headers in the work_mem calculation will have the effect of making the work_mem bound on memory utilization far more accurate than it's ever been previously. And the change to increase the number of buckets at runtime will mean that we're significantly more resilient against the planner underestimating the number of tuples. That's clearly good. The fact that we can construct borderline cases where it loses shouldn't deter us from pushing forward. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
Tomas Vondra t...@fuzzy.cz writes: On 11 ZáÅà 2014, 16:11, Tom Lane wrote: Ah. Well, that would mean that we need a heuristic for deciding when to increase the number of buckets versus the number of batches ... seems like a difficult decision. That's true, but that's not the aim of this patch. The patch simply increases the number of buckets if the load happens to get too high, and does not try to decide between increasing nbuckets and nbatch. On further thought, increasing nbuckets without changing the batch boundaries would not get us out of an out-of-work_mem situation, in fact it makes memory consumption worse not better (assuming you count the bucket headers towards work_mem ;-)). So in principle, the rule seems like it ought to go if load (defined as max bucket chain length, I imagine?) gets too high, but we are still well below work_mem, increase nbuckets; else increase nbatch. And perhaps we reset nbuckets again for the next batch, not sure. If we are dealing with an out-of-work_mem situation then only increasing nbatch would be a suitable response. Because of the risk that increasing nbuckets would itself lead to a work_mem violation, I don't think it's sane to ignore the interaction entirely, even in a first patch. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW
* Albe Laurenz (laurenz.a...@wien.gv.at) wrote: Etsuro Fujita wrote: I agree with you on that point. So, I've updated the patch to have the explicit flag, as you proposed. Attached is the updated version of the patch. In this version, I've also revised code and its comments a bit. Thank you, I have set the patch to Ready for Committer. I had a few minutes, so I started looking at this patch and I definitely like where it's going. One concern which was brought up that I didn't see completely addressed was around UPDATE/DELETE with LIMIT, which seems to be making progress towards getting in. Presumably, we'd simply disallow this optimization in that case, but we'll need a way to identify that case.. I have to admit that, while I applaud the effort made to have this change only be to postgres_fdw, I'm not sure that having the update/delete happening during the Scan phase and then essentially no-op'ing the ExecForeignUpdate/ExecForeignDelete is entirely in-line with the defined API. I would have thought we'd add a capability check function that says can this Modify be completely pushed down and then have a way for that to happen in ExecForeignUpdate/ExecForeignDelete. That means changes to the existing API, of course, and if people feel that's unnecessary then I'd suggest that we need to at least document that we're willing to support these bulk operations happening on the remote siude during the pre-Modify Scan and not during the ExecForeignUpdate/ExecForeignDelete. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Patch to support SEMI and ANTI join removal
On Thu, Sep 11, 2014 at 7:14 AM, David Rowley dgrowle...@gmail.com wrote: Here's a quick demo, of the patch at work: test=# create table c (id int primary key); CREATE TABLE test=# create table b (id int primary key, c_id int not null references c(id)); CREATE TABLE test=# create table a (id int primary key, b_id int not null references b(id)); CREATE TABLE test=# test=# explain select a.* from a inner join b on a.b_id = b.id inner join c on b.c_id = c.id; QUERY PLAN - Seq Scan on a (cost=0.00..31.40 rows=2140 width=8) Planning time: 1.061 ms (2 rows) That is just awesome. You are my new hero. 1. I don't think that I'm currently handling removing eclass members properly. So far the code just removes the Vars that belong to the relation being removed. I likely should also be doing bms_del_member(ec-ec_relids, relid); on the eclass, but perhaps I should just be marking the whole class as ec_broken = true and adding another eclass all everything that the broken one has minus the parts from the removed relation? I haven't read the patch, but I think the question is why this needs to be different than what we do for left join removal. Assume there's a foreign key a (x) reference b(x) SELECT a.* FROM a INNER JOIN b ON a.x = b.x WHERE b.x = 1 relation b should be removable because an eclass will contain {a.x, b.x} and therefore s baserestrictinfo for a.x = 1 should also exist on relation a. Therefore removing relation b should produce equivalent results, i.e everything that gets filtered out on relation b will also be filtered out on relation a anyway. I think the patch without this is still worth it, but if someone feels strongly about it I'll take a bash at supporting it. That'd be nice to fix, but IMHO not essential. 3. Currently the inner join support does not allow removals using foreign keys which contain duplicate columns on the referencing side. e.g (a,a) references (x,y), this is basically because of the point I made in item 2. In this case a baserestrictinfo would exist on the referenced relation to say WHERE x = y. I think it's fine to not bother with this case. Who cares? 4. The patch currently only allows removals for eclass join types. If the rel has any joininfo items, then the join removal is disallowed. From what I can see equality type inner join conditions get described in eclasses, and only non-equality join conditions make it into the joininfo list, and since foreign keys only support equality operators, then I thought this was a valid restriction, however, if someone can show me a flaw in my assumption then I may need to improve this. Seems OK. 5. I've added a flag to pg_class called relhasfkey. Currently this gets set to true when a foreign key is added, though I've added nothing to set it back to false again. I notice that relhasindex gets set back to false during vacuum, if vacuum happens to find there to not be any indexes on the rel. I didn't put my logic here as I wasn't too sure if scanning pg_constraint during a vacuum seemed very correct, so I just left out the setting it to false logic based on the the fact that I noticed that relhaspkey gets away with quite lazy setting back to false logic (only when there's no indexes of any kind left on the relation at all). The alternative to resetting the flag somehow is not having it in the first place. Would that be terribly expensive? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory Alignment in Postgres
Merlin Moncure mmonc...@gmail.com writes: Be advised of the difficulties you are going to face here. Assuming for a second there is no reason not to go unaligned on Intel and there are material benefits to justify the effort, that doesn't necessarily hold for other platforms like arm/power. Note that on many (most?) non-Intel architectures, unaligned access is simply not an option. The chips themselves will throw SIGBUS or equivalent if you try it. Some kernels provide signal handlers that emulate the unaligned access in software rather than killing the process; but the performance consequences of hitting such traps more than very occasionally would be catastrophic. Even on Intel, I'd wonder what unaligned accesses do to atomicity guarantees and suchlike. This is not a big deal for row data storage, but we'd have to be careful about it if we were to back off alignment requirements for in-memory data structures such as latches and buffer headers. Another fun thing you'd need to deal with is ensuring that the C structs we overlay onto catalog data rows still match up with the data layout rules. On the whole, I'm pretty darn skeptical that such an effort would repay itself. There are lots of more promising things to hack on. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background (and more parallelism infrastructure patches)
On 10/09/14 22:53, Robert Haas wrote: Here's what the other approach looks like. I can't really see doing this way and then only providing hooks for those two functions, so this is with hooks for all the send-side stuff. Original version: 9 files changed, 295 insertions(+), 3 deletions(-) This version: 9 files changed, 428 insertions(+), 47 deletions(-) There is admittedly a certain elegance to providing a complete set of hooks, so maybe this is the way to go. The remaining patches in the patch series work with either the old version or this one; the changes here don't affect anything else. Anyone else have an opinion on which way is better here? Ok so it is more than 20 lines :) I do like this approach more though, it looks cleaner to me. On 10/09/14 22:53, Robert Haas wrote: +extern int (*pq_putmessage_hook)(char msgtype, const char *s, size_tlen); +extern int (*pq_flush_hook)(void); Btw you forgot to remove the above. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory Alignment in Postgres
On 2014-09-11 11:39:12 -0400, Tom Lane wrote: Even on Intel, I'd wonder what unaligned accesses do to atomicity guarantees and suchlike. They pretty much kill atomicity guarantees. Atomicity is guaranteed while you're inside a cacheline, but not once you span them. This is not a big deal for row data storage, but we'd have to be careful about it if we were to back off alignment requirements for in-memory data structures such as latches and buffer headers. Right. I don't think that's an option. Another fun thing you'd need to deal with is ensuring that the C structs we overlay onto catalog data rows still match up with the data layout rules. Yea, this would require some nastyness in the bki generation, but it'd probably doable to have different alignment for system catalogs. On the whole, I'm pretty darn skeptical that such an effort would repay itself. There are lots of more promising things to hack on. I have no desire to hack on it, but I can understand the desire to reduce the space overhead... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: proposal: ignore null fields in not relation type composite type based constructors
On 09/11/2014 08:29 AM, Stephen Frost wrote: * Pavel Stehule (pavel.steh...@gmail.com) wrote: Can I help with something, it is there some open question? I had been hoping for a more definitive answer regarding this option for array_to_json, or even a comment about the change to row_to_json. Andrew- any thoughts on this? (that's what the ping on IRC is for :). The change in row_to_json looks OK, I think. we're replacing an overloading with use of default params, yes? That seems quite reasonable, and users shouldn't notice the difference. There might be a case for optionally suppressing nulls in array_to_json, and it might work reasonably since unlike SQL arrays JSON arrays don't have to be regular (if nested they are arrays of arrays, not multi-dimensional single arrays). OTOH I'm not sure if it's worth doing just for the sake of orthogonality. If someone wants it, then go for it. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch to support SEMI and ANTI join removal
Robert Haas robertmh...@gmail.com writes: On Thu, Sep 11, 2014 at 7:14 AM, David Rowley dgrowle...@gmail.com wrote: 5. I've added a flag to pg_class called relhasfkey. Currently this gets set to true when a foreign key is added, though I've added nothing to set it back to false again. I notice that relhasindex gets set back to false during vacuum, if vacuum happens to find there to not be any indexes on the rel. I didn't put my logic here as I wasn't too sure if scanning pg_constraint during a vacuum seemed very correct, so I just left out the setting it to false logic based on the the fact that I noticed that relhaspkey gets away with quite lazy setting back to false logic (only when there's no indexes of any kind left on the relation at all). The alternative to resetting the flag somehow is not having it in the first place. Would that be terribly expensive? The behavior of relhaspkey is a legacy thing that we've tolerated only because nothing whatsoever in the backend depends on it at all. I'm not eager to add more equally-ill-defined pg_class columns. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers
On Wed, Sep 10, 2014 at 11:40 PM, Michael Paquier michael.paqu...@gmail.com wrote: Currently two nodes can only have the same priority if they have the same application_name, so we could for example add a new connstring parameter called, let's say application_group, to define groups of nodes that will have the same priority (if a node does not define application_group, it defaults to application_name, if app_name is NULL, well we don't care much it cannot be a sync candidate). That's a first idea that we could use to control groups of nodes. And we could switch syncrep.c to use application_group in s_s_names instead of application_name. That would be backward-compatible, and could open the door for more improvements for quorum commits as we could control groups node nodes. Well this is a super-set of what application_name can already do, but there is no problem to identify single nodes of the same data center and how much they could be late in replication, so I think that this would be really user-friendly. An idea similar to that would be a base work for the next thing... See below. In general, I think the user's requirement for what synchronous standbys could need to acknowledge a commit could be an arbitrary Boolean expression - well, probably no NOT, but any amount of AND and OR that you want to use. Can someone want A OR (((B AND C) OR (D AND E)) AND F)? Maybe! Based on previous discussions, it seems not unlikely that as soon as we decide we don't want to support that, someone will tell us they can't live without it. In general, though, I'd expect the two common patterns to be more or less what you've set forth above: any K servers from set X plus any L servers from set Y plus any M servers from set Z, etc. However, I'm not confident it's right to control this by adding more configuration on the client side. I think it would be better to stick with the idea that each client specifies an application_name, and then the master specifies the policy in some way. One advantage of that is that you can change the rules in ONE place - the master - rather than potentially having to update every client. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal (9.5) : psql unicode border line styles
Hi 2014-09-11 16:42 GMT+02:00 Stephen Frost sfr...@snowman.net: Pavel, * Pavel Stehule (pavel.steh...@gmail.com) wrote: I removed dynamic allocation and reduced patch size. This is certainly better, imv, though there are a couple of minor issues (extra semi-colons, extraneous whitespace, get_line_style was still changed to non-const, even though it doesn't need to be now). fixed non-const -- other, I am sorry, I am blind What I tested a old unicode style is same as new unicode style. There nothing was changed .. some fields are specified in refresh_utf8format function I don't particularly like this (having these fields set in refresh_utf8format to hard-coded strings in the function), why not have those handled the same as the rest, where the strings themselves are in the unicode_style structure? I am not sure if I understand well. With refresh_utf8format I can do shortly 6 possible combinations - or more (when it will be requested) I have no idea how to write as rest without repeating all 6 combinations - what was one noticed issue of some older variant, where I designed unicode1, unicode2, ... Any idea, tip how to it? Regards Pavel The rest looks pretty good. Need to step out for a bit but I'll look at making the above changes when I get back if I don't hear anything. Thanks, Stephen diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml new file mode 100644 index aa71674..1d59dce *** a/doc/src/sgml/ref/psql-ref.sgml --- b/doc/src/sgml/ref/psql-ref.sgml *** lo_import 152801 *** 2306,2311 --- 2306,2347 /para /listitem /varlistentry + + varlistentry + termliteralunicode_border_style/literal/term + listitem + para + Sets the border line drawing style to one + of literalsingle/literal or literaldouble/literal + This option only affects the literalunicode/ + linestyle + /para + /listitem + /varlistentry + + varlistentry + termliteralunicode_column_style/literal/term + listitem + para + Sets the column line drawing style to one + of literalsingle/literal or literaldouble/literal + This option only affects the literalunicode/ + linestyle + /para + /listitem + /varlistentry + + varlistentry + termliteralunicode_header_style/literal/term + listitem + para + Sets the header line drawing style to one + of literalsingle/literal or literaldouble/literal + This option only affects the literalunicode/ + linestyle + /para + /listitem + /varlistentry /variablelist /para diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c new file mode 100644 index 5d90ca2..94d8f45 *** a/src/bin/psql/command.c --- b/src/bin/psql/command.c *** exec_command(const char *cmd, *** 1054,1059 --- 1054,1062 footer, format, linestyle, null, numericlocale, pager, recordsep, tableattr, title, tuples_only, + unicode_border_linestyle, + unicode_column_linestyle, + unicode_header_linestyle, NULL }; *** _align2string(enum printFormat in) *** 2248,2253 --- 2251,2290 return unknown; } + /* + * Parse entered unicode linestyle. Returns true, when entered string is + * known linestyle: single, double else returns false. + */ + static bool + set_unicode_line_style(printQueryOpt *popt, const char *value, size_t vallen, unicode_linestyle *linestyle) + { + if (pg_strncasecmp(single, value, vallen) == 0) + *linestyle = UNICODE_LINESTYLE_SINGLE; + else if (pg_strncasecmp(double, value, vallen) == 0) + *linestyle = UNICODE_LINESTYLE_DOUBLE; + else + return false; + + /* input is ok, generate new unicode style */ + refresh_utf8format((popt-topt)); + + return true; + } + + static const char * + _unicode_linestyle2string(int linestyle) + { + switch (linestyle) + { + case UNICODE_LINESTYLE_SINGLE: + return single; + break; + case UNICODE_LINESTYLE_DOUBLE: + return double; + break; + } + return unknown; + } bool do_pset(const char *param, const char *value, printQueryOpt *popt, bool quiet) *** do_pset(const char *param, const char *v *** 2305,2310 --- 2342,2383 } + /* set unicode border line style */ + else if (strcmp(param, unicode_border_linestyle) == 0) + { + if (!value) + ; + else if (!set_unicode_line_style(popt, value, vallen, popt-topt.unicode_border_linestyle)) + { + psql_error(\\pset: allowed unicode border linestyle are single, double\n); + return false; + } + } + + /* set unicode column line style */ + else if (strcmp(param, unicode_column_linestyle) == 0)
Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW
Stephen Frost sfr...@snowman.net writes: I have to admit that, while I applaud the effort made to have this change only be to postgres_fdw, I'm not sure that having the update/delete happening during the Scan phase and then essentially no-op'ing the ExecForeignUpdate/ExecForeignDelete is entirely in-line with the defined API. Yeah, I've started looking at this patch and that seemed like not necessarily such a wise choice. I think it'd be better if the generated plan tree had a different structure in this case. The existing approach is an impressive hack but it's hard to call it anything but a hack. I'm not sure offhand what the new plan tree ought to look like. We could just generate a ForeignScan node, but that seems like rather a misnomer. Is it worth inventing a new ForeignUpdate node type? Or maybe it'd still be ForeignScan but with a flag field saying hey this is really an update (or a delete). The main benefit I can think of right now is that the EXPLAIN output would be less strange-looking --- but EXPLAIN is hardly the only thing that ever looks at plan trees, so having an outright misleading plan structure is likely to burn us down the line. One advantage of getting the core code involved in the decision about whether an update/delete can be pushed down is that FDW-independent checks like whether there are relevant triggers could be implemented in the core code, rather than having to duplicate them (and later maintain them) in every FDW that wants to do this. OTOH, maybe the trigger issue is really the only thing that could be shared, not sure. Stuff like is there a LIMIT probably has to be in the FDW since some FDWs could support pushing down LIMIT and others not. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On 11 Září 2014, 17:28, Tom Lane wrote: Tomas Vondra t...@fuzzy.cz writes: On 11 Z?? 2014, 16:11, Tom Lane wrote: Ah. Well, that would mean that we need a heuristic for deciding when to increase the number of buckets versus the number of batches ... seems like a difficult decision. That's true, but that's not the aim of this patch. The patch simply increases the number of buckets if the load happens to get too high, and does not try to decide between increasing nbuckets and nbatch. On further thought, increasing nbuckets without changing the batch boundaries would not get us out of an out-of-work_mem situation, in fact it makes memory consumption worse not better (assuming you count the bucket headers towards work_mem ;-)). Yes, the patch counts the bucket headers towards work_mem, while the original code didn't. The reasoning is that due to changing NTUP_PER_BUCKET from 10 to 1, the memory occupied by buckets is not negligible and should be accounted for. So in principle, the rule seems like it ought to go if load (defined as max bucket chain length, I imagine?) gets too high, but we are still well below work_mem, increase nbuckets; else increase nbatch. And perhaps we reset nbuckets again for the next batch, not sure. If we are dealing with an out-of-work_mem situation then only increasing nbatch would be a suitable response. Almost. If we expect batching at the very beginning, we size nbuckets for full work_mem (see how many tuples we can get into work_mem, while not breaking NTUP_PER_BUCKET threshold). If we expect to be fine without batching, we start with the 'right' nbuckets and track the optimal nbuckets as we go (without actually resizing the hash table). Once we hit work_mem (considering the optimal nbuckets value), we keep the value. Only at the end, we check whether (nbuckets != nbuckets_optimal) and resize the hash table if needed. Also, we keep this value for all batches (it's OK because it assumes full work_mem, and it makes the batchno evaluation trivial). So the resize happens only once and only for the first batch. Because of the risk that increasing nbuckets would itself lead to a work_mem violation, I don't think it's sane to ignore the interaction entirely, even in a first patch. This possibility of increasing the number of batches is the consequence of counting the buckets towards work_mem. I believe that's the right thing to do here, to make the work_mem guarantee stronger, but it's not something this patch depends on. If this is the only concern, we can leave this out. It's also worth mentioning that the dense allocation of tuples makes a huge difference. palloc easily results in 100% overhead for narrow tuples (that is, allocating 2x the amount of needed memory). For example over here [1] is an example of a hashjoin consuming 1.4GB with work_mem=800MB. And the dense allocation patch pretty much makes this go away (as demonstrated in the [1]). For wider tuples, the difference is smaller, but that when the buckets are less important. What I'm trying to say is that it's only a matter of work_mem values. Currently, because of the weak guarantee, people tend to set the value lower than necessary. The dense allocation makes this unnecessary, allowing higher work_mem values, and thus eliminating the issue. If this is not enough, we can stop counting the buckets towards work_mem. It'll still be more memory efficient than the old approach (as a pointer is smaller than palloc overhead), and it won't cause additional batches. But I don't like this - I think we should make work_mem a stronger guarantee. [1] http://www.postgresql.org/message-id/53beea9e.2080...@fuzzy.cz regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RLS Design
On Sat, Sep 6, 2014 at 2:54 AM, Brightwell, Adam adam.brightw...@crunchydatasolutions.com wrote: * Add ALTER TABLE name { ENABLE | DISABLE } ROW LEVEL SECURITY - set flag on table to allow for a default-deny capability. If RLS is enabled on a table and has no policies, then a default-deny policy is automatically applied. If RLS is disabled on table and the table still has policies on it then then an error is raised. Though if DISABLE is accompanied with CASCADE, then all policies will be removed and no error is raised. This text doesn't make it clear that all of the cases have been covered; in particular, you didn't specify whether an error is thrown if you try to add a policy to a table with DISABLE ROW LEVEL SECURITY in effect. Backing up a bit, I think there are two sensible designs here: 1. Row level security policies can't exist for a table with DISABLE ROW LEVEL SECURITY in effect. It sounds like this is what you have implemented, modulo any hypothetical bugs. You can't add policies without enabling RLS, and you can't disable RLS without dropping them all. 2. Row level security policies can exist for a table with DISABLE ROW LEVEL SECURITY in effect, but they don't do anything until RLS is enabled. A possible advantage of this approach is that you could *temporarily* shut off RLS for a table without having to drop all of your policies and put them back. I kind of like this approach; we have something similar for triggers, and I think it could be useful to people. If you stick with approach #1, make sure pg_dump is guaranteed to enable RLS before applying the policies. And either way, you should that pg_dump behaves sanely in the case where there are circular dependencies, like you have two table A and B, and each has a RLS policy that manages to use the other table's row-type. (You probably also want to check that DROP and DROP .. CASCADE on either policy or either table does the right thing in that situation, but that's probably easier to get right.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Thu, Sep 11, 2014 at 1:46 AM, Rahila Syed rahilasyed...@gmail.com wrote: I will repeat the above tests with high load on CPU and using the benchmark given by Fujii-san and post the results. Average % of CPU usage at user level for each of the compression algorithm are as follows. CompressionMultipleSingle Off81.133881.1267 LZ4 81.099881.1695 Snappy:80.9741 80.9703 Pglz :81.235381.2753 http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user_single.png http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user.png The numbers show CPU utilization of Snappy is the least. The CPU utilization in increasing order is pglz No compression LZ4 Snappy The variance of average CPU utilization numbers is very low. However , snappy seems to be best when it comes to lesser utilization of CPU. As per the measurement results posted till date LZ4 outperforms snappy and pglz in terms of compression ratio and performance. However , CPU utilization numbers show snappy utilizes least amount of CPU . Difference is not much though. As there has been no consensus yet about which compression algorithm to adopt, is it better to make this decision independent of the FPW compression patch as suggested earlier in this thread?. FPW compression can be done using built in compression pglz as it shows considerable performance over uncompressed WAL and good compression ratio Also, the patch to compress multiple blocks at once gives better compression as compared to single block. ISTM that performance overhead introduced by multiple blocks compression is slightly higher than single block compression which can be tested again after modifying the patch to use pglz . Hence, this patch can be built using multiple blocks compression. I advise supporting pglz only for the initial patch, and adding support for the others later if it seems worthwhile. The approach seems to work well enough with pglz that it's worth doing even if we never add the other algorithms. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On 2014-09-11 12:55:21 -0400, Robert Haas wrote: I advise supporting pglz only for the initial patch, and adding support for the others later if it seems worthwhile. The approach seems to work well enough with pglz that it's worth doing even if we never add the other algorithms. That approach is fine with me. Note though that I am pretty strongly against adding support for more than one algorithm at the same time. So, if we gain lz4 support - which I think is definitely where we should go - we should drop pglz support for the WAL. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Thu, Sep 11, 2014 at 12:55:21PM -0400, Robert Haas wrote: I advise supporting pglz only for the initial patch, and adding support for the others later if it seems worthwhile. The approach seems to work well enough with pglz that it's worth doing even if we never add the other algorithms. +1 -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Commitfest status
On 10.9.2014 22:39, Heikki Linnakangas wrote: The bad news is that the rest don't seem to moving graduating from the Needs Review state. ISTM that many patches (a) in 'needs review' actually have a review, or are being thoroughly discussed (b) in 'waiting on author' are not really waiting, because the author already responded / posted a new version of the patch Except that the patch status was not updated, whis makes it really difficult to spot patches that currently need a review :-( regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Thu, Sep 11, 2014 at 12:58 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-09-11 12:55:21 -0400, Robert Haas wrote: I advise supporting pglz only for the initial patch, and adding support for the others later if it seems worthwhile. The approach seems to work well enough with pglz that it's worth doing even if we never add the other algorithms. That approach is fine with me. Note though that I am pretty strongly against adding support for more than one algorithm at the same time. What if one algorithm compresses better and the other algorithm uses less CPU time? I don't see a compelling need for an option if we get a new algorithm that strictly dominates what we've already got in all parameters, and it may well be that, as respects pglz, that's achievable. But ISTM that it need not be true in general. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] add modulo (%) operator to pgbench
On Thu, Sep 11, 2014 at 2:47 AM, Fabien COELHO coe...@cri.ensmp.fr wrote: Ok. I do agree that an expression syntax would be great! However, that would not diminish nor change much the amount and kind of code necessary to add an operator or a function That's not really true. You can't really add abs(x) or hash(x) right now because the current code only supports this syntax: \set varname operand1 [ operator operand2 ] There's no way to add support for a unary operator with that syntax. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory Alignment in Postgres
On Thu, Sep 11, 2014 at 11:27 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-09-11 10:32:24 -0300, Arthur Silva wrote: Unaligned memory access received a lot attention in Intel post-Nehalen era. So it may very well pay off on Intel servers. You might find this blog post and it's comments/external-links interesting http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/ FWIW, the reported results of imo pretty meaningless for postgres. It's sequential access over larger amount of memory. I.e. a perfectly prefetchable workload where it doesn't matter if superflous cachelines are fetched because they're going to be needed next round anyway. In many production workloads one of the most busy accesses to individual datums is the binary search on individual pages during index lookups. That's pretty much exactly the contrary to the above. Not saying that it's not going to be a benefit in many scenarios, but it's far from being as simple as saying that unaligned accesses on their own aren't penalized anymore. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services I modified the test code to use a completely random scan pattern to test something that completely trashes the cache. Not realistic but still confirms the hypothesis that the overhead is minimal on modern Intel. -- test results compiling for 32bit -- processing word of size 2 offset = 0 average time for offset 0 is 422.7 offset = 1 average time for offset 1 is 422.85 processing word of size 4 offset = 0 average time for offset 0 is 436.6 offset = 1 average time for offset 1 is 451 offset = 2 average time for offset 2 is 444.3 offset = 3 average time for offset 3 is 441.9 processing word of size 8 offset = 0 average time for offset 0 is 630.15 offset = 1 average time for offset 1 is 653 offset = 2 average time for offset 2 is 655.5 offset = 3 average time for offset 3 is 660.85 offset = 4 average time for offset 4 is 650.1 offset = 5 average time for offset 5 is 656.9 offset = 6 average time for offset 6 is 656.6 offset = 7 average time for offset 7 is 656.9 -- test results compiling for 64bit -- processing word of size 2 offset = 0 average time for offset 0 is 402.55 offset = 1 average time for offset 1 is 406.9 processing word of size 4 offset = 0 average time for offset 0 is 424.05 offset = 1 average time for offset 1 is 436.55 offset = 2 average time for offset 2 is 435.1 offset = 3 average time for offset 3 is 435.3 processing word of size 8 offset = 0 average time for offset 0 is 444.9 offset = 1 average time for offset 1 is 470.25 offset = 2 average time for offset 2 is 468.95 offset = 3 average time for offset 3 is 476.75 offset = 4 average time for offset 4 is 474.9 offset = 5 average time for offset 5 is 468.25 offset = 6 average time for offset 6 is 469.8 offset = 7 average time for offset 7 is 469.1 // g++ -O2 -o test test.cpp ./test #include sys/stat.h #include sys/time.h #include sys/types.h #include iostream #include cassert #include vector #include inttypes.h using namespace std; class WallClockTimer { public: struct timeval t1, t2; WallClockTimer() : t1(), t2() { gettimeofday(t1, 0); t2 = t1; } void reset() { gettimeofday(t1, 0); t2 = t1; } int elapsed() { return (t2.tv_sec * 1000 + t2.tv_usec / 1000) - (t1.tv_sec * 1000 + t1.tv_usec / 1000); } int split() { gettimeofday(t2, 0); return elapsed(); } }; // xor shift uint32_t xor128(void) { static uint32_t x = 123456789; static uint32_t y = 362436069; static uint32_t z = 521288629; static uint32_t w = 88675123; uint32_t t; t = x ^ (x 11); x = y; y = z; z = w; return w = w ^ (w 19) ^ (t ^ (t 8)); } template class T void runtest() { size_t N = 10 * 1000 * 1000 ; int repeat = 20; WallClockTimer timer; const bool paranoid = false; cout processing word of size sizeof(T)endl; for(unsigned int offset = 0; offsetsizeof(T); ++offset) { vectorT bigarray(N+2); coutoffset = offsetendl; T * const begin = reinterpret_castT * (reinterpret_castuintptr_t(bigarray[0]) + offset); assert(offset + reinterpret_castuintptr_t(bigarray[0]) == reinterpret_castuintptr_t(begin) ); T * const end = begin + N; if(paranoid) assert(reinterpret_castuintptr_t(end)reinterpret_castuintptr_t(bigarray.back())); int sumt = 0; //cout ignore this: ; for(int k = 0 ; k repeat; ++k) { timer.reset(); for(size_t i = 0; i N; ++i) { int ri = xor128() % N; begin[ri] = static_castT( i ); } volatile T val = 1;
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On 2014-09-11 13:04:43 -0400, Robert Haas wrote: On Thu, Sep 11, 2014 at 12:58 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-09-11 12:55:21 -0400, Robert Haas wrote: I advise supporting pglz only for the initial patch, and adding support for the others later if it seems worthwhile. The approach seems to work well enough with pglz that it's worth doing even if we never add the other algorithms. That approach is fine with me. Note though that I am pretty strongly against adding support for more than one algorithm at the same time. What if one algorithm compresses better and the other algorithm uses less CPU time? Then we make a choice for our users. A configuration option about an aspect of postgres that darned view people will understand with for the marginal differences between snappy and lz4 doesn't make sense. I don't see a compelling need for an option if we get a new algorithm that strictly dominates what we've already got in all parameters, and it may well be that, as respects pglz, that's achievable. But ISTM that it need not be true in general. If you look at the results lz4 is pretty much there. Sure, there's algorithms which have a much better compression - but the time overhead is so large it just doesn't make sense for full page compression. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] implement subject alternative names support for SSL connections
On Mon, Sep 8, 2014 at 8:04 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 09/05/2014 07:30 PM, Alexey Klyukin wrote: The error does not state whether the names comes from the CN or from the SAN section. I'd reword that slightly, to: psql: server certificate for example.com (and 2 other names) does not match host name example-foo.com I never liked the current wording, server name foo very much. I think it's important to use the word server certificate in the error message, to make it clear where the problem is. +1 For translations, that message should be pluralized, but there is no libpq_ngettext macro equivalent to ngettext(), like libpq_gettext. I wonder if that was left out on purpose, or if we just haven't needed that in libpq before. Anyway, I think we need to add that for this. Well, the translation guidelines in the documentation suggest avoiding plurals if possible, but I don't like rephrasing the sentence with 'names total: %d', so attached is my attempt to add the function. I have also moved the one-time library binding code to the function of its own. This version also checks for the availability of the subject name, it looks like RFC 5280 does not require it at all. 4.1.2.6. Subject The subject field identifies the entity associated with the public key stored in the subject public key field. The subject name MAY be carried in the subject field and/or the subjectAltName extension. Ok. The pattern of allocating the name in the subroutine and then reporting it (and releasing when necessary) in the main function is a little bit ugly, but it looks to me the least ugly of anything else I could come up (i.e. extracting those names at the time the error message is shown). I reworked that a bit, see attached. What do you think? Thank you, I like your approach of unconditionally allocating and freeing memory, it makes the code easier to read. 2 minor changes I've made in v7 (in addition to adding the libpq_ngettext): - the verify_peer_name_matches_certificate does not try to return -1 (since it returns bool, it would be misinterpreted as true) - removed the !got_error !found_match check from the for loop header to the loop body per style comment from Tom in the beginning of this thread. I think this is ready for commit, except for two things: 1. The pluralization of the message for translation 2. I still wonder if we should follow the RFC 6125 and not check the Common Name at all, if SANs are present. I don't really understand the point of that rule, and it seems unlikely to pose a security issue in practice, because surely a CA won't sign a certificate with a bogus/wrong CN, because an older client that doesn't look at the SANs at all would use the CN anyway. But still... what does a Typical Web Browser do? At least Firefox and Safari seem to follow RFC strictly, according to some anecdotical evidences (I haven't got enough time to dig into the source code yet): http://superuser.com/questions/230136/primary-common-name-in-subjectaltname https://bugzilla.mozilla.org/show_bug.cgi?id=238142 Chrome seem to follow them, so the major open-source browsers are ignoring the Common Name if the SubjectAltName is present. Still, I see no win in ignoring CN (except for the shorter code), it just annoys some users that forgot to put the CN name also in the SAN section, so my 5 cents is that we should check both. Regards, -- Alexey Klyukin diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c new file mode 100644 index fc930bd..a082268 *** a/src/interfaces/libpq/fe-misc.c --- b/src/interfaces/libpq/fe-misc.c *** static int pqSendSome(PGconn *conn, int *** 64,69 --- 64,71 static int pqSocketCheck(PGconn *conn, int forRead, int forWrite, time_t end_time); static intpqSocketPoll(int sock, int forRead, int forWrite, time_t end_time); + static void libpq_bindomain(); + /* * PQlibVersion: return the libpq version number *** PQenv2encoding(void) *** 1210,1223 #ifdef ENABLE_NLS ! char * ! libpq_gettext(const char *msgid) { static bool already_bound = false; if (!already_bound) { ! /* dgettext() preserves errno, but bindtextdomain() doesn't */ #ifdef WIN32 int save_errno = GetLastError(); #else --- 1212,1225 #ifdef ENABLE_NLS ! static void ! libpq_bindomain() { static bool already_bound = false; if (!already_bound) { ! /* bindtextdomain() does not preserve errno */ #ifdef WIN32 int save_errno = GetLastError(); #else *** libpq_gettext(const char *msgid) *** 1237,1244 --- 1239,1258 errno = save_errno; #endif } + } + char * + libpq_gettext(const char *msgid) + { + libpq_bindomain();
Re: [HACKERS] Memory Alignment in Postgres
On Thu, Sep 11, 2014 at 12:39 PM, Tom Lane t...@sss.pgh.pa.us wrote: Merlin Moncure mmonc...@gmail.com writes: Be advised of the difficulties you are going to face here. Assuming for a second there is no reason not to go unaligned on Intel and there are material benefits to justify the effort, that doesn't necessarily hold for other platforms like arm/power. Note that on many (most?) non-Intel architectures, unaligned access is simply not an option. The chips themselves will throw SIGBUS or equivalent if you try it. Some kernels provide signal handlers that emulate the unaligned access in software rather than killing the process; but the performance consequences of hitting such traps more than very occasionally would be catastrophic. Even on Intel, I'd wonder what unaligned accesses do to atomicity guarantees and suchlike. This is not a big deal for row data storage, but we'd have to be careful about it if we were to back off alignment requirements for in-memory data structures such as latches and buffer headers. Another fun thing you'd need to deal with is ensuring that the C structs we overlay onto catalog data rows still match up with the data layout rules. On the whole, I'm pretty darn skeptical that such an effort would repay itself. There are lots of more promising things to hack on. regards, tom lane Indeed I don't know any other architectures that this would be at an option. So if this ever moves forward it must be turned on at compile time for x86-64 only. I wonder how the Mysql handle their rows even on those architectures as their storage format is completely packed. If we just reduced the alignment requirements when laying out columns in the rows and indexes by reducing/removing padding -- typalign, it'd be enough gain in my (humble) opinion. If you think alignment is not an issue you can see saving everywhere, which is kinda insane... I'm unsure how this equates in patch complexity, but judging by the reactions so far I'm assuming a lot.
Re: [HACKERS] proposal (9.5) : psql unicode border line styles
* Pavel Stehule (pavel.steh...@gmail.com) wrote: 2014-09-11 16:42 GMT+02:00 Stephen Frost sfr...@snowman.net: I don't particularly like this (having these fields set in refresh_utf8format to hard-coded strings in the function), why not have those handled the same as the rest, where the strings themselves are in the unicode_style structure? I am not sure if I understand well. With refresh_utf8format I can do shortly 6 possible combinations - or more (when it will be requested) I have no idea how to write as rest without repeating all 6 combinations - what was one noticed issue of some older variant, where I designed unicode1, unicode2, ... Any idea, tip how to it? All I was suggesting was pulling these strings out of the function: + /* ⵠ*/ + popt-header_nl_right = \342\206\265; + + popt-nl_left = ; + + /* ⵠ*/ + popt-nl_right = \342\206\265; + + /* ⦠*/ + popt-wrap_left = \342\200\246; + popt-wrap_right = \342\200\246; and adding them to unicode_style and then referencing them there, similar to how the rest of printTextFormat popt (by the way- don't really like that variable name, particularly as it's used elsewhere with a very different meaning.. why not 'format' or 'ptf'?) is set up in refresh_utf8format, that's all. Thanks, Stephen signature.asc Description: Digital signature
[HACKERS] Re: proposal: ignore null fields in not relation type composite type based constructors
Andrew, * Andrew Dunstan (and...@dunslane.net) wrote: On 09/11/2014 08:29 AM, Stephen Frost wrote: * Pavel Stehule (pavel.steh...@gmail.com) wrote: Can I help with something, it is there some open question? I had been hoping for a more definitive answer regarding this option for array_to_json, or even a comment about the change to row_to_json. Andrew- any thoughts on this? (that's what the ping on IRC is for :). The change in row_to_json looks OK, I think. we're replacing an overloading with use of default params, yes? That seems quite reasonable, and users shouldn't notice the difference. Right. Great, thanks. There might be a case for optionally suppressing nulls in array_to_json, and it might work reasonably since unlike SQL arrays JSON arrays don't have to be regular (if nested they are arrays of arrays, not multi-dimensional single arrays). OTOH I'm not sure if it's worth doing just for the sake of orthogonality. If someone wants it, then go for it. Ok. I'll handle updating both of these to remove the overloading and use default params instead, but I'll only add the 'ignore_null' option to row_to_json. Thanks again! Stephen signature.asc Description: Digital signature
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Thu, Sep 11, 2014 at 1:17 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-09-11 13:04:43 -0400, Robert Haas wrote: On Thu, Sep 11, 2014 at 12:58 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-09-11 12:55:21 -0400, Robert Haas wrote: I advise supporting pglz only for the initial patch, and adding support for the others later if it seems worthwhile. The approach seems to work well enough with pglz that it's worth doing even if we never add the other algorithms. That approach is fine with me. Note though that I am pretty strongly against adding support for more than one algorithm at the same time. What if one algorithm compresses better and the other algorithm uses less CPU time? Then we make a choice for our users. A configuration option about an aspect of postgres that darned view people will understand with for the marginal differences between snappy and lz4 doesn't make sense. Maybe. Let's get the basic patch done first; then we can argue about that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Thu, Sep 11, 2014 at 06:58:06PM +0200, Andres Freund wrote: On 2014-09-11 12:55:21 -0400, Robert Haas wrote: I advise supporting pglz only for the initial patch, and adding support for the others later if it seems worthwhile. The approach seems to work well enough with pglz that it's worth doing even if we never add the other algorithms. That approach is fine with me. Note though that I am pretty strongly against adding support for more than one algorithm at the same time. So, if we gain lz4 support - which I think is definitely where we should go - we should drop pglz support for the WAL. Greetings, Andres Freund +1 Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Thu, Sep 11, 2014 at 07:17:42PM +0200, Andres Freund wrote: On 2014-09-11 13:04:43 -0400, Robert Haas wrote: On Thu, Sep 11, 2014 at 12:58 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-09-11 12:55:21 -0400, Robert Haas wrote: I advise supporting pglz only for the initial patch, and adding support for the others later if it seems worthwhile. The approach seems to work well enough with pglz that it's worth doing even if we never add the other algorithms. That approach is fine with me. Note though that I am pretty strongly against adding support for more than one algorithm at the same time. What if one algorithm compresses better and the other algorithm uses less CPU time? Then we make a choice for our users. A configuration option about an aspect of postgres that darned view people will understand with for the marginal differences between snappy and lz4 doesn't make sense. I don't see a compelling need for an option if we get a new algorithm that strictly dominates what we've already got in all parameters, and it may well be that, as respects pglz, that's achievable. But ISTM that it need not be true in general. If you look at the results lz4 is pretty much there. Sure, there's algorithms which have a much better compression - but the time overhead is so large it just doesn't make sense for full page compression. Greetings, Andres Freund In addition, you can leverage the the presence of a higher-compression version of lz4 (lz4hc) that can utilize the same decompression engine that could possibly be applied to static tables as a REINDEX option or even slowly growing tables that would benefit from the better compression as well as the increased decompression speed available. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background (and more parallelism infrastructure patches)
On Thu, Sep 11, 2014 at 7:34 AM, Amit Kapila amit.kapil...@gmail.com wrote: Don't we need some way to prohibit changing GUC by launching process, once it has shared the existing GUC? Nope. I mean, eventually, for true parallelism ... we absolutely will need that. But pg_background itself doesn't need that; it's perfectly fine for configuration settings to get changed in the background worker. So it's a different piece of infrastructure from this patch set. 1. This patch generates warning on windows 1pg_background.obj : error LNK2001: unresolved external symbol StatementTimeout You need to add PGDLLIMPORT for StatementTimeout OK. I still think we should go back and PGDLLIMPORT-ize all the GUC variables. 2. CREATE FUNCTION pg_background_launch(sql pg_catalog.text, queue_size pg_catalog.int4 DEFAULT 65536) Here shouldn't queue_size be pg_catalog.int8 as I could see some related functions in test_shm_mq uses int8? I think if you think you want a queue that's more than 2GB in size, you should should re-think. pg_background_launch(PG_FUNCTION_ARGS) { text *sql = PG_GETARG_TEXT_PP(0); int32 queue_size = PG_GETARG_INT64(1); Here it should _INT32 variant to match with current sql definition, otherwise it leads to below error. postgres=# select pg_background_launch('vacuum verbose foo'); ERROR: queue size must be at least 64 bytes Oops. 3. Comparing execute_sql_string() and exec_simple_query(), I could see below main differences: a. Allocate a new memory context different from message context b. Start/commit control of transaction is outside execute_sql_string c. enable/disable statement timeout is done from outside incase of execute_sql_string() d. execute_sql_string() prohibits Start/Commit/Abort transaction statements. e. execute_sql_string() doesn't log statements f. execute_sql_string() uses binary format for result whereas exec_simple_query() uses TEXT as defult format g. processed stat related info from caller incase of execute_sql_string(). Can't we handle all these or other changes inside exec_simple_query() based on some parameter or may be a use a hook similar to what we do in ProcessUtility? Basically it looks bit odd to have a duplicate (mostly) copy of exec_simple_query(). It is. But I didn't think hacking up exec_simple_query() was a better option. We could do that if most people like that approach, but to me it seemed there were enough differences to make it unappealing. 4. Won't it be better if pg_background_worker_main() can look more like PostgresMain() (in terms of handling different kind of messages), so that it can be extended in future to handle parallel worker. I don't think that a parallel worker will look like pg_background in much more than broad outline. Some of the same constructs will get reused, but overall I think it's a different problem that I'd rather not conflate with this patch. 5. pg_background_result() { .. /* Read and processes messages from the shared memory queue. */ } Shouldn't the processing of messages be a separate function as we do for pqParseInput3(). I guess we could. It's not all that much code, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory Alignment in Postgres
On Thu, Sep 11, 2014 at 02:54:36PM -0300, Arthur Silva wrote: Indeed I don't know any other architectures that this would be at an option. So if this ever moves forward it must be turned on at compile time for x86-64 only. I wonder how the Mysql handle their rows even on those architectures as their storage format is completely packed. If we just reduced the alignment requirements when laying out columns in the rows and indexes by reducing/removing padding -- typalign, it'd be enough gain in my (humble) opinion. If you think alignment is not an issue you can see saving everywhere, which is kinda insane... I'm unsure how this equates in patch complexity, but judging by the reactions so far I'm assuming a lot. If the column order in the table was independent of the physical layout, it would be possible to order columns to reduce the padding needed. Not my suggestion, just repeating a valid comment from earlier in the thread. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] about half processes are blocked by btree, btree is bottleneck?
On Wed, Sep 10, 2014 at 8:08 PM, Xiaoyulei xiaoyu...@huawei.com wrote: Sum:66 #0 0x7f8273a77627 in semop () from /lib64/libc.so.6 #1 0x0060cda7 in PGSemaphoreLock () #2 0x006511a9 in LWLockAcquire () #3 0x004987f7 in _bt_relandgetbuf () #4 0x0049c116 in _bt_search () #5 0x00497e13 in _bt_doinsert () #6 0x0049af52 in btinsert () #7 0x0072dce4 in FunctionCall6Coll () #8 0x0049592e in index_insert () #9 0x00590ac5 in ExecInsertIndexTuples () Sum:36 #0 0x7f8273a77627 in semop () from /lib64/libc.so.6 #1 0x0060cda7 in PGSemaphoreLock () #2 0x006511a9 in LWLockAcquire () #3 0x00497e31 in _bt_doinsert () #4 0x0049af52 in btinsert () #5 0x0072dce4 in FunctionCall6Coll () #6 0x0049592e in index_insert () #7 0x00590ac5 in ExecInsertIndexTuples () I don't know what Sum indicates here, but if the ratio of total calls to LWLockAcquire() between each LWLockAcquire() caller listed here is roughly in line with the Sum ratio, this must be a pretty small B-Tree (or the average size of each B-Tree must be very small). I suppose you could still see a lot of contention without the B-Tree ever getting much bigger if there is a lot of non-HOT updates. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] add modulo (%) operator to pgbench
However, that would not diminish nor change much the amount and kind of code necessary to add an operator or a function That's not really true. You can't really add abs(x) or hash(x) right now because the current code only supports this syntax: \set varname operand1 [ operator operand2 ] There's no way to add support for a unary operator with that syntax. Hmmm. If you accept a postfix syntax, there is:-) But this is not convincing. Adding a unary function with a clean syntax indeed requires doing something with the parser. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RLS Design
Robert, * Robert Haas (robertmh...@gmail.com) wrote: On Sat, Sep 6, 2014 at 2:54 AM, Brightwell, Adam adam.brightw...@crunchydatasolutions.com wrote: * Add ALTER TABLE name { ENABLE | DISABLE } ROW LEVEL SECURITY - set flag on table to allow for a default-deny capability. If RLS is enabled on a table and has no policies, then a default-deny policy is automatically applied. If RLS is disabled on table and the table still has policies on it then then an error is raised. Though if DISABLE is accompanied with CASCADE, then all policies will be removed and no error is raised. This text doesn't make it clear that all of the cases have been covered; in particular, you didn't specify whether an error is thrown if you try to add a policy to a table with DISABLE ROW LEVEL SECURITY in effect. Backing up a bit, I think there are two sensible designs here: Ah, yeah, the text could certainly be clearer. 1. Row level security policies can't exist for a table with DISABLE ROW LEVEL SECURITY in effect. It sounds like this is what you have implemented, modulo any hypothetical bugs. You can't add policies without enabling RLS, and you can't disable RLS without dropping them all. Right, this was the approach we were taking. Specifically, adding policies would implicitly enable RLS for the relation. 2. Row level security policies can exist for a table with DISABLE ROW LEVEL SECURITY in effect, but they don't do anything until RLS is enabled. A possible advantage of this approach is that you could *temporarily* shut off RLS for a table without having to drop all of your policies and put them back. I kind of like this approach; we have something similar for triggers, and I think it could be useful to people. I like the idea of being able to turn them off without dropping them. We have that with row_security = off, but that would only work for the owner or a superuser (or a user with bypassrls). This would allow disabling RLS temporairly for everything accessing the table. The one thing I'm wondering about with this design is- what happens when a policy is initially added? Currently, we automatically turn on RLS for the table when that happens. I'm not thrilled with the idea that you have to add policies AND turn on RLS explicitly- someone might add policies but then forget to turn RLS on.. If you stick with approach #1, make sure pg_dump is guaranteed to enable RLS before applying the policies. Currently, adding a policy automatically turns on RLS, so we don't have any issue with pg_dump from that perspective. Handling cases where RLS is disabled but policies exist would get more complicated for pg_dump if we keep the current idea that adding policies implicitly turns on RLS- it'd essentially have to go back and turn it off after the policies are added. Not a big fan of that either. And either way, you should that pg_dump behaves sanely in the case where there are circular dependencies, like you have two table A and B, and each has a RLS policy that manages to use the other table's row-type. (You probably also want to check that DROP and DROP .. CASCADE on either policy or either table does the right thing in that situation, but that's probably easier to get right.) Agreed, we'll double-check that this is working. As these are attributes of the table which get added later on by pg_dump, similar to permissions, I'd think it'd all work fine, but good to make sure (and ditto with DROP/DROP CASCADE.. We have some checks for that, but good to make sure it works in a circular-dependency case too). If we want to be able to disable RLS w/o dropping the policies, then I think we have to completely de-couple the two and users would then have both add policies AND turn on RLS to have RLS actually be enabled for a given table. I'm on the fence about that. Thoughts? Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Commitfest status
On 11/09/14 18:59, Tomas Vondra wrote: On 10.9.2014 22:39, Heikki Linnakangas wrote: The bad news is that the rest don't seem to moving graduating from the Needs Review state. ISTM that many patches (b) in 'waiting on author' are not really waiting, because the author already responded / posted a new version of the patch Except that the patch status was not updated, whis makes it really difficult to spot patches that currently need a review :-( I think that still means patch is 'waiting for author' as author is responsible for changing this. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background (and more parallelism infrastructure patches)
On 11/09/14 20:37, Robert Haas wrote: 1. This patch generates warning on windows 1pg_background.obj : error LNK2001: unresolved external symbol StatementTimeout You need to add PGDLLIMPORT for StatementTimeout OK. I still think we should go back and PGDLLIMPORT-ize all the GUC variables. +1 4. Won't it be better if pg_background_worker_main() can look more like PostgresMain() (in terms of handling different kind of messages), so that it can be extended in future to handle parallel worker. I don't think that a parallel worker will look like pg_background in much more than broad outline. Some of the same constructs will get reused, but overall I think it's a different problem that I'd rather not conflate with this patch. Yeah I agree here. While the patch provides a lot of necessary plumbing that any kind of parallel processing needs, the pg_background itself does not seem to be base for that. Actually, when I first seen the pg_background, I was thinking that it looks like a good base for some kind of background job scheduling mechanism that was requested few times over the years (the actual scheduling is the only part missing now IMHO but that's separate discussion). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background (and more parallelism infrastructure patches)
On 2014-09-11 21:27:15 +0200, Petr Jelinek wrote: On 11/09/14 20:37, Robert Haas wrote: OK. I still think we should go back and PGDLLIMPORT-ize all the GUC variables. +1 Same here. I think Tom was the only one against it, but pretty much everyone else was for it. We should fix all the GUCs declared as externs in multiple .c files at the same time imo. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Aussie timezone database changes incoming
On 12/09/14 01:57, Robert Haas wrote: On Thu, Sep 11, 2014 at 12:20 AM, Craig Ringer cr...@2ndquadrant.com wrote: I shouldn't be surprised that Australia gets to change. While the cynic in me thinks this is the usual USA-is-the-center-of-the-universe-ism, in reality it makes sense given relative population and likely impact. Just because it makes sense doesn't mean it isn't USA-is-the-center-of-the-universe-ism. ...Robert In the same way the Americans tend to act like they are not on a planet, by saying something will happen in Summer - completely ignoring that for people who live in the Southern hemisphere, for whom that will be Winter! Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.5] Custom Plan API
On Thu, Sep 11, 2014 at 11:24 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: Don't the changes to src/backend/optimizer/plan/createplan.c belong in patch #2? The borderline between #1 and #2 is little bit bogus. So, I moved most of portion into #1, however, invocation of InitCustomScan (that is a callback in CustomPlanMethod) in create_custom_plan() is still in #2. Eh, create_custom_scan() certainly looks like it is in #1 from here, or at least part of it is. It calculates tlist and clauses and then does nothing with them. That clearly can't be the right division. I think it would make sense to have create_custom_scan() compute tlist and clauses first, and then pass those to CreateCustomPlan(). Then you don't need a separate InitCustomScan() - which is misnamed anyway, since it has nothing to do with ExecInitCustomScan(). OK, I revised. Now custom-scan assumes it has a particular valid relation to be scanned, so no code path with scanrelid == 0 at this moment. Let us revisit this scenario when custom-scan replaces relation-joins. In this case, custom-scan will not be associated with a particular base- relation, thus it needs to admit a custom-scan node with scanrelid == 0. Yeah, I guess the question there is whether we'll want let CustomScan have scanrelid == 0 or require that CustomJoin be used there instead. Why can't the Custom(GpuHashJoin) node build the hash table internally instead of using a separate node? It's possible, however, it prevents to check sub-plans using EXPLAIN if we manage inner-plans internally. So, I'd like to have a separate node being connected to the inner-plan. Isn't that just a matter of letting the EXPLAIN code print more stuff? Why can't it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Wed, Sep 10, 2014 at 11:36 AM, Robert Haas robertmh...@gmail.com wrote: No, not really. All you have to do is right a little test program to gather the information. I don't think a little test program is useful - IMV it's too much of a simplification to suppose that a strcoll() has a fixed cost, and a memcmp() has a fixed cost, and that we can determine algebraically that we should (say) proceed or not proceed with the additional opportunistic memcmp() == 0 optimization based solely on that. I'm not sure if that's what you meant, but it might have been. Temporal locality is surely a huge factor here, for example. Are we talking about a memcmp() that always immediately precedes a similar strcoll() call on the same memory? Are we factoring in the cost of NUL-termination in order to make each strcoll() possible? And that's just for starters. However, I think it's perfectly fair to consider a case where the opportunistic memcmp() == 0 thing never works out (as opposed to mostly not helping, which is what I considered earlier [1]), as long as we're sorting real tuples. You mentioned Heikki's test case; it seems fair to consider that, but for the non-abbreviated case where the additional, *totally* opportunistic memcmp == 0 optimization only applies (so no abbreviated keys), while still having the additional optimization be 100% useless. Clearly that test case is also just about perfectly pessimal for this case too. (Recall that Heikki's test case shows performance on per-sorted input, so there are far fewer comparisons than would typically be required anyway - n comparisons, or a bubble sort best case. If I wanted to cheat, I could reduce work_mem so that an external tape sort is used, since as it happens tapesort doesn't opportunistically check for pre-sorted input, but I won't do that. Heikki's case both emphasizes the amortized cost of a strxfrm() where we abbreviate, and in this instance de-emphasizes the importance of memory latency by having access be sequential/predictable.) The only variation I'm adding here to Heikki's original test case is to have a leading int4 attribute that always has a value of 1 -- that conveniently removes abbreviation (including strxfrm() overhead) as a factor that can influence the outcome, since right now that isn't under consideration. So: create table sorttest (dummy int4, t text); insert into sorttest select 1, 'foobarfo' || (g) || repeat('a', 75) from generate_series(1, 3) g; Benchmark: pg@hamster:~/tests$ cat heikki-sort.sql select * from (select * from sorttest order by dummy, t offset 100) f; pgbench -f heikki-sort.sql -T 100 -n With optimization enabled tps = 77.861353 (including connections establishing) tps = 77.862260 (excluding connections establishing) tps = 78.211053 (including connections establishing) tps = 78.212016 (excluding connections establishing) tps = 77.996117 (including connections establishing) tps = 77.997069 (excluding connections establishing) With optimization disabled (len1 == len2 thing is commented out) = tps = 78.719389 (including connections establishing) tps = 78.720366 (excluding connections establishing) tps = 78.764819 (including connections establishing) tps = 78.765712 (excluding connections establishing) tps = 78.472902 (including connections establishing) tps = 78.473844 (excluding connections establishing) So, yes, it looks like I might have just about regressed this case - it's hard to be completely sure. However, this is still a very unrealistic case, since invariably len1 == len2 without the optimization ever working out, whereas the case that benefits [2] is quite representative. As I'm sure you were expecting, I still favor pursuing this additional optimization. If you think I've been unfair or not thorough, I'm happy to look at other cases. Also, I'm not sure that you accept that I'm justified in considering this a separate question to the more important question of what to do in the tie-breaker abbreviation case (where we may be almost certain that equality will be indicated by a memcmp()). If you don't accept that I'm right about that more important case, I guess that means that you don't have confidence in my ad-hoc cost model (the HyperLogLog/cardinality stuff). [1] http://www.postgresql.org/message-id/CAM3SWZR9dtGO+zX4VEn7GTW2=+umSNq=c57sjgxg8oqhjar...@mail.gmail.com [2] http://www.postgresql.org/message-id/cam3swzqtyv3kp+cakzjzv3rwb1ojjahwpcz9coyjxpkhbtc...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RLS Design
On Thu, Sep 11, 2014 at 3:08 PM, Stephen Frost sfr...@snowman.net wrote: 2. Row level security policies can exist for a table with DISABLE ROW LEVEL SECURITY in effect, but they don't do anything until RLS is enabled. A possible advantage of this approach is that you could *temporarily* shut off RLS for a table without having to drop all of your policies and put them back. I kind of like this approach; we have something similar for triggers, and I think it could be useful to people. I like the idea of being able to turn them off without dropping them. We have that with row_security = off, but that would only work for the owner or a superuser (or a user with bypassrls). This would allow disabling RLS temporairly for everything accessing the table. The one thing I'm wondering about with this design is- what happens when a policy is initially added? Currently, we automatically turn on RLS for the table when that happens. I'm not thrilled with the idea that you have to add policies AND turn on RLS explicitly- someone might add policies but then forget to turn RLS on.. Whoa. I think that's a bad idea. I think the default value for RLS should be disabled, and users should have to turn it on explicitly if they want to get it. It's arguable whether the behavior if you try to create a policy beforehand should be (1) outright failure or (2) command accepted but no effect, but I think (3) automagically enable the feature is a POLA violation. When somebody adds a policy and then drops it again, they will expect to be back in the same state they started out in, and for good reason. If we want to be able to disable RLS w/o dropping the policies, then I think we have to completely de-couple the two and users would then have both add policies AND turn on RLS to have RLS actually be enabled for a given table. I'm on the fence about that. Thoughts? A strong +1 for doing just that. Look, anybody who is going to use row-level security but isn't careful enough to verify that it's actually working as desired after configuring it is a lost cause anyway. That is the moral equivalent of a locksmith who comes out and replaces a lock for you and at no point while he's there does he ever close the door and verify that it latches and won't reopen. I'm sure somebody has done that, but if a security breach results, surely everybody would agree that the locksmith is at fault, not the lock manufacturer. Personally, I have to test every GRANT and REVOKE I issue, because there's no error for granting a privilege that the target already has or revoking one they don't, and with group membership and PUBLIC it's quite easy to have not done what you thought you did. Fixing that might be worthwhile but it doesn't take away from the fact that, like any other configuration change you make, security-relevant changes need to be tested. There is another possible advantage of the explicit-enable approach as well, which is that you might want to create several policies and then turn them all on at once. With what you have now, creating the first policy will enable RLS on the table and then everyone who wasn't the beneficiary of that initial policy is locked out. Now, granted, you can probably get around that by doing all of the operations in one transaction, so it's a minor point. But it's still nice to think about being able to add several policies and then flip them on. If it doesn't work out, flip them off, adjust, and flip them back on again. Now, again, the core design issue, IMHO, is that the switch from default-allow to default-deny should be explicit and unmistakable, so the rest of this is just tinkering around the edges. But we might as well make those edges as nice as possible, and the usability of this approach feels good to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RLS Design
* Robert Haas (robertmh...@gmail.com) wrote: On Thu, Sep 11, 2014 at 3:08 PM, Stephen Frost sfr...@snowman.net wrote: The one thing I'm wondering about with this design is- what happens when a policy is initially added? Currently, we automatically turn on RLS for the table when that happens. I'm not thrilled with the idea that you have to add policies AND turn on RLS explicitly- someone might add policies but then forget to turn RLS on.. Whoa. I think that's a bad idea. I think the default value for RLS should be disabled, and users should have to turn it on explicitly if they want to get it. It's arguable whether the behavior if you try to create a policy beforehand should be (1) outright failure or (2) command accepted but no effect, but I think (3) automagically enable the feature is a POLA violation. When somebody adds a policy and then drops it again, they will expect to be back in the same state they started out in, and for good reason. Yeah, that I can agree with. Prior to adding the ability to explicitly enable RLS, that's what they got, but that's changed now that we've made the ability to turn on/off RLS half-way independent of policies. Also.. If we want to be able to disable RLS w/o dropping the policies, then I think we have to completely de-couple the two and users would then have both add policies AND turn on RLS to have RLS actually be enabled for a given table. I'm on the fence about that. A strong +1 for doing just that. Look, anybody who is going to use row-level security but isn't careful enough to verify that it's actually working as desired after configuring it is a lost cause anyway. I had been thinking the same, which is why I was on the fence about if it was really an issue or not. This all amounts to actually making the patch smaller also, which isn't a bad thing. Personally, I have to test every GRANT and REVOKE I issue, because there's no error for granting a privilege that the target already has or revoking one they don't, and with group membership and PUBLIC it's quite easy to have not done what you thought you did. Fixing that might be worthwhile but it doesn't take away from the fact that, like any other configuration change you make, security-relevant changes need to be tested. Hmm, pretty sure that'd end up going against the spec too, but that's a whole different discussion anyway. There is another possible advantage of the explicit-enable approach as well, which is that you might want to create several policies and then turn them all on at once. With what you have now, creating the first policy will enable RLS on the table and then everyone who wasn't the beneficiary of that initial policy is locked out. Now, granted, you can probably get around that by doing all of the operations in one transaction, so it's a minor point. But it's still nice to think about being able to add several policies and then flip them on. If it doesn't work out, flip them off, adjust, and flip them back on again. Now, again, the core design issue, IMHO, is that the switch from default-allow to default-deny should be explicit and unmistakable, so the rest of this is just tinkering around the edges. But we might as well make those edges as nice as possible, and the usability of this approach feels good to me. Fair enough. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README
On Tue, Sep 9, 2014 at 12:01 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: + Lehman and Yao don't require read locks, but assume that in-memory + copies of tree pages are unshared. This is the existing paragraph, just moved to different place in the README. That's right - it seemed to make just as much sense to talk about that here, while doing so establishes the context of talking about how what we do differs from the canonical algorithm (which I think is true of real-world LY implementations generally). Which makes the next paragraph easier to understand: + Although it could be argued that Lehman and Yao isn't followed to the + letter because single pages are read locked as the tree is descended, + this is at least necessary to support deletion, a common requirement + which LY hardly acknowledge. Read locks also ensure that B-tree + pages are self-consistent (LY appear to assume atomic page reads and + writes). This is just duplicating the existing paragraph. I don't see the point of this. The point is to make clear the reason for the differences - evidently, Amit felt it was unclear why we don't follow LY. I am suggesting that LY's talk of having no read locks is unrealistic (it might be realistic in the 21st century, but that's a whole other story). Even with these read locks, following LY obviates the need + of earlier schemes to hold multiple locks concurrently when descending + the tree as part of servicing index scans (pessimistic lock coupling). + The addition of right-links at all levels, as well as the addition of + a page high key allows detection and dynamic recovery from + concurrent page splits (that is, splits between unlocking an internal + page, and subsequently locking its child page during a descent). This explains in a few sentences what a LY B-tree looks like. The current README assumes that you're already familiar with what a LY tree looks like, or that you go read the paper mentioned at the top of the README. It might be a good idea to expand on that, and add an introduction like this so that an unfamiliar reader doesn't need to read the LY paper first. Is that the purpose of this patch? Please make it more explicit. Yes, although even LY don't get around to telling the reader why they should care, the mistake that many good papers make. We now know its significance, both in general and to Postgres. Framing the discussion like this aids understanding more than you'd think. And please make the sentences simpler - the above reads like a Shakespeare play. Out, damned lock! Something like: The basic Lehman Yao Algorithm Compared to a classic B-tree, LY adds a right-link pointer to each page, to the page's right sibling. It also adds a high key to each page, which is an upper bound on the keys that are allowed on that page. These two additions make it possible detect a concurrent page split, which allows the tree to be searched without holding any read locks (except to keep a single page from being modified while reading it). When a search follows a downlink to a child page, it compares the page's high key with the search key. If the search key is greater than the high key, the page must've been split concurrently, and you must follow the right-link to find the new page containing the key range you're looking for. This might need to be repeated, if the page has been split more than once. Differences to the Lehman Yao algorithm = (current Lehamn and Yao Algorithm and Insertions section) I think that's pretty much the same information you added, but it's in a separate section, with the clear purpose that it explains what a LY tree looks like. You can skip over it, if you have read the paper or are otherwise already familiar with it. It still assumes that you're familiar with B-trees in general. That seems fair enough - I'd just expand on why we don't completely avoid read locks, which LY suppose we can get away with. That is clearly a point of confusion. Anyway, I see that you had resurrected this in the commitfest app after three weeks of inactivity. I'm going to mark this back to Returned with Feedback. Please don't resurrect it again, this patch has received more than its fair share of attention. I didn't mean to suggest that it deserves much attention. I didn't know how to interpret the fact that you changed the status, since in fact not much additional work was required. I was busy throughout, for reasons that are perhaps obvious. I am not fussed about when this happens, but I really think we should get around to it. Instead, please help by signing up to review a patch. The commitfest progress is glacial at the moment, and we really need experienced reviewers like you to get closure to people's patches. I'm back from holidays now. I plan to look at the Tomas Vondra's patch. If you think I should look at some particular
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Thu, Sep 11, 2014 at 4:13 PM, Peter Geoghegan p...@heroku.com wrote: On Wed, Sep 10, 2014 at 11:36 AM, Robert Haas robertmh...@gmail.com wrote: No, not really. All you have to do is right a little test program to gather the information. I don't think a little test program is useful - IMV it's too much of a simplification to suppose that a strcoll() has a fixed cost, and a memcmp() has a fixed cost, and that we can determine algebraically that we should (say) proceed or not proceed with the additional opportunistic memcmp() == 0 optimization based solely on that. I'm not sure if that's what you meant, but it might have been. I think I said pretty clearly that it was. However, I think it's perfectly fair to consider a case where the opportunistic memcmp() == 0 thing never works out (as opposed to mostly not helping, which is what I considered earlier [1]), as long as we're sorting real tuples. You mentioned Heikki's test case; it seems fair to consider that, but for the non-abbreviated case where the additional, *totally* opportunistic memcmp == 0 optimization only applies (so no abbreviated keys), while still having the additional optimization be 100% useless. Clearly that test case is also just about perfectly pessimal for this case too. (Recall that Heikki's test case shows performance on per-sorted input, so there are far fewer comparisons than would typically be required anyway - n comparisons, or a bubble sort best case. If I wanted to cheat, I could reduce work_mem so that an external tape sort is used, since as it happens tapesort doesn't opportunistically check for pre-sorted input, but I won't do that. Heikki's case both emphasizes the amortized cost of a strxfrm() where we abbreviate, and in this instance de-emphasizes the importance of memory latency by having access be sequential/predictable.) The only variation I'm adding here to Heikki's original test case is to have a leading int4 attribute that always has a value of 1 -- that conveniently removes abbreviation (including strxfrm() overhead) as a factor that can influence the outcome, since right now that isn't under consideration. So: create table sorttest (dummy int4, t text); insert into sorttest select 1, 'foobarfo' || (g) || repeat('a', 75) from generate_series(1, 3) g; Benchmark: pg@hamster:~/tests$ cat heikki-sort.sql select * from (select * from sorttest order by dummy, t offset 100) f; pgbench -f heikki-sort.sql -T 100 -n With optimization enabled tps = 77.861353 (including connections establishing) tps = 77.862260 (excluding connections establishing) tps = 78.211053 (including connections establishing) tps = 78.212016 (excluding connections establishing) tps = 77.996117 (including connections establishing) tps = 77.997069 (excluding connections establishing) With optimization disabled (len1 == len2 thing is commented out) = tps = 78.719389 (including connections establishing) tps = 78.720366 (excluding connections establishing) tps = 78.764819 (including connections establishing) tps = 78.765712 (excluding connections establishing) tps = 78.472902 (including connections establishing) tps = 78.473844 (excluding connections establishing) So, yes, it looks like I might have just about regressed this case - it's hard to be completely sure. However, this is still a very unrealistic case, since invariably len1 == len2 without the optimization ever working out, whereas the case that benefits [2] is quite representative. As I'm sure you were expecting, I still favor pursuing this additional optimization. Well, I have to agree that doesn't look too bad, but your reluctance to actually do the microbenchmark worries me. Granted, macrobenchmarks are what actually matters, but they can be noisy and there can be other confounding factors. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_upgrade and epoch
On Wed, Sep 10, 2014 at 02:24:17AM +0100, Greg Stark wrote: On Tue, Sep 9, 2014 at 4:05 PM, Bruce Momjian br...@momjian.us wrote: Yes, I did think about that, but it seems like a behavior change. However, it is tempting to avoid future bug reports about this. When this came up in March, Tom and I agreed that this wasn't something we wanted to slip into 9.4. Given that, it is hard to argue we should now slip this into 9.5, 9.4, and 9.3, so unless someone else votes for inclusion, I think I will leave this as 9.5-only. With no one replying, I will consider this issue closed and not backpatch this. I think the reason nobody's responding is because nobody has anything significant to add. It's a behavior change from not-working to working. Why wouldn't it be backpatched? OK, Greg seems to be passionate about this. Does anyone _object_ to my back-patching the epoch preservation fix through 9.3. Tom? The patch is commit a74a4aa23bb95b590ff01ee564219d2eacea3706. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_upgrade and epoch
Bruce Momjian br...@momjian.us writes: On Wed, Sep 10, 2014 at 02:24:17AM +0100, Greg Stark wrote: I think the reason nobody's responding is because nobody has anything significant to add. It's a behavior change from not-working to working. Why wouldn't it be backpatched? OK, Greg seems to be passionate about this. Does anyone _object_ to my back-patching the epoch preservation fix through 9.3. Tom? Not I. This is a data-loss bug fix, no? Why would we not back-patch it? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers