Re: [HACKERS] Scaling shared buffer eviction
On Tue, Sep 9, 2014 at 12:16 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila amit.kapil...@gmail.com wrote: Apart from above, I think for this patch, cat version bump is required as I have modified system catalog. However I have not done the same in patch as otherwise it will be bit difficult to take performance data. One regression failed on linux due to spacing issue which is fixed in attached patch. I took another read through this patch. Here are some further review comments: 1. In BgMoveBuffersToFreelist(), it seems quite superfluous to have both num_to_free and tmp_num_to_free. num_to_free is used to accumulate total number of buffers that are freed in one cycle of BgMoveBuffersToFreelist() which is reported for debug info (BGW_DEBUG) and tmp_num_to_free is used as a temporary number which is a count of number of buffers to be freed in one sub-cycle (inner while loop) I'd get rid of tmp_num_to_free and move the declaration of num_to_free inside the outer loop. I'd also move the definitions of tmp_next_to_clean, tmp_recent_alloc, tmp_recent_backend_clocksweep into the innermost scope in which they are used. okay, I have moved the tmp_* variables in innermost scope. 2. Also in that function, I think the innermost bit of logic could be rewritten more compactly, and in such a way as to make it clearer for what set of instructions the buffer header will be locked. LockBufHdr(bufHdr); if (bufHdr-refcount == 0) { if (bufHdr-usage_count 0) bufHdr-usage_count--; else add_to_freelist = true; } UnlockBufHdr(bufHdr); if (add_to_freelist StrategyMoveBufferToFreeListEnd(bufHdr)) num_to_free--; Changed as per suggestion. 3. This comment is now obsolete: + /* +* If bgwriterLatch is set, we need to waken the bgwriter, but we should +* not do so while holding freelist_lck; so set it after releasing the +* freelist_lck. This is annoyingly tedious, but it happens at most once +* per bgwriter cycle, so the performance hit is minimal. +*/ + We're not actually holding any lock in need of releasing at that point in the code, so this can be shortened to If bgwriterLatch is set, we need to waken the bgwriter. Changed as per suggestion. * Ideally numFreeListBuffers should get called under freelist spinlock, That doesn't make any sense. numFreeListBuffers is a variable, so you can't call it. The value should be *read* under the spinlock, but it is. I think this whole comment can be deleted and replaced with If the number of free buffers has fallen below the low water mark, awaken the bgreclaimer to repopulate it. Changed as per suggestion. 4. StrategySyncStartAndEnd() is kind of a mess. One, it can return the same victim buffer that's being handed out - at almost the same time - to a backend running the clock sweep; if it does, they'll fight over the buffer. Two, the *end out parameter actually returns a count, not an endpoint. I think we should have BgMoveBuffersToFreelist() call StrategySyncNextVictimBuffer() at the top of the inner loop rather than the bottom, and change StrategySyncStartAndEnd() so that it knows nothing about victimbuf_lck. Let's also change StrategyGetBuffer() to call StrategySyncNextVictimBuffer() so that the logic is centralized in one place, and rename StrategySyncStartAndEnd() to something that better matches its revised purpose. Changed as per suggestion. I have also updated StrategySyncNextVictimBuffer() such that it increments completePasses on completion of cycle as I think it is appropriate to update it, even when clock sweep is done by bgreclaimer. Maybe StrategyGetReclaimInfo(). I have changed it to StrategyGetFreelistAccessInfo() as it seems most other functions in freelist.c uses the names to sound something related to buffers. 5. Have you tested that this new bgwriter statistic is actually working? Because it looks to me like BgMoveBuffersToFreelist is changing BgWriterStats but never calling pgstat_send_bgwriter(), which I'm thinking will mean the counters accumulate forever inside the reclaimer but never get forwarded to the stats collector. pgstat_send_bgwriter() is called in bgreclaimer loop (caller of BgMoveBuffersToFreelist, this is similar to how bgwriter does). I have done few tests with it before sending the previous patch. 6. StrategyInitialize() uses #defines for HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT and LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT but inline constants (5, 2000) for clamping. Let's have constants for all of those (and omit mention of the specific values in the comments). Changed as per suggestion. 7. The line you've added to the definition of view pg_stat_bgwriter doesn't seem to be indented the same way as all the others. Tab vs. space problem? Fixed. Performance Data:
Re: [HACKERS] Scaling shared buffer eviction
On Wed, Sep 10, 2014 at 5:46 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: On 05/09/14 23:50, Amit Kapila wrote: On Fri, Sep 5, 2014 at 8:42 AM, Mark Kirkwood FWIW below are some test results on the 60 core beast with this patch applied to 9.4. I'll need to do more runs to iron out the variation, but it looks like the patch helps the standard (write heavy) pgbench workload a little, and clearly helps the read only case. Thanks for doing the test. I think if possible you can take the performance data with higher scale factor (4000) as it seems your m/c has 1TB of RAM. You might want to use latest patch I have posted today. Here's some fairly typical data from read-write and read-only runs at scale 4000 for 9.4 beta2 with and without the v7 patch (below). I'm not seeing much variation between repeated read-write runs with the same config (which is nice - sleep 30 and explicit checkpoint call between each one seem to help there). Interestingly, I note anecdotally that (unpatched) 9.4 beta2 seems to be better at higher client counts than beta1 was... In terms of the effect of the patch - looks pretty similar to the scale 2000 results for read-write, but read-only is a different and more interesting story - unpatched 9.4 is noticeably impacted in the higher client counts, whereas the patched version scales as well (or even better perhaps) than in the scale 2000 case. Yeah, that's what I was expecting, the benefit of this patch will be more at higher client count when there is large data and all the data can fit in RAM . Many thanks for doing the performance test for patch. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] FD_SETSIZE on Linux?
On 10 September 2014 00:21, Tom Lane t...@sss.pgh.pa.us wrote: Thom Brown t...@linux.com writes: I find this in pgbench.c: #ifdef FD_SETSIZE #define MAXCLIENTS (FD_SETSIZE - 10) #else #define MAXCLIENTS 1024 #endif FD_SETSIZE is supposed to be defined, according to the POSIX spec: The sys/select.h header shall define the following symbolic constant, which shall have a value suitable for use in #if preprocessing directives: FD_SETSIZE Maximum number of file descriptors in an fd_set structure. It looks like Linux sets it to 1024. On RHEL6, at least, I find this: $ grep -r FD_SETSIZE /usr/include /usr/include/linux/posix_types.h:#undef __FD_SETSIZE /usr/include/linux/posix_types.h:#define __FD_SETSIZE 1024 ... /usr/include/sys/select.h:#define FD_SETSIZE __FD_SETSIZE ... Ah yes, I have the same on Debian: /usr/include/linux/posix_types.h:#undef __FD_SETSIZE /usr/include/linux/posix_types.h:#define __FD_SETSIZE 1024 ... usr/include/x86_64-linux-gnu/sys/select.h:#define FD_SETSIZE __FD_SETSIZE /usr/include/x86_64-linux-gnu/bits/typesizes.h:#define __FD_SETSIZE 1024 ... I didn't think to look beyond Postgres' code. #ifdef WIN32 #define FD_SETSIZE 1024 /* set before winsock2.h is included */ #endif /* ! WIN32 */ Windows probably hasn't got sys/select.h at all, so it may not provide this symbol. Interestingly, it looks like POSIX also requires sys/time.h to define FD_SETSIZE. I wonder whether Windows has that header? It'd definitely be better to get this symbol from the system than assume 1024 will work. Okay, this now makes sense. It just takes the system value and reduces it by 10 to get the MAXCLIENTS value. Thanks for the explanation. Thom
Re: [HACKERS] LIMIT for UPDATE and DELETE
On 2014-09-10 04:25, Etsuro Fujita wrote: (2014/09/09 18:57), Heikki Linnakangas wrote: What's not clear to me is whether it make sense to do 1) without 2) ? Is UPDATE .. LIMIT without support for an ORDER BY useful enough? And if we apply this patch now, how much of it needs to be rewritten after 2) ? If the answers are yes and not much, then we should review this patch now, and put 2) on the TODO list. Otherwise 2) should do done first. My answers are yes but completely rewritten. Any particular reason for you to say that? Because an UPDATE might have a RETURNING clause, all the updated tuples have to go through the ModifyTable node one at a time. I don't see why we couldn't LIMIT there after implementing #2. .marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replication commands and log_statements
On 08/28/2014 11:38 AM, Fujii Masao wrote: On Thu, Jun 19, 2014 at 5:29 PM, Ian Barwick i...@2ndquadrant.com wrote: - minor rewording for the description, mentioning that statements will still be logged as DEBUG1 even if parameter set to 'off' (might prevent reports of the kind I set it to 'off', why am I still seeing log entries?). para Causes each replication command to be logged in the server log. See xref linkend=protocol-replication for more information about replication commands. The default value is literaloff/. When set to literaloff/, commands will be logged at log level literalDEBUG1/literal. Only superusers can change this setting. /para Yep, fixed. Attached is the updated version of the patch. I don't think it's necessary to mention that the commands will still be logged at DEBUG1 level. We log all kinds of crap at the various DEBUG levels, and they're not supposed to be used in normal operation. - I feel it would be more consistent to use the plural form for this parameter, i.e. log_replication_commands, in line with log_lock_waits, log_temp_files, log_disconnections etc. But log_statement is in the singular form. So I just used log_replication_command. For the consistency, maybe we need to change both parameters in the plural form? I don't have strong opinion about this. Yeah, we seem to be inconsistent. log_replication_commands would sound better to me in isolation, but then there is log_statement.. I'll mark this as Ready for Committer in the commitfest app (I assume you'll take care of committing this yourself when ready). - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 10/09/14 18:54, Amit Kapila wrote: On Wed, Sep 10, 2014 at 5:46 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz wrote: In terms of the effect of the patch - looks pretty similar to the scale 2000 results for read-write, but read-only is a different and more interesting story - unpatched 9.4 is noticeably impacted in the higher client counts, whereas the patched version scales as well (or even better perhaps) than in the scale 2000 case. Yeah, that's what I was expecting, the benefit of this patch will be more at higher client count when there is large data and all the data can fit in RAM . Many thanks for doing the performance test for patch. No worries, this is looking like a patch we're going to apply to 9.4 to make the 60 core beast scale a bit better, so thanks very much for your work in this area. If you would like more tests run at higher scales let me know (we have two of these machines at pre-production state currently so I can run benchmarks as reqd)! Regards Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index
(2014/09/10 12:31), Fujii Masao wrote: On Wed, Sep 10, 2014 at 12:15 PM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: (2014/09/09 22:17), Fujii Masao wrote: Attached is the updated version of the patch. I took a quick review on the patch. It looks good to me, but one thing I'm concerned about is You wrote: The attached patch introduces the GIN index storage parameter PENDING_LIST_CLEANUP_SIZE which specifies the maximum size of GIN pending list. If it's not set, work_mem is used as that maximum size, instead. So this patch doesn't break the existing application which currently uses work_mem as the threshold of cleanup operation of the pending list. It has only not to set PENDING_LIST_CLEANUP_SIZE. As you mentioned, I think it's important to consider for the existing applications, but I'm wondering if it would be a bit confusing users to have two parameters, Yep. PENDING_LIST_CLEANUP_SIZE and work_mem, for this setting. Wouldn't it be easy-to-use to have only one parameter, PENDING_LIST_CLEANUP_SIZE? How about setting PENDING_LIST_CLEANUP_SIZE to work_mem as the default value when running the CREATE INDEX command? That's an idea. But there might be some users who want to change the cleanup size per session like they can do by setting work_mem, and your idea would prevent them from doing that... Why not use ALTER INDEX ... SET (PENDING_LIST_CLEANUP_SIZE= ...)? Maybe I'm missing something, though. So what about introduing pending_list_cleanup_size also as GUC? That is, users can set the threshold by using either the reloption or GUC. Yeah, that's an idea. So, I'd like to hear the opinions of others. Thanks, Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] add modulo (%) operator to pgbench
Hello Robert, Sure, and I would have looked at that patch and complained that you were implementing a modulo operator with different semantics than C. And then we'd be right back where we are now. [...] Probably. To be clear about my intent, which is a summary of what you already know: I would like to be able to generate a reasonable non uniform throttled workload with pgbench. I do think such a feature is worth having for anybody who would like to do something realistic with pgbench, and that it is in the general interest of postgresql to have such features. In particular, given the current state of abysmal performance under some trivial load with pg, I really think that it is really worth having a better tool, and I think that my effort with rate and progress helped to put these hidden problems into the light, and I do hope that they are going to be solved quickly. Now if I submitted a big patch with all the necessary features in it, someone would ask to break it into pieces. So they are submitted one by one (progress, throttling, limit, modulo, ...). Note I did not start with the non uniform stuff, but Mitsumasa-san sent a gaussian distribution patch and I jumped onto the wagon to complement it with an exponential distribution patch. I knew when doing it that is was not enough, but as I said one piece at a time, given the effort required to pass simple patch. What is still needed for the overall purpose is the ability to scatter the distribution. This is really: (1) a positive remainder modulo, which is the trivial patch under discussion (2) a fast but good quality for the purpose hash function also a rather small patch, not submitted yet. (3) maybe the '|' operator to do a TP*-like non-uniform load, which is really periodic so I do not like it. a trivial patch, not submitted yet. If you do not want one of these pieces (1 2), basically the interest of gaussian/exponential addition is much reduced, and this is just a half baked effort aborted because you did not want what was required to make it useful. Well, I can only disagree, but you are a committer and I'm not! -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [TODO] Process pg_hba.conf keywords as case-insensitive
Hello, I had a closer look on this patch. Finally I think that we need case-insensitive version of get_role_id and() get_database_id() to acoomplish this patch'es objective. (This runs full-scans on pg_database or pg_authid X() And I'd like to propose to change token categorization from notation-base to how-to-treat base. Concretely this patch categorizes tokens using 'special quote is used' and 'quote from the first' but it seems making logics clearer to categorize them using 'case sensive or not' and 'it represents group name'. The attached patch is a revised version of your original patch regarding to the above point. (Sorry in advance that this is a quick hack, especially the code related to file-inclusion is not tested at all) I have tested this only superficial level but it seems works as expected. Under the new specifications, next_token will work as following, - USER : token: USER , case-insensitive - USeR: token: USeR , case-SENSITIVE - +uSeR : token: +uSeR , case-SENSITIVE - +UsER : token: +UsEr , case-insensitive - USeR : token: USeR , case-insensitive - +USER : token: USER , case-insensitive, group_name - +uSeR : token: uSeR , case_SENSITIVE, group_name - +UsEr : token: UsEr , case-insensitive, group_name - + : token: + , (useless?) - @ : token: @ , (useless?) - @hoge: token: hoge, file_inclusion (not confirmed) There's a concern that Case-insensitive matching is accomplished by full-scan on pg_database or pg_authid so it would be rather slow than case-sensitive matching. This might not be acceptable by the community. And one known defect is that you will get a bit odd message if you put an hba line having keywords quoted or prefixed with '+', for example +locAl postgres +sUstRust The server complains for the line above that *| LOG: invalid connection type locAl | CONTEXT: line 84 of configuration file /home/horiguti/data/data_work/pg_hba.conf The prefixing '+' is omitted. To correct this, either deparsing token into original string or storing original string into tokens is needed, I think. What do you think about the changes, Viswanatham or all ? regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index f480be8..db73dd9 100644 --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c @@ -1991,6 +1991,50 @@ get_database_oid(const char *dbname, bool missing_ok) return oid; } +/* + * get_database_oid - given a database name, look up the OID in + * case-insensitive manner. + * + * If missing_ok is false, throw an error if database name not found. If + * true, just return InvalidOid. + */ +Oid +get_database_oid_case_insensitive(const char *dbname, bool missing_ok) +{ + Relation relation; + SysScanDesc scandesc; + HeapTuple tuple; + Oid oid = InvalidOid; + + /* + * SysCache has no abirility to case insensitive match, so we have no + * means except scanning whole the systable. + */ + relation = heap_open(DatabaseRelationId, AccessShareLock); + + scandesc = systable_beginscan(relation, InvalidOid, false, + NULL, 0, NULL); + while (HeapTupleIsValid(tuple = systable_getnext(scandesc))) + { + Form_pg_database dbForm = (Form_pg_database) GETSTRUCT(tuple); + + if (pg_strcasecmp(dbname, dbForm-datname.data) == 0) + { + oid = HeapTupleGetOid(tuple); + break; + } + } + systable_endscan(scandesc); + heap_close(relation, AccessShareLock); + + if (!OidIsValid(oid) !missing_ok) + ereport(ERROR, +(errcode(ERRCODE_UNDEFINED_DATABASE), + errmsg(database \%s\ does not exist, + dbname))); + + return oid; +} /* * get_database_name - given a database OID, look up the name diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c index 84da823..2d3a059 100644 --- a/src/backend/libpq/hba.c +++ b/src/backend/libpq/hba.c @@ -60,9 +60,20 @@ typedef struct check_network_data bool result; /* set to true if match */ } check_network_data; - -#define token_is_keyword(t, k) (!t-quoted strcmp(t-string, k) == 0) -#define token_matches(t, k) (strcmp(t-string, k) == 0) +typedef enum TokenType +{ + NORMAL, + GROUP_NAME, /* this token had leading '+' */ + FILE_INCLUSION, /* this token had leading '@' */ +} TokenType; + +#define token_is_keyword(tk, kw) \ + ((tk)-type != NORMAL || (tk)-case_sensitive ? false : \ + (pg_strcasecmp((tk)-string, (kw)) == 0)) +#define token_matches(t, k) \ + ((t)-type != NORMAL ? false :\ + ((t)-case_sensitive ? (strcmp((t)-string, (k)) == 0): \ + (pg_strcasecmp((t)-string, (k)) == 0))) /* * A single string token lexed from the HBA config file, together with whether @@ -71,7 +82,8 @@ typedef struct check_network_data typedef struct HbaToken { char *string; - bool quoted; + TokenType type; + bool case_sensitive; } HbaToken; /* @@ -111,6 +123,7 @@ pg_isblank(const char
Re: [HACKERS] LIMIT for UPDATE and DELETE
(2014/09/10 16:57), Marko Tiikkaja wrote: On 2014-09-10 04:25, Etsuro Fujita wrote: (2014/09/09 18:57), Heikki Linnakangas wrote: What's not clear to me is whether it make sense to do 1) without 2) ? Is UPDATE .. LIMIT without support for an ORDER BY useful enough? And if we apply this patch now, how much of it needs to be rewritten after 2) ? If the answers are yes and not much, then we should review this patch now, and put 2) on the TODO list. Otherwise 2) should do done first. My answers are yes but completely rewritten. Any particular reason for you to say that? Because an UPDATE might have a RETURNING clause, all the updated tuples have to go through the ModifyTable node one at a time. I don't see why we couldn't LIMIT there after implementing #2. The reason is because I think that after implementing #2, we should re-implement this feature by extending the planner to produce a plan tree such as ModifyTable+Limit+Append, maybe with LockRows below the Limit node. Such an approach would prevent duplication of the LIMIT code in ModifyTable, making the ModifyTable code more simple, I think. Thanks, Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [TODO] Process pg_hba.conf keywords as case-insensitive
Hmm... case-insensitive mathing could get multiple matches, which should be an error but I've forgot to do so. regards, 2014/09/10 17:54 Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp: And one known defect is that you will get a bit odd message if you put an hba line having keywords quoted or prefixed with '+', for example +locAl postgres +sUstRust The server complains for the line above that *| LOG: invalid connection type locAl | CONTEXT: line 84 of configuration file /home/horiguti/data/data_work/pg_hba.conf The prefixing '+' is omitted. To correct this, either deparsing token into original string or storing original string into tokens is needed, I think. What do you think about the changes, Viswanatham or all ? -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] [TODO] Process pg_hba.conf keywords as case-insensitive
On Sep10, 2014, at 10:54 , Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Under the new specifications, next_token will work as following, - USER : token: USER , case-insensitive - USeR: token: USeR , case-SENSITIVE - +uSeR : token: +uSeR , case-SENSITIVE - +UsER : token: +UsEr , case-insensitive - USeR : token: USeR , case-insensitive - +USER : token: USER , case-insensitive, group_name - +uSeR : token: uSeR , case_SENSITIVE, group_name - +UsEr : token: UsEr , case-insensitive, group_name - + : token: + , (useless?) - @ : token: @ , (useless?) - @hoge: token: hoge, file_inclusion (not confirmed) There's a concern that Case-insensitive matching is accomplished by full-scan on pg_database or pg_authid so it would be rather slow than case-sensitive matching. This might not be acceptable by the community. That does indeed sound bad. Couldn't we handle this the same way we handle SQL identifiers, i.e. simply downcase unquoted identifiers, and then compare case-sensitively? So foo, Foo and FOO would all match the user called foo, but Foo would match the user called Foo, and FOO the user called FOO. An unquoted + would cause whatever follows it to be interpreted as a group name, whereas a quoted + would simply become part of the user name (or group name, if there's an additional unquoted + before it). So +foo would refer to the group foo, +FOO to the group FOO, and ++A to the group +A. I haven't checked if such an approach would be sufficiently backwards-compatible, though. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LIMIT for UPDATE and DELETE
On 9/10/14 11:25 AM, Etsuro Fujita wrote: The reason is because I think that after implementing #2, we should re-implement this feature by extending the planner to produce a plan tree such as ModifyTable+Limit+Append, maybe with LockRows below the Limit node. Such an approach would prevent duplication of the LIMIT code in ModifyTable, making the ModifyTable code more simple, I think. You can already change *this patch* to do ModifyTable - Limit - LockRows. If we think that's what we want, we should rewrite this patch right now. This isn't a reason not to implement LIMIT without ORDER BY. Like I said upthread, I think LockRows is a bad idea, but I'll need to run some performance tests first. But whichever method we decide to implement for this patch shouldn't need to be touched when the changes to UPDATE land, so your reasoning is incorrect. .marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index
On Wed, Sep 10, 2014 at 5:35 PM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: (2014/09/10 12:31), Fujii Masao wrote: On Wed, Sep 10, 2014 at 12:15 PM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: (2014/09/09 22:17), Fujii Masao wrote: Attached is the updated version of the patch. I took a quick review on the patch. It looks good to me, but one thing I'm concerned about is You wrote: The attached patch introduces the GIN index storage parameter PENDING_LIST_CLEANUP_SIZE which specifies the maximum size of GIN pending list. If it's not set, work_mem is used as that maximum size, instead. So this patch doesn't break the existing application which currently uses work_mem as the threshold of cleanup operation of the pending list. It has only not to set PENDING_LIST_CLEANUP_SIZE. As you mentioned, I think it's important to consider for the existing applications, but I'm wondering if it would be a bit confusing users to have two parameters, Yep. PENDING_LIST_CLEANUP_SIZE and work_mem, for this setting. Wouldn't it be easy-to-use to have only one parameter, PENDING_LIST_CLEANUP_SIZE? How about setting PENDING_LIST_CLEANUP_SIZE to work_mem as the default value when running the CREATE INDEX command? That's an idea. But there might be some users who want to change the cleanup size per session like they can do by setting work_mem, and your idea would prevent them from doing that... Why not use ALTER INDEX ... SET (PENDING_LIST_CLEANUP_SIZE= ...)? Maybe I'm missing something, though. It takes AccessExclusive lock and has an effect on every sessions (not only specified session). Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LIMIT for UPDATE and DELETE
(2014/09/10 18:33), Marko Tiikkaja wrote: On 9/10/14 11:25 AM, Etsuro Fujita wrote: The reason is because I think that after implementing #2, we should re-implement this feature by extending the planner to produce a plan tree such as ModifyTable+Limit+Append, maybe with LockRows below the Limit node. Such an approach would prevent duplication of the LIMIT code in ModifyTable, making the ModifyTable code more simple, I think. You can already change *this patch* to do ModifyTable - Limit - LockRows. If we think that's what we want, we should rewrite this patch right now. I think it might be relatively easy to do that for single-table cases, but for inheritance cases, inheritance_planner needs work and I'm not sure how much work it would take ... Like I said upthread, I think LockRows is a bad idea, but I'll need to run some performance tests first. But whichever method we decide to implement for this patch shouldn't need to be touched when the changes to UPDATE land, so your reasoning is incorrect. Yeah, as you say, we need the performance tests, and I think it would depend on those results whether LockRows is a bad idea or not. Thanks, Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] LIMIT for UPDATE and DELETE
On 9/10/14 12:05 PM, Etsuro Fujita wrote: (2014/09/10 18:33), Marko Tiikkaja wrote: You can already change *this patch* to do ModifyTable - Limit - LockRows. If we think that's what we want, we should rewrite this patch right now. I think it might be relatively easy to do that for single-table cases, but for inheritance cases, inheritance_planner needs work and I'm not sure how much work it would take ... Uh. Yeah, I think I'm an idiot and you're right. I'll try and get some benchmarking done and see what comes out. .marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replication commands and log_statements
Thanks for reviewing the patch! On Wed, Sep 10, 2014 at 4:57 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 08/28/2014 11:38 AM, Fujii Masao wrote: On Thu, Jun 19, 2014 at 5:29 PM, Ian Barwick i...@2ndquadrant.com wrote: - minor rewording for the description, mentioning that statements will still be logged as DEBUG1 even if parameter set to 'off' (might prevent reports of the kind I set it to 'off', why am I still seeing log entries?). para Causes each replication command to be logged in the server log. See xref linkend=protocol-replication for more information about replication commands. The default value is literaloff/. When set to literaloff/, commands will be logged at log level literalDEBUG1/literal. Only superusers can change this setting. /para Yep, fixed. Attached is the updated version of the patch. I don't think it's necessary to mention that the commands will still be logged at DEBUG1 level. We log all kinds of crap at the various DEBUG levels, and they're not supposed to be used in normal operation. Agreed. I removed that mention from the document. - I feel it would be more consistent to use the plural form for this parameter, i.e. log_replication_commands, in line with log_lock_waits, log_temp_files, log_disconnections etc. But log_statement is in the singular form. So I just used log_replication_command. For the consistency, maybe we need to change both parameters in the plural form? I don't have strong opinion about this. Yeah, we seem to be inconsistent. log_replication_commands would sound better to me in isolation, but then there is log_statement.. Agreed. I changed the GUC name to log_replication_commands. I'll mark this as Ready for Committer in the commitfest app (I assume you'll take care of committing this yourself when ready). Attached is the updated version of the patch. After at least one day I will commit the patch. Regards, -- Fujii Masao *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 4558,4563 FROM pg_stat_activity; --- 4558,4579 /listitem /varlistentry + varlistentry id=guc-log-replication-commands xreflabel=log_replication_commands + termvarnamelog_replication_commands/varname (typeboolean/type) + indexterm +primaryvarnamelog_replication_commands/ configuration parameter/primary + /indexterm + /term + listitem +para + Causes each replication command to be logged in the server log. + See xref linkend=protocol-replication for more information about + replication command. The default value is literaloff/. + Only superusers can change this setting. +/para + /listitem + /varlistentry + varlistentry id=guc-log-temp-files xreflabel=log_temp_files termvarnamelog_temp_files/varname (typeinteger/type) indexterm *** a/doc/src/sgml/protocol.sgml --- b/doc/src/sgml/protocol.sgml *** *** 1306,1311 To initiate streaming replication, the frontend sends the --- 1306,1313 of literaltrue/ tells the backend to go into walsender mode, wherein a small set of replication commands can be issued instead of SQL statements. Only the simple query protocol can be used in walsender mode. + Replication commands are logged in the server log when + xref linkend=guc-log-replication-commands is enabled. Passing literaldatabase/ as the value instructs walsender to connect to the database specified in the literaldbname/ parameter, which will allow the connection to be used for logical replication from that database. *** a/src/backend/replication/walsender.c --- b/src/backend/replication/walsender.c *** *** 108,113 bool am_db_walsender = false; /* Connected to a database? */ --- 108,114 int max_wal_senders = 0; /* the maximum number of concurrent walsenders */ int wal_sender_timeout = 60 * 1000; /* maximum time to send one * WAL data message */ + bool log_replication_commands = false; /* * State for WalSndWakeupRequest *** *** 1268,1280 exec_replication_command(const char *cmd_string) MemoryContext old_context; /* * CREATE_REPLICATION_SLOT ... LOGICAL exports a snapshot until the next * command arrives. Clean up the old stuff if there's anything. */ SnapBuildClearExportedSnapshot(); - elog(DEBUG1, received replication command: %s, cmd_string); - CHECK_FOR_INTERRUPTS(); cmd_context = AllocSetContextCreate(CurrentMemoryContext, --- 1269,1287 MemoryContext old_context; /* + * Log replication command if log_replication_commands is enabled. + * Even when it's disabled, log the command with DEBUG1 level for + * backward compatibility. + */ + ereport(log_replication_commands ? LOG : DEBUG1, + (errmsg(received replication
Re: [HACKERS] Add shutdown_at_recovery_target option to recovery.conf
On Wed, Sep 10, 2014 at 1:45 AM, Petr Jelinek p...@2ndquadrant.com wrote: Hi, I recently wanted several times to have slave server prepared at certain point in time to reduce the time it takes for it to replay remaining WALs (say I have pg_basebackup -x on busy db for example). In your example, you're thinking to perform the recovery after taking the backup and stop it at the consistent point (i.e., end of backup) by using the proposed feature? Then you're expecting that the future recovery will start from that consistent point and which will reduce the recovery time? This is true if checkpoint is executed at the end of backup. But there might be no occurrence of checkpoint during backup. In this case, even future recovery would need to start from very start of backup. That is, we cannot reduce the recovery time. So, for your purpose, for example, you might also need to add new option to pg_basebackup so that checkpoint is executed at the end of backup if the option is set. Currently the way to do it is to have pause_at_recovery_target true (default) and wait until pg accepts connection and the shut it down. The issue is that this is ugly, and also there is a chance that somebody else connects and does bad things (tm) before my process does. So I wrote simple patch that adds option to shut down the cluster once recovery_target is reached. The server will still be able to continue WAL replay if needed later or can be configured to start standalone. What about adding something like action_at_recovery_target=pause|shutdown instead of increasing the number of parameters? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgcrypto: PGP armor headers
On 9/9/14 11:01 AM, I wrote: On 9/9/14 10:54 AM, Heikki Linnakangas wrote: So I think this (and the corresponding dearmor code too) should be refactored to use a StringInfo that gets enlarged as needed, instead of hoping to guess the size correctly beforehand. To ease review, might make sense to do that as a separate patch over current sources, and the main patch on top of that. Yeah, I thought the same thing while writing that code. Thanks, I'll do it that way. Attached is what I have right now. I started working on the decoding part, but it has this piece of code: /* decode crc */ if (b64_decode(p + 1, 4, buf) != 3) goto out; which makes the approach a bit uglier. If I did this the same way, I would have to create and destroy a StringInfo just for this operation, which seems ugly. So I wonder if I shouldn't try and instead keep the code closer to what it is in HEAD right now; I could call enlargeStringInfo() first, then hand out a pointer to b64_encode (or b64_decode()) and finally increment StringInfoData.len by how much was actually written. That would keep the code changes a lot smaller, too. Is either of these approaches anywhere near what you had in mind? I'm also not sure why we need to keep a copy of the base64 encoding/decoding logic instead of exporting it in utils/adt/encode.c. .marko *** a/contrib/pgcrypto/pgp-armor.c --- b/contrib/pgcrypto/pgp-armor.c *** *** 31,36 --- 31,38 #include postgres.h + #include lib/stringinfo.h + #include px.h #include pgp.h *** *** 38,58 * BASE64 - duplicated :( */ ! static const unsigned char _base64[] = ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/; ! static int ! b64_encode(const uint8 *src, unsigned len, uint8 *dst) { - uint8 *p, - *lend = dst + 76; const uint8 *s, *end = src + len; int pos = 2; unsigned long buf = 0; s = src; - p = dst; while (s end) { --- 40,58 * BASE64 - duplicated :( */ ! static const char _base64[] = ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/; ! static void ! b64_encode(const uint8 *src, unsigned len, StringInfo dst) { const uint8 *s, *end = src + len; int pos = 2; unsigned long buf = 0; + int linepos = 0; s = src; while (s end) { *** *** 65,93 b64_encode(const uint8 *src, unsigned len, uint8 *dst) */ if (pos 0) { ! *p++ = _base64[(buf 18) 0x3f]; ! *p++ = _base64[(buf 12) 0x3f]; ! *p++ = _base64[(buf 6) 0x3f]; ! *p++ = _base64[buf 0x3f]; pos = 2; buf = 0; } ! if (p = lend) { ! *p++ = '\n'; ! lend = p + 76; } } if (pos != 2) { ! *p++ = _base64[(buf 18) 0x3f]; ! *p++ = _base64[(buf 12) 0x3f]; ! *p++ = (pos == 0) ? _base64[(buf 6) 0x3f] : '='; ! *p++ = '='; } - - return p - dst; } /* probably should use lookup table */ --- 65,92 */ if (pos 0) { ! appendStringInfoCharMacro(dst, _base64[(buf 18) 0x3f]); ! appendStringInfoCharMacro(dst, _base64[(buf 12) 0x3f]); ! appendStringInfoCharMacro(dst, _base64[(buf 6) 0x3f]); ! appendStringInfoCharMacro(dst, _base64[buf 0x3f]); + linepos += 4; pos = 2; buf = 0; } ! if (linepos = 76) { ! appendStringInfoCharMacro(dst, '\n'); ! linepos = 0; } } if (pos != 2) { ! appendStringInfoCharMacro(dst, _base64[(buf 18) 0x3f]); ! appendStringInfoCharMacro(dst, _base64[(buf 12) 0x3f]); ! appendStringInfoCharMacro(dst, (pos == 0) ? _base64[(buf 6) 0x3f] : '='); ! appendStringInfoCharMacro(dst, '='); } } /* probably should use lookup table */ *** *** 160,174 b64_decode(const uint8 *src, unsigned len, uint8 *dst) } static unsigned - b64_enc_len(unsigned srclen) - { - /* -* 3 bytes will be converted to 4, linefeed after 76 chars -*/ - return (srclen + 2) * 4 / 3 + srclen / (76 * 3 / 4); - } - - static unsigned b64_dec_len(unsigned srclen) { return (srclen * 3) 2; ---
Re: [HACKERS] pgcrypto: PGP armor headers
On 09/10/2014 02:26 PM, Marko Tiikkaja wrote: On 9/9/14 11:01 AM, I wrote: On 9/9/14 10:54 AM, Heikki Linnakangas wrote: So I think this (and the corresponding dearmor code too) should be refactored to use a StringInfo that gets enlarged as needed, instead of hoping to guess the size correctly beforehand. To ease review, might make sense to do that as a separate patch over current sources, and the main patch on top of that. Yeah, I thought the same thing while writing that code. Thanks, I'll do it that way. Attached is what I have right now. I started working on the decoding part, but it has this piece of code: /* decode crc */ if (b64_decode(p + 1, 4, buf) != 3) goto out; which makes the approach a bit uglier. If I did this the same way, I would have to create and destroy a StringInfo just for this operation, which seems ugly. So I wonder if I shouldn't try and instead keep the code closer to what it is in HEAD right now; I could call enlargeStringInfo() first, then hand out a pointer to b64_encode (or b64_decode()) and finally increment StringInfoData.len by how much was actually written. That would keep the code changes a lot smaller, too. Yeah, that sounds reasonable. I'm also not sure why we need to keep a copy of the base64 encoding/decoding logic instead of exporting it in utils/adt/encode.c. Good question... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER SYSTEM RESET?
On Wed, Sep 3, 2014 at 12:08 AM, Christoph Berg c...@df7cb.de wrote: Re: Vik Fearing 2014-09-02 5405d2d9.9050...@dalibo.com Uhm, are we agreed on the decision on not to backpatch this? I would think this should have been part of the initial ALTER SYSTEM SET patch and thus should be backpatched to 9.4. I think it belongs in 9.4 as well, especially if we're having another beta. My original complaint was about 9.4, so I'd like to see it there, yes. ISTM that the consensus is to do the back-patch to 9.4. Does anyone object to the back-patch? If not, I will do that. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgcrypto: PGP armor headers
On 9/10/14 1:38 PM, Heikki Linnakangas wrote: On 09/10/2014 02:26 PM, Marko Tiikkaja wrote: So I wonder if I shouldn't try and instead keep the code closer to what it is in HEAD right now; I could call enlargeStringInfo() first, then hand out a pointer to b64_encode (or b64_decode()) and finally increment StringInfoData.len by how much was actually written. That would keep the code changes a lot smaller, too. Yeah, that sounds reasonable. OK, I've attemped to do that in the attached. I'm pretty sure I didn't get all of the overflows right, so someone should probably take a really good look at it. (I'm not too confident the original code got them right either, but whatever). Speaking of good looks, should I add it to the next commitfest as a new patch, or should we try and get someone to review it like this? I'm also not sure why we need to keep a copy of the base64 encoding/decoding logic instead of exporting it in utils/adt/encode.c. Good question... I've left this question unanswered for now. We can fix it later, independently of this patch. .marko *** a/contrib/pgcrypto/pgp-armor.c --- b/contrib/pgcrypto/pgp-armor.c *** *** 203,240 crc24(const uint8 *data, unsigned len) return crc 0xffL; } ! int ! pgp_armor_encode(const uint8 *src, unsigned len, uint8 *dst) { ! int n; ! uint8 *pos = dst; unsignedcrc = crc24(src, len); ! n = strlen(armor_header); ! memcpy(pos, armor_header, n); ! pos += n; ! n = b64_encode(src, len, pos); ! pos += n; ! if (*(pos - 1) != '\n') ! *pos++ = '\n'; ! *pos++ = '='; ! pos[3] = _base64[crc 0x3f]; ! crc = 6; ! pos[2] = _base64[crc 0x3f]; ! crc = 6; ! pos[1] = _base64[crc 0x3f]; ! crc = 6; ! pos[0] = _base64[crc 0x3f]; ! pos += 4; ! n = strlen(armor_footer); ! memcpy(pos, armor_footer, n); ! pos += n; ! ! return pos - dst; } static const uint8 * --- 203,239 return crc 0xffL; } ! void ! pgp_armor_encode(const uint8 *src, int len, StringInfo dst) { ! int res; ! unsignedb64len; unsignedcrc = crc24(src, len); ! appendStringInfoString(dst, armor_header); ! /* make sure we have enough room to b64_encode() */ ! b64len = b64_enc_len(len); ! if (b64len INT_MAX) ! ereport(ERROR, ! (errcode(ERRCODE_OUT_OF_MEMORY), !errmsg(out of memory))); ! enlargeStringInfo(dst, (int) b64len); ! res = b64_encode(src, len, (uint8 *) dst-data + dst-len); ! if (res b64len) ! elog(FATAL, overflow - encode estimate too small); ! dst-len += res; ! if (*(dst-data + dst-len - 1) != '\n') ! appendStringInfoChar(dst, '\n'); ! appendStringInfoChar(dst, '='); ! appendStringInfoChar(dst, _base64[(crc 18) 0x3f]); ! appendStringInfoChar(dst, _base64[(crc 12) 0x3f]); ! appendStringInfoChar(dst, _base64[(crc 6) 0x3f]); ! appendStringInfoChar(dst, _base64[crc 0x3f]); ! appendStringInfoString(dst, armor_footer); } static const uint8 * *** *** 309,315 find_header(const uint8 *data, const uint8 *datend, } int ! pgp_armor_decode(const uint8 *src, unsigned len, uint8 *dst) { const uint8 *p = src; const uint8 *data_end = src + len; --- 308,314 } int ! pgp_armor_decode(const uint8 *src, int len, StringInfo dst) { const uint8 *p = src; const uint8 *data_end = src + len; *** *** 319,324 pgp_armor_decode(const uint8 *src, unsigned len, uint8 *dst) --- 318,324 const uint8 *base64_end = NULL; uint8 buf[4]; int hlen; + int blen; int res = PXE_PGP_CORRUPT_ARMOR; /* armor start */ *** *** 360,382 pgp_armor_decode(const uint8 *src, unsigned len, uint8 *dst) crc = (((long) buf[0]) 16) + (((long) buf[1]) 8) + (long) buf[2]; /* decode data */ ! res = b64_decode(base64_start, base64_end - base64_start, dst); ! ! /* check crc */ ! if (res = 0 crc24(dst, res) != crc) ! res = PXE_PGP_CORRUPT_ARMOR; out: return res; } - - unsigned - pgp_armor_enc_len(unsigned len) - { - return b64_enc_len(len) + strlen(armor_header) + strlen(armor_footer) + 16; - } - - unsigned - pgp_armor_dec_len(unsigned len) - { - return b64_dec_len(len); - } --- 360,377 crc = (((long) buf[0]) 16) + (((long) buf[1]) 8) + (long) buf[2]; /* decode data */ ! blen = (int) b64_dec_len(len); ! enlargeStringInfo(dst, blen); ! res =
Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index
Fujii Masao wrote: On Wed, Sep 10, 2014 at 12:15 PM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: PENDING_LIST_CLEANUP_SIZE and work_mem, for this setting. Wouldn't it be easy-to-use to have only one parameter, PENDING_LIST_CLEANUP_SIZE? How about setting PENDING_LIST_CLEANUP_SIZE to work_mem as the default value when running the CREATE INDEX command? That's an idea. But there might be some users who want to change the cleanup size per session like they can do by setting work_mem, and your idea would prevent them from doing that... So what about introduing pending_list_cleanup_size also as GUC? That is, users can set the threshold by using either the reloption or GUC. Yes, I think having both a GUC and a reloption makes sense -- the GUC applies to all indexes, and can be tweaked for individual indexes using the reloption. I'm not sure about the idea of being able to change it per session, though. Do you mean that you would like insert processes use a very large value so that they can just append new values to the pending list, and have vacuum use a small value so that it cleans up as soon as it runs? Two things: 1. we could have an autovacuum_ reloption which only changes what autovacuum does; 2. we could have autovacuum run index cleanup actions separately from actual vacuuming. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SKIP LOCKED DATA (work in progress)
Thomas Munro wrote: @@ -2022,7 +2030,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode, bool noWait, */ test = heap_lock_tuple(relation, tuple, estate-es_output_cid, -lockmode, noWait, +lockmode, wait_policy, false, buffer, hufd); /* We now have two pins on the buffer, get rid of one */ ReleaseBuffer(buffer); Doesn't this heap_lock_tuple() need to check for a WouldBlock result? Not sure that this is the same case that you were trying to test in heap_lock_updated_tuple; I think this requires an update chain (so that EPQFetch is invoked) and some tuple later in the chain is locked by a third transaction. I attach some additional minor suggestions to your patch. Please feel free to reword comments differently if you think my wording isn't an improvements (or I've maked an english mistakes). -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench throttling latency limit
On 09/09/2014 03:35 PM, Fabien COELHO wrote: Hello Heikki, I think we have to reconsider what we're reporting in 9.4, when --rate is enabled, even though it's already very late in the release cycle. It's a bad idea to change the definition of latency between 9.4 and 9.5, so let's get it right in 9.4. Indeed. As per the attached patch. I think we should commit this to 9.4. Any objections? Ok for me. Some more propositions about the doc below. I looked closer at the this, and per Jan's comments, realized that we don't log the lag time in the per-transaction log file. I think that's a serious omission; when --rate is used, the schedule lag time is important information to make sense of the result. I think we have to apply the attached patch, and backpatch to 9.4. (The documentation on the log file format needs updating) Also, this is bizarre: /* * Use inverse transform sampling to randomly generate a delay, such * that the series of delays will approximate a Poisson distribution * centered on the throttle_delay time. * * 1 implies a 9.2 (-log(1/1)) to 0.0 (log 1) delay * multiplier, and results in a 0.055 % target underestimation bias: * * SELECT 1.0/AVG(-LN(i/1.0)) FROM generate_series(1,1) AS i; * = 1.00055271703266335474 * * If transactions are too slow or a given wait is shorter than a * transaction, the next transaction will start right away. */ int64 wait = (int64) (throttle_delay * 1.00055271703 * -log(getrand(thread, 1, 1) / 1.0)); We're using getrand() to generate a uniformly distributed random value between 1 and 1, and then convert it to a double between 0.0 and 1.0. But getrand() is implemented by taking a double between 0.0 and 1.0 and converting it an integer, so we're just truncating the original floating point number unnecessarily. I think we should add a new function, getPoissonRand(), that uses pg_erand48() directly. We already have similiar getGaussianRand() and getExponentialRand() functions. Barring objections, I'll prepare another patch to do that, and backpatch to 9.4. - Heikki diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 2f7d80e..f7a3a5f 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -210,10 +210,10 @@ typedef struct * sent */ int sleeping; /* 1 indicates that the client is napping */ bool throttling; /* whether nap is for throttling */ - int64 until; /* napping until (usec) */ Variable *variables; /* array of variable definitions */ int nvariables; - instr_time txn_begin; /* used for measuring transaction latencies */ + int64 txn_scheduled; /* scheduled start time of transaction (usec) */ + instr_time txn_begin; /* used for measuring schedule lag times */ instr_time stmt_begin; /* used for measuring statement latencies */ int64 txn_latencies; /* cumulated latencies */ int64 txn_sqlats; /* cumulated square latencies */ @@ -284,12 +284,17 @@ typedef struct long start_time; /* when does the interval start */ int cnt; /* number of transactions */ - double min_duration; /* min/max durations */ - double max_duration; - double sum; /* sum(duration), sum(duration^2) - for + + double min_latency; /* min/max latencies */ + double max_latency; + double sum_latency; /* sum(latency), sum(latency^2) - for * estimates */ - double sum2; + double sum2_latency; + double min_lag; + double max_lag; + double sum_lag; /* sum(lag) */ + double sum2_lag; /* sum(lag*lag) */ } AggVals; static Command **sql_files[MAX_FILES]; /* SQL script files */ @@ -968,12 +973,18 @@ agg_vals_init(AggVals *aggs, instr_time start) { /* basic counters */ aggs-cnt = 0;/* number of transactions */ - aggs-sum = 0;/* SUM(duration) */ - aggs-sum2 = 0;/* SUM(duration*duration) */ + aggs-sum_latency = 0; /* SUM(latency) */ + aggs-sum2_latency = 0;/* SUM(latency*latency) */ /* min and max transaction duration */ - aggs-min_duration = 0; - aggs-max_duration = 0; + aggs-min_latency = 0; + aggs-max_latency = 0; + + /* schedule lag counters */ + aggs-sum_lag = 0; + aggs-sum2_lag = 0; + aggs-min_lag = 0; + aggs-max_lag = 0; /* start of the current interval */ aggs-start_time = INSTR_TIME_GET_DOUBLE(start); @@ -1016,7 +1027,7 @@ top: thread-throttle_trigger += wait; - st-until = thread-throttle_trigger; + st-txn_scheduled = thread-throttle_trigger; st-sleeping = 1; st-throttling = true; st-is_throttled = true; @@ -1032,13 +1043,13 @@ top: INSTR_TIME_SET_CURRENT(now); now_us = INSTR_TIME_GET_MICROSEC(now); - if (st-until = now_us) + if (st-txn_scheduled = now_us) {
Re: [HACKERS] pgbench throttling latency limit
Hi, I find typo in your patch. Please confirm. @line 239 - agg-sum2_lag = 0; + agg-sum_lag = 0; And back patch is welcome for me. Best Regards, -- Mitsumasa KONDO
Re: [HACKERS] pgbench throttling latency limit
Hello Heikki, I looked closer at the this, and per Jan's comments, realized that we don't log the lag time in the per-transaction log file. I think that's a serious omission; when --rate is used, the schedule lag time is important information to make sense of the result. I think we have to apply the attached patch, and backpatch to 9.4. (The documentation on the log file format needs updating) Indeed. I think that people do not like it to change. I remember that I suggested to change timestamps to .yy instead of the unreadable yyy, and be told not to, because some people have tool which process the output so the format MUST NOT CHANGE. So my behavior is not to avoid touching anything in this area. I'm fine if you do it, though:-) However I have not time to have a precise look at your patch to cross-check it before Friday. Also, this is bizarre: int64 wait = (int64) (throttle_delay * 1.00055271703 * -log(getrand(thread, 1, 1) / 1.0)); We're using getrand() to generate a uniformly distributed random value between 1 and 1, and then convert it to a double between 0.0 and 1.0. The reason for this conversion is to have randomness but to still avoid going to extreme multiplier values. The idea is to avoid a very large multiplier which would generate (even if it is not often) a 0 tps when 100 tps is required. The 1 granularity is basically random but the multiplier stays bounded (max 9.2, so the minimum possible tps would be around 11 for a target of 100 tps, bar issues from the database for processing the transactions). So although this feature can be discussed and amended, it is fully voluntary. I think that it make sense so I would prefer to keep it as is. Maybe the comments could be update to be clearer. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spinlocks and compiler/memory barriers
On Tue, Sep 9, 2014 at 6:00 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-09-09 17:54:03 -0400, Robert Haas wrote: So, that's committed, then. Yay, finally. I think we should pick something that uses spinlocks and is likely to fail spectacularly if we haven't got this totally right yet, and de-volatilize it. And then watch to see what turns red in the buildfarm and/or which users start screaming. Good plan. I'm inclined to propose lwlock.c as a candidate, since that's very widely used and a place where we know there's significant contention. I suggest, additionally possibly, GetSnapshotData(). It's surely one of the hottest functions in postgres, and I've seen some performance increases from de-volatilizing it. IIRC it requires one volatile cast in one place to enforce that a variable is accessed only once. Not sure if we want to add such volatile casts or use something like linux' ACCESS_ONCE macros for some common types. Helps to grep for places worthy of inspection. GetSnapshotData() isn't quite to the point, because it's using a volatile pointer, but not because of anything about spinlock manipulation. That would, perhaps, be a good test for barriers, but not for this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Aussie timezone database changes incoming
In connection with a question asked today on pgsql-general, I had occasion to go check the release announcements for the IANA timezone database files, and it turns out that there are some big changes in 2014f: http://mm.icann.org/pipermail/tz-announce/2014-August/23.html The Russian changes are perhaps not such a big deal because they've done that sort of thing before, but this is an earful: Australian eastern time zone abbreviations are now AEST/AEDT not EST, and similarly for the other Australian zones. That is, for eastern standard and daylight saving time the abbreviations are AEST and AEDT instead of the former EST for both; similarly, ACST/ACDT, ACWST/ACWDT, and AWST/AWDT are now used instead of the former CST, CWST, and WST. This change does not affect UTC offsets, only time zone abbreviations. (Thanks to Rich Tibbett and many others.) I'm wondering how many Aussie applications are going to break when this goes in, and if we could/should do anything about it. One idea that comes to mind is to create an Australia_old tznames file containing the current Aussie zone abbreviations, so as to provide an easy way to maintain backwards compatibility at need (you'd select that as your timezone_abbreviations GUC setting). Anyone from down under care to remark about the actual usage of old and new abbreviations? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] add modulo (%) operator to pgbench
On Wed, Sep 10, 2014 at 4:48 AM, Fabien COELHO coe...@cri.ensmp.fr wrote: Note I did not start with the non uniform stuff, but Mitsumasa-san sent a gaussian distribution patch and I jumped onto the wagon to complement it with an exponential distribution patch. I knew when doing it that is was not enough, but as I said one piece at a time, given the effort required to pass simple patch. What is still needed for the overall purpose is the ability to scatter the distribution. This is really: (1) a positive remainder modulo, which is the trivial patch under discussion (2) a fast but good quality for the purpose hash function also a rather small patch, not submitted yet. (3) maybe the '|' operator to do a TP*-like non-uniform load, which is really periodic so I do not like it. a trivial patch, not submitted yet. If you do not want one of these pieces (1 2), basically the interest of gaussian/exponential addition is much reduced, and this is just a half baked effort aborted because you did not want what was required to make it useful. Well, I can only disagree, but you are a committer and I'm not! I am not objecting to the functionality; I'm objecting to bolting on ad-hoc operators one at a time. I think an expression syntax would let us do this in a much more scalable way. If I had time, I'd go do that, but I don't. We could add abs(x) and hash(x) and it would all be grand. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench throttling latency limit
On 09/10/2014 05:57 PM, Fabien COELHO wrote: Hello Heikki, I looked closer at the this, and per Jan's comments, realized that we don't log the lag time in the per-transaction log file. I think that's a serious omission; when --rate is used, the schedule lag time is important information to make sense of the result. I think we have to apply the attached patch, and backpatch to 9.4. (The documentation on the log file format needs updating) Indeed. I think that people do not like it to change. I remember that I suggested to change timestamps to .yy instead of the unreadable yyy, and be told not to, because some people have tool which process the output so the format MUST NOT CHANGE. So my behavior is not to avoid touching anything in this area. I'm fine if you do it, though:-) However I have not time to have a precise look at your patch to cross-check it before Friday. This is different from changing xxx yyy to xxx.yyy in two ways. First, this adds new information to the log that just isn't there otherwise, it's not just changing the format for the sake of it. Second, in this case it's easy to write a parser for the log format so that it works with the old and new formats. Many such programs would probably ignore any unexpected extra fields, as a matter of lazy programming, while changing the separator from space to a dot would surely require changing every parsing program. We could leave out the lag fields, though, when --rate is not used. Without --rate, the lag is always zero anyway. That would keep the file format unchanged, when you don't use the new --rate feature. I'm not sure if that would be better or worse for programs that might want to parse the information. Also, this is bizarre: int64 wait = (int64) (throttle_delay * 1.00055271703 * -log(getrand(thread, 1, 1) / 1.0)); We're using getrand() to generate a uniformly distributed random value between 1 and 1, and then convert it to a double between 0.0 and 1.0. The reason for this conversion is to have randomness but to still avoid going to extreme multiplier values. The idea is to avoid a very large multiplier which would generate (even if it is not often) a 0 tps when 100 tps is required. The 1 granularity is basically random but the multiplier stays bounded (max 9.2, so the minimum possible tps would be around 11 for a target of 100 tps, bar issues from the database for processing the transactions). So although this feature can be discussed and amended, it is fully voluntary. I think that it make sense so I would prefer to keep it as is. Maybe the comments could be update to be clearer. Ok, yeah, the comments indeed didn't mention anything about that. I don't think such clamping is necessary. With that 9.2x clamp on the multiplier, the probability that any given transaction hits it is about 1/1. And a delay 9.2 times the average is still quite reasonable, IMHO. The maximum delay on my laptop, when pg_erand48() returns DBL_MIN, seems to be about 700 x the average, which is still reasonable if you run a decent number of transactions. And of course, the probability of hitting such an extreme value is miniscule, so if you're just doing a few quick test runs with a small total number of transactions, you won't hit that. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgcrypto: PGP armor headers
On 09/10/2014 04:35 PM, Marko Tiikkaja wrote: Speaking of good looks, should I add it to the next commitfest as a new patch, or should we try and get someone to review it like this? Let's handle this in this commitfest. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory Alignment in Postgres
On Tue, Sep 9, 2014 at 10:08 AM, Arthur Silva arthur...@gmail.com wrote: I'm continuously studying Postgres codebase. Hopefully I'll be able to make some contributions in the future. For now I'm intrigued about the extensive use of memory alignment. I'm sure there's some legacy and some architecture that requires it reasoning behind it. That aside, since it wastes space (a lot of space in some cases) there must be a tipping point somewhere. I'm sure one can prove aligned access is faster in a micro-benchmark but I'm not sure it's the case in a DBMS like postgres, specially in the page/rows area. Just for the sake of comparison Mysql COMPACT storage (default and recommended since 5.5) doesn't align data at all. Mysql NDB uses a fixed 4-byte alignment. Not sure about Oracle and others. Is it worth the extra space in newer architectures (specially Intel)? Do you guys think this is something worth looking at? Yes. At least in my opinion, though, it's not a good project for a beginner. If you get your changes to take effect, you'll find that a lot of things will break in places that are not easy to find or fix. You're getting into really low-level areas of the system that get touched infrequently and require a lot of expertise in how things work today to adjust. The idea I've had before is to try to reduce the widest alignment we ever require from 8 bytes to 4 bytes. That is, look for types with typalign = 'd', and rewrite them to have typalign = 'i' by having them use two 4-byte loads to load an eight-byte value. In practice, I think this would probably save a high percentage of what can be saved, because 8-byte alignment implies a maximum of 7 bytes of wasted space, while 4-byte alignment implies a maximum of 3 bytes of wasted space. And it would probably be pretty cheap, too, because any type with less than 8 byte alignment wouldn't be affected at all, and even those types that were affected would only be slightly slowed down by doing two loads instead of one. In contrast, getting rid of alignment requirements completely would save a little more space, but probably at the cost of a lot more slowdown: any type with alignment requirements would have to fetch the value byte-by-byte instead of pulling the whole thing out at once. But there are a couple of obvious problems with this idea, too, such as: 1. It's really complicated and a ton of work. 2. It would break pg_upgrade pretty darn badly unless we employed some even-more-complex strategy to mitigate that. 3. The savings might not be enough to justify the effort. It might be interesting for someone to develop a tool measuring the number of bytes of alignment padding we lose per tuple or per page and gather some statistics on it on various databases. That would give us some sense as to the possible savings. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Escaping from blocked send() reprised.
Wrong thread... On 09/10/2014 03:04 AM, Kyotaro HORIGUCHI wrote: Hmm. Sorry, I misunderstood the specification. You approach that coloring tokens seems right, but you have broken the parse logic by adding your code. Other than the mistakes others pointed, I found that - non-SQL-ident like tokens are ignored by their token style, quoted or not, so the following line works. | local All aLL trust I suppose this is not what you intended. This is because you have igonred the attribute of a token when comparing it as non-SQL-ident tokens. - '+' at the head of the sequence '+' is treated as the first character of the *quoted* string. e.g. +hoge is tokenized as +hoge:special_quoted. I found this is what intended. This should be documented as comments. |2) users and user-groups only requires special handling and behavior as follows |Normal user : | A. unquoted ( USER ) will be treated as user ( downcase ). | B. quoted ( USeR ) will be treated as USeR (case-sensitive). | C. quoted ( +USER ) will be treated as normal user +USER (i.e. will not be considered as user-group) and case-sensitive as string is quoted. This seems confising with the B below. This seems should be rearranged. | User Group : | A. unquoted ( +USERGROUP ) will be treated as +usergruop ( downcase ). | B. plus quoted ( +UserGROUP ) will be treated as +UserGROUP (case-sensitive). This is why you simply continued processing for '+' without discarding and skipping the '+', and not setting in_quote so the following parser code works as it is not intended. You should understand what the original code does and insert or modify logics not braeking the assumptions. regards, -- - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
Tomas == Tomas Vondra t...@fuzzy.cz writes: Tomas If we can get rid of the excessive ChainAggregate, that's Tomas certainly enough for now. I found an algorithm that should provably give the minimal number of sorts (I was afraid that problem would turn out to be NP-hard, but not so - it's solvable in P by reducing it to a problem of maximal matching in bipartite graphs). Updated patch should be forthcoming in a day or two. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Tue, Sep 9, 2014 at 2:00 PM, Robert Haas robertmh...@gmail.com wrote: Boiled down, what you're saying is that you might have a set that contains lots of duplicates in general, but not very many where the abbreviated-keys also match. Sure, that's true. Abbreviated keys are not used in the case where we do a (fully) opportunistic memcmp(), without really having any idea of whether or not it'll work out. Abbreviated keys aren't really relevant to that case, except perhaps in that we know we'll have statistics available for leading attributes, which will make the case less frequent in practice. But you might also not have that case, so I don't see where that gets us; the same worst-case test case Heikki developed the last time we relitigated this point is still relevant here. Well, I think that there should definitely be a distinction made between abbreviated and non-abbreviated cases; you could frequently have almost 100% certainty that each of those optimistic memcmp()s will work out with abbreviated keys. Low cardinality sets are very common. I'm not sure what your position on that is. My proposal to treat both of those cases (abbreviated with a cost model/cardinality statistics; non-abbreviated without) the same is based on different arguments for each case. In order to know how much we're giving up in that case, we need the exact number I asked you to provide in my previous email: the ratio of the cost of strcoll() to the cost of memcmp(). I see that you haven't chosen to provide that information in any of your four responses. Well, it's kind of difficult to give that number in a vacuum. I showed a case that had a large majority of opportunistic memcmp()s go to waste, while a small number were useful, which still put us ahead. I can look at Heikki's case with this again if you think that'll help. Heikki said that his case was all about wasted strxfrm()s, which are surely much more expensive than wasted memcmp()s, particularly when you consider temporal locality (we needed that memory to be stored in a cacheline for the immediately subsequent operation anyway, should the memcmp() thing not work out - the simple ratio that you're interested in may be elusive). In case I haven't been clear enough on this point, I re-emphasize that I do accept that for something like the non-abbreviated case, the opportunistic memcmp() thing must virtually be free if we're to proceed, since it is purely opportunistic. If it can be demonstrated that that isn't the case (and if that cannot be fixed by limiting it to CACHE_LINE_SIZE), clearly we should not proceed with opportunistic (non-abbreviated) memcmp()s. In fact, I think I'm holding it to a higher standard than you are - I believe that it had better be virtually free. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Fwd: Re: [HACKERS] proposal: new long psql parameter --on-error-stop]
2014-09-10 0:13 GMT+02:00 Andres Freund and...@2ndquadrant.com: On 2014-09-09 22:22:45 +0200, Andres Freund wrote: I plan to push this soon. Done. Thanks for the patch! Thank you very much Pavel Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On 09/10/2014 01:49 AM, Tomas Vondra wrote: On 9.9.2014 16:09, Robert Haas wrote: On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra t...@fuzzy.cz wrote: So I only posted the separate patch for those who want to do a review, and then a complete patch with both parts combined. But it sure may be a bit confusing. Let's do this: post each new version of the patches only on this thread, as a series of patches that can be applied one after another. OK, attached. Apply in this order 1) dense-alloc-v5.patch 2) hashjoin-nbuckets-v12.patch The dense-alloc-v5.patch looks good to me. I have committed that with minor cleanup (more comments below). I have not looked at the second patch. I also did a few 'minor' changes to the dense allocation patch, most notably: * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData The original naming seemed a bit awkward. That's too easy to confuse with regular MemoryContext and AllocChunk stuff. I renamed it to HashMemoryChunk. * the chunks size is 32kB (instead of 16kB), and we're using 1/4 threshold for 'oversized' items We need the threshold to be =8kB, to trigger the special case within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION. Should we care about the fact that if there are only a few tuples, we will nevertheless waste 32kB of memory for the chunk? I guess not, but I thought I'd mention it. The smallest allowed value for work_mem is 64kB. I also considered Heikki's suggestion that while rebatching, we can reuse chunks with a single large tuple, instead of copying it needlessly. My attempt however made the code much uglier (I almost used a GOTO, but my hands rejected to type it ...). I'll look into that. I tried constructing a test case where the rehashing time would be significant enough for this to matter, but I couldn't find one. Even if the original estimate on the number of batches is way too small, we ramp up quickly to a reasonable size and the rehashing time is completely insignificant compared to all the other work. So I think we can drop that idea. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)
On Wed, Sep 10, 2014 at 1:36 PM, Peter Geoghegan p...@heroku.com wrote: In order to know how much we're giving up in that case, we need the exact number I asked you to provide in my previous email: the ratio of the cost of strcoll() to the cost of memcmp(). I see that you haven't chosen to provide that information in any of your four responses. Well, it's kind of difficult to give that number in a vacuum. No, not really. All you have to do is right a little test program to gather the information. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: The dense-alloc-v5.patch looks good to me. I have committed that with minor cleanup (more comments below). I have not looked at the second patch. Gah. I was in the middle of doing this. Sigh. * the chunks size is 32kB (instead of 16kB), and we're using 1/4 threshold for 'oversized' items We need the threshold to be =8kB, to trigger the special case within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION. Should we care about the fact that if there are only a few tuples, we will nevertheless waste 32kB of memory for the chunk? I guess not, but I thought I'd mention it. The smallest allowed value for work_mem is 64kB. I think we should change the threshold here to 1/8th. The worst case memory wastage as-is ~32k/5 6k. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench throttling latency limit
On 09/10/2014 11:28 AM, Heikki Linnakangas wrote: On 09/10/2014 05:57 PM, Fabien COELHO wrote: Hello Heikki, I looked closer at the this, and per Jan's comments, realized that we don't log the lag time in the per-transaction log file. I think that's a serious omission; when --rate is used, the schedule lag time is important information to make sense of the result. I think we have to apply the attached patch, and backpatch to 9.4. (The documentation on the log file format needs updating) Indeed. I think that people do not like it to change. I remember that I suggested to change timestamps to .yy instead of the unreadable yyy, and be told not to, because some people have tool which process the output so the format MUST NOT CHANGE. So my behavior is not to avoid touching anything in this area. I'm fine if you do it, though:-) However I have not time to have a precise look at your patch to cross-check it before Friday. This is different from changing xxx yyy to xxx.yyy in two ways. First, this adds new information to the log that just isn't there otherwise, it's not just changing the format for the sake of it. Second, in this case it's easy to write a parser for the log format so that it works with the old and new formats. Many such programs would probably ignore any unexpected extra fields, as a matter of lazy programming, while changing the separator from space to a dot would surely require changing every parsing program. We could leave out the lag fields, though, when --rate is not used. Without --rate, the lag is always zero anyway. That would keep the file format unchanged, when you don't use the new --rate feature. I'm not sure if that would be better or worse for programs that might want to parse the information. We could also leave the default output format as is and introduce another option with a % style format string. Jan Also, this is bizarre: int64 wait = (int64) (throttle_delay * 1.00055271703 * -log(getrand(thread, 1, 1) / 1.0)); We're using getrand() to generate a uniformly distributed random value between 1 and 1, and then convert it to a double between 0.0 and 1.0. The reason for this conversion is to have randomness but to still avoid going to extreme multiplier values. The idea is to avoid a very large multiplier which would generate (even if it is not often) a 0 tps when 100 tps is required. The 1 granularity is basically random but the multiplier stays bounded (max 9.2, so the minimum possible tps would be around 11 for a target of 100 tps, bar issues from the database for processing the transactions). So although this feature can be discussed and amended, it is fully voluntary. I think that it make sense so I would prefer to keep it as is. Maybe the comments could be update to be clearer. Ok, yeah, the comments indeed didn't mention anything about that. I don't think such clamping is necessary. With that 9.2x clamp on the multiplier, the probability that any given transaction hits it is about 1/1. And a delay 9.2 times the average is still quite reasonable, IMHO. The maximum delay on my laptop, when pg_erand48() returns DBL_MIN, seems to be about 700 x the average, which is still reasonable if you run a decent number of transactions. And of course, the probability of hitting such an extreme value is miniscule, so if you're just doing a few quick test runs with a small total number of transactions, you won't hit that. - Heikki -- Jan Wieck Senior Software Engineer http://slony.info -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] removing volatile qualifiers from lwlock.c
As discussed on the thread Spinlocks and compiler/memory barriers, now that we've made the spinlock primitives function as compiler barriers (we think), it should be possible to remove volatile qualifiers from many places in the source code. The attached patch does this in lwlock.c. If the changes in commit 0709b7ee72e4bc71ad07b7120acd117265ab51d0 (and follow-on commits) are correct and complete, applying this shouldn't break anything, while possibly giving the compiler room to optimize things better than it does today. However, demonstrating the necessity of that commit for these changes seems to be non-trivial. I tried applying this patch and reverting commits 5b26278822c69dd76ef89fd50ecc7cdba9c3f035, b4c28d1b92c81941e4fc124884e51a7c110316bf, and 0709b7ee72e4bc71ad07b7120acd117265ab51d0 on a PPC64 POWER8 box with a whopping 192 hardware threads (thanks, IBM!). I then ran the regression tests repeatedly, and I ran several long pgbench runs with as many as 350 concurrent clients. No failures. So I'm posting this patch in the hope that others can help. The relevant tests are: 1. If you apply this patch to master and run tests of whatever kind strikes your fancy, does anything break under high concurrency? If it does, then the above commits weren't enough to make this safe on your platform. 2. If you apply this patch to master, revert the commits mentioned above, and again run tests, does anything now break? If it does (but the first tests were OK), then that shows that those commits did something useful on your platform. The change to make spinlocks act as compiler barriers provides the foundation for, hopefully, a cleaner code base and easier future work on scalabilty and performance projects, so help is much appreciated. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 7c96da5..66fb2e4 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -112,7 +112,7 @@ static lwlock_stats lwlock_stats_dummy; bool Trace_lwlocks = false; inline static void -PRINT_LWDEBUG(const char *where, const volatile LWLock *lock) +PRINT_LWDEBUG(const char *where, const LWLock *lock) { if (Trace_lwlocks) elog(LOG, %s(%s %d): excl %d shared %d head %p rOK %d, @@ -406,9 +406,7 @@ LWLock * LWLockAssign(void) { LWLock *result; - - /* use volatile pointer to prevent code rearrangement */ - volatile int *LWLockCounter; + int *LWLockCounter; LWLockCounter = (int *) ((char *) MainLWLockArray - 3 * sizeof(int)); SpinLockAcquire(ShmemLock); @@ -429,9 +427,7 @@ int LWLockNewTrancheId(void) { int result; - - /* use volatile pointer to prevent code rearrangement */ - volatile int *LWLockCounter; + int *LWLockCounter; LWLockCounter = (int *) ((char *) MainLWLockArray - 3 * sizeof(int)); SpinLockAcquire(ShmemLock); @@ -511,9 +507,8 @@ LWLockAcquireWithVar(LWLock *l, uint64 *valptr, uint64 val) /* internal function to implement LWLockAcquire and LWLockAcquireWithVar */ static inline bool -LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val) +LWLockAcquireCommon(LWLock *lock, LWLockMode mode, uint64 *valptr, uint64 val) { - volatile LWLock *lock = l; PGPROC *proc = MyProc; bool retry = false; bool result = true; @@ -525,7 +520,7 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val) PRINT_LWDEBUG(LWLockAcquire, lock); #ifdef LWLOCK_STATS - lwstats = get_lwlock_stats_entry(l); + lwstats = get_lwlock_stats_entry(lock); /* Count lock acquisition attempts */ if (mode == LW_EXCLUSIVE) @@ -642,13 +637,13 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val) * so that the lock manager or signal manager will see the received * signal when it next waits. */ - LOG_LWDEBUG(LWLockAcquire, T_NAME(l), T_ID(l), waiting); + LOG_LWDEBUG(LWLockAcquire, T_NAME(lock), T_ID(lock), waiting); #ifdef LWLOCK_STATS lwstats-block_count++; #endif - TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(l), T_ID(l), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), T_ID(lock), mode); for (;;) { @@ -659,9 +654,9 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val) extraWaits++; } - TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(l), T_ID(l), mode); + TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), T_ID(lock), mode); - LOG_LWDEBUG(LWLockAcquire, T_NAME(l), T_ID(l), awakened); + LOG_LWDEBUG(LWLockAcquire, T_NAME(lock), T_ID(lock), awakened); /* Now loop back and try to acquire lock again. */ retry = true; @@ -675,10 +670,10 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val) /* We are done updating shared state of the lock itself. */ SpinLockRelease(lock-mutex); - TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(l), T_ID(l), mode); +
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On 09/10/2014 09:31 PM, Robert Haas wrote: On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: The dense-alloc-v5.patch looks good to me. I have committed that with minor cleanup (more comments below). I have not looked at the second patch. Gah. I was in the middle of doing this. Sigh. Sorry. * the chunks size is 32kB (instead of 16kB), and we're using 1/4 threshold for 'oversized' items We need the threshold to be =8kB, to trigger the special case within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION. Should we care about the fact that if there are only a few tuples, we will nevertheless waste 32kB of memory for the chunk? I guess not, but I thought I'd mention it. The smallest allowed value for work_mem is 64kB. I think we should change the threshold here to 1/8th. The worst case memory wastage as-is ~32k/5 6k. Not sure how you arrived at that number. The worst case is that each 32k chunk is filled up to 24k+1 bytes, i.e. 8k-1 bytes is wasted in each chunk. That's 25% wastage. That's not too bad IMHO - the worst case waste of AllocSet is 50%. But if you want to twiddle it, feel free. Note that there's some interplay with allocset code, like Tomas mentioned. If the threshold is 8k, palloc() will round up request sizes smaller than 8k anyway. That wastes some memory, nothing more serious than that, but it seems like a good idea to avoid it. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On 10.9.2014 20:25, Heikki Linnakangas wrote: On 09/10/2014 01:49 AM, Tomas Vondra wrote: I also did a few 'minor' changes to the dense allocation patch, most notably: * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData The original naming seemed a bit awkward. That's too easy to confuse with regular MemoryContext and AllocChunk stuff. I renamed it to HashMemoryChunk. OK. * the chunks size is 32kB (instead of 16kB), and we're using 1/4 threshold for 'oversized' items We need the threshold to be =8kB, to trigger the special case within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION. Should we care about the fact that if there are only a few tuples, we will nevertheless waste 32kB of memory for the chunk? I guess not, but I thought I'd mention it. The smallest allowed value for work_mem is 64kB. I don't think that's a problem. I also considered Heikki's suggestion that while rebatching, we can reuse chunks with a single large tuple, instead of copying it needlessly. My attempt however made the code much uglier (I almost used a GOTO, but my hands rejected to type it ...). I'll look into that. I tried constructing a test case where the rehashing time would be significant enough for this to matter, but I couldn't find one. Even if the original estimate on the number of batches is way too small, we ramp up quickly to a reasonable size and the rehashing time is completely insignificant compared to all the other work. So I think we can drop that idea. I don't think that had anything to do with rehashing. What you pointed out is that when rebatching, we have to keep ~50% of the tuples, which means 'copy into a new chunk, then throw away the old ones'. For large tuples (you mentioned 100MB tuples, IIRC), we needlessly allocate this amount of memory only to copy the single tuple and then throw away the old tuple. So (a) that's an additional memcpy, and (b) it allocates additional 100MB of memory for a short period of time. Now, I'd guess when dealing with tuples this large, there will be more serious trouble elsewhere, but I don't want to neglect it. Maybe it's worth optimizing? Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On 10.9.2014 20:31, Robert Haas wrote: On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: The dense-alloc-v5.patch looks good to me. I have committed that with minor cleanup (more comments below). I have not looked at the second patch. Gah. I was in the middle of doing this. Sigh. * the chunks size is 32kB (instead of 16kB), and we're using 1/4 threshold for 'oversized' items We need the threshold to be =8kB, to trigger the special case within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION. Should we care about the fact that if there are only a few tuples, we will nevertheless waste 32kB of memory for the chunk? I guess not, but I thought I'd mention it. The smallest allowed value for work_mem is 64kB. I think we should change the threshold here to 1/8th. The worst case memory wastage as-is ~32k/5 6k. So you'd lower the threshold to 4kB? That may lower the wastage in the chunks, but palloc will actually allocate 8kB anyway, wasting up to additional 4kB. So I don't see how lowering the threshold to 1/8th improves the situation ... Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On Wed, Sep 10, 2014 at 3:02 PM, Tomas Vondra t...@fuzzy.cz wrote: On 10.9.2014 20:31, Robert Haas wrote: On Wed, Sep 10, 2014 at 2:25 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: The dense-alloc-v5.patch looks good to me. I have committed that with minor cleanup (more comments below). I have not looked at the second patch. Gah. I was in the middle of doing this. Sigh. * the chunks size is 32kB (instead of 16kB), and we're using 1/4 threshold for 'oversized' items We need the threshold to be =8kB, to trigger the special case within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION. Should we care about the fact that if there are only a few tuples, we will nevertheless waste 32kB of memory for the chunk? I guess not, but I thought I'd mention it. The smallest allowed value for work_mem is 64kB. I think we should change the threshold here to 1/8th. The worst case memory wastage as-is ~32k/5 6k. So you'd lower the threshold to 4kB? That may lower the wastage in the chunks, but palloc will actually allocate 8kB anyway, wasting up to additional 4kB. So I don't see how lowering the threshold to 1/8th improves the situation ... Ah, OK. Well, never mind then. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On 10.9.2014 20:55, Heikki Linnakangas wrote: On 09/10/2014 09:31 PM, Robert Haas wrote: * the chunks size is 32kB (instead of 16kB), and we're using 1/4 threshold for 'oversized' items We need the threshold to be =8kB, to trigger the special case within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION. Should we care about the fact that if there are only a few tuples, we will nevertheless waste 32kB of memory for the chunk? I guess not, but I thought I'd mention it. The smallest allowed value for work_mem is 64kB. I think we should change the threshold here to 1/8th. The worst case memory wastage as-is ~32k/5 6k. Not sure how you arrived at that number. The worst case is that each 32k chunk is filled up to 24k+1 bytes, i.e. 8k-1 bytes is wasted in each chunk. That's 25% wastage. That's not too bad IMHO - the worst case waste of AllocSet is 50%. But if you want to twiddle it, feel free. Note that there's some interplay with allocset code, like Tomas mentioned. If the threshold is 8k, palloc() will round up request sizes smaller than 8k anyway. That wastes some memory, nothing more serious than that, but it seems like a good idea to avoid it. Yes, and pfree then keeps these blocks, which means a chance of keeping memory that won't be reused. That's why I chose the 8kB threshold. So I'd keep the 32kB/8kB, as proposed in the patch. But I don't see this as a big issue - my experience is that either we have very much smaller than 4kB, or much larger tuples than 8kB. And even for the tuples between 4kB and 8kB, the waste is not that bad. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On 10.9.2014 20:25, Heikki Linnakangas wrote: On 09/10/2014 01:49 AM, Tomas Vondra wrote: I also did a few 'minor' changes to the dense allocation patch, most notably: * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData The original naming seemed a bit awkward. That's too easy to confuse with regular MemoryContext and AllocChunk stuff. I renamed it to HashMemoryChunk. BTW this breaks the second patch, which is allocating new chunks when resizing the hash table. Should I rebase the patch, or can the commiter do s/MemoryChunk/HashMemoryChunk/ ? Assuming there are no more fixes needed, of course. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inherit support for foreign tables
I had a cursory look at this patch and the discussions around this. ISTM there are actually two new features in this: 1) allow CHECK constraints on foreign tables, and 2) support inheritance for foreign tables. How about splitting it into two? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On Wed, Sep 10, 2014 at 3:12 PM, Tomas Vondra t...@fuzzy.cz wrote: On 10.9.2014 20:25, Heikki Linnakangas wrote: On 09/10/2014 01:49 AM, Tomas Vondra wrote: I also did a few 'minor' changes to the dense allocation patch, most notably: * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData The original naming seemed a bit awkward. That's too easy to confuse with regular MemoryContext and AllocChunk stuff. I renamed it to HashMemoryChunk. BTW this breaks the second patch, which is allocating new chunks when resizing the hash table. Should I rebase the patch, or can the commiter do s/MemoryChunk/HashMemoryChunk/ ? Assuming there are no more fixes needed, of course. Rebasing it will save the committer time, which will increase the chances that one will look at your patch. So it's highly recommended. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SKIP LOCKED DATA (work in progress)
On Wed, Sep 10, 2014 at 9:47 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: I attach some additional minor suggestions to your patch. I don't see any attachment. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background (and more parallelism infrastructure patches)
On Tue, Sep 9, 2014 at 6:03 PM, Petr Jelinek p...@2ndquadrant.com wrote: - The remaining functions are pq_init(), pq_comm_reset(), pq_flush(), pq_flush_if_writable(), pq_is_send_pending(), pq_putmessage(), pq_putmessage_noblock(), pq_startcopyout(), and pq_endcopyout(). These are the ones that I think are potentially interesting. I didn't choose to provide hooks for all of these in the submitted patch because they're not all needed for I want to do here: pq_startcopyout() and pq_endcopyout() are only needed for V2 protocol support, which did not interest me (nor did COPY, really); pq_putmessage_noblock(), pq_flush_if_writable(), and pq_is_send_pending() are only used for the walsender protocol, which doesn't seem useful to redirect to a non-socket; and I just didn't happen to have any use for pq_init() or pq_comm_reset(). Hence what I ended up with. But, I could revisit that. Suppose I provide a structure with 10 function pointers for all ten of those functions, or maybe 9 since pq_init() is called so early that it's not likely we'll have control to put the hooks in place before that point, and anyway whatever code installs the hooks can do its own initialization then. Then make a global variable like pqSendMethods and #define pq_comm_reset() to be pqSendMethods-comm_reset(), pflush() to be pqSendMethods-flush(), and so on for all 9 or 10 methods. Then the pqmq code could just change pqSendMethods to point to its own method structure instead of the default one. Would that address the concern this concern? It's more than 20 lines of code, but it's not TOO bad. Yes, although my issue with the hooks was not that you only provided them for 2 functions, but the fact that it had no structure and the implementation was if hook set do this, else do that which I don't see like a good way of doing it. We've done it that way in a bunch of other places, like ExecutorRun(). An advantage of this approach (I think) is that jumping to a fixed address is faster than jumping through a function pointer - so with the approach I've taken here, the common case where we're talking to the client incurs only the overhead of a null-test, and the larger overhead of the function pointer jump is incurred only when the hook is in use. Maybe that's not enough of a difference to matter to anything, but I think the contention that I've invented some novel kind of interface here doesn't hold up to scrutiny. We have lots of hooks that work just like what I did here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers
On 08/28/2014 10:10 AM, Michael Paquier wrote: + #synchronous_standby_num = -1 # number of standbys servers using sync rep To be honest, that's a horrible name for the GUC. Back when synchronous replication was implemented, we had looong discussions on this feature. It was called quorum commit back then. I'd suggest using the quorum term in this patch, too, that's a fairly well-known term in distributed computing for this. When synchronous replication was added, quorum was left out to keep things simple; the current feature set was the most we could all agree on to be useful. If you search the archives for quorum commit you'll see what I mean. There was a lot of confusion on what is possible and what is useful, but regarding this particular patch: people wanted to be able to describe more complicated scenarios. For example, imagine that you have a master and two standbys in one the primary data center, and two more standbys in a different data center. It should be possible to specify that you must get acknowledgment from at least on standby in both data centers. Maybe you could hack that by giving the standbys in the same data center the same name, but it gets ugly, and it still won't scale to even more complex scenarios. Maybe that's OK - we don't necessarily need to solve all scenarios at once. But it's worth considering. BTW, how does this patch behave if there are multiple standbys connected with the same name? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory Alignment in Postgres
On Wed, Sep 10, 2014 at 11:43:52AM -0400, Robert Haas wrote: But there are a couple of obvious problems with this idea, too, such as: 1. It's really complicated and a ton of work. 2. It would break pg_upgrade pretty darn badly unless we employed some even-more-complex strategy to mitigate that. 3. The savings might not be enough to justify the effort. It might be interesting for someone to develop a tool measuring the number of bytes of alignment padding we lose per tuple or per page and gather some statistics on it on various databases. That would give us some sense as to the possible savings. And will we ever implement a logical attribute system so we can reorder the stored attribtes to minimize wasted space? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SKIP LOCKED DATA (work in progress)
Robert Haas wrote: On Wed, Sep 10, 2014 at 9:47 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: I attach some additional minor suggestions to your patch. I don't see any attachment. Uh. Here it is. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 51d1889..2b336b0 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -4090,7 +4090,7 @@ get_mxact_status_for_lock(LockTupleMode mode, bool is_update) * cid: current command ID (used for visibility test, and stored into * tuple's cmax if lock is successful) * mode: indicates if shared or exclusive tuple lock is desired - * wait_policy: whether to block, ereport or skip if lock not available + * wait_policy: what to do if tuple lock is not available * follow_updates: if true, follow the update chain to also lock descendant * tuples. * diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index e4065c2..a718c0b 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -1928,6 +1928,9 @@ EvalPlanQual(EState *estate, EPQState *epqstate, * * Returns a palloc'd copy of the newest tuple version, or NULL if we find * that there is no newest version (ie, the row was deleted not updated). + * We also return NULL if the tuple is locked and the wait policy is to skip + * such tuples. + * * If successful, we have locked the newest tuple version, so caller does not * need to worry about it changing anymore. * @@ -1994,7 +1997,7 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode, break; case LockWaitSkip: if (!ConditionalXactLockTableWait(SnapshotDirty.xmax)) - return NULL; /* skipping instead of waiting */ + return NULL; /* skip instead of waiting */ break; case LockWaitError: if (!ConditionalXactLockTableWait(SnapshotDirty.xmax)) @@ -2076,6 +2079,10 @@ EvalPlanQualFetch(EState *estate, Relation relation, int lockmode, /* tuple was deleted, so give up */ return NULL; +case HeapTupleWouldBlock: + elog(WARNING, this is an odd case); + return NULL; + default: ReleaseBuffer(buffer); elog(ERROR, unrecognized heap_lock_tuple status: %u, diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c index f9feff4..990240b 100644 --- a/src/backend/executor/nodeLockRows.c +++ b/src/backend/executor/nodeLockRows.c @@ -179,7 +179,10 @@ lnext: if (copyTuple == NULL) { - /* Tuple was deleted or skipped (in SKIP LOCKED), so don't return it */ + /* + * Tuple was deleted; or it's locked and we're under SKIP + * LOCKED policy, so don't return it + */ goto lnext; } /* remember the actually locked tuple's TID */ diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c index 0f6e9b0..1f66928 100644 --- a/src/backend/parser/analyze.c +++ b/src/backend/parser/analyze.c @@ -2515,11 +2515,12 @@ applyLockingClause(Query *qry, Index rtindex, * a shared and exclusive lock at the same time; it'll end up being * exclusive anyway.) * - * We also consider that NOWAIT wins if it is specified multiple ways, - * otherwise SKIP LOCKED wins. This is a bit more debatable but - * raising an error doesn't seem helpful. (Consider for instance - * SELECT FOR UPDATE NOWAIT from a view that internally contains a - * plain FOR UPDATE spec.) + * Similarly, if the same RTE is specified with more than one lock wait + * policy, consider that NOWAIT wins over SKIP LOCKED, which in turn + * wins over waiting for the lock (the default). This is a bit more + * debatable but raising an error doesn't seem helpful. (Consider for + * instance SELECT FOR UPDATE NOWAIT from a view that internally + * contains a plain FOR UPDATE spec.) * * And of course pushedDown becomes false if any clause is explicit. */ diff --git a/src/include/utils/lockwaitpolicy.h b/src/include/utils/lockwaitpolicy.h index f8ad07e..7f9a693 100644 --- a/src/include/utils/lockwaitpolicy.h +++ b/src/include/utils/lockwaitpolicy.h @@ -1,34 +1,31 @@ /*- - * * lockwaitpolicy.h - * Header file for the enum LockWaitPolicy enum, which is needed in - * several modules (parser, planner, executor, heap manager). + * Header file for LockWaitPolicy enum. * * Copyright (c) 2014, PostgreSQL Global Development Group * * src/include/utils/lockwaitpolicy.h - * *- */ #ifndef LOCKWAITPOLICY_H #define LOCKWAITPOLICY_H /* - * Policy for what to do when a row lock cannot be obtained immediately. - * - * The enum values defined here control how the parser treats multiple FOR - * UPDATE/SHARE
[HACKERS] Commitfest status
Another commitfest week has passed, and here are still. There are now 24 patches in Needs Review state, and 8 in Ready for Committer. I'm not paying attention to the Waiting on Author patches - once we're close to zero on the other patches, those will be bounced back. The good news is that all but two patches have a reviewer assigned. The bad news is that the rest don't seem to moving graduating from the Needs Review state. Reviewers: please review your patches. And then pick another patch to review; one of the two that have no reviewer assigned yet, or some other patch. It is only good to have more than on reviewer for the same patch. Patch authors: if your patch is not getting reviewed in a timely fashion, in a few days, please send an off-list note to the reviewer and ask what the status is. The reviewer might not realize that you're waiting. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background (and more parallelism infrastructure patches)
On Wed, Sep 10, 2014 at 4:01 PM, Robert Haas robertmh...@gmail.com wrote: Yes, although my issue with the hooks was not that you only provided them for 2 functions, but the fact that it had no structure and the implementation was if hook set do this, else do that which I don't see like a good way of doing it. We've done it that way in a bunch of other places, like ExecutorRun(). An advantage of this approach (I think) is that jumping to a fixed address is faster than jumping through a function pointer - so with the approach I've taken here, the common case where we're talking to the client incurs only the overhead of a null-test, and the larger overhead of the function pointer jump is incurred only when the hook is in use. Maybe that's not enough of a difference to matter to anything, but I think the contention that I've invented some novel kind of interface here doesn't hold up to scrutiny. We have lots of hooks that work just like what I did here. Here's what the other approach looks like. I can't really see doing this way and then only providing hooks for those two functions, so this is with hooks for all the send-side stuff. Original version: 9 files changed, 295 insertions(+), 3 deletions(-) This version: 9 files changed, 428 insertions(+), 47 deletions(-) There is admittedly a certain elegance to providing a complete set of hooks, so maybe this is the way to go. The remaining patches in the patch series work with either the old version or this one; the changes here don't affect anything else. Anyone else have an opinion on which way is better here? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company commit 76b9b171f81b1639e9d0379f5066212107a740f6 Author: Robert Haas rh...@postgresql.org Date: Wed May 28 19:30:21 2014 -0400 Support frontend-backend protocol communication using a shm_mq. A background worker can use pq_redirect_to_shm_mq() to direct protocol that would normally be sent to the frontend to a shm_mq so that another process may read them. The receiving process may use pq_parse_errornotice() to parse an ErrorResponse or NoticeResponse from the background worker and, if it wishes, ThrowErrorData() to propagate the error (with or without further modification). V2: Lots more hooks, and a struct! diff --git a/src/backend/libpq/Makefile b/src/backend/libpq/Makefile index 8be0572..09410c4 100644 --- a/src/backend/libpq/Makefile +++ b/src/backend/libpq/Makefile @@ -15,7 +15,7 @@ include $(top_builddir)/src/Makefile.global # be-fsstubs is here for historical reasons, probably belongs elsewhere OBJS = be-fsstubs.o be-secure.o auth.o crypt.o hba.o ip.o md5.o pqcomm.o \ - pqformat.o pqsignal.o + pqformat.o pqmq.o pqsignal.o ifeq ($(with_openssl),yes) OBJS += be-secure-openssl.o diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index 605d891..7c4252d 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -102,7 +102,6 @@ int Unix_socket_permissions; char *Unix_socket_group; - /* Where the Unix socket files are (list of palloc'd strings) */ static List *sock_paths = NIL; @@ -134,16 +133,38 @@ static bool DoingCopyOut; /* Internal functions */ -static void pq_close(int code, Datum arg); +static void socket_comm_reset(void); +static void socket_close(int code, Datum arg); +static void socket_set_nonblocking(bool nonblocking); +static int socket_flush(void); +static int socket_flush_if_writable(void); +static bool socket_is_send_pending(void); +static int socket_putmessage(char msgtype, const char *s, size_t len); +static void socket_putmessage_noblock(char msgtype, const char *s, size_t len); +static void socket_startcopyout(void); +static void socket_endcopyout(bool errorAbort); static int internal_putbytes(const char *s, size_t len); static int internal_flush(void); -static void pq_set_nonblocking(bool nonblocking); +static void socket_set_nonblocking(bool nonblocking); #ifdef HAVE_UNIX_SOCKETS static int Lock_AF_UNIX(char *unixSocketDir, char *unixSocketPath); static int Setup_AF_UNIX(char *sock_path); #endif /* HAVE_UNIX_SOCKETS */ +PQsendMethods PQsendSocketMethods; + +static PQsendMethods PqSendSocketMethods = { + socket_comm_reset, + socket_flush, + socket_flush_if_writable, + socket_is_send_pending, + socket_putmessage, + socket_putmessage_noblock, + socket_startcopyout, + socket_endcopyout +}; + /* * pq_init - initialize libpq at backend startup @@ -152,24 +173,25 @@ static int Setup_AF_UNIX(char *sock_path); void pq_init(void) { + PqSendMethods = PqSendSocketMethods; PqSendBufferSize = PQ_SEND_BUFFER_SIZE; PqSendBuffer = MemoryContextAlloc(TopMemoryContext, PqSendBufferSize); PqSendPointer = PqSendStart = PqRecvPointer = PqRecvLength = 0; PqCommBusy = false; DoingCopyOut = false; - on_proc_exit(pq_close, 0); + on_proc_exit(socket_close,
Re: [HACKERS] Memory Alignment in Postgres
On Wed, Sep 10, 2014 at 4:29 PM, Bruce Momjian br...@momjian.us wrote: On Wed, Sep 10, 2014 at 11:43:52AM -0400, Robert Haas wrote: But there are a couple of obvious problems with this idea, too, such as: 1. It's really complicated and a ton of work. 2. It would break pg_upgrade pretty darn badly unless we employed some even-more-complex strategy to mitigate that. 3. The savings might not be enough to justify the effort. It might be interesting for someone to develop a tool measuring the number of bytes of alignment padding we lose per tuple or per page and gather some statistics on it on various databases. That would give us some sense as to the possible savings. And will we ever implement a logical attribute system so we can reorder the stored attribtes to minimize wasted space? You forgot to attach the patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins
On 10.9.2014 21:34, Robert Haas wrote: On Wed, Sep 10, 2014 at 3:12 PM, Tomas Vondra t...@fuzzy.cz wrote: On 10.9.2014 20:25, Heikki Linnakangas wrote: On 09/10/2014 01:49 AM, Tomas Vondra wrote: I also did a few 'minor' changes to the dense allocation patch, most notably: * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData The original naming seemed a bit awkward. That's too easy to confuse with regular MemoryContext and AllocChunk stuff. I renamed it to HashMemoryChunk. BTW this breaks the second patch, which is allocating new chunks when resizing the hash table. Should I rebase the patch, or can the commiter do s/MemoryChunk/HashMemoryChunk/ ? Assuming there are no more fixes needed, of course. Rebasing it will save the committer time, which will increase the chances that one will look at your patch. So it's highly recommended. OK. So here's v13 of the patch, reflecting this change. regards Tomas diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 781a736..d128e08 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -1900,18 +1900,21 @@ show_hash_info(HashState *hashstate, ExplainState *es) if (es-format != EXPLAIN_FORMAT_TEXT) { ExplainPropertyLong(Hash Buckets, hashtable-nbuckets, es); + ExplainPropertyLong(Original Hash Buckets, +hashtable-nbuckets_original, es); ExplainPropertyLong(Hash Batches, hashtable-nbatch, es); ExplainPropertyLong(Original Hash Batches, hashtable-nbatch_original, es); ExplainPropertyLong(Peak Memory Usage, spacePeakKb, es); } - else if (hashtable-nbatch_original != hashtable-nbatch) + else if ((hashtable-nbatch_original != hashtable-nbatch) || + (hashtable-nbuckets_original != hashtable-nbuckets)) { appendStringInfoSpaces(es-str, es-indent * 2); appendStringInfo(es-str, - Buckets: %d Batches: %d (originally %d) Memory Usage: %ldkB\n, - hashtable-nbuckets, hashtable-nbatch, - hashtable-nbatch_original, spacePeakKb); + Buckets: %d (originally %d) Batches: %d (originally %d) Memory Usage: %ldkB\n, + hashtable-nbuckets, hashtable-nbuckets_original, + hashtable-nbatch, hashtable-nbatch_original, spacePeakKb); } else { diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index 3ef7cfb..4f00426 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -39,6 +39,7 @@ static void ExecHashIncreaseNumBatches(HashJoinTable hashtable); +static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable); static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node, int mcvsToUse); static void ExecHashSkewTableInsert(HashJoinTable hashtable, @@ -49,6 +50,9 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable); static void *dense_alloc(HashJoinTable hashtable, Size size); +/* Memory needed for optimal number of buckets. */ +#define BUCKETS_SPACE(htab) ((htab)-nbuckets_optimal * sizeof(HashJoinTuple)) + /* * ExecHash * @@ -117,6 +121,7 @@ MultiExecHash(HashState *node) /* It's a skew tuple, so put it into that hash table */ ExecHashSkewTableInsert(hashtable, slot, hashvalue, bucketNumber); +hashtable-skewTuples += 1; } else { @@ -127,6 +132,25 @@ MultiExecHash(HashState *node) } } + /* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */ + if (hashtable-nbuckets != hashtable-nbuckets_optimal) + { + /* We never decrease the number of buckets. */ + Assert(hashtable-nbuckets_optimal hashtable-nbuckets); + +#ifdef HJDEBUG + printf(Increasing nbuckets %d = %d\n, + hashtable-nbuckets, hashtable-nbuckets_optimal); +#endif + + ExecHashIncreaseNumBuckets(hashtable); + } + + /* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */ + hashtable-spaceUsed += BUCKETS_SPACE(hashtable); + if (hashtable-spaceUsed hashtable-spacePeak) + hashtable-spacePeak = hashtable-spaceUsed; + /* must provide our own instrumentation support */ if (node-ps.instrument) InstrStopNode(node-ps.instrument, hashtable-totalTuples); @@ -272,7 +296,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls) */ hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData)); hashtable-nbuckets = nbuckets; + hashtable-nbuckets_original = nbuckets; + hashtable-nbuckets_optimal = nbuckets; hashtable-log2_nbuckets = log2_nbuckets; + hashtable-log2_nbuckets_optimal = log2_nbuckets; hashtable-buckets = NULL; hashtable-keepNulls = keepNulls; hashtable-skewEnabled = false; @@ -286,6 +313,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls) hashtable-nbatch_outstart = nbatch; hashtable-growEnabled = true; hashtable-totalTuples = 0; + hashtable-skewTuples = 0; hashtable-innerBatchFile = NULL; hashtable-outerBatchFile = NULL;
Re: [HACKERS] Need Multixact Freezing Docs
On Fri, Sep 5, 2014 at 07:39:36PM -0400, Bruce Momjian wrote: On Wed, Sep 3, 2014 at 05:17:17PM -0400, Robert Haas wrote: I had a look at this and came upon a problem --- there is no multi-xid SQL data type, and in fact the system catalogs that store mxid values use xid, e.g.: relminmxid | xid | not null With no mxid data type, there is no way to do function overloading to cause age to call the mxid variant. Should we use an explicit mxid_age() function name? Add an mxid data type? Maybe both. But mxid_age() is probably the simpler way forward just to start. OK, patch applied using mxid_age() and no new data type. Applied. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #10823: Better REINDEX syntax.
On 09/08/2014 06:17 AM, Stephen Frost wrote: * Vik Fearing (vik.fear...@dalibo.com) wrote: On 09/02/2014 10:17 PM, Marko Tiikkaja wrote: Yeah, I think I like this better than allowing all of them without the database name. Why? It's just a noise word! Eh, because it ends up reindexing system tables too, which is probably not what new folks are expecting. No behavior is changed at all. REINDEX DATABASE dbname; has always hit the system tables. Since dbname can *only* be the current database, there's no logic nor benefit in requiring it to be specified. Also, it's not required when you say 'user tables', so it's similar to your user_tables v1 patch in that regard. The fact that REINDEX USER TABLES; is the only one that doesn't require the dbname seems very inconsistent and confusing. Yes, I will update the patch. Still planning to do this..? Marking this back to waiting-for-author. Yes, but probably not for this commitfest unfortunately. -- Vik -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #10823: Better REINDEX syntax.
* Vik Fearing (vik.fear...@dalibo.com) wrote: On 09/08/2014 06:17 AM, Stephen Frost wrote: * Vik Fearing (vik.fear...@dalibo.com) wrote: On 09/02/2014 10:17 PM, Marko Tiikkaja wrote: Yeah, I think I like this better than allowing all of them without the database name. Why? It's just a noise word! Eh, because it ends up reindexing system tables too, which is probably not what new folks are expecting. No behavior is changed at all. REINDEX DATABASE dbname; has always hit the system tables. Since dbname can *only* be the current database, there's no logic nor benefit in requiring it to be specified. Sure, but I think the point is that reindexing the system tables as part of a database-wide reindex is a *bad* thing which we shouldn't be encouraging by making it easier. I realize you're a bit 'stuck' here because we don't like the current behavior, but we don't want to change it either. Also, it's not required when you say 'user tables', so it's similar to your user_tables v1 patch in that regard. The fact that REINDEX USER TABLES; is the only one that doesn't require the dbname seems very inconsistent and confusing. I understand, but the alternative would be a 'reindex;' which *doesn't* reindex the system tables- would that be less confusing? Or getting rid of the current 'reindex database' which also reindexes system tables... Yes, I will update the patch. Still planning to do this..? Marking this back to waiting-for-author. Yes, but probably not for this commitfest unfortunately. Fair enough, I'll mark it 'returned with feedback'. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] pg_basebackup failed to back up large file
On Mon, Jun 9, 2014 at 11:17:41AM +0200, Magnus Hagander wrote: On Wednesday, June 4, 2014, Tom Lane t...@sss.pgh.pa.us wrote: Magnus Hagander mag...@hagander.net writes: On Tue, Jun 3, 2014 at 6:38 PM, Tom Lane t...@sss.pgh.pa.us wrote: Another thought is we could make pg_basebackup simply skip any files that exceed RELSEG_SIZE, on the principle that you don't really need/want enormous log files to get copied anyhow. We'd still need the pax extension if the user had configured large RELSEG_SIZE, but having a compatible tar could be documented as a requirement of doing that. I think going all the way to pax is the proper long-term thing to do, at least if we can confirm it works in the main tar implementations. For backpatchable that seems more reasonable. It doesn't work today, and we just need to document that it doesn't, with larger RELSEG_SIZE. And then fix it properly for the future. Agreed, this would be a reasonable quick fix that we could replace in 9.5 or later. Fujii, are you going to be able to work on this with the now expanded scope? :) Is this a TODO or doc item? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_control is missing a field for LOBLKSIZE
On Tue, Jun 17, 2014 at 08:46:00PM -0400, Bruce Momjian wrote: On Tue, Jun 17, 2014 at 07:12:16PM -0400, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: On Tue, Jun 17, 2014 at 03:55:02PM -0400, Alvaro Herrera wrote: Can't you compare it to the historic default value? I mean, add an assumption that people thus far has never tweaked it. Well, if they did tweak it, then they would be unable to use pg_upgrade because it would complain about a mismatch if they actually matched the old and new servers. What about comparing to the symbolic value LOBLKSIZE? This would make pg_upgrade assume that the old installation had been tweaked the same as in its own build. This ends up being the same as what you said, ie, effectively no comparison ... but it might be less complicated to code/understand. OK, assume the compiled-in default is the value for an old cluster that has no value --- yeah, I could do that. Turns out I already had values that could be missing in the old cluster, so I just used the same format for this, rather than testing for LOBLKSIZE. Attached patch applied. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + diff --git a/contrib/pg_upgrade/controldata.c b/contrib/pg_upgrade/controldata.c new file mode 100644 index 9282b8e..d105a59 *** a/contrib/pg_upgrade/controldata.c --- b/contrib/pg_upgrade/controldata.c *** get_control_data(ClusterInfo *cluster, b *** 54,59 --- 54,60 bool got_ident = false; bool got_index = false; bool got_toast = false; + bool got_large_object = false; bool got_date_is_int = false; bool got_float8_pass_by_value = false; bool got_data_checksum_version = false; *** get_control_data(ClusterInfo *cluster, b *** 357,362 --- 358,374 cluster-controldata.toast = str2uint(p); got_toast = true; } + else if ((p = strstr(bufin, Size of a large-object chunk:)) != NULL) + { + p = strchr(p, ':'); + + if (p == NULL || strlen(p) = 1) + pg_fatal(%d: controldata retrieval problem\n, __LINE__); + + p++;/* removing ':' char */ + cluster-controldata.large_object = str2uint(p); + got_large_object = true; + } else if ((p = strstr(bufin, Date/time type storage:)) != NULL) { p = strchr(p, ':'); *** get_control_data(ClusterInfo *cluster, b *** 475,480 --- 487,494 !got_tli || !got_align || !got_blocksz || !got_largesz || !got_walsz || !got_walseg || !got_ident || !got_index || !got_toast || + (!got_large_object + cluster-controldata.cat_ver = LARGE_OBJECT_SIZE_PG_CONTROL_VER) || !got_date_is_int || !got_float8_pass_by_value || !got_data_checksum_version) { pg_log(PG_REPORT, *** get_control_data(ClusterInfo *cluster, b *** 527,532 --- 541,550 if (!got_toast) pg_log(PG_REPORT, maximum TOAST chunk size\n); + if (!got_large_object + cluster-controldata.cat_ver = LARGE_OBJECT_SIZE_PG_CONTROL_VER) + pg_log(PG_REPORT, large-object chunk size\n); + if (!got_date_is_int) pg_log(PG_REPORT, dates/times are integers?\n); *** check_control_data(ControlData *oldctrl, *** 576,581 --- 594,602 if (oldctrl-toast == 0 || oldctrl-toast != newctrl-toast) pg_fatal(old and new pg_controldata maximum TOAST chunk sizes are invalid or do not match\n); + if (oldctrl-large_object == 0 || oldctrl-large_object != newctrl-large_object) + pg_fatal(old and new pg_controldata large-object chunk sizes are invalid or do not match\n); + if (oldctrl-date_is_int != newctrl-date_is_int) pg_fatal(old and new pg_controldata date/time storage types do not match\n); diff --git a/contrib/pg_upgrade/pg_upgrade.h b/contrib/pg_upgrade/pg_upgrade.h new file mode 100644 index 1ac3394..0207391 *** a/contrib/pg_upgrade/pg_upgrade.h --- b/contrib/pg_upgrade/pg_upgrade.h *** extern char *output_files[]; *** 116,121 --- 116,127 #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231 /* + * large object chunk size added to pg_controldata, + * commit 5f93c37805e7485488480916b4585e098d3cc883 + */ + #define LARGE_OBJECT_SIZE_PG_CONTROL_VER 942 + + /* * Each relation is represented by a relinfo structure. */ typedef struct *** typedef struct *** 203,208 --- 209,215 uint32 ident; uint32 index; uint32 toast; + uint32 large_object; bool date_is_int; bool float8_pass_by_value; bool data_checksum_version; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cancelling statement due to user request error occurs but the transaction has committed.
On Tue, Jun 10, 2014 at 10:30:24AM -0400, Robert Haas wrote: On Tue, Jun 10, 2014 at 10:18 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: I don't agree with this analysis. If the connection is closed after the client sends a COMMIT and before it gets a response, then the client must indeed be smart enough to figure out whether or not the commit happened. But if the server sends a response, the client should be able to rely on that response being correct. In this case, an ERROR is getting sent but the transaction is getting committed; yuck. I'm not sure whether the fix is right, but this definitely seems like a bug. In general, the only way to avoid that sort of behavior for a post-commit error would be to PANIC ... and even then, the transaction got committed, which might not be the expectation of a client that got an error message, even if it said PANIC. So this whole area is a minefield, and the only attractive thing we can do is to try to reduce the number of errors that can get thrown post-commit. We already, for example, do not treat post-commit file unlink failures as ERROR, though we surely would prefer to do that. We could treated it as a lost-communication scenario. The appropriate recovery actions from the client's point of view are identical. So from this standpoint, redefining SIGINT as not throwing an error when we're in post-commit seems like a good idea. I'm not endorsing any details of the patch here, but the 2-foot view seems generally sound. Cool, that makes sense to me also. Did we ever do anything about this? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inaccuracy in VACUUM's tuple count estimates
On Thu, Jun 12, 2014 at 01:40:59PM +0200, Andres Freund wrote: Hi Tom, On 2014-06-06 15:44:25 -0400, Tom Lane wrote: I figured it'd be easy enough to get a better estimate by adding another counter to count just LIVE and INSERT_IN_PROGRESS tuples (thus effectively assuming that in-progress inserts and deletes will both commit). Did you plan to backpatch that? My inclination would be no... I did that, and found that it helped Tim's test case not at all :-(. A bit of sleuthing revealed that HeapTupleSatisfiesVacuum actually returns INSERT_IN_PROGRESS for any tuple whose xmin isn't committed, regardless of whether the transaction has since marked it for deletion: /* * It'd be possible to discern between INSERT/DELETE in progress * here by looking at xmax - but that doesn't seem beneficial for * the majority of callers and even detrimental for some. We'd * rather have callers look at/wait for xmin than xmax. It's * always correct to return INSERT_IN_PROGRESS because that's * what's happening from the view of other backends. */ return HEAPTUPLE_INSERT_IN_PROGRESS; It did not use to blow this question off: back around 8.3 you got DELETE_IN_PROGRESS if the tuple had a delete pending. I think we need less laziness + fuzzy thinking here. Maybe we should have a separate HEAPTUPLE_INSERT_AND_DELETE_IN_PROGRESS result code? Is it *really* the case that callers other than VACUUM itself are okay with failing to make this distinction? I'm dubious: there are very few if any callers that treat the INSERT and DELETE cases exactly alike. My current position on this is that we should leave the code as is 9.4 and HEAPTUPLE_INSERT_IN_PROGRESS for the 9.4/master. Would you be ok with that? The second best thing imo would be to discern and return HEAPTUPLE_INSERT_IN_PROGRESS/HEAPTUPLE_DELETE_IN_PROGRESS for the respective cases. Which way would you like to go? Did we ever come to a conclusion on this? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] about half processes are blocked by btree, btree is bottleneck?
I use benchmarksql with more than 200 clients on pg 9.3.3. when the test is going on, I collect all the process stack. I found about 100 processes are blocked by btree insert. Another 100 are blocked by xloginsert. Does btree has bad performance in concurrency scenarios? Sum:66 #0 0x7f8273a77627 in semop () from /lib64/libc.so.6 #1 0x0060cda7 in PGSemaphoreLock () #2 0x006511a9 in LWLockAcquire () #3 0x004987f7 in _bt_relandgetbuf () #4 0x0049c116 in _bt_search () #5 0x00497e13 in _bt_doinsert () #6 0x0049af52 in btinsert () #7 0x0072dce4 in FunctionCall6Coll () #8 0x0049592e in index_insert () #9 0x00590ac5 in ExecInsertIndexTuples () Sum:36 #0 0x7f8273a77627 in semop () from /lib64/libc.so.6 #1 0x0060cda7 in PGSemaphoreLock () #2 0x006511a9 in LWLockAcquire () #3 0x00497e31 in _bt_doinsert () #4 0x0049af52 in btinsert () #5 0x0072dce4 in FunctionCall6Coll () #6 0x0049592e in index_insert () #7 0x00590ac5 in ExecInsertIndexTuples () -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers
On Thu, Sep 11, 2014 at 5:21 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 08/28/2014 10:10 AM, Michael Paquier wrote: + #synchronous_standby_num = -1 # number of standbys servers using sync rep To be honest, that's a horrible name for the GUC. Back when synchronous replication was implemented, we had looong discussions on this feature. It was called quorum commit back then. I'd suggest using the quorum term in this patch, too, that's a fairly well-known term in distributed computing for this. I am open to any suggestions. Then what about the following parameter names? - synchronous_quorum_num - synchronous_standby_quorum - synchronous_standby_quorum_num - synchronous_quorum_commit When synchronous replication was added, quorum was left out to keep things simple; the current feature set was the most we could all agree on to be useful. If you search the archives for quorum commit you'll see what I mean. There was a lot of confusion on what is possible and what is useful, but regarding this particular patch: people wanted to be able to describe more complicated scenarios. For example, imagine that you have a master and two standbys in one the primary data center, and two more standbys in a different data center. It should be possible to specify that you must get acknowledgment from at least on standby in both data centers. Maybe you could hack that by giving the standbys in the same data center the same name, but it gets ugly, and it still won't scale to even more complex scenarios. Currently two nodes can only have the same priority if they have the same application_name, so we could for example add a new connstring parameter called, let's say application_group, to define groups of nodes that will have the same priority (if a node does not define application_group, it defaults to application_name, if app_name is NULL, well we don't care much it cannot be a sync candidate). That's a first idea that we could use to control groups of nodes. And we could switch syncrep.c to use application_group in s_s_names instead of application_name. That would be backward-compatible, and could open the door for more improvements for quorum commits as we could control groups node nodes. Well this is a super-set of what application_name can already do, but there is no problem to identify single nodes of the same data center and how much they could be late in replication, so I think that this would be really user-friendly. An idea similar to that would be a base work for the next thing... See below. Now, in your case the two nodes on the second data center need to have the same priority either way. With this patch you can achieve that with the same node name. Where things are not that cool with this patch is something like that though: - 5 slaves: 1 with master (node_local), 2 on a 2nd data center (node_center1), 2 last on a 3rd data center (node_center2) - s_s_num = 3 - s_s_names = 'node_local,node_center1,node_center2' In this case the nodes have the following priority: - node_local = 1 - the 2 nodes with node_center1 = 2 - the 2 nodes with node_center2 = 3 In this {1,2,2,3,3} schema, the patch makes system wait for node_local, and the two nodes in node_center1 without caring about the ones in node_center2 as it will pick up only the nodes with the highest priority. If user expects the system to wait for a node in node_center2 he'll be disappointed. That's perhaps where we could improve things, by adding an extra parameter able to control the priority ranks, say synchronous_priority_check: - [absolute|individual], wait for the first s_s_num nodes having the lowest priority, in this case we'll wait for {1,2,2} - group: for only one node in the lowest s_s_num priorities, here we'll wait for {1,2,3} Note that we may not even need this parameter if we assume by default that we wait for only one node in a given group that has the same priority. Maybe that's OK - we don't necessarily need to solve all scenarios at once. But it's worth considering. Parametrizing and coverage of the user expectations are tricky. Either way not everybody can be happy :) There are even people unhappy now because we can only define one single sync node. BTW, how does this patch behave if there are multiple standbys connected with the same name? All the nodes have the same priority. For example in the case of a cluster with 5 slaves having the same application name and s_s_num =3, the first three nodes when scanning the WAL sender array are expected to return a COMMIT before committing locally: =# show synchronous_standby_num ; synchronous_standby_num - 3 (1 row) =# show synchronous_standby_names ; synchronous_standby_names --- node (1 row) =# SELECT application_name, client_port, pg_xlog_location_diff(sent_location, flush_location) AS replay_delta, sync_priority, sync_state FROM pg_stat_replication ORDER BY replay_delta ASC, appl
Re: [HACKERS] pg_basebackup vs. Windows and tablespaces
On 20 August 2014 19:49, Amit Kapila Wrote There are some comments I would like to share with you 1. Rebase the patch to current GIT head. Done. + initStringInfo(symlinkfbuf); I think declaration and initialization of symlinkfbuf string can be moved under #ifdef WIN32 compile time macro, for other platform it’s simply allocated and freed which can be avoided. Agreed, I have changed the patch as per your suggestion. I have done the testing and behavior is as per expectation, Do we need to do some document change? I mean is this limitation on windows is mentioned anywhere ? If no change then i will move the patch to “Ready For Committer”. Thanks Regards, Dilip
Re: [HACKERS] Aussie timezone database changes incoming
On 09/10/2014 11:23 PM, Tom Lane wrote: In connection with a question asked today on pgsql-general, I had occasion to go check the release announcements for the IANA timezone database files, and it turns out that there are some big changes in 2014f: http://mm.icann.org/pipermail/tz-announce/2014-August/23.html The Russian changes are perhaps not such a big deal because they've done that sort of thing before, but this is an earful: Australian eastern time zone abbreviations are now AEST/AEDT not EST, and similarly for the other Australian zones. That is, for eastern standard and daylight saving time the abbreviations are AEST and AEDT instead of the former EST for both; similarly, ACST/ACDT, ACWST/ACWDT, and AWST/AWDT are now used instead of the former CST, CWST, and WST. This change does not affect UTC offsets, only time zone abbreviations. (Thanks to Rich Tibbett and many others.) Oh, lovely. I shouldn't be surprised that Australia gets to change. While the cynic in me thinks this is the usual USA-is-the-center-of-the-universe-ism, in reality it makes sense given relative population and likely impact. I'm wondering how many Aussie applications are going to break when this goes in, and if we could/should do anything about it. One idea that comes to mind is to create an Australia_old tznames file containing the current Aussie zone abbreviations, so as to provide an easy way to maintain backwards compatibility at need (you'd select that as your timezone_abbreviations GUC setting). Anyone from down under care to remark about the actual usage of old and new abbreviations? Most systems I see work in UTC, but I don't actually work with many that're in Australia. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup vs. Windows and tablespaces
On Thu, Sep 11, 2014 at 9:10 AM, Dilip kumar dilip.ku...@huawei.com wrote: I have done the testing and behavior is as per expectation, Do we need to do some document change? I mean is this limitation on windows is mentioned anywhere ? I don't think currently such a limitation is mentioned in docs, however I think we can update the docs at below locations: 1. In description of pg_start_backup in below page: http://www.postgresql.org/docs/devel/static/functions-admin.html#FUNCTIONS-ADMIN-BACKUP 2. In Explanation of Base Backup, basically under heading Making a Base Backup Using the Low Level API at below page: http://www.postgresql.org/docs/devel/static/continuous-archiving.html#BACKUP-BASE-BACKUP In general, we can explain about symlink_label file where ever we are explaining about backup_label file. If you think it is sufficient to explain about symlink_label in above places, then I can update the patch. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Aussie timezone database changes incoming
The Russian changes are perhaps not such a big deal because they've done that sort of thing before, but this is an earful: Australian eastern time zone abbreviations are now AEST/AEDT not EST, and similarly for the other Australian zones. That is, for eastern standard and daylight saving time the abbreviations are AEST and AEDT instead of the former EST for both; similarly, ACST/ACDT, ACWST/ACWDT, and AWST/AWDT are now used instead of the former CST, CWST, and WST. This change does not affect UTC offsets, only time zone abbreviations. (Thanks to Rich Tibbett and many others.) [...] Anyone from down under care to remark about the actual usage of old and new abbreviations? AEST/AEDT/etc are the official abbreviations and are commonly used. They have been increasingly used over the last 20 years or so, and the EST/EDT stuff on the Olsen tz database has been a source of annoyance for a very long time, eg: http://thread.gmane.org/gmane.comp.time.tz/2262 Quite likely this change will break stuff, but my feeling is more people will be cheering than screaming. -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
I will repeat the above tests with high load on CPU and using the benchmark given by Fujii-san and post the results. Average % of CPU usage at user level for each of the compression algorithm are as follows. CompressionMultipleSingle Off81.133881.1267 LZ4 81.099881.1695 Snappy:80.9741 80.9703 Pglz :81.235381.2753 http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user_single.png http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user.png The numbers show CPU utilization of Snappy is the least. The CPU utilization in increasing order is pglz No compression LZ4 Snappy The variance of average CPU utilization numbers is very low. However , snappy seems to be best when it comes to lesser utilization of CPU. As per the measurement results posted till date LZ4 outperforms snappy and pglz in terms of compression ratio and performance. However , CPU utilization numbers show snappy utilizes least amount of CPU . Difference is not much though. As there has been no consensus yet about which compression algorithm to adopt, is it better to make this decision independent of the FPW compression patch as suggested earlier in this thread?. FPW compression can be done using built in compression pglz as it shows considerable performance over uncompressed WAL and good compression ratio Also, the patch to compress multiple blocks at once gives better compression as compared to single block. ISTM that performance overhead introduced by multiple blocks compression is slightly higher than single block compression which can be tested again after modifying the patch to use pglz . Hence, this patch can be built using multiple blocks compression. Thoughts? -- View this message in context: http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5818552.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers