Re: Review: [HACKERS] ECPG infrastructure changes part 1, was: Re: ECPG fixes
2014-01-16 22:13 keltezéssel, Alvaro Herrera írta: Alvaro Herrera escribió: Boszormenyi Zoltan escribió: All patches are attached again for completeness. Thanks. I pushed a commit comprising patches 09 through 14. Now also pushed 15 to 17. Thanks very much. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ECPG FETCH readahead, was: Re: ECPG fixes
2014-01-16 23:42 keltezéssel, Alvaro Herrera írta: Boszormenyi Zoltan escribió: Rebased patches after the regression test and other details were fixed in the infrastructure part. This thread started in 2010, and various pieces have been applied already and some others have changed in nature. Would you please post a new patchset, containing rebased patches that still need application, in a new email thread to be linked in the commitfest entry? I will do that. Best regards, Zoltán Böszörményi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Funny representation in pg_stat_statements.query.
Hello, I noticed that pg_stat_statements.query can have funny values. | =# select pg_stat_statements_reset(); | =# select current_timestamp(0); | snip | =# select query from pg_stat_statements; |query | | select ?(0); | select pg_stat_statements_reset(); The same thing happenes for CURRENT_DATE, CURRENT_TIME, LOCALTIME and LOCALTIMESTAMP which are specially treated, that is, immediately replaced with a combination of a source function and a typecast in gram.y. The story is as follows, At first, for instance, CURRENT_TIMESTAMP(0) is replaced with 'now()::timestamptz(0)' and CURRENT_TIMESTAMP's location is used as that of 'now' and the location of the const '1' of LOCALTIME is used as that of the const '1' for 'timestamptz'. Let's assume the orignal query was, | pos: Query | 0: SELECT | 7: CURRENT_TIMESTAMP( | 25: 0 | : ); This is parsed in gram.y as follows, the location for 'CURRENT_TIMESTAMP' above (7) is used as the location for the CONST now. | TypeCast { | TypeCast { | A_Const(now) | TypeName {names = [text], loc = -1 (undef) } | loc = 7 | } | Typename {names = [timestamptz], loc = -1 } | typemods = [Const {1, loc = 25}] | } Then this is transformed into, | FuncExpr { | funcid = 'timestamptz' | args = [ | CoerceViaIO { | arg = Const {type = text, value = now, loc = 7 } | loc = -1 | } | Const { type = int4, value = 1, loc = -1 } | ] | } Finally pg_stat_statements picks the location '7' as a location of some constant and replaces the token there with '?' nevertheless it is originally the location of 'CURRENT_TIMESTAMP' which is not a constant for users. I found two ways to fix this issue. Both of them prevents wrong masking but the original constant parameter ('0') becomes won't be masked. This behavior seems sane enough for the porpose. A. Making pg_stat_statements conscious of the token types to prevent wrong masking. 20140117_remove_needless_location_setting.patch B. Letting gram.y not set the location for the problematic nodes. 20140117_skip_nonconstants_on_nomalization.patch I don't have firm idea which is preferable. Or the possible another solution. regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index f0b9507..30da7aa 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -11355,7 +11355,7 @@ func_expr_common_subexpr: * to rely on it.) */ Node *n; - n = makeStringConstCast(now, @1, SystemTypeName(text)); + n = makeStringConstCast(now, -1, SystemTypeName(text)); $$ = makeTypeCast(n, SystemTypeName(date), -1); } | CURRENT_TIME @@ -11365,7 +11365,7 @@ func_expr_common_subexpr: * See comments for CURRENT_DATE. */ Node *n; - n = makeStringConstCast(now, @1, SystemTypeName(text)); + n = makeStringConstCast(now, -1, SystemTypeName(text)); $$ = makeTypeCast(n, SystemTypeName(timetz), -1); } | CURRENT_TIME '(' Iconst ')' @@ -11376,7 +11376,7 @@ func_expr_common_subexpr: */ Node *n; TypeName *d; - n = makeStringConstCast(now, @1, SystemTypeName(text)); + n = makeStringConstCast(now, -1, SystemTypeName(text)); d = SystemTypeName(timetz); d-typmods = list_make1(makeIntConst($3, @3)); $$ = makeTypeCast(n, d, -1); @@ -11397,7 +11397,7 @@ func_expr_common_subexpr: */ Node *n; TypeName *d; - n = makeStringConstCast(now, @1, SystemTypeName(text)); + n = makeStringConstCast(now, -1, SystemTypeName(text)); d = SystemTypeName(timestamptz); d-typmods = list_make1(makeIntConst($3, @3)); $$ = makeTypeCast(n, d, -1); @@ -11409,7 +11409,7 @@ func_expr_common_subexpr: * See comments for CURRENT_DATE. */ Node *n; - n = makeStringConstCast(now, @1, SystemTypeName(text)); + n = makeStringConstCast(now, -1, SystemTypeName(text)); $$ = makeTypeCast((Node *)n, SystemTypeName(time), -1); } | LOCALTIME '(' Iconst ')' @@ -11420,7 +11420,7 @@ func_expr_common_subexpr: */ Node *n; TypeName *d; - n = makeStringConstCast(now, @1, SystemTypeName(text)); + n = makeStringConstCast(now, -1, SystemTypeName(text)); d = SystemTypeName(time); d-typmods = list_make1(makeIntConst($3, @3)); $$ = makeTypeCast((Node *)n, d, -1); @@ -11432,7 +11432,7 @@ func_expr_common_subexpr: * See comments for CURRENT_DATE. */ Node *n; - n = makeStringConstCast(now, @1, SystemTypeName(text)); + n = makeStringConstCast(now, -1, SystemTypeName(text)); $$ = makeTypeCast(n, SystemTypeName(timestamp), -1); } | LOCALTIMESTAMP '(' Iconst ')' @@ -11443,7 +11443,7 @@ func_expr_common_subexpr: */ Node *n; TypeName *d; - n = makeStringConstCast(now, @1, SystemTypeName(text)); + n =
[HACKERS] improve the help message about psql -F
in the offical doc,-F say: Use separator as the field separator for unaligned output. but in the psql --help,-F say: set field separator (default: |) if user don't read the offical doc carefully,he can use: psql -F , -c 'select ...' But can't get what he want. It is a bad user Experience. I make a patch change the help message for -F to: set field separator for unaligned output (default: |) patch for head attached. Jov blog: http:amutu.com/blog http://amutu.com/blog filed_separator_set_help_imporove.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql.warn_shadow
Hello After some thinking I don't think so this design is not good. It changing a working with exception (error) levels - and it is not consistent with other PostgreSQL parts. A benefit is less than not clean configuration. Better to solve similar issues via specialized plpgsql extensions or try to help me push plpgsql_check_function to core. It can be a best holder for this and similar checks. Regards Pavel 2014/1/15 Marko Tiikkaja ma...@joh.to On 1/15/14 3:09 PM, Pavel Stehule wrote: You first should to say, what is warning and why it is only warning and not error. Personally, I'm a huge fan of the computer telling me that I might have made a mistake. But on the other hand, I also really hate it when the computer gets in my way when I'm trying to test something quickly and making these mistakes on purpose. Warnings are really good for that: hard to ignore (YMMV) accidentally, but easy to spot when developing. As to how we would categorize these checks between warnings and errors.. I can't really answer that. I'm tempted to say anything that is an error now is an error, any additional checks we might add are warnings, but that's effectively just freezing the definition at an arbitrary point in time. And why plpgsql warning processing should be different than general postgresql processing? What do you mean? We're adding extra checks on *top* of the normal this is clearly an error conditions. PostgreSQL in general doesn't really do that. Consider: SELECT * FROM foo WHERE fooid IN (SELECT fooid FROM bar); where the table bar doesn't have a column fooid. That's a perfectly valid query, but it almost certainly doesn't do what you would want. Personally I'd like to see a WARNING here normally, but I've eventually learned to live with this caveat. I'm hoping that in PL/PgSQL we could at least solve some of the most annoying pitfalls. My objection is against too general option. Every option shoudl to do one clean thing. It looks to me like the GUC *is* doing only one thing. list of warnings I want to see, or the shorthand all for convenience. Regards, Marko Tiikkaja
Re: [HACKERS] plpgsql.warn_shadow
On 1/17/14 11:26 AM, Pavel Stehule wrote: After some thinking I don't think so this design is not good. It changing a working with exception (error) levels - and it is not consistent with other PostgreSQL parts. A benefit is less than not clean configuration. Better to solve similar issues via specialized plpgsql extensions or try to help me push plpgsql_check_function to core. It can be a best holder for this and similar checks. I see these as being two are different things. There *is* a need for a full-blown static analyzer for PL/PgSQL, but I don't think it needs to be in core. However, there seems to be a number of pitfalls we could warn the user about with little effort in core, and I think those are worth doing. Regards, Marko Tiikkaja -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Wed, Jan 15, 2014 at 10:47 AM, Alexander Korotkov aekorot...@gmail.comwrote: On Wed, Jan 15, 2014 at 5:17 AM, Tomas Vondra t...@fuzzy.cz wrote: On 14.1.2014 00:38, Tomas Vondra wrote: On 13.1.2014 18:07, Alexander Korotkov wrote: On Sat, Jan 11, 2014 at 6:15 AM, Tomas Vondra t...@fuzzy.cz mailto:t...@fuzzy.cz wrote: On 8.1.2014 22:58, Alexander Korotkov wrote: Thanks for reporting. Fixed version is attached. I've tried to rerun the 'archie' benchmark with the current patch, and once again I got PANIC: could not split GIN page, didn't fit I reran it with '--enable-cassert' and with that I got TRAP: FailedAssertion(!(ginCompareItemPointers(items[i - 1], items[i]) 0), File: gindatapage.c, Line: 149) LOG: server process (PID 5364) was terminated by signal 6: Aborted DETAIL: Failed process was running: INSERT INTO messages ... so the assert in GinDataLeafPageGetUncompressed fails for some reason. I can easily reproduce it, but my knowledge in this area is rather limited so I'm not entirely sure what to look for. I've fixed this bug and many other bug. Now patch passes test suite that I've used earlier. The results are so: OK, it seems the bug is gone. However now there's a memory leak somewhere. I'm loading pgsql mailing list archives (~600k messages) using this script https://bitbucket.org/tvondra/archie/src/1bbeb920/bin/load.py And after loading about 1/5 of the data, all the memory gets filled by the pgsql backends (loading the data in parallel) and the DB gets killed by the OOM killer. I've spent a fair amount of time trying to locate the memory leak, but so far no luck. I'm not sufficiently familiar with the GIN code. I can however demonstrate that it's there, and I have rather simple test case to reproduce it - basically just a CREATE INDEX on a table with ~1M email message bodies (in a tsvector column). The data is available here (360MB compressed, 1GB raw): http://www.fuzzy.cz/tmp/message-b.data.gz Simply create a single-column table, load data and create the index CREATE TABLE test ( body_tsvector TSVECTOR ); COPY test FROM '/tmp/message-b.data'; CREATE test_idx ON test USING gin test ( body_tsvector ); I'm running this on a machine with 8GB of RAM, with these settings shared_buffers=1GB maintenance_work_mem=1GB According to top, CREATE INDEX from the current HEAD never consumes more than ~25% of RAM: PID USER PR NIVIRTRESSHR %CPU %MEM COMMAND 32091 tomas 20 0 2026032 1,817g 1,040g 56,2 23,8 postgres which is about right, as (shared_buffers + maintenance_work_mem) is about 1/4 of RAM. With the v5 patch version applied, the CREATE INDEX process eventually goes crazy and allocates almost all the available memory (but somesimes finishes, mostly by pure luck). This is what I was able to get from top PID USER PR NIVIRTRESSHR S %CPU %MEM COMMAND 14090 tomas 20 0 7913820 6,962g 955036 D 4,3 91,1 postgres while the system was still reasonably responsive. Thanks a lot for your help! I believe problem is that each decompressed item pointers array is palloc'd but not freed. I hope to fix it today. Seems to be fixed in the attached version of patch. Another improvement in this version: only left-most segments and further are re-encoded. Left part of page are left untouched. -- With best regards, Alexander Korotkov. gin-packed-postinglists-varbyte6.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgcrypto: implement gen_random_uuid
2014/1/9 Oskari Saarenmaa o...@ohmu.fi: The only useful feature of the uuid-ossp module in my opinion is the uuid_generate_v4 function and as uuid-ossp is more or less abandonware people have had trouble building and installing it. This patch implements an alternative uuid v4 generation function in pgcrypto which could be moved to core once there's a core PRNG with large enough internal state. It is a small but very useful patch. Installing uuid-ossp can be very hard on some systems. There is not much to review. The patch applies cleanly to HEAD. The function is generating valid UUID version 4. The code and the documentation style seems to fit in the pgcrypto extension. I am marking it as Ready for Commiter. The problem is users probably would not look pgcrypto extension for UUID generator, especially when there is another extension with uuid in it's name. Also, UUID generator does not sound like a cryptographic function. It would be much better, if this would be in core with the UUID type. There is a reference on the UUID Type documentation page to the uuid-ossp extension. We can add a reference to pgcrypro extension in that page and consider adding a deprecation note to the uuid-ossp extension, if is is not possible to add the function to the core, for now. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Fix double-inclusion of pg_config_os.h when building extensions with Visual Studio
On Sat, Jan 11, 2014 at 8:57 AM, Craig Ringer cr...@2ndquadrant.com wrote: Hi all Related to the earlier comments about building extensions on Windows, I just noticed that we don't treat WINDLL as equivalent to WIN32, and WIN32 isn't set in a Visual Studio DLL project. We should actually be testing _WIN32, which is the compiler's pre-defined macro. The attached patch to c.h takes care of that - it tests _WIN32 in c.h and sets WIN32 early if it's found. _WIN32 is set for both win32 and win64, like we expect from WIN32. Regardless of where the other thread goes, this seems like something we should fix. Thus - applied, with minor changes to the comment, thanks. My understanding is that this change alone doesn't actually help us very much, so I haven't backpatched it anywhere. Let me know if that understanding was incorrect, and it would actually help as a backpatch. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] Feature request: Logging SSL connections
On Sun, Dec 8, 2013 at 10:27 AM, Marko Kreen mark...@gmail.com wrote: On Fri, Dec 06, 2013 at 02:53:27PM +0100, Dr. Andreas Kunert wrote: Anything else missing? Functionally it's fine now, but I see few style problems: - if (port-ssl 0) is wrong, -ssl is pointer. So use just if (port-ssl). - Deeper indentation would look nicer with braces. - There are some duplicated message, could you restructure it so that each message exists only once. New version is attached. I could add braces before and after the ereport()-calls too, but then I need two more #ifdefs to catch the closing braces. Thank you. Looks good now. I added it to next commitfest: https://commitfest.postgresql.org/action/patch_view?id=1324 Applied, thanks! -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] patch: option --if-exists for pg_dump
Hello 2014/1/16 Jeevan Chalke jeevan.cha...@enterprisedb.com Hi Pavel, I have reviewed the patch and here are my concerns and notes: POSITIVES: --- 1. Patch applies with some white-space errors. 2. make / make install / make check is smooth. No issues as such. 3. Feature looks good as well. 4. NO concern on overall design. 5. Good work. NEGATIVES: --- Here are the points which I see in the review and would like you to have your attention. 1. +It use conditional commands (with literalIF EXISTS/literal Grammar mistakes. use = uses 2. @@ -55,7 +55,8 @@ static ArchiveHandle *_allocAH(const char *FileSpec, const ArchiveFormat fmt, const int compression, ArchiveMode mode, SetupWorkerPtr setupWorkerPtr); static void _getObjectDescription(PQExpBuffer buf, TocEntry *te, ArchiveHandle *AH); -static void _printTocEntry(ArchiveHandle *AH, TocEntry *te, RestoreOptions *ropt, bool isData, bool acl_pass); +static void _printTocEntry(ArchiveHandle *AH, TocEntry *te, RestoreOptions *ropt, +bool isData, bool acl_pass); static char *replace_line_endings(const char *str); static void _doSetFixedOutputState(ArchiveHandle *AH); static void _doSetSessionAuth(ArchiveHandle *AH, const char *user); @@ -234,6 +235,7 @@ RestoreArchive(Archive *AHX) boolparallel_mode; TocEntry *te; OutputContext sav; + AH-stage = STAGE_INITIALIZING; @@ -2961,7 +3005,8 @@ _getObjectDescription(PQExpBuffer buf, TocEntry *te, ArchiveHandle *AH) } static void -_printTocEntry(ArchiveHandle *AH, TocEntry *te, RestoreOptions *ropt, bool isData, bool acl_pass) +_printTocEntry(ArchiveHandle *AH, TocEntry *te, RestoreOptions *ropt, bool isData, + bool acl_pass) { /* ACLs are dumped only during acl pass */ if (acl_pass) Above changes are NOT at all related to the patch. Please remove them even though they are clean-up like changes. Don't mix them with actual changes. 3. + if (strncmp(te-dropStmt + desc_len + 5, IF EXISTS, 9) != 0) IF EXISTS has 10 characters NOT 9. 4. + if (strncmp(te-dropStmt + desc_len + 5, IF EXISTS, 9) != 0) + ahprintf(AH, DROP %s IF EXISTS %s, + te-desc, + te-dropStmt + 6 + desc_len); Here you have used strncmp, starting at te-dropStmt + X, where X = desc_len + 5. While adding back you used X = 6 + desc_len. First time you used 5 as you added space in comparison but for adding back we want past space location and thus you have used 6. That's correct, but little bit confusing. Why not you simply used + if (strstr(te-dropStmt, IF EXISTS) != NULL) to check whether drop statement has IF EXISTS or not like you did at some other place. This will remove my concern 3 and above confusion as well. What you think ? 5. + } + + else Extra line before else part. Please remove it for consistency. 6. + printf(_( --if-exists use IF EXISTS when dropping objects\n)); (pg_dump) + printf(_( --if-exists don't report error if cleaned object doesn't exist\n)); (pg_dumpall) + printf(_( --if-exists use IF EXISTS when dropping objects\n)); (pg_restore) Please have same message for all three. 7. printf(_( --binary-upgrade for use by upgrade utilities only\n)); printf(_( --column-inserts dump data as INSERT commands with column names\n)); + printf(_( --if-exists don't report error if cleaned object doesn't exist\n)); printf(_( --disable-dollar-quoting disable dollar quoting, use SQL standard quoting\n)); printf(_( --disable-triggers disable triggers during data-only restore\n)); Please maintain order like pg_dump and pg_restore. Also at variable declaration and at options parsing mechanism. I fixed a previous issue, see a attachment please 8. + if (if_exists !outputClean) + exit_horribly(NULL, option --if-exists requires -c/--clean option\n); Are we really want to exit when -c is not provided ? Can't we simply ignore --if-exists in that case (or with emitting a warning) ? This behave is based on a talk related to proposal of this feature - and I am thinking, this behave is little bit safer - ignoring requested functionality is not what I like. And a error message is simple and clean in this case - is not difficult to use it and it is not difficult to fix missing option for user Regards Pavel Marking Waiting on author. Thanks -- Jeevan B Chalke Principal Software Engineer, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company commit 3fecb7805261e96db764fffcae2bb912538903f5 Author: Pavel Stehule pavel.steh...@gmail.com Date: Fri
[HACKERS] wal_buffers = -1
Is there any real use-case for not setting wal_buffers to -1 these days? Or should we just remove it and use the -1 behaviour at all times? IIRC we discussed not keeping it at all when the autotune behavior was introduced, but said we wanted to keep it just in case. If we're not ready to remove it, then does that just mean that we need to fix it so we can? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] wal_buffers = -1
On 17 January 2014 13:01, Magnus Hagander mag...@hagander.net wrote: Is there any real use-case for not setting wal_buffers to -1 these days? Or should we just remove it and use the -1 behaviour at all times? IIRC we discussed not keeping it at all when the autotune behavior was introduced, but said we wanted to keep it just in case. If we're not ready to remove it, then does that just mean that we need to fix it so we can? Robert Haas reported that setting it to 32MB can yield a considerable performance benefit: http://www.postgresql.org/message-id/CA+TgmobgMv_aaakLoasBt=5iYfi=kdcOUz0X9TdYe0c2SZ=2...@mail.gmail.com -- Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers = -1
Hi, On 2014-01-17 14:01:56 +0100, Magnus Hagander wrote: Is there any real use-case for not setting wal_buffers to -1 these days? Or should we just remove it and use the -1 behaviour at all times? I have seen improvements by setting it larger than the max -1 one value. Also, for some workloads (low latency) it can be beneficial to have a small s_b but still have a larger wal_buffers setting. So -1 for removing it from me. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers = -1
On Fri, Jan 17, 2014 at 2:07 PM, Thom Brown t...@linux.com wrote: On 17 January 2014 13:01, Magnus Hagander mag...@hagander.net wrote: Is there any real use-case for not setting wal_buffers to -1 these days? Or should we just remove it and use the -1 behaviour at all times? IIRC we discussed not keeping it at all when the autotune behavior was introduced, but said we wanted to keep it just in case. If we're not ready to remove it, then does that just mean that we need to fix it so we can? Robert Haas reported that setting it to 32MB can yield a considerable performance benefit: http://www.postgresql.org/message-id/CA+TgmobgMv_aaakLoasBt=5iYfi=kdcOUz0X9TdYe0c2SZ=2...@mail.gmail.com In that case, sholdn't the autotuning be changed to not limit it at 16MB? :) -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Thu, Jan 16, 2014 at 03:58:56PM -0800, Jeff Janes wrote: On Thu, Jan 16, 2014 at 3:23 PM, Dave Chinner da...@fromorbit.com wrote: On Wed, Jan 15, 2014 at 06:14:18PM -0600, Jim Nasby wrote: On 1/15/14, 12:00 AM, Claudio Freire wrote: My completely unproven theory is that swapping is overwhelmed by near-misses. Ie: a process touches a page, and before it's actually swapped in, another process touches it too, blocking on the other process' read. But the second process doesn't account for that page when evaluating predictive models (ie: read-ahead), so the next I/O by process 2 is unexpected to the kernel. Then the same with 1. Etc... In essence, swap, by a fluke of its implementation, fails utterly to predict the I/O pattern, and results in far sub-optimal reads. Explicit I/O is free from that effect, all read calls are accountable, and that makes a difference. Maybe, if the kernel could be fixed in that respect, you could consider mmap'd files as a suitable form of temporary storage. But that would depend on the success and availability of such a fix/patch. Another option is to consider some of the more radical ideas in this thread, but only for temporary data. Our write sequencing and other needs are far less stringent for this stuff. -- Jim C. I suspect that a lot of the temporary data issues can be solved by using tmpfs for temporary files Temp files can collectively reach hundreds of gigs. So unless you have terabytes of RAM you're going to have to write them back to disk. But there's something here that I'm not getting - you're talking about a data set that you want ot keep cache resident that is at least an order of magnitude larger than the cyclic 5-15 minute WAL dataset that ongoing operations need to manage to avoid IO storms. Where do these temporary files fit into this picture, how fast do they grow and why are do they need to be so large in comparison to the ongoing modifications being made to the database? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 06:14:18PM -0600, Jim Nasby wrote: On 1/15/14, 12:00 AM, Claudio Freire wrote: My completely unproven theory is that swapping is overwhelmed by near-misses. Ie: a process touches a page, and before it's actually swapped in, another process touches it too, blocking on the other process' read. But the second process doesn't account for that page when evaluating predictive models (ie: read-ahead), so the next I/O by process 2 is unexpected to the kernel. Then the same with 1. Etc... In essence, swap, by a fluke of its implementation, fails utterly to predict the I/O pattern, and results in far sub-optimal reads. Explicit I/O is free from that effect, all read calls are accountable, and that makes a difference. Maybe, if the kernel could be fixed in that respect, you could consider mmap'd files as a suitable form of temporary storage. But that would depend on the success and availability of such a fix/patch. Another option is to consider some of the more radical ideas in this thread, but only for temporary data. Our write sequencing and other needs are far less stringent for this stuff. -- Jim C. I suspect that a lot of the temporary data issues can be solved by using tmpfs for temporary files Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Thu, Jan 16, 2014 at 08:48:24PM -0500, Robert Haas wrote: On Thu, Jan 16, 2014 at 7:31 PM, Dave Chinner da...@fromorbit.com wrote: But there's something here that I'm not getting - you're talking about a data set that you want ot keep cache resident that is at least an order of magnitude larger than the cyclic 5-15 minute WAL dataset that ongoing operations need to manage to avoid IO storms. Where do these temporary files fit into this picture, how fast do they grow and why are do they need to be so large in comparison to the ongoing modifications being made to the database? [ snip ] Temp files are something else again. If PostgreSQL needs to sort a small amount of data, like a kilobyte, it'll use quicksort. But if it needs to sort a large amount of data, like a terabyte, it'll use a merge sort.[1] IOWs the temp files contain data that requires transformation as part of a query operation. So, temp file size is bound by the dataset, growth determined by data retreival and transformation rate. IOWs, there are two very different IO and caching requirements in play here and tuning the kernel for one actively degrades the performance of the other. Right, got it now. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Thu, 16 Jan 2014 20:48:24 -0500 Robert Haas robertmh...@gmail.com wrote: On Thu, Jan 16, 2014 at 7:31 PM, Dave Chinner da...@fromorbit.com wrote: But there's something here that I'm not getting - you're talking about a data set that you want ot keep cache resident that is at least an order of magnitude larger than the cyclic 5-15 minute WAL dataset that ongoing operations need to manage to avoid IO storms. Where do these temporary files fit into this picture, how fast do they grow and why are do they need to be so large in comparison to the ongoing modifications being made to the database? I'm not sure you've got that quite right. WAL is fsync'd very frequently - on every commit, at the very least, and multiple times per second even there are no commits going on just to make sure we get it all down to the platter as fast as possible. The thing that causes the I/O storm is the data file writes, which are performed either when we need to free up space in PostgreSQL's internal buffer pool (aka shared_buffers) or once per checkpoint interval (5-60 minutes) in any event. The point of this system is that if we crash, we're going to need to replay all of the WAL to recover the data files to the proper state; but we don't want to keep WAL around forever, so we checkpoint periodically. By writing all the data back to the underlying data files, checkpoints render older WAL segments irrelevant, at which point we can recycle those files before the disk fills up. So this says to me that the WAL is a place where DIO should really be reconsidered. It's mostly sequential writes that need to hit the disk ASAP, and you need to know that they have hit the disk before you can proceed with other operations. Also, is the WAL actually ever read under normal (non-recovery) conditions or is it write-only under normal operation? If it's seldom read, then using DIO for them also avoids some double buffering since they wouldn't go through pagecache. Again, I think this discussion would really benefit from an outline of the different files used by pgsql, and what sort of data access patterns you expect with them. Temp files are something else again. If PostgreSQL needs to sort a small amount of data, like a kilobyte, it'll use quicksort. But if it needs to sort a large amount of data, like a terabyte, it'll use a merge sort.[1] The reason is of course that quicksort requires random access to work well; if parts of quicksort's working memory get paged out during the sort, your life sucks. Merge sort (or at least our implementation of it) is slower overall, but it only accesses the data sequentially. When we do a merge sort, we use files to simulate the tapes that Knuth had in mind when he wrote down the algorithm. If the OS runs short of memory - because the sort is really big or just because of other memory pressure - it can page out the parts of the file we're not actively using without totally destroying performance. It'll be slow, of course, because disks always are, but not like quicksort would be if it started swapping. I haven't actually experienced (or heard mentioned) the problem Jeff Janes is mentioning where temp files get written out to disk too aggressively; as mentioned before, the problems I've seen are usually the other way - stuff not getting written out aggressively enough. But it sounds plausible. The OS only lets you set one policy, and if you make that file right for permanent data files that get checkpointed it could well be wrong for temp files that get thrown out. Just stuffing the data on RAMFS will work for some installations, but might not be good if you actually do want to perform sorts whose size exceeds RAM. BTW, I haven't heard anyone on pgsql-hackers say they'd be interesting in attending Collab on behalf of the PostgreSQL community. Although the prospect of a cross-country flight is a somewhat depressing thought, it does sound pretty cool, so I'm potentially interested. I have no idea what the procedure is here for moving forward though, especially since it sounds like there might be only one seat available and I don't know who else may wish to sit in it. -- Jeff Layton jlay...@redhat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers = -1
On 17 January 2014 13:20, Magnus Hagander mag...@hagander.net wrote: On Fri, Jan 17, 2014 at 2:07 PM, Thom Brown t...@linux.com wrote: On 17 January 2014 13:01, Magnus Hagander mag...@hagander.net wrote: Is there any real use-case for not setting wal_buffers to -1 these days? Or should we just remove it and use the -1 behaviour at all times? IIRC we discussed not keeping it at all when the autotune behavior was introduced, but said we wanted to keep it just in case. If we're not ready to remove it, then does that just mean that we need to fix it so we can? Robert Haas reported that setting it to 32MB can yield a considerable performance benefit: http://www.postgresql.org/message-id/CA+TgmobgMv_aaakLoasBt=5iYfi=kdcOUz0X9TdYe0c2SZ=2...@mail.gmail.com In that case, sholdn't the autotuning be changed to not limit it at 16MB? :) Well, that's the question. Do we have a heuristic sweet-spot that folk would agree on? -- Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql.consistent_into
On Wed, Jan 15, 2014 at 8:23 AM, Jim Nasby j...@nasby.net wrote: Do we actually support = right now? We already support v_field := field FROM table ... ; and I think it's a bad idea to have different meaning for = and :=. That was already discussed before. Yes, we support both = and := and they have exactly the same behavior; I agree, we should keep them equivalent. Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Fri, Jan 17, 2014 at 7:34 AM, Jeff Layton jlay...@redhat.com wrote: So this says to me that the WAL is a place where DIO should really be reconsidered. It's mostly sequential writes that need to hit the disk ASAP, and you need to know that they have hit the disk before you can proceed with other operations. Ironically enough, we actually *have* an option to use O_DIRECT here. But it doesn't work well. See below. Also, is the WAL actually ever read under normal (non-recovery) conditions or is it write-only under normal operation? If it's seldom read, then using DIO for them also avoids some double buffering since they wouldn't go through pagecache. This is the first problem: if replication is in use, then the WAL gets read shortly after it gets written. Using O_DIRECT bypasses the kernel cache for the writes, but then the reads stink. However, if you configure wal_sync_method=open_sync and disable replication, then you will in fact get O_DIRECT|O_SYNC behavior. But that still doesn't work out very well, because now the guy who does the write() has to wait for it to finish before he can do anything else. That's not always what we want, because WAL gets written out from our internal buffers for multiple different reasons. If we're forcing the WAL out to disk because of transaction commit or because we need to write the buffer protected by a certain WAL record only after the WAL hits the platter, then it's fine. But sometimes we're writing WAL just because we've run out of internal buffer space, and we don't want to block waiting for the write to complete. Opening the file with O_SYNC deprives us of the ability to control the timing of the sync relative to the timing of the write. Again, I think this discussion would really benefit from an outline of the different files used by pgsql, and what sort of data access patterns you expect with them. I think I more or less did that in my previous email, but here it is again in briefer form: - WAL files are written (and sometimes read) sequentially and fsync'd very frequently and it's always good to write the data out to disk as soon as possible - Temp files are written and read sequentially and never fsync'd. They should only be written to disk when memory pressure demands it (but are a good candidate when that situation comes up) - Data files are read and written randomly. They are fsync'd at checkpoint time; between checkpoints, it's best not to write them sooner than necessary, but when the checkpoint arrives, they all need to get out to the disk without bringing the system to a standstill We have other kinds of files, but off-hand I'm not thinking of any that are really very interesting, apart from those. Maybe it'll be useful to have hints that say always write this file to disk as quick as you can and always postpone writing this file to disk for as long as you can for WAL and temp files respectively. But the rule for the data files, which are the really important case, is not so simple. fsync() is actually a fine API except that it tends to destroy system throughput. Maybe what we need is just for fsync() to be less aggressive, or a less aggressive version of it. We wouldn't mind waiting an almost arbitrarily long time for fsync to complete if other processes could still get their I/O requests serviced in a reasonable amount of time in the meanwhile. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL Rate Limiting
On Thu, Jan 16, 2014 at 10:56 AM, Andres Freund and...@2ndquadrant.com wrote: That'd be most places doing XLogInsert() if you want to be consistent. Each needing to be analyzed whether we're blocking important resources, determining where in the callstack we can do the CHECK_FOR_WAL_BUDGET(). And doing that probably wouldn't be a problem except for the fact that this patch is showing up at the absolute very last minute with multiple unresolved design considerations. I'm with Tom: this should be postponed to 9.5. That having been said, I bet it could be done at the tail of XLogInsert(). if there are cases where that's not desirable, then add macros HOLDOFF_WAL_THROTTLING() and RESUME_WAL_THROTTLING() that bump a counter up and down. When the counter is 0, XLogInsert() doesn't sleep; when RESUME_WAL_THROTTLING() drops the counter to 0, it also considers sleeping. I suspect only a few places would need to do this, like where we're holding one of the SLRU locks. Some testing would be needed to figure out exactly which places cause problems, but that's why it's a good idea to start discussion of your proposed feature before January 14th. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL Rate Limiting
On 2014-01-17 09:04:54 -0500, Robert Haas wrote: That having been said, I bet it could be done at the tail of XLogInsert(). if there are cases where that's not desirable, then add macros HOLDOFF_WAL_THROTTLING() and RESUME_WAL_THROTTLING() that bump a counter up and down. When the counter is 0, XLogInsert() doesn't sleep; when RESUME_WAL_THROTTLING() drops the counter to 0, it also considers sleeping. I suspect only a few places would need to do this, like where we're holding one of the SLRU locks. I don't think there are many locations where this would be ok. Sleeping while holding exclusive buffer locks? Quite possibly inside a criticial section? Surely not. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Fri, Jan 17, 2014 at 3:05 PM, Florian Pflug f...@phlo.org wrote: I had some more fun with this, the result is v2.5 of the patch (attached). Changes are explained below. Looks like you've been busy on this! Thank you for implementing all the changes you talked about. I've now started working the 2.5 patch you sent and I've ripped out all the numeric inverse transition functions on that patch and implemented inverse transitions for max and min using the new NULL returning method that you've implemented. I've also made a very fast pass and fixed up a few comments and some random white space. I've not gotten to the documents yet. I did add a whole bunch of regression tests for all of the inverse transition functions that are used for max and min. I'm just calling these like: SELECT int4larger_inv(3,2),int4larger_inv(3,3),int4larger_inv(3,4); Rather than using the aggregate as a window function each time. I thought it was better to validate each of those functions individually then put in various tests that target a smaller number of aggregates with various inputs. A few things still to do that I'll try and get to soon. 1. More testing 2. Docs updated. 3. Perhaps I'll look at adding bitand and bitor inverse transition functions 4. ? Thanks again for putting so much effort into this. If you make any more changes would it be possible for you to base them on the attached patch instead of the last one you sent? Perhaps if there's much more hacking to do we could start pushing to a guthub repo to make it easier to work together on this. Regards David Rowley inverse_transition_functions_v2.6.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Jan17, 2014, at 15:14 , David Rowley dgrowle...@gmail.com wrote: If you make any more changes would it be possible for you to base them on the attached patch instead of the last one you sent? Sure! The only reason I didn't rebase my last patch onto yours was that having the numeric stuff in there meant potentially better test coverage. Thanks for doing the merge! Perhaps if there's much more hacking to do we could start pushing to a guthub repo to make it easier to work together on this. Yeah, that seems like a good idea. Easier to keep track of then a bunch of patches floating around. Could you push your latest version? best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL Rate Limiting
Andres Freund and...@2ndquadrant.com writes: On 2014-01-17 09:04:54 -0500, Robert Haas wrote: That having been said, I bet it could be done at the tail of XLogInsert(). I don't think there are many locations where this would be ok. Sleeping while holding exclusive buffer locks? Quite possibly inside a criticial section? More or less by definition, you're always doing both when you call XLogInsert. Surely not. I agree. It's got to be somewhere further up the call stack. I'm inclined to think that what we ought to do is reconceptualize vacuum_delay_point() as something a bit more generic, and sprinkle calls to it in a few more places than now. It's also interesting to wonder about the relationship to CHECK_FOR_INTERRUPTS --- although I think that currently, we assume that that's *cheap* (1 test and branch) as long as nothing is pending. I don't want to see a bunch of arithmetic added to it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL Rate Limiting
On 16 January 2014 17:20, Simon Riggs si...@2ndquadrant.com wrote: Thank you everyone for responding so positively to this proposal. It is something many users have complained about. In time, I think we might want both WAL Rate Limiting and I/O Rate Limiting. IMHO I/O rate limiting is harder and so this proposal is restricted solely to WAL rate limiting. I'm comfortable with a session level parameter. I was mulling over a wal_rate_limit_scope parameter but I think that is too much. I will follow Craig's proposal to define this in terms of MB/s, adding that I would calc from beginning of statement to now, so it is averaged rate. Setting of zero means unlimited. The rate would be per-session (in this release) rather than a globally calculated setting. Parameter renamed to wal_rate_limit. Not yet modified to produce this in MB/s I would like to call it CHECK_FOR_WAL_RATE_LIMIT() Used CHECK_WAL_BUDGET() as proposed That seems like a cheap enough call to allow it to be added in more places, so my expanded list of commands touched are CLUSTER/VACUUM FULL ALTER TABLE (only in phase 3 table rewrite) ALTER TABLE SET TABLESPACE CREATE INDEX which are already there, plus new ones suggested/implied COPY CREATE TABLE AS SELECT INSERT/UPDATE/DELETE Added, no problems or technical difficulties encountered. Please let me know if I missed something (rather than debating what is or is not a maintenance command). Any other thoughts before I cut v2 ? v2 attached -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services wal_rate_limiting.v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL Rate Limiting
On 01/17/2014 05:20 PM, Simon Riggs wrote: + if (RelationNeedsWAL(indexrel)) + CHECK_FOR_WAL_BUDGET(); I don't think we need the RelationNeedsWAL tests. If the relation is not WAL-logged, you won't write much WAL, so you should easily stay under the limit. And CHECK_FOR_WAL_BUDGET() better be cheap when you're below the limit. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL Rate Limiting
It occurs to me that this is very similar to the method I proposed in June to enforce a hard limit on WAL usage, to avoid PANIC caused by running out of disk space when writing WAL: http://www.postgresql.org/message-id/51b095fe.6050...@vmware.com Enforcing a global limit needs more book-keeping than rate limiting individual sesssions. But I'm hoping that the CHECK_WAL_BUDGET() calls could be used for both purposes. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Thu, Jan 16, 2014 at 04:30:59PM -0800, Jeff Janes wrote: On Wed, Jan 15, 2014 at 2:08 AM, Mel Gorman mgor...@suse.de wrote: On Tue, Jan 14, 2014 at 09:30:19AM -0800, Jeff Janes wrote: That could be something we look at. There are cases buried deep in the VM where pages get shuffled to the end of the LRU and get tagged for reclaim as soon as possible. Maybe you need access to something like that via posix_fadvise to say reclaim this page if you need memory but leave it resident if there is no memory pressure or something similar. Not exactly sure what that interface would look like or offhand how it could be reliably implemented. I think the reclaim this page if you need memory but leave it resident if there is no memory pressure hint would be more useful for temporary working files than for what was being discussed above (shared buffers). When I do work that needs large temporary files, I often see physical write IO spike but physical read IO does not. I interpret that to mean that the temporary data is being written to disk to satisfy either dirty_expire_centisecs or dirty_*bytes, but the data remains in the FS cache and so disk reads are not needed to satisfy it. So a hint that says this file will never be fsynced so please ignore dirty_*bytes and dirty_expire_centisecs. It would be good to know if dirty_expire_centisecs or dirty ratio|bytes were the problem here. Is there an easy way to tell? I would guess it has to be at least dirty_expire_centisecs, if not both, as a very large sort operation takes a lot more than 30 seconds to complete. There is not an easy way to tell. To be 100%, it would require an instrumentation patch or a systemtap script to detect when a particular page is being written back and track the context. There are approximations though. Monitor nr_dirty pages over time. If at the time of the stall there are fewer dirty pages than allowed by dirty_ratio then the dirty_expire_centisecs kicked in. That or monitor the process for stalls, when it stalls check /proc/PID/stack and see if it's stuck in balance_dirty_pages or something similar which would indicate the process hit dirty_ratio. An interface that forces a dirty page to stay dirty regardless of the global system would be a major hazard. It potentially allows the creator of the temporary file to stall all other processes dirtying pages for an unbounded period of time. Are the dirty ratio/bytes limits the mechanisms by which adequate clean memory is maintained? Yes, for file-backed pages. I thought those were there just to but a limit on long it would take to execute a sync call should one be issued, and there were other setting which said how much clean memory to maintain. It should definitely write out the pages if it needs the memory for other things, just not write them out due to fear of how long it would take to sync it if a sync was called. (And if it needs the memory, it should be able to write it out quickly as the writes would be mostly sequential, not random--although how the kernel can believe me that that will always be the case could a problem) It has been suggested on more than one occasion that a more sensible interface would be to do not allow more dirty data than it takes N seconds to writeback. The details of how to implement this are tricky and no one has taken up the challenge yet. I proposed in another part of the thread a hint for open inodes to have the background writer thread ignore dirty pages belonging to that inode. Dirty limits and fsync would still be obeyed. It might also be workable for temporary files but the proposal could be full of holes. If calling fsync would fail with an error, would that lower the risk of DoS? I do not understand the proposal. If there are pages that must remain dirty and the kernel cannot touch then there will be the risk that dirty_ratio number of pages are all untouchable and the system livelocks until userspace takes an action. That still leaves the possibility of flagging temp pages that should only be written to disk if the kernel really needs to. -- Mel Gorman SUSE Labs -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Funny representation in pg_stat_statements.query.
Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp writes: Hello, I noticed that pg_stat_statements.query can have funny values. I don't think that's an acceptable reason for lobotomizing the parser's ability to print error cursors, which is what your first patch does (and without even any documentation that would keep someone from changing it back). The second patch might be all right, though I'm a bit disturbed by its dependency on gram.h constants. The numeric values of those change on a regular basis, and who's to say that these are exactly the set of tokens that we care about? In the end, really the cleanest fix for this would be to get rid of the grammar's translation of these special functions into hacky expressions. They ought to get translated into some new node type(s) that could be reverse-listed in standard form by ruleutils.c. I've wanted to do that for years, but never got around to it ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Feature request: Logging SSL connections
Magnus Hagander mag...@hagander.net writes: Applied, thanks! Minor bikeshedding: the messages would read better, to my eye, as user=%s database=%s SSL enabled (protocol=%s, cipher=%s) Putting enabled where it is requires extra mental gymnastics on the part of the reader. And why the random change between = in one part of the string and : in the new part? You could take that last point a bit further and make it user=%s database=%s SSL=enabled (protocol=%s, cipher=%s) but I'm not sure if that's an improvement. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On 01/17/2014 06:40 AM, Dave Chinner wrote: On Thu, Jan 16, 2014 at 08:48:24PM -0500, Robert Haas wrote: On Thu, Jan 16, 2014 at 7:31 PM, Dave Chinner da...@fromorbit.com wrote: But there's something here that I'm not getting - you're talking about a data set that you want ot keep cache resident that is at least an order of magnitude larger than the cyclic 5-15 minute WAL dataset that ongoing operations need to manage to avoid IO storms. Where do these temporary files fit into this picture, how fast do they grow and why are do they need to be so large in comparison to the ongoing modifications being made to the database? [ snip ] Temp files are something else again. If PostgreSQL needs to sort a small amount of data, like a kilobyte, it'll use quicksort. But if it needs to sort a large amount of data, like a terabyte, it'll use a merge sort.[1] IOWs the temp files contain data that requires transformation as part of a query operation. So, temp file size is bound by the dataset, Basically yes, though the size of the dataset can be orders of magnitude bigger than the database in case of some queries. growth determined by data retreival and transformation rate. IOWs, there are two very different IO and caching requirements in play here and tuning the kernel for one actively degrades the performance of the other. Right, got it now. Yes. A step in right solutions would be some way to tune this on per-device basis, but as large part of this in linux seems to be driven from the keeping-vm-clean side it guess it will be far from simple. Cheers, Dave. -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] currawong is not a happy animal
I've not been able to build master for a few days now on visual studios due some problems in the new test_shm_mq contrib module. I'm not too familiar with how the visual studio project files are generated so I've not yet managed to come up with a fix just yet. However I have attached a fix for the many compiler warnings that currawong has been seeing since 9c14dd2 define_WIN32.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance (summary v2 2014-1-17)
On Wed, Jan 15, 2014 at 02:14:08PM +, Mel Gorman wrote: One assumption would be that Postgres is perfectly happy with the current kernel behaviour in which case our discussion here is done. It has been demonstrated that this statement was farcical. The thread is massive just from interaction with the LSF/MM program committee. I'm hoping that there will be Postgres representation at LSF/MM this year to bring the issues to a wider audience. I expect that LSF/MM can only commit to one person attending the whole summit due to limited seats but we could be more more flexible for the Postgres track itself so informal meetings can be arranged for the evenings and at collab summit. We still have not decided on a person that can definitely attend but we'll get back to that shortly. I wanted to revise the summary mail so that there is a record that can be easily digested without trawling through archives. As before if I missed something important, prioritised poorly or emphasised incorrectly then shout at me. On testing of modern kernels Josh Berkus claims that most people are using Postgres with 2.6.19 and consequently there may be poor awareness of recent kernel developments. This is a disturbingly large window of opportunity for problems to have been introduced. Minimally, Postgres has concerns about IO-related stalls which may or may not exist in current kernels. There were indications that large writes starve reads. There have been variants of this style of bug in the past but it's unclear what the exact shape of this problem is and if IO-less dirty throttling affected it. It is possible that Postgres was burned in the past by data being written back from reclaim context in low memory situations. That would have looked like massive stalls with drops in IO throughput but it was fixed in relatively recent kernels. Any data on historical tests would be helpful. Alternatively, a pgbench-based reproduction test could potentially be used by people in the kernel community that track performance over time and have access to a suitable testing rig. Postgres bug reports and LKML - It is claimed that LKML does not welcome bug reports but it's less clear what the basis of this claim is. Is it because the reports are ignored? A possible explanation is that they are simply getting lost in the LKML noise and there would be better luck if the bug report was cc'd to a specific subsystem list. A second possibility is the bug report is against an old kernel and unless it is reproduced on a recent kernel the bug report will be ignored. Finally it is possible that there is not enough data available to debug the problem. The worst explanation is that to date the problem has not been fixable but the details of this have been lost and are now unknown. Is is possible that some of these bug reports can be refreshed so at least there is a chance they get addressed? Apparently there were changes to the reclaim algorithms that crippled performance without any sysctls. The problem may be compounded by the introduction of adaptive replacement cache in the shape of the thrash detection patches currently being reviewed. Postgres investigated the use of ARC in the past and ultimately abandoned it. Details are in the archives (http://www.Postgres.org/search/?m=1q=arcl=1d=-1s=r). I have not read then, just noting they exist for future reference. Sysctls to control VM behaviour are not popular as such tuning parameters are often used as an excuse to not properly fix the problem. Would it be possible to describe a test case that shows 2.6.19 performing well and a modern kernel failing? That would give the VM people a concrete basis to work from to either fix the problem or identify exactly what sysctls are required to make this work. I am confident that any bug related to VM reclaim in this area has been lost. At least, I recall no instances of it being discussed on linux-mm and it has not featured on LSF/MM during the last years. IO Scheduling - Kevin Grittner has stated that it is known that the DEADLINE and NOOP schedulers perform better than any alternatives for most database loads. It would be desirable to quantify this for some test case and see can the default scheduler cope in some way. The deadline scheduler makes sense to a large extent though. Postgres is sensitive to large latencies due to IO write spikes. It is at least plausible that deadline would give more deterministic behaviour for parallel reads in the presence of large writes assuming there were not ordering problems between the reads/writes and the underlying filesystem. For reference, these IO spikes can be massive. If the shared buffer is completely dirtied in a short space of time then it could be 20-25% of RAM being dirtied and writeback required in typical configurations. There have been cases where it was worked around by limiting the size of the shared buffer to a
Re: [HACKERS] currawong is not a happy animal
On 01/17/2014 11:26 AM, David Rowley wrote: I've not been able to build master for a few days now on visual studios due some problems in the new test_shm_mq contrib module. I'm not too familiar with how the visual studio project files are generated so I've not yet managed to come up with a fix just yet. However I have attached a fix for the many compiler warnings that currawong has been seeing since 9c14dd2 That seems perfe4ctly sane, so I have committed it. I'll look into the other issue. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Filter error log statements by sqlstate
Oskari Saarenmaa o...@ohmu.fi writes: 17.01.2014 00:13, Tom Lane kirjoitti: I find it hard to follow exactly what the use-case for this is; could you make a defense of why we should carry a feature like this? I usually run PG with log_min_messages = warning and log_min_error_statement = error to log any unexpected errors. But as I have a lot of check constraints in my database as well as a lot of plpgsql and plpython code which raises exceptions on invalid user input I also get tons of logs for statements causing expected errors which I have to try to filter out with other tools. But if the goal is to reduce clutter in your log, wouldn't you wish to suppress the *entire* log entry for expected errors, not just the SQL-statement field? Certainly all the previous discussion about this type of feature (and there has been plenty) has presumed that you'd want to suppress whole entries. In any case, I'm finding it hard to believe that filtering by individual SQLSTATEs is a practical API. I don't know about others, but filtering by individual SQLSTATE is exactly what I need - I don't want to suppress all plpgsql errors or constraint violations, but I may want to suppress plpgsql RAISE EXCEPTION and CHECK violations. Meh. Again, there's been lots of prior discussion, and I think there's consensus that a filtering API based solely on a list of SQLSTATEs wouldn't be widely adequate. The most recent discussion I can find about this is in this thread: http://www.postgresql.org/message-id/flat/20131205204512.gd6...@eldon.alvh.no-ip.org#20131205204512.gd6...@eldon.alvh.no-ip.org That thread references an older one http://www.postgresql.org/message-id/flat/19791.1335902...@sss.pgh.pa.us#19791.1335902...@sss.pgh.pa.us and I'm pretty sure that there are several others you could find with a bit of digging. The more ambitious proposals required decorating ereport calls with a new kind of severity labeling, reflecting how likely it'd be that DBAs would want to read about the particular error in their logs. That's be a lot of work though, and would require us to make a lot of value judgements that might be wrong. The main alternative that was discussed was to filter on the basis of SQLSTATE classes, and maybe relocate a few specific ERRCODEs to different classes if it seemed that they were a lot unlike other things in their class. It hasn't really been proven that SQLSTATE-class filtering would work conveniently, but AFAICS the only way to prove or disprove that is for somebody to code it up and use it in production. What I'd suggest is that you revise this patch so that it can handle filtering on the basis of either individual SQLSTATEs or whole classes, the former being able to override the latter. With a bit of wholesale notation invention, an example could be log_sqlstates = 'P0,!P0001,!42,42P05' which would mean log all messages in class P0, except don't log P0001; don't log anything in class 42, except always log 42P05; for everything else, log according to log_min_messages. If you don't like that notation, feel free to propose another. I did not like the one you had to start with, though, because it looked like it was actually changing the error severity level, not just the log or don't log decision. BTW, I'd suggest that this filtering only happen for messages PANIC level; it's pretty hard to believe that anybody would ever want to suppress a PANIC report. Another thought here is that if you're willing to have the filter only able to *prevent* logging, and not to let it *cause* logging of messages that would otherwise be suppressed by log_min_messages, it could be implemented as a loadable module using the emit_log_hook. An experimental feature, which is what this really is, is always a lot easier sell in that format since anybody who finds it useless needn't pay the overhead (which I'm still concerned about ...). But I'm not sure how important it might be to be able to upgrade a message's log priority, so maybe that approach would be significantly less usable. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance (summary v2 2014-1-17)
Hi Mel, On 2014-01-17 16:31:48 +, Mel Gorman wrote: Direct IO, buffered IO, double buffering and wishlists -- 3. Hint that a page should be dropped immediately when IO completes. There is already something like this buried in the kernel internals and sometimes called immediate reclaim which comes into play when pages are bgin invalidated. It should just be a case of investigating if that is visible to userspace, if not why not and do it in a semi-sensible fashion. bgin invalidated? Generally, +1 on the capability to achieve such a behaviour from userspace. 7. Allow userspace process to insert data into the kernel page cache without marking the page dirty. This would allow the application to request that the OS use the application copy of data as page cache if it does not have a copy already. The difficulty here is that the application has no way of knowing if something else has altered the underlying file in the meantime via something like direct IO. Granted, such activity has probably corrupted the database already but initial reactions are that this is not a safe interface and there are coherency concerns. I was one of the people suggesting that capability in this thread (after pondering about it on the back on my mind for quite some time), and I first though it would never be acceptable for pretty much those reasons. But on second thought I don't think that line of argument makes too much sense. If such an API would require write permissions on the file - which it surely would - it wouldn't allow an application to do anything it previously wasn't able to. And I don't see the dangers of concurrent direct IO as anything new. Right now the page's contents reside in userspace memory and aren't synced in any way with either the page cache or the actual on disk state. And afaik there are already several data races if a file is modified and read both via the page cache and direct io. The scheme that'd allow us is the following: When postgres reads a data page, it will continue to first look up the page in its shared buffers, if it's not there, it will perform a page cache backed read, but instruct that read to immediately remove from the page cache afterwards (new API or, posix_fadvise() or whatever). As long as it's in shared_buffers, postgres will not need to issue new reads, so there's no no benefit keeping it in the page cache. If the page is dirtied, it will be written out normally telling the kernel to forget about the caching the page (using 3) or possibly direct io). When a page in postgres's buffers (which wouldn't be set to very large values) isn't needed anymore and *not* dirty, it will seed the kernel page cache with the current data. Now, such a scheme wouldn't likely be zero-copy, but it would avoid double buffering. I think the cost of buffer copying has been overstated in this thread... he major advantage is that all that could easily implemented in a very localized manner, without hurting other OSs and it could easily degrade on kernels not providing that capability, which would surely be the majority of installations for the next couple of cases. So, I think such an interface would be hugely beneficial - and I'd be surprised if other applications couldn't reuse it. And I don't think it'd be all that hard to implement on the kernel side? Dave Chinner asked why, exactly, do you even need the kernel page cache here? when Postgres already knows how and when data should be written back to disk. The answer boiled down to To let kernel do the job that it is good at, namely managing the write-back of dirty buffers to disk and to manage (possible) read-ahead pages. Postgres has some ordering requirements but it does not want to be responsible for all cache replacement and IO scheduling. Hannu Krosing summarised it best as The other part is that using the page cache for the majority of warm, but not burning hot pages, allows the kernel to much more sensibly adapt to concurrent workloads requiring memory in some form or other (possibly giving it to other VMs when mostly idle and such). 8. Allow copy-on-write of page-cache pages to anonymous. This would limit the double ram usage to some extent. It's not as simple as having a MAP_PRIVATE mapping of a file-backed page because presumably they want this data in a shared buffer shared between Postgres processes. The implementation details of something like this are hairy because it's mmap()-like but not mmap() as it does not have the same writeback semantics due to the write ordering requirements Postgres has for database integrity. 9. Hint that a page in an anonymous buffer is a copy of a page cache page and invalidate the page cache page on COW. This limits the
Re: [HACKERS] currawong is not a happy animal
On Fri, Jan 17, 2014 at 5:51 PM, Andrew Dunstan and...@dunslane.net wrote: On 01/17/2014 11:26 AM, David Rowley wrote: I've not been able to build master for a few days now on visual studios due some problems in the new test_shm_mq contrib module. I'm not too familiar with how the visual studio project files are generated so I've not yet managed to come up with a fix just yet. However I have attached a fix for the many compiler warnings that currawong has been seeing since 9c14dd2 That seems perfe4ctly sane, so I have committed it. I'll look into the other issue. Thanks. I could've sworn I committed it that way, but I see now that it's sitting unstaged in my working directory... -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] currawong is not a happy animal
On 01/17/2014 11:51 AM, Andrew Dunstan wrote: On 01/17/2014 11:26 AM, David Rowley wrote: I've not been able to build master for a few days now on visual studios due some problems in the new test_shm_mq contrib module. I'm not too familiar with how the visual studio project files are generated so I've not yet managed to come up with a fix just yet. However I have attached a fix for the many compiler warnings that currawong has been seeing since 9c14dd2 That seems perfe4ctly sane, so I have committed it. I'll look into the other issue. It looks like what we need to fix the immediate problem is to mark set_latch_on_sigusr1 with PGDLLIMPORT. Is that the only one of these symbols we might need to make visible? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] currawong is not a happy animal
Andrew Dunstan and...@dunslane.net writes: It looks like what we need to fix the immediate problem is to mark set_latch_on_sigusr1 with PGDLLIMPORT. Is that the only one of these symbols we might need to make visible? We've generally relied on the buildfarm to cue us to add PGDLLIMPORT. Add that one and see what happens ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Minor improvements to sslinfo contrib module
Please find attached the patch that fixes a couple of comments, and adds 'static' qualifier to functions that are not used anywhere else in the code base. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com http://www.enterprisedb.com diff --git a/contrib/sslinfo/sslinfo.c b/contrib/sslinfo/sslinfo.c index 7a58470..18405f6 100644 --- a/contrib/sslinfo/sslinfo.c +++ b/contrib/sslinfo/sslinfo.c @@ -31,9 +31,10 @@ Datum ssl_client_dn_field(PG_FUNCTION_ARGS); Datum ssl_issuer_field(PG_FUNCTION_ARGS); Datum ssl_client_dn(PG_FUNCTION_ARGS); Datum ssl_issuer_dn(PG_FUNCTION_ARGS); -Datum X509_NAME_field_to_text(X509_NAME *name, text *fieldName); -Datum X509_NAME_to_text(X509_NAME *name); -Datum ASN1_STRING_to_text(ASN1_STRING *str); + +static Datum X509_NAME_field_to_text(X509_NAME *name, text *fieldName); +static Datum X509_NAME_to_text(X509_NAME *name); +static Datum ASN1_STRING_to_text(ASN1_STRING *str); /* @@ -51,7 +52,7 @@ ssl_is_used(PG_FUNCTION_ARGS) /* - * Returns SSL cipher currently in use. + * Returns SSL version currently in use. */ PG_FUNCTION_INFO_V1(ssl_version); Datum @@ -77,7 +78,7 @@ ssl_cipher(PG_FUNCTION_ARGS) /* - * Indicates whether current client have provided a certificate + * Indicates whether current client provided a certificate * * Function has no arguments. Returns bool. True if current session * is SSL session and client certificate is verified, otherwise false. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] currawong is not a happy animal
On 01/17/2014 12:43 PM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: It looks like what we need to fix the immediate problem is to mark set_latch_on_sigusr1 with PGDLLIMPORT. Is that the only one of these symbols we might need to make visible? We've generally relied on the buildfarm to cue us to add PGDLLIMPORT. Add that one and see what happens ... OK, done. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] currawong is not a happy animal
On Sat, Jan 18, 2014 at 6:50 AM, Andrew Dunstan and...@dunslane.net wrote: On 01/17/2014 12:43 PM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: It looks like what we need to fix the immediate problem is to mark set_latch_on_sigusr1 with PGDLLIMPORT. Is that the only one of these symbols we might need to make visible? We've generally relied on the buildfarm to cue us to add PGDLLIMPORT. Add that one and see what happens ... OK, done. Great, that fixes it on my end. Thanks for fixing those. Regards David Rowley cheers andrew
Re: [Lsf-pc] [HACKERS] Re: Linux kernel impact on PostgreSQL performance (summary v2 2014-1-17)
On Fri, Jan 17, 2014 at 06:14:37PM +0100, Andres Freund wrote: Hi Mel, On 2014-01-17 16:31:48 +, Mel Gorman wrote: Direct IO, buffered IO, double buffering and wishlists -- 3. Hint that a page should be dropped immediately when IO completes. There is already something like this buried in the kernel internals and sometimes called immediate reclaim which comes into play when pages are bgin invalidated. It should just be a case of investigating if that is visible to userspace, if not why not and do it in a semi-sensible fashion. bgin invalidated? s/bgin/being/ I admit that invalidated in this context is very vague and I did not explain myself. This paragraph should remind anyone familiar with VM internals about what happens when invalidate_mapping_pages calls deactivate_page and how PageReclaim pages are treated by both page reclaim and end_page_writeback handler. It's similar but not identical to what Postgres wants and is a reasonable starting position for an implementation. Generally, +1 on the capability to achieve such a behaviour from userspace. 7. Allow userspace process to insert data into the kernel page cache without marking the page dirty. This would allow the application to request that the OS use the application copy of data as page cache if it does not have a copy already. The difficulty here is that the application has no way of knowing if something else has altered the underlying file in the meantime via something like direct IO. Granted, such activity has probably corrupted the database already but initial reactions are that this is not a safe interface and there are coherency concerns. I was one of the people suggesting that capability in this thread (after pondering about it on the back on my mind for quite some time), and I first though it would never be acceptable for pretty much those reasons. But on second thought I don't think that line of argument makes too much sense. If such an API would require write permissions on the file - which it surely would - it wouldn't allow an application to do anything it previously wasn't able to. And I don't see the dangers of concurrent direct IO as anything new. Right now the page's contents reside in userspace memory and aren't synced in any way with either the page cache or the actual on disk state. And afaik there are already several data races if a file is modified and read both via the page cache and direct io. All of this is true. The objections may not hold up over time and it may be seem much more reasonable when/if the easier stuff is addressed. The scheme that'd allow us is the following: When postgres reads a data page, it will continue to first look up the page in its shared buffers, if it's not there, it will perform a page cache backed read, but instruct that read to immediately remove from the page cache afterwards (new API or, posix_fadvise() or whatever). As long as it's in shared_buffers, postgres will not need to issue new reads, so there's no no benefit keeping it in the page cache. If the page is dirtied, it will be written out normally telling the kernel to forget about the caching the page (using 3) or possibly direct io). When a page in postgres's buffers (which wouldn't be set to very large values) isn't needed anymore and *not* dirty, it will seed the kernel page cache with the current data. Ordinarily the initial read page could be discarded with fadvise but the later write would cause the data to be read back in again which is a waste. The details of avoiding that re-read are tricky from a core kernel perspective because ordinarily the kernel at that point does not know if the write is a full complete aligned write of an underlying filesystem structure or not. It may need a different write path which potentially leads into needing changes to the address_space operations on a filesystem basis -- that would get messy and be a Linux-specific extension. I have not researched this properly at all, I could be way off but I have a feeling the details get messy. Now, such a scheme wouldn't likely be zero-copy, but it would avoid double buffering. It wouldn't be zero copy because minimally the data needs to be handed over the filesystem for writing to the disk and the interface for that is offset,length based, not page based. Maybe sometimes it will be zero copy but it would be a filesystem-specific thing. I think the cost of buffer copying has been overstated in this thread... he major advantage is that all that could easily implemented in a very localized manner, without hurting other OSs and it could easily degrade on kernels not providing that capability, which would surely be the majority of installations for the next couple of cases. So, I think such an interface would be hugely
Re: [HACKERS] GIN improvements part 1: additional information
On 01/17/2014 01:05 PM, Alexander Korotkov wrote: Seems to be fixed in the attached version of patch. Another improvement in this version: only left-most segments and further are re-encoded. Left part of page are left untouched. I'm looking into this now. A few quick notes: * Even when you don't re-encode the whole page, you still WAL-log the whole page. While correct, it'd be a pretty obvious optimization to only WAL-log the modified part. * When new items are appended to the end of the page so that only the last existing compressed segment is re-encoded, and the page has to be split, the items aren't divided 50/50 on the pages. The loop that moves segments destined for the left page to the right won't move any existing, untouched, segments. ! /* !* Didn't fit uncompressed. We'll have to encode them. Check if both !* new items and uncompressed items can be placed starting from last !* segment of page. Then re-encode only last segment of page. !*/ ! minNewItem = newItems[0]; ! if (nolduncompressed == 0 ! ginCompareItemPointers(olduncompressed[0], minNewItem) 0) ! minNewItem = olduncompressed[0]; That looks wrong. If I'm understanding it right, it's trying to do minNewItem = Min(newItems[0], olduncompressed[0]). The test should be nolduncompressed 0 ... No need to send a new patch, I'll just fix those while reviewing... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Fri, Jan 17, 2014 at 10:38 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 01/17/2014 01:05 PM, Alexander Korotkov wrote: Seems to be fixed in the attached version of patch. Another improvement in this version: only left-most segments and further are re-encoded. Left part of page are left untouched. I'm looking into this now. A few quick notes: * Even when you don't re-encode the whole page, you still WAL-log the whole page. While correct, it'd be a pretty obvious optimization to only WAL-log the modified part. Oh, I overlooked it. I wrote code in ginRedoInsertData which finds appropriate place fro data but didn't notice that I still wrote whole page to xlog record. * When new items are appended to the end of the page so that only the last existing compressed segment is re-encoded, and the page has to be split, the items aren't divided 50/50 on the pages. The loop that moves segments destined for the left page to the right won't move any existing, untouched, segments. I think this loop will move one segment. However, it's too small. ! /* !* Didn't fit uncompressed. We'll have to encode them. Check if both !* new items and uncompressed items can be placed starting from last !* segment of page. Then re-encode only last segment of page. !*/ ! minNewItem = newItems[0]; ! if (nolduncompressed == 0 ! ginCompareItemPointers(olduncompressed[0], minNewItem) 0) ! minNewItem = olduncompressed[0]; That looks wrong. If I'm understanding it right, it's trying to do minNewItem = Min(newItems[0], olduncompressed[0]). The test should be nolduncompressed 0 ... Yes, definitely a bug. No need to send a new patch, I'll just fix those while reviewing... Ok, thanks! -- With best regards, Alexander Korotkov.
Re: [HACKERS] Add %z support to elog/ereport?
Andres Freund and...@2ndquadrant.com writes: [ 0001-Add-support-for-printing-Size-arguments-to-elog-erep.patch ] I think this approach is fundamentally broken, because it can't reasonably cope with any case more complicated than %zu or %zd. While it's arguable that we can get along without the ability to specify field width, since the [U]INT64_FORMAT macros never allowed that, it is surely going to be the case that translators will need to insert n$ flags in the translated versions of messages. Another objection is that this only fixes the problem in elog/ereport format strings, not anyplace else that it might be useful to print a size_t value. You suggest below that we could invent some additional macros to support that; but since the z flag is in C99, there really ought to be only a small minority of platforms where it doesn't work. So I don't think we should be contorting our code with some nonstandard notation for the case, if there's any way to avoid that. I think a better solution approach is to teach our src/port/snprintf.c about the z flag, then extend configure's checking to force use of our snprintf if the platform's version doesn't handle z. While it might be objected that this creates a need for a run-time check in configure, we already have one to check if snprintf handles n$, so this approach doesn't really make life any worse for cross-compiles. In patch 01, I've modified configure to not define [U]INT64_FORMAT directly, but rather just define INT64_LENGTH_MODIFIER as appropriate. The formats are now defined in c.h. INT64_LENGTH_MODIFIER is defined without quotes - I am not sure whether that was the right choice, it requires using CppAsString2() in some places. Maybe I should just have defined it with quotes instead. Happy to rejigger it if somebody thinks the other way would be better. Meh. This isn't needed if we do what I suggest above, but in any case I don't approve of removing the existing [U]INT64_FORMAT macros. That breaks code that doesn't need to get broken, probably including third-party modules. It'd be more sensible to just add an additional macro for the flag character(s). (And yeah, I'd put quotes in it.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] currawong is not a happy animal
On 01/17/2014 12:55 PM, David Rowley wrote: On Sat, Jan 18, 2014 at 6:50 AM, Andrew Dunstan and...@dunslane.net mailto:and...@dunslane.net wrote: On 01/17/2014 12:43 PM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net mailto:and...@dunslane.net writes: It looks like what we need to fix the immediate problem is to mark set_latch_on_sigusr1 with PGDLLIMPORT. Is that the only one of these symbols we might need to make visible? We've generally relied on the buildfarm to cue us to add PGDLLIMPORT. Add that one and see what happens ... OK, done. Great, that fixes it on my end. Thanks for fixing those. Not quite out of the woods yet. We're getting this regression failure on Windows/MSVC (see http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=bowerbirddt=2014-01-17%2018%3A20%3A35): *** c:/prog/bf/root/HEAD/pgsql.2612/contrib/test_shm_mq/expected/test_shm_mq.out Fri Jan 17 13:20:53 2014 --- c:/prog/bf/root/HEAD/pgsql.2612/contrib/test_shm_mq/results/test_shm_mq.out Fri Jan 17 13:44:33 2014 *** *** 5,18 -- internal sanity tests fail. -- SELECT test_shm_mq(32768, (select string_agg(chr(32+(random()*96)::int), '') from generate_series(1,400)), 1, 1); ! test_shm_mq ! - ! ! (1 row) ! SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*96)::int), '') from generate_series(1,27)), 200, 3); ! test_shm_mq_pipelined ! --- ! ! (1 row) ! --- 5,10 -- internal sanity tests fail. -- SELECT test_shm_mq(32768, (select string_agg(chr(32+(random()*96)::int), '') from generate_series(1,400)), 1, 1); ! ERROR: queue size must be at least 472262143 bytes SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*96)::int), '') from generate_series(1,27)), 200, 3); ! ERROR: queue size must be at least 472262143 bytes cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Re: Linux kernel impact on PostgreSQL performance (summary v2 2014-1-17)
Mel, So we have a few interested parties. What do we need to do to set up the Collab session? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] currawong is not a happy animal
Andrew Dunstan and...@dunslane.net writes: Not quite out of the woods yet. We're getting this regression failure on Windows/MSVC (see http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=bowerbirddt=2014-01-17%2018%3A20%3A35): SELECT test_shm_mq(32768, (select string_agg(chr(32+(random()*96)::int), '') from generate_series(1,400)), 1, 1); ! ERROR: queue size must be at least 472262143 bytes It looks like this is computing a bogus value: const Size shm_mq_minimum_size = MAXALIGN(offsetof(shm_mq, mq_ring)) + MAXIMUM_ALIGNOF; I seem to recall that we've previously found that you have to write MAXALIGN(offsetof(shm_mq, mq_ring[0])) + MAXIMUM_ALIGNOF; to keep MSVC happy with a reference to an array member in offsetof. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add %z support to elog/ereport?
I wrote: Meh. This isn't needed if we do what I suggest above, but in any case I don't approve of removing the existing [U]INT64_FORMAT macros. That breaks code that doesn't need to get broken, probably including third-party modules. After looking more closely I see you didn't actually *remove* those macros, just define them in a different place/way. So the above objection is just noise, sorry. (Though I think it'd be notationally cleaner to let configure continue to define the macros; it doesn't need to do anything as ugly as CppAsString2() to concatenate...) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] wal_buffers = -1
On Fri, Jan 17, 2014 at 8:20 AM, Magnus Hagander mag...@hagander.net wrote: Robert Haas reported that setting it to 32MB can yield a considerable performance benefit: http://www.postgresql.org/message-id/CA+TgmobgMv_aaakLoasBt=5iYfi=kdcOUz0X9TdYe0c2SZ=2...@mail.gmail.com In that case, sholdn't the autotuning be changed to not limit it at 16MB? :) I'm in favor of keeping the setting; I think that the auto-tuning has largely eliminated the pain in this area for the majority of users, but that doesn't mean we should deny someone who really wants to squeeze the last drop of performance out of their system the opportunity to poke at it manually. I doubt it's the least useful setting we have. The test above shows 32MB beating 16MB, but I think I did other tests where 16MB and 64MB came out the same. Back when I was working heavily on performance, I did a number of tests to try to understand what events cause latency spikes. Many of those events are related to write-ahead logging. It turns out that writing a page from PostgreSQL's WAL buffers to the OS cache is usually pretty fast, unless the OS thinks we're dirtying data too quickly and decides to slam on the brakes. Calling fsync() to get the data out to disk, though, is very slow. However, both of those operations are protected by the same lock (WALWriteLock), so it's frequently the case that no more WAL buffer space can be freed up by calling write() because the guy who has the lock is busy waiting for an fsync(). That sucks, because there's no intrinsic reason why we can't have one backend doing a write() while another backend is doing an fsync(). On a related note, there's no real reason why the poor bastard who writes the WAL record that fills a segment should be forced to synchronously flush the segment instead of letting it be done in the background, but right now he is. I think if we fix these problems, the optimal value for wal_buffers is likely to change; however, I'm not certain we'll be able to to auto-tune it perfectly on day one. Having a setting makes it easier for people to experiment with different values, and I think that's good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal, patch: allow multiple plpgsql plugins
2014/1/16 Marko Tiikkaja ma...@joh.to Hi Pavel, First of all, thanks for working on this! On 1/12/14, 8:58 PM, Pavel Stehule wrote: I still not happy with plugin_info - it is only per plugin now and should be per plugin and per function. I'm not sure I understand the point of plugin_info in the first place, but what would having a separate info per (plugin, function) achieve that can't be done otherwise? First use case - I would to protect repeated call of plpgsql_check_function in passive mode. Where I have to store information about successful first start? It is related to the function instance, so function oid can be ambiguous (for function with polymorphic parameters). When function instance is destroyed, then this information should be destroyed. It is impossible do this check from plugin. Second use case - attach session life cycle plugin data with some function - for example for coverage calculation. Inside plugin without function specific data you have to hold a hash of all used function, and you have to search again and again. When plpgsql hold this info in internal plpgsql function structures, then you don't need search anything. Regards Pavel As for the current patch, I'd like to see improvements on a few things: 1) It doesn't currently compile because of extra semicolons in the PLpgSQL_plugin struct. 2) The previous comment above the same struct still talk about the rendezvous variable which is now gone. The comment should be updated to reflect the new API. 3) The same comment talks about how important it is to unregister a plugin if its _PG_fini() is ever called, but the current API does not support unregistering. That should probably be added? I'm not sure when _PG_fini() would be called. 4) The comment /* reserved for use by optional plugin */ seems a bit weird in its new context. Regards, Marko Tiikkaja
Re: [HACKERS] [v9.4] row level security
On 12/13/13 11:40 PM, Craig Ringer wrote: You may want to check out the updated writable security-barrier views patch. http://www.postgresql.org/message-id/52ab112b.6020...@2ndquadrant.com It may offer a path forward for the CF submission for RLS, letting us get rid of the var/attr fiddling that many here objected to. With my advocacy hat on, I'd like to revisit this idea now that there's a viable updatable security barrier view submission. I thought the most serious showstopper feedback from the last CF's RLS submission was that this needed to be sorted out first. Reworking KaiGai's submission to merge against Dean's new one makes it viable again in my mind, and I'd like to continue toward re-reviewing it as part of this CF in that light. Admittedly it's not ideal to try and do that at the same time the barrier view patch is being modified, but I see that as a normal CF merge of things based on other people's submissions. I mentioned advocacy because the budding new PostgreSQL test instances I'm seeing now will lose a lot of momentum if we end up with no user visible RLS features in 9.4. The pieces we have now can assemble into something that's useful, and I don't think that goal is unreasonably far away. -- Greg Smith greg.sm...@crunchydatasolutions.com Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Fri, Jan 17, 2014 at 3:05 PM, Florian Pflug f...@phlo.org wrote: I've now shuffled things around so that we can use inverse transition functions even if only some aggregates provide them, and to allow inverse transition functions to force a restart by returning NULL. The rules regarding NULL handling are now the following Maybe this is me thinking out loud, but I'm just thinking about the numeric case again. Since the patch can now handle inverse transition functions returning NULL when they fail to perform inverse transitions, I'm wondering if we could add an expectedscale to NumericAggState, set it to -1 initially, when we get the first value set the expectedscale to the dscale of that numeric, then if we get anything but that value we'll set the expectedscale back to -1 again, if we are asked to perform an inverse transition with a expectedscale as -1 we'll return null, otherwise we can perform the inverse transition... Thoughts? Regards David Rowley
Re: [HACKERS] [BUGS] surprising to_timestamp behavior
Jeevan Chalke jeevan.cha...@enterprisedb.com writes: Attached patch which fixes this issue. I have just tweaked the code around ignoring spaces in DCH_from_char() function. Adding to CommitFest 2014-01 (Open). I went to review this, and found that there's not actually a patch attached ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Wed, Jan 15, 2014 at 5:14 PM, Simon Riggs si...@2ndquadrant.com wrote: We already know that HOT is ineffective in areas of high contention (previous thread by me). Prior experience was that smaller tables didn't show much apparent benefit from using HOT either; its effectiveness was limited to medium and large tables being updated. The two already stated use cases that would apply are these ones Do you have a link to that previous thread? I don't happen to recall that conversation. I've found that HOT can be very important on smaller tables, so I'm skeptical of that as a general conclusion. What I think might be true is that if VACUUM is going to hit the table often enough to make you happy, then you don't really need HOT. In other words, if the update rate is non-zero but low, not too much cruft will accumulate before the table gets vacuumed, and you may be OK. If the update rate is high, though, I think disabling HOT will be painful on a table of any size. There might be exceptions, but I can't think of what the are off-hand. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] currawong is not a happy animal
Tom Lane escribió: Andrew Dunstan and...@dunslane.net writes: Not quite out of the woods yet. We're getting this regression failure on Windows/MSVC (see http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=bowerbirddt=2014-01-17%2018%3A20%3A35): SELECT test_shm_mq(32768, (select string_agg(chr(32+(random()*96)::int), '') from generate_series(1,400)), 1, 1); ! ERROR: queue size must be at least 472262143 bytes It looks like this is computing a bogus value: const Size shm_mq_minimum_size = MAXALIGN(offsetof(shm_mq, mq_ring)) + MAXIMUM_ALIGNOF; I seem to recall that we've previously found that you have to write MAXALIGN(offsetof(shm_mq, mq_ring[0])) + MAXIMUM_ALIGNOF; to keep MSVC happy with a reference to an array member in offsetof. Hmm, this seems to contradict what's documented at the definition of FLEXIBLE_ARRAY_MEMBER: /* Define to nothing if C supports flexible array members, and to 1 if it does not. That way, with a declaration like `struct s { int n; double d[FLEXIBLE_ARRAY_MEMBER]; };', the struct hack can be used with pre-C99 compilers. When computing the size of such an object, don't use 'sizeof (struct s)' as it overestimates the size. Use 'offsetof (struct s, d)' instead. Don't use 'offsetof (struct s, d[0])', as this doesn't work with MSVC and with C++ compilers. */ #define FLEXIBLE_ARRAY_MEMBER /**/ -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] currawong is not a happy animal
On 01/17/2014 02:54 PM, Alvaro Herrera wrote: Tom Lane escribió: Andrew Dunstan and...@dunslane.net writes: Not quite out of the woods yet. We're getting this regression failure on Windows/MSVC (see http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=bowerbirddt=2014-01-17%2018%3A20%3A35): SELECT test_shm_mq(32768, (select string_agg(chr(32+(random()*96)::int), '') from generate_series(1,400)), 1, 1); ! ERROR: queue size must be at least 472262143 bytes It looks like this is computing a bogus value: const Size shm_mq_minimum_size = MAXALIGN(offsetof(shm_mq, mq_ring)) + MAXIMUM_ALIGNOF; I seem to recall that we've previously found that you have to write MAXALIGN(offsetof(shm_mq, mq_ring[0])) + MAXIMUM_ALIGNOF; to keep MSVC happy with a reference to an array member in offsetof. Hmm, this seems to contradict what's documented at the definition of FLEXIBLE_ARRAY_MEMBER: /* Define to nothing if C supports flexible array members, and to 1 if it does not. That way, with a declaration like `struct s { int n; double d[FLEXIBLE_ARRAY_MEMBER]; };', the struct hack can be used with pre-C99 compilers. When computing the size of such an object, don't use 'sizeof (struct s)' as it overestimates the size. Use 'offsetof (struct s, d)' instead. Don't use 'offsetof (struct s, d[0])', as this doesn't work with MSVC and with C++ compilers. */ #define FLEXIBLE_ARRAY_MEMBER /**/ Well, there's a bunch of these in the source: [andrew@emma pg_head]$ grep -r offsetof.*\\[0 src/backend src/backend/access/nbtree/nbtutils.c:size = offsetof(BTVacInfo, vacuums[0]); src/backend/utils/adt/geo_ops.c:size = offsetof(PATH, p[0]) +sizeof(path-p[0]) * npts; src/backend/utils/adt/geo_ops.c:if (npts = 0 || npts = (int32) ((INT_MAX - offsetof(PATH, p[0])) / sizeof(Point))) src/backend/utils/adt/geo_ops.c:size = offsetof(PATH, p[0]) +sizeof(path-p[0]) * npts; src/backend/utils/adt/geo_ops.c:size = offsetof(POLYGON, p[0]) +sizeof(poly-p[0]) * npts; src/backend/utils/adt/geo_ops.c:if (npts = 0 || npts = (int32) ((INT_MAX - offsetof(POLYGON, p[0])) / sizeof(Point))) src/backend/utils/adt/geo_ops.c:size = offsetof(POLYGON, p[0]) +sizeof(poly-p[0]) * npts; src/backend/utils/adt/geo_ops.c:size = offsetof(PATH, p[0]) +base_size; src/backend/utils/adt/geo_ops.c:size = offsetof(POLYGON, p[0]) +sizeof(poly-p[0]) * path-npts; src/backend/utils/adt/geo_ops.c:size = offsetof(POLYGON, p[0]) +sizeof(poly-p[0]) * 4; src/backend/utils/adt/geo_ops.c:size = offsetof(PATH, p[0]) +sizeof(path-p[0]) * poly-npts; src/backend/utils/adt/geo_ops.c:size = offsetof(POLYGON, p[0]) +base_size; src/backend/postmaster/pgstat.c:len = offsetof(PgStat_MsgTabstat, m_entry[0]) + src/backend/postmaster/pgstat.c:pgstat_send(msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) + src/backend/postmaster/pgstat.c:pgstat_send(msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) + src/backend/postmaster/pgstat.c:len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) src/backend/postmaster/pgstat.c:len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) src/backend/postmaster/pgstat.c:len = offsetof(PgStat_MsgFuncpurge, m_functionid[0]) src/backend/postmaster/pgstat.c:len = offsetof(PgStat_MsgFuncpurge, m_functionid[0]) src/backend/postmaster/pgstat.c:len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) +sizeof(Oid); cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] currawong is not a happy animal
Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane escribió: I seem to recall that we've previously found that you have to write MAXALIGN(offsetof(shm_mq, mq_ring[0])) + MAXIMUM_ALIGNOF; to keep MSVC happy with a reference to an array member in offsetof. Hmm, this seems to contradict what's documented at the definition of FLEXIBLE_ARRAY_MEMBER: Ah, I thought we had that issue documented somewhere, but failed to find this comment, or I'd have known that was backwards. The other possibility I was contemplating is that export a const variable doesn't actually work for some reason. We're not in the habit of doing that elsewhere, so I don't find that theory outlandish. Perhaps it could be fixed by adding PGDLLIMPORT to the extern, but on the whole I'd rather avoid the technique altogether. The least-unlike-other-Postgres-code approach would be to go ahead and export the struct so that the size computation could be provided as a #define in the same header. Robert stated a couple days ago that he didn't foresee much churn in this struct, so that doesn't seem unacceptable. Another possibility is to refactor so that testing an allocation request against shm_mq_minimum_size is the responsibility of storage/ipc/shm_mq.c, not some random code in a contrib module. It's not immediately apparent to me why it's good code modularization to have a contrib module responsible for checking sizes based on the sizeof a struct it's not supposed to have any access to. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] currawong is not a happy animal
Andrew Dunstan and...@dunslane.net writes: On 01/17/2014 02:54 PM, Alvaro Herrera wrote: Hmm, this seems to contradict what's documented at the definition of FLEXIBLE_ARRAY_MEMBER: Well, there's a bunch of these in the source: Yeah, I did the same grep and noted the same thing --- but on second look, I think those are all structs that are defined without use of FLEXIBLE_ARRAY_MEMBER, which OTOH *is* used in struct shm_mq. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] currawong is not a happy animal
Andrew Dunstan escribió: On 01/17/2014 02:54 PM, Alvaro Herrera wrote: Hmm, this seems to contradict what's documented at the definition of FLEXIBLE_ARRAY_MEMBER: Well, there's a bunch of these in the source: AFAICT these are all defined with non-zero, non-empty array sizes, not FLEXIBLE_ARRAY_MEMBER (which mq_ring is). I don't know why that makes a difference but apparently it does. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On 1/17/14 10:37 AM, Mel Gorman wrote: There is not an easy way to tell. To be 100%, it would require an instrumentation patch or a systemtap script to detect when a particular page is being written back and track the context. There are approximations though. Monitor nr_dirty pages over time. I have a benchmarking wrapper for the pgbench testing program called pgbench-tools: https://github.com/gregs1104/pgbench-tools As of October, on Linux it now plots the Dirty value from /proc/meminfo over time. You get that on the same time axis as the transaction latency data. The report at the end includes things like the maximum amount of dirty memory observed during the test sampling. That doesn't tell you exactly what's happening to the level someone reworking the kernel logic might want, but you can easily see things like the database's checkpoint cycle reflected by watching the dirty memory total. This works really well for monitoring production servers too. I have a lot of data from a plugin for the Munin monitoring system that plots the same way. Once you have some history about what's normal, it's easy to see when systems fall behind in a way that's ruining writes, and the high water mark often correlates with bad responsiveness periods. Another recent change is that pgbench for the upcoming PostgreSQL 9.4 now allows you to specify a target transaction rate. Seeing the write latency behavior with that in place is far more interesting than anything we were able to watch with pgbench before. The pgbench write tests we've been doing for years mainly told you the throughput rate when all of the caches were always as full as the database could make them, and tuning for that is not very useful. Turns out it's far more interesting to run at 50% of what the storage is capable of, then watch what happens to latency when you adjust things like the dirty_* parameters. I've been working on the problem of how we can make a benchmark test case that acts enough like real busy PostgreSQL servers that we can share it with kernel developers, and then everyone has an objective way to measure changes. These rate limited tests are working much better for that than anything I came up with before. I am skeptical that the database will take over very much of this work and perform better than the Linux kernel does. My take is that our most useful role would be providing test cases kernel developers can add to a performance regression suite. Ugly we never though that would happen situations seems at the root of many of the kernel performance regressions people here get nailed by. Effective I/O scheduling is very hard, and we are unlikely to ever out innovate the kernel hacking community by pulling more of that into the database. It's already possible to experiment with moving in that direction with tuning changes. Use a larger database shared_buffers value, tweak checkpoints to spread I/O out, and reduce things like dirty_ratio. I do some of that, but I've learned it's dangerous to wander too far that way. If instead you let Linux do even more work--give it a lot of memory to manage and room to re-order I/O--that can work out quite well. For example, I've seen a lot of people try to keep latency down by using the deadline scheduler and very low settings for the expire times. Theory is great, but it never works out in the real world for me though. Here's the sort of deadline I deploy instead now: echo 500 ${DEV}/queue/iosched/read_expire echo 30${DEV}/queue/iosched/write_expire echo 1048576 ${DEV}/queue/iosched/writes_starved These numbers look insane compared to the defaults, but I assure you they're from a server that's happily chugging through 5 to 10K transactions/second around the clock. PostgreSQL forces writes out with fsync when they must go out, but this sort of tuning is basically giving up on it managing writes beyond that. We really have no idea what order they should go out in. I just let the kernel have a large pile of work queued up, and trust things like the kernel's block elevator and congestion code are smarter than the database can possibly be. -- Greg Smith greg.sm...@crunchydatasolutions.com Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pgcrypto: implement gen_random_uuid
Emre Hasegeli e...@hasegeli.com writes: 2014/1/9 Oskari Saarenmaa o...@ohmu.fi: The only useful feature of the uuid-ossp module in my opinion is the uuid_generate_v4 function and as uuid-ossp is more or less abandonware people have had trouble building and installing it. This patch implements an alternative uuid v4 generation function in pgcrypto which could be moved to core once there's a core PRNG with large enough internal state. It is a small but very useful patch. Installing uuid-ossp can be very hard on some systems. There is not much to review. The patch applies cleanly to HEAD. The function is generating valid UUID version 4. The code and the documentation style seems to fit in the pgcrypto extension. I am marking it as Ready for Commiter. The problem is users probably would not look pgcrypto extension for UUID generator, especially when there is another extension with uuid in it's name. Also, UUID generator does not sound like a cryptographic function. It would be much better, if this would be in core with the UUID type. There is a reference on the UUID Type documentation page to the uuid-ossp extension. We can add a reference to pgcrypro extension in that page and consider adding a deprecation note to the uuid-ossp extension, if is is not possible to add the function to the core, for now. Well, we're not pulling pgcrypto into core in the foreseeable future; there are legal (export control) issues that make that too risky. Even aside from that, there was general consensus when type uuid went in that the various generation algorithms were, how shall I say it, too intellectually unsatisfying to be part of the core code. So I think from a code standpoint this solution is just fine. I agree that we need some extra work on the documentation to point people towards this approach instead of uuid-ossp, though. I'll take care of that and commit. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Jan17, 2014, at 20:34 , David Rowley dgrowle...@gmail.com wrote: On Fri, Jan 17, 2014 at 3:05 PM, Florian Pflug f...@phlo.org wrote: I've now shuffled things around so that we can use inverse transition functions even if only some aggregates provide them, and to allow inverse transition functions to force a restart by returning NULL. The rules regarding NULL handling are now the following Maybe this is me thinking out loud, but I'm just thinking about the numeric case again. Since the patch can now handle inverse transition functions returning NULL when they fail to perform inverse transitions, I'm wondering if we could add an expectedscale to NumericAggState, set it to -1 initially, when we get the first value set the expectedscale to the dscale of that numeric, then if we get anything but that value we'll set the expectedscale back to -1 again, if we are asked to perform an inverse transition with a expectedscale as -1 we'll return null, otherwise we can perform the inverse transition... You could do better than that - the numeric problem amounts to tracking the maximum scale AFAICS, so it'd be sufficient to restart if we remove a value whose scale equals the current maximum. But if we do SUM(numeric) at all, I think we should do so without requiring restarts - I still think the overhead of tracking the maximum scale within the aggregated values isn't that bad. If we zero the dscale counters lazily, the numbers of entries we have to zero is bound by the maximum dscale we encounter. Since actually summing N digits is probably more expensive than zeroing them, and since we pay the zeroing overhead only once per aggregation but save potentially many summations, I'm pretty sure we come out ahead by quite some margin. It'd be interesting to do float() similar to the way you describe, though. We might not be able to guarantee that we yield exactly the same result as without inverse aggregation, but we might be able to bound the error. That might make it acceptable to do this - as Kevin pointed out, float is always an approximation anyway. I haven't really thought that through, though... Anyway, with time running out fast if we want to get this into 9.4, I think we should focus on getting this into a committable state right now. I've started to look over what you've pushed to github, and it looks mostly fine. I have a few comments - mostly cosmetic stuff - that I'll post once I finished reading through it. I also plan to do some basic performance testing to verify that my reshuffling of eval_windowaggregates() doesn't hurt aggregates without an inverse transition function. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] currawong is not a happy animal
On 01/17/2014 03:15 PM, Tom Lane wrote: The other possibility I was contemplating is that export a const variable doesn't actually work for some reason. We're not in the habit of doing that elsewhere, so I don't find that theory outlandish. Perhaps it could be fixed by adding PGDLLIMPORT to the extern, but on the whole I'd rather avoid the technique altogether. The least-unlike-other-Postgres-code approach would be to go ahead and export the struct so that the size computation could be provided as a #define in the same header. Robert stated a couple days ago that he didn't foresee much churn in this struct, so that doesn't seem unacceptable. Another possibility is to refactor so that testing an allocation request against shm_mq_minimum_size is the responsibility of storage/ipc/shm_mq.c, not some random code in a contrib module. It's not immediately apparent to me why it's good code modularization to have a contrib module responsible for checking sizes based on the sizeof a struct it's not supposed to have any access to. Or maybe we could expose the value via a function rather than a const variable. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Sat, Jan 18, 2014 at 10:17 AM, Florian Pflug f...@phlo.org wrote: On Jan17, 2014, at 20:34 , David Rowley dgrowle...@gmail.com wrote: On Fri, Jan 17, 2014 at 3:05 PM, Florian Pflug f...@phlo.org wrote: I've now shuffled things around so that we can use inverse transition functions even if only some aggregates provide them, and to allow inverse transition functions to force a restart by returning NULL. The rules regarding NULL handling are now the following Maybe this is me thinking out loud, but I'm just thinking about the numeric case again. Since the patch can now handle inverse transition functions returning NULL when they fail to perform inverse transitions, I'm wondering if we could add an expectedscale to NumericAggState, set it to -1 initially, when we get the first value set the expectedscale to the dscale of that numeric, then if we get anything but that value we'll set the expectedscale back to -1 again, if we are asked to perform an inverse transition with a expectedscale as -1 we'll return null, otherwise we can perform the inverse transition... You could do better than that - the numeric problem amounts to tracking the maximum scale AFAICS, so it'd be sufficient to restart if we remove a value whose scale equals the current maximum. But if we do SUM(numeric) at all, I think we should do so without requiring restarts - I still think the overhead of tracking the maximum scale within the aggregated values isn't that bad. If we zero the dscale counters lazily, the numbers of entries we have to zero is bound by the maximum dscale we encounter. Since actually summing N digits is probably more expensive than zeroing them, and since we pay the zeroing overhead only once per aggregation but save potentially many summations, I'm pretty sure we come out ahead by quite some margin. We'll work that out, I don't think it will take very long to code up your idea either. I just thought that my idea was good enough and very cheap too, won't all numerics that are actually stored in a column have the same scale anyway? Is it not only been a problem because we've been testing with random numeric literals the whole time? The test turned out to become: if (state-expectedScale == -1) state-expectedScale = X.dscale; else if (state-expectedScale != X.dscale) state-expectedScale = -2; In do_numeric_accum, then when we do the inverse I just check if expectedScale 0 then return NULL. I'm not set on it, and I'm willing to try the lazy zeroing of the scale tracker array, but I'm just not quite sure what extra real use cases it covers that the one above does not. Perhaps my way won't perform inverse transitions if the query did sum(numericCol / 10) or so. I'll be committing this to the github repo very soon, so we can hack away at the scale tracking code once it's back in. David Rowley
Re: [HACKERS] Triggers on foreign tables
On Tue, Jan 07, 2014 at 12:11:28PM +0100, Ronan Dunklau wrote: Since the last discussion about it (http://www.postgresql.org/message-id/cadyhksugp6ojb1pybtimop3s5fg_yokguto-7rbcltnvaj5...@mail.gmail.com), I finally managed to implement row triggers for foreign tables. For statement-level triggers, nothing has changed since the last patch. Statement-level triggers are just allowed in the command checking. For row-level triggers, it works as follows: - during rewrite phase, every attribute of the foreign table is duplicated as a resjunk entry if a trigger is defined on the relation. These entries are then used to store the old values for a tuple. There is room for improvement here: we could check if any trigger is in fact a row-level trigger The need here is awfully similar to a need of INSTEAD OF triggers on views. For them, we add a single wholerow resjunk TLE. Is there a good reason to do it differently for triggers on foreign tables? - for after triggers, the whole queuing mechanism is bypassed for foreign tables. This is IMO acceptable, since foreign tables cannot have constraints or constraints triggers, and thus have not need for deferrable execution. This design avoids the need for storing and retrieving/identifying remote tuples until the query or transaction end. Whether an AFTER ROW trigger is deferred determines whether it runs at the end of the firing query or at the end of the firing query's transaction. In all cases, every BEFORE ROW trigger of a given query fires before any AFTER ROW trigger of the same query. SQL requires that. This proposal would give foreign table AFTER ROW triggers a novel firing time; let's not do that. I think the options going forward are either (a) design a way to queue foreign table AFTER ROW triggers such that we can get the old and/or new rows at the end of the query or (b) not support AFTER ROW triggers on foreign tables for the time being. It's not clear to me whether SQL/MED contemplates triggers on foreign tables. Its drop basic column definition General Rules do mention the possibility of a reference from a trigger column list. On the other hand, I see nothing overriding the fact that CREATE TRIGGER only targets base tables. Is this clearer to anyone else? (This is a minor point, mainly bearing on the Compatibility section of the CREATE TRIGGER documentation.) - the duplicated resjunk attributes are identified by being: - marked as resjunk in the targetlist - not being system or whole-row attributes (varno 0) There is still one small issue with the attached patch: modifications to the tuple performed by the foreign data wrapper (via the returned TupleTableSlot in ExecForeignUpdate and ExecForeignInsert hooks) are not visible to the AFTER trigger. This could be fixed by merging the planslot containing the resjunk columns with the returned slot before calling the trigger, but I'm not really sure how to safely perform that. Any advice ? Currently, FDWs are permitted to skip returning columns not actually referenced by any RETURNING clause. I would change that part of the API contract to require returning all columns when an AFTER ROW trigger is involved. You can't get around doing that by merging old column values, because, among other reasons, an INSERT does not have those values at all. On Tue, Jan 07, 2014 at 05:31:10PM +0100, Ronan Dunklau wrote: --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out +NOTICE: TG_NAME: trig_row_after +NOTICE: TG_WHEN: AFTER +NOTICE: TG_LEVEL: ROW +NOTICE: TG_OP: DELETE +NOTICE: TG_RELID::regclass: rem1 +NOTICE: TG_RELNAME: rem1 +NOTICE: TG_TABLE_NAME: rem1 +NOTICE: TG_TABLE_SCHEMA: public +NOTICE: TG_NARGS: 2 +NOTICE: TG_ARGV: [23, skidoo] +NOTICE: OLD: (2,bye) +NOTICE: TG_NAME: trig_row_before +NOTICE: TG_WHEN: BEFORE +NOTICE: TG_LEVEL: ROW +NOTICE: TG_OP: DELETE +NOTICE: TG_RELID::regclass: rem1 +NOTICE: TG_RELNAME: rem1 +NOTICE: TG_TABLE_NAME: rem1 +NOTICE: TG_TABLE_SCHEMA: public +NOTICE: TG_NARGS: 2 +NOTICE: TG_ARGV: [23, skidoo] +NOTICE: OLD: (11,bye remote) +NOTICE: TG_NAME: trig_row_after +NOTICE: TG_WHEN: AFTER +NOTICE: TG_LEVEL: ROW +NOTICE: TG_OP: DELETE +NOTICE: TG_RELID::regclass: rem1 +NOTICE: TG_RELNAME: rem1 +NOTICE: TG_TABLE_NAME: rem1 +NOTICE: TG_TABLE_SCHEMA: public +NOTICE: TG_NARGS: 2 +NOTICE: TG_ARGV: [23, skidoo] +NOTICE: OLD: (11,bye remote) +insert into rem1 values(1,'insert'); Would you trim the verbosity a bit? Maybe merge several of the TG_ fields onto one line, and remove the low-importance ones. Perhaps issue one line like this in place of all the current TG_ lines: NOTICE: trig_row_after(23, skidoo) AFTER ROW DELETE ON rem1 --- a/contrib/postgres_fdw/sql/postgres_fdw.sql +++ b/contrib/postgres_fdw/sql/postgres_fdw.sql @@ -384,3 +384,188 @@ insert into loc1(f2) values('bye'); insert into rem1(f2) values('bye
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Sat, Jan 18, 2014 at 10:42 AM, David Rowley dgrowle...@gmail.com wrote: You could do better than that - the numeric problem amounts to tracking the maximum scale AFAICS, so it'd be sufficient to restart if we remove a value whose scale equals the current maximum. But if we do SUM(numeric) at all, I think we should do so without requiring restarts - I still think the overhead of tracking the maximum scale within the aggregated values isn't that bad. If we zero the dscale counters lazily, the numbers of entries we have to zero is bound by the maximum dscale we encounter. Since actually summing N digits is probably more expensive than zeroing them, and since we pay the zeroing overhead only once per aggregation but save potentially many summations, I'm pretty sure we come out ahead by quite some margin. We'll work that out, I don't think it will take very long to code up your idea either. I just thought that my idea was good enough and very cheap too, won't all numerics that are actually stored in a column have the same scale anyway? Is it not only been a problem because we've been testing with random numeric literals the whole time? The test turned out to become: if (state-expectedScale == -1) state-expectedScale = X.dscale; else if (state-expectedScale != X.dscale) state-expectedScale = -2; In do_numeric_accum, then when we do the inverse I just check if expectedScale 0 then return NULL. I'm not set on it, and I'm willing to try the lazy zeroing of the scale tracker array, but I'm just not quite sure what extra real use cases it covers that the one above does not. Perhaps my way won't perform inverse transitions if the query did sum(numericCol / 10) or so. I'll be committing this to the github repo very soon, so we can hack away at the scale tracking code once it's back in. Ok, we're back up to 86 aggregate function / type combinations with inverse transition functions. I've commited my latest work up to github and here's a fresh patch which I will need to do more tests on. The test (below) that used to fail a few versions ago is back in there and it's now passing. SELECT SUM(n::numeric) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) FROM (VALUES(1,1.01),(2,2),(3,3)) v(i,n); In this case it won't use inverse transitions because the forward transition function detects that the scale is not fixed. I tested throwing some numerics into a table and I'm pretty happy with the results. create table num (num numeric(10,2) not null); insert into num (num) select * from generate_series(1,2); select sum(num) over(order by num rows between current row and unbounded following) from num; -- 124ms select sum(num / 10) over(order by num rows between current row and unbounded following) from num; -- 254ms select sum(num / 1) over(order by num rows between current row and unbounded following) from num; -- 108156.917 ms The divide by 1 case is slow because of that weird 20 trailing zero instead of 16 when dividing a numeric by 1 and that causes the inverse transition function to return NULL because the scale changes. I've not tested an unpatched version yet to see how that divide by 1 query performs on that but I'll get to that soon. I'm thinking that the idea about detecting the numeric range with floating point types and performing an inverse transition providing the range has not gone beyond some defined danger zone is not material for this patch... I think it would be not a great deal of work to code, but the testing involved would probably make this patch not possible for 9.4 The latest version of the patch is attached. Regards David Rowley inverse_transition_functions_v2.7.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
David Rowley dgrowle...@gmail.com writes: I just thought that my idea was good enough and very cheap too, won't all numerics that are actually stored in a column have the same scale anyway? No; unconstrained numeric columns (ie, if you just say numeric) don't force their contents to any particular scale. It might be that we don't have to optimize for that case, since it's not in the SQL spec, but it is definitely supported by PG. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance (summary v2 2014-1-17)
On Fri, Jan 17, 2014 at 9:14 AM, Andres Freund and...@2ndquadrant.com wrote: The scheme that'd allow us is the following: When postgres reads a data page, it will continue to first look up the page in its shared buffers, if it's not there, it will perform a page cache backed read, but instruct that read to immediately remove from the page cache afterwards (new API or, posix_fadvise() or whatever). As long as it's in shared_buffers, postgres will not need to issue new reads, so there's no no benefit keeping it in the page cache. If the page is dirtied, it will be written out normally telling the kernel to forget about the caching the page (using 3) or possibly direct io). When a page in postgres's buffers (which wouldn't be set to very large values) isn't needed anymore and *not* dirty, it will seed the kernel page cache with the current data. This seems like mostly an exact reimplementation of DIRECT_IO semantics using posix_fadvise. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add %z support to elog/ereport?
Hi, On 2014-01-17 13:50:08 -0500, Tom Lane wrote: I think a better solution approach is to teach our src/port/snprintf.c about the z flag, then extend configure's checking to force use of our snprintf if the platform's version doesn't handle z. While it might be objected that this creates a need for a run-time check in configure, we already have one to check if snprintf handles n$, so this approach doesn't really make life any worse for cross-compiles. Hm. I had thought about that, but dismissed it because I thought people would argue about it being too invasive... If we're going there, we should just eliminate expand_fmt_string() from elog.c and test for it in configure too, right? You suggest below that we could invent some additional macros to support that; but since the z flag is in C99, there really ought to be only a small minority of platforms where it doesn't work. Well, maybe just a minority numberwise, but one of them being windows surely makes it count in number of installations... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add %z support to elog/ereport?
On 2014-01-17 14:18:55 -0500, Tom Lane wrote: I wrote: Meh. This isn't needed if we do what I suggest above, but in any case I don't approve of removing the existing [U]INT64_FORMAT macros. That breaks code that doesn't need to get broken, probably including third-party modules. After looking more closely I see you didn't actually *remove* those macros, just define them in a different place/way. So the above objection is just noise, sorry. (Though I think it'd be notationally cleaner to let configure continue to define the macros; it doesn't need to do anything as ugly as CppAsString2() to concatenate...) I prefer having configure just define the lenght modifier since that allows to define further macros containing formats. But I think defining them as strings instead row literals as I had might make it a bit less ugly... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add %z support to elog/ereport?
On 2014-01-17 13:50:08 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: [ 0001-Add-support-for-printing-Size-arguments-to-elog-erep.patch ] I think this approach is fundamentally broken, because it can't reasonably cope with any case more complicated than %zu or %zd. While it's arguable that we can get along without the ability to specify field width, since the [U]INT64_FORMAT macros never allowed that, it is surely going to be the case that translators will need to insert n$ flags in the translated versions of messages. Am I just too tired, or am I not getting how INT64_FORMAT currently allows the arguments to be used posititional? Admittedly most places currently just cast down the value avoiding the need for INT64_FORMAT in many places. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance (summary v2 2014-1-17)
On 2014-01-17 16:18:49 -0800, Greg Stark wrote: On Fri, Jan 17, 2014 at 9:14 AM, Andres Freund and...@2ndquadrant.com wrote: The scheme that'd allow us is the following: When postgres reads a data page, it will continue to first look up the page in its shared buffers, if it's not there, it will perform a page cache backed read, but instruct that read to immediately remove from the page cache afterwards (new API or, posix_fadvise() or whatever). As long as it's in shared_buffers, postgres will not need to issue new reads, so there's no no benefit keeping it in the page cache. If the page is dirtied, it will be written out normally telling the kernel to forget about the caching the page (using 3) or possibly direct io). When a page in postgres's buffers (which wouldn't be set to very large values) isn't needed anymore and *not* dirty, it will seed the kernel page cache with the current data. This seems like mostly an exact reimplementation of DIRECT_IO semantics using posix_fadvise. Not at all. The significant bits are a) the kernel uses the pagecache for writeback of dirty data and more importantly b) uses the pagecache to cache data not in postgres's s_b. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add %z support to elog/ereport?
Andres Freund and...@2ndquadrant.com writes: On 2014-01-17 13:50:08 -0500, Tom Lane wrote: I think a better solution approach is to teach our src/port/snprintf.c about the z flag, then extend configure's checking to force use of our snprintf if the platform's version doesn't handle z. Hm. I had thought about that, but dismissed it because I thought people would argue about it being too invasive... How so? It'd be a lot less invasive than what we'd have to do to use 'z' flags the other way. If we're going there, we should just eliminate expand_fmt_string() from elog.c and test for it in configure too, right? If you mean let's rely on glibc for %m, the answer is not bloody likely. See useful_strerror(), which is functionality we'd lose if we short-circuit that. You suggest below that we could invent some additional macros to support that; but since the z flag is in C99, there really ought to be only a small minority of platforms where it doesn't work. Well, maybe just a minority numberwise, but one of them being windows surely makes it count in number of installations... Agreed, but I believe we're already using src/port/snprintf.c on Windows because of lack of %n$ support (which is also required by C99). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
First, I've go the feeling that I should somehow update the commitfest app, but I don't really know in which way. Should I put myself in as a reviewer, or as a second author? Or neither? Suggestions welcome... On Jan17, 2014, at 23:34 , David Rowley dgrowle...@gmail.com wrote: The test turned out to become: if (state-expectedScale == -1) state-expectedScale = X.dscale; else if (state-expectedScale != X.dscale) state-expectedScale = -2; In do_numeric_accum, then when we do the inverse I just check if expectedScale 0 then return NULL. Ok, so this will rescan if and only if the dscales of all inputs match. I still that's overly pessimistic - we've only got a problem when we removed the input with the largest dscale, no? So my suggestion would be state-maxDScale = MAX(X.dscale, state-maxDScale); in do_numeric_accum, and in the inverse if (state-maxDScane == X.dscale) return PG_RETURN_NULL; I'd think that this avoids more restarts without about the same effort, but I haven't tried this though, so maybe I'm missing something. I'm not set on it, and I'm willing to try the lazy zeroing of the scale tracker array, but I'm just not quite sure what extra real use cases it covers that the one above does not. Perhaps my way won't perform inverse transitions if the query did sum(numericCol / 10) or so. Dunno how many people SUM over numerics with different dscales. Easily possible that it's few enough to not be worth fussing over. create table num (num numeric(10,2) not null); insert into num (num) select * from generate_series(1,2); select sum(num) over(order by num rows between current row and unbounded following) from num; -- 124ms select sum(num / 10) over(order by num rows between current row and unbounded following) from num; -- 254ms select sum(num / 1) over(order by num rows between current row and unbounded following) from num; -- 108156.917 ms The divide by 1 case is slow because of that weird 20 trailing zero instead of 16 when dividing a numeric by 1 and that causes the inverse transition function to return NULL because the scale changes. I've not tested an unpatched version yet to see how that divide by 1 query performs on that but I'll get to that soon. So without the patch, all three queries should perform simiarly, i.e. take about 10 seconds, right? If so, the improvement is fantastic! I'm thinking that the idea about detecting the numeric range with floating point types and performing an inverse transition providing the range has not gone beyond some defined danger zone is not material for this patch... I think it would be not a great deal of work to code, but the testing involved would probably make this patch not possible for 9.4 Yeah, I never imagined that this would happen for 9.4. The latest version of the patch is attached. OK, there are a few more comments * build_aggregate_fnexprs() should allow NULL to be passed for invtransfn_oid, I think. I don't quite like that we construct that even for plain aggregates, and avoiding requires just an additional if. * Don't we need to check for volatile function in the filter expression too? * As it stands, I don't think intXand_inv and intXor_inv are worth it, since the case where they return non-NULL is awefully slim (only for inputs containing only 1 respectively only zeros). We should either track the number of zeros and ones per bit, which would allow us to always perform inverse transitions, or just drop them. * Quite a few of the inverse transition functions are marked strict, yet contain code to handle NULL inputs. You can just remove that code - the system makes sure that strict functions never receive NULL arguments. Affected are, AFAICS numeric_accum_inv, numeric_avg_accum_inv, int2_accum_inv, int4_accum_inv, int8_accum_inv, int8_avg_accum_inv, int2_sum_inv, int4_sum_inv, int8_sum_inv. Not sure that list is exhaustive... * For any of the new inverse transition functions, I'd be inclined to just elog() if they're called directly and not as an aggregate. In particular those which check for that anyway, plus the *smaller_inv and *larger_inv ones. I don't see why anyone would ever want to call these directly - and if we elog() now, we can easy change these functions later, because no external code can depend on them. E.g., maybe someone wants to keep the top N elements in the MIN/MAX aggregates one day... * The number of new regression tests seems a bit excessive. I don't think there really a policy what to test and what not, but in this case I think it suffices if we test the basic machinery, plus the more complex functions. But e.g. most of the SUM and AVG aggregates use numeric_accum or numeric_avg_accum internally, and the wrapper code basically just does datatype conversion, so testing a few cases seems enough there. What I think we *should* test, but don't do currently, is whether
Re: [HACKERS] Add %z support to elog/ereport?
Andres Freund and...@2ndquadrant.com writes: On 2014-01-17 13:50:08 -0500, Tom Lane wrote: I think this approach is fundamentally broken, because it can't reasonably cope with any case more complicated than %zu or %zd. Am I just too tired, or am I not getting how INT64_FORMAT currently allows the arguments to be used posititional? It doesn't, which is one of the reasons for not allowing it in translatable strings (the other being lack of standardization of the strings that would be subject to translation). Adding 'z' will only fix this for cases where what we want to print is really a size_t. That's a usefully large set, but not all the cases by any means. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add %z support to elog/ereport?
I wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-01-17 13:50:08 -0500, Tom Lane wrote: I think this approach is fundamentally broken, because it can't reasonably cope with any case more complicated than %zu or %zd. Am I just too tired, or am I not getting how INT64_FORMAT currently allows the arguments to be used posititional? It doesn't, which is one of the reasons for not allowing it in translatable strings (the other being lack of standardization of the strings that would be subject to translation). On second thought, that answer was too glib. There's no need for %n$ in the format strings *in the source code*, so INT64_FORMAT isn't getting in the way from that perspective. However, expand_fmt_string() is necessarily applied to formats *after* they've been through gettext(), so it has to expect that it might see %n$ in the now-translated strings. It's still the case that we need a policy against INT64_FORMAT in translatable strings, though, because what's passed to the gettext mechanism has to be a fixed string not something with platform dependencies in it. Or so I would assume, anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minor improvements to sslinfo contrib module
Gurjeet Singh gurj...@singh.im writes: Please find attached the patch that fixes a couple of comments, and adds 'static' qualifier to functions that are not used anywhere else in the code base. Committed, with the trivial fix that the function definitions have to be marked static too. (gcc doesn't complain about this, which I think may be correct per C standard; but some other compilers do whine about that sort of inconsistency.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] array_length(anyarray)
On 1/12/14, 5:53 AM, I wrote: On 1/9/14, 2:57 PM, Dean Rasheed wrote: How it should behave for multi-dimensional arrays is less clear, but I'd argue that it should return the total number of elements, i.e. cardinality('{{1,2},{3,4}}'::int[][]) = 4. That would make it consistent with the choices we've already made for unnest() and ordinality: - cardinality(foo) = (select count(*) from unnest(foo)). - unnest with ordinality would always result in ordinals in the range [1, cardinality]. Ignoring my proposal, this seems like the most reasonable option. I'll send an updated patch along these lines. Here's the patch as promised. Thoughts? Regards, Marko Tiikkaja *** a/doc/src/sgml/array.sgml --- b/doc/src/sgml/array.sgml *** *** 338,343 SELECT array_length(schedule, 1) FROM sal_emp WHERE name = 'Carol'; --- 338,356 2 (1 row) /programlisting + + functioncardinality/function returns the total number of elements in an + array across all dimensions. It is effectively the number of rows a call to + functionunnest/function would yield: + + programlisting + SELECT cardinality(schedule) FROM sal_emp WHERE name = 'Carol'; + + cardinality + - +4 + (1 row) + /programlisting /para /sect2 *** a/doc/src/sgml/func.sgml --- b/doc/src/sgml/func.sgml *** *** 11009,11014 SELECT NULLIF(value, '(none)') ... --- 11009,11017 primaryarray_upper/primary /indexterm indexterm + primarycardinality/primary + /indexterm + indexterm primarystring_to_array/primary /indexterm indexterm *** *** 11167,11172 SELECT NULLIF(value, '(none)') ... --- 11170,11186 row entry literal + functioncardinality/function(typeanyarray/type) + /literal + /entry + entrytypeint/type/entry + entryreturns the total number of elements in the array, or 0 if the array is empty/entry + entryliteralcardinality(ARRAY[[1,2],[3,4]])/literal/entry + entryliteral4/literal/entry +/row +row + entry + literal functionstring_to_array/function(typetext/type, typetext/type optional, typetext/type/optional) /literal /entry *** a/src/backend/utils/adt/arrayfuncs.c --- b/src/backend/utils/adt/arrayfuncs.c *** *** 1740,1745 array_length(PG_FUNCTION_ARGS) --- 1740,1766 } /* + * array_cardinality: + *returns the total number of elements in an array + */ + Datum + array_cardinality(PG_FUNCTION_ARGS) + { + ArrayType *v = PG_GETARG_ARRAYTYPE_P(0); + int*dimv; + int i; + int cardinality; + + dimv = ARR_DIMS(v); + cardinality = 1; + for (i = 0; i ARR_NDIM(v); i++) + cardinality *= dimv[i]; + + PG_RETURN_INT32(cardinality); + } + + + /* * array_ref : * This routine takes an array pointer and a subscript array and returns * the referenced item as a Datum. Note that for a pass-by-reference *** a/src/include/catalog/pg_proc.h --- b/src/include/catalog/pg_proc.h *** *** 840,845 DATA(insert OID = 2092 ( array_upper PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 --- 840,847 DESCR(array upper dimension); DATA(insert OID = 2176 ( array_length PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 23 2277 23 _null_ _null_ _null_ _null_ array_length _null_ _null_ _null_ )); DESCR(array length); + DATA(insert OID = 3179 ( cardinalityPGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 23 2277 _null_ _null_ _null_ _null_ array_cardinality _null_ _null_ _null_ )); + DESCR(array cardinality); DATA(insert OID = 378 ( array_appendPGNSP PGUID 12 1 0 0 0 f f f f f f i 2 0 2277 2277 2283 _null_ _null_ _null_ _null_ array_push _null_ _null_ _null_ )); DESCR(append element onto end of array); DATA(insert OID = 379 ( array_prepend PGNSP PGUID 12 1 0 0 0 f f f f f f i 2 0 2277 2283 2277 _null_ _null_ _null_ _null_ array_push _null_ _null_ _null_ )); *** a/src/include/utils/array.h --- b/src/include/utils/array.h *** *** 204,209 extern Datum array_dims(PG_FUNCTION_ARGS); --- 204,210 extern Datum array_lower(PG_FUNCTION_ARGS); extern Datum array_upper(PG_FUNCTION_ARGS); extern Datum array_length(PG_FUNCTION_ARGS); + extern Datum array_cardinality(PG_FUNCTION_ARGS); extern Datum array_larger(PG_FUNCTION_ARGS); extern Datum array_smaller(PG_FUNCTION_ARGS); extern Datum generate_subscripts(PG_FUNCTION_ARGS); *** a/src/test/regress/expected/arrays.out --- b/src/test/regress/expected/arrays.out *** *** 1455,1460 select array_length(array[[1,2,3], [4,5,6]], 3); --- 1455,1502 (1 row) + select cardinality(NULL::int[]); + cardinality + - + + (1 row) + + select
Re: [HACKERS] [PATCH] Make various variables read-only (const)
Oskari Saarenmaa o...@ohmu.fi writes: On Sun, Dec 22, 2013 at 09:43:57PM -0500, Robert Haas wrote: - Why change the API of transformRelOptions()? The comment was changed to reflect the new API, I modified transformRelOptions to only accept a single valid namespace to make things simpler in the calling code. Nothing used more than one valid namespace anyway, and it allows us to just use a constant toast without having to create a 2 char* array with a NULL. That may be true today, but the code was obviously designed to allow for more than one namespace in the future. I'm inclined to reject this part of the patch, or at least rework it to const-ify the existing data structure rather than editorialize on what's needed. However, I believe this code was Alvaro's to begin with, so I'd be interested in his opinion on how important it is for transformRelOptions to allow more than one namespace per call. Actually, I'm kind of wondering why the code has a concept of namespaces at all, given the fact that it fails to store them in the resulting array. It seems impossible to verify after the fact that an option was given with the right namespace, so isn't the feature pretty much pointless? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3.2 Client-only installation per docs fails creating directory.
On Thu, 2014-01-16 at 15:59 -0600, Mike Blackwell wrote: Per the docs (15.4 Installation Procedure, Client-only installation), after running make, the following should work: gmake -C src/bin install gmake -C src/include install gmake -C src/interfaces install gmake -C doc install Fixed, thanks. (It works if you have NLS enabled, which is why this was probably not reported before.) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Sat, Jan 18, 2014 at 2:20 PM, Florian Pflug f...@phlo.org wrote: First, I've go the feeling that I should somehow update the commitfest app, but I don't really know in which way. Should I put myself in as a reviewer, or as a second author? Or neither? Suggestions welcome... We I guess you're both now, but it's a bit weird to be author and reviewer so I've put your name against author too, hopefully Dean can review our combined results and we can review each other's work at the same time. On Jan17, 2014, at 23:34 , David Rowley dgrowle...@gmail.com wrote: The test turned out to become: if (state-expectedScale == -1) state-expectedScale = X.dscale; else if (state-expectedScale != X.dscale) state-expectedScale = -2; In do_numeric_accum, then when we do the inverse I just check if expectedScale 0 then return NULL. Ok, so this will rescan if and only if the dscales of all inputs match. I still that's overly pessimistic - we've only got a problem when we removed the input with the largest dscale, no? So my suggestion would be state-maxDScale = MAX(X.dscale, state-maxDScale); in do_numeric_accum, and in the inverse if (state-maxDScane == X.dscale) return PG_RETURN_NULL; I'd think that this avoids more restarts without about the same effort, but I haven't tried this though, so maybe I'm missing something. This is not quite right as it means if all the values are the same then we reject inverse transitions since state-maxScale will always be equal to X.dscale. But you are right about the overly strict code I've put in, we should allow values with a less than the maximum dscale to be unaggregated without complaint. To implement this I needed a maxScaleCounter variable too so we only reject when the maxScaleCounter gets back to 0 again. Note that after this fix the results for my quick benchmark now look like: create table num (num numeric(10,2) not null); insert into num (num) select * from generate_series(1,2); select sum(num) over(order by num rows between current row and unbounded following) from num; -- 113 ms select sum(num / 10) over(order by num rows between current row and unbounded following) from num; -- 156ms select sum(num / 1) over(order by num rows between current row and unbounded following) from num; -- 144 ms So it seems to be much less prone to falling back to brute force forward transitions. It also seems the / 10 version must have had to previously do 1 brute force rescan but now it looks like it can do it in 1 scan. I'm not set on it, and I'm willing to try the lazy zeroing of the scale tracker array, but I'm just not quite sure what extra real use cases it covers that the one above does not. Perhaps my way won't perform inverse transitions if the query did sum(numericCol / 10) or so. Dunno how many people SUM over numerics with different dscales. Easily possible that it's few enough to not be worth fussing over. Going by Tom's comments in the post above this is possible just by having an unconstrained numeric column, but I guess there's still a good chance that even those unconstrained numbers have the same scale or at least the scale will likely not vary wildly enough to make us have to perform brute force forward transitions for each row. create table num (num numeric(10,2) not null); insert into num (num) select * from generate_series(1,2); select sum(num) over(order by num rows between current row and unbounded following) from num; -- 124ms select sum(num / 10) over(order by num rows between current row and unbounded following) from num; -- 254ms select sum(num / 1) over(order by num rows between current row and unbounded following) from num; -- 108156.917 ms The divide by 1 case is slow because of that weird 20 trailing zero instead of 16 when dividing a numeric by 1 and that causes the inverse transition function to return NULL because the scale changes. I've not tested an unpatched version yet to see how that divide by 1 query performs on that but I'll get to that soon. So without the patch, all three queries should perform simiarly, i.e. take about 10 seconds, right? If so, the improvement is fantastic! Well, it's actually 100 seconds, not 10. I tested the worse case performance against an unpatched head and got 107 seconds instead of the 108. So I'm guessing that's pretty good as worse case is not really any worse and the worse case is pretty hard to get to. I guess the results would have to all have a different scale with the biggest scale on the first aggregated values... Reaching that worse case just seems impossible in a real world workload. I'm thinking that the idea about detecting the numeric range with floating point types and performing an inverse transition providing the range has not gone beyond some defined danger zone is not material for this patch... I think it would be not a great deal of work to code, but
Re: [HACKERS] [PATCH] Make various variables read-only (const)
I wrote: However, I believe this code was Alvaro's to begin with, so I'd be interested in his opinion on how important it is for transformRelOptions to allow more than one namespace per call. Actually, I'm kind of wondering why the code has a concept of namespaces at all, given the fact that it fails to store them in the resulting array. It seems impossible to verify after the fact that an option was given with the right namespace, so isn't the feature pretty much pointless? After thinking about it some more, I realize that the intended usage pattern is to call transformRelOptions once for each allowed namespace. Since each call would ignore options outside its namespace, that would result in any options with invalid namespace names being silently ignored; so the fix was to add a parameter saying which namespaces are valid, and then each call checks that, independently of extracting the options it cares about. This design seems a bit odd to me; it's certainly wasting effort, since each namespace check after the first one is redundant. I'm inclined to propose that we factor out the namespace validity checking into a separate function, such that you do something like void checkOptionNamespaces(List *defList, const char * const validnsps[]) just once, and then call transformRelOptions for each of the desired namespaces; transformRelOptions's validnsps argument goes away. In at least some places this looks like it would net out cleaner; for instance, there is no good reason why callers that are trying to extract toast options should have to know which other namespaces are allowed. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Sat, Jan 18, 2014 at 6:15 PM, David Rowley dgrowle...@gmail.com wrote: On Sat, Jan 18, 2014 at 2:20 PM, Florian Pflug f...@phlo.org wrote: * Don't we need to check for volatile function in the filter expression too? I did manual testing on this before and the volatility test for the aggregate arguments seems to cover this. I didn't look into why but it just did. I've not test this again since your refactoring. I could test this easily before when my numeric case was changing the results because of the dscale problem, I noticed that if I did FILTER(WHERE random() 0) that the extra trailing zeros would disappear. The problem now is that it's pretty hard to determine if an inverse transition took place, the only way we can really tell is performance. I'll see if I can invent a new test case for this by creating a user defined aggregate as you described. I'm thinking just append '+' to a string for transitions and '-' to a string for inverse transitions, then just make sure we only have a string of '+'s when doing something like filter(where random() = 0). I've added a test case for this and it seem work as expected: https://github.com/david-rowley/postgres/commit/43a5021e8f8ae1af272e7e21a842d1b0d5cbe577 Regards David Rowley
Re: [HACKERS] currawong is not a happy animal
On Sat, Jan 18, 2014 at 2:48 AM, Andrew Dunstan and...@dunslane.net wrote: On 01/17/2014 03:15 PM, Tom Lane wrote: The other possibility I was contemplating is that export a const variable doesn't actually work for some reason. We're not in the habit of doing that elsewhere, so I don't find that theory outlandish. Perhaps it could be fixed by adding PGDLLIMPORT to the extern, but on the whole I'd rather avoid the technique altogether. Is there any specific reason why adding PGDLLIMPORT for exporting const variables is not good, when we are already doing for other variables like InterruptHoldoffCount, CritSectionCount? On searching in code, I found that for few const variables we do extern PGDLLIMPORT. For example: extern PGDLLIMPORT const int NumScanKeywords; extern PGDLLIMPORT const char *debug_query_string; The least-unlike-other-Postgres-code approach would be to go ahead and export the struct so that the size computation could be provided as a #define in the same header. Robert stated a couple days ago that he didn't foresee much churn in this struct, so that doesn't seem unacceptable. Another possibility is to refactor so that testing an allocation request against shm_mq_minimum_size is the responsibility of storage/ipc/shm_mq.c, not some random code in a contrib module. It's not immediately apparent to me why it's good code modularization to have a contrib module responsible for checking sizes based on the sizeof a struct it's not supposed to have any access to. Or maybe we could expose the value via a function rather than a const variable. All of above suggested alternatives will fix this problem. However after fixing this problem, when I ran the test further it crashed the bgworker and I found that reason was there are some other variables (ImmediateInterruptOK, MyBgworkerEntry) used in test module without PGDLLIMPORT. After adding PGDLLIMPORT to variables (ImmediateInterruptOK, MyBgworkerEntry, shm_mq_minimum_size) both the tests defined in contrib module passed. Attached patch fixes the problems related to test_shm_mq for me. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com fix_test_shm_mq.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers