Re: [HACKERS] Resource Owner reassign Locks
On Thu, Jun 21, 2012 at 5:32 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 18.06.2012 13:59, Heikki Linnakangas wrote: On 10.06.2012 23:39, Jeff Janes wrote: I found the interface between resowner.c and lock.c a bit confusing. resowner.c would sometimes call LockReassignCurrentOwner() to reassign all the locks, and sometimes it would call LockReassignCurrentOwner() on each individual lock, with the net effect that's the same as calling LockReleaseCurrentOwner(). And it doesn't seem right for ResourceOwnerRemember/ForgetLock to have to accept a NULL owner. I rearranged that so that there's just a single LockReassignCurrentOwner() function, like before this patch. But it takes as an optional argument a list of locks held by the current resource owner. If the caller knows it, it can pass them to make the call faster, but if it doesn't it can just pass NULL and the function will traverse the hash table the old-fashioned way. I think that's a better API. Please take a look to see if I broke something. I hear no complaints, so committed. Thanks! I'd like to advocate for back-patching this to 9.0, 9.1, and 9.2. It has run without problems for a while now, and it can be considered a bug that systems with a very large number of objects cannot be upgraded in a reasonable time. There are possible ways to make the upgrade smoother by changing the new systems pg_upgrade rather the old systems server, but those methods will not be simpler, and not be as well tested. I'll add this proposal to the commit fest. Thanks, Jeff
Re: [HACKERS] [CORE] Restore-reliability mode
On Sat, Jun 6, 2015 at 7:35 AM, Geoff Winkless pgsqlad...@geoff.dj wrote: On 6 June 2015 at 13:41, Sehrope Sarkuni sehr...@jackdb.com wrote: It's much easier to work into dev/test setups if there are system packages as it's just a config change to an existing script. Building from source would require a whole new workflow that I don't have time to incorporate. Really? You genuinely don't have time to paste, say: mkdir -p ~/src/pgdevel cd ~/src/pgdevel wget https://ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.bz2 tar xjf postgresql-snapshot.tar.bz2 mkdir bld cd bld ../postgresql-9.5devel/configure $(pg_config --configure | sed -e 's/\(pg\|postgresql[-\/]\)\(doc-\)\?9\.[0-9]*\(dev\)\?/\1\29.5dev/g') make wor ld make check make world-install I think this is rather uncharitable. You don't include yum, zypper, or apt-get anywhere in there, and I vaguely recall it took me quite a bit of time to install all the prereqs in order to get it to compile several years ago when I first started trying to compile it. And then even more time get it to compile with several of the config flags I wanted. And then even more time to get the docs to compile. And now after I got all of that, when I try your code, it still doesn't work. $(pg_config ) seems to not quote things the way that configure wants them quoted, or something. And the package from which I was using pg_config uses more config options than I was set up for when compiling myself, so I either have to manually remove the flags, or find more dependencies (pam, xslt, ossp-uuid, tcl, tcl-dev, and counting). This is not very fun, and I didn't even need to get bureaucratic approval to do any of this stuff, like a lot of people do. And then when I try to install it, it tries to overwrite some of the files which were initially installed by the package manager in /usr/lib. That doesn't seem good. And how do I, as a hypothetical package manager user, start this puppy up? Where is pg_ctlcluster? How does one do pg_upgrade between a package-controlled data directory and this new binary? And then when I find a bug, how do I know it is a bug and not me doing something wrong in the build process, and getting the wrong .so to load with the wrong binary or something like that? and yet you think you have enough time to provide more than a looks like it's working report to the developers? If it isn't working, reports of it isn't working with error messages are pretty helpful and don't take a whole lot of time. If it is working, reports of that probably aren't terribly helpful without putting a lot more work into it. But people might be willing to put a lot of work into, say, performance regression testing it that is their area of expertise, if they don' t also have to put a lot of work into getting the software to compile in the first place, which is not their area. Now I don't see a lot of evidence of beta testing from the public (i.e. unfamiliar names) on -hackers and -bugs lists. But a lot of hackers report things that a client or someone on IRC reported to them, so I'm willing to believe that a lot of useful beta testing does go on, even though I don't directly see it, and if there were alpha releases, why wouldn't it extend to that? (NB the sed for the pg_config line will probably need work, it looks like it should work on the two types of system I have here but I have to admit I changed the config line manually when I built it) Right, and are the people who use apt-get to install everything likely to know how to do that work? Cheers, Jeff
Re: [HACKERS] PoC: Partial sort
On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson andr...@proxel.se wrote: Are you planning to work on this patch for 9.6? FWIW I hope so. It's a nice patch. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [CORE] Restore-reliability mode
Joshua D. Drake j...@commandprompt.com wrote: On 06/06/2015 07:14 PM, Peter Geoghegan wrote: On Sat, Jun 6, 2015 at 7:07 PM, Robert Haas robertmh...@gmail.com wrote: Perhaps we're honoring this more in the breech than in the observance, but I'm not making up what Tom has said about this: http://www.postgresql.org/message-id/27310.1251410...@sss.pgh.pa.us That's 9.0 release discussion: | I think that the traditional criterion is that we don't release beta1 | as long as there are any known issues that might force an initdb. | We were successful in avoiding a post-beta initdb this time, although | IIRC the majority of release cycles have had one --- so maybe you | could argue that that's not so important. It would certainly be | less important if we had working pg_migrator functionality to ease | the pain of going from beta to final. http://www.postgresql.org/message-id/19174.1299782...@sss.pgh.pa.us That's 9.1 release discussion: | Historically we've declared it beta once we think we are done with | initdb-forcing problems. | In any case, the existence of pg_upgrade means that might we need | another initdb? is not as strong a consideration as it once was, so | I'm not sure if we should still use that as a criterion. I don't know | quite what ready for beta should mean otherwise, though. http://www.postgresql.org/message-id/3413.1301154...@sss.pgh.pa.us Also 9.1, it is listed as one criterion: | * No open issues that are expected to result in a catversion bump. | (With pg_upgrade, this is not as critical as it used to be, but | I still think catalog stability is a good indicator of a release's | maturity) http://www.postgresql.org/message-id/3261.1401915...@sss.pgh.pa.us Here we jump to 9.4 discussion: | Agreed. Additionally I also agree with Stefan that the price of a initdb | during beta isn't that high these days. | | Yeah, if nothing else it gives testers another opportunity to exercise | pg_upgrade ;-). The policy about post-beta1 initdb is avoid if | practical, not avoid at all costs. So I think these examples show that the policy has shifted from a pretty strong requirement to it's probably nice if status, with some benefits seen in pg_upgrade testing to actually having a bump. Of course, not doing a catversion bump after beta1 doesn't necessarily have much value in and of itself. *Promising* to not do a catversion bump, and then usually keeping that promise definitely has a certain value, but clearly we are incapable of that. As someone who was able to bring up a new production application on 8.2 because it was all redundant data and not yet mission-critical, I appreciate that in very rate circumstances that combination could have benefit. But really, how often are people in that position? It seems to me that a cat bump during Alpha or Beta should be absolutely fine and reservedly fine respectively. Where we should absolutely not cat bump unless there is absolutely no other choice is during and RC. +1 on all of that. And for a while now we've been talking about an alpha test release, right? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Proposal] More Vacuum Statistics
Hi, On 06/05/15 14:10, Naoya Anzai wrote: Thank you for quick feedback, and I'm sorry for slow response. All of your opinions were very helpful for me. I have confirmed Greg's Idea Timing events. http://www.postgresql.org/message-id/509300f7.5000...@2ndquadrant.com Greg said at first, Parsing log files for commonly needed performance data is no fun. Yes, I completely agree with him. That looks a nice idea but I don't know why this idea has not been commited yet. Anybody knows? Most likely lack of time, I guess. I have reworked my idea since I heard dear hacker's opinions. pg_stat_vacuum view I understand it is not good to simply add more counters in pg_stat_*_tables. For now, I'd like to suggest an extension which can confirm vacuum statistics like pg_stat_statements. I don't see how you want to collect the necessary information from an extension? pg_stat_statements get most of the stats from BufferUsage structure, but vacuum keeps all this internal, AFAIK. So it'd be necessary to make this somehow public - either by creating something like BufferUsage with all the vacuum stats, or perhaps a set of callbacks (either a single log_relation_vacuum or different callbacks for tables and indexs). IMHO the callbacks are a better idea - for example because it naturally handles database-wide vacuum. The global structure makes this difficult, because you'll only see data for all the vacuumed objects (or it'd have to track per-object stats internally, somehow). VACUUM is a most important feature in PostgreSQL, but a special view for vacuum does not exist. Don't you think the fact is inconvenience? At least, I am disgruntled with that we need to parse pg_log for tune VACUUM. +1 My first design of pg_stat_vacuum view is following. (There are two views.) pg_stat_vacuum_table --- dbid schemaname relid relname elapsed page_removed page_remain page_skipped tuple_removed tuple_remain tuple_notremovable buffer_hit buffer_miss buffer_dirty avg_read avg_write vm_count vac_start vac_end is_autovacuum pg_stat_vacuum_index --- dbid shemaname relid indexrelid indexname elapsed num_index_tuples num_pages tuples_removed pages_deleted pages_free is_autovacuum At present, I think memory design of pg_stat_statements can divert into this feature.And I think this module needs to prepare following parameters like pg_stat_statements. I'm not really sure about this. Firstly, the very fist response from TL in this thread was that adding per-table counters is not a particularly good idea, as it'll bloat the statistics files. It's true you're not adding the data into the main stats files, but you effectively establish a new 'vertical partition' with one record per table/index. It might be worth the overhead, if it really brings useful functionality (especially if it's opt-in feature, like pg_stat_statements). Secondly, the main issue of this design IMHO is that it only tracks the very last vacuum run (or do I understand it wrong?). That means even if you snapshot the pg_stat_vacuum views, you'll not know how many vacuums executed in between (and the more frequently you snapshot that, the greater the overhead). The other stats counters have the same issue, but the snapshotting works a bit better because the counters are cumulative (so you can easily do deltas etc.). But that's not the case here - certainly not with the timestamps, for example. I don't think the vacuum start/end timestamps are particularly interesting, TBH - we already have them in pg_stat_all_tables anyway, including the vacuum_count etc. So I'd propose dropping the timestamps, possibly replacing them with a single 'elapsed time', and making all the counters cumulative (so that you can do snapshots and deltas). I'm also wondering whether this should track the vacuum costs (because that determines how aggressive the vacuum is, and how much work will be done in a particular time), if it was anti-wraparound vacuum, if there was also ANALYZE performed, if the autovacuum was interrupted because of user activity, etc. pg_stat_vacuum.max(integer) pg_stat_vacuum.save(boolean) pg_stat_vacuum.excluded_dbnames(text) pg_stat_vacuum.excluded_schemas(text) pg_stat_vacuum.min_duration(integer) ... and so on. To implement this feature, I have to collect each vacuum-stats every lazy_vacuum_* and I need to embed a hook function point where needed. (probably last point of lazy_vacuum_rel). Do you hesitate to add the hook only for this function? Aha! So you plan to use the callbacks. Similar feature has been already provided by pg_statsinfo package. But it is a full-stack package for PG-stats and it needs to redesign pg_log and design a repository database for introduce. And it is not a core-extension for PostgreSQL. (I don't intend to hate pg_statsinfo, I think this package is a very convinient tool) Everyone will be able to do more easily tuning of
Re: [HACKERS] checkpointer continuous flushing
Hello Andres, They pretty much can't if you flush things frequently. That's why I think this won't be acceptable without the sorting in the checkpointer. * VERSION 2 WORK IN PROGRESS. The implementation is more a proof-of-concept for having feedback than clean code. What it does: - as version 1 : simplified asynchronous flush based on Andres Freund patch, with sync_file_range/posix_fadvise used to hint the OS that the buffer must be sent to disk now. - added: checkpoint buffer sorting based on a 2007 patch by Takahiro Itagaki but with a smaller and static buffer allocated once. Also, sorting is done by chunks in the current version. - also added: sync/advise calls are now merged if possible, so less calls are used, especially when buffers are sorted, but also if there are few files. * PERFORMANCE TESTS Impacts on pgbench -M prepared -N -P 1 scale 10 (simple update pgbench with a mostly-write activity), with checkpoint_completion_target=0.8 and shared_buffers=1GB. Contrary to v1, I have not tested bgwriter flushing as the impact on the first round was close to nought. This does not mean that particular loads may benefit or be harmed but flushing from bgwriter. - 100 tps throttled max 100 ms latency over 6400 seconds with checkpoint_timeout=30s flush | sort | late transactions off | off | 6.0 % off | on | 6.1 % on | off | 0.4 % on | on | 0.4 % (93% improvement) - 100 tps throttled max 100 ms latency over 4000 seconds with checkpoint_timeout=10mn flush | sort | late transactions off | off | 1.5 % off | on | 0.6 % (?!) on | off | 0.8 % on | on | 0.6 % (60% improvement) - 150 tps throttled max 100 ms latency over 19600 seconds (5.5 hours) with checkpoint_timeout=30s flush | sort | late transactions off | off | 8.5 % off | on | 8.1 % on | off | 0.5 % on | on | 0.4 % (95% improvement) - full speed bgbench over 6400 seconds with checkpoint_timeout=30s flush | sort | tps performance over per second data off | off | 676 +- 230 off | on | 683 +- 213 on | off | 712 +- 130 on | on | 725 +- 116 (7.2% avg/50% stddev improvements) - full speed bgbench over 4000 seconds with checkpoint_timeout=10mn flush | sort | tps performance over per second data off | off | 885 +- 188 off | on | 940 +- 120 (6%/36%!) on | off | 778 +- 245 (hmmm... not very consistent?) on | on | 927 +- 108 (4.5% avg/43% sttdev improvements) - full speed bgbench -j2 -c4 over 6400 seconds with checkpoint_timeout=30s flush | sort | tps performance over per second data off | off | 2012 +- 747 off | on | 2086 +- 708 on | off | 2099 +- 459 on | on | 2114 +- 422 (5% avg/44% stddev improvements) * CONCLUSION : For all these HDD tests, when both options are activated the tps performance is improved, the latency is reduced and the performance is more stable (smaller standard deviation). Overall the option effects, not surprisingly, are quite (with exceptions) orthogonal: - latency is essentially improved (60 to 95% reduction) by flushing - throughput is improved (4 to 7% better) thanks to sorting In detail, some loads may benefit more from only one option activated. Also on SSD probably both options would have limited benefit. Usual caveat: these are only benches on one host at a particular time and location, which may or may not be reproducible nor be representative as such of any other load. The good news is that all these tests tell the same thing. * LOOK FOR THOUGHTS - The bgwriter flushing option seems ineffective, it could be removed from the patch? - Move fsync as early as possible, suggested by Andres Freund? In these tests, when the flush option is activated, the fsync duration at the end of the checkpoint is small: on more than 5525 checkpoint fsyncs, 0.5% are above 1 second when flush is on, but the figure raises to 24% when it is off This suggest that doing the fsync as soon as possible would probably have no significant effect on these tests. My opinion is that this should be left out for the nonce. - Take into account tablespaces, as pointed out by Andres Freund? The issue is that if writes are sorted, they are not be distributed randomly over tablespaces, inducing lower performance on such systems. How to do it: while scanning shared_buffers, count dirty buffers for each tablespace. Then start as many threads as table spaces, each one doing its own independent throttling for a tablespace? For some obscure reason there are 2 tablespaces by default (pg_global and pg_default), that would mean at least 2 threads. Alternatively, maybe it can be done from one thread, but it would probably involve some strange hocus-pocus to switch frequently between tablespaces. -- Fabien.diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 1da7dfb..2e6bb10 100644 --- a/doc/src/sgml/config.sgml +++
Re: [HACKERS] PoC: Partial sort
On 09/15/2014 01:58 PM, Alexander Korotkov wrote: On Sun, Sep 14, 2014 at 9:32 AM, Peter Geoghegan p...@heroku.com mailto:p...@heroku.com wrote: I think we might be better off if a tuplesort function was called shortly after tuplesort_begin_heap() is called. How top-n heap sorts work is something that largely lives in tuplesort's head. Today, we call tuplesort_set_bound() to hint to tuplesort By the way, this is a top-n heap sort applicable sort. I think that with this patch, we should then hint (where applicable) by the way, you won't actually be required to sort those first n indexed attributes; rather, you can expect to scan those in logical order. You may work the rest out yourself, and may be clever about exploiting the sorted-ness of the first few columns. The idea of managing a bunch of tiny sorts from with ExecSort(), and calling the new function tuplesort_reset() seems questionable. tuplesortstate is supposed to be private/opaque to nodeSort.c, and the current design strains that. I'd like to keep nodeSort.c simple. I think it's pretty clear that the guts of this do not belong within ExecSort(), in any case. Also, the additions there should be much better commented, wherever they finally end up. As I understand, you propose to incapsulate partial sort algorithm into tuplesort. However, in this case we anyway need some significant change of its interface: let tuplesort decide when it's able to return tuple. Otherwise, we would miss significant part of LIMIT clause optimization. tuplesort_set_bound() can't solve all the cases. There could be other planner nodes between the partial sort and LIMIT. Hi, Are you planning to work on this patch for 9.6? I generally agree with Peter that the changes to the sorting probably belong in the tuplesort code rather than in the executor. This way it should also be theoretically possible to support mark/restore. Andreas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)
Hi. This is a followup to a 2014-02 discussion that led to pg_stats_temp being excluded from pg_basebackup. At the time, it was discussed to exclude pg_log as well, but nothing eventually came of that. Recently, one of our customers has had a basebackup fail because pg_log contained files that were 8GB: FATAL: archive member pg_log/postgresql-20150119.log too large for tar format I think pg_basebackup should also skip pg_log entries, as it does for pg_stats_temp and pg_replslot, etc. I've attached a patch along those lines for discussion. -- Abhijit P.S. Aren't we leaking statrelpath? From 8db162c1385b1cdd2b0e666975b76aa814f09f35 Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen a...@2ndquadrant.com Date: Mon, 27 Apr 2015 12:58:52 +0530 Subject: Skip files in pg_log during basebackup --- src/backend/replication/basebackup.c | 29 + 1 file changed, 29 insertions(+) diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index 1e86e4c..cc75a03 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -27,6 +27,7 @@ #include nodes/pg_list.h #include pgtar.h #include pgstat.h +#include postmaster/syslogger.h #include replication/basebackup.h #include replication/walsender.h #include replication/walsender_private.h @@ -72,6 +73,9 @@ static bool backup_started_in_recovery = false; /* Relative path of temporary statistics directory */ static char *statrelpath = NULL; +/* Relative path to log directory */ +static char logpath[MAXPGPATH]; + /* * Size of each block sent into the tar stream for larger files. */ @@ -157,6 +161,18 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir) else statrelpath = pgstat_stat_directory; + /* + * Do the same for the log_directory. + */ + + if (is_absolute_path(Log_directory) + path_is_prefix_of_path(DataDir, Log_directory)) + snprintf(logpath, MAXPGPATH, ./%s, Log_directory + datadirpathlen + 1); + else if (strncmp(Log_directory, ./, 2) != 0) + snprintf(logpath, MAXPGPATH, ./%s, Log_directory); + else + strncpy(logpath, Log_directory, MAXPGPATH); + /* Add a node for the base directory at the end */ ti = palloc0(sizeof(tablespaceinfo)); ti-size = opt-progress ? sendDir(., 1, true, tablespaces, true) : -1; @@ -965,6 +981,19 @@ sendDir(char *path, int basepathlen, bool sizeonly, List *tablespaces, } /* + * We can skip pg_log (or whatever log_directory is set to, if + * it's under the data directory), but include it as an empty + * directory anyway, so we get permissions right. + */ + if (strcmp(pathbuf, logpath) == 0) + { + if (!sizeonly) +_tarWriteHeader(pathbuf + basepathlen + 1, NULL, statbuf); + size += 512; + continue; + } + + /* * Skip pg_replslot, not useful to copy. But include it as an empty * directory anyway, so we get permissions right. */ -- 1.9.1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] skipping pg_log in basebackup
At 2015-06-08 13:09:02 +0900, michael.paqu...@gmail.com wrote: It seems to be that: http://www.postgresql.org/message-id/cahgqgwh0okz6ckpjkcwojga3ejwffm1enrmro3dkdoteaai...@mail.gmail.com (Note that this is about calculating the wrong size, whereas my bug is about the file being too large to be written to a tar archive.) And a recent discussion about that is this one: http://www.postgresql.org/message-id/82897A1301080E4B8E461DDAA0FFCF142A1B2660@SYD1216 Oh, sorry, I somehow did miss that thread entirely. Thanks for the pointer. (I've added Vaishnavi to the Cc: list here.) I'm not convinced that we need a mechanism to let people exclude the torrent files they've stored in their data directory, but if we have to do it, the idea of having a GUC setting rather than specifying excludes on the basebackup command line each time does have a certain appeal. Anyone else interested in doing it that way? -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Collection of memory leaks for ECPG driver
Hi all, Please find attached a patch aimed at fixing a couple of memory leaks in the ECPG driver. Coverity (and sometimes I after reading some other code paths) found them. Regards, -- Michael diff --git a/src/interfaces/ecpg/compatlib/informix.c b/src/interfaces/ecpg/compatlib/informix.c index d6de3ea..c1e3dfb 100644 --- a/src/interfaces/ecpg/compatlib/informix.c +++ b/src/interfaces/ecpg/compatlib/informix.c @@ -672,6 +672,7 @@ intoasc(interval * i, char *str) if (!str) return -errno; + free(str); return 0; } diff --git a/src/interfaces/ecpg/ecpglib/connect.c b/src/interfaces/ecpg/ecpglib/connect.c index 55c5680..12f445d 100644 --- a/src/interfaces/ecpg/ecpglib/connect.c +++ b/src/interfaces/ecpg/ecpglib/connect.c @@ -298,7 +298,8 @@ ECPGconnect(int lineno, int c, const char *name, const char *user, const char *p envname = getenv(PG_DBPATH); if (envname) { - ecpg_free(dbname); + if (dbname) +ecpg_free(dbname); dbname = ecpg_strdup(envname, lineno); } @@ -314,14 +315,19 @@ ECPGconnect(int lineno, int c, const char *name, const char *user, const char *p /* check if the identifier is unique */ if (ecpg_get_connection(connection_name)) { - ecpg_free(dbname); + if (dbname) + ecpg_free(dbname); ecpg_log(ECPGconnect: connection identifier %s is already in use\n, connection_name); return false; } if ((this = (struct connection *) ecpg_alloc(sizeof(struct connection), lineno)) == NULL) + { + if (dbname) + ecpg_free(dbname); return false; + } if (dbname != NULL) { diff --git a/src/interfaces/ecpg/preproc/descriptor.c b/src/interfaces/ecpg/preproc/descriptor.c index 053a7af..ebd95d3 100644 --- a/src/interfaces/ecpg/preproc/descriptor.c +++ b/src/interfaces/ecpg/preproc/descriptor.c @@ -175,6 +175,7 @@ output_get_descr(char *desc_name, char *index) for (results = assignments; results != NULL; results = results-next) { const struct variable *v = find_variable(results-variable); + char *str_zero = mm_strdup(0); switch (results-value) { @@ -188,7 +189,8 @@ output_get_descr(char *desc_name, char *index) break; } fprintf(yyout, %s,, get_dtype(results-value)); - ECPGdump_a_type(yyout, v-name, v-type, v-brace_level, NULL, NULL, -1, NULL, NULL, mm_strdup(0), NULL, NULL); + ECPGdump_a_type(yyout, v-name, v-type, v-brace_level, NULL, NULL, -1, NULL, NULL, str_zero, NULL, NULL); + free(str_zero); } drop_assignments(); fputs(ECPGd_EODT);\n, yyout); @@ -292,8 +294,12 @@ output_set_descr(char *desc_name, char *index) case ECPGd_indicator: case ECPGd_length: case ECPGd_type: -fprintf(yyout, %s,, get_dtype(results-value)); -ECPGdump_a_type(yyout, v-name, v-type, v-brace_level, NULL, NULL, -1, NULL, NULL, mm_strdup(0), NULL, NULL); +{ + char *str_zero = mm_strdup(0); + fprintf(yyout, %s,, get_dtype(results-value)); + ECPGdump_a_type(yyout, v-name, v-type, v-brace_level, NULL, NULL, -1, NULL, NULL, str_zero, NULL, NULL); + free(str_zero); +} break; default: diff --git a/src/interfaces/ecpg/preproc/pgc.l b/src/interfaces/ecpg/preproc/pgc.l index c70f298..0453409 100644 --- a/src/interfaces/ecpg/preproc/pgc.l +++ b/src/interfaces/ecpg/preproc/pgc.l @@ -27,7 +27,7 @@ extern YYSTYPE yylval; static int xcdepth = 0; /* depth of nesting in slash-star comments */ -static char *dolqstart; /* current $foo$ quote start string */ +static char *dolqstart = NULL; /* current $foo$ quote start string */ static YY_BUFFER_STATE scanbufhandle; static char *scanbuf; @@ -550,6 +550,8 @@ cppline {space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})(.*\\{space})*. } SQL{dolqdelim} { token_start = yytext; +if (dolqstart) + free(dolqstart); dolqstart = mm_strdup(yytext); BEGIN(xdolq); startlit(); diff --git a/src/interfaces/ecpg/preproc/variable.c b/src/interfaces/ecpg/preproc/variable.c index 2ad7b5d..bb4b664 100644 --- a/src/interfaces/ecpg/preproc/variable.c +++ b/src/interfaces/ecpg/preproc/variable.c @@ -437,11 +437,13 @@ remove_variable_from_list(struct arguments ** list, struct variable * var) void dump_variables(struct arguments * list, int mode) { - char *str_zero = mm_strdup(0); + char *str_zero; if (list == NULL) return; + str_zero = mm_strdup(0); + /* * The list is build up from the beginning so lets first dump the end of * the list: -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Restore-reliability mode
On Sat, Jun 6, 2015 at 12:58 PM, Noah Misch n...@leadboat.com wrote: - Call VALGRIND_MAKE_MEM_NOACCESS() on a shared buffer when its local pin count falls to zero. Under CLOBBER_FREED_MEMORY, wipe a shared buffer when its global pin count falls to zero. Did a patch for this ever materialize? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)
On Mon, Jun 8, 2015 at 12:42 PM, Abhijit Menon-Sen a...@2ndquadrant.com wrote: This is a followup to a 2014-02 discussion that led to pg_stats_temp being excluded from pg_basebackup. At the time, it was discussed to exclude pg_log as well, but nothing eventually came of that. It seems to be that: http://www.postgresql.org/message-id/cahgqgwh0okz6ckpjkcwojga3ejwffm1enrmro3dkdoteaai...@mail.gmail.com Recently, one of our customers has had a basebackup fail because pg_log contained files that were 8GB: FATAL: archive member pg_log/postgresql-20150119.log too large for tar format I think pg_basebackup should also skip pg_log entries, as it does for pg_stats_temp and pg_replslot, etc. I've attached a patch along those lines for discussion. And a recent discussion about that is this one: http://www.postgresql.org/message-id/82897A1301080E4B8E461DDAA0FFCF142A1B2660@SYD1216 Bringing the point: some users may want to keep log files in a base backup, and some users may want to skip some of them, and not only pg_log. Hence we may want more flexibility than what is proposed here. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file
On Mon, Jun 8, 2015 at 5:52 AM, Andrew Dunstan and...@dunslane.net wrote: On 06/05/2015 11:08 PM, Amit Kapila wrote: Okay, I think I can understand why you want to be cautious for having a different check for this path, but in that case there is a chance that recovery might fail when it will try to create a symlink for that file. Shouldn't we then try to call this new function only when we are trying to restore from tablespace_map file and also is there a need of ifdef S_ISLINK in remove_tablespace_link? Shall I send an updated patch on these lines or do you want to proceed with v3 version? The point seems to me that we should not be removing anything that's not an empty directory or symlink, and that nothing else has any business being in pg_tblspc. If we encounter such a thing whose name conflicts with the name of a tablespace link we wish to create, rather than quietly blowing it away we should complain loudly, and error out, making it the user's responsibility to clean up their mess. Am I missing something? How about if it is just a flat file with same name as tablespace link, why we want to give error for that case? I think now it just don't do anything with that file (unlink will fail with ENOENT and it will be ignored, atleast thats the way currently it behaves in Windows) and create a separate symlink with same name which seems okay to me and in the change proposed by you it will give error, do you see any reason for doing so? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Memory leak with XLogFileCopy since de768844 (WAL file with .partial)
On Fri, Jun 5, 2015 at 10:45 PM, Fujii Masao wrote: Why don't we call InstallXLogFileSegment() at the end of XLogFileCopy()? If we do that, the risk of memory leak you're worried will disappear at all. Yes, that looks fine, XLogFileCopy() would copy to a temporary file, then install it definitely. Updated patch attached. -- Michael diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 666fa37..c03f70e 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -807,7 +807,7 @@ static bool XLogCheckpointNeeded(XLogSegNo new_segno); static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible); static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath, bool find_free, XLogSegNo max_segno, - bool use_lock); + bool use_lock, int elevel); static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli, int source, bool notexistOk); static int XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source); @@ -3012,7 +3012,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock) max_segno = logsegno + CheckPointSegments; if (!InstallXLogFileSegment(installed_segno, tmppath, *use_existent, max_segno, -use_lock)) +use_lock, LOG)) { /* * No need for any more future segments, or InstallXLogFileSegment() @@ -3041,18 +3041,16 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock) /* * Copy a WAL segment file in pg_xlog directory. * - * dstfname destination filename * srcfname source filename * upto how much of the source file to copy? (the rest is filled with *zeros) + * segno identify segment to install. * - * If dstfname is not given, the file is created with a temporary filename, - * which is returned. Both filenames are relative to the pg_xlog directory. - * - * NB: Any existing file with the same name will be overwritten! + * The file is first copied with a temporary filename, and then installed as + * a newly-created segment. */ -static char * -XLogFileCopy(char *dstfname, char *srcfname, int upto) +static void +XLogFileCopy(char *srcfname, int upto, XLogSegNo segno) { char srcpath[MAXPGPATH]; char tmppath[MAXPGPATH]; @@ -3150,25 +3148,9 @@ XLogFileCopy(char *dstfname, char *srcfname, int upto) CloseTransientFile(srcfd); - /* - * Now move the segment into place with its final name. (Or just return - * the path to the file we created, if the caller wants to handle the rest - * on its own.) - */ - if (dstfname) - { - char dstpath[MAXPGPATH]; - - snprintf(dstpath, MAXPGPATH, XLOGDIR /%s, dstfname); - if (rename(tmppath, dstpath) 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg(could not rename file \%s\ to \%s\: %m, - tmppath, dstpath))); - return NULL; - } - else - return pstrdup(tmppath); + /* install the new file */ + (void) InstallXLogFileSegment(segno, tmppath, false, + 0, false, ERROR); } /* @@ -3195,6 +3177,8 @@ XLogFileCopy(char *dstfname, char *srcfname, int upto) * place. This should be TRUE except during bootstrap log creation. The * caller must *not* hold the lock at call. * + * elevel: log level used by this routine. + * * Returns TRUE if the file was installed successfully. FALSE indicates that * max_segno limit was exceeded, or an error occurred while renaming the * file into place. @@ -3202,7 +3186,7 @@ XLogFileCopy(char *dstfname, char *srcfname, int upto) static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath, bool find_free, XLogSegNo max_segno, - bool use_lock) + bool use_lock, int elevel) { char path[MAXPGPATH]; struct stat stat_buf; @@ -3247,7 +3231,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath, { if (use_lock) LWLockRelease(ControlFileLock); - ereport(LOG, + ereport(elevel, (errcode_for_file_access(), errmsg(could not link file \%s\ to \%s\ (initialization of log file): %m, tmppath, path))); @@ -3259,7 +3243,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath, { if (use_lock) LWLockRelease(ControlFileLock); - ereport(LOG, + ereport(elevel, (errcode_for_file_access(), errmsg(could not rename file \%s\ to \%s\ (initialization of log file): %m, tmppath, path))); @@ -3748,7 +3732,7 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr) if (endlogSegNo = recycleSegNo lstat(path, statbuf) == 0 S_ISREG(statbuf.st_mode) InstallXLogFileSegment(endlogSegNo, path, - true, recycleSegNo, true)) + true, recycleSegNo, true, LOG)) { ereport(DEBUG2, (errmsg(recycled transaction log file \%s\, @@ -5227,8 +5211,6 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog) */ if (endLogSegNo == startLogSegNo) { - char *tmpfname; - XLogFileName(xlogfname, endTLI, endLogSegNo); /* @@ -5238,9 +5220,7 @@
Re: [HACKERS] pg_stat_archiver issue with aborted archiver
On Sun, Jun 7, 2015 at 1:11 AM, Julien Rouhaud julien.rouh...@dalibo.com wrote: I just noticed that if the archiver aborts (for instance if the archive_command exited with a return code 127), pg_stat_archiver won't report those failed attempts. This happens with both 9.4 and 9.5 branches. Please find attached a patch that fix this issue, based on current head. The current code seems right to me. When the archive command dies because of a signal (exit code 128), the server should fail immediately with FATAL and should not do any extra processing. It will also try to archive again the same segment file after restart. When trying again, if this time the failure is not caused by a signal but still fails it will be reported to pg_stat_archiver. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing tuple overhead
On Sun, Jun 7, 2015 at 3:02 PM, Simon Riggs si...@2ndquadrant.com wrote: On 23 April 2015 at 17:24, Andres Freund and...@anarazel.de wrote: It's hard to see how to save space there without reference to a specific use case. I see different solutions depending upon whether we assume a low number of transactions or a high number of transactions. I have tried to check theoretically, how much difference such a change could give us. Assuming BLK_SZ - 8192 bytes; Page header - 24 bytes; each line pointer - 4 bytes; average tuple - 150 bytes, roughly 53 tuples could be accommodated in one page. Now each of this tuple contains 12 bytes transaction information (xmin, xmax, cid/combocid). Now considering that in average workload 4 transactions operate on a page at the same time (I think for a workload like pgbench tpc-b, it shouldn't be more otherwise it should have been visible in perf reports), 4 additional tuples [1] could be accommodated on a page which is approximately 7% savings in space (which in-turns means that much less I/O). This gain could vary based on tuple size, no. of transactions that can operate on page, page size.. Some additional benefits that I could see from such a change: 1. I think we don't need to traverse the whole page while freezing, so there should be some savings in freeze operation as well. 2. Now I think with this, we might be able to reduce WAL also if we can avoid some transaction related info in the cases where it is currently stored (update/delete). 3. Another gain could come if we want to add transaction information in index segment as well, because if such information can be stored at page level, then there won't be much impact in adding it there which will help us in avoiding multiple-passes of Vaccum (heap and index could be vacuumed separately which will definitely help in IO and bloat reduction). [1] Calc for 4 additional tuples: (saving by removing trans info from tuple - new space consumed by trans) /new tuple size (53 * 12 - 12 * 4) / (150 - 12) = 4.26 With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Reducing tuple overhead
On 23 April 2015 at 17:24, Andres Freund and...@anarazel.de wrote: Split into a new thread, the other one is already growing fast enough. This discussion started at http://archives.postgresql.org/message-id/55391469.5010506%40iki.fi On April 23, 2015 6:48:57 PM GMT+03:00, Heikki Linnakangas hlinn...@iki.fi wrote: Stop right there. You need to reserve enough space on the page to store an xmax for *every* tuple on the page. Because if you don't, what are you going to do when every tuple on the page is deleted by a different transaction. Even if you store the xmax somewhere else than the page header, you need to reserve the same amount of space for them, so it doesn't help at all. Depends on how you do it and what you optimize for (disk space, runtime, code complexity..). You can e.g. use apply a somewhat similar trick to xmin/xmax as done to cmin/cmax; only that the data structure needs to be persistent. In fact, we already have combocid like structure for xids that's persistent - multixacts. We could just have one xid saved that's either xmin or xmax (indicated by bits) or a multixact. When a tuple is updated/deleted whose xmin is still required we could replace the former xmin with a multixact, otherwise just change the flag that it's now a xmax without a xmin. To check visibility and if the xid is a multixact we'd just have to look for the relevant member for the actual xmin and xmax. To avoid exessive overhead when a tuple is repeatedly updated within one session we could store some of the data in the combocid entry that we anyway need in that case. Whether that's feasible complexity wise is debatable, but it's certainly possible. I do wonder what, in realistic cases, is actually the bigger contributor to the overhead. The tuple header or the padding we liberally add in many cases... It's hard to see how to save space there without reference to a specific use case. I see different solutions depending upon whether we assume a low number of transactions or a high number of transactions. A case we could optimize for is insert-mostly tables. But in that case if you get rid of the xmax then you still have to freeze the tuples later. I would have thought a better optimization would be to use the xmax for the xid epoch by default, so that such rows would never need freezing. Then at least we are using the xmax for something useful in a larger insert-mostly database rather than just leaving it at zero. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] could not truncate directory pg_subtrans: apparent wraparound
On Sat, Jun 6, 2015 at 4:51 PM, Thomas Munro thomas.mu...@enterprisedb.com wrote: On Sat, Jun 6, 2015 at 1:25 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Thomas Munro wrote: My idea was that if I could get oldestXact == next XID in TruncateSUBSTRANS, then TransactionIdToPage(oldestXact) for a value of oldestXact that happens to be immediately after a page boundary (so that xid % 2048 == 0) might give page number that is = latest_page_number, causing SimpleLruTruncate to print that message. But I can't figure out how to get next XID == oldest XID, because vacuumdb --freeze --all consumes xids itself, so in my first attempt at this, next XID is always 3 ahead of the oldest XID when a checkpoint is run. vacuumdb starts by querying pg_database, which eats one XID. Vacuum itself only uses one XID when vac_truncate_clog() is called. This is called from vac_update_datfrozenxid(), which always happen at the end of each user-invoked VACUUM (so three times for vacuumdb if you have three databases); autovacuum does it also at the end of each run. Maybe you can get autovacuum to quit before doing it. OTOH, if the values in the pg_database entry do not change, vac_truncate_clog is not called, and thus vacuum would finish without consuming an XID. I have manage to reproduce it a few times but haven't quite found the right synchronisation hacks to make it reliable so I'm not posting a repro script yet. I think it's a scary sounding message but very rare and entirely harmless (unless you really have wrapped around...). The fix is probably something like: if oldest XID == next XID, then just don't call SimpleLruTruncate (truncation is deferred until the next checkpoint), or perhaps (if we can confirm this doesn't cause problems for dirty pages or that there can't be any dirty pages before cutoff page because of the preceding flush (as I suspect)) we could use cutoffPage = TransactionIdToPage(oldextXact - 1) if oldest == next, or maybe even always. Here's a repro script and a suggested patch. (What I said about dirty pages in parentheses above was nonsense, I was confusing this with something else.) -- Thomas Munro http://www.enterprisedb.com fix-bogus-subtrans-wraparound-error.patch Description: Binary data checkpoint-subtrans-boundary.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Resource Owner reassign Locks
On 2015-06-07 13:44:08 -0700, Jeff Janes wrote: I'd like to advocate for back-patching this to 9.0, 9.1, and 9.2. It has run without problems for a while now, and it can be considered a bug that systems with a very large number of objects cannot be upgraded in a reasonable time. +1 Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file
On 06/05/2015 11:08 PM, Amit Kapila wrote: On Fri, Jun 5, 2015 at 10:51 AM, Amit Kapila amit.kapil...@gmail.com mailto:amit.kapil...@gmail.com wrote: On Fri, Jun 5, 2015 at 9:57 AM, Andrew Dunstan and...@dunslane.net mailto:and...@dunslane.net wrote: On 06/04/2015 11:35 PM, Amit Kapila wrote: Theoretically, I don't see much problem by changing the checks way you have done in patch, but it becomes different than what we have in destroy_tablespace_directories() and it is slightly changing the way check was originally done in create_tablespace_directories(), basically original check will try unlink if lstat returns non-zero return code. If you want to proceed with the changed checks as in v3, then may be we can modify comments on top of function remove_tablespace_symlink() which indicates that it works like destroy_tablespace_directories(). The difference is that here we're getting the list from a base backup and it seems to me the risk of having a file we don't really want to unlink is significantly greater. Okay, I think I can understand why you want to be cautious for having a different check for this path, but in that case there is a chance that recovery might fail when it will try to create a symlink for that file. Shouldn't we then try to call this new function only when we are trying to restore from tablespace_map file and also is there a need of ifdef S_ISLINK in remove_tablespace_link? Shall I send an updated patch on these lines or do you want to proceed with v3 version? The point seems to me that we should not be removing anything that's not an empty directory or symlink, and that nothing else has any business being in pg_tblspc. If we encounter such a thing whose name conflicts with the name of a tablespace link we wish to create, rather than quietly blowing it away we should complain loudly, and error out, making it the user's responsibility to clean up their mess. Am I missing something? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] could not truncate directory pg_subtrans: apparent wraparound
On Mon, Jun 8, 2015 at 12:29 PM, Thomas Munro thomas.mu...@enterprisedb.com wrote: Here's a repro script and a suggested patch. Argh... I realised immediately after sending this that subtransaction truncation doesn't even use the oldest XID computed by vacuum, it uses GetOldestXmin (the oldest transaction that was running when any current transaction was started). So here is an even simpler repro, though the fix is the same (with different comments). -- Thomas Munro http://www.enterprisedb.com checkpoint-subtrans-boundary-v2.sh Description: Bourne shell script fix-bogus-subtrans-wraparound-error-v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [CORE] Restore-reliability mode
Joshua D. Drake wrote: On 06/05/2015 08:07 PM, Bruce Momjian wrote: From my side, it is only recently I got some clear answers to my questions about how it worked. I think it is very important that major features have extensive README type documentation with them so the underlying principles used in the development are clear. I would define the measure of a good feature as whether another committer can read the code comments and get a good feel. A bad feature is one where committers walk away from it, saying I don't really get it and I can't read an explanation of why it does that. Tom's most significant contribution is his long descriptive comments on what the problem is that need to be solved, the options and the method chosen. Clarity of thought is what solves bugs. Yes, I think we should have done that early-on for multi-xact, and I am hopeful we will learn to do that more often when complex features are implemented, or when we identify areas that are more complex than we thought. I see this idea of the README as very useful. There are far more people like me in this community than Simon or Alvaro. I can test, I can break things, I can script up a harness but I need to be understand HOW and the README would help allow for that. There is a src/backend/access/README.tuplock that attempts to describe multixacts. Is that not sufficient? Now that I think about it, this file hasn't been updated with the latest changes, so it's probably a bit outdated now. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing tuple overhead
On Thu, Apr 30, 2015 at 6:54 AM, Robert Haas robertmh...@gmail.com wrote: The other, related problem is that the ordering operator might start to return different results than it did at index creation time. For example, consider a btree index built on a text column. Now consider 'yum update'. glibc gets updated, collation ordering of various strings change, and now you've got tuples that are in the wrong place in the index, because when the index was built, we thought A B, but now we think B A. You would think the glibc maintainers might avoid such changes in minor releases, or that the Red Hat guys would avoid packaging and shipping those changes in minor releases, but you'd be wrong. I would not think that. Unicode Technical Standard #10 states: Collation order is not fixed. Over time, collation order will vary: there may be fixes needed as more information becomes available about languages; there may be new government or industry standards for the language that require changes; and finally, new characters added to the Unicode Standard will interleave with the previously-defined ones. This means that collations must be carefully versioned. Also, in the paper Modern B-Tree Techniques, by Goetz Graefe, page 238, it states: In many operating systems, appropriate functions are provided to compute a normalized key from a localized string value, date value, or time value. This functionality is used, for example, to list files in a directory as appropriate for the local language. Adding normalization for numeric data types is relatively straightforward, as is concatenation of multiple normalized values. Database code must not rely on such operating system code, however. The problem with relying on operating systems support for database indexes is the update frequency. An operating system might update its normalization code due to an error or extension in the code or in the definition of a local sort order; it is unacceptable, however, if such an update silently renders existing large database indexes incorrect. Unfortunately, it is simply not the case that we can rely on OS collations being immutable. We have *no* contract with any C standard library concerning collation stability whatsoever. I'm surprised that we don't see hear more about this kind of corruption. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers