Re: [HACKERS] gaussian distribution pgbench
Please find attached 2 patches, which are a split of the patch discussed in this thread. Please find attached a very minor improvement to apply a code (variable name) simplification directly in patch A so as to avoid a change in patch B. The cumulated patch is the same as previous. (A) add gaussian exponential options to pgbench \setrandom the patch includes sql test files. There is no change in the *code* from previous already reviewed submissions, so I do not think that it needs another review on that account. However I have (yet again) reworked the *documentation* (for Andres Freund Robert Haas), in particular both descriptions now follow the same structure (introduction, formula, intuition, rule of thumb and constraint). I have differentiated the concept and the option by putting the later in literal tags, and added a link to the corresponding wikipedia pages. Please bear in mind that: 1. English is not my native language. 2. this is not easy reading... this is maths, to read slowly:-) 3. word smithing contributions are welcome. I assume somehow that a user interested in gaussian exponential distributions must know a little bit about probabilities... (B) add pgbench test variants with gauss exponential. I have reworked the patch so as to avoid copy pasting the 3 test cases, as requested by Andres Freund, thus this is new, although quite simple, code. I have also added explanations in the documentation about how to interpret the decile outputs, so as to hopefully address Robert Haas comments. -- Fabien.diff --git a/contrib/pgbench/README b/contrib/pgbench/README new file mode 100644 index 000..6881256 --- /dev/null +++ b/contrib/pgbench/README @@ -0,0 +1,5 @@ +# gaussian and exponential tests +# with XXX as expo or gauss +psql test test-init.sql +./pgbench -M prepared -f test-XXX-run.sql -t 100 -P 1 -n test +psql test test-XXX-check.sql diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index 4aa8a50..379ef24 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -41,6 +41,7 @@ #include math.h #include signal.h #include sys/time.h +#include assert.h #ifdef HAVE_SYS_SELECT_H #include sys/select.h #endif @@ -98,6 +99,8 @@ static int pthread_join(pthread_t th, void **thread_return); #define LOG_STEP_SECONDS 5 /* seconds between log messages */ #define DEFAULT_NXACTS 10 /* default nxacts */ +#define MIN_GAUSSIAN_THRESHOLD 2.0 /* minimum threshold for gauss */ + int nxacts = 0; /* number of transactions per client */ int duration = 0; /* duration in seconds */ @@ -471,6 +474,76 @@ getrand(TState *thread, int64 min, int64 max) return min + (int64) ((max - min + 1) * pg_erand48(thread-random_state)); } +/* + * random number generator: exponential distribution from min to max inclusive. + * the threshold is so that the density of probability for the last cut-off max + * value is exp(-threshold). + */ +static int64 +getExponentialrand(TState *thread, int64 min, int64 max, double threshold) +{ + double cut, uniform, rand; + assert(threshold 0.0); + cut = exp(-threshold); + /* erand in [0, 1), uniform in (0, 1] */ + uniform = 1.0 - pg_erand48(thread-random_state); + /* + * inner expresion in (cut, 1] (if threshold 0), + * rand in [0, 1) + */ + assert((1.0 - cut) != 0.0); + rand = - log(cut + (1.0 - cut) * uniform) / threshold; + /* return int64 random number within between min and max */ + return min + (int64)((max - min + 1) * rand); +} + +/* random number generator: gaussian distribution from min to max inclusive */ +static int64 +getGaussianrand(TState *thread, int64 min, int64 max, double threshold) +{ + double stdev; + double rand; + + /* + * Get user specified random number from this loop, with + * -threshold stdev = threshold + * + * This loop is executed until the number is in the expected range. + * + * As the minimum threshold is 2.0, the probability of looping is low: + * sqrt(-2 ln(r)) = 2 = r = e^{-2} ~ 0.135, then when taking the average + * sinus multiplier as 2/pi, we have a 8.6% looping probability in the + * worst case. For a 5.0 threshold value, the looping probability + * is about e^{-5} * 2 / pi ~ 0.43%. + */ + do + { + /* + * pg_erand48 generates [0,1), but for the basic version of the + * Box-Muller transform the two uniformly distributed random numbers + * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller) + */ + double rand1 = 1.0 - pg_erand48(thread-random_state); + double rand2 = 1.0 - pg_erand48(thread-random_state); + + /* Box-Muller basic form transform */ + double var_sqrt = sqrt(-2.0 * log(rand1)); + stdev = var_sqrt * sin(2.0 * M_PI * rand2); + + /* + * we may try with cos, but there may be a bias induced if the previous + * value fails the test? To be on the safe side, let us try over. + */ + } + while (stdev -threshold || stdev = threshold); + + /* stdev is in [-threshold, threshold), normalization to [0,1)
Re: [HACKERS] [bug fix] Suppress autovacuum: found orphan temp table message
On 2014-07-22 10:09:04 +0900, MauMau wrote: From: Andres Freund and...@2ndquadrant.com On 2014-07-18 23:38:09 +0900, MauMau wrote: So, I propose a simple fix to change the LOG level to DEBUG1. I don't know which of DEBUG1-DEBUG5 is appropriate, and any level is OK. Could you include this in 9.2.9? Surely that's the wrong end to tackle this from. Hiding actual problems is a seriously bad idea. No, there is no serious problem in the user operation in this situation. Server crash cannot be avoided, and must be anticipated. The problem is that PostgreSQL makes users worried about lots of (probably) unnecessary messages. Is there any problem if we don't output the message? Yes. The user won't know that possibly gigabytes worth of diskspace aren't freed. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql.extra_warnings='num_into_expressions'
On 7/22/14, 7:06 AM, Pavel Stehule wrote: I looked on this patch and I am thinking so it is not a good idea. It introduce early dependency between functions and pg_class based objects. What dependency? The patch only looks at the raw parser output, so it won't e.g. know whether SELECT * INTO a, b FROM foo; is problematic or not. .marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans for multicolumn GIST
On 07/21/2014 10:47 PM, Anastasia Lubennikova wrote: Hi, hackers! There are new results of my work on GSoC project Index-only scans for GIST. Previous post is here: http://postgresql.1045698.n5.nabble.com/Index-only-scans-for-GIST-td5804892.html Repository is https://github.com/lubennikovaav/postgres/tree/indexonlygist2 Patch is in attachments. It includes indexonlyscan for multicolumn GiST. It works correctly - results are the same with another scan plans. Fetch() method is realized for box and circle opclasses Improvement for circle opclass is not such distinct as for box opclass, because of recheck. For a circle, the GiST index stores a bounding box of the circle. The new fetch function reverses that, calculating the radius and center of the circle from the bounding box. Those conversions lose some precision due to rounding. Are we okay with that? Floating point math is always subject to rounding errors, but there's a good argument to be made that it's not acceptable to get a different value back when the user didn't explicitly invoke any math functions. As an example: create table circle_tbl (c circle); create index circle_tbl_idx on circle_tbl using gist (c); insert into circle_tbl values ('1.23456789012345,1.23456789012345,1e300'); postgres=# set enable_seqscan=off; set enable_bitmapscan=off; set enable_indexonlyscan=on; SET SET SET postgres=# select * from circle_tbl ; c (0,0),1e+300 (1 row) postgres=# set enable_indexonlyscan=off; SET postgres=# select * from circle_tbl ; c -- (1.23456789012345,1.23456789012345),1e+300 (1 row) - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql.extra_warnings='num_into_expressions'
2014-07-22 8:52 GMT+02:00 Marko Tiikkaja ma...@joh.to: On 7/22/14, 7:06 AM, Pavel Stehule wrote: I looked on this patch and I am thinking so it is not a good idea. It introduce early dependency between functions and pg_class based objects. What dependency? The patch only looks at the raw parser output, so it won't e.g. know whether SELECT * INTO a, b FROM foo; is problematic or not. I am sorry, I was confused There is dependencty in variable type, but this dependency is not new. Regards Pavel .marko
Re: [HACKERS] Production block comparison facility
On Sun, Jul 20, 2014 at 5:31 PM, Simon Riggs si...@2ndquadrant.com wrote: The block comparison facility presented earlier by Heikki would not be able to be used in production systems. ISTM that it would be desirable to have something that could be used in that way. ISTM easy to make these changes * optionally generate a FPW for every WAL record, not just first change after checkpoint full_page_writes = 'always' * when an FPW arrives, optionally run a check to see if it compares correctly against the page already there, when running streaming replication without a recovery target. We could skip reporting any problems until the database is consistent full_page_write_check = on The above changes seem easy to implement. With FPW compression, this would be a usable feature in production. Comments? This is an interesting idea, and it would be easier to use than what has been submitted for CF1. However, full_page_writes set to always would generate a large amount of WAL even for small records, increasing I/O for the partition holding pg_xlog, and the frequency of checkpoints run on system. Is this really something suitable for production? Then, looking at the code, we would need to tweak XLogInsert for the WAL record construction to always do a FPW and to update XLogCheckBufferNeedsBackup. Then for the redo part, we would need to do some extra operations in the area of RestoreBackupBlock/RestoreBackupBlockContents, including masking operations before comparing the content of the FPW and the current page. Does that sound right? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] Suppress autovacuum: found orphan temp table message
From: Andres Freund and...@2ndquadrant.com On 2014-07-22 10:09:04 +0900, MauMau wrote: Is there any problem if we don't output the message? Yes. The user won't know that possibly gigabytes worth of diskspace aren't freed. RemovePgTempFiles() frees the disk space by removing temp relation files at server start. In addition, the temp relation files are not replicated to the standby server of streaming replication (this is the customer's case). So, those messages seem no more than just the noise. With this, are those messages really necessary? Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] parametric block size?
Hello devs, The default blocksize is currently 8k, which is not necessary optimal for all setup, especially with SSDs where the latency is much lower than HDD. There is a case for different values with significant impact on performance (up to a not-to-be-sneezed-at 10% on a pgbench run on SSD, see http://www.cybertec.at/postgresql-block-sizes-getting-started/), and ISTM that the ability to align PostgreSQL block size to the underlying FS/HW block size would be nice. This is currently possible, but it requires recompiling and maintaining distinct executables for various block sizes. This is annoying, thus most admins will not bother. ISTM that a desirable and reasonably simple to implement feature would be to be able to set the blocksize at initdb time, and postgres could use the value found in the database instead of a compile-time one. More advanced features, but with much more impact on the code, would be to be able to change the size at database/table level. Any thoughts? -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Use unique index for longer pathkeys.
Sorry , previous version has bugs. It stamps over the stack and crashesX( The attached is the bug fixed version, with no substantial changes. On Tue, Jul 15, 2014 at 2:17 PM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Hi, the attached is the revised version. Thanks Horiguchi-San for the updated patch. # By the way, this style of calling a person is quite common # among Japanese since the first-name basis implies very close # relationship or it frequently conveys offensive shade. But I'm # not sure what should I call others who're not Japases native. Anyway, Today while looking into updated patch, I was wondering why can't we eliminate useless keys in query_pathkeys when we actually build the same in function standard_qp_callback(), basically somewhat similar to what we do in standard_qp_callback-make_pathkeys_for_sortclauses-pathkey_is_redundant(). I agree that standard_qp_callback is basically more preferable. We already have index information related to base_rels before building query pathkeys. I noticed that you mentioned the below in your original mail which indicates that information related to inheritance tables is built only after set_base_rel_sizes() These steps take place between set_base_rel_sizes() and set_base_rel_pathlists() in make_one_rel(). The reason for the position is that the step 2 above needs all inheritence tables to be extended in PlannerInfo and set_base_rel_sizes (currently) does that. Sorry, my memory had been worn down. After some reconfirmation, this description found to be a bit (quite?) wrong. collect_common_primary_pathkeys needs root-eq_classes to be set up beforehand, not appendrels. Bacause build_index_pathkeys doesn't work before every EC member for all relation involved is already registered. standard_qp_callback registers EC members in the following path but only for the primary(?) tlist of the subquery, so EC members for the child relations are not registered at the time. .-make_pathekeys_sortclauses-make_pathkey_from_sortop -make_pathkey_from_sortinfo-get_eclass_for_sort_expr EC members for the child rels are registered in set_base_rel_sizes-set_rel_size-set_append_rel_size -add_child_rel_equivalences So trim_query_pathkeys (collect_common...) doesn't work before set_base_rel_sizes(). These EC members needed to be registered at least if root-query_pathkeys exists so this seems to be done in standard_qp_callback adding a separate inheritance tree walk. # rel-has_elcass_joins seems not working but it is not the topic # here. Could you please explain me why the index information built in above path is not sufficient or is there any other case which I am missing? Is the description above enough to be the explaination for the place? Sorry for the inaccurate description. regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index 9573a9b..546e112 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -1789,9 +1789,11 @@ _outIndexOptInfo(StringInfo str, const IndexOptInfo *node) /* indexprs is redundant since we print indextlist */ WRITE_NODE_FIELD(indpred); WRITE_NODE_FIELD(indextlist); + /* cached pathkeys are omitted as indexkeys */ WRITE_BOOL_FIELD(predOK); WRITE_BOOL_FIELD(unique); WRITE_BOOL_FIELD(immediate); + WRITE_BOOL_FIELD(allnotnull); WRITE_BOOL_FIELD(hypothetical); /* we don't bother with fields copied from the pg_am entry */ } diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index c81efe9..c12d0e7 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -58,6 +58,7 @@ int geqo_threshold; join_search_hook_type join_search_hook = NULL; +static List *collect_common_primary_pathkeys(PlannerInfo *root); static void set_base_rel_sizes(PlannerInfo *root); static void set_base_rel_pathlists(PlannerInfo *root); static void set_rel_size(PlannerInfo *root, RelOptInfo *rel, @@ -66,6 +67,7 @@ static void set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, Index rti, RangeTblEntry *rte); static void set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte); +static void trim_query_pathkeys(PlannerInfo * root); static void set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte); static void set_foreign_size(PlannerInfo *root, RelOptInfo *rel, @@ -112,6 +114,203 @@ static void recurse_push_qual(Node *setOp, Query *topquery, RangeTblEntry *rte, Index rti, Node *qual); static void remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel); +/* + * collect_common_primary_pathkeys: Collects common unique and non-null index + * pathkeys from all base relations in current root. + */ +static List * +collect_common_primary_pathkeys(PlannerInfo *root) +{ + List *result = NIL; + Index rti; + int nindex = 0; + List
Re: [HACKERS] [bug fix] Suppress autovacuum: found orphan temp table message
On 2014-07-22 17:05:22 +0900, MauMau wrote: From: Andres Freund and...@2ndquadrant.com On 2014-07-22 10:09:04 +0900, MauMau wrote: Is there any problem if we don't output the message? Yes. The user won't know that possibly gigabytes worth of diskspace aren't freed. RemovePgTempFiles() frees the disk space by removing temp relation files at server start. But it's not called during a crash restart. In addition, the temp relation files are not replicated to the standby server of streaming replication (this is the customer's case). So? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL replay bugs
Hello, Although I doubt necessity of the flexibility seeing the current testing framework, I don't have so strong objection about that. Nevertheless, perhaps you are appreciated to put a notice on.. README or somewhere. Hm, well... Fine, I added it in this updated series. Thank you for your patience:) After all, I have no more comment about this patch. I will mark this as 'Ready for committer' unless no comment comes from others for a few days. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BUFFER_LOCK_EXCLUSIVE is used in ginbuildempty().
Hello, At Thu, 17 Jul 2014 15:54:31 -0400, Tom Lane t...@sss.pgh.pa.us wrote in 10710.1405626...@sss.pgh.pa.us Peter Geoghegan p...@heroku.com writes: On Thu, Jul 17, 2014 at 7:47 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: I don't understand the point of having these GIN_EXCLUSIVE / GIN_SHARED symbols. It's not like we could do anything different than BUFFER_LOCK_EXCLUSIVE etc instead. It there was a GinLockBuffer() it might make more sense to have specialized symbols, but as it is it seems pointless. I agree with you. From the eyes not specialized for each AM, of me, the translation-only symbols didn't make me so happy. It's a pattern common to the index AMs. I think it's kind of pointless myself, but as long as we're doing it we might as well be consistent. I think that to the extent that these symbols are used in APIs above the direct buffer-access layer, they are useful --- for example using BT_READ/BT_WRITE in _bt_search calls seems like a useful increment of readability. GIN seems to have less of that than some of the other AMs, but I do see GIN_SHARE being used that way in some calls. BTW, there's one direct usage of BUFFER_LOCK_EXCLUSIVE in the GIST code as well, which should probably be replaced with GIST_EXCLUSIVE if we're trying to be consistent. Though I brought up this topic, this kind of consistency seems not needed so much. If so, it seems better to be left as it is. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] Suppress autovacuum: found orphan temp table message
From: Andres Freund and...@2ndquadrant.com On 2014-07-22 17:05:22 +0900, MauMau wrote: RemovePgTempFiles() frees the disk space by removing temp relation files at server start. But it's not called during a crash restart. Yes, the comment of the function says: * NOTE: we could, but don't, call this during a post-backend-crash restart * cycle. The argument for not doing it is that someone might want to examine * the temp files for debugging purposes. This does however mean that * OpenTemporaryFile had better allow for collision with an existing temp * file name. But this is true if restart_after_crash = on in postgresql.conf, because the crash restart only occurs in that case. However, in HA cluster, whether it is shared-disk or replication, restart_after_crash is set to off, isn't it? Moreover, as the comment says, the behavior of keeping leftover temp files is for debugging by developers. It's not helpful for users, isn't it? I thought messages of DEBUG level is more appropriate, because the behavior is for debugging purposes. In addition, the temp relation files are not replicated to the standby server of streaming replication (this is the customer's case). So? Yes, so nobody can convince serious customers that the current behavior makes real sense. Could you please reconsider this? Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] how to reach D5 in tuplesort.c 's polyphase merge algorithm?
hi, I got the same result after work_mem = 64, but I can get to D5 and D6 after using bigger data sample (at least 10 records) as Tom said! 2014-07-19 6:35 GMT+08:00 土卜皿 pengcz.n...@gmail.com: 2014-07-19 6:26 GMT+08:00 Tom Lane t...@sss.pgh.pa.us: =?UTF-8?B?5Zyf5Y2c55q/?= pengcz.n...@gmail.com writes: for studying polyphase merge algorithm of tuplesort.c, I use ddd and apend a table, which has a schema as follows: ... and has 36684 records, and every record is like: id code article name department 31800266\NMachault77 and for getting into external sort, I type the following command: select * from towns order by name desc; but I found it need not reach D5 and D6 during sorting, That doesn't sound like enough data to force it to spill to disk at all; at least not unless you turn down work_mem to some very small value. hi, Tom thanks a lot! work_mem you said remind me one more thing I did, I tried to change BLCKSZ = 8192/2, and successfully compiled, but I got a error when executing initdb Dillon
Re: [HACKERS] [bug fix] Suppress autovacuum: found orphan temp table message
On 2014-07-22 19:13:56 +0900, MauMau wrote: From: Andres Freund and...@2ndquadrant.com On 2014-07-22 17:05:22 +0900, MauMau wrote: RemovePgTempFiles() frees the disk space by removing temp relation files at server start. But it's not called during a crash restart. Yes, the comment of the function says: * NOTE: we could, but don't, call this during a post-backend-crash restart * cycle. The argument for not doing it is that someone might want to examine * the temp files for debugging purposes. This does however mean that * OpenTemporaryFile had better allow for collision with an existing temp * file name. But this is true if restart_after_crash = on in postgresql.conf, because the crash restart only occurs in that case. However, in HA cluster, whether it is shared-disk or replication, restart_after_crash is set to off, isn't it? In almost all setups I've seen it's set to on, even in HA scenarios. Moreover, as the comment says, the behavior of keeping leftover temp files is for debugging by developers. It's not helpful for users, isn't it? I thought messages of DEBUG level is more appropriate, because the behavior is for debugging purposes. GRR. That doesn't change the fact that there'll be files left over after a crash restart. Yes, so nobody can convince serious customers that the current behavior makes real sense. I think you're making lots of noise over a trivial log message. Could you please reconsider this? No. Just removing a warning isn't the way to solve this. If you want to improve things you'll actually need to improve things not just stick your head into the sand. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Production block comparison facility
On 22 July 2014 08:49, Michael Paquier michael.paqu...@gmail.com wrote: On Sun, Jul 20, 2014 at 5:31 PM, Simon Riggs si...@2ndquadrant.com wrote: The block comparison facility presented earlier by Heikki would not be able to be used in production systems. ISTM that it would be desirable to have something that could be used in that way. ISTM easy to make these changes * optionally generate a FPW for every WAL record, not just first change after checkpoint full_page_writes = 'always' * when an FPW arrives, optionally run a check to see if it compares correctly against the page already there, when running streaming replication without a recovery target. We could skip reporting any problems until the database is consistent full_page_write_check = on The above changes seem easy to implement. With FPW compression, this would be a usable feature in production. Comments? This is an interesting idea, and it would be easier to use than what has been submitted for CF1. However, full_page_writes set to always would generate a large amount of WAL even for small records, increasing I/O for the partition holding pg_xlog, and the frequency of checkpoints run on system. Is this really something suitable for production? For critical systems, yes, I think it is. It would be possible to make that user selectable for particular transactions or tables. Then, looking at the code, we would need to tweak XLogInsert for the WAL record construction to always do a FPW and to update XLogCheckBufferNeedsBackup. Then for the redo part, we would need to do some extra operations in the area of RestoreBackupBlock/RestoreBackupBlockContents, including masking operations before comparing the content of the FPW and the current page. Does that sound right? Yes, it doesn't look very much code because it fits well with existing approaches. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Production block comparison facility
If you're always going FPW then there's no point in the rest of the record. The point here was to find problems so that users could run normally with confidence. The cases you might want to run in the mode you describe are the build farm or integration testing. When treating your application on the next release of postgres it would be nice to have tests for the replication in your workload given the experience in 9.3. Even without the constant full page writes a live production system could do a FPW comparison after a FPW if it was in a consistent state. That would give standbys periodic verification at low costs. -- greg On 22 Jul 2014 12:28, Simon Riggs si...@2ndquadrant.com wrote: On 22 July 2014 08:49, Michael Paquier michael.paqu...@gmail.com wrote: On Sun, Jul 20, 2014 at 5:31 PM, Simon Riggs si...@2ndquadrant.com wrote: The block comparison facility presented earlier by Heikki would not be able to be used in production systems. ISTM that it would be desirable to have something that could be used in that way. ISTM easy to make these changes * optionally generate a FPW for every WAL record, not just first change after checkpoint full_page_writes = 'always' * when an FPW arrives, optionally run a check to see if it compares correctly against the page already there, when running streaming replication without a recovery target. We could skip reporting any problems until the database is consistent full_page_write_check = on The above changes seem easy to implement. With FPW compression, this would be a usable feature in production. Comments? This is an interesting idea, and it would be easier to use than what has been submitted for CF1. However, full_page_writes set to always would generate a large amount of WAL even for small records, increasing I/O for the partition holding pg_xlog, and the frequency of checkpoints run on system. Is this really something suitable for production? For critical systems, yes, I think it is. It would be possible to make that user selectable for particular transactions or tables. Then, looking at the code, we would need to tweak XLogInsert for the WAL record construction to always do a FPW and to update XLogCheckBufferNeedsBackup. Then for the redo part, we would need to do some extra operations in the area of RestoreBackupBlock/RestoreBackupBlockContents, including masking operations before comparing the content of the FPW and the current page. Does that sound right? Yes, it doesn't look very much code because it fits well with existing approaches. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Production block comparison facility
On 22 July 2014 12:54, Greg Stark st...@mit.edu wrote: If you're always going FPW then there's no point in the rest of the record. I think its a simple matter to mark them XLP_BKP_REMOVABLE and to skip any optimization of remainder of WAL records. The point here was to find problems so that users could run normally with confidence. Yes, but a full overwrite mode would provide an even safer mode of operation. The cases you might want to run in the mode you describe are the build farm or integration testing. When treating your application on the next release of postgres it would be nice to have tests for the replication in your workload given the experience in 9.3. Even without the constant full page writes a live production system could do a FPW comparison after a FPW if it was in a consistent state. That would give standbys periodic verification at low costs. Yes, the two options I proposed are somewhat independent of each other. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] Suppress autovacuum: found orphan temp table message
From: Andres Freund and...@2ndquadrant.com On 2014-07-22 19:13:56 +0900, MauMau wrote: But this is true if restart_after_crash = on in postgresql.conf, because the crash restart only occurs in that case. However, in HA cluster, whether it is shared-disk or replication, restart_after_crash is set to off, isn't it? In almost all setups I've seen it's set to on, even in HA scenarios. I'm afraid that's because people don't notice the existence or purpose of this parameter. The 9.1 release note says: Add restart_after_crash setting which disables automatic server restart after a backend crash (Robert Haas) This allows external cluster management software to control whether the database server restarts or not. Reading this, I guess the parameter was introduced, and should be used, for HA environments controlled by the clusterware. Restarting the database server on the same machine may fail, or the restarted server may fail again, due to the broken hardware components, so I guess it was considered better to let the clusterware determine what to do. Moreover, as the comment says, the behavior of keeping leftover temp files is for debugging by developers. It's not helpful for users, isn't it? I thought messages of DEBUG level is more appropriate, because the behavior is for debugging purposes. GRR. That doesn't change the fact that there'll be files left over after a crash restart. Yes... that's a source of headache. But please understand that there's a problem -- trying to leave temp relations just for debugging is causing a flood of messages, which the customer is actually concerned about. I think you're making lots of noise over a trivial log message. Maybe so, and I hope so. I may be too nervous about what the customer will ask and/or request next. If they request something similar to what I proposed here, let me consult you again. Could you please reconsider this? No. Just removing a warning isn't the way to solve this. If you want to improve things you'll actually need to improve things not just stick your head into the sand. I have a few ideas below, but none of them seems better than the original proposal. What do you think? 1. startup process deletes the catalog entries and data files of leftover temp relations at the end of recovery. This is probably difficult, impossible or undesirable, because the startup process cannot access system catalogs. Even if it's possible, it is against the developers' desire to leave temp relation files for debugging. 2. autovacuum launcher deletes the catalog entries and data files of leftover temp relations during its initialization. This may be possible, but it is against the developers' desire to leave temp relation files for debugging. 3. Emit the orphan temp relation message only when the associated data file actually exists. autovacuum workers check if the temp relation file is left over with stat(). If not, delete the catalog entry in pg_class silently. This sounds reasonable because the purpose of the message is to notify users of potential disk space shortage. In the streaming replication case, no data files should exist on the promoted new primary, so no messages should be emitted. However, in the shared-disk HA cluster case, the temp relation files are left over on the shared disk, so this fix doesn't improve anything. 4. Emit the orphan temp relation message only when restart_after_crash is on. i.e. ereport(restart_after_crash ? LOG : DEBUG1, ... Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] Suppress autovacuum: found orphan temp table message
On Tue, Jul 22, 2014 at 6:23 AM, Andres Freund and...@2ndquadrant.com wrote: Yes, so nobody can convince serious customers that the current behavior makes real sense. I think you're making lots of noise over a trivial log message. Could you please reconsider this? No. Just removing a warning isn't the way to solve this. If you want to improve things you'll actually need to improve things not just stick your head into the sand. I've studied this area of the code before, and I would actually proposed to fix this in the opposite way - namely, adopt the logic currently used for the wraparound case, which drops the temp table, even if wraparound is not an issue. The current code seems to be laboring under the impression that there's some benefit to keeping the temporary relation around, which, as far as I can see, there isn't. Now, you could perhaps argue that it's useful for forensics, but that presumes that the situation where a backend fails to crash without cleaning up its temporary relations is exciting enough to merit an investigation, which I believe to be false. RemoveTempRelationsCallback just does this: AbortOutOfAnyTransaction(); StartTransactionCommand(); RemoveTempRelations(myTempNamespace); CommitTransactionCommand(); RemoveTempRelations uses: deleteWhatDependsOn(object, false); These are pretty high-level operations, and there are all kinds of reasons they could fail. Many of those reasons do indeed involve the system being messed up in some way, but it's likely that the user will know about that for other reasons anyway - e.g. if the cleanup fails because the disk where the files are located has gone read-only at the operating system level, things are going to go downhill in a hurry. When the user restarts, they expect recovery - or other automatic cleanup mechanisms - to put things right. And normally they do: the first backend to get the same backend ID as any orphaned temp schema will clear it out anyway on first use - completely silently - but if it so happens that a crashed backend had a very high backend ID and that temp table usage is relatively uncommon, then the user gets log spam from now and until eternity because of a problem that, in their mind, is already fixed. Even better, they will typically be completely unable to connect the log spam back to the event that triggered it, and will have no idea how to put it right. So while I disagree with MauMau's proposed fix, I agree with him that this error message is useless log spam. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans for multicolumn GIST
On Tue, Jul 22, 2014 at 2:55 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 07/21/2014 10:47 PM, Anastasia Lubennikova wrote: Hi, hackers! There are new results of my work on GSoC project Index-only scans for GIST. Previous post is here: http://postgresql.1045698.n5.nabble.com/Index-only-scans-for-GIST-td5804892.html Repository is https://github.com/lubennikovaav/postgres/tree/indexonlygist2 Patch is in attachments. It includes indexonlyscan for multicolumn GiST. It works correctly - results are the same with another scan plans. Fetch() method is realized for box and circle opclasses Improvement for circle opclass is not such distinct as for box opclass, because of recheck. For a circle, the GiST index stores a bounding box of the circle. The new fetch function reverses that, calculating the radius and center of the circle from the bounding box. Those conversions lose some precision due to rounding. Are we okay with that? Floating point math is always subject to rounding errors, but there's a good argument to be made that it's not acceptable to get a different value back when the user didn't explicitly invoke any math functions. As an example: create table circle_tbl (c circle); create index circle_tbl_idx on circle_tbl using gist (c); insert into circle_tbl values ('1.23456789012345,1.23456789012345,1e300'); postgres=# set enable_seqscan=off; set enable_bitmapscan=off; set enable_indexonlyscan=on; SET SET SET postgres=# select * from circle_tbl ; c (0,0),1e+300 (1 row) postgres=# set enable_indexonlyscan=off; SET postgres=# select * from circle_tbl ; c -- (1.23456789012345,1.23456789012345),1e+300 (1 row) I really don't think it's ever OK for a query to produce different answers depending on which plan is chosen. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans for multicolumn GIST
Heikki Linnakangas hlinnakan...@vmware.com writes: For a circle, the GiST index stores a bounding box of the circle. The new fetch function reverses that, calculating the radius and center of the circle from the bounding box. Those conversions lose some precision due to rounding. Are we okay with that? That seems like a nonstarter :-(. Index-only scans don't have a license to return approximations. If we drop the behavior for circles, how much functionality do we have left? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] Suppress autovacuum: found orphan temp table message
On 2014-07-22 09:39:13 -0400, Robert Haas wrote: On Tue, Jul 22, 2014 at 6:23 AM, Andres Freund and...@2ndquadrant.com wrote: Yes, so nobody can convince serious customers that the current behavior makes real sense. I think you're making lots of noise over a trivial log message. Could you please reconsider this? No. Just removing a warning isn't the way to solve this. If you want to improve things you'll actually need to improve things not just stick your head into the sand. I've studied this area of the code before, and I would actually proposed to fix this in the opposite way - namely, adopt the logic currently used for the wraparound case, which drops the temp table, even if wraparound is not an issue. I'm absolutely not opposed to fixing this for real. I doubt it's backpatching material, but that's something to judge when we know what to do. The current code seems to be laboring under the impression that there's some benefit to keeping the temporary relation around, which, as far as I can see, there isn't. Now, you could perhaps argue that it's useful for forensics, but that presumes that the situation where a backend fails to crash without cleaning up its temporary relations is exciting enough to merit an investigation, which I believe to be false. I think the picture here changed with the introduction of the restart_after_crash GUC - before it it was pretty hard to investigate anything that involved crashes. So I'm ok with changing things around. But I think it's more complex than just doing the if (wraparound) in do_autovacuum(). a) There very well could be a backend reconnecting to that backendId. Then we potentially might try to remove the temp schema from two backends - I'm not sure that's always going to end up going well. There's already a race window, but it's pretty darn unlikely to hit it right now because the wraparound case pretty much implies that nothing has used that backendid slot for a while. I guess we could do something like: LockDatabaseObject(tempschema); if (SearchSysCacheExists1) /* bailout */ performDeletion(...); b) I think at the very least we also need to call RemovePgTempFiles() during crash restart. RemoveTempRelationsCallback just does this: AbortOutOfAnyTransaction(); StartTransactionCommand(); RemoveTempRelations(myTempNamespace); CommitTransactionCommand(); RemoveTempRelations uses: deleteWhatDependsOn(object, false); These are pretty high-level operations, and there are all kinds of reasons they could fail. One could argue, without being very convincing from my pov, that that's a reason not to always do it from autovacuum because it'll prevent vacuum from progressing past the borked temporary relation. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] Suppress autovacuum: found orphan temp table message
Robert Haas robertmh...@gmail.com writes: On Tue, Jul 22, 2014 at 6:23 AM, Andres Freund and...@2ndquadrant.com wrote: No. Just removing a warning isn't the way to solve this. If you want to improve things you'll actually need to improve things not just stick your head into the sand. I've studied this area of the code before, and I would actually proposed to fix this in the opposite way - namely, adopt the logic currently used for the wraparound case, which drops the temp table, even if wraparound is not an issue. The current code seems to be laboring under the impression that there's some benefit to keeping the temporary relation around, which, as far as I can see, there isn't. FWIW, I agree with Andres on this. The right response is to drop the temp table not complain about log spam. Or even more to the point, investigate why it's there in the first place; perhaps there's an actual fixable bug somewhere in there. But deciding that orphaned temp tables are normal is *not* an improvement IMO. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] Suppress autovacuum: found orphan temp table message
On 2014-07-22 10:17:15 -0400, Tom Lane wrote: Or even more to the point, investigate why it's there in the first place; perhaps there's an actual fixable bug somewhere in there. I think MauMau's scenario of a failover to another database explains their existance - there's no step that'd remove them after promoting a standby. So there indeed is a need to have a sensible mechanism for removing them at some point. But it should be about removing, not ignoring them. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] Suppress autovacuum: found orphan temp table message
On 2014-07-22 22:18:03 +0900, MauMau wrote: From: Andres Freund and...@2ndquadrant.com On 2014-07-22 19:13:56 +0900, MauMau wrote: But this is true if restart_after_crash = on in postgresql.conf, because the crash restart only occurs in that case. However, in HA cluster, whether it is shared-disk or replication, restart_after_crash is set to off, isn't it? In almost all setups I've seen it's set to on, even in HA scenarios. I'm afraid that's because people don't notice the existence or purpose of this parameter. The 9.1 release note says: I think it's also because it's simply a good idea to keep it on in many production scenarios. Failing over 'just' because something caused a crash restart is often too expensive. No. Just removing a warning isn't the way to solve this. If you want to improve things you'll actually need to improve things not just stick your head into the sand. I have a few ideas below, but none of them seems better than the original proposal. What do you think? 1. startup process deletes the catalog entries and data files of leftover temp relations at the end of recovery. This is probably difficult, impossible or undesirable, because the startup process cannot access system catalogs. Even if it's possible, it is against the developers' desire to leave temp relation files for debugging. 2. autovacuum launcher deletes the catalog entries and data files of leftover temp relations during its initialization. This may be possible, but it is against the developers' desire to leave temp relation files for debugging. I think that desire is pretty much antiquated and doesn't really count for much. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] Suppress autovacuum: found orphan temp table message
Andres Freund and...@2ndquadrant.com writes: On 2014-07-22 10:17:15 -0400, Tom Lane wrote: Or even more to the point, investigate why it's there in the first place; perhaps there's an actual fixable bug somewhere in there. I think MauMau's scenario of a failover to another database explains their existance - there's no step that'd remove them after promoting a standby. So there indeed is a need to have a sensible mechanism for removing them at some point. But it should be about removing, not ignoring them. Agreed. Note that RemovePgTempFiles, as such, only reclaims disk space. It does not clean out the pg_class entries, which means that just running that at standby promotion would do nothing to get rid of autovacuum's whining. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some bogus results from prairiedog
On 07/22/2014 12:24 AM, Tom Lane wrote: According to http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedogdt=2014-07-21%2022%3A36%3A55 prairiedog saw a crash in make check on the 9.4 branch earlier tonight; but there's not a lot of evidence as to why in the buildfarm report, because the postmaster log file is truncated well before where things got interesting. Fortunately, I was able to capture a copy of check.log before it got overwritten by the next run. I find that the place where the webserver report stops matches this section of check.log: [53cd99bb.134a:158] LOG: statement: create index test_range_gist_idx on test_range_gist using gist (ir); [53cd99bb.134a:159] LOG: statement: insert into test_range_gist select int4range(g, g+10) from generate_series(1,2000) g; ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^\ @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@[53cd99ba.1344:329] LOG: statement: INSERT INTO num_exp_div VALUES (7,8,'-1108.80577182462841041118'); [53cd99ba.1344:330] LOG: statement: INSERT INTO num_exp_add VALUES (7,9,'-107955289.045047420'); [53cd99ba.1344:331] LOG: statement: INSERT INTO num_exp_sub VALUES (7,9,'-58101680.954952580'); The ^@'s represent nul bytes, which I find runs of elsewhere in the file as well. I think they are an artifact of OS X buffering policy caused by multiple processes writing into the same file without any interlocks. Perhaps we ought to consider making buildfarm runs use the logging collector by default? But in any case, it seems uncool that either the buildfarm log-upload process, or the buildfarm web server, is unable to cope with log files containing nul bytes. The data is there, just click on the check stage link at the top of the page to see it in raw form. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GSoC2014] Patch ALTER TABLE ... SET LOGGED
On Thu, Jul 17, 2014 at 8:02 PM, Andres Freund and...@2ndquadrant.com wrote: That means should I FlushRelationBuffers(rel) before change the relpersistence? I don't think that'd help. I think what this means that you simply cannot change the relpersistence of the old relation before the heap swap is successful. So I guess it has to be something like (pseudocode): OIDNewHeap = make_new_heap(..); newrel = heap_open(OIDNewHeap, AEL); /* * Change the temporary relation to be unlogged/logged. We have to do * that here so buffers for the new relfilenode will have the right * persistency set while the original filenode's buffers won't get read * in with the wrong (i.e. new) persistency setting. Otherwise a * rollback after the rewrite would possibly result with buffers for the * original filenode having the wrong persistency setting. * * NB: This relies on swap_relation_files() also swapping the * persistency. That wouldn't work for pg_class, but that can't be * unlogged anyway. */ AlterTableChangeCatalogToLoggedOrUnlogged(newrel); FlushRelationBuffers(newrel); /* copy heap data into newrel */ finish_heap_swap(); And then in swap_relation_files() also copy the persistency. That's the best I can come up right now at least. Isn't better if we can set the relpersistence as an argument to make_new_heap ? I'm thinking to change the make_new_heap: From: make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, bool forcetemp, LOCKMODE lockmode) To: make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, char relpersistence, LOCKMODE lockmode) That way we can create the new heap with the appropriate relpersistence, so in the swap_relation_files also copy the persistency, of course. And after copy the heap data to the new table (ATRewriteTable) change relpersistence of the OldHeap's indexes, because in the finish_heap_swap they'll be rebuild. Thoughts? Regards, -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL Timbira: http://www.timbira.com.br Blog sobre TI: http://fabriziomello.blogspot.com Perfil Linkedin: http://br.linkedin.com/in/fabriziomello Twitter: http://twitter.com/fabriziomello
Re: [HACKERS] [bug fix] Suppress autovacuum: found orphan temp table message
On 2014-07-23 00:13:26 +0900, MauMau wrote: Hello, Robert-san, Andres-san, Tom-san, From: Andres Freund and...@2ndquadrant.com a) There very well could be a backend reconnecting to that backendId. Then we potentially might try to remove the temp schema from two backends - I'm not sure that's always going to end up going well. There's already a race window, but it's pretty darn unlikely to hit it right now because the wraparound case pretty much implies that nothing has used that backendid slot for a while. I guess we could do something like: LockDatabaseObject(tempschema); if (SearchSysCacheExists1) /* bailout */ performDeletion(...); b) I think at the very least we also need to call RemovePgTempFiles() during crash restart. Thank you for showing the direction. I'll investigate the code. But that will be tomorrow as it's already past midnight. Could it be included in 9.2.9 if I could submit the patch tomorrow? (I'm not confident I can finish it...) I'd really appreciate it if you could create the fix, if tomorrow will be late. 9.2.9 is already stamped, so no. But even without that, I don't think that this is a change that should be rushed into the backbranches. The risk/benefit ratio just isn't on the size of doing things hastily. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] Suppress autovacuum: found orphan temp table message
Hello, Robert-san, Andres-san, Tom-san, From: Andres Freund and...@2ndquadrant.com a) There very well could be a backend reconnecting to that backendId. Then we potentially might try to remove the temp schema from two backends - I'm not sure that's always going to end up going well. There's already a race window, but it's pretty darn unlikely to hit it right now because the wraparound case pretty much implies that nothing has used that backendid slot for a while. I guess we could do something like: LockDatabaseObject(tempschema); if (SearchSysCacheExists1) /* bailout */ performDeletion(...); b) I think at the very least we also need to call RemovePgTempFiles() during crash restart. Thank you for showing the direction. I'll investigate the code. But that will be tomorrow as it's already past midnight. Could it be included in 9.2.9 if I could submit the patch tomorrow? (I'm not confident I can finish it...) I'd really appreciate it if you could create the fix, if tomorrow will be late. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some bogus results from prairiedog
On Tue, Jul 22, 2014 at 12:24 AM, Tom Lane t...@sss.pgh.pa.us wrote: According to http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedogdt=2014-07-21%2022%3A36%3A55 prairiedog saw a crash in make check on the 9.4 branch earlier tonight; but there's not a lot of evidence as to why in the buildfarm report, because the postmaster log file is truncated well before where things got interesting. Fortunately, I was able to capture a copy of check.log before it got overwritten by the next run. I find that the place where the webserver report stops matches this section of check.log: [53cd99bb.134a:158] LOG: statement: create index test_range_gist_idx on test_range_gist using gist (ir); [53cd99bb.134a:159] LOG: statement: insert into test_range_gist select int4range(g, g+10) from generate_series(1,2000) g; ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^\ @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@[53cd99ba.1344:329] LOG: statement: INSERT INTO num_exp_div VALUES (7,8,'-1108.80577182462841041118'); [53cd99ba.1344:330] LOG: statement: INSERT INTO num_exp_add VALUES (7,9,'-107955289.045047420'); [53cd99ba.1344:331] LOG: statement: INSERT INTO num_exp_sub VALUES (7,9,'-58101680.954952580'); The ^@'s represent nul bytes, which I find runs of elsewhere in the file as well. I think they are an artifact of OS X buffering policy caused by multiple processes writing into the same file without any interlocks. Perhaps we ought to consider making buildfarm runs use the logging collector by default? But in any case, it seems uncool that either the buildfarm log-upload process, or the buildfarm web server, is unable to cope with log files containing nul bytes. Anyway, to cut to the chase, the crash seems to be from this: TRAP: FailedAssertion(!(FastPathStrongRelationLocks-count[fasthashcode] 0), File: lock.c, Line: 2957) which the postmaster shortly later says this about: [53cd99b6.130e:2] LOG: server process (PID 5230) was terminated by signal 6: Abort trap [53cd99b6.130e:3] DETAIL: Failed process was running: ROLLBACK PREPARED 'foo1'; [53cd99b6.130e:4] LOG: terminating any other active server processes So there is still something rotten in the fastpath lock logic. Gosh, that sucks. The inconstancy of this problem would seem to suggest some kind of locking bug rather than a flat-out concurrency issue, but it looks to me like everything relevant is marked volatile. But ... maybe the problem could be in s_lock.h? That machine is powerpc, and powerpc's definition of S_UNLOCK looks like this: #ifdef USE_PPC_LWSYNC #define S_UNLOCK(lock) \ do \ { \ __asm__ __volatile__ ( lwsync \n); \ *((volatile slock_t *) (lock)) = 0; \ } while (0) #else #define S_UNLOCK(lock) \ do \ { \ __asm__ __volatile__ ( sync \n); \ *((volatile slock_t *) (lock)) = 0; \ } while (0) #endif /* USE_PPC_LWSYNC */ I believe Andres recently reported that this isn't enough to prevent compiler reordering; for that, the __asm__ __volatile___ would need to be augmented with ::: memory. If that's correct, then the compiler could decide to issue the volatile store before the lwsync or sync. Then, the CPU could decide to perform the volatile store to the spinlock before performing the update to FastPathStrongRelationLocks. That would explain this problem. The other possible explanation (at least, AFAICS) is that lock_twophase_postcommit() is trying to release a strong lock count that it doesn't actually hold. That would indicate a strong-lock-count handling bug upstream someplace - e.g. that the count got released when the transaction was prepared, or somesuch. I certainly can't rule that out, although I don't know exactly where to look. We could add an assertion to AtPrepare_Locks(), just before setting locallock-holdsStrongLockCount = FALSE, that locallock-holdsStrongLockCount is true if and only if the lock tag and more suggest that it ought to be. But that's really just guessing wildly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Diagnose incompatible OpenLDAP versions during build and test.
Noah Misch n...@leadboat.com writes: Diagnose incompatible OpenLDAP versions during build and test. Hmm. I'm pretty sure it is not considered good style to drop AC_DEFUN blocks right into configure.in; at least, we have never done that before. PGAC_LDAP_SAFE should get defined somewhere in config/*.m4, instead. I am not real sure however which existing config/ file is appropriate, or whether we should create a new one. Peter, any opinion? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Behavior of OFFSET -1
Before 9.3, you got an error from this: regression=# select * from tenk1 offset -1; ERROR: OFFSET must not be negative But 9.3 and up ignore the negative OFFSET. This seems to be a thinko in my commit 1a1832eb. limit_needed() thinks it can discard the Limit plan node altogether, which of course prevents nodeLimit.c from complaining: /* Executor would treat less-than-zero same as zero */ if (offset 0) return true;/* OFFSET with a positive value */ I don't recall the reasoning behind that comment for sure, but I imagine I examined the behavior of ExecLimit() and failed to notice that there was an error check in recompute_limits(). This seems to me to be a clear bug: we should reinstate the former behavior by tightening this check so it only discards OFFSET with a constant value of exactly 0. Anyone think differently? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some bogus results from prairiedog
On 07/22/2014 10:55 AM, Andrew Dunstan wrote: On 07/22/2014 12:24 AM, Tom Lane wrote: According to http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedogdt=2014-07-21%2022%3A36%3A55 prairiedog saw a crash in make check on the 9.4 branch earlier tonight; but there's not a lot of evidence as to why in the buildfarm report, because the postmaster log file is truncated well before where things got interesting. Fortunately, I was able to capture a copy of check.log before it got overwritten by the next run. I find that the place where the webserver report stops matches this section of check.log: [53cd99bb.134a:158] LOG: statement: create index test_range_gist_idx on test_range_gist using gist (ir); [53cd99bb.134a:159] LOG: statement: insert into test_range_gist select int4range(g, g+10) from generate_series(1,2000) g; ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^\ @^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@[53cd99ba.1344:329] LOG: statement: INSERT INTO num_exp_div VALUES (7,8,'-1108.80577182462841041118'); [53cd99ba.1344:330] LOG: statement: INSERT INTO num_exp_add VALUES (7,9,'-107955289.045047420'); [53cd99ba.1344:331] LOG: statement: INSERT INTO num_exp_sub VALUES (7,9,'-58101680.954952580'); The ^@'s represent nul bytes, which I find runs of elsewhere in the file as well. I think they are an artifact of OS X buffering policy caused by multiple processes writing into the same file without any interlocks. Perhaps we ought to consider making buildfarm runs use the logging collector by default? But in any case, it seems uncool that either the buildfarm log-upload process, or the buildfarm web server, is unable to cope with log files containing nul bytes. The data is there, just click on the check stage link at the top of the page to see it in raw form. I have made a change in the upload receiver app to escape nul bytes in the main log field too. This will operate prospectively. Using the logging collector would be a larger change, but we can look at it if this isn't sufficient. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parametric block size?
Fabien wrote: ISTM that a desirable and reasonably simple to implement feature would be to be able to set the blocksize at initdb time, and postgres could use the value found in the database instead of a compile-time one. I think you will find it more difficult to implement than it seems at first. For one thing, there are several macros that depend on the block size and the algorithms involved cannot work with dynamic sizes; consider MaxIndexTuplesPerPage which is used inPageIndexMultiDelete() for instance. That value is used to allocate an array in the stack, but that doesn't work if the array size is dynamic. (Actually it works almost everywhere, but that feature is not in C89 and thus it fails on Windows). That shouldn't be a problem, you say, just palloc() the array -- except that that function is called within critical sections in some places (e.g. _bt_delitems_vacuum) and you cannot use palloc there. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Portability issues in TAP tests
On Mon, Jul 21, 2014 at 11:39 PM, Tom Lane t...@sss.pgh.pa.us wrote: Andrew Dunstan and...@dunslane.net writes: On 07/21/2014 02:40 PM, Tom Lane wrote: I had the same feeling about the Perl on RHEL6 ;-). The TAP tests will need to be a great deal more portable than they've proven so far before they'll really be useful for anything much. I trust we're committed to making that happen. I'd be rather inclined to remove them from the check-world and installcheck-world targets until we have dealt with the issues people have identified. I think it would be reasonable to do that in the 9.4 branch, since it's a bit hard to argue that the TAP tests are production grade at the moment. We could leave them there in HEAD. That doesn't do developers who can't run the tests in their environment much good: most people develop against master. Maybe that's the point: to get these fixed. But it's pretty annoying that I can no longer do 'make world'. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Behavior of OFFSET -1
On Tue, Jul 22, 2014 at 12:49 PM, Tom Lane t...@sss.pgh.pa.us wrote: Before 9.3, you got an error from this: regression=# select * from tenk1 offset -1; ERROR: OFFSET must not be negative But 9.3 and up ignore the negative OFFSET. This seems to be a thinko in my commit 1a1832eb. limit_needed() thinks it can discard the Limit plan node altogether, which of course prevents nodeLimit.c from complaining: /* Executor would treat less-than-zero same as zero */ if (offset 0) return true;/* OFFSET with a positive value */ I don't recall the reasoning behind that comment for sure, but I imagine I examined the behavior of ExecLimit() and failed to notice that there was an error check in recompute_limits(). This seems to me to be a clear bug: we should reinstate the former behavior by tightening this check so it only discards OFFSET with a constant value of exactly 0. Anyone think differently? Not I. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parametric block size?
Hello Alvaro, ISTM that a desirable and reasonably simple to implement feature would be to be able to set the blocksize at initdb time, and postgres could use the value found in the database instead of a compile-time one. I think you will find it more difficult to implement than it seems at first. For one thing, there are several macros that depend on the block size and the algorithms involved cannot work with dynamic sizes; consider MaxIndexTuplesPerPage which is used inPageIndexMultiDelete() for instance. That value is used to allocate an array in the stack, but that doesn't work if the array size is dynamic. (Actually it works almost everywhere, but that feature is not in C89 and thus it fails on Windows). That shouldn't be a problem, you say, just palloc() the array -- except that that function is called within critical sections in some places (e.g. _bt_delitems_vacuum) and you cannot use palloc there. Hmmm. Thanks for your point... indeed there may be implementation details... not a surprise:-) Note that I was more asking about the desirability of the feature, the implementation is another, although also relevant, issue. To me it is really desirable given the potential performance impact, but maybe we should not care about 10%? About your point: if we really have to do without dynamic stack allocation (C99 is only 15, not ripe for adult use yet, maybe when it turns 18 or 21, depending on the state:-), a possible way around would be to allocate a larger area with some MAX_BLCKSZ with a ifdef for compilers that really would not support dynamic stack allocation. Moreover, it might be possible to hide it more or less cleanly in a macro. I had to put -pedantic -Werror to manage to get an error on dynamic stack allocation with gcc -std=c89. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Shared Data Structure b/w clients
Hi All, I am working on postgresql code and having some problem. :) I need to create shared data structure, so that different client and connection can update and share the state of those data structures in memory. I planned to use top memory context but it can give me shared structure within one session/terminal. Please tel me how postgresql do that and how i can do that? Regards, Rohit Goyal
Re: [HACKERS] Behavior of OFFSET -1
On Tue, Jul 22, 2014 at 12:49:37PM -0400, Tom Lane wrote: Before 9.3, you got an error from this: regression=# select * from tenk1 offset -1; ERROR: OFFSET must not be negative That seems eminently sane, and should continue to error out, IM. The only circumstance I can imagine where this could be argued not to be is just casuistry, namely LIMIT m OFFSET -n might be argued to mean LIMIT m-n. Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GSoC2014] Patch ALTER TABLE ... SET LOGGED
On Tue, Jul 22, 2014 at 12:01 PM, Fabrízio de Royes Mello fabriziome...@gmail.com wrote: On Thu, Jul 17, 2014 at 8:02 PM, Andres Freund and...@2ndquadrant.com wrote: That means should I FlushRelationBuffers(rel) before change the relpersistence? I don't think that'd help. I think what this means that you simply cannot change the relpersistence of the old relation before the heap swap is successful. So I guess it has to be something like (pseudocode): OIDNewHeap = make_new_heap(..); newrel = heap_open(OIDNewHeap, AEL); /* * Change the temporary relation to be unlogged/logged. We have to do * that here so buffers for the new relfilenode will have the right * persistency set while the original filenode's buffers won't get read * in with the wrong (i.e. new) persistency setting. Otherwise a * rollback after the rewrite would possibly result with buffers for the * original filenode having the wrong persistency setting. * * NB: This relies on swap_relation_files() also swapping the * persistency. That wouldn't work for pg_class, but that can't be * unlogged anyway. */ AlterTableChangeCatalogToLoggedOrUnlogged(newrel); FlushRelationBuffers(newrel); /* copy heap data into newrel */ finish_heap_swap(); And then in swap_relation_files() also copy the persistency. That's the best I can come up right now at least. Isn't better if we can set the relpersistence as an argument to make_new_heap ? I'm thinking to change the make_new_heap: From: make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, bool forcetemp, LOCKMODE lockmode) To: make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, char relpersistence, LOCKMODE lockmode) That way we can create the new heap with the appropriate relpersistence, so in the swap_relation_files also copy the persistency, of course. And after copy the heap data to the new table (ATRewriteTable) change relpersistence of the OldHeap's indexes, because in the finish_heap_swap they'll be rebuild. Thoughts? The attached patch implement my previous idea based on Andres thoughts. Regards, -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL Timbira: http://www.timbira.com.br Blog sobre TI: http://fabriziomello.blogspot.com Perfil Linkedin: http://br.linkedin.com/in/fabriziomello Twitter: http://twitter.com/fabriziomello diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index 69a1e14..2d131df 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -59,16 +59,17 @@ ALTER TABLE [ IF EXISTS ] replaceable class=PARAMETERname/replaceable ENABLE ALWAYS RULE replaceable class=PARAMETERrewrite_rule_name/replaceable CLUSTER ON replaceable class=PARAMETERindex_name/replaceable SET WITHOUT CLUSTER +SET {LOGGED | UNLOGGED} SET WITH OIDS SET WITHOUT OIDS SET ( replaceable class=PARAMETERstorage_parameter/replaceable = replaceable class=PARAMETERvalue/replaceable [, ... ] ) +SET TABLESPACE replaceable class=PARAMETERnew_tablespace/replaceable RESET ( replaceable class=PARAMETERstorage_parameter/replaceable [, ... ] ) INHERIT replaceable class=PARAMETERparent_table/replaceable NO INHERIT replaceable class=PARAMETERparent_table/replaceable OF replaceable class=PARAMETERtype_name/replaceable NOT OF OWNER TO replaceable class=PARAMETERnew_owner/replaceable -SET TABLESPACE replaceable class=PARAMETERnew_tablespace/replaceable REPLICA IDENTITY {DEFAULT | USING INDEX replaceable class=PARAMETERindex_name/replaceable | FULL | NOTHING} phraseand replaceable class=PARAMETERtable_constraint_using_index/replaceable is:/phrase @@ -447,6 +448,20 @@ ALTER TABLE [ IF EXISTS ] replaceable class=PARAMETERname/replaceable /varlistentry varlistentry +termliteralSET {LOGGED | UNLOGGED}/literal/term +listitem + para + This form changes the table persistence type from unlogged to permanent or + from unlogged to permanent (see xref linkend=SQL-CREATETABLE-UNLOGGED). + /para + para + Changing the table persistence type acquires an literalACCESS EXCLUSIVE/literal lock + and rewrites the table contents and associated indexes into new disk files. + /para +/listitem + /varlistentry + + varlistentry termliteralSET WITH OIDS/literal/term listitem para diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c index b1c411a..7f497c7 100644 --- a/src/backend/commands/cluster.c +++ b/src/backend/commands/cluster.c @@ -574,7 +574,8 @@ rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose) heap_close(OldHeap, NoLock); /* Create the transient table that will receive the re-ordered data */ - OIDNewHeap = make_new_heap(tableOid, tableSpace, false, + OIDNewHeap = make_new_heap(tableOid, tableSpace, + OldHeap-rd_rel-relpersistence,
Re: [HACKERS] Shared Data Structure b/w clients
On Tuesday, July 22, 2014, Rohit Goyal rhtgyl...@gmail.com wrote: Hi All, I am working on postgresql code and having some problem. :) I need to create shared data structure, so that different client and connection can update and share the state of those data structures in memory. I planned to use top memory context but it can give me shared structure within one session/terminal. Please tel me how postgresql do that and how i can do that? Regards, Rohit Goyal How about making it a part of shared mem, like shared buffers? Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] Shared Data Structure b/w clients
Hi Atri/All, I am very new in postgresql code. Can you please help in a bit detail ortel me how to create structure in shared memory(shared buffer). It would be really easy for me if you can give me a code snippet or any link to follow. Regards, Rohit Goyal On Tue, Jul 22, 2014 at 8:30 PM, Atri Sharma atri.j...@gmail.com wrote: On Tuesday, July 22, 2014, Rohit Goyal rhtgyl...@gmail.com wrote: Hi All, I am working on postgresql code and having some problem. :) I need to create shared data structure, so that different client and connection can update and share the state of those data structures in memory. I planned to use top memory context but it can give me shared structure within one session/terminal. Please tel me how postgresql do that and how i can do that? Regards, Rohit Goyal How about making it a part of shared mem, like shared buffers? Regards, Atri -- Regards, Atri *l'apprenant* -- Regards, Rohit Goyal
Re: [HACKERS] Shared Data Structure b/w clients
On Tue, Jul 22, 2014 at 1:33 PM, Rohit Goyal rhtgyl...@gmail.com wrote: Hi Atri/All, I am very new in postgresql code. Can you please help in a bit detail ortel me how to create structure in shared memory(shared buffer). It would be really easy for me if you can give me a code snippet or any link to follow. you can look at contrib/pg_stat_statements -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación Phone: +593 4 5107566 Cell: +593 987171157 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal (9.5) : psql unicode border line styles
On 28.6.2014 21:29, Pavel Stehule wrote: Hello rebase for 9.5 test: \pset linestyle unicode \pset border 2 \pset unicode_header_linestyle double \l Regards Pavel I did a quick review of the patch today: * it applies cleanly to current HEAD (no failures, small offsets) * compiles and generally seems to work just fine Two questions: (1) Shouldn't the new options be listed in '\?' (as possible names for pset)? I mean, here: \pset [NAME [VALUE]] set table output option (NAME := {format|border|expanded|fieldsep|fieldsep_zero|footer|null| numericlocale|recordsep|recordsep_zero|tuples_only|title|tableattr|pager}) (2) I noticed this piece of code: +typedef enum unicode_linestyle +{ + UNICODE_LINESTYLE_SINGLE = 0, /* to make sure someone initializes this */ + UNICODE_LINESTYLE_DOUBLE = 1 +} unicode_linestyle; Why are the values defined explicitly? These values are set by the compiled automatically, so why set them manually? Only a few of the other enums are defined explicitly, and most of them have to do that to define different values (e.g. 0x01, 0x02, 0x04, ...). I don't understand how the comment to make sure someone initializes this explains the purpose? regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Some bogus results from prairiedog
Robert Haas robertmh...@gmail.com writes: On Tue, Jul 22, 2014 at 12:24 AM, Tom Lane t...@sss.pgh.pa.us wrote: Anyway, to cut to the chase, the crash seems to be from this: TRAP: FailedAssertion(!(FastPathStrongRelationLocks-count[fasthashcode] 0), File: lock.c, Line: 2957) So there is still something rotten in the fastpath lock logic. Gosh, that sucks. The inconstancy of this problem would seem to suggest some kind of locking bug rather than a flat-out concurrency issue, but it looks to me like everything relevant is marked volatile. I don't think that you need any big assumptions about machine-specific coding issues to spot the problem. The assert in question is here: /* * Decrement strong lock count. This logic is needed only for 2PC. */ if (decrement_strong_lock_count ConflictsWithRelationFastPath(lock-tag, lockmode)) { uint32fasthashcode = FastPathStrongLockHashPartition(hashcode); SpinLockAcquire(FastPathStrongRelationLocks-mutex); Assert(FastPathStrongRelationLocks-count[fasthashcode] 0); FastPathStrongRelationLocks-count[fasthashcode]--; SpinLockRelease(FastPathStrongRelationLocks-mutex); } and it sure looks to me like that ConflictsWithRelationFastPath(lock-tag is looking at the tag of the shared-memory lock object you just released. If someone else had managed to recycle that locktable entry for some other purpose, the ConflictsWithRelationFastPath call might incorrectly return true. I think s/lock-tag/locktag/ would fix it, but maybe I'm missing something. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Diagnose incompatible OpenLDAP versions during build and test.
Noah Misch n...@leadboat.com writes: Diagnose incompatible OpenLDAP versions during build and test. Oh, one more thing: the Windows buildfarm members mostly don't like the dblink test case you added. Looks like the mechanism for finding the shared library doesn't work. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README
On Mon, Jul 21, 2014 at 9:55 PM, Amit Kapila amit.kapil...@gmail.com wrote: There is a mention about the race condition where it needs to move right in another caller (_bt_search) of _bt_moveright() as well. /* * Race -- the page we just grabbed may have split since we read its * pointer in the parent (or metapage). If it has, we may need to * move right to its new sibling. Do that. .. Do you think there is more to what is already mentioned on top of second caller which we should add or you think if it is true for both, then it should be on top of _bt_moveright()? Well, maybe it is justified to mention it multiple times, so as to not break the reader's train of thought. I'm not sure. In general, I agree with you that we should mention about any advantage of the algorithm we are using and especially if it is significant. I think it will be better if can also mention how that advantage or use is realized in our implementation as we are already doing in README of nbtree. Right. It seems like the nbtree README is very shy about telling us what the point of all this extra work is. IMV that should be stated very prominently to aid understanding. A lot of people believe that we have to do lock coupling/crabbing when descending the tree. We do not. The locks acquired when descending a B-Tree in Postgres are pretty granular. One read buffer lock is held at a time in the process of servicing index scans. There are times during the descent of the tree when no buffer locks are held whatsoever. Moreover, (with some caveats) it doesn't really matter if a stale view of the page is seen during a descent (as it happens I've been trying to think of ways to further take advantage of this). That's pretty cool. If you only know one thing about how the nbtree code works, that should probably be it. The above indicates 2 things: a. L Y doesn't need to hold read locks concurrently. b. Advantage of right-links at all levels and high-key. As per my understanding, we are not following point (a) in our code, so what is the benefit we get by having a reference of same in README? The major reason that we don't completely avoid read locks, is, I suppose, the need for self-consistent pages (but also because it would break page deletion - I'm pretty sure that LY don't consider page deletion, and the page deletion logic is entirely based on the Lanin and Shasha paper and original research, but I didn't check). I think that the sentence Lehman and Yao don't require read locks, but assume that in-memory copies of tree pages are unshared is intended to convey the point on the self-consistency of pages. Of course, Lehman and Yao must assume that the B-Tree is in some sense in shared memory. Otherwise, there wouldn't be much point to their elaborate locking protocol. :-) Isn't it better if we mention how the point (b) is used in our code and it's advantage together rather than keeping it at top of README? Maybe it deserves more prominent mention in the code too. Already README mentions in brief about right-link and how it is used as below: .. The scan must remember the page's right-link at the time it was scanned, since that is the page to move right to; if we move right to the current right-link then we'd re-scan any items moved by a page split. ... This is talking about how index scans interlock against VACUUM while going to the heap, by keeping a leaf page pinned (this prevents super exclusive lock acquisition). This is after the tree has been descended. This is really just a detail (albeit one that follows similar principles, since pages split right and it similarly exploits that fact), whereas the use of right links and high keys while descending the tree, and in particular the fact that the move right LY technique obviates the prior need for lock coupling is pretty much the whole point of LY. In more concrete terms, _bt_search() releases and only then acquires read locks during a descent of the tree (by calling _bt_relandgetbuf()), and, perhaps counterintuitively, that's just fine. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README
Peter Geoghegan p...@heroku.com writes: Right. It seems like the nbtree README is very shy about telling us what the point of all this extra work is. IIRC, the README was written on the assumption that you'd already read LY. If this patch is mostly about not assuming that, why not? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pgkill() is not POSIX-like for exiting processes
My new OpenLDAP test case has been breaking each MSVC buildfarm member. Most MinGW members are fine, though the 9.0 and 9.1 narwhal members broke. (Newer narwhal members have been broken long-term.) The MSVC build system has a mundane inability to handle a Makefile construct I used; the first attached patch fixes that. With that fixed, though, the test case reveals a departure from POSIX in pgkill(), our emulation of kill(2) for Windows. A pgkill() call falling very close in time to process exit can easily give ERROR_BROKEN_PIPE, and I at least once observed it give ERROR_BAD_PIPE. pgkill() translates those codes to EINVAL. Only a buggy POSIX program will ever see EINVAL from kill(2). This situation is just like signalling a zombie, and POSIX kill(2) returns zero for zombies. Let's do that, as attached. I'm inclined to back-patch this, although a quick scan didn't turn up any code having bugs today as a result. The change will affect occasional error messages. (An alternative is to change the test case code in the back branches to treat EINVAL like success.) On my systems, MSVC builds get ERROR_BROKEN_PIPE far more easily than MinGW builds. In MinGW, I ran make -C contrib/dblink check many times without ever hitting the error, but vcregress contribcheck hit it over 99% of the time. I don't have a theory to explain that. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com diff --git a/src/tools/msvc/vcregress.pl b/src/tools/msvc/vcregress.pl index 39698ee..a9e95cd 100644 --- a/src/tools/msvc/vcregress.pl +++ b/src/tools/msvc/vcregress.pl @@ -331,9 +331,13 @@ sub fetchRegressOpts if ($m =~ /^\s*REGRESS_OPTS\s*=(.*)/m) { - # ignore options that use makefile variables - can't handle those - # ignore anything that isn't an option staring with -- - @opts = grep { $_ !~ /\$\(/ $_ =~ /^--/ } split(/\s+/, $1); + # Substitute known Makefile variables, then ignore options that retain + # an unhandled variable reference. Ignore anything that isn't an + # option staring with --. + @opts = grep { + s/\Q$(top_builddir)\E/\$topdir\/; + $_ !~ /\$\(/ $_ =~ /^--/ + } split(/\s+/, $1); } if ($m =~ /^\s*ENCODING\s*=\s*(\S+)/m) { diff --git a/src/port/kill.c b/src/port/kill.c index 5a7d483..b33283e 100644 --- a/src/port/kill.c +++ b/src/port/kill.c @@ -70,13 +70,28 @@ pgkill(int pid, int sig) return 0; } - if (GetLastError() == ERROR_FILE_NOT_FOUND) - errno = ESRCH; - else if (GetLastError() == ERROR_ACCESS_DENIED) - errno = EPERM; - else - errno = EINVAL; - return -1; + switch (GetLastError()) + { + case ERROR_BROKEN_PIPE: + case ERROR_BAD_PIPE: + + /* +* These arise transiently as a process is exiting. Treat them +* like POSIX treats a zombie process, reporting success. +*/ + return 0; + + case ERROR_FILE_NOT_FOUND: + /* pipe fully gone, so treat the process as gone */ + errno = ESRCH; + return -1; + case ERROR_ACCESS_DENIED: + errno = EPERM; + return -1; + default: + errno = EINVAL; /* unexpected */ + return -1; + } } #endif -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Shared Data Structure b/w clients
On 07/23/2014 02:33 AM, Rohit Goyal wrote: I am very new in postgresql code. Can you please help in a bit detail ortel me how to create structure in shared memory(shared buffer). It would be really easy for me if you can give me a code snippet or any link to follow. There's a lot of detail on how to do this in the BDR codebase, see contrib/bdr in http://git.postgresql.org/gitweb/?p=2ndquadrant_bdr.git;a=summary -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Shared Data Structure b/w clients
On 07/23/2014 09:46 AM, Craig Ringer wrote: There's a lot of detail on how to do this in the BDR codebase, see contrib/bdr in http://git.postgresql.org/gitweb/?p=2ndquadrant_bdr.git;a=summary Oh, sorry: in the bdr-next branch. Should've mentioned. http://git.postgresql.org/gitweb/?p=2ndquadrant_bdr.git;a=tree;f=contrib/bdr;h=fad1aa59a15724deb98f9b923d84f0ce818afc1f;hb=refs/heads/bdr-next -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] IS NOT DISTINCT FROM + Indexing
On Jul 22, 2014, at 12:40 AM, Tom Lane t...@sss.pgh.pa.us wrote: Jonathan S. Katz jonathan.k...@excoventures.com writes: On Jul 21, 2014, at 9:51 PM, Tom Lane t...@sss.pgh.pa.us wrote: The short reason why not is that it's not an operator (where operator is defined as something with a pg_operator entry), and all our indexing infrastructure is built around the notion that indexable clauses are of the form indexed_column indexable_operator comparison_value. What got me thinking this initially problem is that I know IS NULL is indexable and I was unsure of how adding IS NOT DISTINCT FROM would be too different from that - of course, this is from my perspective from primarily operating on the surface. It sounds like the IS NULL work is in the btree code? We hacked in IS [NOT] NULL as a potentially indexable construct, but the key thing that made that possible without major redesign is that IS [NOT] NULL is datatype independent, so there's no need to identify any particular underlying operator or opclass. I'm not sure what we'd do to handle IS [NOT] DISTINCT FROM, but that particular approach ain't gonna cut it. Another point is that people are unlikely to be satisfied with planner optimization for IS NOT DISTINCT FROM that doesn't support it as a join clause (i.e., tab1.col1 IS NOT DISTINCT FROM tab2.col2); which is an issue that doesn't arise for IS [NOT] NULL, as it has only one argument. So that brings you into not just indexability but hashing and merging support. I hasten to say that that doesn't necessarily have to happen in a version-zero patch; but trying to make IS NOT DISTINCT FROM into a first-class construct is a big project. Well that definitely answers how hard would it be. - before embarking on something laborious (as even just indexing is nontrivial), I think it would be good to figure out how people are using IS [NOT] DISTINCT FROM and if there is interest in having it be indexable, let alone used in a JOIN optimization. It could become a handy tool to simplify the SQL in queries that are returning a lot of NULL / NOT NULL data mixed together. Jonathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README
On Wed, Jul 23, 2014 at 5:58 AM, Peter Geoghegan p...@heroku.com wrote: On Mon, Jul 21, 2014 at 9:55 PM, Amit Kapila amit.kapil...@gmail.com wrote: The above indicates 2 things: a. L Y doesn't need to hold read locks concurrently. b. Advantage of right-links at all levels and high-key. As per my understanding, we are not following point (a) in our code, so what is the benefit we get by having a reference of same in README? The major reason that we don't completely avoid read locks, is, I suppose, the need for self-consistent pages (but also because it would break page deletion - I'm pretty sure that LY don't consider page deletion, and the page deletion logic is entirely based on the Lanin and Shasha paper and original research, but I didn't check). I think that the sentence Lehman and Yao don't require read locks, but assume that in-memory copies of tree pages are unshared is intended to convey the point on the self-consistency of pages. Of course, Lehman and Yao must assume that the B-Tree is in some sense in shared memory. Otherwise, there wouldn't be much point to their elaborate locking protocol. :-) Okay, but how does this justify to add below new text in README. + Even with these read locks, Lehman and Yao's approach obviates the + need of earlier schemes to hold multiple read locks concurrently when + descending the tree as part of servicing index scans (pessimistic lock + coupling). Actually I think putting it can lead to inconsistency in the README. Currently it indicates that our algorithm is different from LY w.r.t taking Read Locks and has given explanation for same. Isn't it better if we mention how the point (b) is used in our code and it's advantage together rather than keeping it at top of README? Maybe it deserves more prominent mention in the code too. Already README mentions in brief about right-link and how it is used as below: .. The scan must remember the page's right-link at the time it was scanned, since that is the page to move right to; if we move right to the current right-link then we'd re-scan any items moved by a page split. ... This is talking about how index scans interlock against VACUUM while going to the heap, by keeping a leaf page pinned (this prevents super exclusive lock acquisition). This is after the tree has been descended. This is really just a detail (albeit one that follows similar principles, since pages split right and it similarly exploits that fact), whereas the use of right links and high keys while descending the tree, and in particular the fact that the move right LY technique obviates the prior need for lock coupling is pretty much the whole point of LY. In more concrete terms, _bt_search() releases and only then acquires read locks during a descent of the tree (by calling _bt_relandgetbuf()), and, perhaps counterintuitively, that's just fine. So don't you think that it needs bit more explanation than you have quoted in below text. + The addition of right-links at all levels, as well as the + addition of a page high key allows detection of, and dynamic + recovery from concurrent page splits (that is, splits between + unlocking an internal page, and subsequently locking its child page + during a descent). Basically I think it will be better if you can explain in bit more detail that how does right-links at all levels and high-key helps to detect and recover from concurrent page splits. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README
On Tue, Jul 22, 2014 at 5:39 PM, Tom Lane t...@sss.pgh.pa.us wrote: IIRC, the README was written on the assumption that you'd already read LY. If this patch is mostly about not assuming that, why not? LY made the same mistake that the authors of most influential papers make - they never get around to telling the reader why they should bother to read it. The paper is over 30 years old, and we now know that it's very influential, and the reasons why. I think that both the nbtree README and LY would be a lot more approachable with a high level introduction (arguably LY attempt this, but the way they go about it seems impenetrable, mostly consisting of esoteric references to other papers). Surely making that code more approachable is a worthy goal. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Inconsistencies of service failure handling on Windows
Hi all, While playing on Windows with services, I noticed an inconsistent behavior in the way failures are handled when using a service for a Postgres instance. Let's assume that there is a service called postgres that has been registered: $ psql -At -c 'select version()' PostgreSQL 9.5devel, compiled by Visual C++ build 1600, 64-bit $ tasklist.exe -svc -FI SERVICES eq postgres Image Name PID Services = pg_ctl.exe 556 postgres When pg_ctl is directly killed, service manager is able to detect a failure correctly. $ taskkill.exe -PID 556 -F SUCCESS: The process with PID 556 has been terminated. $ sc query postgres SERVICE_NAME: postgres TYPE : 10 WIN32_OWN_PROCESS STATE : 1 STOPPED WIN32_EXIT_CODE: 1067 (0x42b) SERVICE_EXIT_CODE : 0 (0x0) CHECKPOINT : 0x0 WAIT_HINT : 0x0 In this case 1067 means that the process left unexpectedly. Note that at this point the Postgres instance is still running but we can use the failure callback to run a script that could do some cleanup before restarting properly the service. However when a backend process is directly killed something different happens. $ tasklist.exe -FI IMAGENAME eq postgres.exe Image Name PID Session NameSession#Mem Usage = === postgres.exe 2088 Services 0 17,380 K postgres.exe 2132 Services 0 4,400 K postgres.exe 2236 Services 0 5,064 K postgres.exe 1524 Services 0 6,304 K postgres.exe 2084 Services 0 9,200 K postgres.exe 2384 Services 0 5,968 K postgres.exe 2652 Services 0 4,500 K postgres.exe 2116 Services 0 4,384 K $ taskkill.exe -PID 2084 -F SUCCESS: The process with PID 2084 has been terminated. After that some processes remain: $ tasklist.exe -FI IMAGENAME eq postgres.exe Image Name PID Session NameSession#Mem Usage = === postgres.exe 2088 Services 0 5,708 K postgres.exe 2132 Services 0 4,400 K Processes that are immediately taken down when attempting a connection to the server. Note that before attempting any connections service is considered as running normally: $ sc query postgres SERVICE_NAME: postgres TYPE : 10 WIN32_OWN_PROCESS STATE : 4 RUNNING (STOPPABLE, PAUSABLE, ACCEPTS_SHUTDOWN) WIN32_EXIT_CODE: 0 (0x0) SERVICE_EXIT_CODE : 0 (0x0) CHECKPOINT : 0x0 WAIT_HINT : 0x0 $ psql psql: could not connect to server: Connection refused (0x274D/10061) Is the server running on host localhost (::1) and accepting TCP/IP connections on port 5432? could not connect to server: Connection refused (0x274D/10061) Is the server running on host localhost (127.0.0.1) and accepting TCP/IP connections on port 5432? $ tasklist.exe -FI IMAGENAME eq postgres.exe INFO: No tasks are running which match the specified criteria. But now service has stopped, and it is not considered as having failed: $ sc query postgres SERVICE_NAME: postgres TYPE : 10 WIN32_OWN_PROCESS STATE : 1 STOPPED WIN32_EXIT_CODE: 0 (0x0) SERVICE_EXIT_CODE : 0 (0x0) CHECKPOINT : 0x0 WAIT_HINT : 0x0 This seems like an inconsistent behavior in error detection. I am guessing that pg_ctl is not able to track appropriately failures that are happening on postgres side. But are there things we could do to improve the failure detection here? Regards, -- Michael
Re: [HACKERS] proposal (9.5) : psql unicode border line styles
Hi Tomas 2014-07-22 23:20 GMT+02:00 Tomas Vondra t...@fuzzy.cz: On 28.6.2014 21:29, Pavel Stehule wrote: Hello rebase for 9.5 test: \pset linestyle unicode \pset border 2 \pset unicode_header_linestyle double \l Regards Pavel I did a quick review of the patch today: * it applies cleanly to current HEAD (no failures, small offsets) * compiles and generally seems to work just fine Two questions: (1) Shouldn't the new options be listed in '\?' (as possible names for pset)? I mean, here: \pset [NAME [VALUE]] set table output option (NAME := {format|border|expanded|fieldsep|fieldsep_zero|footer|null| numericlocale|recordsep|recordsep_zero|tuples_only|title|tableattr|pager}) fixed (2) I noticed this piece of code: +typedef enum unicode_linestyle +{ + UNICODE_LINESTYLE_SINGLE = 0, /* to make sure someone initializes this */ + UNICODE_LINESTYLE_DOUBLE = 1 +} unicode_linestyle; Why are the values defined explicitly? These values are set by the compiled automatically, so why set them manually? Only a few of the other enums are defined explicitly, and most of them have to do that to define different values (e.g. 0x01, 0x02, 0x04, ...). this is useless - I removed it. I don't understand how the comment to make sure someone initializes this explains the purpose? copy/paste error :( - removed updated version is in attachment Regards Pavel regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers commit fb99f3b1e12d7dfb203b70187e63647e5d0a674d Author: Pavel Stehule pavel.steh...@gooddata.com Date: Wed Jul 23 07:30:46 2014 +0200 second version - minor changes: help, remove bogus comment and not necessary exact enum specification diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index fa0d6f2..fc8a503 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -2294,6 +2294,42 @@ lo_import 152801 /para /listitem /varlistentry + + varlistentry + termliteralunicode_border_style/literal/term + listitem + para + Sets the border line drawing style to one + of literalsingle/literal or literaldouble/literal + This option only affects the literalunicode/ + linestyle + /para + /listitem + /varlistentry + + varlistentry + termliteralunicode_column_style/literal/term + listitem + para + Sets the column line drawing style to one + of literalsingle/literal or literaldouble/literal + This option only affects the literalunicode/ + linestyle + /para + /listitem + /varlistentry + + varlistentry + termliteralunicode_header_style/literal/term + listitem + para + Sets the header line drawing style to one + of literalsingle/literal or literaldouble/literal + This option only affects the literalunicode/ + linestyle + /para + /listitem + /varlistentry /variablelist /para diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c index 161de75..c0a09b1 100644 --- a/src/bin/psql/command.c +++ b/src/bin/psql/command.c @@ -1054,6 +1054,9 @@ exec_command(const char *cmd, footer, format, linestyle, null, numericlocale, pager, recordsep, tableattr, title, tuples_only, +unicode_border_linestyle, +unicode_column_linestyle, +unicode_header_linestyle, NULL }; @@ -2248,6 +2251,55 @@ _align2string(enum printFormat in) return unknown; } +/* + * Parse entered unicode linestyle. Returns true, when entered string is + * known linestyle: single, double else returns false. + */ +static bool +set_unicode_line_style(const char *value, size_t vallen, unicode_linestyle *linestyle) +{ + if (pg_strncasecmp(single, value, vallen) == 0) + *linestyle = UNICODE_LINESTYLE_SINGLE; + else if (pg_strncasecmp(double, value, vallen) == 0) + *linestyle = UNICODE_LINESTYLE_DOUBLE; + else + return false; + + return true; +} + +static const char * +_unicode_linestyle2string(int linestyle) +{ + switch (linestyle) + { + case UNICODE_LINESTYLE_SINGLE: + return single; + break; + case UNICODE_LINESTYLE_DOUBLE: + return double; + break; + } + return unknown; +} + +static const char * +_linestyle2string(linestyle_type line_style) +{ + switch (line_style) + { + case LINESTYLE_ASCII: + return ascii; + break; + case LINESTYLE_OLD_ASCII: + return old-ascii; + break; + case LINESTYLE_UNICODE: + return unicode; + break; + } + return unknown; +} bool do_pset(const char *param, const char *value, printQueryOpt *popt, bool quiet) @@ -2292,11 +2344,11 @@ do_pset(const char
Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README
On Tue, Jul 22, 2014 at 8:59 PM, Amit Kapila amit.kapil...@gmail.com wrote: Okay, but how does this justify to add below new text in README. + Even with these read locks, Lehman and Yao's approach obviates the + need of earlier schemes to hold multiple read locks concurrently when + descending the tree as part of servicing index scans (pessimistic lock + coupling). Actually I think putting it can lead to inconsistency in the README. Currently it indicates that our algorithm is different from LY w.r.t taking Read Locks and has given explanation for same. Not really. Firstly, that sentence acknowledges that there are read locks where LY assume there will not be. Even with these read locks references the first paragraph, where it is stated the Postgres B-Trees still acquire read locks while descending the tree. Secondly, I'm pretty sure that even Lehman and Yao realized that their apparent assumption that real implementations would not require read locks isn't realistic. Their handling of deletion seems perfunctory to me. They say In situations where excessive deletions cause the storage utilization of tree nodes to be unacceptably low, a batch reorganization or an underflow operation which locks the entire tree can be performed. I'm pretty sure that that sounded almost as bad in 1980 as it does now. We don't have a not quite LY implementation just because there are single read locks acquired while descending the tree. Prior schemes needed multiple *concurrent* exclusive locks. B-Trees were around for about 10 years before LY. There is reason to think that pretty much every practical implementation uses read locks for many years, because there is a well received 2001 paper [1] that describes a scheme where LY style B-link trees can *actually* be made to not require read locks, which discusses things like caching effects on contemporary hardware - it involves playing tricks with detecting and recovering from page level inconsistencies, IIRC. Furthermore, it references a scheme from the late 90s involving local copies of B-Link pages. I thought about pursuing something like that myself, but the cost of latching (buffer locking) B-Trees in PostgreSQL is likely to be reduced before too long anyway, which makes the general idea seem unenticing right now. Basically I think it will be better if you can explain in bit more detail that how does right-links at all levels and high-key helps to detect and recover from concurrent page splits. You might be right about that - perhaps I should go into more detail. [1] http://www.vldb.org/conf/2001/P181.pdf -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers