Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On 01/14/2014 03:44 AM, Dave Chinner wrote: > On Tue, Jan 14, 2014 at 02:26:25AM +0100, Andres Freund wrote: >> On 2014-01-13 17:13:51 -0800, James Bottomley wrote: >>> a file into a user provided buffer, thus obtaining a page cache entry >>> and a copy in their userspace buffer, then insert the page of the user >>> buffer back into the page cache as the page cache page ... that's right, >>> isn't it postgress people? >> Pretty much, yes. We'd probably hint (*advise(DONTNEED)) that the page >> isn't needed anymore when reading. And we'd normally write if the page >> is dirty. > So why, exactly, do you even need the kernel page cache here? You've > got direct access to the copy of data read into userspace, and you > want direct control of when and how the data in that buffer is > written and reclaimed. Why push that data buffer back into the > kernel and then have to add all sorts of kernel interfaces to > control the page you already have control of? To let kernel do the job that it is good at, namely managing the write-back of dirty buffers to disk and to manage (possible) read-ahead pages. While we do have control of "the page", we do not (and really don't want to) have control of the complex and varied side of efficiently reading and writing to various file-systems with possibly very different disk configurations. We quite prefer kernel to take care of it and generally like how kernel manages it. We have a few suggestions about giving the kernel extra info about the applications usage patterns of the data. > >>> Effectively you end up with buffered read/write that's also mapped into >>> the page cache. It's a pretty awful way to hack around mmap. >> Well, the problem is that you can't really use mmap() for the things we >> do. Postgres' durability works by guaranteeing that our journal entries >> (called WAL := Write Ahead Log) are written & synced to disk before the >> corresponding entries of tables and indexes reach the disk. That also >> allows to group together many random-writes into a few contiguous writes >> fdatasync()ed at once. Only during a checkpointing phase the big bulk of >> the data is then (slowly, in the background) synced to disk. > Which is the exact algorithm most journalling filesystems use for > ensuring durability of their metadata updates. Indeed, here's an > interesting piece of architecture that you might like to consider: > > * Neither XFS and BTRFS use the kernel page cache to back their > metadata transaction engines. But file system code is supposed to know much more about the underlying disk than a mere application program like postgresql. We do not want to start duplicating OS if we can avoid it. What we would like is to have a way to tell the kernel 1) "here is the modified copy of file page, it is now safe to write it back" - the current 'lazy' write 2) "here is the page, write it back now, before returning success to me" - unbuffered write or write + sync but we also would like to have 3) "here is the page as it is currently on disk, I may need it soon, so keep it together with your other clean pages accessed at time X" - this is the non-dirtying write discussed the page may be in buffer cache, in which case just update its LRU position (to either current time or time provided by postgresql), or it may not be there, in which case put it there if reasonable by it's LRU position. And we would like all this to work together with other current linux kernel goodness of managing the whole disk-side interaction of efficient reading and writing and managing the buffers :) > Why not? Because the page cache is too simplistic to adequately > represent the complex object heirarchies that the filesystems have > and so it's flat LRU reclaim algorithms and writeback control > mechanisms are a terrible fit and cause lots of performance issues > under memory pressure. Same is true for postgresql - if we would just use direct writes and reads from disk then the performance would be terrible. We would need to duplicate all the complicated algorithms in file system do for good performance if we were to start implementing that part of the file system ourselves. > IOWs, the two most complex high performance transaction engines in > the Linux kernel have moved to fully customised cache and (direct) > IO implementations because the requirements for scalability and > performance are far more complex than the kernel page cache > infrastructure can provide. And we would like to avoid implementing this again this by delegating this part of work to said complex high performance transaction engines in the Linux kernel. We do not want to abandon all work for postgresql business code and go into file system development mode for next few years. Again, as said above the linux file system is doing fine. What we want is a few ways to interact with it to let it do even better when working with postgresql by telling it some stuff it otherwise would h
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Tue, Jan 14, 2014 at 2:29 PM, Tom Lane wrote: > Florian Pflug writes: > > I think it'd be worthwile to get this into 9.4, if that's still an > option, > > even if we only support COUNT. > > My thought is > > (1) we can certainly implement inverse transitions for COUNT() and the > integer variants of SUM(), and that alone would be a killer feature for > some people. > > 100% Agreed about sum(int) and count, but I've so far managed only to find problems with SUM(numeric), other inverse transition functions seem just fine with numeric types. e.g. AVG(numeric) and STDDEV(numeric), I can't produce any results that differ from the unpatched head... but perhaps someone can give me a case where things could vary? I've been testing with something like: select avg(n) over w,SUM(3::float) over w from (values(1,10::numeric),(2,0),(3,0)) v(i,n) window w as (order by i rows between current row and unbounded following); where I included the sum(float) to force the old school brute force transitions to be performed. I then compared the results to the ones given without the sum(float) tagged on the end. I've so far not found anything out of place. (2) the float and numeric variants should be implemented under nondefault > names (I'm thinking FAST_SUM(), but bikeshed away). People who need > extra speed and don't mind the slightly different results can alter > their queries to use these variants. > I'm not as keen on this idea, as if someone later thought of a nice way to allow the inverse functions to provide exactly the same result as if they were not used then the would become redundant legacy that would likely have to be supported forever and a day, and I know we already have things like that, but of course we want to minimise that as much as possible. If AVG(numeric) and the stddev numeric aggregates turn out to be valid then numeric_avg_accum_inv() must remain in core code to support that and this is the exact function that is required for a user to define their own SUM(numeric) with. We could leave it at that and document what a user must do to create their own FAST_SUM then perhaps leave the including these in core decision to later based maybe on the number of complaints about SUM(numeric) being slow when used in windows with a moving frame head. (For me) It does not seem too unreasonable to think that one day we might decide that: SELECT AVG(n) FROM (VALUES(10::NUMERIC(10,3)),(0),(0)) v(n); return 3.333 instead of 3. like it does now. If we ever did that then we could support these numeric inverse transitions on SUM() without having to worry about weird extra trailing zeros. Instead we'd just give the user what they asked for, when they asked for it. Reasons like that make me think that we might be jumping the gun a little on FAST_SUM() as it could end up redundant more quickly than we might initially think. > One reason I'm thinking this is that whatever we do to ameliorate > the semantic issues is going to slow down the forward transition > function --- to no benefit unless the aggregate is being used as a > window function in a moving window. So I'm less than convinced > that we *should* implement any of these designs in the default > aggregates, even if we get to the point where we arguably *could* > do it with little risk of functional differences. > > So far there's only 1 case of possible slowdown of forward transitions and that's with numeric types. I had to change do_numeric_accum() to add a NaN Counter increment and also change the logic so that it continues summing non NaN values even after the first NaN was encountered. Would you see this as too big a performance risk to include in core? If not then I think we might be able to support AVG(numeric) and STDDEV(numeric) functions as the patch sits without having to worry that there might be an extra overhead on forward transitions that include NaN values. Florian has done some really good work researching the standards around this area. His research seems to indicate that the results should be of the same scale as the inputs, which the current patch does not do, providing you assume that the user is showing us the intended scale based on the numbers that we're working with rather than a scale specified by the column or cast. Florian's idea of how to fix the scale on inverse transition seems pretty valid to me and I think the overhead of it could be made pretty minimal. I'm just not sure that I can implement it in such a way as to make the overhead small enough to look like background noise. But I'm very willing to give it a try and see... *** some moments pass *** I think unless anyone has some objections I'm going to remove the inverse transition for SUM(numeric) and modify the documents to tell the user how to build their own FAST_SUM(numeric) using the built in functions to do it. I'm starting to think that playing around with resetting numeric scale will turn a possible 9.4 patch into a 9.5/10.0 patch. I see no
Re: [HACKERS] Add CREATE support to event triggers
Hello 2014/1/13 Alvaro Herrera > Alvaro Herrera escribió: > > > In an event trigger, the function > pg_event_trigger_get_creation_commands() > > returns the following JSON blob: > > After playing with this for a while, I realized something that must have > seemed quite obvious to those paying attention: what this function is, > is just a glorified sprintf() for JSON. So I propose we take our > existing format(text) and use it to model a new format(json) function, > which will be useful to the project at hand and be of more general > applicability. > > To make it a better fit, I have changed the spec slightly. The format > string is now the "fmt" element in the topmost JSON. This format string > can contain % escapes, which consist of: > > * the literal % itself > * an element name, enclosed in braces { }. The name can optionally be > followed by a colon and a possibly-empty array separator. > * a format specifier, which can be I (identifier), D (dotted name), or s > (string) > * Alternatively, %% expands to a literal %, as usual. > > For each such escape, the JSON object is searched using the element name > as key. For identifiers, the element is expected to be a string, and > will be quoted per identifier quoting rules. Dotted-names are used to > format possibly-qualified relation names and such; the element must be > an object with one, two or three string elements, each of which is > quoted per identifier rules, and output separated by periods. > > Finally, for arrays we expand each element in the JSON array element, > and separate them with the separator specified in the {} part of the > format specifier. > > For instance, > alvherre=# select format(json '{"fmt":"hello, %{who}s! This is %{name}I", > "who":"world", "name":"a function"}'); >format > -- > hello, world! This is "a function" > > Elements can be objects, in which case they are expanded recursively: a > "fmt" element is looked up and expanded as described above. > > > I don't yet see a need for %L escapes (that is, literals that can expand > to a single-quoted value or to NULL), but if I see it I will add that > too. > I am not against to this idea, although I don't see a strong benefit. . Just special function can be better - it has minimal relation to variadic "function" format - and nested mixed format can be messy > > -- > Álvaro Herrerahttp://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Re: [HACKERS] Proposal: variant of regclass
> On Tue, Jan 14, 2014 at 4:28 PM, Yugo Nagata wrote: >> Here is the patch to implement to_regclass, to_regproc, to_regoper, >> and to_regtype. They are new functions similar to regclass, regproc, >> regoper, and regtype except that if requested object is not found, >> returns InvalidOid, rather than raises an error. > > You should add this patch to the upcoming commit fest (beginning > tomorrow actually), I am not seeing it in the list: > https://commitfest.postgresql.org/action/commitfest_view?id=21 > Thanks, > -- > Michael Of course. Problem is, in order to add to commit fest, patch's URL is needed in the first place. And the URL is not defined until a posting is made. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Capturing EXPLAIN ANALYZE for a query without discarding the normal result set
On Tue, Jan 14, 2014 at 5:13 AM, Marti Raudsepp wrote: > You can use the auto_explain contrib module I just remembered that there's also the pg_stat_plans extension, which is closer to what you asked: https://github.com/2ndQuadrant/pg_stat_plans . This one you'll have to build yourself (it's not in contrib). Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On 01/13/2014 11:22 PM, James Bottomley wrote: > >> The less exciting, more conservative option would be to add kernel >> interfaces to teach Postgres about things like raid geometries. Then >> Postgres could use directio and decide to do prefetching based on the >> raid geometry, how much available i/o bandwidth and iops is available, >> etc. >> >> Reimplementing i/o schedulers and all the rest of the work that the >> kernel provides inside Postgres just seems like something outside our >> competency and that none of us is really excited about doing. > This would also be a well trodden path ... I believe that some large > database company introduced Direct IO for roughly this purpose. > The file system at that time were much worse than they are now, so said large companies had no choice but to write their own. As linux file handling has been much better for most of active development of postgresql we have been able to avoid it and still have reasonable performance. What was been pointed out above are some (allegedly desktop/mobile influenced) decisions which broke good performance. Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql.consistent_into
>> I am thinking so GUC and plpgsql option can live together. If you like to >> accent a some behave, then you can use a plpgsql option. On second hand, I >> would to use a some functionality, that is safe, but I don't would to dirty >> source code by using repeated options. But I have to check (and calculate >> with risk) a GUC settings. >> >> One idea: required GUC? Can be nice a possibility to ensure some GUC >> setting, and restore ensure these values or raises warning. >> >> Back to main topic. Required and described feature doesn't change a >> behave of INTO clause. I can enable or disable this functionality and well >> written code should to work without change (and problems). When check is >> disabled, then execution is just less safe. So in this case, a impact of >> GUC is significantly less than by you described issues. Does know anybody a >> use case where this check should be disabled? >> >> Probably we have a different experience about GUC. I had a problem with >> standard_conforming_strings and bytea format some years ago. Now I prepare >> document about required setting. But I can see (from my experience from >> Czech area) more often problems related to effective_cache_size or >> from_collapse_limit and similar GUC. These parameters are behind knowledge >> (and visibility) typical user. >> > > ISTM that in this case, it should be safe to make the new default behavior > STRICT; if you forget to set the GUC to disable than you'll get an error > that points directly at the problem, at which point you'll go "Oh, yeah... > I forgot to set X..." > > I disagree - STRICT can be too restrictive - and a combination SELECT INTO and IF FOUND can be significantly faster - our exception handling is not cheap. Regards Pavel > Outside of the GUC, I believe the default should definitely be STRICT. If > your app is relying on non-strict then you need to be made aware of that. > We should be able to provide a DO block that will change this setting for > every function you've got if someone isn't happy with STRICT mode. > -- > Jim C. Nasby, Data Architect j...@nasby.net > 512.569.9461 (cell) http://jim.nasby.net >
Re: [HACKERS] GIN improvements part 1: additional information
On 01/13/2014 07:07 PM, Alexander Korotkov wrote: I've fixed this bug and many other bug. Now patch passes test suite that I've used earlier. The results are so: Operations time: event | period ---+- index_build | 00:01:47.53915 index_build_recovery | 00:00:04 index_update | 00:05:24.388163 index_update_recovery | 00:00:53 search_new| 00:24:02.289384 search_updated| 00:27:09.193343 (6 rows) Index sizes: label | size ---+--- new | 384761856 after_updates | 667942912 (2 rows) Also, I made following changes in algorithms: - Now, there is a limit to number of uncompressed TIDs in the page. After reaching this limit, they are encoded independent on if they can fit page. That seems to me more desirable behaviour and somehow it accelerates search speed. Before this change times were following: event | period ---+- index_build | 00:01:51.467888 index_build_recovery | 00:00:04 index_update | 00:05:03.315155 index_update_recovery | 00:00:51 search_new| 00:24:43.194882 search_updated| 00:28:36.316784 (6 rows) Hmm, that's strange. Any idea why that's happening? One theory is that when you re-encode the pages more aggressively, there are fewer pages with a mix of packed and unpacked items. Mixed pages are somewhat slower to scan than fully packed or fully unpacked pages, because GinDataLeafPageGetItems() has to merge the packed and unpacked items into a single list. But I wouldn't expect that effect to be large enough to explain the results you got. - Page are not fully re-encoded if it's enough to re-encode just last segment. Great! We should also take advantage of that in the WAL record that's written; no point in WAL-logging all the segments, if we know that only last one was modified. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing wrote: > Again, as said above the linux file system is doing fine. What we > want is a few ways to interact with it to let it do even better when > working with postgresql by telling it some stuff it otherwise would > have to second guess and by sometimes giving it back some cache > pages which were copied away for potential modifying but ended > up clean in the end. You don't need new interfaces. Only a slight modification of what fadvise DONTNEED does. This insistence in injecting pages from postgres to kernel is just a bad idea. At the very least, it still needs postgres to know too much of the filesystem (block layout) to properly work. Ie: pg must be required to put entire filesystem-level blocks into the page cache, since that's how the page cache works. At the very worst, it may introduce serious security and reliability implications, when applications can destroy the consistency of the page cache (even if full access rights are checked, there's still the possibility this inconsistency might be exploitable). Simply making fadvise DONTNEED move pages to the head of the LRU (ie: discard next if you need) should work as expected without all the complication of the above proposal. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part 1: additional information
On Tue, Jan 14, 2014 at 12:34 PM, Heikki Linnakangas < hlinnakan...@vmware.com> wrote: > On 01/13/2014 07:07 PM, Alexander Korotkov wrote: > >> I've fixed this bug and many other bug. Now patch passes test suite that >> I've used earlier. The results are so: >> >> Operations time: >> event | period >> ---+- >> index_build | 00:01:47.53915 >> index_build_recovery | 00:00:04 >> index_update | 00:05:24.388163 >> index_update_recovery | 00:00:53 >> search_new| 00:24:02.289384 >> search_updated| 00:27:09.193343 >> (6 rows) >> >> Index sizes: >> label | size >> ---+--- >> new | 384761856 >> after_updates | 667942912 >> (2 rows) >> >> Also, I made following changes in algorithms: >> >> - Now, there is a limit to number of uncompressed TIDs in the page. >> >> After reaching this limit, they are encoded independent on if they >> can fit >> page. That seems to me more desirable behaviour and somehow it >> accelerates >> search speed. Before this change times were following: >> >> event | period >> ---+- >> index_build | 00:01:51.467888 >> index_build_recovery | 00:00:04 >> index_update | 00:05:03.315155 >> index_update_recovery | 00:00:51 >> search_new| 00:24:43.194882 >> search_updated| 00:28:36.316784 >> (6 rows) >> > > Hmm, that's strange. Any idea why that's happening? One theory is that > when you re-encode the pages more aggressively, there are fewer pages with > a mix of packed and unpacked items. Mixed pages are somewhat slower to scan > than fully packed or fully unpacked pages, because > GinDataLeafPageGetItems() has to merge the packed and unpacked items into a > single list. But I wouldn't expect that effect to be large enough to > explain the results you got. > Probably, it's because of less work in ginMergeItemPointers. - Page are not fully re-encoded if it's enough to re-encode just last >> segment. >> > > Great! We should also take advantage of that in the WAL record that's > written; no point in WAL-logging all the segments, if we know that only > last one was modified. Already. -- With best regards, Alexander Korotkov.
Re: [HACKERS] UNION ALL on partitioned tables won't use indices.
This is cont'd from previous CF3. You'll see the overview and the discussion since in the thread begins from there. The issue ramains as of current 9.4dev head. http://www.postgresql.org/message-id/20131024.193953.233464126.horiguchi.kyot...@lab.ntt.co.jp The issue in brief is that UNION ALL on inheritance tables does not make use of indices in contrast to that on bare tables. This is because of appendrels with the depth more than 2 levels brought by inheritance tables. Some of UNION ALL operation - especially using ORDERBY and LIMIT - will be accelerated by this patch. Details in the message above. I proposed 3 types of solution but the one of them - tweaking Equivalence Class (Type 1)- was dismissed because of inappropriate manipulation on Equivalence Class. The rest do essentially the same thing - flattening appendrels - at different timings. In type 2, the process is implemented as a generic appendrel flattening function and applied after expand_inherited_tables() in subquery_planner. In type 3, the equivelant process is combined in expand_inherited_rtentry(). Type 2 loops once for whole appendrel list (in most cases) so it might be more effective than type 3 which searches the parent appendrel for each appended ones. Type 3 is more approprieate design assuming that appendrel tree deeper than 2 levels will be generated only by expand_inherited_tables(). The attached two patches are rebased to current 9.4dev HEAD and make check at the topmost directory and src/test/isolation are passed without error. One bug was found and fixed on the way. It was an assertion failure caused by probably unexpected type conversion introduced by collapse_appendrels() which leads implicit whole-row cast from some valid reltype to invalid reltype (0) into adjust_appendrel_attrs_mutator(). Attached files are the two versions of patches mentioned above. - unionall_inh_idx_typ2_v4_20140114.patch - unionall_inh_idx_typ3_v4_20140114.patch regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c index e249628..582e8c3 100644 --- a/src/backend/optimizer/prep/prepunion.c +++ b/src/backend/optimizer/prep/prepunion.c @@ -57,6 +57,11 @@ typedef struct AppendRelInfo *appinfo; } adjust_appendrel_attrs_context; +typedef struct { + AppendRelInfo *child_appinfo; + Index target_rti; +} transvars_merge_context; + static Plan *recurse_set_operations(Node *setOp, PlannerInfo *root, double tuple_fraction, List *colTypes, List *colCollations, @@ -98,6 +103,8 @@ static List *generate_append_tlist(List *colTypes, List *colCollations, List *input_plans, List *refnames_tlist); static List *generate_setop_grouplist(SetOperationStmt *op, List *targetlist); +static Node *transvars_merge_mutator(Node *node, + transvars_merge_context *ctx); static void expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti); static void make_inh_translation_list(Relation oldrelation, @@ -1210,6 +1217,34 @@ expand_inherited_tables(PlannerInfo *root) } } +static Node * +transvars_merge_mutator(Node *node, transvars_merge_context *ctx) +{ + if (node == NULL) + return NULL; + + if (IsA(node, Var)) + { + Var *oldv = (Var*)node; + + if (!oldv->varlevelsup && oldv->varno == ctx->target_rti) + { + if (oldv->varattno > + list_length(ctx->child_appinfo->translated_vars)) +elog(ERROR, + "attribute %d of relation \"%s\" does not exist", + oldv->varattno, + get_rel_name(ctx->child_appinfo->parent_reloid)); + + return (Node*)copyObject( +list_nth(ctx->child_appinfo->translated_vars, + oldv->varattno - 1)); + } + } + return expression_tree_mutator(node, + transvars_merge_mutator, (void*)ctx); +} + /* * expand_inherited_rtentry * Check whether a rangetable entry represents an inheritance set. @@ -1237,6 +1272,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti) List *inhOIDs; List *appinfos; ListCell *l; + AppendRelInfo *parent_appinfo = NULL; /* Does RT entry allow inheritance? */ if (!rte->inh) @@ -1301,6 +1337,21 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti) oldrc->isParent = true; /* + * If parent relation is appearing in a subselect of UNION ALL, it has + * furthur parent appendrelinfo. Save it to pull up inheritance children + * later. + */ + foreach(l, root->append_rel_list) + { + AppendRelInfo *appinfo = (AppendRelInfo *)lfirst(l); + if(appinfo->child_relid == rti) + { + parent_appinfo = appinfo; + break; + } + } + + /* * Must open the parent relation to examine its tupdesc. We need not lock * it; we assume the rewriter already did. */ @@ -1378,6 +1429,28 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti) } /* + * Pull up this appinfo onto just above of the parent. The parent +
Re: [HACKERS] Get more from indices.
Hello, since CF4 is already closed but this patch ramains marked as 'Ready for Committer', please let me re-post the latest version for CF4 to get rid of vanishing :-p > tgl> But aside from hasty typos, > > Oops! I've picked up wrong node. It always denies pathkeys extension. > > | !IsA(member, Var)) > > is a mistake of the following. > > | !IsA(member->em_expr, Var)) regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index 4f63906..b695e40 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -1790,6 +1790,7 @@ _outIndexOptInfo(StringInfo str, const IndexOptInfo *node) WRITE_BOOL_FIELD(predOK); WRITE_BOOL_FIELD(unique); WRITE_BOOL_FIELD(immediate); + WRITE_BOOL_FIELD(allnotnull); WRITE_BOOL_FIELD(hypothetical); /* we don't bother with fields copied from the pg_am entry */ } diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index 606734a..0b2f529 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -952,8 +952,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel, { index_pathkeys = build_index_pathkeys(root, index, ForwardScanDirection); - useful_pathkeys = truncate_useless_pathkeys(root, rel, - index_pathkeys); + if (index_pathkeys_are_extensible(root, index, index_pathkeys)) + useful_pathkeys = root->query_pathkeys; + else + useful_pathkeys = truncate_useless_pathkeys(root, rel, + index_pathkeys); orderbyclauses = NIL; orderbyclausecols = NIL; } diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index 9c8ede6..ad3a8b7 100644 --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -502,6 +502,60 @@ build_index_pathkeys(PlannerInfo *root, } /* + * index_pathkeys_are_extensible + * Check whether the pathkeys are extensible to query_pathkeys. + */ +bool +index_pathkeys_are_extensible(PlannerInfo *root, + IndexOptInfo *index, + List *pathkeys) +{ + bool result; + ListCell *lc1; + + if (root->query_pathkeys == NIL || pathkeys == NIL) + return false; + + if (!index->unique || !index->immediate || !index->allnotnull) + return false; + + if (!pathkeys_contained_in(pathkeys, root->query_pathkeys)) + return false; + + if (list_length(pathkeys) == list_length(root->query_pathkeys)) + return true; + + result = true; + foreach(lc1, root->query_pathkeys) + { + PathKey*pathkey = (PathKey *) lfirst(lc1); + bool found = false; + ListCell *lc2; + + foreach(lc2, pathkey->pk_eclass->ec_members) + { + EquivalenceMember *member = (EquivalenceMember *) lfirst(lc2); + + if (!bms_equal(member->em_relids, index->rel->relids) || +!IsA(member->em_expr, Var)) +continue; + else + { +found = true; +break; + } + } + + if (!found) + { + result = false; + break; + } + } + return result; +} + +/* * build_expression_pathkey * Build a pathkeys list that describes an ordering by a single expression * using the given sort operator. diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index de981cb..4e24220 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -333,6 +333,26 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent, info->immediate = index->indimmediate; info->hypothetical = false; + info->allnotnull = true; + for (i = 0; i < ncolumns; i++) + { +int attrno = info->indexkeys[i]; + +if (attrno == 0) +{ + info->allnotnull = false; + break; +} +else if (attrno > 0) +{ + if (!relation->rd_att->attrs[attrno - 1]->attnotnull) + { + info->allnotnull = false; + break; + } +} + } + /* * Estimate the index size. If it's not a partial index, we lock * the number-of-tuples estimate to equal the parent table; if it diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h index a9219e0..d9a3b9b 100644 --- a/src/include/nodes/relation.h +++ b/src/include/nodes/relation.h @@ -525,6 +525,7 @@ typedef struct IndexOptInfo bool predOK; /* true if predicate matches query */ bool unique; /* true if a unique index */ bool immediate; /* is uniqueness enforced immediately? */ + bool allnotnull; /* true if index's keys are all not null */ bool hypothetical; /* true if index doesn't really exist */ bool canreturn; /* can index return IndexTuples? */ bool amcanorderbyop; /* does AM support order by operator result? */ diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h index 999adaa..5ee2e56 100644 --- a/src/include/optimizer/paths.h +++ b/src/include/optimizer/paths.h @@ -196,5 +196,8 @@ extern List *truncate_useless_pathkeys(PlannerInfo *root, RelOptInfo *rel,
Re: [HACKERS] Using indices for UNION.
This is cont'd from CF3. http://www.postgresql.org/message-id/20131122.165927.27412386.horiguchi.kyot...@lab.ntt.co.jp The issue in brief is that UNION is never flattened differently to UNION ALL so UNION cannot make use of index scan even if usable. This patch flattens UNION likewise currently did for UNION ALL. This patch needs another 'UNION ALL' patch and further gain with even another 'Widening application of indices' patch. ('Ready for Commit' now in CF3..) http://www.postgresql.org/message-id/20140114.180447.145186052.horiguchi.kyot...@lab.ntt.co.jp http://www.postgresql.org/message-id/20140114.180810.122352231.horiguchi.kyot...@lab.ntt.co.jp You can see the detailed outlines in the message at here, http://www.postgresql.org/message-id/20131031.194251.12577697.horiguchi.kyot...@lab.ntt.co.jp The attached patche is rebased to current 9.4dev HEAD and make check at the topmost directory and src/test/isolation are passed without error. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Linux kernel impact on PostgreSQL performance
On 01/14/2014 12:26 AM, Mel Gorman wrote: On Mon, Jan 13, 2014 at 03:15:16PM -0500, Robert Haas wrote: The other thing that comes to mind is the kernel's caching behavior. We've talked a lot over the years about the difficulties of getting the kernel to write data out when we want it to and to not write data out when we don't want it to. Is sync_file_range() broke? When it writes data back to disk too aggressively, we get lousy throughput because the same page can get written more than once when caching it for longer would have allowed write-combining. Do you think that is related to dirty_ratio or dirty_writeback_centisecs? If it's dirty_writeback_centisecs then that would be particularly tricky because poor interactions there would come down to luck basically. When it doesn't write data to disk aggressively enough, we get huge latency spikes at checkpoint time when we call fsync() and the kernel says "uh, what? you wanted that data *on the disk*? sorry boss!" and then proceeds to destroy the world by starving the rest of the system for I/O for many seconds or minutes at a time. Ok, parts of that are somewhat expected. It *may* depend on the underlying filesystem. Some of them handle fsync better than others. If you are syncing the whole file though when you call fsync then you are potentially burned by having to writeback dirty_ratio amounts of memory which could take a substantial amount of time. We've made some desultory attempts to use sync_file_range() to improve things here, but I'm not sure that's really the right tool, and if it is we don't know how to use it well enough to obtain consistent positive results. That implies that either sync_file_range() is broken in some fashion we (or at least I) are not aware of and that needs kicking. Let me try to explain the problem: Checkpoints can cause an I/O spike, which slows down other processes. When it's time to perform a checkpoint, PostgreSQL will write() all dirty buffers from the PostgreSQL buffer cache, and finally perform an fsync() to flush the writes to disk. After that, we know the data is safely on disk. In older PostgreSQL versions, the write() calls would cause an I/O storm as the OS cache quickly fills up with dirty pages, up to dirty_ratio, and after that all subsequent write()s block. That's OK as far as the checkpoint is concerned, but it significantly slows down queries running at the same time. Even a read-only query often needs to write(), to evict a dirty page from the buffer cache to make room for a different page. We made that less painful by adding sleeps between the write() calls, so that they are trickled over a long period of time and hopefully stay below dirty_ratio at all times. However, we still have to perform the fsync()s after the writes(), and sometimes that still causes a similar I/O storm. The checkpointer is not in a hurry. A checkpoint typically has 10-30 minutes to finish, before it's time to start the next checkpoint, and even if it misses that deadline that's not too serious either. But the OS doesn't know that, and we have no way of telling it. As a quick fix, some sort of a lazy fsync() call would be nice. It would behave just like fsync() but it would not change the I/O scheduling at all. Instead, it would sleep until all the pages have been flushed to disk, at the speed they would've been without the fsync() call. Another approach would be to give the I/O that the checkpointer process initiates a lower priority. This would be slightly preferable, because PostgreSQL could then issue the writes() as fast as it can, and have the checkpoint finish earlier when there's not much other load. Last I looked into this (which was a long time ago), there was no suitable priority system for writes, only reads. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using indices for UNION.
Sorry, I missed to attach file. > This is cont'd from CF3. > > http://www.postgresql.org/message-id/20131122.165927.27412386.horiguchi.kyot...@lab.ntt.co.jp > > The issue in brief is that UNION is never flattened differently > to UNION ALL so UNION cannot make use of index scan even if > usable. > > This patch flattens UNION likewise currently did for UNION ALL. > > This patch needs another 'UNION ALL' patch and further gain with > even another 'Widening application of indices' patch. ('Ready for > Commit' now in CF3..) > > http://www.postgresql.org/message-id/20140114.180447.145186052.horiguchi.kyot...@lab.ntt.co.jp > http://www.postgresql.org/message-id/20140114.180810.122352231.horiguchi.kyot...@lab.ntt.co.jp > > > You can see the detailed outlines in the message at here, > > http://www.postgresql.org/message-id/20131031.194251.12577697.horiguchi.kyot...@lab.ntt.co.jp > > The attached patche is rebased to current 9.4dev HEAD and make > check at the topmost directory and src/test/isolation are passed > without error. regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 1da4b2f..e89f8b3 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -353,13 +353,14 @@ subquery_planner(PlannerGlobal *glob, Query *parse, pull_up_subqueries(root, (Node *) parse->jointree); /* - * If this is a simple UNION ALL query, flatten it into an appendrel. We - * do this now because it requires applying pull_up_subqueries to the leaf - * queries of the UNION ALL, which weren't touched above because they - * weren't referenced by the jointree (they will be after we do this). + * If this is a simple UNION/UNION ALL query, flatten it into an + * appendrel. We do this now because it requires applying + * pull_up_subqueries to the leaf queries of the UNION/UNION ALL, which + * weren't touched above because they weren't referenced by the jointree + * (they will be after we do this). */ if (parse->setOperations) - flatten_simple_union_all(root); + flatten_simple_union(root); /* * Detect whether any rangetable entries are RTE_JOIN kind; if not, we can diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c index 1c6083b..04bba48 100644 --- a/src/backend/optimizer/prep/prepjointree.c +++ b/src/backend/optimizer/prep/prepjointree.c @@ -32,6 +32,7 @@ #include "optimizer/subselect.h" #include "optimizer/tlist.h" #include "parser/parse_relation.h" +#include "parser/parse_clause.h" #include "parser/parsetree.h" #include "rewrite/rewriteManip.h" @@ -81,8 +82,8 @@ static void make_setop_translation_list(Query *query, Index newvarno, static bool is_simple_subquery(Query *subquery, RangeTblEntry *rte, JoinExpr *lowest_outer_join); static bool is_simple_union_all(Query *subquery); -static bool is_simple_union_all_recurse(Node *setOp, Query *setOpQuery, - List *colTypes); +static bool is_simple_union_recurse(SetOperationStmt *topop, Node *setOp, + Query *setOpQuery, List *colTypes); static bool is_safe_append_member(Query *subquery); static bool jointree_contains_lateral_outer_refs(Node *jtnode, bool restricted, Relids safe_upper_varnos); @@ -1440,12 +1441,14 @@ is_simple_union_all(Query *subquery) return false; /* Recursively check the tree of set operations */ - return is_simple_union_all_recurse((Node *) topop, subquery, - topop->colTypes); + if (topop->op != SETOP_UNION || !topop->all) return false; + return is_simple_union_recurse(topop, (Node *) topop, subquery, + topop->colTypes); } static bool -is_simple_union_all_recurse(Node *setOp, Query *setOpQuery, List *colTypes) +is_simple_union_recurse(SetOperationStmt *topop, Node *setOp, + Query *setOpQuery, List *colTypes) { if (IsA(setOp, RangeTblRef)) { @@ -1463,13 +1466,16 @@ is_simple_union_all_recurse(Node *setOp, Query *setOpQuery, List *colTypes) { SetOperationStmt *op = (SetOperationStmt *) setOp; - /* Must be UNION ALL */ - if (op->op != SETOP_UNION || !op->all) + /* All SetOps under topop are UNION and ->all is same to topop */ + if (op->op != SETOP_UNION || op->all != topop->all) return false; /* Recurse to check inputs */ - return is_simple_union_all_recurse(op->larg, setOpQuery, colTypes) && - is_simple_union_all_recurse(op->rarg, setOpQuery, colTypes); + return + is_simple_union_recurse(topop, op->larg, + setOpQuery, colTypes) && + is_simple_union_recurse(topop, op->rarg, + setOpQuery, colTypes); } else { @@ -1908,23 +1914,22 @@ pullup_replace_vars_subquery(Query *query, NULL); } - /* * flatten_simple_union_all - * Try to optimize top-level UNION ALL structure into an appendrel + * Try to optimize top-level UNION/UNION ALL structure into an appendrel * - * If a query's setOperations tree consists e
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Tue, Jan 14, 2014 at 9:09 PM, David Rowley wrote: > I think unless anyone has some objections I'm going to remove the inverse > transition for SUM(numeric) and modify the documents to tell the user how > to build their own FAST_SUM(numeric) using the built in functions to do it. > I'm starting to think that playing around with resetting numeric scale will > turn a possible 9.4 patch into a 9.5/10.0 patch. I see no reason why what's > there so far, minus sum(numeric), can't go in... > > Of course its only now that I discover that this is not possible to do this: CREATE AGGREGATE fast_sum (numeric) ( stype = numeric, sfunc = numeric_avg_accum, invfunc = numeric_avg_accum_inv, finalfunc = numeric_sum ); because SUM(numeric) uses an internal type to store the transition state. hmmm, built-in fast_sum anyone? Is there any simple way to limit these to only be used in the context of a window? If so is it worth it? Would we want fast_sum() for float too? Regards David Rowley
Re: [HACKERS] plpgsql.consistent_into
2014/1/14 Florian Pflug > On Jan14, 2014, at 00:52 , Marko Tiikkaja wrote: > > When I've worked with PL/PgSQL, this has been a source of a few bugs that > > would have been noticed during testing if the behaviour of INTO wasn't as > > dangerous as it is right now. > > The question is, how many bugs stemmed from wrong SQL queries, and what > percentage of those would have been caught by this? The way I see it, there > are thousands of ways to screw up a query, and having it return multiple > rows instead of one is just one of them. > > > Yes, it breaks backwards compatibility, but that's why there's a nice > GUC. > > Which doesn't help, because the GUC isn't tied to the code. This *adds* > an error case, not remove one - now, instead of getting your code correct, > you *also* have to get the GUC correct. If you even *know* that such a GUC > exists. > > > If we're not going to scrap PL/PgSQL and > > start over again, we are going to have to do changes like this to make > the > > language better. Also I think that out of all the things we could do to > > break backwards compatibility, this is closer to "harmless" than "a pain > > in the butt". > > I very strongly believe that languages don't get better by adding a > thousand > little knobs which subtly change semantics. Look at the mess that is PHP - > we absolutely, certainly don't want to go there. The most important rule in > language design is in my opinion "stick with your choices". C, C++ and JAVA > all seem to follow this, and it's one of the reasons these languages are > popular for big projects, I think. > > The way I see it, the only OK way to change existing behaviour is to have > the concept of a "language version", and force code to indicate the > language > version it expects. The important thing is that the language version is an > attribute of code, not some global setting that you can change without ever > looking at the code it'd affect. > > So if we really want to change this, I think we need to have a > LANGUAGE_VERSION attribute on functions. Each time a major postgres release > changes the behaviour of one of the procedural languages, we'd increment > that language's version, and enable the old behaviour for all functions > tagged with an older one. > I dislike this proposal too enterprise, too complex, too bad - after ten years we can have a ten language versions and it helps nothing. return back to topic a) there is agreement so we like this functionality as plpgsql option b) there is no agreement so we would to see this functionality as default (or in near future) @b is a topic, that we should not to solve and it is back again and again. Sometimes we have success, sometimes not. Temporal GUC is not enough solution for some people. So my result - @a can be implement, @b not and we can continue in other thread about SELECT INTO redesign and about possibilities that we have or have not. It is about personal perspective - someone can thinking about missing check after INTO as about feature, other one can see it as bug. My opinion - it is a bug with possible ambiguous result and possible negative performance impacts. Probably we can precise a doc about wrong and good usage SELECT INTO clause. I am working on plpgsql_debug extension and I am thinking so I am able (with small change in plpgsql executor) implement this check as extension. So it can be available for advanced users (that will has a knowledge about additional plpgsql extensions). Regards Pavel regards Pavel > > best regards, > Florian Pflug > > > > > > >
Re: [HACKERS] inherit support for foreign tables
2013/11/18 Tom Lane : > Robert Haas writes: >> I think it's been previously proposed that we have some version of a >> CHECK constraint that effectively acts as an assertion for query >> optimization purposes, but isn't actually enforced by the system. I >> can see that being useful in a variety of situations, including this >> one. > > Yeah, I think it would be much smarter to provide a different syntax > to explicitly represent the notion that we're only assuming the condition > is true, and not trying to enforce it. I'd like to revisit this feature. Explicit notation to represent not-enforcing (assertive?) is an interesting idea. I'm not sure which word is appropriate, but for example, let's use the word ASSERTIVE as a new constraint attribute. CREATE TABLE parent ( id int NOT NULL, group int NOT NULL, name text ); CREATE FOREIGN TABLE child_grp1 ( /* no additional column */ ) INHERITS (parent) SERVER server1; ALTER TABLE child_grp1 ADD CONSTRAINT chk_group1 CHECK (group = 1) ASSERTIVE; If ASSERTIVE is specified, it's not guaranteed that the constraint is enforced completely, so it should be treated as a hint for planner. As Robert mentioned, enforcing as much as we can during INSERT/UPDATE is one option about this issue. In addition, an idea which I can't throw away is to assume that all constraints defined on foreign tables as ASSERTIVE. Foreign tables potentially have dangers to have "wrong" data by updating source data not through foreign tables. This is not specific to an FDW, so IMO constraints defined on foreign tables are basically ASSERTIVE. Of course PG can try to maintain data correct, but always somebody might break it. Besides CHECK constraints, currently NOT NULL constraints are virtually ASSERTIVE (not enforcing). Should it also be noted explicitly? Thoughts? -- Shigeru HANADA -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Trigger information for auto_explain.
Hello, Now explain can show trigger statistics (from when?). =# create table t (a int, b int); =# create or replace function hoge() returns trigger as 'begin new.b = new.a; return new; end;' language plpgsql; =# create trigger ins_hoge before insert or update on t for each row execute procedure hoge(); Explain analyze shows trigger information. =# explain analyze insert into t (select a from generate_series(0, 100) a); |QUERY PLAN | | Insert on t (cost=0.00..10.00 rows=1000 width=4) (actual time=2.712. | -> Function Scan on generate_series a (cost=0.00..10.00 rows=1000 width=4) | (actual time=0.047..0.147 rows=101 loops=1) | Trigger ins_hoge: time=1.881 calls=101 | Total runtime: 3.009 ms On the other hand, auto_explain doesn't. =# load auto_explain; =# set auto_explain.log_min_duration = 0; =# set auto_explain.log_analyze = 'yes'; =# insert into t (select a from generate_series(0, 100) a); |LOG: duration: 2.871 ms plan: |Query Text: insert into t (select a from generate_series(0, 100) a); |Insert on t (cost=0.00..10.00 rows=1000 width=4) | -> Function Scan on generate_series a (cost=0.00..10.00 ... auto_explain will show trigger infos with this patch like this. =# set auto_explain.log_triggers = 'yes'; =# insert into t (select a from generate_series(0, 100) a); | LOG: duration: 2.098 ms plan: | Query Text: insert into t (select a from generate_series(0, 100) a); | Insert on t (cost=0.00..10.00 rows=1000 width=4) (actual time=2.097..2.097 rows=0 loops=1) | -> Function Scan on generate_series a (cost=0.00..10.00 rows=1000 width=4) (actual time=0.044..0.123 rows=101 loops=1) | Trigger ins_hoge: time=1.452 calls=101 This patch consists of two parts, 0001_expose_explain_triggers_v1_20140114.patch Expose the code to print out trigger information currently hidden in ExplainOnePlan(). 0002_auto_explain_triggers_v1_20140114.patch Enable auto_explain to output trigger information. Documentation will be added later.. regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 9969a25..45beb5b 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -484,29 +484,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es, /* Print info about runtime of triggers */ if (es->analyze) - { - ResultRelInfo *rInfo; - bool show_relname; - int numrels = queryDesc->estate->es_num_result_relations; - List *targrels = queryDesc->estate->es_trig_target_relations; - int nr; - ListCell *l; - - ExplainOpenGroup("Triggers", "Triggers", false, es); - - show_relname = (numrels > 1 || targrels != NIL); - rInfo = queryDesc->estate->es_result_relations; - for (nr = 0; nr < numrels; rInfo++, nr++) - report_triggers(rInfo, show_relname, es); - - foreach(l, targrels) - { - rInfo = (ResultRelInfo *) lfirst(l); - report_triggers(rInfo, show_relname, es); - } - - ExplainCloseGroup("Triggers", "Triggers", false, es); - } + ExplainPrintTriggers(es, queryDesc); /* * Close down the query and free resources. Include time for this in the @@ -563,6 +541,42 @@ ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc) } /* + * ExplainPrintTriggers - + + * convert a QueryDesc's trigger statistics to text and append it to + * es->str + * + * The caller should have set up the options fields of *es, as well as + * initializing the output buffer es->str. Other fields in *es are + * initialized here. + */ +void +ExplainPrintTriggers(ExplainState *es, QueryDesc *queryDesc) +{ + ResultRelInfo *rInfo; + bool show_relname; + int numrels = queryDesc->estate->es_num_result_relations; + List *targrels = queryDesc->estate->es_trig_target_relations; + int nr; + ListCell *l; + + ExplainOpenGroup("Triggers", "Triggers", false, es); + + show_relname = (numrels > 1 || targrels != NIL); + rInfo = queryDesc->estate->es_result_relations; + for (nr = 0; nr < numrels; rInfo++, nr++) + report_triggers(rInfo, show_relname, es); + + foreach(l, targrels) + { + rInfo = (ResultRelInfo *) lfirst(l); + report_triggers(rInfo, show_relname, es); + } + + ExplainCloseGroup("Triggers", "Triggers", false, es); +} + +/* * ExplainQueryText - * add a "Query Text" node that contains the actual text of the query * diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h index ca213d7..4c4226d 100644 --- a/src/include/commands/explain.h +++ b/src/include/commands/explain.h @@ -71,6 +71,7 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, const char *queryString, ParamListInfo params); extern void ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc); +extern void ExplainPrintTriggers(ExplainState *es, QueryDesc *queryDesc); e
Re: [HACKERS] Case sensitive mode in windows build option
On 01/14/2014 11:25 AM Craig Ringer Wrote, > > As per current behavior if user want to build in debug mode in > > windows, then he need to give debug in capital letters (DEBUG), > > > > I think many user will always make mistake in giving this option, in > > my opinion we can make it case insensitive. > > The idea seems reasonable, the implementation does not. You've changed > the meaning rather more than making it case insensitive. > > Use the Perl 'lc' function to compare a lower-cased input instead. > > http://perldoc.perl.org/functions/lc.html I think I have done the same thing, converted user input to upper case and compared with DEBUG, so this will always give the case insensitive comparison. Now we can input debug in any case (Debug, DEBUG, debug..) and it will work fine.. And for achieving this I used Perl 'uc' function to compare upper-cased input. ! if (uc($ARGV[0]) eq 'DEBUG') { $bconf = "Debug"; } Will it make any difference if I use 'lc' instead of 'uc' ? Please correct me if I have missed something here.. Regards, Dilip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Case sensitive mode in windows build option
On 01/14/2014 05:35 PM, Dilip kumar wrote: > On 01/14/2014 11:25 AM Craig Ringer Wrote, > >>> As per current behavior if user want to build in debug mode in >>> windows, then he need to give debug in capital letters (DEBUG), >>> >>> I think many user will always make mistake in giving this option, in >>> my opinion we can make it case insensitive. >> >> The idea seems reasonable, the implementation does not. You've changed >> the meaning rather more than making it case insensitive. >> >> Use the Perl 'lc' function to compare a lower-cased input instead. >> >> http://perldoc.perl.org/functions/lc.html > > I think I have done the same thing, converted user input to upper case and > compared with DEBUG, so this will always give the case insensitive comparison. > Now we can input debug in any case (Debug, DEBUG, debug..) and it will work > fine.. You're completely right. My apologies. I'm not used to reading that awful (IMO) context-diff format - despite it being the official standard for PostgreSQL, I still misread it on a regular basis. I saw: ! if (uc($ARGV[0]) eq 'DEBUG') ... ! elsif (uc($ARGV[0]) ne "RELEASE") and thought "WTF?". In fact, the WTF is all me. Please disregard. This seems quite sensible. Please add it to the commitfest app if it isn't there already: http://commitfest.postgresql.org/ and I'll sign up as a reviewer so I can do some build-testing on it after the rather pressing deadline I've got in the next couple of days has passed. If you don't hear from me by Friday, poke me. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] nested hstore patch
Erik, thanks for docs fixes, we have even more :) Oleg On Tue, Jan 14, 2014 at 4:18 AM, Erik Rijkers wrote: > On Mon, January 13, 2014 18:30, Andrew Dunstan wrote: >> >> >> On 01/13/2014 11:16 AM, Oleg Bartunov wrote: >>> Andrew, >>> >>> did you run perl script ? Actually, I found, that operator table needs >>> to be fixed. >>> >> >> No. My build machine doesn't actually have DBD::Pg installed. Can you >> send me a patch if you don't want to push it yourself, or maybe Erik can >> send a pacth top adjust the table. >> > >> [ nested_hstore_and_jsonb-2.patch ] > > ( centos 6.5, gcc 4.8.2. ) > > The patch applies & compiles with warnings (see below). > > The opr_sanity test fails during make check: regression.diffs attached. > > Also attached are changes to hstore.sgml, to operator + functions table, plus > some typos. > > Thanks, > Erik Rijkers > > > make > > jsonfuncs.c: In function ‘each_object_field_end_jsonb’: > jsonfuncs.c:1328:7: warning: assignment from incompatible pointer type > [enabled by default] >val = DatumGetPointer(DirectFunctionCall1(jsonb_in, > CStringGetDatum(cstr))); >^ > jsonfuncs.c: In function ‘elements_array_element_end_jsonb’: > jsonfuncs.c:1530:8: warning: assignment from incompatible pointer type > [enabled by default] > jbval = DatumGetPointer(DirectFunctionCall1(jsonb_in, > CStringGetDatum(cstr))); > ^ > > > make contrib: > > hstore_io.c: In function ‘array_to_hstore’: > hstore_io.c:1694:30: warning: ‘result’ may be used uninitialized in this > function [-Wmaybe-uninitialized] > PG_RETURN_POINTER(hstoreDump(result)); > > > > > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Tue, Jan 14, 2014 at 9:09 PM, David Rowley wrote: > I think unless anyone has some objections I'm going to remove the inverse > transition for SUM(numeric) and modify the documents to tell the user how > to build their own FAST_SUM(numeric) using the built in functions to do it. > I'm starting to think that playing around with resetting numeric scale will > turn a possible 9.4 patch into a 9.5/10.0 patch. I see no reason why what's > there so far, minus sum(numeric), can't go in... > > If so then there's 36 aggregate functions ticked off my "create an inverse > transition function for" list! I personally think that's a pretty good win. > I'd rather do this than battle and miss deadlines for 9.4. I'd find that > pretty annoying. > > Here's a patch which removes sum(numeric) and changes the documents a little to remove a reference to using sum(numeric) to workaround the fact that there's no inverse transitions for sum(float). I also made a small change in the aggregates.sql tests to check that the aggfnoid <= . I've also pulled the regression tests that I had added for sum(numeric) as they no longer test anything new. All tests are now passing. With the attached patch I feel like I've left users a bit high and dry for their sum(numeric) needs. I guess there is no true workaround as even if they created their functions in SQL using simple + and - arithmetic, they would likely suffer from NaN recovery problems. I'm starting to come around to Tom's FAST_SUM idea as I simply can't see any fool proof workaround that could be created without writing things in C. The only problems I see with the FAST_SUM idea are that the number of trailing zeros may appear a little random based on if inverse transitions are used or are not used... Keep in mind that inverse transitions are not performed if any aggregate in the window does not support them OR if any aggregate in the frame contains a volatile function in the aggregate's parameters or the FILTER (WHERE clause). Does this matter or can we just document to warn about that? If there's a few more +1s for FAST_SUM(numeric) then let me know and I'll add it. If anyone feels strongly against adding FAST_SUM then please let the reasons for that known too. Or failing that, if anyone has any other ideas that have not yet been written on this thread, please post them so we can discuss. Regards David Rowley > Regards > > David Rowley > > inverse_transition_functions_v2.1.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Inheritance and indexes
From PostgreSQL manual: "A serious limitation of the inheritance feature is that indexes (including unique constraints) and foreign key constraints only apply to single tables, not to their inheritance children." But is it possible to use index for derived table at all? Why sequential search is used for derived table in the example below: create table base_table (x integer primary key); create table derived_table (y integer) inherits (base_table); insert into base_table values (1); insert into derived_table values (2,2); create index derived_index on derived_table(x); explain select * from base_table where x>=0; QUERY PLAN -- Append (cost=0.14..4.56 rows=81 width=4) -> Index Only Scan using base_table_pkey on base_table (cost=0.14..3.55 rows=80 width=4) Index Cond: (x >= 0) -> Seq Scan on derived_table (cost=0.00..1.01 rows=1 width=4) Filter: (x >= 0) (5 rows) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Mon, Jan 13, 2014 at 6:45 PM, Peter Geoghegan wrote: > + uint32 > + SpeculativeInsertionIsInProgress(TransactionId xid, RelFileNode rel, > ItemPointer tid) > + { For the purposes of preventing unprincipled deadlocking, commenting out the following (the only caller of the above) has no immediately discernible effect with any of the test-cases that I've published: /* XXX shouldn't we fall through to look at xmax? */ + /* XXX why? or is that now covered by the above check? */ + snapshot->speculativeToken = + SpeculativeInsertionIsInProgress(HeapTupleHeaderGetRawXmin(tuple), + rnode, + &htup->t_self); + + snapshot->xmin = HeapTupleHeaderGetRawXmin(tuple); return true;/* in insertion by other */ I think that the prevention of unprincipled deadlocking is all down to this immediately prior piece of code, at least in those test cases: ! /* !* in insertion by other. !* !* Before returning true, check for the special case that the !* tuple was deleted by the same transaction that inserted it. !* Such a tuple will never be visible to anyone else, whether !* the transaction commits or aborts. !*/ ! if (!(tuple->t_infomask & HEAP_XMAX_INVALID) && ! !(tuple->t_infomask & HEAP_XMAX_COMMITTED) && ! !(tuple->t_infomask & HEAP_XMAX_IS_MULTI) && ! !HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) && ! HeapTupleHeaderGetRawXmax(tuple) == HeapTupleHeaderGetRawXmin(tuple)) ! { ! return false; ! } But why should it be acceptable to change the semantics of dirty snapshots like this, which previously always returned true when control reached here? It is a departure from their traditional behavior, not limited to clients of this new promise tuple infrastructure. Now, it becomes entirely a matter of whether we tried to insert before or after the deleting xact's deletion (of a tuple it originally inserted) as to whether or not we block. So in general we don't get to "keep our old value locks" until xact end when we update or delete. Even if you don't consider this a bug for existing dirty snapshot clients (I myself do - we can't rely on deleting a row and re-inserting the same values now, which could be particularly undesirable for updates), I have already described how we can take advantage of deleting tuples while still holding on to their "value locks" [1] to Andres. I think it'll be very important for multi-master conflict resolution. I've already described this useful property of dirty snapshots numerous times on this thread in relation to different aspects, as it happens. It's essential. Anyway, I guess you're going to need an infomask bit to fix this, so you can differentiate between 'promise' tuples and 'proper' tuples. Those are in short supply. I still think this problem is more or less down to a modularity violation, and I suspect that this is not the last problem that will be found along these lines if we continue to pursue this approach. [1] http://www.postgresql.org/message-id/CAM3SWZQpLSGPS2Kd=-n6hvyiqkf_mcxmx-q72ar9upzq-x6...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Linux kernel impact on PostgreSQL performance
On Mon, Jan 13, 2014 at 03:24:38PM -0800, Josh Berkus wrote: > On 01/13/2014 02:26 PM, Mel Gorman wrote: > > Really? > > > > zone_reclaim_mode is often a complete disaster unless the workload is > > partitioned to fit within NUMA nodes. On older kernels enabling it would > > sometimes cause massive stalls. I'm actually very surprised to hear it > > fixes anything and would be interested in hearing more about what sort > > of circumstnaces would convince you to enable that thing. > > So the problem with the default setting is that it pretty much isolates > all FS cache for PostgreSQL to whichever socket the postmaster is > running on, and makes the other FS cache unavailable. I'm not being pedantic but the default depends on the NUMA characteristics of the machine so I need to know if it was enabled or disabled. Some machines will default zone_reclaim_mode to 0 and others will default it to 1. In my experience the majority of bugs that involved zone_reclaim_mode were due to zone_reclaim_mode enabled by default. If I see a bug that involves a file-based workload on a NUMA machine with stalls and/or excessive IO when there is plenty of memory free then zone_reclaim_mode is the first thing I check. I'm guessing from context that in your experience it gets enabled by default on the machines you care about. This would indeed limit FS cache usage to the node where the process is initiating IO (postmaster I guess). > This means that, > for example, if you have two memory banks, then only one of them is > available for PostgreSQL filesystem caching ... essentially cutting your > available cache in half. > > And however slow moving cached pages between memory banks is, it's an > order of magnitude faster than moving them from disk. But this isn't > how the NUMA stuff is configured; it seems to assume that it's less > expensive to get pages from disk than to move them between banks, so Yes, this is right. The history behind this "logic" is that it was assumed NUMA machines would only ever be used for HPC and that the workloads would always be partitioned to run within NUMA nodes. This has not been the case for a long time and I would argue that we should leave that thing disabled by default in all cases. Last time I tried it was met with resistance but maybe it's time to try again. > whatever you've got cached on the other bank, it flushes it to disk as > fast as possible. I understand the goal was to make memory usage local > to the processors stuff was running on, but that includes an implicit > assumption that no individual process will ever want more than one > memory bank worth of cache. > > So disabling all of the NUMA optimizations is the way to go for any > workload I personally deal with. > I would hesitate to recommend "all" on the grounds that zone_reclaim_mode is brain damage and I'd hate to lump all tuning parameters into the same box. There is an interesting side-line here. If all IO is initiated by one process in postgres then the memory locality will be sub-optimal. The consumer of the data may or may not be running on the same node as the process that read the data from disk. It is possible to migrate this from user space but the interface is clumsy and assumes the data is mapped. Automatic NUMA balancing does not help you here because that thing also depends on the data being mapped. It does nothing for data accessed via read/write. There is nothing fundamental that prevents this, it was not implemented because it was not deemed to be important enough. The amount of effort spent on addressing this would depend on how important NUMA locality is for postgres performance. -- Mel Gorman SUSE Labs -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On 01/14/2014 12:20 PM, Peter Geoghegan wrote: I think that the prevention of unprincipled deadlocking is all down to this immediately prior piece of code, at least in those test cases: ! /* !* in insertion by other. !* !* Before returning true, check for the special case that the !* tuple was deleted by the same transaction that inserted it. !* Such a tuple will never be visible to anyone else, whether !* the transaction commits or aborts. !*/ ! if (!(tuple->t_infomask & HEAP_XMAX_INVALID) && ! !(tuple->t_infomask & HEAP_XMAX_COMMITTED) && ! !(tuple->t_infomask & HEAP_XMAX_IS_MULTI) && ! !HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) && ! HeapTupleHeaderGetRawXmax(tuple) == HeapTupleHeaderGetRawXmin(tuple)) ! { ! return false; ! } But why should it be acceptable to change the semantics of dirty snapshots like this, which previously always returned true when control reached here? It is a departure from their traditional behavior, not limited to clients of this new promise tuple infrastructure. Now, it becomes entirely a matter of whether we tried to insert before or after the deleting xact's deletion (of a tuple it originally inserted) as to whether or not we block. So in general we don't get to "keep our old value locks" until xact end when we update or delete. Hmm. So the scenario would be that a process inserts a tuple, but kills it again later in the transaction, and then re-inserts the same value. The expectation is that because it inserted the value once already, inserting it again will not block. Ie. inserting and deleting a tuple effectively acquires a value-lock on the inserted values. Even if you don't consider this a bug for existing dirty snapshot clients (I myself do - we can't rely on deleting a row and re-inserting the same values now, which could be particularly undesirable for updates), Yeah, it would be bad if updates start failing because of this. We could add a check for that, and return true if the tuple was updated rather than deleted. I have already described how we can take advantage of deleting tuples while still holding on to their "value locks" [1] to Andres. I think it'll be very important for multi-master conflict resolution. I've already described this useful property of dirty snapshots numerous times on this thread in relation to different aspects, as it happens. It's essential. I didn't understand that description. Anyway, I guess you're going to need an infomask bit to fix this, so you can differentiate between 'promise' tuples and 'proper' tuples. Yeah, that's one way. Or you could set xmin to invalid, to make the killed tuple look thoroughly dead to everyone. Those are in short supply. I still think this problem is more or less down to a modularity violation, and I suspect that this is not the last problem that will be found along these lines if we continue to pursue this approach. You have suspected that many times throughout this thread, and every time there's been a relatively simple solutions to the issues you've raised. I suspect that's also going to be true for whatever mundane next issue you come up with. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql.consistent_into
On 1/14/14 10:16 AM, Pavel Stehule wrote: 2014/1/14 Florian Pflug So if we really want to change this, I think we need to have a LANGUAGE_VERSION attribute on functions. Each time a major postgres release changes the behaviour of one of the procedural languages, we'd increment that language's version, and enable the old behaviour for all functions tagged with an older one. I dislike this proposal too enterprise, too complex, too bad - after ten years we can have a ten language versions and it helps nothing. At this point I'm almost tempted to agree with Florian -- I'm really hoping we could change PL/PgSQL for the better over the next 10 years, but given the backwards compatibility requirements we have, this seems like an absolute nightmare. Not saying we need a version number we can keep bumping every release, but something similar. return back to topic a) there is agreement so we like this functionality as plpgsql option b) there is no agreement so we would to see this functionality as default (or in near future) @b is a topic, that we should not to solve and it is back again and again. Sometimes we have success, sometimes not. Temporal GUC is not enough solution for some people. So my result - @a can be implement, @b not FWIW, I would like to have this behaviour even if it's not the default (that was my original proposal anyway). It could help catch a lot of bugs in testing, and for important code it could be even turned on in production (perhaps on a per-function basis). Maybe even globally, idk. I am working on plpgsql_debug extension and I am thinking so I am able (with small change in plpgsql executor) implement this check as extension. So it can be available for advanced users (that will has a knowledge about additional plpgsql extensions). While this sounds cool, running a modified postgres locally seems a lot easier. I already have to do that for PL/PgSQL development because of the reasons I outlined in the thread "PL/PgSQL, RAISE and error context". Regards, Marko Tiikkaja -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On 01/14/2014 12:44 AM, Peter Geoghegan wrote: On Mon, Jan 13, 2014 at 12:58 PM, Heikki Linnakangas wrote: Well, even if you don't agree that locking all the conflicting rows for update is sensible, it's still perfectly sensible to return the rejected rows to the user. For example, you're inserting N rows, and if some of them violate a constraint, you still want to insert the non-conflicting rows instead of rolling back the whole transaction. Right, but with your approach, can you really be sure that you have the right rejecting tuple ctid (not reject)? In other words, as you wait for the exclusion constraint to conclusively indicate that there is a conflict, minutes may have passed in which time other conflicts may emerge in earlier unique indexes. Whereas with an approach where values are locked, you are guaranteed that earlier unique indexes have no conflicting values. Maintaining that property seems useful, since we check in a well-defined order, and we're still projecting a ctid. Unlike when row locking is involved, we can make no assumptions or generalizations around where conflicts will occur. Although that may also be a general concern with your approach when row locking, for multi-master replication use-cases. There may be some value in knowing it cannot have been earlier unique indexes (and so the existing values for those unique indexes in the locked row should stay the same - don't many conflict resolution policies work that way?). I don't understand what you're saying. Can you give an example? In the use case I was envisioning above, ie. you insert N rows, and if any of them violate constraint, you still want to insert the non-violating instead of rolling back the whole transaction, you don't care. You don't care what existing rows the new rows conflicted with. Even if you want to know what you conflicted with, I can't make sense of what you're saying. In the btreelock approach, the value locks are immediately released once you discover that there's conflict. So by the time you get to do anything with the ctid of the existing tuple you conflicted with, new conflicting tuples might've appeared. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PostgreSQL Service on Windows does not start if data directory given is relative path
On Tue, Jan 12, 2014 David Rowley wrote: >>I have found a case that PostgreSQL as win32 service does not start, if the >>data directory given as relative path. >>Error observed in this case is: >>"The PostgreSQL on Local Computer started and >> then stopped". >>This may happen because relative path given will be relative to path from >>where service is registered but >>the path from where WIN32 service execution gets invoked may be different and >>hence it won't be able >>to find the data directory. >>I have fixed the same by internally converting the relative path to absolute >>path as it is being done for executable file. >Hi, >I've not quite got around to testing this yet, but I think the patch is a good >idea as I can see that I can use a relative path when I start postgres.exe >from the command line, I guess the behaviour should likely be the same when >installed as a windows > service. Thanks for reviewing and providing the first level of feedback. >So, I've just been looking over this patch and I'm just wondering about a few >things: >In pgwin32_CommandLine you declare dbPath, it looks like you could declare >this in the scope of; if (pg_config) Yes I have declared the same in the scope of "if (pg_config) " >In find_my_abs_path you're making a call to StrNCpy, I know you likely just >used find_my_exec as an example here, but I'd say the use of StrNCpy is not >really needed here and is a bit of a waste of cycles. I'd rather see strlcpy >being used as strncpy > will needlessly zero the remaining buffer space. Yes you are right, it is much better to use strlcpy instead of StrNCpy. I have modified the patch. >Also in find_my_abs_path, I'm just wondering if the cwd variable is really >needed, I think you could just use retpath each time and it would also save a >little bit of copying that's done in join_path_components(). By the looks of >it you can just call > join_path_components(retpath, retpath, inpath). Perhaps some people would > disagree, but I'm not really seeing the harm in it and it saves some copying > too. Yes I am also convinced with your suggestion. It really avoid one level of copy. I have seen that the similar mechanism is used in many places where join_path_components() is called. So I think it should be OK to use in our case also. I have made the necessary changes for the same. I have attached the changed patch. Please provide your further feedback, if any. Thanks and Regards, Kumar Rajeev Rastogi pgctl_win32service_rel_dbpath_v2.patch Description: pgctl_win32service_rel_dbpath_v2.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql.consistent_into
I've always hated INTO in procedures since it makes the code harder to follow and has very different behavior on the SQL level, in addition to the multi-row problem you bring up. If we can make assignment syntax more versatile and eventually replace INTO, then that solves multiple problems in the language without breaking backwards compatibility. On Tue, Jan 14, 2014 at 4:30 AM, Marko Tiikkaja wrote: > On 2014-01-14 02:54, Marti Raudsepp wrote: >> But PL/pgSQL already has an assignment syntax with the behavior you want: > > According to the docs, that doesn't set FOUND which would make this a pain > to deal with.. Right you are. If we can extend the syntax then we could make it such that "= SELECT" sets FOUND and other diagnostics, and a simple assignment doesn't. Which makes sense IMO: a = 10; -- simple assignments really shouldn't affect FOUND With explicit SELECT, clearly the intent is to perform a query: a = SELECT foo FROM table; And this could also work: a = INSERT INTO table (foo) VALUES (10) RETURNING foo_id; AFAICT the fact that this works is more of an accident and should be discouraged. We can leave it as is for compatibility's sake: a = foo FROM table; Now, another question is whether it's possible to make the syntax work. Is this an assignment from the result of a subquery, or is it a query by itself? a = (SELECT foo FROM table); Does anyone consider this proposal workable? Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inheritance and indexes
On Tue, Jan 14, 2014 at 12:07 PM, knizhnik wrote: > But is it possible to use index for derived table at all? Yes, the planner will do an index scan when it makes sense. > Why sequential search is used for derived table in the example below: > insert into derived_table values (2,2); > create index derived_index on derived_table(x); > explain select * from base_table where x>=0; With only 1 row in the table, the planner decides there's no point in scanning the index. Try with more realistic data. Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql.consistent_into
On 1/14/14 12:28 PM, Marti Raudsepp wrote: I've always hated INTO in procedures since it makes the code harder to follow and has very different behavior on the SQL level, in addition to the multi-row problem you bring up. If we can make assignment syntax more versatile and eventually replace INTO, then that solves multiple problems in the language without breaking backwards compatibility. I don't personally have a problem with INTO other than the behaviour that started this thread. But I'm willing to consider other options. On Tue, Jan 14, 2014 at 4:30 AM, Marko Tiikkaja wrote: On 2014-01-14 02:54, Marti Raudsepp wrote: But PL/pgSQL already has an assignment syntax with the behavior you want: According to the docs, that doesn't set FOUND which would make this a pain to deal with.. Right you are. If we can extend the syntax then we could make it such that "= SELECT" sets FOUND and other diagnostics, and a simple assignment doesn't. Which makes sense IMO: a = 10; -- simple assignments really shouldn't affect FOUND With you so far. With explicit SELECT, clearly the intent is to perform a query: a = SELECT foo FROM table; And this could also work: a = INSERT INTO table (foo) VALUES (10) RETURNING foo_id; I'm not sure that would work with the grammar. Basically what PL/PgSQL does right now is for a statement like: a = 1; It parses the "a =" part itself, and then just reads until the next unquoted semicolon without actually looking at it, and slams a "SELECT " in front of it. With this approach we'd have to look into the query and try and guess what it does. That might be possible, but I don't like the idea. AFAICT the fact that this works is more of an accident and should be discouraged. We can leave it as is for compatibility's sake: a = foo FROM table; I've always considered that ugly (IIRC it's still undocumented as well), and would encourage people not to do that. Now, another question is whether it's possible to make the syntax work. Is this an assignment from the result of a subquery, or is it a query by itself? a = (SELECT foo FROM table); That looks like a scalar subquery, which is wrong because they can't return more than one column (nor can they be INSERT etc., obviously). How about: (a) = SELECT 1; (a, b) = SELECT 1, 2; (a, b) = INSERT INTO foo RETURNING col1, col2; Same semantics: TOO_MANY_ROWS on rows > 1, sets FOUND and row_count. AFAICT this can be parsed unambiguously, too, and we don't need to look at the query string because this is new syntax. Regards, Marko Tiikkaja -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimize kernel readahead using buffer access strategy
Hi, I fix and submit this patch in CF4. In my past patch, it is significant bug which is mistaken caluculation of offset in posix_fadvise():-( However it works well without problem in pgbench. Because pgbench transactions are always random access... And I test my patch in DBT-2 benchmark. Results are under following. * Test server Server: HP Proliant DL360 G7 CPU:Xeon E5640 2.66GHz (1P/4C) Memory: 18GB(PC3-10600R-9) Disk: 146GB(15k)*4 RAID1+0 RAID controller: P410i/256MB OS: RHEL 6.4(x86_64) FS: Ext4 * DBT-2 result(WH400, SESSION=100, ideal_score=5160) Method | score | average | 90%tile | Maximum plain | 3589 | 9.751 | 33.680 | 87.8036 option=off | 3670 | 9.107 | 34.267 | 79.3773 option=on | 4222 | 5.140 | 7.619 | 102.473 "option" is "readahead_strategy" option, and "on" is my proposed. "average", "90%tile", and Maximum represent latency. Average_latency is 2 times faster than plain! * Detail results (uploading now. please wait for a hour...) [plain] http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/normal_20140109/HTML/index_thput.html [option=off] http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/readahead_off_20140109/HTML/index_thput.html [option=on] http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/readahead_on_20140109/HTML/index_thput.html We can see part of super slow latency in my proposed method test. Part of transaction active is 20%, and part of slow transactions is 80%. It might be Pareto principle in CHECKPOINT;-) #It's joke. I will test some join sqls performance and TPC-3 benchmark in this or next week. Regards, -- Mitsumasa KONDO NTT Open Source Software Center *** a/configure --- b/configure *** *** 11303,11309 fi LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" --- 11303,11309 LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fadvise pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" *** a/contrib/pg_prewarm/pg_prewarm.c --- b/contrib/pg_prewarm/pg_prewarm.c *** *** 179,185 pg_prewarm(PG_FUNCTION_ARGS) */ for (block = first_block; block <= last_block; ++block) { ! smgrread(rel->rd_smgr, forkNumber, block, blockbuffer); ++blocks_done; } } --- 179,185 */ for (block = first_block; block <= last_block; ++block) { ! smgrread(rel->rd_smgr, forkNumber, block, blockbuffer, BAS_BULKREAD); ++blocks_done; } } *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 1293,1298 include 'filename' --- 1293,1322 + + readahead_strategy (integer) + +readahead_strategyconfiguration parameter + + + + This feature is to select which readahead strategy is used. When we + set off(default), readahead strategy is optimized by OS. On the other + hands, when we set on, readahead strategy is optimized by Postgres. + In typicaly situations, OS readahead strategy will be good working, + however Postgres often knows better readahead strategy before + executing disk access. For example, we can easy to predict access + pattern when we input SQLs, because planner of postgres decides + efficient access pattern to read faster. And it might be random access + pattern or sequential access pattern. It will be less disk IO and more + efficient to use file cache in OS. It will be better performance. + However this optimization is not complete now, so it is necessary to + choose it carefully in considering situations. Default setting is off + that is optimized by OS, and whenever it can change it. + + + + *** a/src/backend/commands/tablecmds.c --- b/src/backend/commands/tablecmds.c *** *** 9125,9131 copy_relation_data(SMgrRelation src, SMgrRelation dst, /* If we got a cancel signal during the copy of the data, quit */ CHECK_FOR_INTERRUPTS(); ! smgrread(src, forkNum, blkno, buf); if (!PageIsVerified(page, blkno)) ereport(ERROR, --- 9125,9131 /* If we got a cancel signal during th
Re: [HACKERS] plpgsql.consistent_into
2014/1/14 Marko Tiikkaja > On 1/14/14 12:28 PM, Marti Raudsepp wrote: > >> I've always hated INTO in procedures since it makes the code harder to >> follow and has very different behavior on the SQL level, in addition >> to the multi-row problem you bring up. If we can make assignment >> syntax more versatile and eventually replace INTO, then that solves >> multiple problems in the language without breaking backwards >> compatibility. >> > > I don't personally have a problem with INTO other than the behaviour that > started this thread. But I'm willing to consider other options. > > > On Tue, Jan 14, 2014 at 4:30 AM, Marko Tiikkaja wrote: >> >>> On 2014-01-14 02:54, Marti Raudsepp wrote: >>> But PL/pgSQL already has an assignment syntax with the behavior you want: >>> >>> According to the docs, that doesn't set FOUND which would make this a >>> pain >>> to deal with.. >>> >> >> Right you are. If we can extend the syntax then we could make it such >> that "= SELECT" sets FOUND and other diagnostics, and a simple >> assignment doesn't. Which makes sense IMO: >> >> a = 10; -- simple assignments really shouldn't affect FOUND >> > > With you so far. > > > With explicit SELECT, clearly the intent is to perform a query: >>a = SELECT foo FROM table; >> And this could also work: >>a = INSERT INTO table (foo) VALUES (10) RETURNING foo_id; >> > > I'm not sure that would work with the grammar. Basically what PL/PgSQL > does right now is for a statement like: > > a = 1; > > It parses the "a =" part itself, and then just reads until the next > unquoted semicolon without actually looking at it, and slams a "SELECT " in > front of it. With this approach we'd have to look into the query and try > and guess what it does. That might be possible, but I don't like the idea. > > > AFAICT the fact that this works is more of an accident and should be >> discouraged. We can leave it as is for compatibility's sake: >>a = foo FROM table; >> > > I've always considered that ugly (IIRC it's still undocumented as well), > and would encourage people not to do that. > > > Now, another question is whether it's possible to make the syntax >> work. Is this an assignment from the result of a subquery, or is it a >> query by itself? >>a = (SELECT foo FROM table); >> > only this form is allowed in SQL/PSM - and it has some logic - you can assign result of subquery (should be one row only) to variable. > > That looks like a scalar subquery, which is wrong because they can't > return more than one column (nor can they be INSERT etc., obviously). > > How about: > > (a) = SELECT 1; > (a, b) = SELECT 1, 2; > (a, b) = INSERT INTO foo RETURNING col1, col2; > > I prefer subquery only syntax - a := (some) or (a,b,c) = (some a,b,c) with possible enhancing for statements with RETURNING a advance is compatibility with DB2 (SQL/PSM) syntax - and this code is written now - it is done in my sql/psm implementation Regards Pavel > Same semantics: TOO_MANY_ROWS on rows > 1, sets FOUND and row_count. > AFAICT this can be parsed unambiguously, too, and we don't need to look at > the query string because this is new syntax. > > > Regards, > Marko Tiikkaja > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
[HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format
Hi all, As of today, replication protocol has a command called BASE_BACKUP to allow a client connecting with the replication protocol to retrieve a full backup from server through a connection stream. The description of its current options are here: http://www.postgresql.org/docs/9.3/static/protocol-replication.html This command is in charge to put the server in start backup by using do_pg_start_backup, then it sends the backup, and finalizes the backup with do_pg_stop_backup. Thanks to that it is as well possible to get backups from even standby nodes as the stream contains the backup_label file necessary for recovery. Full backup is sent in tar format for obvious performance reasons to limit the amount of data sent through the stream, and server contains necessary coding to send the data in correct format. This forces the client as well to perform some decoding if the output of the base backup received needs to be analyzed on the fly but doing something similar to what now pg_basebackup does when the backup format is plain. I would like to propose the following things to extend BASE_BACKUP to retrieve a backup from a stream: - Addition of an option FORMAT, to control the output format of backup, with possible options as 'plain' and 'tar'. Default is tar for backward compatibility purposes. The purpose of this option is to make easier for backup tools playing with postgres to retrieve and backup and analyze it on the fly, the purpose being to filter and analyze the data while it is being received without all the tar decoding necessary, what would consist in copying portions of pg_basebackup code more or less. - Addition of an option called INCREMENTAL to send an incremental backup to the client. This option uses as input an LSN, and sends back to client relation pages (in the shape of reduced relation files) that are newer than the LSN specified by looking at pd_lsn of PageHeaderData. In this case the LSN needs to be determined by client based on the latest full backup taken. This option is particularly interesting to reduce the amount of data taken between two backups, even if it increases the restore time as client needs to reconstitute a base backup depending on the recovery target and the pages modified. Client would be in charge of rebuilding pages from incremental backup by scanning all the blocks that need to be updated based on the full backup as the LSN from which incremental backup is taken is known. But this is not really something the server cares about... Such things are actually done by pg_rman as well. As a next step, I would imagine that pg_basebackup could be extended to take incremental backups as well. Having another tool able to rebuild backups based on a full backup with incremental information would be nice as well. This is of course not material for 9.4, I just would like for now to measure the temperature about such things and gather opinions... Regards -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Soften pg_[start|stop]_backup to allow them on a standby?
Hi, On 2014-01-14 12:31:09 +0900, Michael Paquier wrote: > Currently, pg_start_backup and pg_stop_backup cannot run on a standby > because it is not possible to write a backup_label file to disk, > because of the nature of a standby server preventing to write any data > in its PGDATA. Is this thought right? This is what the comments at the > top of do_pg_start_backup make me conclude. No, the actual reason is that a plain pg_stop_backup() writes WAL - which we can't do on a standby. The walsender command gets around this by storing the required data in the backup label itself, but that requires the label to be written after the basebackup actually finished which doesn't work for plain start/stop backup. > Another idea would be to send the backup label file directly as the > output of pg_start_backup such as client application can grab it and > reuse it. Any thoughts about that as well? Yea, I think extending the "protocols" available is the way to go here. We need to be able to send the backup label after the actual base backup finished. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format
Hi, On 2014-01-14 21:47:43 +0900, Michael Paquier wrote: > I would like to propose the following things to extend BASE_BACKUP to > retrieve a backup from a stream: > - Addition of an option FORMAT, to control the output format of > backup, with possible options as 'plain' and 'tar'. Default is tar for > backward compatibility purposes. The purpose of this option is to make > easier for backup tools playing with postgres to retrieve and backup > and analyze it on the fly, the purpose being to filter and analyze the > data while it is being received without all the tar decoding > necessary, what would consist in copying portions of pg_basebackup > code more or less. We'd need our own serialization format since we're dealing with more than one file, what would be the point? > - Addition of an option called INCREMENTAL to send an incremental > backup to the client. This option uses as input an LSN, and sends back > to client relation pages (in the shape of reduced relation files) that > are newer than the LSN specified by looking at pd_lsn of > PageHeaderData. In this case the LSN needs to be determined by client > based on the latest full backup taken. This option is particularly > interesting to reduce the amount of data taken between two backups, > even if it increases the restore time as client needs to reconstitute > a base backup depending on the recovery target and the pages modified. > Client would be in charge of rebuilding pages from incremental backup > by scanning all the blocks that need to be updated based on the full > backup as the LSN from which incremental backup is taken is known. But > this is not really something the server cares about... Such things are > actually done by pg_rman as well. Why not just rely on WAL replay since you're relying on the consistency of the standby anyway? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format
On 01/14/2014 02:47 PM, Michael Paquier wrote: I would like to propose the following things to extend BASE_BACKUP to retrieve a backup from a stream: - Addition of an option FORMAT, to control the output format of backup, with possible options as 'plain' and 'tar'. Default is tar for backward compatibility purposes. The purpose of this option is to make easier for backup tools playing with postgres to retrieve and backup and analyze it on the fly, the purpose being to filter and analyze the data while it is being received without all the tar decoding necessary, what would consist in copying portions of pg_basebackup code more or less. Umm, you have to somehow mark in the protocol where one file begins and another one ends. The 'tar' format seems perfectly OK for that purpose. What exactly would the 'plain' format do? - Addition of an option called INCREMENTAL to send an incremental backup to the client. This option uses as input an LSN, and sends back to client relation pages (in the shape of reduced relation files) that are newer than the LSN specified by looking at pd_lsn of PageHeaderData. In this case the LSN needs to be determined by client based on the latest full backup taken. This option is particularly interesting to reduce the amount of data taken between two backups, even if it increases the restore time as client needs to reconstitute a base backup depending on the recovery target and the pages modified. Client would be in charge of rebuilding pages from incremental backup by scanning all the blocks that need to be updated based on the full backup as the LSN from which incremental backup is taken is known. But this is not really something the server cares about... Such things are actually done by pg_rman as well. How does the server find all the pages with LSN > the threshold? If it needs to scan the whole database, it's not all that useful. I guess it would be better than nothing, but I think you might as well just use rsync. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format
On Tue, Jan 14, 2014 at 1:47 PM, Michael Paquier wrote: > Hi all, > > As of today, replication protocol has a command called BASE_BACKUP to > allow a client connecting with the replication protocol to retrieve a > full backup from server through a connection stream. The description > of its current options are here: > http://www.postgresql.org/docs/9.3/static/protocol-replication.html > > This command is in charge to put the server in start backup by using > do_pg_start_backup, then it sends the backup, and finalizes the backup > with do_pg_stop_backup. Thanks to that it is as well possible to get > backups from even standby nodes as the stream contains the > backup_label file necessary for recovery. Full backup is sent in tar > format for obvious performance reasons to limit the amount of data > sent through the stream, and server contains necessary coding to send > the data in correct format. This forces the client as well to perform > some decoding if the output of the base backup received needs to be > analyzed on the fly but doing something similar to what now > pg_basebackup does when the backup format is plain. > > I would like to propose the following things to extend BASE_BACKUP to > retrieve a backup from a stream: > - Addition of an option FORMAT, to control the output format of > backup, with possible options as 'plain' and 'tar'. Default is tar for > backward compatibility purposes. The purpose of this option is to make > easier for backup tools playing with postgres to retrieve and backup > and analyze it on the fly, the purpose being to filter and analyze the > data while it is being received without all the tar decoding > necessary, what would consist in copying portions of pg_basebackup > code more or less. > How would this be different/better than the tar format? pg_basebackup already does this analysis, for example, when it comes to recovery.conf. The tar format is really easy to analyze as a stream, that's one of the reasons we picked it... > - Addition of an option called INCREMENTAL to send an incremental > backup to the client. This option uses as input an LSN, and sends back > to client relation pages (in the shape of reduced relation files) that > are newer than the LSN specified by looking at pd_lsn of > PageHeaderData. In this case the LSN needs to be determined by client > based on the latest full backup taken. This option is particularly > interesting to reduce the amount of data taken between two backups, > even if it increases the restore time as client needs to reconstitute > a base backup depending on the recovery target and the pages modified. > Client would be in charge of rebuilding pages from incremental backup > by scanning all the blocks that need to be updated based on the full > backup as the LSN from which incremental backup is taken is known. But > this is not really something the server cares about... Such things are > actually done by pg_rman as well. > This sounds a lot like DIFFERENTIAL in other databases? Or I guess it's the same underlying technology, depending only on if you go back to the full base backup, or to the last incremental one. But if you look at the terms otherwise, I think incremental often refers to what we call WAL. Either way - if we can do this in a safe way, it sounds like a good idea. It would be sort of like rsync, except relying on the fact that we can look at the LSN and don't have to compare the actual files, right? As a next step, I would imagine that pg_basebackup could be extended > to take incremental backups as well. Having another tool able to > rebuild backups based on a full backup with incremental information > would be nice as well. > I would say those are requirements, not just next step and "nice as well" :) -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format
On Tue, Jan 14, 2014 at 10:01 PM, Heikki Linnakangas wrote: >> - Addition of an option called INCREMENTAL to send an incremental >> backup to the client. This option uses as input an LSN, and sends back >> to client relation pages (in the shape of reduced relation files) that >> are newer than the LSN specified by looking at pd_lsn of >> PageHeaderData. In this case the LSN needs to be determined by client >> based on the latest full backup taken. This option is particularly >> interesting to reduce the amount of data taken between two backups, >> even if it increases the restore time as client needs to reconstitute >> a base backup depending on the recovery target and the pages modified. >> Client would be in charge of rebuilding pages from incremental backup >> by scanning all the blocks that need to be updated based on the full >> backup as the LSN from which incremental backup is taken is known. But >> this is not really something the server cares about... Such things are >> actually done by pg_rman as well. > > > How does the server find all the pages with LSN > the threshold? If it needs > to scan the whole database, it's not all that useful. I guess it would be > better than nothing, but I think you might as well just use rsync. Yes, it would be necessary to scan the whole database as the LSN to be checked is kept in PageHeaderData :). Perhaps it is not that performant, but my initial thought was that perhaps the amount of data necessary to maintain incremental backups could balance with the amount of WAL necessary to keep and limit the whole amount on disk. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format
On 2014-01-14 14:12:46 +0100, Magnus Hagander wrote: > Either way - if we can do this in a safe way, it sounds like a good idea. > It would be sort of like rsync, except relying on the fact that we can look > at the LSN and don't have to compare the actual files, right? Which is an advantage, yes. On the other hand, it doesn't fix problems with a subtly broken replica, e.g. after a bug in replay, or disk corruption. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format
On Tue, Jan 14, 2014 at 10:12 PM, Magnus Hagander wrote: > On Tue, Jan 14, 2014 at 1:47 PM, Michael Paquier > wrote: >> >> Hi all, >> >> As of today, replication protocol has a command called BASE_BACKUP to >> allow a client connecting with the replication protocol to retrieve a >> full backup from server through a connection stream. The description >> of its current options are here: >> http://www.postgresql.org/docs/9.3/static/protocol-replication.html >> >> This command is in charge to put the server in start backup by using >> do_pg_start_backup, then it sends the backup, and finalizes the backup >> with do_pg_stop_backup. Thanks to that it is as well possible to get >> backups from even standby nodes as the stream contains the >> backup_label file necessary for recovery. Full backup is sent in tar >> format for obvious performance reasons to limit the amount of data >> sent through the stream, and server contains necessary coding to send >> the data in correct format. This forces the client as well to perform >> some decoding if the output of the base backup received needs to be >> analyzed on the fly but doing something similar to what now >> pg_basebackup does when the backup format is plain. >> >> I would like to propose the following things to extend BASE_BACKUP to >> retrieve a backup from a stream: >> - Addition of an option FORMAT, to control the output format of >> backup, with possible options as 'plain' and 'tar'. Default is tar for >> backward compatibility purposes. The purpose of this option is to make >> easier for backup tools playing with postgres to retrieve and backup >> and analyze it on the fly, the purpose being to filter and analyze the >> data while it is being received without all the tar decoding >> necessary, what would consist in copying portions of pg_basebackup >> code more or less. > > > How would this be different/better than the tar format? pg_basebackup > already does this analysis, for example, when it comes to recovery.conf. > The tar format is really easy to analyze as a stream, that's one of the > reasons we picked it... > > >> >> - Addition of an option called INCREMENTAL to send an incremental >> >> backup to the client. This option uses as input an LSN, and sends back >> to client relation pages (in the shape of reduced relation files) that >> are newer than the LSN specified by looking at pd_lsn of >> PageHeaderData. In this case the LSN needs to be determined by client >> based on the latest full backup taken. This option is particularly >> interesting to reduce the amount of data taken between two backups, >> even if it increases the restore time as client needs to reconstitute >> a base backup depending on the recovery target and the pages modified. >> Client would be in charge of rebuilding pages from incremental backup >> by scanning all the blocks that need to be updated based on the full >> backup as the LSN from which incremental backup is taken is known. But >> this is not really something the server cares about... Such things are >> actually done by pg_rman as well. > > > This sounds a lot like DIFFERENTIAL in other databases? Or I guess it's the > same underlying technology, depending only on if you go back to the full > base backup, or to the last incremental one. Yes, that's actually a LSN-differential, I got my head in pg_rman for a couple of weeks, where a similar idea is called incremental there. > > But if you look at the terms otherwise, I think incremental often refers to > what we call WAL. > > Either way - if we can do this in a safe way, it sounds like a good idea. It > would be sort of like rsync, except relying on the fact that we can look at > the LSN and don't have to compare the actual files, right? Yep, that's the idea. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format
On Tue, Jan 14, 2014 at 2:18 PM, Andres Freund wrote: > On 2014-01-14 14:12:46 +0100, Magnus Hagander wrote: > > Either way - if we can do this in a safe way, it sounds like a good idea. > > It would be sort of like rsync, except relying on the fact that we can > look > > at the LSN and don't have to compare the actual files, right? > > Which is an advantage, yes. On the other hand, it doesn't fix problems > with a subtly broken replica, e.g. after a bug in replay, or disk > corruption. > > Right. But neither does rsync, right? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format
On 2014-01-14 14:40:46 +0100, Magnus Hagander wrote: > On Tue, Jan 14, 2014 at 2:18 PM, Andres Freund wrote: > > > On 2014-01-14 14:12:46 +0100, Magnus Hagander wrote: > > > Either way - if we can do this in a safe way, it sounds like a good idea. > > > It would be sort of like rsync, except relying on the fact that we can > > look > > > at the LSN and don't have to compare the actual files, right? > > > > Which is an advantage, yes. On the other hand, it doesn't fix problems > > with a subtly broken replica, e.g. after a bug in replay, or disk > > corruption. > > > > > Right. But neither does rsync, right? Hm? Rsync's really only safe with --checksum and with that it definitely should fix those? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format
On Tue, Jan 14, 2014 at 2:16 PM, Michael Paquier wrote: > On Tue, Jan 14, 2014 at 10:01 PM, Heikki Linnakangas > wrote: > >> - Addition of an option called INCREMENTAL to send an incremental > >> backup to the client. This option uses as input an LSN, and sends back > >> to client relation pages (in the shape of reduced relation files) that > >> are newer than the LSN specified by looking at pd_lsn of > >> PageHeaderData. In this case the LSN needs to be determined by client > >> based on the latest full backup taken. This option is particularly > >> interesting to reduce the amount of data taken between two backups, > >> even if it increases the restore time as client needs to reconstitute > >> a base backup depending on the recovery target and the pages modified. > >> Client would be in charge of rebuilding pages from incremental backup > >> by scanning all the blocks that need to be updated based on the full > >> backup as the LSN from which incremental backup is taken is known. But > >> this is not really something the server cares about... Such things are > >> actually done by pg_rman as well. > > > > > > How does the server find all the pages with LSN > the threshold? If it > needs > > to scan the whole database, it's not all that useful. I guess it would be > > better than nothing, but I think you might as well just use rsync. > Yes, it would be necessary to scan the whole database as the LSN to be > checked is kept in PageHeaderData :). Perhaps it is not that > performant, but my initial thought was that perhaps the amount of data > necessary to maintain incremental backups could balance with the > amount of WAL necessary to keep and limit the whole amount on disk. > It wouldn't be worse performance wise than a full backup. That one also has to read all the blocks after all... You're decreasing network traffic and client storage, with the same I/O on the server side. Seems worthwhile. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format
On Tue, Jan 14, 2014 at 2:41 PM, Andres Freund wrote: > On 2014-01-14 14:40:46 +0100, Magnus Hagander wrote: > > On Tue, Jan 14, 2014 at 2:18 PM, Andres Freund >wrote: > > > > > On 2014-01-14 14:12:46 +0100, Magnus Hagander wrote: > > > > Either way - if we can do this in a safe way, it sounds like a good > idea. > > > > It would be sort of like rsync, except relying on the fact that we > can > > > look > > > > at the LSN and don't have to compare the actual files, right? > > > > > > Which is an advantage, yes. On the other hand, it doesn't fix problems > > > with a subtly broken replica, e.g. after a bug in replay, or disk > > > corruption. > > > > > > > > Right. But neither does rsync, right? > > Hm? Rsync's really only safe with --checksum and with that it definitely > should fix those? > > I think we're talking about difference scenarios. I thought you were talking about a backup taken from a replica, that already has corruption. rsync checksums surely aren't going to help with that? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format
On 2014-01-14 14:42:36 +0100, Magnus Hagander wrote: > On Tue, Jan 14, 2014 at 2:41 PM, Andres Freund wrote: > > > On 2014-01-14 14:40:46 +0100, Magnus Hagander wrote: > > > On Tue, Jan 14, 2014 at 2:18 PM, Andres Freund > >wrote: > > > > > > > On 2014-01-14 14:12:46 +0100, Magnus Hagander wrote: > > > > > Either way - if we can do this in a safe way, it sounds like a good > > idea. > > > > > It would be sort of like rsync, except relying on the fact that we > > can > > > > look > > > > > at the LSN and don't have to compare the actual files, right? > > > > > > > > Which is an advantage, yes. On the other hand, it doesn't fix problems > > > > with a subtly broken replica, e.g. after a bug in replay, or disk > > > > corruption. > > > > > > > > > > > Right. But neither does rsync, right? > > > > Hm? Rsync's really only safe with --checksum and with that it definitely > > should fix those? > > > > > I think we're talking about difference scenarios. Sounds like it. > I thought you were talking about a backup taken from a replica, that > already has corruption. rsync checksums surely aren't going to help with > that? I was talking about updating a standby using such an incremental or differential backup from the primary (or a standby higher up in the cascade). If your standby is corrupted in any way a rsync --checksum will certainly correct errors if it syncs from a correct source? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Filter error log statements by sqlstate
On Tue, Jan 14, 2014 at 12:22:30PM +0530, Jeevan Chalke wrote: > On Mon, Jan 13, 2014 at 4:30 PM, Oskari Saarenmaa wrote: > > On 13/01/14 10:26, Jeevan Chalke wrote: > > > > > 1. Documentation is missing and thus becomes difficult to understand > > > what exactly you are trying to do. Or in other words, user will be > > > uncertain about using it more efficiently. > > > > I figured I'd write documentation for this if it looks like a useful > > feature which would be accepted for 9.4, but I guess it would've > > helped to have a bit better description of this for the initial > > submission as well. > > > > > > > 2. Some more comments required. At each new function and > > > specifically at get_sqlstate_error_level(). > > > > Just after I submitted the patch I noticed that I had a placeholder > > for comment about that function but never wrote the actual comment, > > sorry about that. > > > > > 3. Please add test-case if possible. > > > > Sure. > > > > > 4. Some code part does not comply with PostgreSQL indentation style. > > > (Can be ignored as it will pass through pg_indent, but better fix > > > it). > > > > > > > I'll try to fix this for v2. > > > > > 5. You have used ""XX000:warning," string to get maximum possible > > > length of the valid sqlstate:level identifier. It's perfect, but > > > small explanation about that will be good there. Also in future if > > > we have any other error level which exceeds this, we need changes > > > here too. Right ? > > > > Good point, I'll address this in v2. > > > > > I will look into this further. But please have your attention on above > > > points. > > > > Thanks for the review! > > Since you are taking care of most of the points above. I will wait for v2 > patch. Till then marking "Waiting on Author". Attached v2 of the patch which addresses the above points. I couldn't figure out how to test log output, but at least the patch now tests that it can set and show the log level. Thanks, Oskari >From 213f647657f318141e3866087a17a863a0f322d9 Mon Sep 17 00:00:00 2001 From: Oskari Saarenmaa Date: Tue, 14 Jan 2014 15:47:39 +0200 Subject: [PATCH] Filter error log statements by sqlstate Allow the default log_min_error_statement to be overridden per sqlstate to make it possible to filter out some error types while maintaining a low log_min_error_statement or enable logging for some error types when the default is to not log anything. --- doc/src/sgml/config.sgml | 30 ++ src/backend/utils/error/elog.c| 220 +- src/backend/utils/misc/guc.c | 14 ++- src/include/utils/guc.h | 4 + src/include/utils/guc_tables.h| 1 + src/test/regress/expected/guc.out | 24 + src/test/regress/sql/guc.sql | 8 ++ 7 files changed, 298 insertions(+), 3 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 0f2f2bf..73a58ad 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -3743,6 +3743,36 @@ local0.*/var/log/postgresql + + log_sqlstate_error_statement (string) + + log_sqlstate_error_statement configuration parameter + + + +Controls which error types in SQL statements condition are recorded +in the server log. This overrides the global per error type and can be used to +disable logging for certain error types and/or to enable logging for +other types. + + +The value must be a comma-separated list in the form +'ERRORCODE:LOGLEVEL,...'. For example, a setting +of 'P0001:PANIC,22012:ERROR' would never log the +SQL statements for errors generated by the PL/pgSQL +RAISE statement but would always log the +statements causing division by zero errors. + +See for the list of valid error +codes and for valid log +levels. + +Only superusers can change this setting. + + + + log_min_duration_statement (integer) diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index 8705586..2e74fd5 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -74,7 +74,9 @@ #include "storage/ipc.h" #include "storage/proc.h" #include "tcop/tcopprot.h" +#include "utils/builtins.h" #include "utils/guc.h" +#include "utils/guc_tables.h" #include "utils/memutils.h" #include "utils/ps_status.h" @@ -111,6 +113,11 @@ char *Log_line_prefix = NULL; /* format for extra log line info */ intLog_destination = LOG_DESTINATION_STDERR; char *Log_destination_string = NULL; +static uint64 *log_sqlstate_error_statement = NULL; +static size_t log_sqlstate_error_statement_len = 0; + +static int get_sqlstate_error_level(int sqlstate); + #ifdef HAVE_SYSLOG /* @@ -2496,6 +2503,7 @@ static void write_csvlog(ErrorData *edata) { StringInfoData buf
Re: [HACKERS] Optimize kernel readahead using buffer access strategy
On Tue, Jan 14, 2014 at 8:58 AM, KONDO Mitsumasa wrote: > > In my past patch, it is significant bug which is mistaken caluculation of > offset in posix_fadvise():-( However it works well without problem in > pgbench. > Because pgbench transactions are always random access... Did you notice any difference? AFAIK, when specifying read patterns (ie, RANDOM, SEQUENTIAL and stuff like that), the offset doesn't matter. At least in linux. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimize kernel readahead using buffer access strategy
On Tue, Jan 14, 2014 at 8:58 AM, KONDO Mitsumasa wrote: > > In my past patch, it is significant bug which is mistaken caluculation of > offset in posix_fadvise():-( However it works well without problem in > pgbench. > Because pgbench transactions are always random access... Did you notice any difference? AFAIK, when specifying read patterns (ie, RANDOM, SEQUENTIAL and stuff like that), the offset doesn't matter. At least in linux. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On 01/14/2014 09:39 AM, Claudio Freire wrote: > On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing wrote: >> Again, as said above the linux file system is doing fine. What we >> want is a few ways to interact with it to let it do even better when >> working with postgresql by telling it some stuff it otherwise would >> have to second guess and by sometimes giving it back some cache >> pages which were copied away for potential modifying but ended >> up clean in the end. > You don't need new interfaces. Only a slight modification of what > fadvise DONTNEED does. > > This insistence in injecting pages from postgres to kernel is just a > bad idea. Do you think it would be possible to map copy-on-write pages from linux cache to postgresql cache ? this would be a step in direction of solving the double-ram-usage of pages which have not been read from syscache to postgresql cache without sacrificing linux read-ahead (which I assume does not happen when reads bypass system cache). and we can write back the copy at the point when it is safe (from postgresql perspective) to let the system write them back ? Do you think it is possible to make it work with good performance for a few million 8kb pages ? > At the very least, it still needs postgres to know too much > of the filesystem (block layout) to properly work. Ie: pg must be > required to put entire filesystem-level blocks into the page cache, > since that's how the page cache works. I was more thinking of an simple write() interface with extra flags/sysctls to tell kernel that "we already have this on disk" > At the very worst, it may > introduce serious security and reliability implications, when > applications can destroy the consistency of the page cache (even if > full access rights are checked, there's still the possibility this > inconsistency might be exploitable). If you allow write() which just writes clean pages, I can not see where the extra security concerns are beyond what normal write can do. Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Linux kernel impact on PostgreSQL performance
On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman wrote: >> Amen to that. Actually, I think NUMA can be (mostly?) fixed by >> setting zone_reclaim_mode; is there some other problem besides that? > > Really? > > zone_reclaim_mode is often a complete disaster unless the workload is > partitioned to fit within NUMA nodes. On older kernels enabling it would > sometimes cause massive stalls. I'm actually very surprised to hear it > fixes anything and would be interested in hearing more about what sort > of circumstnaces would convince you to enable that thing. By "set" I mean "set to zero". We've seen multiple of instances of people complaining about large amounts of system memory going unused because this setting defaulted to 1. >> The other thing that comes to mind is the kernel's caching behavior. >> We've talked a lot over the years about the difficulties of getting >> the kernel to write data out when we want it to and to not write data >> out when we don't want it to. > > Is sync_file_range() broke? I don't know. I think a few of us have played with it and not been able to achieve a clear win. Whether the problem is with the system call or the programmer is harder to determine. I think the problem is in part that it's not exactly clear when we should call it. So suppose we want to do a checkpoint. What we used to do a long time ago is write everything, and then fsync it all, and then call it good. But that produced horrible I/O storms. So what we do now is do the writes over a period of time, with sleeps in between, and then fsync it all at the end, hoping that the kernel will write some of it before the fsyncs arrive so that we don't get a huge I/O spike. And that sorta works, and it's definitely better than doing it all at full speed, but it's pretty imprecise. If the kernel doesn't write enough of the data out in advance, then there's still a huge I/O storm when we do the fsyncs and everything grinds to a halt. If it writes out more data than needed in advance, it increases the total number of physical writes because we get less write-combining, and that hurts performance, too. I basically feel like the I/O scheduler sucks, though whether it sucks because it's not theoretically possible to do any better or whether it sucks because of some more tractable reason is not clear to me. In an ideal world, when I call fsync() a bunch of times from one process, other processes on the same machine should begin to observe 30+-second (or sometimes 300+-second) times for read or write of an 8kB block. Imagine a hypothetical UNIX-like system where when one process starts running at 100% CPU, every other process on the machine gets timesliced in only once per minute. That's obviously ridiculous, and yet it's pretty much exactly what happens with I/O. >> When it writes data back to disk too >> aggressively, we get lousy throughput because the same page can get >> written more than once when caching it for longer would have allowed >> write-combining. > > Do you think that is related to dirty_ratio or dirty_writeback_centisecs? > If it's dirty_writeback_centisecs then that would be particularly tricky > because poor interactions there would come down to luck basically. See above; I think it's related to fsync. >> When it doesn't write data to disk aggressively >> enough, we get huge latency spikes at checkpoint time when we call >> fsync() and the kernel says "uh, what? you wanted that data *on the >> disk*? sorry boss!" and then proceeds to destroy the world by starving >> the rest of the system for I/O for many seconds or minutes at a time. > > Ok, parts of that are somewhat expected. It *may* depend on the > underlying filesystem. Some of them handle fsync better than others. If > you are syncing the whole file though when you call fsync then you are > potentially burned by having to writeback dirty_ratio amounts of memory > which could take a substantial amount of time. Yeah. ext3 apparently fsyncs the whole filesystem, which is terrible for throughput, but if you happen to have xlog (which is flushed regularly) on the same filesystem as the data files (which are flushed only periodically) then at least you don't have the problem of the write queue getting too large. But I think most of our users are on ext4 at this point, probably some xfs and other things. We track the number of un-fsync'd blocks we've written to each file, and have gotten desperate enough to think of approaches like - ok, well if the total number of un-fsync'd blocks in the system exceeds some threshold, then fsync the file with the most such blocks, not because we really need the data on disk just yet but so that the write queue won't get too large for the kernel to deal with. And I think there may even be some test results from such crocks showing some benefit. But really, I don't understand why we have to baby the kernel like this. Ensuring scheduling fairness is a basic job of the kernel; if we wanted to have to cont
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
First off, I want to give a +1 on everything in the recent posts from Heikki and Hannu. Jan Kara wrote: > Now the aging of pages marked as volatile as it is currently > implemented needn't be perfect for your needs but you still have > time to influence what gets implemented... Actually developers of > the vrange() syscall were specifically looking for some ideas > what to base aging on. Currently I think it is first marked - > first evicted. The "first marked - first evicted" seems like what we would want. The ability to "unmark" and have the page no longer be considered preferred for eviction would be very nice. That seems to me like it would cover the multiple layers of buffering *clean* pages very nicely (although I know nothing more about vrange() than what has been said on this thread, so I could be missing something). The other side of that is related avoiding multiple writes of the same page as much as possible, while avoid write gluts. The issue here is that PostgreSQL tries to hang on to dirty pages for as long as possible before "writing" them to the OS cache, while the OS tries to avoid writing them to storage for as long as possible until they reach a (configurable) threshold or are fsync'd. The problem is that a under various conditions PostgreSQL may need to write and fsync a lot of dirty pages it has accumulated in a short time. That has an "avalanche" effect, creating a "write glut" which can stall all I/O for a period of many seconds up to a few minutes. If the OS was aware of the dirty pages pending write in the application, and counted those for purposes of calculating when and how much to write, the glut could be avoided. Currently, people configure the PostgreSQL background writer to be very aggressive, configure a small PostgreSQL shared_buffers setting, and/or set the OS thresholds low enough to minimize the problem; but all of these mitigation strategies have their own costs. A new hint that the application has dirtied a page could be used by the OS to improve things this way: When the OS is notified that a page is dirty, it takes action depending on whether the page is considered dirty by the OS. If it is not dirty, the page is immediately discarded from the OS cache. It is known that the application has a modified version of the page that it intends to write, so the version in the OS cache has no value. We don't want this page forcing eviction of vrange()-flagged pages. If it is dirty, any write ordering to storage by the OS based on when the page was written to the OS would be pushed back as far as possible without crossing any write barriers, in hopes that the writes could be combined. Either way, this page is counted toward dirty pages for purposes of calculating how much to write from the OS to storage, and the later write of the page doesn't redundantly add to this number. The combination of these two changes could boost PostgreSQL performance quite a bit, at least for some common workloads. The MMAP approach always seems tempting on first blush, but the need to "pin" pages and the need to assure that dirty pages are not written ahead of the WAL-logging of those pages makes it hard to see how we can use it. The "pin" means that we need to ensure that a particular 8KB page remains available for direct reference by all PostgreSQL processes until it is "unpinned". The other thing we would need is the ability to modify a page with a solid assurance that the modified page would *not* be written to disk until we authorize it. The page would remain pinned until we do authorize write, at which point the changes are available to be written, but can wait for an fsync or accumulations of sufficient dirty pages to cross the write threshold. Next comes the hard part. The page may or may not be unpinned after that, and if it remains pinned or is pinned again, there may be further changes to the page. While the prior changes can be written (and *must* be written for an fsync), these new changes must *not* be until we authorize it. If MMAP can be made to handle that, we could probably use it (and some of the previously-discussed techniques might not be needed), but my understanding is that there is currently no way to do so. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 11:39 AM, Hannu Krosing wrote: > On 01/14/2014 09:39 AM, Claudio Freire wrote: >> On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing wrote: >>> Again, as said above the linux file system is doing fine. What we >>> want is a few ways to interact with it to let it do even better when >>> working with postgresql by telling it some stuff it otherwise would >>> have to second guess and by sometimes giving it back some cache >>> pages which were copied away for potential modifying but ended >>> up clean in the end. >> You don't need new interfaces. Only a slight modification of what >> fadvise DONTNEED does. >> >> This insistence in injecting pages from postgres to kernel is just a >> bad idea. > Do you think it would be possible to map copy-on-write pages > from linux cache to postgresql cache ? > > this would be a step in direction of solving the double-ram-usage > of pages which have not been read from syscache to postgresql > cache without sacrificing linux read-ahead (which I assume does > not happen when reads bypass system cache). > > and we can write back the copy at the point when it is safe (from > postgresql perspective) to let the system write them back ? > > Do you think it is possible to make it work with good performance > for a few million 8kb pages ? I don't think so. The kernel would need to walk the page mapping on each page fault, which would incurr the cost of a read cache hit on each page fault. A cache hit is still orders of magnitude slower than a regular page fault, because the process page map is compact and efficient. But if you bloat it, or if you make the kernel go read the buffer cache, it would mean bad performance for RAM access, which I'd venture isn't really a net gain. That's probably the reason there is no zero-copy read mechanism. Because you always have to copy from/to the buffer cache anyway. Of course, this is just OTOMH. Without actually benchmarking, this is all blabber. >> At the very worst, it may >> introduce serious security and reliability implications, when >> applications can destroy the consistency of the page cache (even if >> full access rights are checked, there's still the possibility this >> inconsistency might be exploitable). > If you allow write() which just writes clean pages, I can not see > where the extra security concerns are beyond what normal > write can do. I've been working on security enough to never dismiss any kind of system-level inconsistency. The fact that you can make user-land applications see different data than kernel-land code has over-reaching consequences that are hard to ponder. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 3:39 AM, Claudio Freire wrote: > On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing wrote: >> Again, as said above the linux file system is doing fine. What we >> want is a few ways to interact with it to let it do even better when >> working with postgresql by telling it some stuff it otherwise would >> have to second guess and by sometimes giving it back some cache >> pages which were copied away for potential modifying but ended >> up clean in the end. > > You don't need new interfaces. Only a slight modification of what > fadvise DONTNEED does. Yeah. DONTREALLYNEEDALLTHATTERRIBLYMUCH. > This insistence in injecting pages from postgres to kernel is just a > bad idea. At the very least, it still needs postgres to know too much > of the filesystem (block layout) to properly work. Ie: pg must be > required to put entire filesystem-level blocks into the page cache, > since that's how the page cache works. At the very worst, it may > introduce serious security and reliability implications, when > applications can destroy the consistency of the page cache (even if > full access rights are checked, there's still the possibility this > inconsistency might be exploitable). I agree with all that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 5:00 AM, Jan Kara wrote: > I thought that instead of injecting pages into pagecache for aging as you > describe in 3), you would mark pages as volatile (i.e. for reclaim by > kernel) through vrange() syscall. Next time you need the page, you check > whether the kernel reclaimed the page or not. If yes, you reload it from > disk, if not, you unmark it and use it. > > Now the aging of pages marked as volatile as it is currently implemented > needn't be perfect for your needs but you still have time to influence what > gets implemented... Actually developers of the vrange() syscall were > specifically looking for some ideas what to base aging on. Currently I > think it is first marked - first evicted. This is an interesting idea but it stinks of impracticality. Essentially when the last buffer pin on a page is dropped we'd have to mark it as discardable, and then the next person wanting to pin it would have to check whether it's still there. But the system call overhead of calling vrange() every time the last pin on a page was dropped would probably hose us. *thinks* Well, I guess it could be done lazily: make periodic sweeps through shared_buffers, looking for pages that haven't been touched in a while, and vrange() them. That's quite a bit of new mechanism, but in theory it could work out to a win. vrange() would have to scale well to millions of separate ranges, though. Will it? And a lot depends on whether the kernel makes the right decision about whether to chunk data from our vrange() vs. any other page it could have reclaimed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql.consistent_into
On 1/14/14 1:28 PM, Pavel Stehule wrote: I prefer subquery only syntax - a := (some) or (a,b,c) = (some a,b,c) with possible enhancing for statements with RETURNING a advance is compatibility with DB2 (SQL/PSM) syntax - and this code is written now - it is done in my sql/psm implementation Are you saying that DB2 supports the multi-variable assignment? I couldn't find any reference suggesting that, can you point me to one? Regards, Marko Tiikkaja -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] extension_control_path
Dimitri Fontaine writes: > Please find attached to this email a patch implementing a new GUC that > allows users to setup a list of path where PostgreSQL will search for > the extension control files at CREATE EXTENSION time. Why is that a good idea? It's certainly not going to simplify DBAs' lives, more the reverse. ("This dump won't reload." "Uh, where did you get that extension from?" "Ummm...") Assuming that there is some need for loading extensions from nonstandard places, would it be better to just allow a filename specification in CREATE EXTENSION? (I don't know the answer, since the use-case isn't apparent to me in the first place, but it seems worth asking.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GIN improvements part2: fast scan
On Thu, Nov 21, 2013 at 12:14 AM, Alexander Korotkov wrote: > On Wed, Nov 20, 2013 at 3:06 AM, Alexander Korotkov > wrote: > >> On Fri, Nov 15, 2013 at 11:19 AM, Alexander Korotkov < >> aekorot...@gmail.com> wrote: >> >>> On Fri, Nov 15, 2013 at 12:34 AM, Heikki Linnakangas < >>> hlinnakan...@vmware.com> wrote: >>> On 14.11.2013 19:26, Alexander Korotkov wrote: > On Sun, Jun 30, 2013 at 3:00 PM, Heikki Linnakangas < > hlinnakan...@vmware.com > >> wrote: >> > > On 28.06.2013 22:31, Alexander Korotkov wrote: >> >> Now, I got the point of three state consistent: we can keep only one >>> consistent in opclasses that support new interface. exact true and >>> exact >>> false values will be passed in the case of current patch consistent; >>> exact >>> false and unknown will be passed in the case of current patch >>> preConsistent. That's reasonable. >>> >>> >> I'm going to mark this as "returned with feedback". For the next >> version, >> I'd like to see the API changed per above. Also, I'd like us to do >> something about the tidbitmap overhead, as a separate patch before >> this, so >> that we can assess the actual benefit of this patch. And a new test >> case >> that demonstrates the I/O benefits. >> > > > Revised version of patch is attached. > Changes are so: > 1) Patch rebased against packed posting lists, not depends on > additional > information now. > 2) New API with tri-state logic is introduced. > Thanks! A couple of thoughts after a 5-minute glance: * documentation >>> >>> Will provide documented version this week. >>> >>> * How about defining the tri-state consistent function to also return a tri-state? True would mean that the tuple definitely matches, false means the tuple definitely does not match, and Unknown means it might match. Or does return value true with recheck==true have the same effect? If I understood the patch, right, returning Unknown or True wouldn't actually make any difference, but it's conceivable that we might come up with more optimizations in the future that could take advantage of that. For example, for a query like "foo OR (bar AND baz)", you could immediately return any tuples that match foo, and not bother scanning for bar and baz at all. >>> >>> >>> The meaning of recheck flag when input contains unknown is undefined >>> now. :) >>> For instance, we could define it in following ways: >>> 1) Like returning Unknown meaning that consistent with true of false >>> instead of input Unknown could return either true or false. >>> 2) Consistent with true of false instead of input Unknown could return >>> recheck. This meaning is probably logical, but I don't see any usage of it. >>> >>> I'm not against idea of tri-state returning value for consisted, because >>> it's logical continuation of its tri-state input. However, I don't see >>> usage of distinguish True and Unknown in returning value for now :) >>> >>> In example you give we can return foo immediately, but we have to create >>> full bitmap. So we anyway will have to scan (bar AND baz). We could skip >>> part of trees for bar and baz. But it's possible only when foo contains >>> large amount of sequential TIDS so we can be sure that we didn't miss any >>> TIDs. This seems to be very narrow use-case for me. >>> >>> Another point is that one day we probably could immediately return >>> tuples in gingettuple. And with LIMIT clause and no sorting we can don't >>> search for other tuples. However, gingettuple was removed because of >>> reasons of concurrency. And my patches for index-based ordering didn't >>> return it in previous manner: it collects all the results and then returns >>> them one-by-one. >>> >> >> I'm trying to make fastscan work with GinFuzzySearchLimit. Then I figure >> out that I don't understand how GinFuzzySearchLimit works. Why with >> GinFuzzySearchLimit startScan can return without doing startScanKey? Is it >> a bug? >> > > Revised version of patch is attached. Changes are so: > 1) Support for GinFuzzySearchLimit. > 2) Some documentation. > Question about GinFuzzySearchLimit is still relevant. > Attached version is rebased against last version of packed posting lists. -- With best regards, Alexander Korotkov. gin-fast-scan.9.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql.consistent_into
Hello 2014/1/14 Marko Tiikkaja > On 1/14/14 1:28 PM, Pavel Stehule wrote: > >> I prefer subquery only syntax - a := (some) or (a,b,c) = (some a,b,c) with >> possible enhancing for statements with RETURNING >> >> a advance is compatibility with DB2 (SQL/PSM) syntax - and this code is >> written now - it is done in my sql/psm implementation >> > > Are you saying that DB2 supports the multi-variable assignment? I > couldn't find any reference suggesting that, can you point me to one? > I was wrong - it is different syntax - it is SQL/PSM form - db2 supports SET var=val, var=val, ... http://postgres.cz/wiki/Basic_statements_SQL/PSM_samples Regards Pavel > > Regards, > Marko Tiikkaja >
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
James Bottomley writes: > The current mechanism for coherency between a userspace cache and the > in-kernel page cache is mmap ... that's the only way you get the same > page in both currently. Right. > glibc used to have an implementation of read/write in terms of mmap, so > it should be possible to insert it into your current implementation > without a major rewrite. The problem I think this brings you is > uncontrolled writeback: you don't want dirty pages to go to disk until > you issue a write() Exactly. > I think we could fix this with another madvise(): > something like MADV_WILLUPDATE telling the page cache we expect to alter > the pages again, so don't be aggressive about cleaning them. "Don't be aggressive" isn't good enough. The prohibition on early write has to be absolute, because writing a dirty page before we've done whatever else we need to do results in a corrupt database. It has to be treated like a write barrier. > The problem is we can't give you absolute control of when pages are > written back because that interface can be used to DoS the system: once > we get too many dirty uncleanable pages, we'll thrash looking for memory > and the system will livelock. Understood, but that makes this direction a dead end. We can't use it if the kernel might decide to write anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PoC: Partial sort
Hi! On Tue, Jan 14, 2014 at 12:54 AM, Marti Raudsepp wrote: > First, thanks a lot for working on this feature. This PostgreSQL > shortcoming crops up in all the time in web applications that implement > paging by multiple sorted columns. > Thanks! I've been trying it out in a few situations. I implemented a new > enable_partialsort GUC to make it easier to turn on/off, this way it's a > lot easier to test. The attached patch applies on top of > partial-sort-5.patch > I though about such option. Generally not because of testing convenience, but because of overhead of planning. This way you implement it is quite naive :) For instance, merge join rely on partial sort which will be replaced with simple sort. > I will spend more time reviewing the patch, but some of this planner code > is over my head. If there's any way I can help to make sure this lands in > the next version, let me know. > > > > The patch performs just as well as I would expect it to: > > marti=# select ac.name, r.name from artist_credit ac join release r on ( > ac.id=r.artist_credit) order by ac.name, r.name limit 1000; > Time: 9.830 ms > marti=# set enable_partialsort = off; > marti=# select ac.name, r.name from artist_credit ac join release r on ( > ac.id=r.artist_credit) order by ac.name, r.name limit 1000; > Time: 1442.815 ms > > A difference of almost 150x! > > There's a missed opportunity in that the code doesn't consider pushing new > Sort steps into subplans. For example, if there's no index on > language(name) then this query cannot take advantage partial sorts: > > marti=# explain select l.name, r.name from language l join release r on ( > l.id=r.language) order by l.name, r.name limit 1000; > Limit (cost=123203.20..123205.70 rows=1000 width=32) >-> Sort (cost=123203.20..126154.27 rows=1180430 width=32) > Sort Key: l.name, r.name > -> Hash Join (cost=229.47..58481.49 rows=1180430 width=32) >Hash Cond: (r.language = l.id) >-> Seq Scan on release r (cost=0.00..31040.10 > rows=1232610 width=26) >-> Hash (cost=131.43..131.43 rows=7843 width=14) > -> Seq Scan on language l (cost=0.00..131.43 > rows=7843 width=14) > > But because there are only so few languages, it would be a lot faster to > sort languages in advance and then do partial sort: > Limit (rows=1000 width=31) >-> Partial sort (rows=1180881 width=31) > Sort Key: l.name, r.name > Presorted Key: l.name > -> Nested Loop (rows=1180881 width=31) >-> Sort (rows=7843 width=10) > Sort Key: name > -> Seq Scan on language (rows=7843 width=14) >-> Index Scan using release_language_idx on release r > (rows=11246 width=25) > Index Cond: (language = l.id) > > Even an explicit sorted CTE cannot take advantage of partial sorts: > marti=# explain with sorted_lang as (select id, name from language order > by name) > marti-# select l.name, r.name from sorted_lang l join release r on > (l.id=r.language) > order by l.name, r.name limit 1000; > Limit (cost=3324368.83..3324371.33 rows=1000 width=240) >CTE sorted_lang > -> Sort (cost=638.76..658.37 rows=7843 width=14) >Sort Key: language.name >-> Seq Scan on language (cost=0.00..131.43 rows=7843 width=14) >-> Sort (cost=3323710.46..3439436.82 rows=46290543 width=240) > Sort Key: l.name, r.name > -> Merge Join (cost=664.62..785649.92 rows=46290543 width=240) >Merge Cond: (r.language = l.id) >-> Index Scan using release_language_idx on release r > (cost=0.43..87546.06 rows=1232610 width=26) >-> Sort (cost=664.19..683.80 rows=7843 width=222) > Sort Key: l.id > -> CTE Scan on sorted_lang l (cost=0.00..156.86 > rows=7843 width=222) > > But even with these limitations, this will easily be the killer feature of > the next release, for me at least. > I see. But I don't think it can be achieved by small changes in planner. Moreover, I didn't check but I think if you remove ordering by r.name you will still not get sorting languages in the inner node. So, this problem is not directly related to partial sort. -- With best regards, Alexander Korotkov.
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 12:42 PM, Trond Myklebust wrote: >> James Bottomley writes: >>> The current mechanism for coherency between a userspace cache and the >>> in-kernel page cache is mmap ... that's the only way you get the same >>> page in both currently. >> >> Right. >> >>> glibc used to have an implementation of read/write in terms of mmap, so >>> it should be possible to insert it into your current implementation >>> without a major rewrite. The problem I think this brings you is >>> uncontrolled writeback: you don't want dirty pages to go to disk until >>> you issue a write() >> >> Exactly. >> >>> I think we could fix this with another madvise(): >>> something like MADV_WILLUPDATE telling the page cache we expect to alter >>> the pages again, so don't be aggressive about cleaning them. >> >> "Don't be aggressive" isn't good enough. The prohibition on early write >> has to be absolute, because writing a dirty page before we've done >> whatever else we need to do results in a corrupt database. It has to >> be treated like a write barrier. > > Then why are you dirtying the page at all? It makes no sense to tell the > kernel “we’re changing this page in the page cache, but we don’t want you to > change it on disk”: that’s not consistent with the function of a page cache. PG doesn't currently. All that dirtying happens in anonymous shared memory, in pg-specific buffers. The proposal is to use mmap instead of anonymous shared memory as pg-specific buffers to avoid the extra copy (mmap would share the page with both kernel and user space). But that would dirty the page when written to, because now the kernel has the correspondence between that specific memory region and the file, and that's forbidden for PG's usage. I believe the only option here is for the kernel to implement zero-copy reads. But that implementation is doomed for the performance reasons I outlined on an eariler mail. So... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
Trond Myklebust writes: > On Jan 14, 2014, at 10:39, Tom Lane wrote: >> "Don't be aggressive" isn't good enough. The prohibition on early write >> has to be absolute, because writing a dirty page before we've done >> whatever else we need to do results in a corrupt database. It has to >> be treated like a write barrier. > Then why are you dirtying the page at all? It makes no sense to tell the > kernel were changing this page in the page cache, but we dont want you to > change it on disk: thats not consistent with the function of a page cache. As things currently stand, we dirty the page in our internal buffers, and we don't write it to the kernel until we've written and fsync'd the WAL data that needs to get to disk first. The discussion here is about whether we could somehow avoid double-buffering between our internal buffers and the kernel page cache. I personally think there is no chance of using mmap for that; the semantics of mmap are pretty much dictated by POSIX and they don't work for this. However, disregarding the fact that the two communities speaking here don't control the POSIX spec, you could maybe imagine making it work if *both* pending WAL file contents and data file contents were mmap'd, and there were kernel APIs allowing us to say "you can write this mmap'd page if you want, but not till you've written that mmap'd data over there". That'd provide the necessary write-barrier semantics, and avoid the cache coherency question because all the data visible to the kernel could be thought of as the "current" filesystem contents, it just might not all have reached disk yet; which is the behavior of the kernel disk cache already. I'm dubious that this sketch is implementable with adequate efficiency, though, because in a live system the kernel would be forced to deal with a whole lot of active barrier restrictions. Within Postgres we can reduce write-ordering tests to a very simple comparison: don't write this page until WAL is flushed to disk at least as far as WAL sequence number XYZ. I think any kernel API would have to be a great deal more general and thus harder to optimize. Another difficulty with merging our internal buffers with the kernel cache is that when we're in the process of applying a change to a page, there are intermediate states of the page data that should under no circumstances reach disk (eg, we might need to shuffle records around within the page). We can deal with that fairly easily right now by not issuing a write() while a page change is in progress. I don't see that it's even theoretically possible in an mmap'd world; there are no atomic updates to an mmap'd page that are larger than whatever is an atomic update for the CPU. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue 14-01-14 09:08:40, Hannu Krosing wrote: > >>> Effectively you end up with buffered read/write that's also mapped into > >>> the page cache. It's a pretty awful way to hack around mmap. > >> Well, the problem is that you can't really use mmap() for the things we > >> do. Postgres' durability works by guaranteeing that our journal entries > >> (called WAL := Write Ahead Log) are written & synced to disk before the > >> corresponding entries of tables and indexes reach the disk. That also > >> allows to group together many random-writes into a few contiguous writes > >> fdatasync()ed at once. Only during a checkpointing phase the big bulk of > >> the data is then (slowly, in the background) synced to disk. > > Which is the exact algorithm most journalling filesystems use for > > ensuring durability of their metadata updates. Indeed, here's an > > interesting piece of architecture that you might like to consider: > > > > * Neither XFS and BTRFS use the kernel page cache to back their > > metadata transaction engines. > But file system code is supposed to know much more about the > underlying disk than a mere application program like postgresql. > > We do not want to start duplicating OS if we can avoid it. > > What we would like is to have a way to tell the kernel > > 1) "here is the modified copy of file page, it is now safe to write > it back" - the current 'lazy' write > > 2) "here is the page, write it back now, before returning success > to me" - unbuffered write or write + sync > > but we also would like to have > > 3) "here is the page as it is currently on disk, I may need it soon, > so keep it together with your other clean pages accessed at time X" > - this is the non-dirtying write discussed > > the page may be in buffer cache, in which case just update its LRU > position (to either current time or time provided by postgresql), or > it may not be there, in which case put it there if reasonable by it's > LRU position. > > And we would like all this to work together with other current linux > kernel goodness of managing the whole disk-side interaction of > efficient reading and writing and managing the buffers :) So when I was speaking about the proposed vrange() syscall in this thread, I thought that instead of injecting pages into pagecache for aging as you describe in 3), you would mark pages as volatile (i.e. for reclaim by kernel) through vrange() syscall. Next time you need the page, you check whether the kernel reclaimed the page or not. If yes, you reload it from disk, if not, you unmark it and use it. Now the aging of pages marked as volatile as it is currently implemented needn't be perfect for your needs but you still have time to influence what gets implemented... Actually developers of the vrange() syscall were specifically looking for some ideas what to base aging on. Currently I think it is first marked - first evicted. Honza -- Jan Kara SUSE Labs, CR -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue 14-01-14 11:11:28, Heikki Linnakangas wrote: > On 01/14/2014 12:26 AM, Mel Gorman wrote: > >On Mon, Jan 13, 2014 at 03:15:16PM -0500, Robert Haas wrote: > >>The other thing that comes to mind is the kernel's caching behavior. > >>We've talked a lot over the years about the difficulties of getting > >>the kernel to write data out when we want it to and to not write data > >>out when we don't want it to. > > > >Is sync_file_range() broke? > > > >>When it writes data back to disk too > >>aggressively, we get lousy throughput because the same page can get > >>written more than once when caching it for longer would have allowed > >>write-combining. > > > >Do you think that is related to dirty_ratio or dirty_writeback_centisecs? > >If it's dirty_writeback_centisecs then that would be particularly tricky > >because poor interactions there would come down to luck basically. > > >>When it doesn't write data to disk aggressively > >>enough, we get huge latency spikes at checkpoint time when we call > >>fsync() and the kernel says "uh, what? you wanted that data *on the > >>disk*? sorry boss!" and then proceeds to destroy the world by starving > >>the rest of the system for I/O for many seconds or minutes at a time. > > > >Ok, parts of that are somewhat expected. It *may* depend on the > >underlying filesystem. Some of them handle fsync better than others. If > >you are syncing the whole file though when you call fsync then you are > >potentially burned by having to writeback dirty_ratio amounts of memory > >which could take a substantial amount of time. > > > >>We've made some desultory attempts to use sync_file_range() to improve > >>things here, but I'm not sure that's really the right tool, and if it > >>is we don't know how to use it well enough to obtain consistent > >>positive results. > > > >That implies that either sync_file_range() is broken in some fashion we > >(or at least I) are not aware of and that needs kicking. > > Let me try to explain the problem: Checkpoints can cause an I/O > spike, which slows down other processes. > > When it's time to perform a checkpoint, PostgreSQL will write() all > dirty buffers from the PostgreSQL buffer cache, and finally perform > an fsync() to flush the writes to disk. After that, we know the data > is safely on disk. > > In older PostgreSQL versions, the write() calls would cause an I/O > storm as the OS cache quickly fills up with dirty pages, up to > dirty_ratio, and after that all subsequent write()s block. That's OK > as far as the checkpoint is concerned, but it significantly slows > down queries running at the same time. Even a read-only query often > needs to write(), to evict a dirty page from the buffer cache to > make room for a different page. We made that less painful by adding > sleeps between the write() calls, so that they are trickled over a > long period of time and hopefully stay below dirty_ratio at all > times. Hum, I wonder whether you see any difference with reasonably recent kernels (say newer than 3.2). Because those have IO-less dirty throttling. That means that: a) checkpointing thread (or other threads blocked due to dirty limit) won't issue IO on their own but rather wait for flusher thread to do the work. b) there should be more noticeable difference between the delay imposed on heavily dirtying thread (i.e. the checkpointing thread) and the delay imposed on lightly dirtying thread (that's what I would expect from those threads having to do occasional page eviction to make room for other page). > However, we still have to perform the fsync()s after the > writes(), and sometimes that still causes a similar I/O storm. Because there is still quite some dirty data in the page cache or because e.g. ext3 has to flush a lot of unrelated dirty data? > The checkpointer is not in a hurry. A checkpoint typically has 10-30 > minutes to finish, before it's time to start the next checkpoint, > and even if it misses that deadline that's not too serious either. > But the OS doesn't know that, and we have no way of telling it. > > As a quick fix, some sort of a lazy fsync() call would be nice. It > would behave just like fsync() but it would not change the I/O > scheduling at all. Instead, it would sleep until all the pages have > been flushed to disk, at the speed they would've been without the > fsync() call. > > Another approach would be to give the I/O that the checkpointer > process initiates a lower priority. This would be slightly > preferable, because PostgreSQL could then issue the writes() as fast > as it can, and have the checkpoint finish earlier when there's not > much other load. Last I looked into this (which was a long time > ago), there was no suitable priority system for writes, only reads. Well, IO priority works for writes in principle, the trouble is it doesn't work for writes which end up just in the page cache. Then writeback of page cache is usually done by flusher thread so that's completely disconnected f
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Mon, Jan 13, 2014 at 09:29:02PM +, Greg Stark wrote: > On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund wrote: > > For one, postgres doesn't use mmap for files (and can't without major > > new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has > > horrible consequences for performance/scalability - very quickly you > > contend on locks in the kernel. > > I may as well dump this in this thread. We've discussed this in person > a few times, including at least once with Ted T'so when he visited > Dublin last year. > > The fundamental conflict is that the kernel understands better the > hardware and other software using the same resources, Postgres > understands better its own access patterns. We need to either add > interfaces so Postgres can teach the kernel what it needs about its > access patterns or add interfaces so Postgres can find out what it > needs to know about the hardware context. In my experience applications don't need to know anything about the underlying storage hardware - all they need is for someone to tell them the optimal IO size and alignment to use. > The more ambitious and interesting direction is to let Postgres tell > the kernel what it needs to know to manage everything. To do that we > would need the ability to control when pages are flushed out. This is > absolutely necessary to maintain consistency. Postgres would need to > be able to mark pages as unflushable until some point in time in the > future when the journal is flushed. We discussed various ways that > interface could work but it would be tricky to keep it low enough > overhead to be workable. IMO, the concept of allowing userspace to pin dirty page cache pages in memory is just asking for trouble. Apart from the obvious memory reclaim and OOM issues, some filesystems won't be able to move their journals forward until the data is flushed. i.e. ordered mode data writeback on ext3 will have all sorts of deadlock issues that result from pinning pages and then issuing fsync() on another file which will block waiting for the pinned pages to be flushed. Indeed, what happens if you do pin_dirty_pages(fd); fsync(fd);? If fsync() blocks because there are pinned pages, and there's no other thread to unpin them, then that code just deadlocked. If fsync() doesn't block and skips the pinned pages, then we haven't done an fsync() at all, and so violated the expectation that users have that after fsync() returns their data is safe on disk. And if we return an error to fsync(), then what the hell does the user do if it is some other application we don't know about that has pinned the pages? And if the kernel unpins them after some time, then we just violated the application's consistency guarantees H. What happens if the process crashes after pinning the dirty pages? How do we even know what process pinned the dirty pages so we can clean up after it? What happens if the same page is pinned by multiple processes? What happens on truncate/hole punch if the partial pages in the range that need to be zeroed and written are pinned? What happens if we do direct IO to a range with pinned, unflushable pages in the page cache? These are all complex corner cases that are introduced by allowing applications to pin dirty pages in memory. I've only spent a few minutes coming up with these, and I'm sure there's more of them. As such, I just don't see that allowing userspace to pin dirty page cache pages in memory being a workable solution. > The less exciting, more conservative option would be to add kernel > interfaces to teach Postgres about things like raid geometries. Then /sys/block//queue/* contains all the information that is exposed to filesystems to optimise layout for storage geometry. Some filesystems can already expose the relevant parts of this information to userspace, others don't. What I think we really need to provide is a generic interface similar to the old XFS_IOC_DIOINFO ioctl that can be used to expose IO characteristics to applications in a simple, easy to gather manner. Something like: struct io_info { u64 minimum_io_size;/* sector size */ u64 maximum_io_size;/* currently 2GB */ u64 optimal_io_size;/* stripe unit/width */ u64 optimal_io_alignment; /* stripe unit/width */ u64 mem_alignment; /* PAGE_SIZE */ u32 queue_depth;/* max IO concurrency */ }; > Postgres could use directio and decide to do prefetching based on the > raid geometry, Underlying storage array raid geometry and optimal IO sizes for the filesystem may be different. Hence you want what the filesystem considers optimal, not what the underlying storage is configured with. Indeed, a filesystem might be able to supply per-file IO characteristics depending on where it is located in the filesystem (think tiered storage) > how much available i/o bandwidth and iops is available, > etc. The kern
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Mon, Jan 13, 2014 at 03:24:38PM -0800, Josh Berkus wrote: > On 01/13/2014 02:26 PM, Mel Gorman wrote: > > Really? > > > > zone_reclaim_mode is often a complete disaster unless the workload is > > partitioned to fit within NUMA nodes. On older kernels enabling it would > > sometimes cause massive stalls. I'm actually very surprised to hear it > > fixes anything and would be interested in hearing more about what sort > > of circumstnaces would convince you to enable that thing. > > So the problem with the default setting is that it pretty much isolates > all FS cache for PostgreSQL to whichever socket the postmaster is > running on, and makes the other FS cache unavailable. This means that, > for example, if you have two memory banks, then only one of them is > available for PostgreSQL filesystem caching ... essentially cutting your > available cache in half. No matter what default NUMA allocation policy we set, there will be an application for which that behaviour is wrong. As such, we've had tools for setting application specific NUMA policies for quite a few years now. e.g: $ man 8 numactl --interleave=nodes, -i nodes Set a memory interleave policy. Memory will be allocated using round robin on nodes. When memory cannot be allocated on the current interleave target fall back to other nodes. Multiple nodes may be specified on --interleave, --membind and --cpunodebind. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 02:26:25AM +0100, Andres Freund wrote: > On 2014-01-13 17:13:51 -0800, James Bottomley wrote: > > a file into a user provided buffer, thus obtaining a page cache entry > > and a copy in their userspace buffer, then insert the page of the user > > buffer back into the page cache as the page cache page ... that's right, > > isn't it postgress people? > > Pretty much, yes. We'd probably hint (*advise(DONTNEED)) that the page > isn't needed anymore when reading. And we'd normally write if the page > is dirty. So why, exactly, do you even need the kernel page cache here? You've got direct access to the copy of data read into userspace, and you want direct control of when and how the data in that buffer is written and reclaimed. Why push that data buffer back into the kernel and then have to add all sorts of kernel interfaces to control the page you already have control of? > > Effectively you end up with buffered read/write that's also mapped into > > the page cache. It's a pretty awful way to hack around mmap. > > Well, the problem is that you can't really use mmap() for the things we > do. Postgres' durability works by guaranteeing that our journal entries > (called WAL := Write Ahead Log) are written & synced to disk before the > corresponding entries of tables and indexes reach the disk. That also > allows to group together many random-writes into a few contiguous writes > fdatasync()ed at once. Only during a checkpointing phase the big bulk of > the data is then (slowly, in the background) synced to disk. Which is the exact algorithm most journalling filesystems use for ensuring durability of their metadata updates. Indeed, here's an interesting piece of architecture that you might like to consider: * Neither XFS and BTRFS use the kernel page cache to back their metadata transaction engines. Why not? Because the page cache is too simplistic to adequately represent the complex object heirarchies that the filesystems have and so it's flat LRU reclaim algorithms and writeback control mechanisms are a terrible fit and cause lots of performance issues under memory pressure. IOWs, the two most complex high performance transaction engines in the Linux kernel have moved to fully customised cache and (direct) IO implementations because the requirements for scalability and performance are far more complex than the kernel page cache infrastructure can provide. Just food for thought Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Mon, 2014-01-13 at 19:48 -0500, Trond Myklebust wrote: > On Jan 13, 2014, at 19:03, Hannu Krosing wrote: > > > On 01/13/2014 09:53 PM, Trond Myklebust wrote: > >> On Jan 13, 2014, at 15:40, Andres Freund wrote: > >> > >>> On 2014-01-13 15:15:16 -0500, Robert Haas wrote: > On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner > wrote: > > I notice, Josh, that you didn't mention the problems many people > > have run into with Transparent Huge Page defrag and with NUMA > > access. > Amen to that. Actually, I think NUMA can be (mostly?) fixed by > setting zone_reclaim_mode; is there some other problem besides that? > >>> I think that fixes some of the worst instances, but I've seen machines > >>> spending horrible amounts of CPU (& BUS) time in page reclaim > >>> nonetheless. If I analyzed it correctly it's in RAM << working set > >>> workloads where RAM is pretty large and most of it is used as page > >>> cache. The kernel ends up spending a huge percentage of time finding and > >>> potentially defragmenting pages when looking for victim buffers. > >>> > On a related note, there's also the problem of double-buffering. When > we read a page into shared_buffers, we leave a copy behind in the OS > buffers, and similarly on write-out. It's very unclear what to do > about this, since the kernel and PostgreSQL don't have intimate > knowledge of what each other are doing, but it would be nice to solve > somehow. > >>> I've wondered before if there wouldn't be a chance for postgres to say > >>> "my dear OS, that the file range 0-8192 of file x contains y, no need to > >>> reread" and do that when we evict a page from s_b but I never dared to > >>> actually propose that to kernel people... > >> O_DIRECT was specifically designed to solve the problem of double > >> buffering > >> between applications and the kernel. Why are you not able to use that in > >> these situations? > > What is asked is the opposite of O_DIRECT - the write from a buffer inside > > postgresql to linux *buffercache* and telling linux that it is the same > > as what > > is currently on disk, so don't bother to write it back ever. > > I don’t understand. Are we talking about mmap()ed files here? Why > would the kernel be trying to write back pages that aren’t dirty? No ... if I have it right, it's pretty awful: they want to do a read of a file into a user provided buffer, thus obtaining a page cache entry and a copy in their userspace buffer, then insert the page of the user buffer back into the page cache as the page cache page ... that's right, isn't it postgress people? Effectively you end up with buffered read/write that's also mapped into the page cache. It's a pretty awful way to hack around mmap. James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format
On Jan 14, 2014 2:44 PM, "Andres Freund" wrote: > > On 2014-01-14 14:42:36 +0100, Magnus Hagander wrote: > > On Tue, Jan 14, 2014 at 2:41 PM, Andres Freund wrote: > > > > > On 2014-01-14 14:40:46 +0100, Magnus Hagander wrote: > > > > On Tue, Jan 14, 2014 at 2:18 PM, Andres Freund < and...@2ndquadrant.com > > > >wrote: > > > > > > > > > On 2014-01-14 14:12:46 +0100, Magnus Hagander wrote: > > > > > > Either way - if we can do this in a safe way, it sounds like a good > > > idea. > > > > > > It would be sort of like rsync, except relying on the fact that we > > > can > > > > > look > > > > > > at the LSN and don't have to compare the actual files, right? > > > > > > > > > > Which is an advantage, yes. On the other hand, it doesn't fix problems > > > > > with a subtly broken replica, e.g. after a bug in replay, or disk > > > > > corruption. > > > > > > > > > > > > > > Right. But neither does rsync, right? > > > > > > Hm? Rsync's really only safe with --checksum and with that it definitely > > > should fix those? > > > > > > > > I think we're talking about difference scenarios. > > Sounds like it. > > > I thought you were talking about a backup taken from a replica, that > > already has corruption. rsync checksums surely aren't going to help with > > that? > > I was talking about updating a standby using such an incremental or > differential backup from the primary (or a standby higher up in the > cascade). If your standby is corrupted in any way a rsync --checksum > will certainly correct errors if it syncs from a correct source? Sure, but as I understand it that's not at all the scenario that the suggested functionality is for. You can still use rsync for that, I don't think anybody suggested removing that ability. Replicas weren't the target... /Magnus
Re: [HACKERS] extension_control_path
Tom Lane writes: > Why is that a good idea? It's certainly not going to simplify DBAs' > lives, more the reverse. ("This dump won't reload." "Uh, where did > you get that extension from?" "Ummm...") The latest users for the feature are the Red Hat team working on Open Shift where they want to have co-existing per-user PostgreSQL clusters on a machine, each with its own set of extensions. Having extension_control_path also allows to install extension files in a place not owned by root. Lastly, as a developer, you might enjoy being able to have your own non-system-global place to install extensions, as Andres did explain on this list not too long ago. > Assuming that there is some need for loading extensions from nonstandard > places, would it be better to just allow a filename specification in > CREATE EXTENSION? (I don't know the answer, since the use-case isn't > apparent to me in the first place, but it seems worth asking.) In the extension_control_path idea, we still are adressing needs where the people managing the OS and the database are distinct sets. The GUC allows the system admins to setup PostgreSQL the way they want, then the database guy doesn't need to know anything about that at CREATE EXTENSION time. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, 2014-01-14 at 15:39 +0100, Hannu Krosing wrote: > On 01/14/2014 09:39 AM, Claudio Freire wrote: > > On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing > > wrote: > >> Again, as said above the linux file system is doing fine. What we > >> want is a few ways to interact with it to let it do even better when > >> working with postgresql by telling it some stuff it otherwise would > >> have to second guess and by sometimes giving it back some cache > >> pages which were copied away for potential modifying but ended > >> up clean in the end. > > You don't need new interfaces. Only a slight modification of what > > fadvise DONTNEED does. > > > > This insistence in injecting pages from postgres to kernel is just a > > bad idea. > Do you think it would be possible to map copy-on-write pages > from linux cache to postgresql cache ? > > this would be a step in direction of solving the double-ram-usage > of pages which have not been read from syscache to postgresql > cache without sacrificing linux read-ahead (which I assume does > not happen when reads bypass system cache). The current mechanism for coherency between a userspace cache and the in-kernel page cache is mmap ... that's the only way you get the same page in both currently. glibc used to have an implementation of read/write in terms of mmap, so it should be possible to insert it into your current implementation without a major rewrite. The problem I think this brings you is uncontrolled writeback: you don't want dirty pages to go to disk until you issue a write() I think we could fix this with another madvise(): something like MADV_WILLUPDATE telling the page cache we expect to alter the pages again, so don't be aggressive about cleaning them. Plus all the other issues with mmap() ... but if you can detail those, we might be able to fix them. > and we can write back the copy at the point when it is safe (from > postgresql perspective) to let the system write them back ? Using MADV_WILLUPDATE, possibly ... you're still not going to have absolute control. The kernel will write back the pages if the dirty limits are exceeded, for instance, but we could tune it to be useful. > Do you think it is possible to make it work with good performance > for a few million 8kb pages ? > > > At the very least, it still needs postgres to know too much > > of the filesystem (block layout) to properly work. Ie: pg must be > > required to put entire filesystem-level blocks into the page cache, > > since that's how the page cache works. > I was more thinking of an simple write() interface with extra > flags/sysctls to tell kernel that "we already have this on disk" > > At the very worst, it may > > introduce serious security and reliability implications, when > > applications can destroy the consistency of the page cache (even if > > full access rights are checked, there's still the possibility this > > inconsistency might be exploitable). > If you allow write() which just writes clean pages, I can not see > where the extra security concerns are beyond what normal > write can do. The problem is we can't give you absolute control of when pages are written back because that interface can be used to DoS the system: once we get too many dirty uncleanable pages, we'll thrash looking for memory and the system will livelock. James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Jan 14, 2014, at 10:39, Tom Lane wrote: > James Bottomley writes: >> The current mechanism for coherency between a userspace cache and the >> in-kernel page cache is mmap ... that's the only way you get the same >> page in both currently. > > Right. > >> glibc used to have an implementation of read/write in terms of mmap, so >> it should be possible to insert it into your current implementation >> without a major rewrite. The problem I think this brings you is >> uncontrolled writeback: you don't want dirty pages to go to disk until >> you issue a write() > > Exactly. > >> I think we could fix this with another madvise(): >> something like MADV_WILLUPDATE telling the page cache we expect to alter >> the pages again, so don't be aggressive about cleaning them. > > "Don't be aggressive" isn't good enough. The prohibition on early write > has to be absolute, because writing a dirty page before we've done > whatever else we need to do results in a corrupt database. It has to > be treated like a write barrier. Then why are you dirtying the page at all? It makes no sense to tell the kernel “we’re changing this page in the page cache, but we don’t want you to change it on disk”: that’s not consistent with the function of a page cache. >> The problem is we can't give you absolute control of when pages are >> written back because that interface can be used to DoS the system: once >> we get too many dirty uncleanable pages, we'll thrash looking for memory >> and the system will livelock. > > Understood, but that makes this direction a dead end. We can't use > it if the kernel might decide to write anyway. > > regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] extension_control_path
* Tom Lane (t...@sss.pgh.pa.us) wrote: > Dimitri Fontaine writes: > > Please find attached to this email a patch implementing a new GUC that > > allows users to setup a list of path where PostgreSQL will search for > > the extension control files at CREATE EXTENSION time. > > Why is that a good idea? It's certainly not going to simplify DBAs' > lives, more the reverse. ("This dump won't reload." "Uh, where did > you get that extension from?" "Ummm...") We *already* have that problem. I don't think this makes it particularly worse- you still need to go back to the old box and look at what came from where. Sure, you *might* be lucky enough to find the right extension by guessing at what packages were installed or searching for one that looks like the correct one, but then, you might discover that the version available isn't the right version for the database you're trying to restore anyway. Indeed, this might allow folks who don't particularly care for package systems to build consistent dumps without having to worry quite as much about what the package system is doing. > Assuming that there is some need for loading extensions from nonstandard > places, would it be better to just allow a filename specification in > CREATE EXTENSION? (I don't know the answer, since the use-case isn't > apparent to me in the first place, but it seems worth asking.) For my 2c, I could absolutely see it as being worthwhile to have an independent directory to install not-from-package extensions. That would keep things which are "managed by the package system" and things which are installed independent separate, which is absolutely a good thing, imv. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Jan14, 2014, at 11:06 , David Rowley wrote: > Here's a patch which removes sum(numeric) and changes the documents a little > to remove a reference to using sum(numeric) to workaround the fact that > there's no inverse transitions for sum(float). I also made a small change in > the aggregates.sql tests to check that the aggfnoid <= . I've looked over the patch, here a few comments. For STRICT pairs of transfer and inverse transfer functions we should complain if any of them ever return NULL. That can never be correct anyway, since a STRICT function cannot undo a non-NULL -> NULL transition. The same goes for non-STRICT, at least if we ever want to allow an inverse transfer function to indicate "Sorry, cannot undo in this case, please rescan". If we allowed NULL as just another state value now, we couldn't easily undo that later, so we'd rob ourselves of the obvious way for the inverse transfer function to indicate this condition to its caller. The notnullcount machinery seems to apply to both STRICT and non-STRICT transfer function pairs. Shouldn't that be constrained to STRICT transfer function pairs? For non-STRICT pairs, it's up to the transfer functions to deal with NULL inputs however they please, no? The logic around movedaggbase in eval_windowaggregates() seems a bit convoluted. Couldn't the if be moved before the code that pulls aggregatedbase up to frameheadpos using the inverse aggregation function? Also, could we easily check whether the frames corresponding to the individual rows are all either identical or disjoint, and don't use the inverse transfer function then? Currently, for a frame which contains either just the current row, or all the current row's peers, I think we'd use the inverse transfer function to fully un-add the old frame, and then add back the new frame. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley wrote: > No, I'm sorry, that's never going to be possible. No user space > application has all the facts. If we give you an interface to force > unconditional holding of dirty pages in core you'll livelock the system > eventually because you made a wrong decision to hold too many dirty > pages. I don't understand why this has to be absolute: if you advise > us to hold the pages dirty and we do up until it becomes a choice to > hold on to the pages or to thrash the system into a livelock, why would > you ever choose the latter? And if, as I'm assuming, you never would, > why don't you want the kernel to make that choice for you? If you don't understand how write-ahead logging works, this conversation is going nowhere. Suffice it to say that the word "ahead" is not optional. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 1:48 PM, Robert Haas wrote: > On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley > wrote: >> No, I'm sorry, that's never going to be possible. No user space >> application has all the facts. If we give you an interface to force >> unconditional holding of dirty pages in core you'll livelock the system >> eventually because you made a wrong decision to hold too many dirty >> pages. I don't understand why this has to be absolute: if you advise >> us to hold the pages dirty and we do up until it becomes a choice to >> hold on to the pages or to thrash the system into a livelock, why would >> you ever choose the latter? And if, as I'm assuming, you never would, >> why don't you want the kernel to make that choice for you? > > If you don't understand how write-ahead logging works, this > conversation is going nowhere. Suffice it to say that the word > "ahead" is not optional. In essence, if you do flush when you shouldn't, and there is a hardware failure, or kernel panic, or anything that stops the rest of the writes from succeeding, your database is kaputt, and you've got to restore a backup. Ie: very very bad. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On 01/14/2014 06:08 PM, Tom Lane wrote: Trond Myklebust writes: On Jan 14, 2014, at 10:39, Tom Lane wrote: "Don't be aggressive" isn't good enough. The prohibition on early write has to be absolute, because writing a dirty page before we've done whatever else we need to do results in a corrupt database. It has to be treated like a write barrier. Then why are you dirtying the page at all? It makes no sense to tell the kernel “we’re changing this page in the page cache, but we don’t want you to change it on disk”: that’s not consistent with the function of a page cache. As things currently stand, we dirty the page in our internal buffers, and we don't write it to the kernel until we've written and fsync'd the WAL data that needs to get to disk first. The discussion here is about whether we could somehow avoid double-buffering between our internal buffers and the kernel page cache. To be honest, I think the impact of double buffering in real-life applications is greatly exaggerated. If you follow the usual guideline and configure shared_buffers to 25% of available RAM, at worst you're wasting 25% of RAM to double buffering. That's significant, but it's not the end of the world, and it's a problem that can be compensated by simply buying more RAM. Of course, if someone can come up with an easy way to solve that, that'd be great, but if it means giving up other advantages that we get from relying on the OS page cache, then -1 from me. The usual response to the "why don't you just use O_DIRECT?" is that it'd require reimplementing a lot of I/O infrastructure, but misses an IMHO more important point: it would require setting shared_buffers a lot higher to get the same level of performance you get today. That has a number of problems: 1. It becomes a lot more important to tune shared_buffers correctly. Set it too low, and you're not taking advantage of all the RAM available. Set it too high, and you'll start swapping, totally killing performance. I can already hear consultants rubbing their hands, waiting for the rush of customers that will need expert help to determine the optimal shared_buffers setting. 2. Memory spent on the buffer cache can't be used for other things. For example, an index build can temporarily allocate several gigabytes of memory; if that memory is allocated to the shared buffer cache, it can't be used for that purpose. Yeah, we could change that, and allow borrowing pages from the shared buffer cache for other purposes, but that means more work and more code. 3. Memory used for the shared buffer cache can't be used by other processes (without swapping). It becomes a lot harder to be a good citizen on a system that's not entirely dedicated to PostgreSQL. So not only would we need to re-implement I/O infrastructure, we'd also need to make memory management a lot smarter and a lot more flexible. We'd need a lot more information on what else is running on the system and how badly they need memory. I personally think there is no chance of using mmap for that; the semantics of mmap are pretty much dictated by POSIX and they don't work for this. Agreed. It would be possible to use mmap() for pages that are not modified, though. When you're not modifying, you could mmap() the data you need, and bypass the PostgreSQL buffer cache that way. The interaction with the buffer cache becomes complicated, because you couldn't use the buffer cache's locks etc., and some pages might have a never version in the buffer cache than on-disk, but it might be doable. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 11:57 AM, James Bottomley wrote: > On Tue, 2014-01-14 at 11:48 -0500, Robert Haas wrote: >> On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley >> wrote: >> > No, I'm sorry, that's never going to be possible. No user space >> > application has all the facts. If we give you an interface to force >> > unconditional holding of dirty pages in core you'll livelock the system >> > eventually because you made a wrong decision to hold too many dirty >> > pages. I don't understand why this has to be absolute: if you advise >> > us to hold the pages dirty and we do up until it becomes a choice to >> > hold on to the pages or to thrash the system into a livelock, why would >> > you ever choose the latter? And if, as I'm assuming, you never would, >> > why don't you want the kernel to make that choice for you? >> >> If you don't understand how write-ahead logging works, this >> conversation is going nowhere. Suffice it to say that the word >> "ahead" is not optional. > > No, I do ... you mean the order of write out, if we have to do it, is > important. In the rest of the kernel, we do this with barriers which > causes ordered grouping of I/O chunks. If we could force a similar > ordering in the writeout code, is that enough? Probably not. There are a whole raft of problems here. For that to be any of any use, we'd have to move to mmap()ing each buffer instead of read()ing them in, and apparently mmap() doesn't scale well to millions of mappings. And even if it did, then we'd have a solution that only works on Linux. Plus, as Tom pointed out, there are critical sections where it's not just a question of ordering but in fact you need to completely hold off writes. In terms of avoiding double-buffering, here's my thought after reading what's been written so far. Suppose we read a page into our buffer pool. Until the page is clean, it would be ideal for the mapping to be shared between the buffer cache and our pool, sort of like copy-on-write. That way, if we decide to evict the page, it will still be in the OS cache if we end up needing it again (remember, the OS cache is typically much larger than our buffer pool). But if the page is dirtied, then instead of copying it, just have the buffer pool forget about it, because at that point we know we're going to write the page back out anyway before evicting it. This would be pretty similar to copy-on-write, except without the copying. It would just be forget-from-the-buffer-pool-on-write. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 12:12 PM, Robert Haas wrote: > In terms of avoiding double-buffering, here's my thought after reading > what's been written so far. Suppose we read a page into our buffer > pool. Until the page is clean, it would be ideal for the mapping to Correction: "For so long as the page is clean..." > be shared between the buffer cache and our pool, sort of like > copy-on-write. That way, if we decide to evict the page, it will > still be in the OS cache if we end up needing it again (remember, the > OS cache is typically much larger than our buffer pool). But if the > page is dirtied, then instead of copying it, just have the buffer pool > forget about it, because at that point we know we're going to write > the page back out anyway before evicting it. > > This would be pretty similar to copy-on-write, except without the > copying. It would just be forget-from-the-buffer-pool-on-write. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas wrote: > > In terms of avoiding double-buffering, here's my thought after reading > what's been written so far. Suppose we read a page into our buffer > pool. Until the page is clean, it would be ideal for the mapping to > be shared between the buffer cache and our pool, sort of like > copy-on-write. That way, if we decide to evict the page, it will > still be in the OS cache if we end up needing it again (remember, the > OS cache is typically much larger than our buffer pool). But if the > page is dirtied, then instead of copying it, just have the buffer pool > forget about it, because at that point we know we're going to write > the page back out anyway before evicting it. > > This would be pretty similar to copy-on-write, except without the > copying. It would just be forget-from-the-buffer-pool-on-write. But... either copy-on-write or forget-on-write needs a page fault, and thus a page mapping. Is a page fault more expensive than copying 8k? (I really don't know). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql.consistent_into
Marko Tiikkaja writes: > On 1/14/14 12:28 PM, Marti Raudsepp wrote: >> Now, another question is whether it's possible to make the syntax >> work. Is this an assignment from the result of a subquery, or is it a >> query by itself? >> a = (SELECT foo FROM table); > That looks like a scalar subquery, which is wrong because they can't > return more than one column (nor can they be INSERT etc., obviously). Yeah, it's a scalar subquery, which means that plpgsql already assigns a non-error meaning to this syntax. > How about: >(a) = SELECT 1; >(a, b) = SELECT 1, 2; >(a, b) = INSERT INTO foo RETURNING col1, col2; > Same semantics: TOO_MANY_ROWS on rows > 1, sets FOUND and row_count. > AFAICT this can be parsed unambiguously, too, and we don't need to look > at the query string because this is new syntax. The idea of inventing new syntax along this line seems like a positive direction to pursue. Since assignment already rejects multiple rows from the source expression, this wouldn't be weirdly inconsistent. It might be worth thinking about the UPDATE syntax that's in recent versions of the SQL standard: UPDATE targettab SET (a, b, c) = row-valued-expression [ , ... ] [ WHERE ... ] We don't actually implement this in PG yet, except for trivial cases, but it will certainly happen eventually. I think your sketch above deviates unnecessarily from what the standard says for UPDATE. In particular I think it'd be better to write things like (a, b) = ROW(1, 2); (a, b, c) = (SELECT x, y, z FROM foo WHERE id = 42); which would exactly match what you'd write in a multiple-assignment UPDATE, and it has the same rejects-multiple-rows semantics too. Also note that the trivial cases we do already implement in UPDATE look like UPDATE targettab SET (a, b, c) = (1, 2, 3) [ WHERE ... ] that is, we allow a row constructor where the optional keyword ROW has been omitted. I think people would expect to be able to write this in plpgsql: (a, b) = (1, 2); Now, this doesn't provide any guidance for INSERT/UPDATE/DELETE RETURNING, but frankly I don't feel any need to invent new syntax for those, since RETURNING INTO already works the way you want. I'm not too sure what it'd take to make this work. Right now, SELECT (SELECT x, y FROM foo WHERE id = 42); would generate "ERROR: subquery must return only one column", but I think it's mostly a historical artifact that it does that rather than returning a composite value (of an anonymous record type). If we were willing to make that change then it seems like it'd be pretty straightforward to teach plpgsql to handle (a, b, ...) = row-valued-expression where there wouldn't actually be any need to parse the RHS any differently from the way plpgsql parses an assignment RHS right now. Which would be a good thing IMO. If we don't generalize the behavior of scalar subqueries then plpgsql would have to jump through a lot of hoops to support the subselect case. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 12:15 PM, Claudio Freire wrote: > On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas wrote: >> In terms of avoiding double-buffering, here's my thought after reading >> what's been written so far. Suppose we read a page into our buffer >> pool. Until the page is clean, it would be ideal for the mapping to >> be shared between the buffer cache and our pool, sort of like >> copy-on-write. That way, if we decide to evict the page, it will >> still be in the OS cache if we end up needing it again (remember, the >> OS cache is typically much larger than our buffer pool). But if the >> page is dirtied, then instead of copying it, just have the buffer pool >> forget about it, because at that point we know we're going to write >> the page back out anyway before evicting it. >> >> This would be pretty similar to copy-on-write, except without the >> copying. It would just be forget-from-the-buffer-pool-on-write. > > But... either copy-on-write or forget-on-write needs a page fault, and > thus a page mapping. > > Is a page fault more expensive than copying 8k? I don't know either. I wasn't thinking so much that it would save CPU time as that it would save memory. Consider a system with 32GB of RAM. If you set shared_buffers=8GB, then in the worst case you've got 25% of your RAM wasted storing pages that already exist, dirtied, in shared_buffers. It's easy to imagine scenarios in which that results in lots of extra I/O, so that the CPU required to do the accounting comes to seem cheap by comparison. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
Claudio Freire wrote: > Robert Haas wrote: >> James Bottomley wrote: >>> I don't understand why this has to be absolute: if you advise >>> us to hold the pages dirty and we do up until it becomes a >>> choice to hold on to the pages or to thrash the system into a >>> livelock, why would you ever choose the latter? Because the former creates database corruption and the latter does not. >>> And if, as I'm assuming, you never would, That assumption is totally wrong. >>> why don't you want the kernel to make that choice for you? >> >> If you don't understand how write-ahead logging works, this >> conversation is going nowhere. Suffice it to say that the word >> "ahead" is not optional. > > In essence, if you do flush when you shouldn't, and there is a > hardware failure, or kernel panic, or anything that stops the > rest of the writes from succeeding, your database is kaputt, and > you've got to restore a backup. > > Ie: very very bad. Yup. And when that's a few terrabytes, you will certainly find yourself wishing that you had been able to do a recovery up to the end of the last successfully committed transaction rather than a restore from backup. Now, as Tom said, if there was an API to create write boundaries between particular dirty pages we could leave it to the OS. Each WAL record's write would be conditional on the previous one and each data page write would be conditional on the WAL record for the last update to the page. But nobody seems to think that would yield acceptable performance. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On 01/14/2014 05:44 PM, James Bottomley wrote: > On Tue, 2014-01-14 at 10:39 -0500, Tom Lane wrote: >> James Bottomley writes: >>> The current mechanism for coherency between a userspace cache and the >>> in-kernel page cache is mmap ... that's the only way you get the same >>> page in both currently. >> Right. >> >>> glibc used to have an implementation of read/write in terms of mmap, so >>> it should be possible to insert it into your current implementation >>> without a major rewrite. The problem I think this brings you is >>> uncontrolled writeback: you don't want dirty pages to go to disk until >>> you issue a write() >> Exactly. >> >>> I think we could fix this with another madvise(): >>> something like MADV_WILLUPDATE telling the page cache we expect to alter >>> the pages again, so don't be aggressive about cleaning them. >> "Don't be aggressive" isn't good enough. The prohibition on early write >> has to be absolute, because writing a dirty page before we've done >> whatever else we need to do results in a corrupt database. It has to >> be treated like a write barrier. >> >>> The problem is we can't give you absolute control of when pages are >>> written back because that interface can be used to DoS the system: once >>> we get too many dirty uncleanable pages, we'll thrash looking for memory >>> and the system will livelock. >> Understood, but that makes this direction a dead end. We can't use >> it if the kernel might decide to write anyway. > No, I'm sorry, that's never going to be possible. No user space > application has all the facts. If we give you an interface to force > unconditional holding of dirty pages in core you'll livelock the system > eventually because you made a wrong decision to hold too many dirty > pages. I don't understand why this has to be absolute: if you advise > us to hold the pages dirty and we do up until it becomes a choice to > hold on to the pages or to thrash the system into a livelock, why would > you ever choose the latter? And if, as I'm assuming, you never would, > why don't you want the kernel to make that choice for you? The short answer is "crash safety". A database system worth its name must make sure that all data reported as stored to clients is there even after crash. Write ahead log is the means for that. And writing wal files and data pages has to be in certain order to guarantee consistent recovery after crash. -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
James Bottomley wrote: > you mean the order of write out, if we have to do it, is > important. In the rest of the kernel, we do this with barriers > which causes ordered grouping of I/O chunks. If we could force a > similar ordering in the writeout code, is that enough? Unless it can be between particular pairs of pages, I don't think performance could be at all acceptable. Each data page has an associated Log Sequence Number reflecting the last Write-Ahead Log record which records a change to that page, and the referenced WAL record must be safely persisted before the data page is allowed to be written. Currently, when we need to write a dirty page to the OS, we must ensure that the WAL record is written and fsync'd first. We also write a WAL record for transaction command and fsync it at each COMMIT, before telling the client that the COMMIT request was successful. (Well, at least by default; they can choose to set synchronous_commit to off for some or all transactions.) If a write barrier to control this applied to everything on the filesystem, performance would be horrible. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Exposing currentTransactionWALVolume
Short patch to expose a function GetCurrentTransactionWALVolume() that gives the total number of bytes written to WAL by current transaction. User interface to this information discussed on separate thread, so that we don't check the baby out with the bathwater when people discuss UI pros and cons. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services GetCurrentTransactionWALVolume.v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] shared memory message queues
On Mon, Dec 23, 2013 at 12:46 PM, Robert Haas wrote: > Oh, dear. That's rather embarrassing. > > Incremental (incremental-shm-mq.patch) and full (shm-mq-v3.patch) > patches attached. OK, I have pushed the patches in this stack. I'm not sure we quite concluded the review back-and-forth but nobody really seems to have had serious objections to this line of attack, other than wanting some more comments which I have added. I don't doubt that there will be more things to tweak and tune here, and a whole lot more stuff that needs to be built using this infrastructure, but I don't think the code that's here is going to get better for remaining outside the tree longer. I decided not to change the dsm_toc patch to use functions instead of macros as Andres suggested; the struct definition would still have needed to be public, so any change would still need a recompile, at least if the size of the struct changed. It could be changed to work with a palloc'd chunk, but then you need to worry about context-lifespan memory leaks and it so didn't seem worth it. I can't imagine this having a lot of churn, anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] extension_control_path
Dimitri Fontaine writes: > Tom Lane writes: >> Why is that a good idea? It's certainly not going to simplify DBAs' >> lives, more the reverse. ("This dump won't reload." "Uh, where did >> you get that extension from?" "Ummm...") > The latest users for the feature are the Red Hat team working on Open > Shift where they want to have co-existing per-user PostgreSQL clusters > on a machine, each with its own set of extensions. Um ... "own set of installed extensions" doesn't need to mean "own set of available extensions", any more than those clusters need to have their own Postgres executables. If the clusters *do* have their own executables, eg because they're different PG versions, then they can certainly also have their own $SHAREDIR trees too. So this example is totally without value for your case. > Having extension_control_path also allows to install extension files in > a place not owned by root. As far as the control files go, there's nothing saying that $SHAREDIR/extension has to be root-owned. If there are .so's involved, I do not believe the Red Hat crew is asking you to support loading .so's from non-root-owned dirs, because that'd be against their own corporate security policies. (But in any case, where we find the control and SQL files need not have anything to do with where the .so's are.) > Lastly, as a developer, you might enjoy being able to have your own > non-system-global place to install extensions, as Andres did explain on > this list not too long ago. And again, if you're working on a development version, $SHAREDIR/extension is probably owned by you anyway. I don't see that any of these scenarios create a need to install extension files anywhere but $SHAREDIR/extension. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Linux kernel impact on PostgreSQL performance
On Mon, Jan 13, 2014 at 2:36 PM, Mel Gorman wrote: > On Mon, Jan 13, 2014 at 06:27:03PM -0200, Claudio Freire wrote: > > On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby wrote: > > > On 1/13/14, 2:19 PM, Claudio Freire wrote: > > >> > > >> On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas > > >> wrote: > > >>> > > >>> On a related note, there's also the problem of double-buffering. > When > > >>> we read a page into shared_buffers, we leave a copy behind in the OS > > >>> buffers, and similarly on write-out. It's very unclear what to do > > >>> about this, since the kernel and PostgreSQL don't have intimate > > >>> knowledge of what each other are doing, but it would be nice to solve > > >>> somehow. > > >> > > >> > > >> > > >> There you have a much harder algorithmic problem. > > >> > > >> You can basically control duplication with fadvise and WONTNEED. The > > >> problem here is not the kernel and whether or not it allows postgres > > >> to be smart about it. The problem is... what kind of smarts > > >> (algorithm) to use. > > > > > > > > > Isn't this a fairly simple matter of when we read a page into shared > buffers > > > tell the kernel do forget that page? And a corollary to that for when > we > > > dump a page out of shared_buffers (here kernel, please put this back > into > > > your cache). > > > > > > That's my point. In terms of kernel-postgres interaction, it's fairly > simple. > > > > What's not so simple, is figuring out what policy to use. Remember, > > you cannot tell the kernel to put some page in its page cache without > > reading it or writing it. So, once you make the kernel forget a page, > > evicting it from shared buffers becomes quite expensive. > > posix_fadvise(POSIX_FADV_WILLNEED) is meant to cover this case by > forcing readahead. But telling the kernel to forget a page, then telling it to read it in again from disk because it might be needed again in the near future is itself very expensive. We would need to hand the page to the kernel so it has it without needing to go to disk to get it. > If you evict it prematurely then you do get kinda > screwed because you pay the IO cost to read it back in again even if you > had enough memory to cache it. Maybe this is the type of kernel-postgres > interaction that is annoying you. > > If you don't evict, the kernel eventually steps in and evicts the wrong > thing. If you do evict and it was unnecessarily you pay an IO cost. > > That could be something we look at. There are cases buried deep in the > VM where pages get shuffled to the end of the LRU and get tagged for > reclaim as soon as possible. Maybe you need access to something like > that via posix_fadvise to say "reclaim this page if you need memory but > leave it resident if there is no memory pressure" or something similar. > Not exactly sure what that interface would look like or offhand how it > could be reliably implemented. > I think the "reclaim this page if you need memory but leave it resident if there is no memory pressure" hint would be more useful for temporary working files than for what was being discussed above (shared buffers). When I do work that needs large temporary files, I often see physical write IO spike but physical read IO does not. I interpret that to mean that the temporary data is being written to disk to satisfy either dirty_expire_centisecs or dirty_*bytes, but the data remains in the FS cache and so disk reads are not needed to satisfy it. So a hint that says "this file will never be fsynced so please ignore dirty_*bytes and dirty_expire_centisecs. I will need it again relatively soon (but not after a reboot), but will do so mostly sequentially, so please don't evict this without need, but if you do need to then it is a good candidate" would be good. Cheers, Jeff
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
On 7 July 2013 14:24, Simon Riggs wrote: > On 3 January 2012 18:42, Tom Lane wrote: >> I wrote: Another point that requires some thought is that switching SnapshotNow to be MVCC-based will presumably result in a noticeable increase in each backend's rate of wanting to acquire snapshots. >> >> BTW, I wonder if this couldn't be ameliorated by establishing some >> ground rules about how up-to-date a snapshot really needs to be. >> Arguably, it should be okay for successive SnapshotNow scans to use the >> same snapshot as long as we have not acquired a new lock in between. >> If not, reusing an old snap doesn't introduce any race condition that >> wasn't there already. > > Now that has been implemented using the above design, we can resubmit > the lock level reduction patch, with thanks to Robert. > > Submitted patch passes original complaint/benchmark. > > Changes > * various forms of ALTER TABLE, notably ADD constraint and VALIDATE > * CREATE TRIGGER > > One minor coirrections to earlier thinking with respect to toast > tables. That might be later relaxed. > > Full tests including proof of lock level reductions, plus docs. Rebased to v14 -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services reduce_lock_levels.v14.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 12:20 PM, James Bottomley wrote: > On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote: >> On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas wrote: >> > In terms of avoiding double-buffering, here's my thought after reading >> > what's been written so far. Suppose we read a page into our buffer >> > pool. Until the page is clean, it would be ideal for the mapping to >> > be shared between the buffer cache and our pool, sort of like >> > copy-on-write. That way, if we decide to evict the page, it will >> > still be in the OS cache if we end up needing it again (remember, the >> > OS cache is typically much larger than our buffer pool). But if the >> > page is dirtied, then instead of copying it, just have the buffer pool >> > forget about it, because at that point we know we're going to write >> > the page back out anyway before evicting it. >> > >> > This would be pretty similar to copy-on-write, except without the >> > copying. It would just be forget-from-the-buffer-pool-on-write. >> >> But... either copy-on-write or forget-on-write needs a page fault, and >> thus a page mapping. >> >> Is a page fault more expensive than copying 8k? >> >> (I really don't know). > > A page fault can be expensive, yes ... but perhaps you don't need one. > > What you want is a range of memory that's read from a file but treated > as anonymous for writeout (i.e. written to swap if we need to reclaim > it). Then at some time later, you want to designate it as written back > to the file instead so you control the writeout order. I'm not sure we > can do this: the separation between file backed and anonymous pages is > pretty deeply ingrained into the OS, but if it were possible, is that > what you want? Doesn't sound exactly like what I had in mind. What I was suggesting is an analogue of read() that, if it reads full pages of data to a page-aligned address, shares the data with the buffer cache until it's first written instead of actually copying the data. The pages are write-protected so that an attempt to write the address range causes a page fault. In response to such a fault, the pages become anonymous memory and the buffer cache no longer holds a reference to the page. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] shared memory message queues
On 14 January 2014 17:29, Robert Haas wrote: > On Mon, Dec 23, 2013 at 12:46 PM, Robert Haas wrote: >> Oh, dear. That's rather embarrassing. >> >> Incremental (incremental-shm-mq.patch) and full (shm-mq-v3.patch) >> patches attached. > > OK, I have pushed the patches in this stack. I'm not sure we quite > concluded the review back-and-forth but nobody really seems to have > had serious objections to this line of attack, other than wanting some > more comments which I have added. I don't doubt that there will be > more things to tweak and tune here, and a whole lot more stuff that > needs to be built using this infrastructure, but I don't think the > code that's here is going to get better for remaining outside the tree > longer. > > I decided not to change the dsm_toc patch to use functions instead of > macros as Andres suggested; the struct definition would still have > needed to be public, so any change would still need a recompile, at > least if the size of the struct changed. It could be changed to work > with a palloc'd chunk, but then you need to worry about > context-lifespan memory leaks and it so didn't seem worth it. I can't > imagine this having a lot of churn, anyway. postgres=# SELECT test_shm_mq(32768, (select string_agg(chr(32+(random()*96)::int), '') from generate_series(1,3)), 1, 10); ERROR: could not register background process HINT: You may need to increase max_worker_processes. STATEMENT: SELECT test_shm_mq(32768, (select string_agg(chr(32+(random()*96)::int), '') from generate_series(1,3)), 1, 10); LOG: registering background worker "test_shm_mq" LOG: starting background worker process "test_shm_mq" ERROR: could not register background process HINT: You may need to increase max_worker_processes. ERROR: unable to map dynamic shared memory segment LOG: worker process: test_shm_mq (PID 21939) exited with exit code 1 LOG: unregistering background worker "test_shm_mq" What's going on here? This occurs when starting Postgres and run as the first statement. I also noticed that everything exits with exit code 1: postgres=# SELECT test_shm_mq(32768, (select string_agg(chr(32+(random()*96)::int), '') from generate_series(1,400)), 1, 1); LOG: registering background worker "test_shm_mq" LOG: starting background worker process "test_shm_mq" test_shm_mq - (1 row) LOG: worker process: test_shm_mq (PID 22041) exited with exit code 1 LOG: unregistering background worker "test_shm_mq" -- Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
Robert Haas writes: > On Tue, Jan 14, 2014 at 11:57 AM, James Bottomley > wrote: >> No, I do ... you mean the order of write out, if we have to do it, is >> important. In the rest of the kernel, we do this with barriers which >> causes ordered grouping of I/O chunks. If we could force a similar >> ordering in the writeout code, is that enough? > Probably not. There are a whole raft of problems here. For that to > be any of any use, we'd have to move to mmap()ing each buffer instead > of read()ing them in, and apparently mmap() doesn't scale well to > millions of mappings. We would presumably mmap whole files, not individual pages (at least on 64-bit machines; else address space size is going to be a problem). However, without a fix for the critical-section/atomic-update problem, the idea's still going nowhere. > This would be pretty similar to copy-on-write, except without the > copying. It would just be forget-from-the-buffer-pool-on-write. That might possibly work. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, 2014-01-14 at 11:48 -0500, Robert Haas wrote: > On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley > wrote: > > No, I'm sorry, that's never going to be possible. No user space > > application has all the facts. If we give you an interface to force > > unconditional holding of dirty pages in core you'll livelock the system > > eventually because you made a wrong decision to hold too many dirty > > pages. I don't understand why this has to be absolute: if you advise > > us to hold the pages dirty and we do up until it becomes a choice to > > hold on to the pages or to thrash the system into a livelock, why would > > you ever choose the latter? And if, as I'm assuming, you never would, > > why don't you want the kernel to make that choice for you? > > If you don't understand how write-ahead logging works, this > conversation is going nowhere. Suffice it to say that the word > "ahead" is not optional. No, I do ... you mean the order of write out, if we have to do it, is important. In the rest of the kernel, we do this with barriers which causes ordered grouping of I/O chunks. If we could force a similar ordering in the writeout code, is that enough? James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote: > On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas wrote: > > > > In terms of avoiding double-buffering, here's my thought after reading > > what's been written so far. Suppose we read a page into our buffer > > pool. Until the page is clean, it would be ideal for the mapping to > > be shared between the buffer cache and our pool, sort of like > > copy-on-write. That way, if we decide to evict the page, it will > > still be in the OS cache if we end up needing it again (remember, the > > OS cache is typically much larger than our buffer pool). But if the > > page is dirtied, then instead of copying it, just have the buffer pool > > forget about it, because at that point we know we're going to write > > the page back out anyway before evicting it. > > > > This would be pretty similar to copy-on-write, except without the > > copying. It would just be forget-from-the-buffer-pool-on-write. > > > But... either copy-on-write or forget-on-write needs a page fault, and > thus a page mapping. > > Is a page fault more expensive than copying 8k? > > (I really don't know). A page fault can be expensive, yes ... but perhaps you don't need one. What you want is a range of memory that's read from a file but treated as anonymous for writeout (i.e. written to swap if we need to reclaim it). Then at some time later, you want to designate it as written back to the file instead so you control the writeout order. I'm not sure we can do this: the separation between file backed and anonymous pages is pretty deeply ingrained into the OS, but if it were possible, is that what you want? James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, 2014-01-14 at 10:39 -0500, Tom Lane wrote: > James Bottomley writes: > > The current mechanism for coherency between a userspace cache and the > > in-kernel page cache is mmap ... that's the only way you get the same > > page in both currently. > > Right. > > > glibc used to have an implementation of read/write in terms of mmap, so > > it should be possible to insert it into your current implementation > > without a major rewrite. The problem I think this brings you is > > uncontrolled writeback: you don't want dirty pages to go to disk until > > you issue a write() > > Exactly. > > > I think we could fix this with another madvise(): > > something like MADV_WILLUPDATE telling the page cache we expect to alter > > the pages again, so don't be aggressive about cleaning them. > > "Don't be aggressive" isn't good enough. The prohibition on early write > has to be absolute, because writing a dirty page before we've done > whatever else we need to do results in a corrupt database. It has to > be treated like a write barrier. > > > The problem is we can't give you absolute control of when pages are > > written back because that interface can be used to DoS the system: once > > we get too many dirty uncleanable pages, we'll thrash looking for memory > > and the system will livelock. > > Understood, but that makes this direction a dead end. We can't use > it if the kernel might decide to write anyway. No, I'm sorry, that's never going to be possible. No user space application has all the facts. If we give you an interface to force unconditional holding of dirty pages in core you'll livelock the system eventually because you made a wrong decision to hold too many dirty pages. I don't understand why this has to be absolute: if you advise us to hold the pages dirty and we do up until it becomes a choice to hold on to the pages or to thrash the system into a livelock, why would you ever choose the latter? And if, as I'm assuming, you never would, why don't you want the kernel to make that choice for you? James -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql.consistent_into
On 1/14/14, 6:15 PM, Tom Lane wrote: Marko Tiikkaja writes: How about: (a) = SELECT 1; (a, b) = SELECT 1, 2; (a, b) = INSERT INTO foo RETURNING col1, col2; Same semantics: TOO_MANY_ROWS on rows > 1, sets FOUND and row_count. AFAICT this can be parsed unambiguously, too, and we don't need to look at the query string because this is new syntax. The idea of inventing new syntax along this line seems like a positive direction to pursue. Since assignment already rejects multiple rows from the source expression, this wouldn't be weirdly inconsistent. It might be worth thinking about the UPDATE syntax that's in recent versions of the SQL standard: UPDATE targettab SET (a, b, c) = row-valued-expression [ , ... ] [ WHERE ... ] We don't actually implement this in PG yet, except for trivial cases, but it will certainly happen eventually. I think your sketch above deviates unnecessarily from what the standard says for UPDATE. In particular I think it'd be better to write things like (a, b) = ROW(1, 2); (a, b, c) = (SELECT x, y, z FROM foo WHERE id = 42); which would exactly match what you'd write in a multiple-assignment UPDATE, and it has the same rejects-multiple-rows semantics too. Hmm. That's a fair point I did not consider. Also note that the trivial cases we do already implement in UPDATE look like UPDATE targettab SET (a, b, c) = (1, 2, 3) [ WHERE ... ] that is, we allow a row constructor where the optional keyword ROW has been omitted. I think people would expect to be able to write this in plpgsql: (a, b) = (1, 2); Now, this doesn't provide any guidance for INSERT/UPDATE/DELETE RETURNING, but frankly I don't feel any need to invent new syntax for those, since RETURNING INTO already works the way you want. Yeah, I don't feel strongly about having to support them with this syntax. The inconsistency is a bit ugly, but it's not the end of the world. I'm not too sure what it'd take to make this work. Right now, SELECT (SELECT x, y FROM foo WHERE id = 42); would generate "ERROR: subquery must return only one column", but I think it's mostly a historical artifact that it does that rather than returning a composite value (of an anonymous record type). If we were willing to make that change then it seems like it'd be pretty straightforward to teach plpgsql to handle (a, b, ...) = row-valued-expression where there wouldn't actually be any need to parse the RHS any differently from the way plpgsql parses an assignment RHS right now. Which would be a good thing IMO. If we don't generalize the behavior of scalar subqueries then plpgsql would have to jump through a lot of hoops to support the subselect case. You can already do the equivalent of (a,b,c) = (1,2,3) with SELECT .. INTO. Would you oppose to starting the work on this by only supporting the subquery syntax, with the implementation being similar to how we currently handle SELECT .. INTO? If we could also not flat out reject including that into 9.4, I would be quite happy. Regards, Marko Tiikkaja -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers