Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Hannu Krosing
On 01/14/2014 03:44 AM, Dave Chinner wrote:
 On Tue, Jan 14, 2014 at 02:26:25AM +0100, Andres Freund wrote:
 On 2014-01-13 17:13:51 -0800, James Bottomley wrote:
 a file into a user provided buffer, thus obtaining a page cache entry
 and a copy in their userspace buffer, then insert the page of the user
 buffer back into the page cache as the page cache page ... that's right,
 isn't it postgress people?
 Pretty much, yes. We'd probably hint (*advise(DONTNEED)) that the page
 isn't needed anymore when reading. And we'd normally write if the page
 is dirty.
 So why, exactly, do you even need the kernel page cache here? You've
 got direct access to the copy of data read into userspace, and you
 want direct control of when and how the data in that buffer is
 written and reclaimed. Why push that data buffer back into the
 kernel and then have to add all sorts of kernel interfaces to
 control the page you already have control of?
To let kernel do the job that it is good at, namely managing the write-back
of dirty buffers to disk and to manage (possible) read-ahead pages.

While we do have control of the page, we do not (and really don't want to)
have control of the complex and varied side of efficiently reading and
writing
to various file-systems with possibly very different disk configurations.

We quite prefer kernel to take care of it and generally like how kernel
manages it.

We have a few suggestions about giving the kernel extra info about the
applications usage patterns of the data.

 Effectively you end up with buffered read/write that's also mapped into
 the page cache.  It's a pretty awful way to hack around mmap.
 Well, the problem is that you can't really use mmap() for the things we
 do. Postgres' durability works by guaranteeing that our journal entries
 (called WAL := Write Ahead Log) are written  synced to disk before the
 corresponding entries of tables and indexes reach the disk. That also
 allows to group together many random-writes into a few contiguous writes
 fdatasync()ed at once. Only during a checkpointing phase the big bulk of
 the data is then (slowly, in the background) synced to disk.
 Which is the exact algorithm most journalling filesystems use for
 ensuring durability of their metadata updates.  Indeed, here's an
 interesting piece of architecture that you might like to consider:

 * Neither XFS and BTRFS use the kernel page cache to back their
   metadata transaction engines.
But file system code is supposed to know much more about the
underlying disk than a mere application program like postgresql.

We do not want to start duplicating OS if we can avoid it.

What we would like is to have a way to tell the kernel

1) here is the modified copy of file page, it is now safe to write
it back - the current 'lazy' write

2) here is the page, write it back now, before returning success
to me - unbuffered write or write + sync

but we also would like to have

3) here is the page as it is currently on disk, I may need it soon,
so keep it together with your other clean pages accessed at time X
- this is the non-dirtying write discussed
   
the page may be in buffer cache, in which case just update its LRU
position (to either current time or time provided by postgresql), or
it may not be there, in which case put it there if reasonable by it's
LRU position.

And we would like all this to work together with other current linux
kernel goodness of managing the whole disk-side interaction of
efficient reading and writing and managing the buffers :)
 Why not? Because the page cache is too simplistic to adequately
 represent the complex object heirarchies that the filesystems have
 and so it's flat LRU reclaim algorithms and writeback control
 mechanisms are a terrible fit and cause lots of performance issues
 under memory pressure.
Same is true for postgresql - if we would just use direct writes
and reads from disk then the performance would be terrible.

We would need to duplicate all the complicated algorithms in file
system do for good performance if we were to start implementing
that part of the file system ourselves.
 
 IOWs, the two most complex high performance transaction engines in
 the Linux kernel have moved to fully customised cache and (direct)
 IO implementations because the requirements for scalability and
 performance are far more complex than the kernel page cache
 infrastructure can provide.
And we would like to avoid implementing this again this by delegating
this part of work to said complex high performance transaction
engines in the Linux kernel.

We do not want to abandon all work for postgresql business code
and go into file system development mode for next few years.

Again, as said above the linux file system is doing fine. What we
want is a few ways to interact with it to let it do even better when
working with postgresql by telling it some stuff it otherwise would
have to second guess and by sometimes giving it back some cache
pages 

Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-14 Thread David Rowley
On Tue, Jan 14, 2014 at 2:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Florian Pflug f...@phlo.org writes:
  I think it'd be worthwile to get this into 9.4, if that's still an
 option,
  even if we only support COUNT.

 My thought is

 (1) we can certainly implement inverse transitions for COUNT() and the
 integer variants of SUM(), and that alone would be a killer feature for
 some people.


100% Agreed about sum(int) and count, but I've so far managed only to find
problems with SUM(numeric), other inverse transition functions seem just
fine with numeric types. e.g. AVG(numeric) and STDDEV(numeric), I can't
produce any results that differ from the unpatched head... but perhaps
someone can give me a case where things could vary?

I've been testing with something like:
select avg(n) over w,SUM(3::float) over w
from (values(1,10::numeric),(2,0),(3,0)) v(i,n)
window w as (order by i rows between current row and unbounded following);

where I included the sum(float) to force the old school brute force
transitions to be performed. I then compared the results to the ones given
without the sum(float) tagged on the end. I've so far not found anything
out of place.

(2) the float and numeric variants should be implemented under nondefault
 names (I'm thinking FAST_SUM(), but bikeshed away).  People who need
 extra speed and don't mind the slightly different results can alter
 their queries to use these variants.


I'm not as keen on this idea, as if someone later thought of a nice way to
allow the inverse functions to provide exactly the same result as if they
were not used then the would become redundant legacy that would likely have
to be supported forever and a day, and I know we already have things like
that, but of course we want to minimise that as much as possible.

If AVG(numeric) and the stddev numeric aggregates turn out to be valid
then numeric_avg_accum_inv() must remain in core code to support that and
this is the exact function that is required for a user to define their own
SUM(numeric) with. We could leave it at that and document what a user must
do to create their own FAST_SUM then perhaps leave the including these in
core decision to later based maybe on the number of complaints about
SUM(numeric) being slow when used in windows with a moving frame head.

(For me) It does not seem too unreasonable to think that one day we might
decide that:

SELECT AVG(n) FROM (VALUES(10::NUMERIC(10,3)),(0),(0)) v(n);

return 3.333 instead of 3. like it does now.

If we ever did that then we could support these numeric inverse transitions
on SUM() without having to worry about weird extra trailing zeros. Instead
we'd just give the user what they asked for, when they asked for it.
Reasons like that make me think that we might be jumping the gun a little
on FAST_SUM() as it could end up redundant more quickly than we might
initially think.


 One reason I'm thinking this is that whatever we do to ameliorate
 the semantic issues is going to slow down the forward transition
 function --- to no benefit unless the aggregate is being used as a
 window function in a moving window.  So I'm less than convinced
 that we *should* implement any of these designs in the default
 aggregates, even if we get to the point where we arguably *could*
 do it with little risk of functional differences.


So far there's only 1 case of possible slowdown of forward transitions and
that's with numeric types.
I had to change do_numeric_accum() to add a NaN Counter increment and also
change the logic so that it continues summing non NaN values even after the
first NaN was encountered.

Would you see this as too big a performance risk to include in core?

If not then I think we might be able to support AVG(numeric) and
STDDEV(numeric) functions as the patch sits without having to worry that
there might be an extra overhead on forward transitions that include NaN
values.


Florian has done some really good work researching the standards around
this area. His research seems to indicate that the results should be of the
same scale as the inputs, which the current patch does not do, providing
you assume that the user is showing us the intended scale based on the
numbers that we're working with rather than a scale specified by the column
or cast. Florian's idea of how to fix the scale on inverse transition seems
pretty valid to me and I think the overhead of it could be made pretty
minimal. I'm just not sure that I can implement it in such a way as to make
the overhead small enough to look like background noise. But I'm very
willing to give it a try and see...

*** some moments pass ***

I think unless anyone has some objections I'm going to remove the inverse
transition for SUM(numeric) and modify the documents to tell the user how
to build their own FAST_SUM(numeric) using the built in functions to do it.
I'm starting to think that playing around with resetting numeric scale will
turn a possible 9.4 patch into a 9.5/10.0 patch. I 

Re: [HACKERS] Add CREATE support to event triggers

2014-01-14 Thread Pavel Stehule
Hello


2014/1/13 Alvaro Herrera alvhe...@2ndquadrant.com

 Alvaro Herrera escribió:

  In an event trigger, the function
 pg_event_trigger_get_creation_commands()
  returns the following JSON blob:

 After playing with this for a while, I realized something that must have
 seemed quite obvious to those paying attention: what this function is,
 is just a glorified sprintf() for JSON.  So I propose we take our
 existing format(text) and use it to model a new format(json) function,
 which will be useful to the project at hand and be of more general
 applicability.

 To make it a better fit, I have changed the spec slightly.  The format
 string is now the fmt element in the topmost JSON.  This format string
 can contain % escapes, which consist of:

 * the literal % itself
 * an element name, enclosed in braces { }.  The name can optionally be
   followed by a colon and a possibly-empty array separator.
 * a format specifier, which can be I (identifier), D (dotted name), or s
   (string)
 * Alternatively, %% expands to a literal %, as usual.

 For each such escape, the JSON object is searched using the element name
 as key.  For identifiers, the element is expected to be a string, and
 will be quoted per identifier quoting rules.  Dotted-names are used to
 format possibly-qualified relation names and such; the element must be
 an object with one, two or three string elements, each of which is
 quoted per identifier rules, and output separated by periods.

 Finally, for arrays we expand each element in the JSON array element,
 and separate them with the separator specified in the {} part of the
 format specifier.

 For instance,
 alvherre=# select format(json '{fmt:hello, %{who}s! This is %{name}I,
 who:world, name:a function}');
format
 --
  hello, world! This is a function

 Elements can be objects, in which case they are expanded recursively: a
 fmt element is looked up and expanded as described above.


 I don't yet see a need for %L escapes (that is, literals that can expand
 to a single-quoted value or to NULL), but if I see it I will add that
 too.


I am not against to this idea, although I don't see a strong benefit. .
Just special function can be better - it has minimal relation to variadic
function format - and nested mixed format can be messy



 --
 Álvaro Herrerahttp://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] Proposal: variant of regclass

2014-01-14 Thread Tatsuo Ishii
 On Tue, Jan 14, 2014 at 4:28 PM, Yugo Nagata nag...@sraoss.co.jp wrote:
 Here is the patch to implement to_regclass, to_regproc, to_regoper,
 and to_regtype. They are new functions similar to regclass, regproc,
 regoper, and regtype except that if requested object is not found,
 returns InvalidOid, rather than raises an error.
 
 You should add this patch to the upcoming commit fest (beginning
 tomorrow actually), I am not seeing it in the list:
 https://commitfest.postgresql.org/action/commitfest_view?id=21
 Thanks,
 -- 
 Michael

Of course. Problem is, in order to add to commit fest, patch's URL is
needed in the first place. And the URL is not defined until a posting
is made.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Capturing EXPLAIN ANALYZE for a query without discarding the normal result set

2014-01-14 Thread Marti Raudsepp
On Tue, Jan 14, 2014 at 5:13 AM, Marti Raudsepp ma...@juffo.org wrote:
 You can use the auto_explain contrib module

I just remembered that there's also the pg_stat_plans extension, which
is closer to what you asked:
https://github.com/2ndQuadrant/pg_stat_plans . This one you'll have to
build yourself (it's not in contrib).

Regards,
Marti


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Hannu Krosing
On 01/13/2014 11:22 PM, James Bottomley wrote:

 The less exciting, more conservative option would be to add kernel
 interfaces to teach Postgres about things like raid geometries. Then
 Postgres could use directio and decide to do prefetching based on the
 raid geometry, how much available i/o bandwidth and iops is available,
 etc.

 Reimplementing i/o schedulers and all the rest of the work that the
 kernel provides inside Postgres just seems like something outside our
 competency and that none of us is really excited about doing.
 This would also be a well trodden path ... I believe that some large
 database company introduced Direct IO for roughly this purpose.

The file system at that time were much worse than they are now,
so said large companies had no choice but to write their own.

As linux file handling has been much better for most of active
development of postgresql we have been able to avoid
it and still have reasonable performance.

What was been pointed out above are some (allegedly
desktop/mobile influenced) decisions which broke good
performance.

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpgsql.consistent_into

2014-01-14 Thread Pavel Stehule
 I am thinking so GUC and plpgsql option can live together. If you like to
 accent a some behave, then you can use a plpgsql option. On second hand, I
 would to use a some functionality, that is safe, but I don't would to dirty
 source code by using repeated options. But I have to check (and calculate
 with risk) a GUC settings.

 One idea: required GUC? Can be nice a possibility to ensure some GUC
 setting, and restore ensure these values or raises warning.

 Back to main topic. Required and described feature doesn't change a
 behave of INTO clause. I can enable or disable this functionality and well
 written code should to work without change (and problems). When check is
 disabled, then execution is just less safe. So in this case, a impact of
 GUC is significantly less than by you described issues. Does know anybody a
 use case where this check should be disabled?

 Probably we have a different experience about GUC. I had a problem with
  standard_conforming_strings and bytea format some years ago. Now I prepare
 document about required setting. But I can see (from my experience from
 Czech area) more often  problems related to effective_cache_size or
 from_collapse_limit and similar GUC. These parameters are behind knowledge
 (and visibility) typical user.


 ISTM that in this case, it should be safe to make the new default behavior
 STRICT; if you forget to set the GUC to disable than you'll get an error
 that points directly at the problem, at which point you'll go Oh, yeah...
 I forgot to set X...


I disagree - STRICT can be too restrictive - and a combination SELECT INTO
and IF FOUND can be significantly faster - our exception handling is not
cheap.

Regards

Pavel


 Outside of the GUC, I believe the default should definitely be STRICT. If
 your app is relying on non-strict then you need to be made aware of that.
 We should be able to provide a DO block that will change this setting for
 every function you've got if someone isn't happy with STRICT mode.
 --
 Jim C. Nasby, Data Architect   j...@nasby.net
 512.569.9461 (cell) http://jim.nasby.net



Re: [HACKERS] GIN improvements part 1: additional information

2014-01-14 Thread Heikki Linnakangas

On 01/13/2014 07:07 PM, Alexander Korotkov wrote:

I've fixed this bug and many other bug. Now patch passes test suite that
I've used earlier. The results are so:

Operations time:
  event | period
---+-
  index_build   | 00:01:47.53915
  index_build_recovery  | 00:00:04
  index_update  | 00:05:24.388163
  index_update_recovery | 00:00:53
  search_new| 00:24:02.289384
  search_updated| 00:27:09.193343
(6 rows)

Index sizes:
  label |   size
---+---
  new   | 384761856
  after_updates | 667942912
(2 rows)

Also, I made following changes in algorithms:

- Now, there is a limit to number of uncompressed TIDs in the page.
After reaching this limit, they are encoded independent on if they can fit
page. That seems to me more desirable behaviour and somehow it accelerates
search speed. Before this change times were following:

  event | period
---+-
  index_build   | 00:01:51.467888
  index_build_recovery  | 00:00:04
  index_update  | 00:05:03.315155
  index_update_recovery | 00:00:51
  search_new| 00:24:43.194882
  search_updated| 00:28:36.316784
(6 rows)


Hmm, that's strange. Any idea why that's happening? One theory is that 
when you re-encode the pages more aggressively, there are fewer pages 
with a mix of packed and unpacked items. Mixed pages are somewhat slower 
to scan than fully packed or fully unpacked pages, because 
GinDataLeafPageGetItems() has to merge the packed and unpacked items 
into a single list. But I wouldn't expect that effect to be large enough 
to explain the results you got.



- Page are not fully re-encoded if it's enough to re-encode just last
segment.


Great! We should also take advantage of that in the WAL record that's 
written; no point in WAL-logging all the segments, if we know that only 
last one was modified.	


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Claudio Freire
On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing ha...@2ndquadrant.com wrote:
 Again, as said above the linux file system is doing fine. What we
 want is a few ways to interact with it to let it do even better when
 working with postgresql by telling it some stuff it otherwise would
 have to second guess and by sometimes giving it back some cache
 pages which were copied away for potential modifying but ended
 up clean in the end.

You don't need new interfaces. Only a slight modification of what
fadvise DONTNEED does.

This insistence in injecting pages from postgres to kernel is just a
bad idea. At the very least, it still needs postgres to know too much
of the filesystem (block layout) to properly work. Ie: pg must be
required to put entire filesystem-level blocks into the page cache,
since that's how the page cache works. At the very worst, it may
introduce serious security and reliability implications, when
applications can destroy the consistency of the page cache (even if
full access rights are checked, there's still the possibility this
inconsistency might be exploitable).

Simply making fadvise DONTNEED move pages to the head of the LRU (ie:
discard next if you need) should work as expected without all the
complication of the above proposal.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN improvements part 1: additional information

2014-01-14 Thread Alexander Korotkov
On Tue, Jan 14, 2014 at 12:34 PM, Heikki Linnakangas 
hlinnakan...@vmware.com wrote:

 On 01/13/2014 07:07 PM, Alexander Korotkov wrote:

 I've fixed this bug and many other bug. Now patch passes test suite that
 I've used earlier. The results are so:

 Operations time:
   event | period
 ---+-
   index_build   | 00:01:47.53915
   index_build_recovery  | 00:00:04
   index_update  | 00:05:24.388163
   index_update_recovery | 00:00:53
   search_new| 00:24:02.289384
   search_updated| 00:27:09.193343
 (6 rows)

 Index sizes:
   label |   size
 ---+---
   new   | 384761856
   after_updates | 667942912
 (2 rows)

 Also, I made following changes in algorithms:

 - Now, there is a limit to number of uncompressed TIDs in the page.

 After reaching this limit, they are encoded independent on if they
 can fit
 page. That seems to me more desirable behaviour and somehow it
 accelerates
 search speed. Before this change times were following:

   event | period
 ---+-
   index_build   | 00:01:51.467888
   index_build_recovery  | 00:00:04
   index_update  | 00:05:03.315155
   index_update_recovery | 00:00:51
   search_new| 00:24:43.194882
   search_updated| 00:28:36.316784
 (6 rows)


 Hmm, that's strange. Any idea why that's happening? One theory is that
 when you re-encode the pages more aggressively, there are fewer pages with
 a mix of packed and unpacked items. Mixed pages are somewhat slower to scan
 than fully packed or fully unpacked pages, because
 GinDataLeafPageGetItems() has to merge the packed and unpacked items into a
 single list. But I wouldn't expect that effect to be large enough to
 explain the results you got.


Probably, it's because of less work in ginMergeItemPointers.

 - Page are not fully re-encoded if it's enough to re-encode just last
 segment.


 Great! We should also take advantage of that in the WAL record that's
 written; no point in WAL-logging all the segments, if we know that only
 last one was modified.


Already.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] UNION ALL on partitioned tables won't use indices.

2014-01-14 Thread Kyotaro HORIGUCHI
This is cont'd from previous CF3.

You'll see the overview and the discussion since in the thread
begins from there. The issue ramains as of current 9.4dev head.

http://www.postgresql.org/message-id/20131024.193953.233464126.horiguchi.kyot...@lab.ntt.co.jp

The issue in brief is that UNION ALL on inheritance tables does
not make use of indices in contrast to that on bare tables. This
is because of appendrels with the depth more than 2 levels
brought by inheritance tables.

Some of UNION ALL operation - especially using ORDERBY and LIMIT
- will be accelerated by this patch. Details in the message
above.

I proposed 3 types of solution but the one of them - tweaking
Equivalence Class (Type 1)- was dismissed because of
inappropriate manipulation on Equivalence Class. The rest do
essentially the same thing - flattening appendrels - at different
timings.

In type 2, the process is implemented as a generic appendrel
flattening function and applied after expand_inherited_tables()
in subquery_planner.

In type 3, the equivelant process is combined in
expand_inherited_rtentry().

Type 2 loops once for whole appendrel list (in most cases) so it
might be more effective than type 3 which searches the parent
appendrel for each appended ones. Type 3 is more approprieate
design assuming that appendrel tree deeper than 2 levels will be
generated only by expand_inherited_tables().

The attached two patches are rebased to current 9.4dev HEAD and
make check at the topmost directory and src/test/isolation are
passed without error. One bug was found and fixed on the way. It
was an assertion failure caused by probably unexpected type
conversion introduced by collapse_appendrels() which leads
implicit whole-row cast from some valid reltype to invalid
reltype (0) into adjust_appendrel_attrs_mutator().

Attached files are the two versions of patches mentioned above.

 - unionall_inh_idx_typ2_v4_20140114.patch
 - unionall_inh_idx_typ3_v4_20140114.patch

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index e249628..582e8c3 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -57,6 +57,11 @@ typedef struct
 	AppendRelInfo *appinfo;
 } adjust_appendrel_attrs_context;
 
+typedef struct {
+	AppendRelInfo	*child_appinfo;
+	Index			 target_rti;
+} transvars_merge_context;
+
 static Plan *recurse_set_operations(Node *setOp, PlannerInfo *root,
 	   double tuple_fraction,
 	   List *colTypes, List *colCollations,
@@ -98,6 +103,8 @@ static List *generate_append_tlist(List *colTypes, List *colCollations,
 	  List *input_plans,
 	  List *refnames_tlist);
 static List *generate_setop_grouplist(SetOperationStmt *op, List *targetlist);
+static Node *transvars_merge_mutator(Node *node,
+	 transvars_merge_context *ctx);
 static void expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte,
 		 Index rti);
 static void make_inh_translation_list(Relation oldrelation,
@@ -1210,6 +1217,34 @@ expand_inherited_tables(PlannerInfo *root)
 	}
 }
 
+static Node *
+transvars_merge_mutator(Node *node, transvars_merge_context *ctx)
+{
+	if (node == NULL)
+		return NULL;
+
+	if (IsA(node, Var))
+	{
+		Var *oldv = (Var*)node;
+
+		if (!oldv-varlevelsup  oldv-varno == ctx-target_rti)
+		{
+			if (oldv-varattno 
+	list_length(ctx-child_appinfo-translated_vars))
+elog(ERROR,
+	 attribute %d of relation \%s\ does not exist,
+	 oldv-varattno,
+	 get_rel_name(ctx-child_appinfo-parent_reloid));
+			
+			return (Node*)copyObject(
+list_nth(ctx-child_appinfo-translated_vars,
+		 oldv-varattno - 1));
+		}
+	}
+	return expression_tree_mutator(node,
+   transvars_merge_mutator, (void*)ctx);
+}
+
 /*
  * expand_inherited_rtentry
  *		Check whether a rangetable entry represents an inheritance set.
@@ -1237,6 +1272,7 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 	List	   *inhOIDs;
 	List	   *appinfos;
 	ListCell   *l;
+	AppendRelInfo *parent_appinfo = NULL;
 
 	/* Does RT entry allow inheritance? */
 	if (!rte-inh)
@@ -1301,6 +1337,21 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		oldrc-isParent = true;
 
 	/*
+	 * If parent relation is appearing in a subselect of UNION ALL, it has
+	 * furthur parent appendrelinfo. Save it to pull up inheritance children
+	 * later.
+	 */
+	foreach(l, root-append_rel_list)
+	{
+		AppendRelInfo *appinfo = (AppendRelInfo *)lfirst(l);
+		if(appinfo-child_relid == rti)
+		{
+			parent_appinfo = appinfo;
+			break;
+		}
+	}
+	
+	/*
 	 * Must open the parent relation to examine its tupdesc.  We need not lock
 	 * it; we assume the rewriter already did.
 	 */
@@ -1378,6 +1429,28 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 		}
 
 		/*
+		 * Pull up this appinfo onto just above of the parent. The parent
+		 * relation has its 

Re: [HACKERS] Get more from indices.

2014-01-14 Thread Kyotaro HORIGUCHI
Hello, since CF4 is already closed but this patch ramains marked
as 'Ready for Committer', please let me re-post the latest
version for CF4 to get rid of vanishing :-p

 tgl But aside from hasty typos,
 
 Oops! I've picked up wrong node. It always denies pathkeys extension.
 
 | !IsA(member, Var))
 
 is a mistake of the following.
 
 | !IsA(member-em_expr, Var))

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 4f63906..b695e40 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1790,6 +1790,7 @@ _outIndexOptInfo(StringInfo str, const IndexOptInfo *node)
 	WRITE_BOOL_FIELD(predOK);
 	WRITE_BOOL_FIELD(unique);
 	WRITE_BOOL_FIELD(immediate);
+	WRITE_BOOL_FIELD(allnotnull);
 	WRITE_BOOL_FIELD(hypothetical);
 	/* we don't bother with fields copied from the pg_am entry */
 }
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 606734a..0b2f529 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -952,8 +952,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	{
 		index_pathkeys = build_index_pathkeys(root, index,
 			  ForwardScanDirection);
-		useful_pathkeys = truncate_useless_pathkeys(root, rel,
-	index_pathkeys);
+		if (index_pathkeys_are_extensible(root, index, index_pathkeys))
+			useful_pathkeys = root-query_pathkeys;
+		else
+			useful_pathkeys = truncate_useless_pathkeys(root, rel,
+		index_pathkeys);
 		orderbyclauses = NIL;
 		orderbyclausecols = NIL;
 	}
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 9c8ede6..ad3a8b7 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -502,6 +502,60 @@ build_index_pathkeys(PlannerInfo *root,
 }
 
 /*
+ * index_pathkeys_are_extensible
+ *	  Check whether the pathkeys are extensible to query_pathkeys.
+ */
+bool
+index_pathkeys_are_extensible(PlannerInfo *root,
+			  IndexOptInfo *index,
+			  List *pathkeys)
+{
+	bool		result;
+	ListCell   *lc1;
+
+	if (root-query_pathkeys == NIL || pathkeys == NIL)
+		return false;
+
+	if (!index-unique || !index-immediate || !index-allnotnull)
+		return false;
+
+	if (!pathkeys_contained_in(pathkeys, root-query_pathkeys))
+		return false;
+
+	if (list_length(pathkeys) == list_length(root-query_pathkeys))
+		return true;
+
+	result = true;
+	foreach(lc1, root-query_pathkeys)
+	{
+		PathKey*pathkey = (PathKey *) lfirst(lc1);
+		bool		found = false;
+		ListCell   *lc2;
+
+		foreach(lc2, pathkey-pk_eclass-ec_members)
+		{
+			EquivalenceMember *member = (EquivalenceMember *) lfirst(lc2);
+
+			if (!bms_equal(member-em_relids, index-rel-relids) ||
+!IsA(member-em_expr, Var))
+continue;
+			else
+			{
+found = true;
+break;
+			}
+		}
+
+		if (!found)
+		{
+			result = false;
+			break;
+		}
+	}
+	return result;
+}
+
+/*
  * build_expression_pathkey
  *	  Build a pathkeys list that describes an ordering by a single expression
  *	  using the given sort operator.
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index de981cb..4e24220 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -333,6 +333,26 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info-immediate = index-indimmediate;
 			info-hypothetical = false;
 
+			info-allnotnull = true;
+			for (i = 0; i  ncolumns; i++)
+			{
+int			attrno = info-indexkeys[i];
+
+if (attrno == 0)
+{
+	info-allnotnull = false;
+	break;
+}
+else if (attrno  0)
+{
+	if (!relation-rd_att-attrs[attrno - 1]-attnotnull)
+	{
+		info-allnotnull = false;
+		break;
+	}
+}
+			}
+
 			/*
 			 * Estimate the index size.  If it's not a partial index, we lock
 			 * the number-of-tuples estimate to equal the parent table; if it
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index a9219e0..d9a3b9b 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -525,6 +525,7 @@ typedef struct IndexOptInfo
 	bool		predOK;			/* true if predicate matches query */
 	bool		unique;			/* true if a unique index */
 	bool		immediate;		/* is uniqueness enforced immediately? */
+	bool		allnotnull;		/* true if index's keys are all not null */
 	bool		hypothetical;	/* true if index doesn't really exist */
 	bool		canreturn;		/* can index return IndexTuples? */
 	bool		amcanorderbyop; /* does AM support order by operator result? */
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 999adaa..5ee2e56 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -196,5 +196,8 @@ extern List *truncate_useless_pathkeys(PlannerInfo *root,
 		  RelOptInfo *rel,
 		  List *pathkeys);
 extern bool 

Re: [HACKERS] Using indices for UNION.

2014-01-14 Thread Kyotaro HORIGUCHI
This is cont'd from CF3.

http://www.postgresql.org/message-id/20131122.165927.27412386.horiguchi.kyot...@lab.ntt.co.jp

The issue in brief is that UNION is never flattened differently
to UNION ALL so UNION cannot make use of index scan even if
usable.

This patch flattens UNION likewise currently did for UNION ALL.

This patch needs another 'UNION ALL' patch and further gain with
even another 'Widening application of indices' patch. ('Ready for
Commit' now in CF3..)

http://www.postgresql.org/message-id/20140114.180447.145186052.horiguchi.kyot...@lab.ntt.co.jp
http://www.postgresql.org/message-id/20140114.180810.122352231.horiguchi.kyot...@lab.ntt.co.jp


You can see the detailed outlines in the message at here,

http://www.postgresql.org/message-id/20131031.194251.12577697.horiguchi.kyot...@lab.ntt.co.jp

The attached patche is rebased to current 9.4dev HEAD and make
check at the topmost directory and src/test/isolation are passed
without error.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Heikki Linnakangas

On 01/14/2014 12:26 AM, Mel Gorman wrote:

On Mon, Jan 13, 2014 at 03:15:16PM -0500, Robert Haas wrote:

The other thing that comes to mind is the kernel's caching behavior.
We've talked a lot over the years about the difficulties of getting
the kernel to write data out when we want it to and to not write data
out when we don't want it to.


Is sync_file_range() broke?


When it writes data back to disk too
aggressively, we get lousy throughput because the same page can get
written more than once when caching it for longer would have allowed
write-combining.


Do you think that is related to dirty_ratio or dirty_writeback_centisecs?
If it's dirty_writeback_centisecs then that would be particularly tricky
because poor interactions there would come down to luck basically.



When it doesn't write data to disk aggressively
enough, we get huge latency spikes at checkpoint time when we call
fsync() and the kernel says uh, what? you wanted that data *on the
disk*? sorry boss! and then proceeds to destroy the world by starving
the rest of the system for I/O for many seconds or minutes at a time.


Ok, parts of that are somewhat expected. It *may* depend on the
underlying filesystem. Some of them handle fsync better than others. If
you are syncing the whole file though when you call fsync then you are
potentially burned by having to writeback dirty_ratio amounts of memory
which could take a substantial amount of time.


We've made some desultory attempts to use sync_file_range() to improve
things here, but I'm not sure that's really the right tool, and if it
is we don't know how to use it well enough to obtain consistent
positive results.


That implies that either sync_file_range() is broken in some fashion we
(or at least I) are not aware of and that needs kicking.


Let me try to explain the problem: Checkpoints can cause an I/O spike, 
which slows down other processes.


When it's time to perform a checkpoint, PostgreSQL will write() all 
dirty buffers from the PostgreSQL buffer cache, and finally perform an 
fsync() to flush the writes to disk. After that, we know the data is 
safely on disk.


In older PostgreSQL versions, the write() calls would cause an I/O storm 
as the OS cache quickly fills up with dirty pages, up to dirty_ratio, 
and after that all subsequent write()s block. That's OK as far as the 
checkpoint is concerned, but it significantly slows down queries running 
at the same time. Even a read-only query often needs to write(), to 
evict a dirty page from the buffer cache to make room for a different 
page. We made that less painful by adding sleeps between the write() 
calls, so that they are trickled over a long period of time and 
hopefully stay below dirty_ratio at all times. However, we still have to 
perform the fsync()s after the writes(), and sometimes that still causes 
a similar I/O storm.


The checkpointer is not in a hurry. A checkpoint typically has 10-30 
minutes to finish, before it's time to start the next checkpoint, and 
even if it misses that deadline that's not too serious either. But the 
OS doesn't know that, and we have no way of telling it.


As a quick fix, some sort of a lazy fsync() call would be nice. It would 
behave just like fsync() but it would not change the I/O scheduling at 
all. Instead, it would sleep until all the pages have been flushed to 
disk, at the speed they would've been without the fsync() call.


Another approach would be to give the I/O that the checkpointer process 
initiates a lower priority. This would be slightly preferable, because 
PostgreSQL could then issue the writes() as fast as it can, and have the 
checkpoint finish earlier when there's not much other load. Last I 
looked into this (which was a long time ago), there was no suitable 
priority system for writes, only reads.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using indices for UNION.

2014-01-14 Thread Kyotaro HORIGUCHI
Sorry, I missed to attach file.

 This is cont'd from CF3.
 
 http://www.postgresql.org/message-id/20131122.165927.27412386.horiguchi.kyot...@lab.ntt.co.jp
 
 The issue in brief is that UNION is never flattened differently
 to UNION ALL so UNION cannot make use of index scan even if
 usable.
 
 This patch flattens UNION likewise currently did for UNION ALL.
 
 This patch needs another 'UNION ALL' patch and further gain with
 even another 'Widening application of indices' patch. ('Ready for
 Commit' now in CF3..)
 
 http://www.postgresql.org/message-id/20140114.180447.145186052.horiguchi.kyot...@lab.ntt.co.jp
 http://www.postgresql.org/message-id/20140114.180810.122352231.horiguchi.kyot...@lab.ntt.co.jp
 
 
 You can see the detailed outlines in the message at here,
 
 http://www.postgresql.org/message-id/20131031.194251.12577697.horiguchi.kyot...@lab.ntt.co.jp
 
 The attached patche is rebased to current 9.4dev HEAD and make
 check at the topmost directory and src/test/isolation are passed
 without error.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 1da4b2f..e89f8b3 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -353,13 +353,14 @@ subquery_planner(PlannerGlobal *glob, Query *parse,
 		pull_up_subqueries(root, (Node *) parse-jointree);
 
 	/*
-	 * If this is a simple UNION ALL query, flatten it into an appendrel. We
-	 * do this now because it requires applying pull_up_subqueries to the leaf
-	 * queries of the UNION ALL, which weren't touched above because they
-	 * weren't referenced by the jointree (they will be after we do this).
+	 * If this is a simple UNION/UNION ALL query, flatten it into an
+	 * appendrel. We do this now because it requires applying
+	 * pull_up_subqueries to the leaf queries of the UNION/UNION ALL, which
+	 * weren't touched above because they weren't referenced by the jointree
+	 * (they will be after we do this).
 	 */
 	if (parse-setOperations)
-		flatten_simple_union_all(root);
+		flatten_simple_union(root);
 
 	/*
 	 * Detect whether any rangetable entries are RTE_JOIN kind; if not, we can
diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c
index 1c6083b..04bba48 100644
--- a/src/backend/optimizer/prep/prepjointree.c
+++ b/src/backend/optimizer/prep/prepjointree.c
@@ -32,6 +32,7 @@
 #include optimizer/subselect.h
 #include optimizer/tlist.h
 #include parser/parse_relation.h
+#include parser/parse_clause.h
 #include parser/parsetree.h
 #include rewrite/rewriteManip.h
 
@@ -81,8 +82,8 @@ static void make_setop_translation_list(Query *query, Index newvarno,
 static bool is_simple_subquery(Query *subquery, RangeTblEntry *rte,
    JoinExpr *lowest_outer_join);
 static bool is_simple_union_all(Query *subquery);
-static bool is_simple_union_all_recurse(Node *setOp, Query *setOpQuery,
-			List *colTypes);
+static bool is_simple_union_recurse(SetOperationStmt *topop, Node *setOp,
+	Query *setOpQuery, List *colTypes);
 static bool is_safe_append_member(Query *subquery);
 static bool jointree_contains_lateral_outer_refs(Node *jtnode, bool restricted,
 	 Relids safe_upper_varnos);
@@ -1440,12 +1441,14 @@ is_simple_union_all(Query *subquery)
 		return false;
 
 	/* Recursively check the tree of set operations */
-	return is_simple_union_all_recurse((Node *) topop, subquery,
-	   topop-colTypes);
+	if (topop-op != SETOP_UNION || !topop-all) return false;
+	return is_simple_union_recurse(topop, (Node *) topop, subquery,
+   topop-colTypes);
 }
 
 static bool
-is_simple_union_all_recurse(Node *setOp, Query *setOpQuery, List *colTypes)
+is_simple_union_recurse(SetOperationStmt *topop, Node *setOp,
+		Query *setOpQuery, List *colTypes)
 {
 	if (IsA(setOp, RangeTblRef))
 	{
@@ -1463,13 +1466,16 @@ is_simple_union_all_recurse(Node *setOp, Query *setOpQuery, List *colTypes)
 	{
 		SetOperationStmt *op = (SetOperationStmt *) setOp;
 
-		/* Must be UNION ALL */
-		if (op-op != SETOP_UNION || !op-all)
+		/* All SetOps under topop are UNION and -all is same to topop */
+		if (op-op != SETOP_UNION || op-all != topop-all)
 			return false;
 
 		/* Recurse to check inputs */
-		return is_simple_union_all_recurse(op-larg, setOpQuery, colTypes) 
-			is_simple_union_all_recurse(op-rarg, setOpQuery, colTypes);
+		return
+			is_simple_union_recurse(topop, op-larg,
+	setOpQuery, colTypes) 
+			is_simple_union_recurse(topop, op-rarg,
+	setOpQuery, colTypes);
 	}
 	else
 	{
@@ -1908,23 +1914,22 @@ pullup_replace_vars_subquery(Query *query,
 		   NULL);
 }
 
-
 /*
  * flatten_simple_union_all
- *		Try to optimize top-level UNION ALL structure into an appendrel
+ *		Try to optimize top-level UNION/UNION ALL structure into an appendrel
  *
- * If a query's setOperations tree consists entirely of simple UNION ALL
- * operations, flatten it 

Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-14 Thread David Rowley
On Tue, Jan 14, 2014 at 9:09 PM, David Rowley dgrowle...@gmail.com wrote:

 I think unless anyone has some objections I'm going to remove the inverse
 transition for SUM(numeric) and modify the documents to tell the user how
 to build their own FAST_SUM(numeric) using the built in functions to do it.
 I'm starting to think that playing around with resetting numeric scale will
 turn a possible 9.4 patch into a 9.5/10.0 patch. I see no reason why what's
 there so far, minus sum(numeric), can't go in...


Of course its only now that I discover that this is not possible to do this:

CREATE AGGREGATE fast_sum (numeric)
(
stype = numeric,
sfunc = numeric_avg_accum,
invfunc = numeric_avg_accum_inv,
finalfunc = numeric_sum
);

because SUM(numeric) uses an internal type to store the transition state.

hmmm, built-in fast_sum anyone?
Is there any simple way to limit these to only be used in the context of a
window? If so is it worth it?
Would we want fast_sum() for float too?

Regards

David Rowley


Re: [HACKERS] plpgsql.consistent_into

2014-01-14 Thread Pavel Stehule
2014/1/14 Florian Pflug f...@phlo.org

 On Jan14, 2014, at 00:52 , Marko Tiikkaja ma...@joh.to wrote:
  When I've worked with PL/PgSQL, this has been a source of a few bugs that
  would have been noticed during testing if the behaviour of INTO wasn't as
  dangerous as it is right now.

 The question is, how many bugs stemmed from wrong SQL queries, and what
 percentage of those would have been caught by this? The way I see it, there
 are thousands of ways to screw up a query, and having it return multiple
 rows instead of one is just one of them.

  Yes, it breaks backwards compatibility, but that's why there's a nice
 GUC.

 Which doesn't help, because the GUC isn't tied to the code. This *adds*
 an error case, not remove one - now, instead of getting your code correct,
 you *also* have to get the GUC correct. If you even *know* that such a GUC
 exists.

  If we're not going to scrap PL/PgSQL and
  start over again, we are going to have to do changes like this to make
 the
  language better.  Also I think that out of all the things we could do to
  break backwards compatibility, this is closer to harmless than a pain
  in the butt.

 I very strongly believe that languages don't get better by adding a
 thousand
 little knobs which subtly change semantics. Look at the mess that is PHP -
 we absolutely, certainly don't want to go there. The most important rule in
 language design is in my opinion stick with your choices. C, C++ and JAVA
 all seem to follow this, and it's one of the reasons these languages are
 popular for big projects, I think.

 The way I see it, the only OK way to change existing behaviour is to have
 the concept of a language version, and force code to indicate the
 language
 version it expects. The important thing is that the language version is an
 attribute of code, not some global setting that you can change without ever
 looking at the code it'd affect.

 So if we really want to change this, I think we need to have a
 LANGUAGE_VERSION attribute on functions. Each time a major postgres release
 changes the behaviour of one of the procedural languages, we'd increment
 that language's version, and enable the old behaviour for all functions
 tagged with an older one.


I dislike this proposal

too enterprise, too complex, too bad - after ten years we can have a ten
language versions and it helps nothing.

return back to topic

a) there is agreement so we like this functionality as plpgsql option

b) there is no agreement so we would to see this functionality as default
(or in near future)

@b is a topic, that we should not to solve and it is back again and again.
Sometimes we have success, sometimes not. Temporal GUC is not enough
solution for some people.

So my result - @a can be implement, @b not

and we can continue in other thread about SELECT INTO redesign and about
possibilities that we have or have not. It is about personal perspective -
someone can thinking about missing check after INTO as about feature, other
one can see it as bug. My opinion - it is a bug with possible ambiguous
result and possible negative performance impacts. Probably we can precise a
doc about wrong and good usage SELECT INTO clause.

I am working on plpgsql_debug extension and I am thinking so I am able
(with small change in plpgsql executor) implement this check as extension.
So it can be available for advanced users (that will has a knowledge about
additional plpgsql extensions).

Regards

Pavel


regards

Pavel



 best regards,
 Florian Pflug









Re: [HACKERS] inherit support for foreign tables

2014-01-14 Thread Shigeru Hanada
2013/11/18 Tom Lane t...@sss.pgh.pa.us:
 Robert Haas robertmh...@gmail.com writes:
 I think it's been previously proposed that we have some version of a
 CHECK constraint that effectively acts as an assertion for query
 optimization purposes, but isn't actually enforced by the system.  I
 can see that being useful in a variety of situations, including this
 one.

 Yeah, I think it would be much smarter to provide a different syntax
 to explicitly represent the notion that we're only assuming the condition
 is true, and not trying to enforce it.

I'd like to revisit this feature.

Explicit notation to represent not-enforcing (assertive?) is an
interesting idea.  I'm not sure which word is appropriate, but for
example, let's use the word ASSERTIVE as a new constraint attribute.

CREATE TABLE parent (
id int NOT NULL,
group int NOT NULL,
name text
);

CREATE FOREIGN TABLE child_grp1 (
  /* no additional column */
) INHERITS (parent) SERVER server1;
ALTER  TABLE child_grp1 ADD CONSTRAINT chk_group1 CHECK (group = 1) ASSERTIVE;

If ASSERTIVE is specified, it's not guaranteed that the constraint is
enforced completely, so it should be treated as a hint for planner.
As Robert mentioned, enforcing as much as we can during INSERT/UPDATE
is one option about this issue.

In addition, an idea which I can't throw away is to assume that all
constraints defined on foreign tables as ASSERTIVE.  Foreign tables
potentially have dangers to have wrong data by updating source data
not through foreign tables.  This is not specific to an FDW, so IMO
constraints defined on foreign tables are basically ASSERTIVE.  Of
course PG can try to maintain data correct, but always somebody might
break it.

Besides CHECK constraints, currently NOT NULL constraints are
virtually ASSERTIVE (not enforcing).  Should it also be noted
explicitly?

Thoughts?

--
Shigeru HANADA


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Trigger information for auto_explain.

2014-01-14 Thread Kyotaro HORIGUCHI
Hello,

Now explain can show trigger statistics (from when?).

=# create table t (a int, b int);
=# create or replace function hoge() returns trigger as 'begin new.b = new.a; 
return new; end;' language plpgsql;
=# create trigger ins_hoge before insert or update on t for each row execute 
procedure hoge();

Explain analyze shows trigger information.

=# explain analyze insert into t (select a from generate_series(0, 100) a);
|QUERY PLAN  
|
| Insert on t  (cost=0.00..10.00 rows=1000 width=4) (actual time=2.712.
|   -  Function Scan on generate_series a  (cost=0.00..10.00 rows=1000 width=4)
|   (actual time=0.047..0.147 rows=101 loops=1)
| Trigger ins_hoge: time=1.881 calls=101
| Total runtime: 3.009 ms

On the other hand, auto_explain doesn't.

=# load auto_explain;
=# set auto_explain.log_min_duration = 0;
=# set auto_explain.log_analyze = 'yes';
=# insert into t (select a from generate_series(0, 100) a);
|LOG:  duration: 2.871 ms  plan:
|Query Text: insert into t (select a from generate_series(0, 100) a);
|Insert on t  (cost=0.00..10.00 rows=1000 width=4)
|  -  Function Scan on generate_series a  (cost=0.00..10.00 ...

auto_explain will show trigger infos with this patch like this.

=# set auto_explain.log_triggers = 'yes';
=# insert into t (select a from generate_series(0, 100) a);
| LOG:  duration: 2.098 ms  plan:
| Query Text: insert into t (select a from generate_series(0, 100) a);
| Insert on t  (cost=0.00..10.00 rows=1000 width=4) (actual 
time=2.097..2.097 rows=0 loops=1)
|   -  Function Scan on generate_series a  (cost=0.00..10.00 rows=1000 
width=4) (actual time=0.044..0.123 rows=101 loops=1)
| Trigger ins_hoge: time=1.452 calls=101

This patch consists of two parts,

 0001_expose_explain_triggers_v1_20140114.patch

  Expose the code to print out trigger information currently
  hidden in ExplainOnePlan().

 0002_auto_explain_triggers_v1_20140114.patch

  Enable auto_explain to output trigger information.

Documentation will be added later..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9969a25..45beb5b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -484,29 +484,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 
 	/* Print info about runtime of triggers */
 	if (es-analyze)
-	{
-		ResultRelInfo *rInfo;
-		bool		show_relname;
-		int			numrels = queryDesc-estate-es_num_result_relations;
-		List	   *targrels = queryDesc-estate-es_trig_target_relations;
-		int			nr;
-		ListCell   *l;
-
-		ExplainOpenGroup(Triggers, Triggers, false, es);
-
-		show_relname = (numrels  1 || targrels != NIL);
-		rInfo = queryDesc-estate-es_result_relations;
-		for (nr = 0; nr  numrels; rInfo++, nr++)
-			report_triggers(rInfo, show_relname, es);
-
-		foreach(l, targrels)
-		{
-			rInfo = (ResultRelInfo *) lfirst(l);
-			report_triggers(rInfo, show_relname, es);
-		}
-
-		ExplainCloseGroup(Triggers, Triggers, false, es);
-	}
+		ExplainPrintTriggers(es, queryDesc);
 
 	/*
 	 * Close down the query and free resources.  Include time for this in the
@@ -563,6 +541,42 @@ ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc)
 }
 
 /*
+ * ExplainPrintTriggers -
+
+ *	  convert a QueryDesc's trigger statistics to text and append it to
+ *	  es-str
+ *
+ * The caller should have set up the options fields of *es, as well as
+ * initializing the output buffer es-str.	Other fields in *es are
+ * initialized here.
+ */
+void
+ExplainPrintTriggers(ExplainState *es, QueryDesc *queryDesc)
+{
+	ResultRelInfo *rInfo;
+	bool		show_relname;
+	int			numrels = queryDesc-estate-es_num_result_relations;
+	List	   *targrels = queryDesc-estate-es_trig_target_relations;
+	int			nr;
+	ListCell   *l;
+
+	ExplainOpenGroup(Triggers, Triggers, false, es);
+
+	show_relname = (numrels  1 || targrels != NIL);
+	rInfo = queryDesc-estate-es_result_relations;
+	for (nr = 0; nr  numrels; rInfo++, nr++)
+		report_triggers(rInfo, show_relname, es);
+
+	foreach(l, targrels)
+	{
+		rInfo = (ResultRelInfo *) lfirst(l);
+		report_triggers(rInfo, show_relname, es);
+	}
+
+	ExplainCloseGroup(Triggers, Triggers, false, es);
+}
+
+/*
  * ExplainQueryText -
  *	  add a Query Text node that contains the actual text of the query
  *
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index ca213d7..4c4226d 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -71,6 +71,7 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into,
 			   const char *queryString, ParamListInfo params);
 
 extern void ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc);
+extern void ExplainPrintTriggers(ExplainState *es, QueryDesc *queryDesc);
 
 extern void 

Re: [HACKERS] Case sensitive mode in windows build option

2014-01-14 Thread Dilip kumar
On 01/14/2014 11:25 AM Craig Ringer Wrote,

  As per current behavior if user want to build in debug mode in
  windows, then he need to give debug in capital letters (DEBUG),
 
  I think many user will always make mistake in giving this option, in
  my opinion we can make it case insensitive.
 
 The idea seems reasonable, the implementation does not. You've changed
 the meaning rather more than making it case insensitive.
 
 Use the Perl 'lc' function to compare a lower-cased input instead.
 
 http://perldoc.perl.org/functions/lc.html

I think I have done the same thing, converted user input to upper case and 
compared with DEBUG, so this will always give the case insensitive comparison.
Now we can input debug in any case (Debug, DEBUG, debug..) and it will work 
fine..

And for achieving this I used Perl 'uc' function to compare upper-cased input.

! if (uc($ARGV[0]) eq 'DEBUG')
  {
$bconf = Debug;
  }

Will it make any difference if I use 'lc' instead of 'uc' ?

Please correct me if I have missed something here..


Regards,
Dilip


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Case sensitive mode in windows build option

2014-01-14 Thread Craig Ringer
On 01/14/2014 05:35 PM, Dilip kumar wrote:
 On 01/14/2014 11:25 AM Craig Ringer Wrote,
 
 As per current behavior if user want to build in debug mode in
 windows, then he need to give debug in capital letters (DEBUG),

 I think many user will always make mistake in giving this option, in
 my opinion we can make it case insensitive.

 The idea seems reasonable, the implementation does not. You've changed
 the meaning rather more than making it case insensitive.

 Use the Perl 'lc' function to compare a lower-cased input instead.

 http://perldoc.perl.org/functions/lc.html
 
 I think I have done the same thing, converted user input to upper case and 
 compared with DEBUG, so this will always give the case insensitive comparison.
 Now we can input debug in any case (Debug, DEBUG, debug..) and it will work 
 fine..

You're completely right. My apologies. I'm not used to reading that
awful (IMO) context-diff format - despite it being the official standard
for PostgreSQL, I still misread it on a regular basis.

I saw:

! if (uc($ARGV[0]) eq 'DEBUG')
...
! elsif (uc($ARGV[0]) ne RELEASE)

and thought WTF?.

In fact, the WTF is all me.

Please disregard.

This seems quite sensible. Please add it to the commitfest app if it
isn't there already:

http://commitfest.postgresql.org/

and I'll sign up as a reviewer so I can do some build-testing on it
after the rather pressing deadline I've got in the next couple of days
has passed.

If you don't hear from me by Friday, poke me.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] nested hstore patch

2014-01-14 Thread Oleg Bartunov
Erik,

thanks for docs fixes, we have even more :)

Oleg

On Tue, Jan 14, 2014 at 4:18 AM, Erik Rijkers e...@xs4all.nl wrote:
 On Mon, January 13, 2014 18:30, Andrew Dunstan wrote:


 On 01/13/2014 11:16 AM, Oleg Bartunov wrote:
 Andrew,

 did you run perl script ? Actually, I found, that operator table needs
 to be fixed.


 No. My build machine doesn't actually have DBD::Pg installed. Can you
 send me a patch if you don't want to push it yourself, or maybe Erik can
 send a pacth top adjust the table.


 [ nested_hstore_and_jsonb-2.patch ]

 ( centos 6.5, gcc 4.8.2. )

 The patch applies  compiles with warnings (see below).

 The opr_sanity test fails during make check: regression.diffs attached.

 Also attached are changes to hstore.sgml, to operator + functions table, plus 
 some typos.

 Thanks,
 Erik Rijkers


 make

 jsonfuncs.c: In function ‘each_object_field_end_jsonb’:
 jsonfuncs.c:1328:7: warning: assignment from incompatible pointer type 
 [enabled by default]
val = DatumGetPointer(DirectFunctionCall1(jsonb_in, 
 CStringGetDatum(cstr)));
^
 jsonfuncs.c: In function ‘elements_array_element_end_jsonb’:
 jsonfuncs.c:1530:8: warning: assignment from incompatible pointer type 
 [enabled by default]
   jbval = DatumGetPointer(DirectFunctionCall1(jsonb_in, 
 CStringGetDatum(cstr)));
 ^


 make contrib:

 hstore_io.c: In function ‘array_to_hstore’:
 hstore_io.c:1694:30: warning: ‘result’ may be used uninitialized in this 
 function [-Wmaybe-uninitialized]
   PG_RETURN_POINTER(hstoreDump(result));







-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-14 Thread David Rowley
On Tue, Jan 14, 2014 at 9:09 PM, David Rowley dgrowle...@gmail.com wrote:

 I think unless anyone has some objections I'm going to remove the inverse
 transition for SUM(numeric) and modify the documents to tell the user how
 to build their own FAST_SUM(numeric) using the built in functions to do it.
 I'm starting to think that playing around with resetting numeric scale will
 turn a possible 9.4 patch into a 9.5/10.0 patch. I see no reason why what's
 there so far, minus sum(numeric), can't go in...

 If so then there's 36 aggregate functions ticked off my create an inverse
 transition function for list! I personally think that's a pretty good win.
 I'd rather do this than battle and miss deadlines for 9.4. I'd find that
 pretty annoying.


Here's a patch which removes sum(numeric) and changes the documents a
little to remove a reference to using sum(numeric) to workaround the fact
that there's no inverse transitions for sum(float). I also made a small
change in the aggregates.sql tests to check that the aggfnoid = .

I've also pulled the regression tests that I had added for sum(numeric) as
they no longer test anything new. All tests are now passing.

With the attached patch I feel like I've left users a bit high and dry for
their sum(numeric) needs. I guess there is no true workaround as even if
they created their functions in SQL using simple + and - arithmetic, they
would likely suffer from NaN recovery problems. I'm starting to come around
to Tom's FAST_SUM idea as I simply can't see any fool proof workaround that
could be created without writing things in C.

The only problems I see with the FAST_SUM idea are that the number of
trailing zeros may appear a little random based on if inverse transitions
are used or are not used... Keep in mind that inverse transitions are not
performed if any aggregate in the window does not support them OR if any
aggregate in the frame contains a volatile function in the aggregate's
parameters or the FILTER (WHERE clause). Does this matter or can we just
document to warn about that?

If there's a few more +1s for FAST_SUM(numeric) then let me know and I'll
add it.
If anyone feels strongly against adding FAST_SUM then please let the
reasons for that known too.
Or failing that, if anyone has any other ideas that have not yet been
written on this thread, please post them so we can discuss.


Regards

David Rowley


 Regards

 David Rowley




inverse_transition_functions_v2.1.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Inheritance and indexes

2014-01-14 Thread knizhnik

From PostgreSQL manual:
A serious limitation of the inheritance feature is that indexes 
(including unique constraints) and foreign key constraints only apply to 
single tables, not to their inheritance children.


But is it possible to use index for derived table at all?
Why sequential search is used for derived table in the example below:

create table base_table (x integer primary key);
create table derived_table (y integer) inherits (base_table);
insert into base_table values (1);
insert into derived_table values (2,2);
create index derived_index on derived_table(x);
explain select * from base_table where x=0;
  QUERY PLAN
--
 Append  (cost=0.14..4.56 rows=81 width=4)
   -  Index Only Scan using base_table_pkey on base_table 
(cost=0.14..3.55 rows=80 width=4)

 Index Cond: (x = 0)
   -  Seq Scan on derived_table  (cost=0.00..1.01 rows=1 width=4)
 Filter: (x = 0)
(5 rows)



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-14 Thread Peter Geoghegan
On Mon, Jan 13, 2014 at 6:45 PM, Peter Geoghegan p...@heroku.com wrote:
 + uint32
 + SpeculativeInsertionIsInProgress(TransactionId xid, RelFileNode rel,
 ItemPointer tid)
 + {

For the purposes of preventing unprincipled deadlocking, commenting
out the following (the only caller of the above) has no immediately
discernible effect with any of the test-cases that I've published:

/* XXX shouldn't we fall through to look at xmax? */
+   /* XXX why? or is that now covered by the above check? 
*/
+   snapshot-speculativeToken =
+   
SpeculativeInsertionIsInProgress(HeapTupleHeaderGetRawXmin(tuple),
+   
 rnode,
+   
 htup-t_self);
+
+   snapshot-xmin = HeapTupleHeaderGetRawXmin(tuple);
return true;/* in insertion by other */

I think that the prevention of unprincipled deadlocking is all down to
this immediately prior piece of code, at least in those test cases:

!   /*
!* in insertion by other.
!*
!* Before returning true, check for the special case 
that the
!* tuple was deleted by the same transaction that 
inserted it.
!* Such a tuple will never be visible to anyone else, 
whether
!* the transaction commits or aborts.
!*/
!   if (!(tuple-t_infomask  HEAP_XMAX_INVALID) 
!   !(tuple-t_infomask  HEAP_XMAX_COMMITTED) 
!   !(tuple-t_infomask  HEAP_XMAX_IS_MULTI) 
!   !HEAP_XMAX_IS_LOCKED_ONLY(tuple-t_infomask) 
!   HeapTupleHeaderGetRawXmax(tuple) == 
HeapTupleHeaderGetRawXmin(tuple))
!   {
!   return false;
!   }

But why should it be acceptable to change the semantics of dirty
snapshots like this, which previously always returned true when
control reached here? It is a departure from their traditional
behavior, not limited to clients of this new promise tuple
infrastructure. Now, it becomes entirely a matter of whether we tried
to insert before or after the deleting xact's deletion (of a tuple it
originally inserted) as to whether or not we block. So in general we
don't get to keep our old value locks until xact end when we update
or delete. Even if you don't consider this a bug for existing dirty
snapshot clients (I myself do - we can't rely on deleting a row and
re-inserting the same values now, which could be particularly
undesirable for updates), I have already described how we can take
advantage of deleting tuples while still holding on to their value
locks [1] to Andres. I think it'll be very important for multi-master
conflict resolution. I've already described this useful property of
dirty snapshots numerous times on this thread in relation to different
aspects, as it happens. It's essential.

Anyway, I guess you're going to need an infomask bit to fix this, so
you can differentiate between 'promise' tuples and 'proper' tuples.
Those are in short supply. I still think this problem is more or less
down to a modularity violation, and I suspect that this is not the
last problem that will be found along these lines if we continue to
pursue this approach.

[1] 
http://www.postgresql.org/message-id/CAM3SWZQpLSGPS2Kd=-n6hvyiqkf_mcxmx-q72ar9upzq-x6...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Mel Gorman
On Mon, Jan 13, 2014 at 03:24:38PM -0800, Josh Berkus wrote:
 On 01/13/2014 02:26 PM, Mel Gorman wrote:
  Really?
  
  zone_reclaim_mode is often a complete disaster unless the workload is
  partitioned to fit within NUMA nodes. On older kernels enabling it would
  sometimes cause massive stalls. I'm actually very surprised to hear it
  fixes anything and would be interested in hearing more about what sort
  of circumstnaces would convince you to enable that thing.
 
 So the problem with the default setting is that it pretty much isolates
 all FS cache for PostgreSQL to whichever socket the postmaster is
 running on, and makes the other FS cache unavailable. 

I'm not being pedantic but the default depends on the NUMA characteristics of
the machine so I need to know if it was enabled or disabled. Some machines
will default zone_reclaim_mode to 0 and others will default it to 1. In my
experience the majority of bugs that involved zone_reclaim_mode were due
to zone_reclaim_mode enabled by default.  If I see a bug that involves
a file-based workload on a NUMA machine with stalls and/or excessive IO
when there is plenty of memory free then zone_reclaim_mode is the first
thing I check.

I'm guessing from context that in your experience it gets enabled by default
on the machines you care about. This would indeed limit FS cache usage to
the node where the process is initiating IO (postmaster I guess).

 This means that,
 for example, if you have two memory banks, then only one of them is
 available for PostgreSQL filesystem caching ... essentially cutting your
 available cache in half.
 
 And however slow moving cached pages between memory banks is, it's an
 order of magnitude faster than moving them from disk.  But this isn't
 how the NUMA stuff is configured; it seems to assume that it's less
 expensive to get pages from disk than to move them between banks, so

Yes, this is right. The history behind this logic is that it was assumed
NUMA machines would only ever be used for HPC and that the workloads would
always be partitioned to run within NUMA nodes. This has not been the case
for a long time and I would argue that we should leave that thing disabled
by default in all cases. Last time I tried it was met with resistance but
maybe it's time to try again.

 whatever you've got cached on the other bank, it flushes it to disk as
 fast as possible.  I understand the goal was to make memory usage local
 to the processors stuff was running on, but that includes an implicit
 assumption that no individual process will ever want more than one
 memory bank worth of cache.
 
 So disabling all of the NUMA optimizations is the way to go for any
 workload I personally deal with.
 

I would hesitate to recommend all on the grounds that zone_reclaim_mode
is brain damage and I'd hate to lump all tuning parameters into the same box.

There is an interesting side-line here. If all IO is initiated by one
process in postgres then the memory locality will be sub-optimal.
The consumer of the data may or may not be running on the same
node as the process that read the data from disk. It is possible to
migrate this from user space but the interface is clumsy and assumes the
data is mapped.

Automatic NUMA balancing does not help you here because that thing also
depends on the data being mapped. It does nothing for data accessed via
read/write. There is nothing fundamental that prevents this, it was not
implemented because it was not deemed to be important enough. The amount
of effort spent on addressing this would depend on how important NUMA
locality is for postgres performance.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-14 Thread Heikki Linnakangas

On 01/14/2014 12:20 PM, Peter Geoghegan wrote:

I think that the prevention of unprincipled deadlocking is all down to
this immediately prior piece of code, at least in those test cases:



!   /*
!* in insertion by other.
!*
!* Before returning true, check for the special case 
that the
!* tuple was deleted by the same transaction that 
inserted it.
!* Such a tuple will never be visible to anyone else, 
whether
!* the transaction commits or aborts.
!*/
!   if (!(tuple-t_infomask  HEAP_XMAX_INVALID) 
!   !(tuple-t_infomask  HEAP_XMAX_COMMITTED) 
!   !(tuple-t_infomask  HEAP_XMAX_IS_MULTI) 
!   !HEAP_XMAX_IS_LOCKED_ONLY(tuple-t_infomask) 
!   HeapTupleHeaderGetRawXmax(tuple) == 
HeapTupleHeaderGetRawXmin(tuple))
!   {
!   return false;
!   }

But why should it be acceptable to change the semantics of dirty
snapshots like this, which previously always returned true when
control reached here? It is a departure from their traditional
behavior, not limited to clients of this new promise tuple
infrastructure. Now, it becomes entirely a matter of whether we tried
to insert before or after the deleting xact's deletion (of a tuple it
originally inserted) as to whether or not we block. So in general we
don't get to keep our old value locks until xact end when we update
or delete.


Hmm. So the scenario would be that a process inserts a tuple, but kills 
it again later in the transaction, and then re-inserts the same value. 
The expectation is that because it inserted the value once already, 
inserting it again will not block. Ie. inserting and deleting a tuple 
effectively acquires a value-lock on the inserted values.



Even if you don't consider this a bug for existing dirty
snapshot clients (I myself do - we can't rely on deleting a row and
re-inserting the same values now, which could be particularly
undesirable for updates),


Yeah, it would be bad if updates start failing because of this. We could 
add a check for that, and return true if the tuple was updated rather 
than deleted.



I have already described how we can take
advantage of deleting tuples while still holding on to their value
locks [1] to Andres. I think it'll be very important for multi-master
conflict resolution. I've already described this useful property of
dirty snapshots numerous times on this thread in relation to different
aspects, as it happens. It's essential.


I didn't understand that description.


Anyway, I guess you're going to need an infomask bit to fix this, so
you can differentiate between 'promise' tuples and 'proper' tuples.


Yeah, that's one way. Or you could set xmin to invalid, to make the 
killed tuple look thoroughly dead to everyone.



Those are in short supply. I still think this problem is more or less
down to a modularity violation, and I suspect that this is not the
last problem that will be found along these lines if we continue to
pursue this approach.


You have suspected that many times throughout this thread, and every 
time there's been a relatively simple solutions to the issues you've 
raised. I suspect that's also going to be true for whatever mundane next 
issue you come up with.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpgsql.consistent_into

2014-01-14 Thread Marko Tiikkaja

On 1/14/14 10:16 AM, Pavel Stehule wrote:

2014/1/14 Florian Pflug f...@phlo.org


So if we really want to change this, I think we need to have a
LANGUAGE_VERSION attribute on functions. Each time a major postgres release
changes the behaviour of one of the procedural languages, we'd increment
that language's version, and enable the old behaviour for all functions
tagged with an older one.



I dislike this proposal

too enterprise, too complex, too bad - after ten years we can have a ten
language versions and it helps nothing.


At this point I'm almost tempted to agree with Florian -- I'm really 
hoping we could change PL/PgSQL for the better over the next 10 years, 
but given the backwards compatibility requirements we have, this seems 
like an absolute nightmare.  Not saying we need a version number we can 
keep bumping every release, but something similar.



return back to topic

a) there is agreement so we like this functionality as plpgsql option

b) there is no agreement so we would to see this functionality as default
(or in near future)

@b is a topic, that we should not to solve and it is back again and again.
Sometimes we have success, sometimes not. Temporal GUC is not enough
solution for some people.

So my result - @a can be implement, @b not


FWIW, I would like to have this behaviour even if it's not the default 
(that was my original proposal anyway).  It could help catch a lot of 
bugs in testing, and for important code it could be even turned on in 
production (perhaps on a per-function basis).  Maybe even globally, idk.



I am working on plpgsql_debug extension and I am thinking so I am able
(with small change in plpgsql executor) implement this check as extension.
So it can be available for advanced users (that will has a knowledge about
additional plpgsql extensions).


While this sounds cool, running a modified postgres locally seems a lot 
easier.  I already have to do that for PL/PgSQL development because of 
the reasons I outlined in the thread PL/PgSQL, RAISE and error context.



Regards,
Marko Tiikkaja


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2014-01-14 Thread Heikki Linnakangas

On 01/14/2014 12:44 AM, Peter Geoghegan wrote:

On Mon, Jan 13, 2014 at 12:58 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

Well, even if you don't agree that locking all the conflicting rows for
update is sensible, it's still perfectly sensible to return the rejected
rows to the user. For example, you're inserting N rows, and if some of them
violate a constraint, you still want to insert the non-conflicting rows
instead of rolling back the whole transaction.


Right, but with your approach, can you really be sure that you have
the right rejecting tuple ctid (not reject)? In other words, as you
wait for the exclusion constraint to conclusively indicate that there
is a conflict, minutes may have passed in which time other conflicts
may emerge in earlier unique indexes. Whereas with an approach where
values are locked, you are guaranteed that earlier unique indexes have
no conflicting values. Maintaining that property seems useful, since
we check in a well-defined order, and we're still projecting a ctid.
Unlike when row locking is involved, we can make no assumptions or
generalizations around where conflicts will occur. Although that may
also be a general concern with your approach when row locking, for
multi-master replication use-cases. There may be some value in knowing
it cannot have been earlier unique indexes (and so the existing values
for those unique indexes in the locked row should stay the same -
don't many conflict resolution policies work that way?).


I don't understand what you're saying. Can you give an example?

In the use case I was envisioning above, ie. you insert N rows, and if 
any of them violate constraint, you still want to insert the 
non-violating instead of rolling back the whole transaction, you don't 
care. You don't care what existing rows the new rows conflicted with.


Even if you want to know what you conflicted with, I can't make sense of 
what you're saying. In the btreelock approach, the value locks are 
immediately released once you discover that there's conflict. So by the 
time you get to do anything with the ctid of the existing tuple you 
conflicted with, new conflicting tuples might've appeared.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PostgreSQL Service on Windows does not start if data directory given is relative path

2014-01-14 Thread Rajeev rastogi
On Tue, Jan 12, 2014 David Rowley wrote:
I have found a case that PostgreSQL as win32 service does not start, if the 
data directory given as relative path.
Error observed in this case is:
The PostgreSQL on Local Computer started and 
 then stopped.
 This may happen because relative path given will be relative to path from 
 where service is registered but
the path from where WIN32 service execution gets invoked may be different and 
hence it won't be able
to find the  data directory.
 I have fixed the same by internally converting the relative path to absolute 
 path as it is being done for executable file.
 Hi,
I've not quite got around to testing this yet, but I think the patch is a good 
idea as I can see that I can use a relative path when I start postgres.exe 
from the command line, I guess the behaviour should likely be the same when 
installed as a windows
 service.

Thanks for reviewing and providing the first level of feedback.

So, I've just been looking over this patch and I'm just wondering about a few 
things:

In pgwin32_CommandLine you declare dbPath, it looks like you could declare 
this in the scope of; if (pg_config)

Yes I have declared the same in the scope of if (pg_config) 

In find_my_abs_path you're making a call to StrNCpy, I know you likely just 
used find_my_exec as an example here, but I'd say the use of StrNCpy is not 
really needed here and is a bit of a waste of cycles. I'd rather see strlcpy 
being used as strncpy
 will needlessly zero the remaining buffer space.
Yes you are right, it is much better to use strlcpy instead of 
StrNCpy. I have modified the patch.

Also in find_my_abs_path, I'm just wondering if the cwd variable is really 
needed, I think you could just use retpath each time and it would also save a 
little bit of copying that's done in join_path_components(). By the looks of 
it you can just call
 join_path_components(retpath, retpath, inpath). Perhaps some people would 
 disagree, but I'm not really seeing the harm in it and it saves some copying 
 too.
Yes I am also convinced with your suggestion. It really avoid one 
level of copy.
I have seen that the similar mechanism is used in many places where 
join_path_components() is called. So I think it should be OK to use in our case 
also.
I have made the necessary changes for the same.

I have attached the changed patch. Please provide your further feedback, if any.

Thanks and Regards,

Kumar Rajeev Rastogi





pgctl_win32service_rel_dbpath_v2.patch
Description: pgctl_win32service_rel_dbpath_v2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpgsql.consistent_into

2014-01-14 Thread Marti Raudsepp
I've always hated INTO in procedures since it makes the code harder to
follow and has very different behavior on the SQL level, in addition
to the multi-row problem you bring up. If we can make assignment
syntax more versatile and eventually replace INTO, then that solves
multiple problems in the language without breaking backwards
compatibility.

On Tue, Jan 14, 2014 at 4:30 AM, Marko Tiikkaja ma...@joh.to wrote:
 On 2014-01-14 02:54, Marti Raudsepp wrote:
 But PL/pgSQL already has an assignment syntax with the behavior you want:

 According to the docs, that doesn't set FOUND which would make this a pain
 to deal with..

Right you are. If we can extend the syntax then we could make it such
that = SELECT sets FOUND and other diagnostics, and a simple
assignment doesn't. Which makes sense IMO:

a = 10; -- simple assignments really shouldn't affect FOUND

With explicit SELECT, clearly the intent is to perform a query:
  a = SELECT foo FROM table;
And this could also work:
  a = INSERT INTO table (foo) VALUES (10) RETURNING foo_id;

AFAICT the fact that this works is more of an accident and should be
discouraged. We can leave it as is for compatibility's sake:
  a = foo FROM table;

Now, another question is whether it's possible to make the syntax
work. Is this an assignment from the result of a subquery, or is it a
query by itself?
  a = (SELECT foo FROM table);

Does anyone consider this proposal workable?

Regards,
Marti


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inheritance and indexes

2014-01-14 Thread Marti Raudsepp
On Tue, Jan 14, 2014 at 12:07 PM, knizhnik knizh...@garret.ru wrote:
 But is it possible to use index for derived table at all?

Yes, the planner will do an index scan when it makes sense.

 Why sequential search is used for derived table in the example below:

 insert into derived_table values (2,2);
 create index derived_index on derived_table(x);
 explain select * from base_table where x=0;

With only 1 row in the table, the planner decides there's no point in
scanning the index. Try with more realistic data.

Regards,
Marti


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpgsql.consistent_into

2014-01-14 Thread Marko Tiikkaja

On 1/14/14 12:28 PM, Marti Raudsepp wrote:

I've always hated INTO in procedures since it makes the code harder to
follow and has very different behavior on the SQL level, in addition
to the multi-row problem you bring up. If we can make assignment
syntax more versatile and eventually replace INTO, then that solves
multiple problems in the language without breaking backwards
compatibility.


I don't personally have a problem with INTO other than the behaviour 
that started this thread.  But I'm willing to consider other options.



On Tue, Jan 14, 2014 at 4:30 AM, Marko Tiikkaja ma...@joh.to wrote:

On 2014-01-14 02:54, Marti Raudsepp wrote:

But PL/pgSQL already has an assignment syntax with the behavior you want:


According to the docs, that doesn't set FOUND which would make this a pain
to deal with..


Right you are. If we can extend the syntax then we could make it such
that = SELECT sets FOUND and other diagnostics, and a simple
assignment doesn't. Which makes sense IMO:

a = 10; -- simple assignments really shouldn't affect FOUND


With you so far.


With explicit SELECT, clearly the intent is to perform a query:
   a = SELECT foo FROM table;
And this could also work:
   a = INSERT INTO table (foo) VALUES (10) RETURNING foo_id;


I'm not sure that would work with the grammar.  Basically what PL/PgSQL 
does right now is for a statement like:


  a = 1;

It parses the a = part itself, and then just reads until the next 
unquoted semicolon without actually looking at it, and slams a SELECT  
in front of it.  With this approach we'd have to look into the query and 
try and guess what it does.  That might be possible, but I don't like 
the idea.



AFAICT the fact that this works is more of an accident and should be
discouraged. We can leave it as is for compatibility's sake:
   a = foo FROM table;


I've always considered that ugly (IIRC it's still undocumented as well), 
and would encourage people not to do that.



Now, another question is whether it's possible to make the syntax
work. Is this an assignment from the result of a subquery, or is it a
query by itself?
   a = (SELECT foo FROM table);


That looks like a scalar subquery, which is wrong because they can't 
return more than one column (nor can they be INSERT etc., obviously).


How about:

  (a) = SELECT 1;
  (a, b) = SELECT 1, 2;
  (a, b) = INSERT INTO foo RETURNING col1, col2;

Same semantics: TOO_MANY_ROWS on rows  1, sets FOUND and row_count. 
AFAICT this can be parsed unambiguously, too, and we don't need to look 
at the query string because this is new syntax.



Regards,
Marko Tiikkaja


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimize kernel readahead using buffer access strategy

2014-01-14 Thread KONDO Mitsumasa

Hi,

I fix and submit this patch in CF4.

In my past patch, it is significant bug which is mistaken caluculation of
offset in posix_fadvise():-( However it works well without problem in pgbench.
Because pgbench transactions are always random access...

And I test my patch in DBT-2 benchmark. Results are under following.

* Test server
  Server: HP Proliant DL360 G7
  CPU:Xeon E5640 2.66GHz (1P/4C)
  Memory: 18GB(PC3-10600R-9)
  Disk:   146GB(15k)*4 RAID1+0
  RAID controller: P410i/256MB
  OS: RHEL 6.4(x86_64)
  FS: Ext4


* DBT-2 result(WH400, SESSION=100, ideal_score=5160)
Method | score | average | 90%tile | Maximum

plain  | 3589  | 9.751   | 33.680  | 87.8036
option=off | 3670  | 9.107   | 34.267  | 79.3773
option=on  | 4222  | 5.140   | 7.619   | 102.473

 option is readahead_strategy option, and on is my proposed.
 average, 90%tile, and Maximum represent latency.
 Average_latency is 2 times faster than plain!


* Detail results (uploading now. please wait for a hour...)
[plain]
http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/normal_20140109/HTML/index_thput.html
[option=off]
http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/readahead_off_20140109/HTML/index_thput.html
[option=on]
http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/readahead_on_20140109/HTML/index_thput.html

We can see part of super slow latency in my proposed method test.
Part of transaction active is 20%, and part of slow transactions is 80%.
It might be Pareto principle in CHECKPOINT;-)
#It's joke.

I will test some join sqls performance and TPC-3 benchmark in this or next week.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
*** a/configure
--- b/configure
***
*** 11303,11309  fi
  LIBS_including_readline=$LIBS
  LIBS=`echo $LIBS | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
  
! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
  do :
as_ac_var=`$as_echo ac_cv_func_$ac_func | $as_tr_sh`
  ac_fn_c_check_func $LINENO $ac_func $as_ac_var
--- 11303,11309 
  LIBS_including_readline=$LIBS
  LIBS=`echo $LIBS | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
  
! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fadvise pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
  do :
as_ac_var=`$as_echo ac_cv_func_$ac_func | $as_tr_sh`
  ac_fn_c_check_func $LINENO $ac_func $as_ac_var
*** a/contrib/pg_prewarm/pg_prewarm.c
--- b/contrib/pg_prewarm/pg_prewarm.c
***
*** 179,185  pg_prewarm(PG_FUNCTION_ARGS)
  		 */
  		for (block = first_block; block = last_block; ++block)
  		{
! 			smgrread(rel-rd_smgr, forkNumber, block, blockbuffer);
  			++blocks_done;
  		}
  	}
--- 179,185 
  		 */
  		for (block = first_block; block = last_block; ++block)
  		{
! 			smgrread(rel-rd_smgr, forkNumber, block, blockbuffer, BAS_BULKREAD);
  			++blocks_done;
  		}
  	}
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 1293,1298  include 'filename'
--- 1293,1322 
/listitem
   /varlistentry
  
+  varlistentry id=guc-readahead-strategy xreflabel=readahead_strategy
+   termvarnamereadahead_strategy/varname (typeinteger/type)/term
+   indexterm
+primaryvarnamereadahead_strategy/configuration parameter/primary
+   /indexterm
+   listitem
+para
+ This feature is to select which readahead strategy is used. When we
+ set off(default), readahead strategy is optimized by OS. On the other
+ hands, when we set on, readahead strategy is optimized by Postgres.
+ In typicaly situations, OS readahead strategy will be good working,
+ however Postgres often knows better readahead strategy before 
+ executing disk access. For example, we can easy to predict access 
+ pattern when we input SQLs, because planner of postgres decides 
+ efficient access pattern to read faster. And it might be random access
+ pattern or sequential access pattern. It will be less disk IO and more
+ efficient to use file cache in OS. It will be better performance.
+ However this optimization is not complete now, so it is necessary to
+ choose it carefully in considering situations. Default setting is off
+ that is optimized by OS, and whenever it can change it.
+/para
+   /listitem
+  /varlistentry 
+ 
   /variablelist
   /sect2
  
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***
*** 9125,9131  copy_relation_data(SMgrRelation src, SMgrRelation dst,
  		/* If we got a cancel signal during the copy of the data, quit */
  		

Re: [HACKERS] plpgsql.consistent_into

2014-01-14 Thread Pavel Stehule
2014/1/14 Marko Tiikkaja ma...@joh.to

 On 1/14/14 12:28 PM, Marti Raudsepp wrote:

 I've always hated INTO in procedures since it makes the code harder to
 follow and has very different behavior on the SQL level, in addition
 to the multi-row problem you bring up. If we can make assignment
 syntax more versatile and eventually replace INTO, then that solves
 multiple problems in the language without breaking backwards
 compatibility.


 I don't personally have a problem with INTO other than the behaviour that
 started this thread.  But I'm willing to consider other options.


  On Tue, Jan 14, 2014 at 4:30 AM, Marko Tiikkaja ma...@joh.to wrote:

 On 2014-01-14 02:54, Marti Raudsepp wrote:

 But PL/pgSQL already has an assignment syntax with the behavior you
 want:


 According to the docs, that doesn't set FOUND which would make this a
 pain
 to deal with..


 Right you are. If we can extend the syntax then we could make it such
 that = SELECT sets FOUND and other diagnostics, and a simple
 assignment doesn't. Which makes sense IMO:

 a = 10; -- simple assignments really shouldn't affect FOUND


 With you so far.


  With explicit SELECT, clearly the intent is to perform a query:
a = SELECT foo FROM table;
 And this could also work:
a = INSERT INTO table (foo) VALUES (10) RETURNING foo_id;


 I'm not sure that would work with the grammar.  Basically what PL/PgSQL
 does right now is for a statement like:

   a = 1;

 It parses the a = part itself, and then just reads until the next
 unquoted semicolon without actually looking at it, and slams a SELECT  in
 front of it.  With this approach we'd have to look into the query and try
 and guess what it does.  That might be possible, but I don't like the idea.


  AFAICT the fact that this works is more of an accident and should be
 discouraged. We can leave it as is for compatibility's sake:
a = foo FROM table;


 I've always considered that ugly (IIRC it's still undocumented as well),
 and would encourage people not to do that.


  Now, another question is whether it's possible to make the syntax
 work. Is this an assignment from the result of a subquery, or is it a
 query by itself?
a = (SELECT foo FROM table);


only this form is allowed in SQL/PSM - and it has some logic - you can
assign result of subquery (should be one row only) to variable.



 That looks like a scalar subquery, which is wrong because they can't
 return more than one column (nor can they be INSERT etc., obviously).

 How about:

   (a) = SELECT 1;
   (a, b) = SELECT 1, 2;
   (a, b) = INSERT INTO foo RETURNING col1, col2;


I prefer subquery only syntax - a := (some) or (a,b,c) = (some a,b,c) with
possible enhancing for statements with RETURNING

a advance is compatibility with DB2 (SQL/PSM) syntax - and this code is
written now - it is done in my sql/psm implementation

Regards

Pavel



 Same semantics: TOO_MANY_ROWS on rows  1, sets FOUND and row_count.
 AFAICT this can be parsed unambiguously, too, and we don't need to look at
 the query string because this is new syntax.


 Regards,
 Marko Tiikkaja



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



[HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format

2014-01-14 Thread Michael Paquier
Hi all,

As of today, replication protocol has a command called BASE_BACKUP to
allow a client connecting with the replication protocol to retrieve a
full backup from server through a connection stream. The description
of its current options are here:
http://www.postgresql.org/docs/9.3/static/protocol-replication.html

This command is in charge to put the server in start backup by using
do_pg_start_backup, then it sends the backup, and finalizes the backup
with do_pg_stop_backup. Thanks to that it is as well possible to get
backups from even standby nodes as the stream contains the
backup_label file necessary for recovery. Full backup is sent in tar
format for obvious performance reasons to limit the amount of data
sent through the stream, and server contains necessary coding to send
the data in correct format. This forces the client as well to perform
some decoding if the output of the base backup received needs to be
analyzed on the fly but doing something similar to what now
pg_basebackup does when the backup format is plain.

I would like to propose the following things to extend BASE_BACKUP to
retrieve a backup from a stream:
- Addition of an option FORMAT, to control the output format of
backup, with possible options as 'plain' and 'tar'. Default is tar for
backward compatibility purposes. The purpose of this option is to make
easier for backup tools playing with postgres to retrieve and backup
and analyze it on the fly, the purpose being to filter and analyze the
data while it is being received without all the tar decoding
necessary, what would consist in copying portions of pg_basebackup
code more or less.
- Addition of an option called INCREMENTAL to send an incremental
backup to the client. This option uses as input an LSN, and sends back
to client relation pages (in the shape of reduced relation files) that
are newer than the LSN specified by looking at pd_lsn of
PageHeaderData. In this case the LSN needs to be determined by client
based on the latest full backup taken. This option is particularly
interesting to reduce the amount of data taken between two backups,
even if it increases the restore time as client needs to reconstitute
a base backup depending on the recovery target and the pages modified.
Client would be in charge of rebuilding pages from incremental backup
by scanning all the blocks that need to be updated based on the full
backup as the LSN from which incremental backup is taken is known. But
this is not really something the server cares about... Such things are
actually done by pg_rman as well.

As a next step, I would imagine that pg_basebackup could be extended
to take incremental backups as well. Having another tool able to
rebuild backups based on a full backup with incremental information
would be nice as well.

This is of course not material for 9.4, I just would like for now to
measure the temperature about such things and gather opinions...

Regards
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Soften pg_[start|stop]_backup to allow them on a standby?

2014-01-14 Thread Andres Freund
Hi,

On 2014-01-14 12:31:09 +0900, Michael Paquier wrote:
 Currently, pg_start_backup and pg_stop_backup cannot run on a standby
 because it is not possible to write a backup_label file to disk,
 because of the nature of a standby server preventing to write any data
 in its PGDATA. Is this thought right? This is what the comments at the
 top of do_pg_start_backup make me conclude.

No, the actual reason is that a plain pg_stop_backup() writes WAL -
which we can't do on a standby. The walsender command gets around this
by storing the required data in the backup label itself, but that
requires the label to be written after the basebackup actually finished
which doesn't work for plain start/stop backup.

 Another idea would be to send the backup label file directly as the
 output of pg_start_backup such as client application can grab it and
 reuse it. Any thoughts about that as well?

Yea, I think extending the protocols available is the way to go
here. We need to be able to send the backup label after the actual base
backup finished.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format

2014-01-14 Thread Andres Freund
Hi,

On 2014-01-14 21:47:43 +0900, Michael Paquier wrote:
 I would like to propose the following things to extend BASE_BACKUP to
 retrieve a backup from a stream:
 - Addition of an option FORMAT, to control the output format of
 backup, with possible options as 'plain' and 'tar'. Default is tar for
 backward compatibility purposes. The purpose of this option is to make
 easier for backup tools playing with postgres to retrieve and backup
 and analyze it on the fly, the purpose being to filter and analyze the
 data while it is being received without all the tar decoding
 necessary, what would consist in copying portions of pg_basebackup
 code more or less.

We'd need our own serialization format since we're dealing with more
than one file, what would be the point?

 - Addition of an option called INCREMENTAL to send an incremental
 backup to the client. This option uses as input an LSN, and sends back
 to client relation pages (in the shape of reduced relation files) that
 are newer than the LSN specified by looking at pd_lsn of
 PageHeaderData. In this case the LSN needs to be determined by client
 based on the latest full backup taken. This option is particularly
 interesting to reduce the amount of data taken between two backups,
 even if it increases the restore time as client needs to reconstitute
 a base backup depending on the recovery target and the pages modified.
 Client would be in charge of rebuilding pages from incremental backup
 by scanning all the blocks that need to be updated based on the full
 backup as the LSN from which incremental backup is taken is known. But
 this is not really something the server cares about... Such things are
 actually done by pg_rman as well.

Why not just rely on WAL replay since you're relying on the consistency
of the standby anyway?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format

2014-01-14 Thread Heikki Linnakangas

On 01/14/2014 02:47 PM, Michael Paquier wrote:

I would like to propose the following things to extend BASE_BACKUP to
retrieve a backup from a stream:
- Addition of an option FORMAT, to control the output format of
backup, with possible options as 'plain' and 'tar'. Default is tar for
backward compatibility purposes. The purpose of this option is to make
easier for backup tools playing with postgres to retrieve and backup
and analyze it on the fly, the purpose being to filter and analyze the
data while it is being received without all the tar decoding
necessary, what would consist in copying portions of pg_basebackup
code more or less.


Umm, you have to somehow mark in the protocol where one file begins and 
another one ends. The 'tar' format seems perfectly OK for that purpose. 
What exactly would the 'plain' format do?



- Addition of an option called INCREMENTAL to send an incremental
backup to the client. This option uses as input an LSN, and sends back
to client relation pages (in the shape of reduced relation files) that
are newer than the LSN specified by looking at pd_lsn of
PageHeaderData. In this case the LSN needs to be determined by client
based on the latest full backup taken. This option is particularly
interesting to reduce the amount of data taken between two backups,
even if it increases the restore time as client needs to reconstitute
a base backup depending on the recovery target and the pages modified.
Client would be in charge of rebuilding pages from incremental backup
by scanning all the blocks that need to be updated based on the full
backup as the LSN from which incremental backup is taken is known. But
this is not really something the server cares about... Such things are
actually done by pg_rman as well.


How does the server find all the pages with LSN  the threshold? If it 
needs to scan the whole database, it's not all that useful. I guess it 
would be better than nothing, but I think you might as well just use rsync.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format

2014-01-14 Thread Magnus Hagander
On Tue, Jan 14, 2014 at 1:47 PM, Michael Paquier
michael.paqu...@gmail.comwrote:

 Hi all,

 As of today, replication protocol has a command called BASE_BACKUP to
 allow a client connecting with the replication protocol to retrieve a
 full backup from server through a connection stream. The description
 of its current options are here:
 http://www.postgresql.org/docs/9.3/static/protocol-replication.html

 This command is in charge to put the server in start backup by using
 do_pg_start_backup, then it sends the backup, and finalizes the backup
 with do_pg_stop_backup. Thanks to that it is as well possible to get
 backups from even standby nodes as the stream contains the
 backup_label file necessary for recovery. Full backup is sent in tar
 format for obvious performance reasons to limit the amount of data
 sent through the stream, and server contains necessary coding to send
 the data in correct format. This forces the client as well to perform
 some decoding if the output of the base backup received needs to be
 analyzed on the fly but doing something similar to what now
 pg_basebackup does when the backup format is plain.

 I would like to propose the following things to extend BASE_BACKUP to
 retrieve a backup from a stream:
 - Addition of an option FORMAT, to control the output format of
 backup, with possible options as 'plain' and 'tar'. Default is tar for
 backward compatibility purposes. The purpose of this option is to make
 easier for backup tools playing with postgres to retrieve and backup
 and analyze it on the fly, the purpose being to filter and analyze the
 data while it is being received without all the tar decoding
 necessary, what would consist in copying portions of pg_basebackup
 code more or less.


How would this be different/better than the tar format? pg_basebackup
already does this analysis, for example, when it comes to recovery.conf.
The tar format is really easy to analyze as a stream, that's one of the
reasons we picked it...



 - Addition of an option called INCREMENTAL to send an incremental

backup to the client. This option uses as input an LSN, and sends back
 to client relation pages (in the shape of reduced relation files) that
 are newer than the LSN specified by looking at pd_lsn of
 PageHeaderData. In this case the LSN needs to be determined by client
 based on the latest full backup taken. This option is particularly
 interesting to reduce the amount of data taken between two backups,
 even if it increases the restore time as client needs to reconstitute
 a base backup depending on the recovery target and the pages modified.
 Client would be in charge of rebuilding pages from incremental backup
 by scanning all the blocks that need to be updated based on the full
 backup as the LSN from which incremental backup is taken is known. But
 this is not really something the server cares about... Such things are
 actually done by pg_rman as well.


This sounds a lot like DIFFERENTIAL in other databases? Or I guess it's the
same underlying technology, depending only on if you go back to the full
base backup, or to the last incremental one.

But if you look at the terms otherwise, I think incremental often refers to
what we call WAL.

Either way - if we can do this in a safe way, it sounds like a good idea.
It would be sort of like rsync, except relying on the fact that we can look
at the LSN and don't have to compare the actual files, right?


As a next step, I would imagine that pg_basebackup could be extended
 to take incremental backups as well. Having another tool able to
 rebuild backups based on a full backup with incremental information
 would be nice as well.


I would say those are requirements, not just next step and nice as well :)


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format

2014-01-14 Thread Michael Paquier
On Tue, Jan 14, 2014 at 10:01 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 - Addition of an option called INCREMENTAL to send an incremental
 backup to the client. This option uses as input an LSN, and sends back
 to client relation pages (in the shape of reduced relation files) that
 are newer than the LSN specified by looking at pd_lsn of
 PageHeaderData. In this case the LSN needs to be determined by client
 based on the latest full backup taken. This option is particularly
 interesting to reduce the amount of data taken between two backups,
 even if it increases the restore time as client needs to reconstitute
 a base backup depending on the recovery target and the pages modified.
 Client would be in charge of rebuilding pages from incremental backup
 by scanning all the blocks that need to be updated based on the full
 backup as the LSN from which incremental backup is taken is known. But
 this is not really something the server cares about... Such things are
 actually done by pg_rman as well.


 How does the server find all the pages with LSN  the threshold? If it needs
 to scan the whole database, it's not all that useful. I guess it would be
 better than nothing, but I think you might as well just use rsync.
Yes, it would be necessary to scan the whole database as the LSN to be
checked is kept in PageHeaderData :). Perhaps it is not that
performant, but my initial thought was that perhaps the amount of data
necessary to maintain incremental backups could balance with the
amount of WAL necessary to keep and limit the whole amount on disk.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format

2014-01-14 Thread Andres Freund
On 2014-01-14 14:12:46 +0100, Magnus Hagander wrote:
 Either way - if we can do this in a safe way, it sounds like a good idea.
 It would be sort of like rsync, except relying on the fact that we can look
 at the LSN and don't have to compare the actual files, right?

Which is an advantage, yes. On the other hand, it doesn't fix problems
with a subtly broken replica, e.g. after a bug in replay, or disk
corruption.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format

2014-01-14 Thread Michael Paquier
On Tue, Jan 14, 2014 at 10:12 PM, Magnus Hagander mag...@hagander.net wrote:
 On Tue, Jan 14, 2014 at 1:47 PM, Michael Paquier michael.paqu...@gmail.com
 wrote:

 Hi all,

 As of today, replication protocol has a command called BASE_BACKUP to
 allow a client connecting with the replication protocol to retrieve a
 full backup from server through a connection stream. The description
 of its current options are here:
 http://www.postgresql.org/docs/9.3/static/protocol-replication.html

 This command is in charge to put the server in start backup by using
 do_pg_start_backup, then it sends the backup, and finalizes the backup
 with do_pg_stop_backup. Thanks to that it is as well possible to get
 backups from even standby nodes as the stream contains the
 backup_label file necessary for recovery. Full backup is sent in tar
 format for obvious performance reasons to limit the amount of data
 sent through the stream, and server contains necessary coding to send
 the data in correct format. This forces the client as well to perform
 some decoding if the output of the base backup received needs to be
 analyzed on the fly but doing something similar to what now
 pg_basebackup does when the backup format is plain.

 I would like to propose the following things to extend BASE_BACKUP to
 retrieve a backup from a stream:
 - Addition of an option FORMAT, to control the output format of
 backup, with possible options as 'plain' and 'tar'. Default is tar for
 backward compatibility purposes. The purpose of this option is to make
 easier for backup tools playing with postgres to retrieve and backup
 and analyze it on the fly, the purpose being to filter and analyze the
 data while it is being received without all the tar decoding
 necessary, what would consist in copying portions of pg_basebackup
 code more or less.


 How would this be different/better than the tar format? pg_basebackup
 already does this analysis, for example, when it comes to recovery.conf.
 The tar format is really easy to analyze as a stream, that's one of the
 reasons we picked it...



 - Addition of an option called INCREMENTAL to send an incremental

 backup to the client. This option uses as input an LSN, and sends back
 to client relation pages (in the shape of reduced relation files) that
 are newer than the LSN specified by looking at pd_lsn of
 PageHeaderData. In this case the LSN needs to be determined by client
 based on the latest full backup taken. This option is particularly
 interesting to reduce the amount of data taken between two backups,
 even if it increases the restore time as client needs to reconstitute
 a base backup depending on the recovery target and the pages modified.
 Client would be in charge of rebuilding pages from incremental backup
 by scanning all the blocks that need to be updated based on the full
 backup as the LSN from which incremental backup is taken is known. But
 this is not really something the server cares about... Such things are
 actually done by pg_rman as well.


 This sounds a lot like DIFFERENTIAL in other databases? Or I guess it's the
 same underlying technology, depending only on if you go back to the full
 base backup, or to the last incremental one.
Yes, that's actually a LSN-differential, I got my head in pg_rman for
a couple of weeks, where a similar idea is called incremental there.


 But if you look at the terms otherwise, I think incremental often refers to
 what we call WAL.

 Either way - if we can do this in a safe way, it sounds like a good idea. It
 would be sort of like rsync, except relying on the fact that we can look at
 the LSN and don't have to compare the actual files, right?
Yep, that's the idea.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format

2014-01-14 Thread Magnus Hagander
On Tue, Jan 14, 2014 at 2:18 PM, Andres Freund and...@2ndquadrant.comwrote:

 On 2014-01-14 14:12:46 +0100, Magnus Hagander wrote:
  Either way - if we can do this in a safe way, it sounds like a good idea.
  It would be sort of like rsync, except relying on the fact that we can
 look
  at the LSN and don't have to compare the actual files, right?

 Which is an advantage, yes. On the other hand, it doesn't fix problems
 with a subtly broken replica, e.g. after a bug in replay, or disk
 corruption.


Right. But neither does rsync, right?


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format

2014-01-14 Thread Andres Freund
On 2014-01-14 14:40:46 +0100, Magnus Hagander wrote:
 On Tue, Jan 14, 2014 at 2:18 PM, Andres Freund and...@2ndquadrant.comwrote:
 
  On 2014-01-14 14:12:46 +0100, Magnus Hagander wrote:
   Either way - if we can do this in a safe way, it sounds like a good idea.
   It would be sort of like rsync, except relying on the fact that we can
  look
   at the LSN and don't have to compare the actual files, right?
 
  Which is an advantage, yes. On the other hand, it doesn't fix problems
  with a subtly broken replica, e.g. after a bug in replay, or disk
  corruption.
 
 
 Right. But neither does rsync, right?

Hm? Rsync's really only safe with --checksum and with that it definitely
should fix those?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format

2014-01-14 Thread Magnus Hagander
On Tue, Jan 14, 2014 at 2:16 PM, Michael Paquier
michael.paqu...@gmail.comwrote:

 On Tue, Jan 14, 2014 at 10:01 PM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
  - Addition of an option called INCREMENTAL to send an incremental
  backup to the client. This option uses as input an LSN, and sends back
  to client relation pages (in the shape of reduced relation files) that
  are newer than the LSN specified by looking at pd_lsn of
  PageHeaderData. In this case the LSN needs to be determined by client
  based on the latest full backup taken. This option is particularly
  interesting to reduce the amount of data taken between two backups,
  even if it increases the restore time as client needs to reconstitute
  a base backup depending on the recovery target and the pages modified.
  Client would be in charge of rebuilding pages from incremental backup
  by scanning all the blocks that need to be updated based on the full
  backup as the LSN from which incremental backup is taken is known. But
  this is not really something the server cares about... Such things are
  actually done by pg_rman as well.
 
 
  How does the server find all the pages with LSN  the threshold? If it
 needs
  to scan the whole database, it's not all that useful. I guess it would be
  better than nothing, but I think you might as well just use rsync.
 Yes, it would be necessary to scan the whole database as the LSN to be
 checked is kept in PageHeaderData :). Perhaps it is not that
 performant, but my initial thought was that perhaps the amount of data
 necessary to maintain incremental backups could balance with the
 amount of WAL necessary to keep and limit the whole amount on disk.


It wouldn't be worse performance wise than a full backup. That one also has
to read all the blocks after all... You're decreasing network traffic and
client storage, with the same I/O on the server side. Seems worthwhile.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format

2014-01-14 Thread Magnus Hagander
On Tue, Jan 14, 2014 at 2:41 PM, Andres Freund and...@2ndquadrant.comwrote:

 On 2014-01-14 14:40:46 +0100, Magnus Hagander wrote:
  On Tue, Jan 14, 2014 at 2:18 PM, Andres Freund and...@2ndquadrant.com
 wrote:
 
   On 2014-01-14 14:12:46 +0100, Magnus Hagander wrote:
Either way - if we can do this in a safe way, it sounds like a good
 idea.
It would be sort of like rsync, except relying on the fact that we
 can
   look
at the LSN and don't have to compare the actual files, right?
  
   Which is an advantage, yes. On the other hand, it doesn't fix problems
   with a subtly broken replica, e.g. after a bug in replay, or disk
   corruption.
  
  
  Right. But neither does rsync, right?

 Hm? Rsync's really only safe with --checksum and with that it definitely
 should fix those?


I think we're talking about difference scenarios.

I thought you were talking about a backup taken from a replica, that
already has corruption. rsync checksums surely aren't going to help with
that?


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format

2014-01-14 Thread Andres Freund
On 2014-01-14 14:42:36 +0100, Magnus Hagander wrote:
 On Tue, Jan 14, 2014 at 2:41 PM, Andres Freund and...@2ndquadrant.comwrote:
 
  On 2014-01-14 14:40:46 +0100, Magnus Hagander wrote:
   On Tue, Jan 14, 2014 at 2:18 PM, Andres Freund and...@2ndquadrant.com
  wrote:
  
On 2014-01-14 14:12:46 +0100, Magnus Hagander wrote:
 Either way - if we can do this in a safe way, it sounds like a good
  idea.
 It would be sort of like rsync, except relying on the fact that we
  can
look
 at the LSN and don't have to compare the actual files, right?
   
Which is an advantage, yes. On the other hand, it doesn't fix problems
with a subtly broken replica, e.g. after a bug in replay, or disk
corruption.
   
   
   Right. But neither does rsync, right?
 
  Hm? Rsync's really only safe with --checksum and with that it definitely
  should fix those?
 
 
 I think we're talking about difference scenarios.

Sounds like it.

 I thought you were talking about a backup taken from a replica, that
 already has corruption. rsync checksums surely aren't going to help with
 that?

I was talking about updating a standby using such an incremental or
differential backup from the primary (or a standby higher up in the
cascade). If your standby is corrupted in any way a rsync --checksum
will certainly correct errors if it syncs from a correct source?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Filter error log statements by sqlstate

2014-01-14 Thread Oskari Saarenmaa
On Tue, Jan 14, 2014 at 12:22:30PM +0530, Jeevan Chalke wrote:
 On Mon, Jan 13, 2014 at 4:30 PM, Oskari Saarenmaa o...@ohmu.fi wrote:
  On 13/01/14 10:26, Jeevan Chalke wrote:
 
   1. Documentation is missing and thus becomes difficult to understand
   what exactly you are trying to do.  Or in other words, user will be
   uncertain about using it more efficiently.
 
  I figured I'd write documentation for this if it looks like a useful
  feature which would be accepted for 9.4, but I guess it would've
  helped to have a bit better description of this for the initial
  submission as well.
 
 
   2. Some more comments required. At each new function and
   specifically at get_sqlstate_error_level().
 
  Just after I submitted the patch I noticed that I had a placeholder
  for comment about that function but never wrote the actual comment,
  sorry about that.
 
   3. Please add test-case if possible.
 
  Sure.
 
   4. Some code part does not comply with PostgreSQL indentation style. 
   (Can be ignored as it will pass through pg_indent, but better fix
   it).
  
 
  I'll try to fix this for v2.
 
   5. You have used XX000:warning, string to get maximum possible
   length of the valid sqlstate:level identifier.  It's perfect, but
   small explanation about that will be good there.  Also in future if
   we have any other error level which exceeds this, we need changes
   here too.  Right ?
 
  Good point, I'll address this in v2.
 
   I will look into this further. But please have your attention on above
   points.
 
  Thanks for the review!
 
 Since you are taking care of most of the points above. I will wait for v2
 patch. Till then marking Waiting on Author.

Attached v2 of the patch which addresses the above points.  I couldn't
figure out how to test log output, but at least the patch now tests that
it can set and show the log level.

Thanks,
Oskari
From 213f647657f318141e3866087a17a863a0f322d9 Mon Sep 17 00:00:00 2001
From: Oskari Saarenmaa o...@ohmu.fi
Date: Tue, 14 Jan 2014 15:47:39 +0200
Subject: [PATCH] Filter error log statements by sqlstate

Allow the default log_min_error_statement to be overridden per sqlstate to
make it possible to filter out some error types while maintaining a low
log_min_error_statement or enable logging for some error types when the
default is to not log anything.
---
 doc/src/sgml/config.sgml  |  30 ++
 src/backend/utils/error/elog.c| 220 +-
 src/backend/utils/misc/guc.c  |  14 ++-
 src/include/utils/guc.h   |   4 +
 src/include/utils/guc_tables.h|   1 +
 src/test/regress/expected/guc.out |  24 +
 src/test/regress/sql/guc.sql  |   8 ++
 7 files changed, 298 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0f2f2bf..73a58ad 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3743,6 +3743,36 @@ local0.*/var/log/postgresql
   /listitem
  /varlistentry
 
+ varlistentry id=guc-log-sqlstate-error-statement 
xreflabel=log_sqlstate_error_statement
+  termvarnamelog_sqlstate_error_statement/varname 
(typestring/type)/term
+  indexterm
+   primaryvarnamelog_sqlstate_error_statement/ configuration 
parameter/primary
+  /indexterm
+  listitem
+   para
+Controls which error types in SQL statements condition are recorded
+in the server log.  This overrides the global xref
+linkend=guc-log-min-messages per error type and can be used to
+disable logging for certain error types and/or to enable logging for
+other types.
+   /para
+   para
+The value must be a comma-separated list in the form
+literal'ERRORCODE:LOGLEVEL,...'/literal.  For example, a setting
+of literal'P0001:PANIC,22012:ERROR'/literal would never log the
+SQL statements for errors generated by the PL/pgSQL
+literalRAISE/literal statement but would always log the
+statements causing division by zero errors.
+
+See xref linkend=errcodes-appendix for the list of valid error
+codes and xref linkend=guc-log-min-messages for valid log
+levels.
+
+Only superusers can change this setting.
+   /para
+  /listitem
+ /varlistentry
+
  varlistentry id=guc-log-min-duration-statement 
xreflabel=log_min_duration_statement
   termvarnamelog_min_duration_statement/varname 
(typeinteger/type)/term
   indexterm
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 8705586..2e74fd5 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -74,7 +74,9 @@
 #include storage/ipc.h
 #include storage/proc.h
 #include tcop/tcopprot.h
+#include utils/builtins.h
 #include utils/guc.h
+#include utils/guc_tables.h
 #include utils/memutils.h
 #include utils/ps_status.h
 
@@ -111,6 +113,11 @@ char  *Log_line_prefix = NULL; /* 
format for extra log line 

Re: [HACKERS] Optimize kernel readahead using buffer access strategy

2014-01-14 Thread Claudio Freire
On Tue, Jan 14, 2014 at 8:58 AM, KONDO Mitsumasa
kondo.mitsum...@lab.ntt.co.jp wrote:

 In my past patch, it is significant bug which is mistaken caluculation of
 offset in posix_fadvise():-( However it works well without problem in
 pgbench.
 Because pgbench transactions are always random access...


Did you notice any difference?

AFAIK, when specifying read patterns (ie, RANDOM, SEQUENTIAL and stuff
like that), the offset doesn't matter. At least in linux.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimize kernel readahead using buffer access strategy

2014-01-14 Thread Claudio Freire
On Tue, Jan 14, 2014 at 8:58 AM, KONDO Mitsumasa
kondo.mitsum...@lab.ntt.co.jp wrote:

 In my past patch, it is significant bug which is mistaken caluculation of
 offset in posix_fadvise():-( However it works well without problem in
 pgbench.
 Because pgbench transactions are always random access...


Did you notice any difference?

AFAIK, when specifying read patterns (ie, RANDOM, SEQUENTIAL and stuff
like that), the offset doesn't matter. At least in linux.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Hannu Krosing
On 01/14/2014 09:39 AM, Claudio Freire wrote:
 On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing ha...@2ndquadrant.com wrote:
 Again, as said above the linux file system is doing fine. What we
 want is a few ways to interact with it to let it do even better when
 working with postgresql by telling it some stuff it otherwise would
 have to second guess and by sometimes giving it back some cache
 pages which were copied away for potential modifying but ended
 up clean in the end.
 You don't need new interfaces. Only a slight modification of what
 fadvise DONTNEED does.

 This insistence in injecting pages from postgres to kernel is just a
 bad idea. 
Do you think it would be possible to map copy-on-write pages
from linux cache to postgresql cache ?

this would be a step in direction of solving the double-ram-usage
of pages which have not been read from syscache to postgresql
cache without sacrificing linux read-ahead (which I assume does
not happen when reads bypass system cache).

and we can write back the copy at the point when it is safe (from
postgresql perspective)  to let the system write them back ?

Do you think it is possible to make it work with good performance
for a few million 8kb pages ?

 At the very least, it still needs postgres to know too much
 of the filesystem (block layout) to properly work. Ie: pg must be
 required to put entire filesystem-level blocks into the page cache,
 since that's how the page cache works. 
I was more thinking of an simple write() interface with extra
flags/sysctls to tell kernel that we already have this on disk
 At the very worst, it may
 introduce serious security and reliability implications, when
 applications can destroy the consistency of the page cache (even if
 full access rights are checked, there's still the possibility this
 inconsistency might be exploitable).
If you allow write() which just writes clean pages, I can not see
where the extra security concerns are beyond what normal
write can do.


Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Robert Haas
On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman mgor...@suse.de wrote:
 Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
 setting zone_reclaim_mode; is there some other problem besides that?

 Really?

 zone_reclaim_mode is often a complete disaster unless the workload is
 partitioned to fit within NUMA nodes. On older kernels enabling it would
 sometimes cause massive stalls. I'm actually very surprised to hear it
 fixes anything and would be interested in hearing more about what sort
 of circumstnaces would convince you to enable that thing.

By set I mean set to zero.  We've seen multiple of instances of
people complaining about large amounts of system memory going unused
because this setting defaulted to 1.

 The other thing that comes to mind is the kernel's caching behavior.
 We've talked a lot over the years about the difficulties of getting
 the kernel to write data out when we want it to and to not write data
 out when we don't want it to.

 Is sync_file_range() broke?

I don't know.  I think a few of us have played with it and not been
able to achieve a clear win.  Whether the problem is with the system
call or the programmer is harder to determine.  I think the problem is
in part that it's not exactly clear when we should call it.  So
suppose we want to do a checkpoint.  What we used to do a long time
ago is write everything, and then fsync it all, and then call it good.
 But that produced horrible I/O storms.  So what we do now is do the
writes over a period of time, with sleeps in between, and then fsync
it all at the end, hoping that the kernel will write some of it before
the fsyncs arrive so that we don't get a huge I/O spike.

And that sorta works, and it's definitely better than doing it all at
full speed, but it's pretty imprecise.  If the kernel doesn't write
enough of the data out in advance, then there's still a huge I/O storm
when we do the fsyncs and everything grinds to a halt.  If it writes
out more data than needed in advance, it increases the total number of
physical writes because we get less write-combining, and that hurts
performance, too.  I basically feel like the I/O scheduler sucks,
though whether it sucks because it's not theoretically possible to do
any better or whether it sucks because of some more tractable reason
is not clear to me.  In an ideal world, when I call fsync() a bunch of
times from one process, other processes on the same machine should
begin to observe 30+-second (or sometimes 300+-second) times for read
or write of an 8kB block.  Imagine a hypothetical UNIX-like system
where when one process starts running at 100% CPU, every other process
on the machine gets timesliced in only once per minute.  That's
obviously ridiculous, and yet it's pretty much exactly what happens
with I/O.

 When it writes data back to disk too
 aggressively, we get lousy throughput because the same page can get
 written more than once when caching it for longer would have allowed
 write-combining.

 Do you think that is related to dirty_ratio or dirty_writeback_centisecs?
 If it's dirty_writeback_centisecs then that would be particularly tricky
 because poor interactions there would come down to luck basically.

See above; I think it's related to fsync.

 When it doesn't write data to disk aggressively
 enough, we get huge latency spikes at checkpoint time when we call
 fsync() and the kernel says uh, what? you wanted that data *on the
 disk*? sorry boss! and then proceeds to destroy the world by starving
 the rest of the system for I/O for many seconds or minutes at a time.

 Ok, parts of that are somewhat expected. It *may* depend on the
 underlying filesystem. Some of them handle fsync better than others. If
 you are syncing the whole file though when you call fsync then you are
 potentially burned by having to writeback dirty_ratio amounts of memory
 which could take a substantial amount of time.

Yeah.  ext3 apparently fsyncs the whole filesystem, which is terrible
for throughput, but if you happen to have xlog (which is flushed
regularly) on the same filesystem as the data files (which are flushed
only periodically) then at least you don't have the problem of the
write queue getting too large.   But I think most of our users are on
ext4 at this point, probably some xfs and other things.

We track the number of un-fsync'd blocks we've written to each file,
and have gotten desperate enough to think of approaches like - ok,
well if the total number of un-fsync'd blocks in the system exceeds
some threshold, then fsync the file with the most such blocks, not
because we really need the data on disk just yet but so that the write
queue won't get too large for the kernel to deal with.  And I think
there may even be some test results from such crocks showing some
benefit.  But really, I don't understand why we have to baby the
kernel like this.  Ensuring scheduling fairness is a basic job of the
kernel; if we wanted to have to control caching behavior manually, we
could 

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Kevin Grittner
First off, I want to give a +1 on everything in the recent posts
from Heikki and Hannu.

Jan Kara j...@suse.cz wrote:

 Now the aging of pages marked as volatile as it is currently
 implemented needn't be perfect for your needs but you still have
 time to influence what gets implemented... Actually developers of
 the vrange() syscall were specifically looking for some ideas
 what to base aging on. Currently I think it is first marked -
 first evicted.

The first marked - first evicted seems like what we would want. 
The ability to unmark and have the page no longer be considered
preferred for eviction would be very nice.  That seems to me like
it would cover the multiple layers of buffering *clean* pages very
nicely (although I know nothing more about vrange() than what has
been said on this thread, so I could be missing something).

The other side of that is related avoiding multiple writes of the
same page as much as possible, while avoid write gluts.  The issue
here is that PostgreSQL tries to hang on to dirty pages for as long
as possible before writing them to the OS cache, while the OS
tries to avoid writing them to storage for as long as possible
until they reach a (configurable) threshold or are fsync'd.  The
problem is that a under various conditions PostgreSQL may need to
write and fsync a lot of dirty pages it has accumulated in a short
time.  That has an avalanche effect, creating a write glut
which can stall all I/O for a period of many seconds up to a few
minutes.  If the OS was aware of the dirty pages pending write in
the application, and counted those for purposes of calculating when
and how much to write, the glut could be avoided.  Currently,
people configure the PostgreSQL background writer to be very
aggressive, configure a small PostgreSQL shared_buffers setting,
and/or set the OS thresholds low enough to minimize the problem;
but all of these mitigation strategies have their own costs.

A new hint that the application has dirtied a page could be used by
the OS to improve things this way:  When the OS is notified that a
page is dirty, it takes action depending on whether the page is
considered dirty by the OS.  If it is not dirty, the page is
immediately discarded from the OS cache.  It is known that the
application has a modified version of the page that it intends to
write, so the version in the OS cache has no value.  We don't want
this page forcing eviction of vrange()-flagged pages.  If it is
dirty, any write ordering to storage by the OS based on when the
page was written to the OS would be pushed back as far as possible
without crossing any write barriers, in hopes that the writes could
be combined.  Either way, this page is counted toward dirty pages
for purposes of calculating how much to write from the OS to
storage, and the later write of the page doesn't redundantly add to
this number.

The combination of these two changes could boost PostgreSQL
performance quite a bit, at least for some common workloads.

The MMAP approach always seems tempting on first blush, but the
need to pin pages and the need to assure that dirty pages are not
written ahead of the WAL-logging of those pages makes it hard to
see how we can use it.  The pin means that we need to ensure that
a particular 8KB page remains available for direct reference by all
PostgreSQL processes until it is unpinned.  The other thing we
would need is the ability to modify a page with a solid assurance
that the modified page would *not* be written to disk until we
authorize it.  The page would remain pinned until we do authorize
write, at which point the changes are available to be written, but
can wait for an fsync or accumulations of sufficient dirty pages to
cross the write threshold.  Next comes the hard part.  The page may
or may not be unpinned after that, and if it remains pinned or is
pinned again, there may be further changes to the page.  While the
prior changes can be written (and *must* be written for an fsync),
these new changes must *not* be until we authorize it.  If MMAP can
be made to handle that, we could probably use it (and some of the
previously-discussed techniques might not be needed), but my
understanding is that there is currently no way to do so.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Claudio Freire
On Tue, Jan 14, 2014 at 11:39 AM, Hannu Krosing ha...@2ndquadrant.com wrote:
 On 01/14/2014 09:39 AM, Claudio Freire wrote:
 On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing ha...@2ndquadrant.com wrote:
 Again, as said above the linux file system is doing fine. What we
 want is a few ways to interact with it to let it do even better when
 working with postgresql by telling it some stuff it otherwise would
 have to second guess and by sometimes giving it back some cache
 pages which were copied away for potential modifying but ended
 up clean in the end.
 You don't need new interfaces. Only a slight modification of what
 fadvise DONTNEED does.

 This insistence in injecting pages from postgres to kernel is just a
 bad idea.
 Do you think it would be possible to map copy-on-write pages
 from linux cache to postgresql cache ?

 this would be a step in direction of solving the double-ram-usage
 of pages which have not been read from syscache to postgresql
 cache without sacrificing linux read-ahead (which I assume does
 not happen when reads bypass system cache).

 and we can write back the copy at the point when it is safe (from
 postgresql perspective)  to let the system write them back ?

 Do you think it is possible to make it work with good performance
 for a few million 8kb pages ?

I don't think so. The kernel would need to walk the page mapping on
each page fault, which would incurr the cost of a read cache hit on
each page fault.

A cache hit is still orders of magnitude slower than a regular page
fault, because the process page map is compact and efficient. But if
you bloat it, or if you make the kernel go read the buffer cache, it
would mean bad performance for RAM access, which I'd venture isn't
really a net gain.

That's probably the reason there is no zero-copy read mechanism.
Because you always have to copy from/to the buffer cache anyway.

Of course, this is just OTOMH. Without actually benchmarking, this is
all blabber.

 At the very worst, it may
 introduce serious security and reliability implications, when
 applications can destroy the consistency of the page cache (even if
 full access rights are checked, there's still the possibility this
 inconsistency might be exploitable).
 If you allow write() which just writes clean pages, I can not see
 where the extra security concerns are beyond what normal
 write can do.

I've been working on security enough to never dismiss any kind of
system-level inconsistency.

The fact that you can make user-land applications see different data
than kernel-land code has over-reaching consequences that are hard to
ponder.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Robert Haas
On Tue, Jan 14, 2014 at 3:39 AM, Claudio Freire klaussfre...@gmail.com wrote:
 On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing ha...@2ndquadrant.com wrote:
 Again, as said above the linux file system is doing fine. What we
 want is a few ways to interact with it to let it do even better when
 working with postgresql by telling it some stuff it otherwise would
 have to second guess and by sometimes giving it back some cache
 pages which were copied away for potential modifying but ended
 up clean in the end.

 You don't need new interfaces. Only a slight modification of what
 fadvise DONTNEED does.

Yeah.  DONTREALLYNEEDALLTHATTERRIBLYMUCH.

 This insistence in injecting pages from postgres to kernel is just a
 bad idea. At the very least, it still needs postgres to know too much
 of the filesystem (block layout) to properly work. Ie: pg must be
 required to put entire filesystem-level blocks into the page cache,
 since that's how the page cache works. At the very worst, it may
 introduce serious security and reliability implications, when
 applications can destroy the consistency of the page cache (even if
 full access rights are checked, there's still the possibility this
 inconsistency might be exploitable).

I agree with all that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Robert Haas
On Tue, Jan 14, 2014 at 5:00 AM, Jan Kara j...@suse.cz wrote:
 I thought that instead of injecting pages into pagecache for aging as you
 describe in 3), you would mark pages as volatile (i.e. for reclaim by
 kernel) through vrange() syscall. Next time you need the page, you check
 whether the kernel reclaimed the page or not. If yes, you reload it from
 disk, if not, you unmark it and use it.

 Now the aging of pages marked as volatile as it is currently implemented
 needn't be perfect for your needs but you still have time to influence what
 gets implemented... Actually developers of the vrange() syscall were
 specifically looking for some ideas what to base aging on. Currently I
 think it is first marked - first evicted.

This is an interesting idea but it stinks of impracticality.
Essentially when the last buffer pin on a page is dropped we'd have to
mark it as discardable, and then the next person wanting to pin it
would have to check whether it's still there.  But the system call
overhead of calling vrange() every time the last pin on a page was
dropped would probably hose us.

*thinks*

Well, I guess it could be done lazily: make periodic sweeps through
shared_buffers, looking for pages that haven't been touched in a
while, and vrange() them.  That's quite a bit of new mechanism, but in
theory it could work out to a win.  vrange() would have to scale well
to millions of separate ranges, though.  Will it?  And a lot depends
on whether the kernel makes the right decision about whether to chunk
data from our vrange() vs. any other page it could have reclaimed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpgsql.consistent_into

2014-01-14 Thread Marko Tiikkaja

On 1/14/14 1:28 PM, Pavel Stehule wrote:

I prefer subquery only syntax - a := (some) or (a,b,c) = (some a,b,c) with
possible enhancing for statements with RETURNING

a advance is compatibility with DB2 (SQL/PSM) syntax - and this code is
written now - it is done in my sql/psm implementation


Are you saying that DB2 supports the multi-variable assignment?  I 
couldn't find any reference suggesting that, can you point me to one?



Regards,
Marko Tiikkaja


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extension_control_path

2014-01-14 Thread Tom Lane
Dimitri Fontaine dimi...@2ndquadrant.fr writes:
 Please find attached to this email a patch implementing a new GUC that
 allows users to setup a list of path where PostgreSQL will search for
 the extension control files at CREATE EXTENSION time.

Why is that a good idea?  It's certainly not going to simplify DBAs'
lives, more the reverse.  (This dump won't reload. Uh, where did
you get that extension from? Ummm...)

Assuming that there is some need for loading extensions from nonstandard
places, would it be better to just allow a filename specification in
CREATE EXTENSION?  (I don't know the answer, since the use-case isn't
apparent to me in the first place, but it seems worth asking.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN improvements part2: fast scan

2014-01-14 Thread Alexander Korotkov
On Thu, Nov 21, 2013 at 12:14 AM, Alexander Korotkov
aekorot...@gmail.comwrote:

 On Wed, Nov 20, 2013 at 3:06 AM, Alexander Korotkov 
 aekorot...@gmail.comwrote:

 On Fri, Nov 15, 2013 at 11:19 AM, Alexander Korotkov 
 aekorot...@gmail.com wrote:

 On Fri, Nov 15, 2013 at 12:34 AM, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:

 On 14.11.2013 19:26, Alexander Korotkov wrote:

 On Sun, Jun 30, 2013 at 3:00 PM, Heikki Linnakangas 
 hlinnakan...@vmware.com

 wrote:


  On 28.06.2013 22:31, Alexander Korotkov wrote:

  Now, I got the point of three state consistent: we can keep only one
 consistent in opclasses that support new interface. exact true and
 exact
 false values will be passed in the case of current patch consistent;
 exact
 false and unknown will be passed in the case of current patch
 preConsistent. That's reasonable.


 I'm going to mark this as returned with feedback. For the next
 version,
 I'd like to see the API changed per above. Also, I'd like us to do
 something about the tidbitmap overhead, as a separate patch before
 this, so
 that we can assess the actual benefit of this patch. And a new test
 case
 that demonstrates the I/O benefits.



 Revised version of patch is attached.
 Changes are so:
 1) Patch rebased against packed posting lists, not depends on
 additional
 information now.
 2) New API with tri-state logic is introduced.


 Thanks! A couple of thoughts after a 5-minute glance:

 * documentation


 Will provide documented version this week.


 * How about defining the tri-state consistent function to also return a
 tri-state? True would mean that the tuple definitely matches, false means
 the tuple definitely does not match, and Unknown means it might match. Or
 does return value true with recheck==true have the same effect? If I
 understood the patch, right, returning Unknown or True wouldn't actually
 make any difference, but it's conceivable that we might come up with more
 optimizations in the future that could take advantage of that. For example,
 for a query like foo OR (bar AND baz), you could immediately return any
 tuples that match foo, and not bother scanning for bar and baz at all.


 The meaning of recheck flag when input contains unknown is undefined
 now. :)
 For instance, we could define it in following ways:
 1) Like returning Unknown meaning that consistent with true of false
 instead of input Unknown could return either true or false.
 2) Consistent with true of false instead of input Unknown could return
 recheck. This meaning is probably logical, but I don't see any usage of it.

 I'm not against idea of tri-state returning value for consisted, because
 it's logical continuation of its tri-state input. However, I don't see
 usage of distinguish True and Unknown in returning value for now :)

 In example you give we can return foo immediately, but we have to create
 full bitmap. So we anyway will have to scan (bar AND baz). We could skip
 part of trees for bar and baz. But it's possible only when foo contains
 large amount of sequential TIDS so we can be sure that we didn't miss any
 TIDs. This seems to be very narrow use-case for me.

 Another point is that one day we probably could immediately return
 tuples in gingettuple. And with LIMIT clause and no sorting we can don't
 search for other tuples. However, gingettuple was removed because of
 reasons of concurrency. And my patches for index-based ordering didn't
 return it in previous manner: it collects all the results and then returns
 them one-by-one.


 I'm trying to make fastscan work with GinFuzzySearchLimit. Then I figure
 out that I don't understand how GinFuzzySearchLimit works. Why with
 GinFuzzySearchLimit startScan can return without doing startScanKey? Is it
 a bug?


 Revised version of patch is attached. Changes are so:
 1) Support for GinFuzzySearchLimit.
 2) Some documentation.
 Question about GinFuzzySearchLimit is still relevant.


Attached version is rebased against last version of packed posting lists.

--
With best regards,
Alexander Korotkov.


gin-fast-scan.9.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpgsql.consistent_into

2014-01-14 Thread Pavel Stehule
Hello


2014/1/14 Marko Tiikkaja ma...@joh.to

 On 1/14/14 1:28 PM, Pavel Stehule wrote:

 I prefer subquery only syntax - a := (some) or (a,b,c) = (some a,b,c) with
 possible enhancing for statements with RETURNING

 a advance is compatibility with DB2 (SQL/PSM) syntax - and this code is
 written now - it is done in my sql/psm implementation


 Are you saying that DB2 supports the multi-variable assignment?  I
 couldn't find any reference suggesting that, can you point me to one?


I was wrong - it is different syntax - it is SQL/PSM form - db2 supports
SET var=val, var=val, ...

 http://postgres.cz/wiki/Basic_statements_SQL/PSM_samples

Regards

Pavel



 Regards,
 Marko Tiikkaja



Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Tom Lane
James Bottomley james.bottom...@hansenpartnership.com writes:
 The current mechanism for coherency between a userspace cache and the
 in-kernel page cache is mmap ... that's the only way you get the same
 page in both currently.

Right.

 glibc used to have an implementation of read/write in terms of mmap, so
 it should be possible to insert it into your current implementation
 without a major rewrite.  The problem I think this brings you is
 uncontrolled writeback: you don't want dirty pages to go to disk until
 you issue a write()

Exactly.

 I think we could fix this with another madvise():
 something like MADV_WILLUPDATE telling the page cache we expect to alter
 the pages again, so don't be aggressive about cleaning them.

Don't be aggressive isn't good enough.  The prohibition on early write
has to be absolute, because writing a dirty page before we've done
whatever else we need to do results in a corrupt database.  It has to
be treated like a write barrier.

 The problem is we can't give you absolute control of when pages are
 written back because that interface can be used to DoS the system: once
 we get too many dirty uncleanable pages, we'll thrash looking for memory
 and the system will livelock.

Understood, but that makes this direction a dead end.  We can't use
it if the kernel might decide to write anyway.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Partial sort

2014-01-14 Thread Alexander Korotkov
Hi!

On Tue, Jan 14, 2014 at 12:54 AM, Marti Raudsepp ma...@juffo.org wrote:

 First, thanks a lot for working on this feature. This PostgreSQL
 shortcoming crops up in all the time in web applications that implement
 paging by multiple sorted columns.


Thanks!

I've been trying it out in a few situations. I implemented a new
 enable_partialsort GUC to make it easier to turn on/off, this way it's a
 lot easier to test. The attached patch applies on top of
 partial-sort-5.patch


I though about such option. Generally not because of testing convenience,
but because of overhead of planning. This way you implement it is quite
naive :) For instance, merge join rely on partial sort which will be
replaced with simple sort.


 I will spend more time reviewing the patch, but some of this planner code
 is over my head. If there's any way I can help to make sure this lands in
 the next version, let me know.

 

 The patch performs just as well as I would expect it to:

 marti=# select ac.name, r.name from artist_credit ac join release r on (
 ac.id=r.artist_credit) order by ac.name, r.name limit 1000;
 Time: 9.830 ms
 marti=# set enable_partialsort = off;
 marti=# select ac.name, r.name from artist_credit ac join release r on (
 ac.id=r.artist_credit) order by ac.name, r.name limit 1000;
 Time: 1442.815 ms

 A difference of almost 150x!

 There's a missed opportunity in that the code doesn't consider pushing new
 Sort steps into subplans. For example, if there's no index on
 language(name) then this query cannot take advantage partial sorts:

 marti=# explain select l.name, r.name from language l join release r on (
 l.id=r.language) order by l.name, r.name limit 1000;
  Limit  (cost=123203.20..123205.70 rows=1000 width=32)
-  Sort  (cost=123203.20..126154.27 rows=1180430 width=32)
  Sort Key: l.name, r.name
  -  Hash Join  (cost=229.47..58481.49 rows=1180430 width=32)
Hash Cond: (r.language = l.id)
-  Seq Scan on release r  (cost=0.00..31040.10
 rows=1232610 width=26)
-  Hash  (cost=131.43..131.43 rows=7843 width=14)
  -  Seq Scan on language l  (cost=0.00..131.43
 rows=7843 width=14)

 But because there are only so few languages, it would be a lot faster to
 sort languages in advance and then do partial sort:
  Limit  (rows=1000 width=31)
-  Partial sort  (rows=1180881 width=31)
  Sort Key: l.name, r.name
  Presorted Key: l.name
  -  Nested Loop  (rows=1180881 width=31)
-  Sort  (rows=7843 width=10)
  Sort Key: name
  -  Seq Scan on language  (rows=7843 width=14)
-  Index Scan using release_language_idx on release r
 (rows=11246 width=25)
  Index Cond: (language = l.id)

 Even an explicit sorted CTE cannot take advantage of partial sorts:
 marti=# explain with sorted_lang as (select id, name from language order
 by name)
 marti-# select l.name, r.name from sorted_lang l join release r on 
 (l.id=r.language)
 order by l.name, r.name limit 1000;
  Limit  (cost=3324368.83..3324371.33 rows=1000 width=240)
CTE sorted_lang
  -  Sort  (cost=638.76..658.37 rows=7843 width=14)
Sort Key: language.name
-  Seq Scan on language  (cost=0.00..131.43 rows=7843 width=14)
-  Sort  (cost=3323710.46..3439436.82 rows=46290543 width=240)
  Sort Key: l.name, r.name
  -  Merge Join  (cost=664.62..785649.92 rows=46290543 width=240)
Merge Cond: (r.language = l.id)
-  Index Scan using release_language_idx on release r
 (cost=0.43..87546.06 rows=1232610 width=26)
-  Sort  (cost=664.19..683.80 rows=7843 width=222)
  Sort Key: l.id
  -  CTE Scan on sorted_lang l  (cost=0.00..156.86
 rows=7843 width=222)

 But even with these limitations, this will easily be the killer feature of
 the next release, for me at least.


I see. But I don't think it can be achieved by small changes in planner.
Moreover, I didn't check but I think if you remove ordering by r.name you
will still not get sorting languages in the inner node. So, this problem is
not directly related to partial sort.

--
With best regards,
Alexander Korotkov.


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Claudio Freire
On Tue, Jan 14, 2014 at 12:42 PM, Trond Myklebust tron...@gmail.com wrote:
 James Bottomley james.bottom...@hansenpartnership.com writes:
 The current mechanism for coherency between a userspace cache and the
 in-kernel page cache is mmap ... that's the only way you get the same
 page in both currently.

 Right.

 glibc used to have an implementation of read/write in terms of mmap, so
 it should be possible to insert it into your current implementation
 without a major rewrite.  The problem I think this brings you is
 uncontrolled writeback: you don't want dirty pages to go to disk until
 you issue a write()

 Exactly.

 I think we could fix this with another madvise():
 something like MADV_WILLUPDATE telling the page cache we expect to alter
 the pages again, so don't be aggressive about cleaning them.

 Don't be aggressive isn't good enough.  The prohibition on early write
 has to be absolute, because writing a dirty page before we've done
 whatever else we need to do results in a corrupt database.  It has to
 be treated like a write barrier.

 Then why are you dirtying the page at all? It makes no sense to tell the 
 kernel “we’re changing this page in the page cache, but we don’t want you to 
 change it on disk”: that’s not consistent with the function of a page cache.


PG doesn't currently.

All that dirtying happens in anonymous shared memory, in pg-specific buffers.

The proposal is to use mmap instead of anonymous shared memory as
pg-specific buffers to avoid the extra copy (mmap would share the page
with both kernel and user space). But that would dirty the page when
written to, because now the kernel has the correspondence between that
specific memory region and the file, and that's forbidden for PG's
usage.

I believe the only option here is for the kernel to implement
zero-copy reads. But that implementation is doomed for the performance
reasons I outlined on an eariler mail. So...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Tom Lane
Trond Myklebust tron...@gmail.com writes:
 On Jan 14, 2014, at 10:39, Tom Lane t...@sss.pgh.pa.us wrote:
 Don't be aggressive isn't good enough.  The prohibition on early write
 has to be absolute, because writing a dirty page before we've done
 whatever else we need to do results in a corrupt database.  It has to
 be treated like a write barrier.

 Then why are you dirtying the page at all? It makes no sense to tell the 
 kernel “we’re changing this page in the page cache, but we don’t want you to 
 change it on disk”: that’s not consistent with the function of a page cache.

As things currently stand, we dirty the page in our internal buffers,
and we don't write it to the kernel until we've written and fsync'd the
WAL data that needs to get to disk first.  The discussion here is about
whether we could somehow avoid double-buffering between our internal
buffers and the kernel page cache.

I personally think there is no chance of using mmap for that; the
semantics of mmap are pretty much dictated by POSIX and they don't work
for this.  However, disregarding the fact that the two communities
speaking here don't control the POSIX spec, you could maybe imagine
making it work if *both* pending WAL file contents and data file
contents were mmap'd, and there were kernel APIs allowing us to say
you can write this mmap'd page if you want, but not till you've written
that mmap'd data over there.  That'd provide the necessary
write-barrier semantics, and avoid the cache coherency question because
all the data visible to the kernel could be thought of as the current
filesystem contents, it just might not all have reached disk yet; which
is the behavior of the kernel disk cache already.

I'm dubious that this sketch is implementable with adequate efficiency,
though, because in a live system the kernel would be forced to deal with
a whole lot of active barrier restrictions.  Within Postgres we can
reduce write-ordering tests to a very simple comparison: don't write
this page until WAL is flushed to disk at least as far as WAL sequence
number XYZ.  I think any kernel API would have to be a great deal more
general and thus harder to optimize.

Another difficulty with merging our internal buffers with the kernel
cache is that when we're in the process of applying a change to a page,
there are intermediate states of the page data that should under no
circumstances reach disk (eg, we might need to shuffle records around
within the page).  We can deal with that fairly easily right now by not
issuing a write() while a page change is in progress.  I don't see that
it's even theoretically possible in an mmap'd world; there are no atomic
updates to an mmap'd page that are larger than whatever is an atomic
update for the CPU.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Jan Kara
On Tue 14-01-14 09:08:40, Hannu Krosing wrote:
  Effectively you end up with buffered read/write that's also mapped into
  the page cache.  It's a pretty awful way to hack around mmap.
  Well, the problem is that you can't really use mmap() for the things we
  do. Postgres' durability works by guaranteeing that our journal entries
  (called WAL := Write Ahead Log) are written  synced to disk before the
  corresponding entries of tables and indexes reach the disk. That also
  allows to group together many random-writes into a few contiguous writes
  fdatasync()ed at once. Only during a checkpointing phase the big bulk of
  the data is then (slowly, in the background) synced to disk.
  Which is the exact algorithm most journalling filesystems use for
  ensuring durability of their metadata updates.  Indeed, here's an
  interesting piece of architecture that you might like to consider:
 
  * Neither XFS and BTRFS use the kernel page cache to back their
metadata transaction engines.
 But file system code is supposed to know much more about the
 underlying disk than a mere application program like postgresql.
 
 We do not want to start duplicating OS if we can avoid it.
 
 What we would like is to have a way to tell the kernel
 
 1) here is the modified copy of file page, it is now safe to write
 it back - the current 'lazy' write
 
 2) here is the page, write it back now, before returning success
 to me - unbuffered write or write + sync
 
 but we also would like to have
 
 3) here is the page as it is currently on disk, I may need it soon,
 so keep it together with your other clean pages accessed at time X
 - this is the non-dirtying write discussed

 the page may be in buffer cache, in which case just update its LRU
 position (to either current time or time provided by postgresql), or
 it may not be there, in which case put it there if reasonable by it's
 LRU position.
 
 And we would like all this to work together with other current linux
 kernel goodness of managing the whole disk-side interaction of
 efficient reading and writing and managing the buffers :)
  So when I was speaking about the proposed vrange() syscall in this thread,
I thought that instead of injecting pages into pagecache for aging as you
describe in 3), you would mark pages as volatile (i.e. for reclaim by
kernel) through vrange() syscall. Next time you need the page, you check
whether the kernel reclaimed the page or not. If yes, you reload it from
disk, if not, you unmark it and use it.

Now the aging of pages marked as volatile as it is currently implemented
needn't be perfect for your needs but you still have time to influence what
gets implemented... Actually developers of the vrange() syscall were
specifically looking for some ideas what to base aging on. Currently I
think it is first marked - first evicted.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Jan Kara
On Tue 14-01-14 11:11:28, Heikki Linnakangas wrote:
 On 01/14/2014 12:26 AM, Mel Gorman wrote:
 On Mon, Jan 13, 2014 at 03:15:16PM -0500, Robert Haas wrote:
 The other thing that comes to mind is the kernel's caching behavior.
 We've talked a lot over the years about the difficulties of getting
 the kernel to write data out when we want it to and to not write data
 out when we don't want it to.
 
 Is sync_file_range() broke?
 
 When it writes data back to disk too
 aggressively, we get lousy throughput because the same page can get
 written more than once when caching it for longer would have allowed
 write-combining.
 
 Do you think that is related to dirty_ratio or dirty_writeback_centisecs?
 If it's dirty_writeback_centisecs then that would be particularly tricky
 because poor interactions there would come down to luck basically.
 
 When it doesn't write data to disk aggressively
 enough, we get huge latency spikes at checkpoint time when we call
 fsync() and the kernel says uh, what? you wanted that data *on the
 disk*? sorry boss! and then proceeds to destroy the world by starving
 the rest of the system for I/O for many seconds or minutes at a time.
 
 Ok, parts of that are somewhat expected. It *may* depend on the
 underlying filesystem. Some of them handle fsync better than others. If
 you are syncing the whole file though when you call fsync then you are
 potentially burned by having to writeback dirty_ratio amounts of memory
 which could take a substantial amount of time.
 
 We've made some desultory attempts to use sync_file_range() to improve
 things here, but I'm not sure that's really the right tool, and if it
 is we don't know how to use it well enough to obtain consistent
 positive results.
 
 That implies that either sync_file_range() is broken in some fashion we
 (or at least I) are not aware of and that needs kicking.
 
 Let me try to explain the problem: Checkpoints can cause an I/O
 spike, which slows down other processes.
 
 When it's time to perform a checkpoint, PostgreSQL will write() all
 dirty buffers from the PostgreSQL buffer cache, and finally perform
 an fsync() to flush the writes to disk. After that, we know the data
 is safely on disk.
 
 In older PostgreSQL versions, the write() calls would cause an I/O
 storm as the OS cache quickly fills up with dirty pages, up to
 dirty_ratio, and after that all subsequent write()s block. That's OK
 as far as the checkpoint is concerned, but it significantly slows
 down queries running at the same time. Even a read-only query often
 needs to write(), to evict a dirty page from the buffer cache to
 make room for a different page. We made that less painful by adding
 sleeps between the write() calls, so that they are trickled over a
 long period of time and hopefully stay below dirty_ratio at all
 times.
  Hum, I wonder whether you see any difference with reasonably recent
kernels (say newer than 3.2). Because those have IO-less dirty throttling.
That means that:
  a) checkpointing thread (or other threads blocked due to dirty limit)
won't issue IO on their own but rather wait for flusher thread to do the
work.
  b) there should be more noticeable difference between the delay imposed
on heavily dirtying thread (i.e. the checkpointing thread) and the delay
imposed on lightly dirtying thread (that's what I would expect from those
threads having to do occasional page eviction to make room for other page).

 However, we still have to perform the fsync()s after the
 writes(), and sometimes that still causes a similar I/O storm.
  Because there is still quite some dirty data in the page cache or because
e.g. ext3 has to flush a lot of unrelated dirty data?

 The checkpointer is not in a hurry. A checkpoint typically has 10-30
 minutes to finish, before it's time to start the next checkpoint,
 and even if it misses that deadline that's not too serious either.
 But the OS doesn't know that, and we have no way of telling it.
 
 As a quick fix, some sort of a lazy fsync() call would be nice. It
 would behave just like fsync() but it would not change the I/O
 scheduling at all. Instead, it would sleep until all the pages have
 been flushed to disk, at the speed they would've been without the
 fsync() call.
 
 Another approach would be to give the I/O that the checkpointer
 process initiates a lower priority. This would be slightly
 preferable, because PostgreSQL could then issue the writes() as fast
 as it can, and have the checkpoint finish earlier when there's not
 much other load. Last I looked into this (which was a long time
 ago), there was no suitable priority system for writes, only reads.
  Well, IO priority works for writes in principle, the trouble is it
doesn't work for writes which end up just in the page cache. Then writeback
of page cache is usually done by flusher thread so that's completely
disconnected from whoever created the dirty data (now I know this is dumb
and long term we want to do something about it so that IO cgroups 

Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Dave Chinner
On Mon, Jan 13, 2014 at 09:29:02PM +, Greg Stark wrote:
 On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund and...@2ndquadrant.com wrote:
  For one, postgres doesn't use mmap for files (and can't without major
  new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has
  horrible consequences for performance/scalability - very quickly you
  contend on locks in the kernel.
 
 I may as well dump this in this thread. We've discussed this in person
 a few times, including at least once with Ted T'so when he visited
 Dublin last year.
 
 The fundamental conflict is that the kernel understands better the
 hardware and other software using the same resources, Postgres
 understands better its own access patterns. We need to either add
 interfaces so Postgres can teach the kernel what it needs about its
 access patterns or add interfaces so Postgres can find out what it
 needs to know about the hardware context.

In my experience applications don't need to know anything about the
underlying storage hardware - all they need is for someone to 
tell them the optimal IO size and alignment to use.

 The more ambitious and interesting direction is to let Postgres tell
 the kernel what it needs to know to manage everything. To do that we
 would need the ability to control when pages are flushed out. This is
 absolutely necessary to maintain consistency. Postgres would need to
 be able to mark pages as unflushable until some point in time in the
 future when the journal is flushed. We discussed various ways that
 interface could work but it would be tricky to keep it low enough
 overhead to be workable.

IMO, the concept of allowing userspace to pin dirty page cache
pages in memory is just asking for trouble. Apart from the obvious
memory reclaim and OOM issues, some filesystems won't be able to
move their journals forward until the data is flushed. i.e. ordered
mode data writeback on ext3 will have all sorts of deadlock issues
that result from pinning pages and then issuing fsync() on another
file which will block waiting for the pinned pages to be flushed.

Indeed, what happens if you do pin_dirty_pages(fd);  fsync(fd);?
If fsync() blocks because there are pinned pages, and there's no
other thread to unpin them, then that code just deadlocked. If
fsync() doesn't block and skips the pinned pages, then we haven't
done an fsync() at all, and so violated the expectation that users
have that after fsync() returns their data is safe on disk. And if
we return an error to fsync(), then what the hell does the user do
if it is some other application we don't know about that has pinned
the pages? And if the kernel unpins them after some time, then we
just violated the application's consistency guarantees

H.  What happens if the process crashes after pinning the dirty
pages?  How do we even know what process pinned the dirty pages so
we can clean up after it? What happens if the same page is pinned by
multiple processes? What happens on truncate/hole punch if the
partial pages in the range that need to be zeroed and written are
pinned? What happens if we do direct IO to a range with pinned,
unflushable pages in the page cache?

These are all complex corner cases that are introduced by allowing
applications to pin dirty pages in memory. I've only spent a few
minutes coming up with these, and I'm sure there's more of them.
As such, I just don't see that allowing userspace to pin dirty
page cache pages in memory being a workable solution.

 The less exciting, more conservative option would be to add kernel
 interfaces to teach Postgres about things like raid geometries. Then

/sys/block/dev/queue/* contains all the information that is
exposed to filesystems to optimise layout for storage geometry.
Some filesystems can already expose the relevant parts of this
information to userspace, others don't.

What I think we really need to provide is a generic interface
similar to the old XFS_IOC_DIOINFO ioctl that can be used to
expose IO characteristics to applications in a simple, easy to
gather manner.  Something like:

struct io_info {
u64 minimum_io_size;/* sector size */
u64 maximum_io_size;/* currently 2GB */
u64 optimal_io_size;/* stripe unit/width */
u64 optimal_io_alignment;   /* stripe unit/width */
u64 mem_alignment;  /* PAGE_SIZE */
u32 queue_depth;/* max IO concurrency */
};

 Postgres could use directio and decide to do prefetching based on the
 raid geometry,

Underlying storage array raid geometry and optimal IO sizes for the
filesystem may be different. Hence you want what the filesystem
considers optimal, not what the underlying storage is configured
with. Indeed, a filesystem might be able to supply per-file IO
characteristics depending on where it is located in the filesystem
(think tiered storage)

 how much available i/o bandwidth and iops is available,
 etc.

The kernel 

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Dave Chinner
On Mon, Jan 13, 2014 at 03:24:38PM -0800, Josh Berkus wrote:
 On 01/13/2014 02:26 PM, Mel Gorman wrote:
  Really?
  
  zone_reclaim_mode is often a complete disaster unless the workload is
  partitioned to fit within NUMA nodes. On older kernels enabling it would
  sometimes cause massive stalls. I'm actually very surprised to hear it
  fixes anything and would be interested in hearing more about what sort
  of circumstnaces would convince you to enable that thing.
 
 So the problem with the default setting is that it pretty much isolates
 all FS cache for PostgreSQL to whichever socket the postmaster is
 running on, and makes the other FS cache unavailable.  This means that,
 for example, if you have two memory banks, then only one of them is
 available for PostgreSQL filesystem caching ... essentially cutting your
 available cache in half.

No matter what default NUMA allocation policy we set, there will be
an application for which that behaviour is wrong. As such, we've had
tools for setting application specific NUMA policies for quite a few
years now. e.g:

$ man 8 numactl

   --interleave=nodes, -i nodes
  Set a memory interleave policy. Memory will be
  allocated using round robin on nodes.  When memory
  cannot be allocated on the current interleave target
  fall back to other nodes.  Multiple nodes may be
  specified on --interleave, --membind and
  --cpunodebind.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Dave Chinner
On Tue, Jan 14, 2014 at 02:26:25AM +0100, Andres Freund wrote:
 On 2014-01-13 17:13:51 -0800, James Bottomley wrote:
  a file into a user provided buffer, thus obtaining a page cache entry
  and a copy in their userspace buffer, then insert the page of the user
  buffer back into the page cache as the page cache page ... that's right,
  isn't it postgress people?
 
 Pretty much, yes. We'd probably hint (*advise(DONTNEED)) that the page
 isn't needed anymore when reading. And we'd normally write if the page
 is dirty.

So why, exactly, do you even need the kernel page cache here? You've
got direct access to the copy of data read into userspace, and you
want direct control of when and how the data in that buffer is
written and reclaimed. Why push that data buffer back into the
kernel and then have to add all sorts of kernel interfaces to
control the page you already have control of?

  Effectively you end up with buffered read/write that's also mapped into
  the page cache.  It's a pretty awful way to hack around mmap.
 
 Well, the problem is that you can't really use mmap() for the things we
 do. Postgres' durability works by guaranteeing that our journal entries
 (called WAL := Write Ahead Log) are written  synced to disk before the
 corresponding entries of tables and indexes reach the disk. That also
 allows to group together many random-writes into a few contiguous writes
 fdatasync()ed at once. Only during a checkpointing phase the big bulk of
 the data is then (slowly, in the background) synced to disk.

Which is the exact algorithm most journalling filesystems use for
ensuring durability of their metadata updates.  Indeed, here's an
interesting piece of architecture that you might like to consider:

* Neither XFS and BTRFS use the kernel page cache to back their
  metadata transaction engines.

Why not? Because the page cache is too simplistic to adequately
represent the complex object heirarchies that the filesystems have
and so it's flat LRU reclaim algorithms and writeback control
mechanisms are a terrible fit and cause lots of performance issues
under memory pressure.

IOWs, the two most complex high performance transaction engines in
the Linux kernel have moved to fully customised cache and (direct)
IO implementations because the requirements for scalability and
performance are far more complex than the kernel page cache
infrastructure can provide.

Just food for thought

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread James Bottomley
On Mon, 2014-01-13 at 19:48 -0500, Trond Myklebust wrote:
 On Jan 13, 2014, at 19:03, Hannu Krosing ha...@2ndquadrant.com wrote:
 
  On 01/13/2014 09:53 PM, Trond Myklebust wrote:
  On Jan 13, 2014, at 15:40, Andres Freund and...@2ndquadrant.com wrote:
  
  On 2014-01-13 15:15:16 -0500, Robert Haas wrote:
  On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner kgri...@ymail.com 
  wrote:
  I notice, Josh, that you didn't mention the problems many people
  have run into with Transparent Huge Page defrag and with NUMA
  access.
  Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
  setting zone_reclaim_mode; is there some other problem besides that?
  I think that fixes some of the worst instances, but I've seen machines
  spending horrible amounts of CPU ( BUS) time in page reclaim
  nonetheless. If I analyzed it correctly it's in RAM  working set
  workloads where RAM is pretty large and most of it is used as page
  cache. The kernel ends up spending a huge percentage of time finding and
  potentially defragmenting pages when looking for victim buffers.
  
  On a related note, there's also the problem of double-buffering.  When
  we read a page into shared_buffers, we leave a copy behind in the OS
  buffers, and similarly on write-out.  It's very unclear what to do
  about this, since the kernel and PostgreSQL don't have intimate
  knowledge of what each other are doing, but it would be nice to solve
  somehow.
  I've wondered before if there wouldn't be a chance for postgres to say
  my dear OS, that the file range 0-8192 of file x contains y, no need to
  reread and do that when we evict a page from s_b but I never dared to
  actually propose that to kernel people...
  O_DIRECT was specifically designed to solve the problem of double 
  buffering 
  between applications and the kernel. Why are you not able to use that in 
  these situations?
  What is asked is the opposite of O_DIRECT - the write from a buffer inside
  postgresql to linux *buffercache* and telling linux that it is the same
  as what
  is currently on disk, so don't bother to write it back ever.
 
 I don’t understand. Are we talking about mmap()ed files here? Why
 would the kernel be trying to write back pages that aren’t dirty?

No ... if I have it right, it's pretty awful: they want to do a read of
a file into a user provided buffer, thus obtaining a page cache entry
and a copy in their userspace buffer, then insert the page of the user
buffer back into the page cache as the page cache page ... that's right,
isn't it postgress people?

Effectively you end up with buffered read/write that's also mapped into
the page cache.  It's a pretty awful way to hack around mmap.

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extending BASE_BACKUP in replication protocol: incremental backup and backup format

2014-01-14 Thread Magnus Hagander
On Jan 14, 2014 2:44 PM, Andres Freund and...@2ndquadrant.com wrote:

 On 2014-01-14 14:42:36 +0100, Magnus Hagander wrote:
  On Tue, Jan 14, 2014 at 2:41 PM, Andres Freund and...@2ndquadrant.com
wrote:
 
   On 2014-01-14 14:40:46 +0100, Magnus Hagander wrote:
On Tue, Jan 14, 2014 at 2:18 PM, Andres Freund 
and...@2ndquadrant.com
   wrote:
   
 On 2014-01-14 14:12:46 +0100, Magnus Hagander wrote:
  Either way - if we can do this in a safe way, it sounds like a
good
   idea.
  It would be sort of like rsync, except relying on the fact that
we
   can
 look
  at the LSN and don't have to compare the actual files, right?

 Which is an advantage, yes. On the other hand, it doesn't fix
problems
 with a subtly broken replica, e.g. after a bug in replay, or disk
 corruption.


Right. But neither does rsync, right?
  
   Hm? Rsync's really only safe with --checksum and with that it
definitely
   should fix those?
  
  
  I think we're talking about difference scenarios.

 Sounds like it.

  I thought you were talking about a backup taken from a replica, that
  already has corruption. rsync checksums surely aren't going to help with
  that?

 I was talking about updating a standby using such an incremental or
 differential backup from the primary (or a standby higher up in the
 cascade). If your standby is corrupted in any way a rsync --checksum
 will certainly correct errors if it syncs from a correct source?

Sure, but as I understand it that's not at all the scenario that the
suggested functionality is for. You can still use rsync for that, I don't
think anybody suggested removing that ability. Replicas weren't the
target...

/Magnus


Re: [HACKERS] extension_control_path

2014-01-14 Thread Dimitri Fontaine
Tom Lane t...@sss.pgh.pa.us writes:
 Why is that a good idea?  It's certainly not going to simplify DBAs'
 lives, more the reverse.  (This dump won't reload. Uh, where did
 you get that extension from? Ummm...)

The latest users for the feature are the Red Hat team working on Open
Shift where they want to have co-existing per-user PostgreSQL clusters
on a machine, each with its own set of extensions.

Having extension_control_path also allows to install extension files in
a place not owned by root.

Lastly, as a developer, you might enjoy being able to have your own
non-system-global place to install extensions, as Andres did explain on
this list not too long ago.

 Assuming that there is some need for loading extensions from nonstandard
 places, would it be better to just allow a filename specification in
 CREATE EXTENSION?  (I don't know the answer, since the use-case isn't
 apparent to me in the first place, but it seems worth asking.)

In the extension_control_path idea, we still are adressing needs where
the people managing the OS and the database are distinct sets. The GUC
allows the system admins to setup PostgreSQL the way they want, then the
database guy doesn't need to know anything about that at CREATE
EXTENSION time.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread James Bottomley
On Tue, 2014-01-14 at 15:39 +0100, Hannu Krosing wrote:
 On 01/14/2014 09:39 AM, Claudio Freire wrote:
  On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing ha...@2ndquadrant.com 
  wrote:
  Again, as said above the linux file system is doing fine. What we
  want is a few ways to interact with it to let it do even better when
  working with postgresql by telling it some stuff it otherwise would
  have to second guess and by sometimes giving it back some cache
  pages which were copied away for potential modifying but ended
  up clean in the end.
  You don't need new interfaces. Only a slight modification of what
  fadvise DONTNEED does.
 
  This insistence in injecting pages from postgres to kernel is just a
  bad idea. 
 Do you think it would be possible to map copy-on-write pages
 from linux cache to postgresql cache ?
 
 this would be a step in direction of solving the double-ram-usage
 of pages which have not been read from syscache to postgresql
 cache without sacrificing linux read-ahead (which I assume does
 not happen when reads bypass system cache).

The current mechanism for coherency between a userspace cache and the
in-kernel page cache is mmap ... that's the only way you get the same
page in both currently.

glibc used to have an implementation of read/write in terms of mmap, so
it should be possible to insert it into your current implementation
without a major rewrite.  The problem I think this brings you is
uncontrolled writeback: you don't want dirty pages to go to disk until
you issue a write()  I think we could fix this with another madvise():
something like MADV_WILLUPDATE telling the page cache we expect to alter
the pages again, so don't be aggressive about cleaning them.  Plus all
the other issues with mmap() ... but if you can detail those, we might
be able to fix them.

 and we can write back the copy at the point when it is safe (from
 postgresql perspective)  to let the system write them back ?

Using MADV_WILLUPDATE, possibly ... you're still not going to have
absolute control.  The kernel will write back the pages if the dirty
limits are exceeded, for instance, but we could tune it to be useful.

 Do you think it is possible to make it work with good performance
 for a few million 8kb pages ?
 
  At the very least, it still needs postgres to know too much
  of the filesystem (block layout) to properly work. Ie: pg must be
  required to put entire filesystem-level blocks into the page cache,
  since that's how the page cache works. 
 I was more thinking of an simple write() interface with extra
 flags/sysctls to tell kernel that we already have this on disk
  At the very worst, it may
  introduce serious security and reliability implications, when
  applications can destroy the consistency of the page cache (even if
  full access rights are checked, there's still the possibility this
  inconsistency might be exploitable).
 If you allow write() which just writes clean pages, I can not see
 where the extra security concerns are beyond what normal
 write can do.

The problem is we can't give you absolute control of when pages are
written back because that interface can be used to DoS the system: once
we get too many dirty uncleanable pages, we'll thrash looking for memory
and the system will livelock.

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Trond Myklebust

On Jan 14, 2014, at 10:39, Tom Lane t...@sss.pgh.pa.us wrote:

 James Bottomley james.bottom...@hansenpartnership.com writes:
 The current mechanism for coherency between a userspace cache and the
 in-kernel page cache is mmap ... that's the only way you get the same
 page in both currently.
 
 Right.
 
 glibc used to have an implementation of read/write in terms of mmap, so
 it should be possible to insert it into your current implementation
 without a major rewrite.  The problem I think this brings you is
 uncontrolled writeback: you don't want dirty pages to go to disk until
 you issue a write()
 
 Exactly.
 
 I think we could fix this with another madvise():
 something like MADV_WILLUPDATE telling the page cache we expect to alter
 the pages again, so don't be aggressive about cleaning them.
 
 Don't be aggressive isn't good enough.  The prohibition on early write
 has to be absolute, because writing a dirty page before we've done
 whatever else we need to do results in a corrupt database.  It has to
 be treated like a write barrier.

Then why are you dirtying the page at all? It makes no sense to tell the kernel 
“we’re changing this page in the page cache, but we don’t want you to change it 
on disk”: that’s not consistent with the function of a page cache.

 The problem is we can't give you absolute control of when pages are
 written back because that interface can be used to DoS the system: once
 we get too many dirty uncleanable pages, we'll thrash looking for memory
 and the system will livelock.
 
 Understood, but that makes this direction a dead end.  We can't use
 it if the kernel might decide to write anyway.
 
   regards, tom lane



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extension_control_path

2014-01-14 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Dimitri Fontaine dimi...@2ndquadrant.fr writes:
  Please find attached to this email a patch implementing a new GUC that
  allows users to setup a list of path where PostgreSQL will search for
  the extension control files at CREATE EXTENSION time.
 
 Why is that a good idea?  It's certainly not going to simplify DBAs'
 lives, more the reverse.  (This dump won't reload. Uh, where did
 you get that extension from? Ummm...)

We *already* have that problem.  I don't think this makes it
particularly worse- you still need to go back to the old box and look at
what came from where.  Sure, you *might* be lucky enough to find the
right extension by guessing at what packages were installed or searching
for one that looks like the correct one, but then, you might discover
that the version available isn't the right version for the database
you're trying to restore anyway.  Indeed, this might allow folks who
don't particularly care for package systems to build consistent dumps
without having to worry quite as much about what the package system is
doing.

 Assuming that there is some need for loading extensions from nonstandard
 places, would it be better to just allow a filename specification in
 CREATE EXTENSION?  (I don't know the answer, since the use-case isn't
 apparent to me in the first place, but it seems worth asking.)

For my 2c, I could absolutely see it as being worthwhile to have an
independent directory to install not-from-package extensions.  That
would keep things which are managed by the package system and things
which are installed independent separate, which is absolutely a good
thing, imv.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-14 Thread Florian Pflug
On Jan14, 2014, at 11:06 , David Rowley dgrowle...@gmail.com wrote:
 Here's a patch which removes sum(numeric) and changes the documents a little 
 to remove a reference to using sum(numeric) to workaround the fact that 
 there's no inverse transitions for sum(float). I also made a small change in 
 the aggregates.sql tests to check that the aggfnoid = .

I've looked over the patch, here a few comments.

For STRICT pairs of transfer and inverse transfer functions we should complain 
if any of them ever return NULL. That can never be correct anyway, since a 
STRICT function cannot undo a non-NULL - NULL transition.

The same goes for non-STRICT, at least if we ever want to allow an inverse 
transfer function to indicate Sorry, cannot undo in this case, please rescan. 
If we allowed NULL as just another state value now, we couldn't easily undo 
that later, so we'd rob ourselves of the obvious way for the inverse transfer 
function to indicate this condition to its caller.

The notnullcount machinery seems to apply to both STRICT and non-STRICT 
transfer function pairs. Shouldn't that be constrained to STRICT transfer 
function pairs? For non-STRICT pairs, it's up to the transfer functions to deal 
with NULL inputs however they please, no?

The logic around movedaggbase in eval_windowaggregates() seems a bit 
convoluted. Couldn't the if be moved before the code that pulls aggregatedbase 
up to frameheadpos using the inverse aggregation function? 

Also, could we easily check whether the frames corresponding to the individual 
rows are all either identical or disjoint, and don't use the inverse transfer 
function then? Currently, for a frame which contains either just the current 
row, or all the current row's peers, I think we'd use the inverse transfer 
function to fully un-add the old frame, and then add back the new frame.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Robert Haas
On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley
james.bottom...@hansenpartnership.com wrote:
 No, I'm sorry, that's never going to be possible.  No user space
 application has all the facts.  If we give you an interface to force
 unconditional holding of dirty pages in core you'll livelock the system
 eventually because you made a wrong decision to hold too many dirty
 pages.   I don't understand why this has to be absolute: if you advise
 us to hold the pages dirty and we do up until it becomes a choice to
 hold on to the pages or to thrash the system into a livelock, why would
 you ever choose the latter?  And if, as I'm assuming, you never would,
 why don't you want the kernel to make that choice for you?

If you don't understand how write-ahead logging works, this
conversation is going nowhere.  Suffice it to say that the word
ahead is not optional.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Claudio Freire
On Tue, Jan 14, 2014 at 1:48 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley
 james.bottom...@hansenpartnership.com wrote:
 No, I'm sorry, that's never going to be possible.  No user space
 application has all the facts.  If we give you an interface to force
 unconditional holding of dirty pages in core you'll livelock the system
 eventually because you made a wrong decision to hold too many dirty
 pages.   I don't understand why this has to be absolute: if you advise
 us to hold the pages dirty and we do up until it becomes a choice to
 hold on to the pages or to thrash the system into a livelock, why would
 you ever choose the latter?  And if, as I'm assuming, you never would,
 why don't you want the kernel to make that choice for you?

 If you don't understand how write-ahead logging works, this
 conversation is going nowhere.  Suffice it to say that the word
 ahead is not optional.


In essence, if you do flush when you shouldn't, and there is a
hardware failure, or kernel panic, or anything that stops the rest of
the writes from succeeding, your database is kaputt, and you've got to
restore a backup.

Ie: very very bad.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Heikki Linnakangas

On 01/14/2014 06:08 PM, Tom Lane wrote:

Trond Myklebust tron...@gmail.com writes:

On Jan 14, 2014, at 10:39, Tom Lane t...@sss.pgh.pa.us wrote:

Don't be aggressive isn't good enough.  The prohibition on early write
has to be absolute, because writing a dirty page before we've done
whatever else we need to do results in a corrupt database.  It has to
be treated like a write barrier.



Then why are you dirtying the page at all? It makes no sense to tell the kernel 
“we’re changing this page in the page cache, but we don’t want you to change it 
on disk”: that’s not consistent with the function of a page cache.


As things currently stand, we dirty the page in our internal buffers,
and we don't write it to the kernel until we've written and fsync'd the
WAL data that needs to get to disk first.  The discussion here is about
whether we could somehow avoid double-buffering between our internal
buffers and the kernel page cache.


To be honest, I think the impact of double buffering in real-life 
applications is greatly exaggerated. If you follow the usual guideline 
and configure shared_buffers to 25% of available RAM, at worst you're 
wasting 25% of RAM to double buffering. That's significant, but it's not 
the end of the world, and it's a problem that can be compensated by 
simply buying more RAM.


Of course, if someone can come up with an easy way to solve that, that'd 
be great, but if it means giving up other advantages that we get from 
relying on the OS page cache, then -1 from me. The usual response to the 
why don't you just use O_DIRECT? is that it'd require reimplementing a 
lot of I/O infrastructure, but misses an IMHO more important point: it 
would require setting shared_buffers a lot higher to get the same level 
of performance you get today. That has a number of problems:


1. It becomes a lot more important to tune shared_buffers correctly. Set 
it too low, and you're not taking advantage of all the RAM available. 
Set it too high, and you'll start swapping, totally killing performance. 
I can already hear consultants rubbing their hands, waiting for the rush 
of customers that will need expert help to determine the optimal 
shared_buffers setting.


2. Memory spent on the buffer cache can't be used for other things. For 
example, an index build can temporarily allocate several gigabytes of 
memory; if that memory is allocated to the shared buffer cache, it can't 
be used for that purpose. Yeah, we could change that, and allow 
borrowing pages from the shared buffer cache for other purposes, but 
that means more work and more code.


3. Memory used for the shared buffer cache can't be used by other 
processes (without swapping). It becomes a lot harder to be a good 
citizen on a system that's not entirely dedicated to PostgreSQL.


So not only would we need to re-implement I/O infrastructure, we'd also 
need to make memory management a lot smarter and a lot more flexible. 
We'd need a lot more information on what else is running on the system 
and how badly they need memory.



I personally think there is no chance of using mmap for that; the
semantics of mmap are pretty much dictated by POSIX and they don't work
for this.


Agreed. It would be possible to use mmap() for pages that are not 
modified, though. When you're not modifying, you could mmap() the data 
you need, and bypass the PostgreSQL buffer cache that way. The 
interaction with the buffer cache becomes complicated, because you 
couldn't use the buffer cache's locks etc., and some pages might have a 
never version in the buffer cache than on-disk, but it might be doable.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Robert Haas
On Tue, Jan 14, 2014 at 11:57 AM, James Bottomley
james.bottom...@hansenpartnership.com wrote:
 On Tue, 2014-01-14 at 11:48 -0500, Robert Haas wrote:
 On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley
 james.bottom...@hansenpartnership.com wrote:
  No, I'm sorry, that's never going to be possible.  No user space
  application has all the facts.  If we give you an interface to force
  unconditional holding of dirty pages in core you'll livelock the system
  eventually because you made a wrong decision to hold too many dirty
  pages.   I don't understand why this has to be absolute: if you advise
  us to hold the pages dirty and we do up until it becomes a choice to
  hold on to the pages or to thrash the system into a livelock, why would
  you ever choose the latter?  And if, as I'm assuming, you never would,
  why don't you want the kernel to make that choice for you?

 If you don't understand how write-ahead logging works, this
 conversation is going nowhere.  Suffice it to say that the word
 ahead is not optional.

 No, I do ... you mean the order of write out, if we have to do it, is
 important.  In the rest of the kernel, we do this with barriers which
 causes ordered grouping of I/O chunks.  If we could force a similar
 ordering in the writeout code, is that enough?

Probably not.  There are a whole raft of problems here.  For that to
be any of any use, we'd have to move to mmap()ing each buffer instead
of read()ing them in, and apparently mmap() doesn't scale well to
millions of mappings.  And even if it did, then we'd have a solution
that only works on Linux.  Plus, as Tom pointed out, there are
critical sections where it's not just a question of ordering but in
fact you need to completely hold off writes.

In terms of avoiding double-buffering, here's my thought after reading
what's been written so far.  Suppose we read a page into our buffer
pool.  Until the page is clean, it would be ideal for the mapping to
be shared between the buffer cache and our pool, sort of like
copy-on-write.  That way, if we decide to evict the page, it will
still be in the OS cache if we end up needing it again (remember, the
OS cache is typically much larger than our buffer pool).  But if the
page is dirtied, then instead of copying it, just have the buffer pool
forget about it, because at that point we know we're going to write
the page back out anyway before evicting it.

This would be pretty similar to copy-on-write, except without the
copying.  It would just be forget-from-the-buffer-pool-on-write.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Robert Haas
On Tue, Jan 14, 2014 at 12:12 PM, Robert Haas robertmh...@gmail.com wrote:
 In terms of avoiding double-buffering, here's my thought after reading
 what's been written so far.  Suppose we read a page into our buffer
 pool.  Until the page is clean, it would be ideal for the mapping to

Correction: For so long as the page is clean...

 be shared between the buffer cache and our pool, sort of like
 copy-on-write.  That way, if we decide to evict the page, it will
 still be in the OS cache if we end up needing it again (remember, the
 OS cache is typically much larger than our buffer pool).  But if the
 page is dirtied, then instead of copying it, just have the buffer pool
 forget about it, because at that point we know we're going to write
 the page back out anyway before evicting it.

 This would be pretty similar to copy-on-write, except without the
 copying.  It would just be forget-from-the-buffer-pool-on-write.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Claudio Freire
On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas robertmh...@gmail.com wrote:

 In terms of avoiding double-buffering, here's my thought after reading
 what's been written so far.  Suppose we read a page into our buffer
 pool.  Until the page is clean, it would be ideal for the mapping to
 be shared between the buffer cache and our pool, sort of like
 copy-on-write.  That way, if we decide to evict the page, it will
 still be in the OS cache if we end up needing it again (remember, the
 OS cache is typically much larger than our buffer pool).  But if the
 page is dirtied, then instead of copying it, just have the buffer pool
 forget about it, because at that point we know we're going to write
 the page back out anyway before evicting it.

 This would be pretty similar to copy-on-write, except without the
 copying.  It would just be forget-from-the-buffer-pool-on-write.


But... either copy-on-write or forget-on-write needs a page fault, and
thus a page mapping.

Is a page fault more expensive than copying 8k?

(I really don't know).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpgsql.consistent_into

2014-01-14 Thread Tom Lane
Marko Tiikkaja ma...@joh.to writes:
 On 1/14/14 12:28 PM, Marti Raudsepp wrote:
 Now, another question is whether it's possible to make the syntax
 work. Is this an assignment from the result of a subquery, or is it a
 query by itself?
 a = (SELECT foo FROM table);

 That looks like a scalar subquery, which is wrong because they can't 
 return more than one column (nor can they be INSERT etc., obviously).

Yeah, it's a scalar subquery, which means that plpgsql already assigns
a non-error meaning to this syntax.

 How about:

(a) = SELECT 1;
(a, b) = SELECT 1, 2;
(a, b) = INSERT INTO foo RETURNING col1, col2;

 Same semantics: TOO_MANY_ROWS on rows  1, sets FOUND and row_count. 
 AFAICT this can be parsed unambiguously, too, and we don't need to look 
 at the query string because this is new syntax.

The idea of inventing new syntax along this line seems like a positive
direction to pursue.  Since assignment already rejects multiple rows
from the source expression, this wouldn't be weirdly inconsistent.

It might be worth thinking about the multiple column assignment UPDATE
syntax that's in recent versions of the SQL standard:

UPDATE targettab SET (a, b, c) = row-valued-expression [ , ... ] [ 
WHERE ... ]

We don't actually implement this in PG yet, except for trivial cases, but
it will certainly happen eventually. I think your sketch above deviates
unnecessarily from what the standard says for UPDATE.  In particular
I think it'd be better to write things like

(a, b) = ROW(1, 2);
(a, b, c) = (SELECT x, y, z FROM foo WHERE id = 42);

which would exactly match what you'd write in a multiple-assignment
UPDATE, and it has the same rejects-multiple-rows semantics too.

Also note that the trivial cases we do already implement in UPDATE
look like

UPDATE targettab SET (a, b, c) = (1, 2, 3) [ WHERE ... ]

that is, we allow a row constructor where the optional keyword ROW has
been omitted.  I think people would expect to be able to write this in
plpgsql:

(a, b) = (1, 2);

Now, this doesn't provide any guidance for INSERT/UPDATE/DELETE RETURNING,
but frankly I don't feel any need to invent new syntax for those, since
RETURNING INTO already works the way you want.

I'm not too sure what it'd take to make this work.  Right now,

SELECT (SELECT x, y FROM foo WHERE id = 42);

would generate ERROR:  subquery must return only one column, but
I think it's mostly a historical artifact that it does that rather than
returning a composite value (of an anonymous record type).  If we were
willing to make that change then it seems like it'd be pretty
straightforward to teach plpgsql to handle

(a, b, ...) = row-valued-expression

where there wouldn't actually be any need to parse the RHS any differently
from the way plpgsql parses an assignment RHS right now.  Which would be
a good thing IMO.  If we don't generalize the behavior of scalar
subqueries then plpgsql would have to jump through a lot of hoops to
support the subselect case.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Robert Haas
On Tue, Jan 14, 2014 at 12:15 PM, Claudio Freire klaussfre...@gmail.com wrote:
 On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas robertmh...@gmail.com wrote:
 In terms of avoiding double-buffering, here's my thought after reading
 what's been written so far.  Suppose we read a page into our buffer
 pool.  Until the page is clean, it would be ideal for the mapping to
 be shared between the buffer cache and our pool, sort of like
 copy-on-write.  That way, if we decide to evict the page, it will
 still be in the OS cache if we end up needing it again (remember, the
 OS cache is typically much larger than our buffer pool).  But if the
 page is dirtied, then instead of copying it, just have the buffer pool
 forget about it, because at that point we know we're going to write
 the page back out anyway before evicting it.

 This would be pretty similar to copy-on-write, except without the
 copying.  It would just be forget-from-the-buffer-pool-on-write.

 But... either copy-on-write or forget-on-write needs a page fault, and
 thus a page mapping.

 Is a page fault more expensive than copying 8k?

I don't know either.  I wasn't thinking so much that it would save CPU
time as that it would save memory.  Consider a system with 32GB of
RAM.  If you set shared_buffers=8GB, then in the worst case you've got
25% of your RAM wasted storing pages that already exist, dirtied, in
shared_buffers.  It's easy to imagine scenarios in which that results
in lots of extra I/O, so that the CPU required to do the accounting
comes to seem cheap by comparison.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Kevin Grittner
Claudio Freire klaussfre...@gmail.com wrote:
 Robert Haas robertmh...@gmail.com wrote:
 James Bottomley james.bottom...@hansenpartnership.com wrote:

 I don't understand why this has to be absolute: if you advise
 us to hold the pages dirty and we do up until it becomes a
 choice to hold on to the pages or to thrash the system into a
 livelock, why would you ever choose the latter?

Because the former creates database corruption and the latter does
not.

 And if, as I'm assuming, you never would,

That assumption is totally wrong.

 why don't you want the kernel to make that choice for you?

 If you don't understand how write-ahead logging works, this
 conversation is going nowhere.  Suffice it to say that the word
 ahead is not optional.

 In essence, if you do flush when you shouldn't, and there is a
 hardware failure, or kernel panic, or anything that stops the
 rest of the writes from succeeding, your database is kaputt, and
 you've got to restore a backup.

 Ie: very very bad.

Yup.  And when that's a few terrabytes, you will certainly find
yourself wishing that you had been able to do a recovery up to the
end of the last successfully committed transaction rather than a
restore from backup.

Now, as Tom said, if there was an API to create write boundaries
between particular dirty pages we could leave it to the OS.  Each
WAL record's write would be conditional on the previous one and
each data page write would be conditional on the WAL record for the
last update to the page.  But nobody seems to think that would
yield acceptable performance.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Hannu Krosing
On 01/14/2014 05:44 PM, James Bottomley wrote:
 On Tue, 2014-01-14 at 10:39 -0500, Tom Lane wrote:
 James Bottomley james.bottom...@hansenpartnership.com writes:
 The current mechanism for coherency between a userspace cache and the
 in-kernel page cache is mmap ... that's the only way you get the same
 page in both currently.
 Right.

 glibc used to have an implementation of read/write in terms of mmap, so
 it should be possible to insert it into your current implementation
 without a major rewrite.  The problem I think this brings you is
 uncontrolled writeback: you don't want dirty pages to go to disk until
 you issue a write()
 Exactly.

 I think we could fix this with another madvise():
 something like MADV_WILLUPDATE telling the page cache we expect to alter
 the pages again, so don't be aggressive about cleaning them.
 Don't be aggressive isn't good enough.  The prohibition on early write
 has to be absolute, because writing a dirty page before we've done
 whatever else we need to do results in a corrupt database.  It has to
 be treated like a write barrier.

 The problem is we can't give you absolute control of when pages are
 written back because that interface can be used to DoS the system: once
 we get too many dirty uncleanable pages, we'll thrash looking for memory
 and the system will livelock.
 Understood, but that makes this direction a dead end.  We can't use
 it if the kernel might decide to write anyway.
 No, I'm sorry, that's never going to be possible.  No user space
 application has all the facts.  If we give you an interface to force
 unconditional holding of dirty pages in core you'll livelock the system
 eventually because you made a wrong decision to hold too many dirty
 pages.   I don't understand why this has to be absolute: if you advise
 us to hold the pages dirty and we do up until it becomes a choice to
 hold on to the pages or to thrash the system into a livelock, why would
 you ever choose the latter?  And if, as I'm assuming, you never would,
 why don't you want the kernel to make that choice for you?
The short answer is crash safety.

A database system worth its name must make sure that all data
reported as stored to clients is there even after crash.

Write ahead log is the means for that. And writing wal files and
data pages has to be in certain order to guarantee consistent
recovery after crash.

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Kevin Grittner
James Bottomley james.bottom...@hansenpartnership.com wrote:

 you mean the order of write out, if we have to do it, is
 important.  In the rest of the kernel, we do this with barriers
 which causes ordered grouping of I/O chunks.  If we could force a
 similar ordering in the writeout code, is that enough?

Unless it can be between particular pairs of pages, I don't think
performance could be at all acceptable.  Each data page has an
associated Log Sequence Number reflecting the last Write-Ahead Log
record which records a change to that page, and the referenced WAL
record must be safely persisted before the data page is allowed to
be written.  Currently, when we need to write a dirty page to the
OS, we must ensure that the WAL record is written and fsync'd
first.  We also write a WAL record for transaction command and
fsync it at each COMMIT, before telling the client that the COMMIT
request was successful.  (Well, at least by default; they can
choose to set synchronous_commit to off for some or all
transactions.)  If a write barrier to control this applied to
everything on the filesystem, performance would be horrible.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Exposing currentTransactionWALVolume

2014-01-14 Thread Simon Riggs
Short patch to expose a function GetCurrentTransactionWALVolume() that
gives the total number of bytes written to WAL by current transaction.

User interface to this information discussed on separate thread, so
that we don't check the baby out with the bathwater when people
discuss UI pros and cons.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


GetCurrentTransactionWALVolume.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] shared memory message queues

2014-01-14 Thread Robert Haas
On Mon, Dec 23, 2013 at 12:46 PM, Robert Haas robertmh...@gmail.com wrote:
 Oh, dear.  That's rather embarrassing.

 Incremental (incremental-shm-mq.patch) and full (shm-mq-v3.patch)
 patches attached.

OK, I have pushed the patches in this stack.  I'm not sure we quite
concluded the review back-and-forth but nobody really seems to have
had serious objections to this line of attack, other than wanting some
more comments which I have added.  I don't doubt that there will be
more things to tweak and tune here, and a whole lot more stuff that
needs to be built using this infrastructure, but I don't think the
code that's here is going to get better for remaining outside the tree
longer.

I decided not to change the dsm_toc patch to use functions instead of
macros as Andres suggested; the struct definition would still have
needed to be public, so any change would still need a recompile, at
least if the size of the struct changed.  It could be changed to work
with a palloc'd chunk, but then you need to worry about
context-lifespan memory leaks and it so didn't seem worth it.  I can't
imagine this having a lot of churn, anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extension_control_path

2014-01-14 Thread Tom Lane
Dimitri Fontaine dimi...@2ndquadrant.fr writes:
 Tom Lane t...@sss.pgh.pa.us writes:
 Why is that a good idea?  It's certainly not going to simplify DBAs'
 lives, more the reverse.  (This dump won't reload. Uh, where did
 you get that extension from? Ummm...)

 The latest users for the feature are the Red Hat team working on Open
 Shift where they want to have co-existing per-user PostgreSQL clusters
 on a machine, each with its own set of extensions.

Um ... own set of installed extensions doesn't need to mean own set of
available extensions, any more than those clusters need to have their
own Postgres executables.  If the clusters *do* have their own
executables, eg because they're different PG versions, then they can
certainly also have their own $SHAREDIR trees too.  So this example
is totally without value for your case.

 Having extension_control_path also allows to install extension files in
 a place not owned by root.

As far as the control files go, there's nothing saying that
$SHAREDIR/extension has to be root-owned.  If there are .so's involved,
I do not believe the Red Hat crew is asking you to support loading .so's
from non-root-owned dirs, because that'd be against their own corporate
security policies.  (But in any case, where we find the control and SQL
files need not have anything to do with where the .so's are.)

 Lastly, as a developer, you might enjoy being able to have your own
 non-system-global place to install extensions, as Andres did explain on
 this list not too long ago.

And again, if you're working on a development version, $SHAREDIR/extension
is probably owned by you anyway.

I don't see that any of these scenarios create a need to install extension
files anywhere but $SHAREDIR/extension.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Jeff Janes
On Mon, Jan 13, 2014 at 2:36 PM, Mel Gorman mgor...@suse.de wrote:

 On Mon, Jan 13, 2014 at 06:27:03PM -0200, Claudio Freire wrote:
  On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby j...@nasby.net wrote:
   On 1/13/14, 2:19 PM, Claudio Freire wrote:
  
   On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas robertmh...@gmail.com
   wrote:
  
   On a related note, there's also the problem of double-buffering.
  When
   we read a page into shared_buffers, we leave a copy behind in the OS
   buffers, and similarly on write-out.  It's very unclear what to do
   about this, since the kernel and PostgreSQL don't have intimate
   knowledge of what each other are doing, but it would be nice to solve
   somehow.
  
  
  
   There you have a much harder algorithmic problem.
  
   You can basically control duplication with fadvise and WONTNEED. The
   problem here is not the kernel and whether or not it allows postgres
   to be smart about it. The problem is... what kind of smarts
   (algorithm) to use.
  
  
   Isn't this a fairly simple matter of when we read a page into shared
 buffers
   tell the kernel do forget that page? And a corollary to that for when
 we
   dump a page out of shared_buffers (here kernel, please put this back
 into
   your cache).
 
 
  That's my point. In terms of kernel-postgres interaction, it's fairly
 simple.
 
  What's not so simple, is figuring out what policy to use. Remember,
  you cannot tell the kernel to put some page in its page cache without
  reading it or writing it. So, once you make the kernel forget a page,
  evicting it from shared buffers becomes quite expensive.

 posix_fadvise(POSIX_FADV_WILLNEED) is meant to cover this case by
 forcing readahead.


But telling the kernel to forget a page, then telling it to read it in
again from disk because it might be needed again in the near future is
itself very expensive.  We would need to hand the page to the kernel so it
has it without needing to go to disk to get it.


 If you evict it prematurely then you do get kinda
 screwed because you pay the IO cost to read it back in again even if you
 had enough memory to cache it. Maybe this is the type of kernel-postgres
 interaction that is annoying you.

 If you don't evict, the kernel eventually steps in and evicts the wrong
 thing. If you do evict and it was unnecessarily you pay an IO cost.

 That could be something we look at. There are cases buried deep in the
 VM where pages get shuffled to the end of the LRU and get tagged for
 reclaim as soon as possible. Maybe you need access to something like
 that via posix_fadvise to say reclaim this page if you need memory but
 leave it resident if there is no memory pressure or something similar.
 Not exactly sure what that interface would look like or offhand how it
 could be reliably implemented.


I think the reclaim this page if you need memory but leave it resident if
there is no memory pressure hint would be more useful for temporary
working files than for what was being discussed above (shared buffers).
 When I do work that needs large temporary files, I often see physical
write IO spike but physical read IO does not.  I interpret that to mean
that the temporary data is being written to disk to satisfy either
dirty_expire_centisecs or dirty_*bytes, but the data remains in the FS
cache and so disk reads are not needed to satisfy it.  So a hint that says
this file will never be fsynced so please ignore dirty_*bytes and
dirty_expire_centisecs.  I will need it again relatively soon (but not
after a reboot), but will do so mostly sequentially, so please don't evict
this without need, but if you do need to then it is a good candidate would
be good.

Cheers,

Jeff


Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe

2014-01-14 Thread Simon Riggs
On 7 July 2013 14:24, Simon Riggs si...@2ndquadrant.com wrote:
 On 3 January 2012 18:42, Tom Lane t...@sss.pgh.pa.us wrote:
 I wrote:
 Another point that requires some thought is that switching SnapshotNow
 to be MVCC-based will presumably result in a noticeable increase in each
 backend's rate of wanting to acquire snapshots.

 BTW, I wonder if this couldn't be ameliorated by establishing some
 ground rules about how up-to-date a snapshot really needs to be.
 Arguably, it should be okay for successive SnapshotNow scans to use the
 same snapshot as long as we have not acquired a new lock in between.
 If not, reusing an old snap doesn't introduce any race condition that
 wasn't there already.

 Now that has been implemented using the above design, we can resubmit
 the lock level reduction patch, with thanks to Robert.

 Submitted patch passes original complaint/benchmark.

 Changes
 * various forms of ALTER TABLE, notably ADD constraint and VALIDATE
 * CREATE TRIGGER

 One minor coirrections to earlier thinking with respect to toast
 tables. That might be later relaxed.

 Full tests including proof of lock level reductions, plus docs.

Rebased to v14

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


reduce_lock_levels.v14.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Robert Haas
On Tue, Jan 14, 2014 at 12:20 PM, James Bottomley
james.bottom...@hansenpartnership.com wrote:
 On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote:
 On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas robertmh...@gmail.com wrote:
  In terms of avoiding double-buffering, here's my thought after reading
  what's been written so far.  Suppose we read a page into our buffer
  pool.  Until the page is clean, it would be ideal for the mapping to
  be shared between the buffer cache and our pool, sort of like
  copy-on-write.  That way, if we decide to evict the page, it will
  still be in the OS cache if we end up needing it again (remember, the
  OS cache is typically much larger than our buffer pool).  But if the
  page is dirtied, then instead of copying it, just have the buffer pool
  forget about it, because at that point we know we're going to write
  the page back out anyway before evicting it.
 
  This would be pretty similar to copy-on-write, except without the
  copying.  It would just be forget-from-the-buffer-pool-on-write.

 But... either copy-on-write or forget-on-write needs a page fault, and
 thus a page mapping.

 Is a page fault more expensive than copying 8k?

 (I really don't know).

 A page fault can be expensive, yes ... but perhaps you don't need one.

 What you want is a range of memory that's read from a file but treated
 as anonymous for writeout (i.e. written to swap if we need to reclaim
 it). Then at some time later, you want to designate it as written back
 to the file instead so you control the writeout order.  I'm not sure we
 can do this: the separation between file backed and anonymous pages is
 pretty deeply ingrained into the OS, but if it were possible, is that
 what you want?

Doesn't sound exactly like what I had in mind.  What I was suggesting
is an analogue of read() that, if it reads full pages of data to a
page-aligned address, shares the data with the buffer cache until it's
first written instead of actually copying the data.  The pages are
write-protected so that an attempt to write the address range causes a
page fault.  In response to such a fault, the pages become anonymous
memory and the buffer cache no longer holds a reference to the page.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] shared memory message queues

2014-01-14 Thread Thom Brown
On 14 January 2014 17:29, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Dec 23, 2013 at 12:46 PM, Robert Haas robertmh...@gmail.com wrote:
 Oh, dear.  That's rather embarrassing.

 Incremental (incremental-shm-mq.patch) and full (shm-mq-v3.patch)
 patches attached.

 OK, I have pushed the patches in this stack.  I'm not sure we quite
 concluded the review back-and-forth but nobody really seems to have
 had serious objections to this line of attack, other than wanting some
 more comments which I have added.  I don't doubt that there will be
 more things to tweak and tune here, and a whole lot more stuff that
 needs to be built using this infrastructure, but I don't think the
 code that's here is going to get better for remaining outside the tree
 longer.

 I decided not to change the dsm_toc patch to use functions instead of
 macros as Andres suggested; the struct definition would still have
 needed to be public, so any change would still need a recompile, at
 least if the size of the struct changed.  It could be changed to work
 with a palloc'd chunk, but then you need to worry about
 context-lifespan memory leaks and it so didn't seem worth it.  I can't
 imagine this having a lot of churn, anyway.

postgres=# SELECT test_shm_mq(32768, (select
string_agg(chr(32+(random()*96)::int), '') from generate_series(1,3)),
1, 10);
ERROR:  could not register background process
HINT:  You may need to increase max_worker_processes.
STATEMENT:  SELECT test_shm_mq(32768, (select
string_agg(chr(32+(random()*96)::int), '') from generate_series(1,3)),
1, 10);
LOG:  registering background worker test_shm_mq
LOG:  starting background worker process test_shm_mq
ERROR:  could not register background process
HINT:  You may need to increase max_worker_processes.
ERROR:  unable to map dynamic shared memory segment
LOG:  worker process: test_shm_mq (PID 21939) exited with exit code 1
LOG:  unregistering background worker test_shm_mq


What's going on here?  This occurs when starting Postgres and run as
the first statement.

I also noticed that everything exits with exit code 1:


postgres=# SELECT test_shm_mq(32768, (select
string_agg(chr(32+(random()*96)::int), '') from
generate_series(1,400)), 1, 1);
LOG:  registering background worker test_shm_mq
LOG:  starting background worker process test_shm_mq
 test_shm_mq
-

(1 row)

LOG:  worker process: test_shm_mq (PID 22041) exited with exit code 1
LOG:  unregistering background worker test_shm_mq

-- 
Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Jan 14, 2014 at 11:57 AM, James Bottomley
 james.bottom...@hansenpartnership.com wrote:
 No, I do ... you mean the order of write out, if we have to do it, is
 important.  In the rest of the kernel, we do this with barriers which
 causes ordered grouping of I/O chunks.  If we could force a similar
 ordering in the writeout code, is that enough?

 Probably not.  There are a whole raft of problems here.  For that to
 be any of any use, we'd have to move to mmap()ing each buffer instead
 of read()ing them in, and apparently mmap() doesn't scale well to
 millions of mappings.

We would presumably mmap whole files, not individual pages (at least
on 64-bit machines; else address space size is going to be a problem).
However, without a fix for the critical-section/atomic-update problem,
the idea's still going nowhere.

 This would be pretty similar to copy-on-write, except without the
 copying.  It would just be forget-from-the-buffer-pool-on-write.

That might possibly work.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread James Bottomley
On Tue, 2014-01-14 at 11:48 -0500, Robert Haas wrote:
 On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley
 james.bottom...@hansenpartnership.com wrote:
  No, I'm sorry, that's never going to be possible.  No user space
  application has all the facts.  If we give you an interface to force
  unconditional holding of dirty pages in core you'll livelock the system
  eventually because you made a wrong decision to hold too many dirty
  pages.   I don't understand why this has to be absolute: if you advise
  us to hold the pages dirty and we do up until it becomes a choice to
  hold on to the pages or to thrash the system into a livelock, why would
  you ever choose the latter?  And if, as I'm assuming, you never would,
  why don't you want the kernel to make that choice for you?
 
 If you don't understand how write-ahead logging works, this
 conversation is going nowhere.  Suffice it to say that the word
 ahead is not optional.

No, I do ... you mean the order of write out, if we have to do it, is
important.  In the rest of the kernel, we do this with barriers which
causes ordered grouping of I/O chunks.  If we could force a similar
ordering in the writeout code, is that enough?

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread James Bottomley
On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote:
 On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas robertmh...@gmail.com wrote:
 
  In terms of avoiding double-buffering, here's my thought after reading
  what's been written so far.  Suppose we read a page into our buffer
  pool.  Until the page is clean, it would be ideal for the mapping to
  be shared between the buffer cache and our pool, sort of like
  copy-on-write.  That way, if we decide to evict the page, it will
  still be in the OS cache if we end up needing it again (remember, the
  OS cache is typically much larger than our buffer pool).  But if the
  page is dirtied, then instead of copying it, just have the buffer pool
  forget about it, because at that point we know we're going to write
  the page back out anyway before evicting it.
 
  This would be pretty similar to copy-on-write, except without the
  copying.  It would just be forget-from-the-buffer-pool-on-write.
 
 
 But... either copy-on-write or forget-on-write needs a page fault, and
 thus a page mapping.
 
 Is a page fault more expensive than copying 8k?
 
 (I really don't know).

A page fault can be expensive, yes ... but perhaps you don't need one. 

What you want is a range of memory that's read from a file but treated
as anonymous for writeout (i.e. written to swap if we need to reclaim
it).  Then at some time later, you want to designate it as written back
to the file instead so you control the writeout order.  I'm not sure we
can do this: the separation between file backed and anonymous pages is
pretty deeply ingrained into the OS, but if it were possible, is that
what you want?

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread James Bottomley
On Tue, 2014-01-14 at 10:39 -0500, Tom Lane wrote:
 James Bottomley james.bottom...@hansenpartnership.com writes:
  The current mechanism for coherency between a userspace cache and the
  in-kernel page cache is mmap ... that's the only way you get the same
  page in both currently.
 
 Right.
 
  glibc used to have an implementation of read/write in terms of mmap, so
  it should be possible to insert it into your current implementation
  without a major rewrite.  The problem I think this brings you is
  uncontrolled writeback: you don't want dirty pages to go to disk until
  you issue a write()
 
 Exactly.
 
  I think we could fix this with another madvise():
  something like MADV_WILLUPDATE telling the page cache we expect to alter
  the pages again, so don't be aggressive about cleaning them.
 
 Don't be aggressive isn't good enough.  The prohibition on early write
 has to be absolute, because writing a dirty page before we've done
 whatever else we need to do results in a corrupt database.  It has to
 be treated like a write barrier.
 
  The problem is we can't give you absolute control of when pages are
  written back because that interface can be used to DoS the system: once
  we get too many dirty uncleanable pages, we'll thrash looking for memory
  and the system will livelock.
 
 Understood, but that makes this direction a dead end.  We can't use
 it if the kernel might decide to write anyway.

No, I'm sorry, that's never going to be possible.  No user space
application has all the facts.  If we give you an interface to force
unconditional holding of dirty pages in core you'll livelock the system
eventually because you made a wrong decision to hold too many dirty
pages.   I don't understand why this has to be absolute: if you advise
us to hold the pages dirty and we do up until it becomes a choice to
hold on to the pages or to thrash the system into a livelock, why would
you ever choose the latter?  And if, as I'm assuming, you never would,
why don't you want the kernel to make that choice for you?

James



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpgsql.consistent_into

2014-01-14 Thread Marko Tiikkaja


On 1/14/14, 6:15 PM, Tom Lane wrote:

Marko Tiikkaja ma...@joh.to writes:

How about:



(a) = SELECT 1;
(a, b) = SELECT 1, 2;
(a, b) = INSERT INTO foo RETURNING col1, col2;



Same semantics: TOO_MANY_ROWS on rows  1, sets FOUND and row_count.
AFAICT this can be parsed unambiguously, too, and we don't need to look
at the query string because this is new syntax.


The idea of inventing new syntax along this line seems like a positive
direction to pursue.  Since assignment already rejects multiple rows
from the source expression, this wouldn't be weirdly inconsistent.

It might be worth thinking about the multiple column assignment UPDATE
syntax that's in recent versions of the SQL standard:

UPDATE targettab SET (a, b, c) = row-valued-expression [ , ... ] [ 
WHERE ... ]

We don't actually implement this in PG yet, except for trivial cases, but
it will certainly happen eventually. I think your sketch above deviates
unnecessarily from what the standard says for UPDATE.  In particular
I think it'd be better to write things like

(a, b) = ROW(1, 2);
(a, b, c) = (SELECT x, y, z FROM foo WHERE id = 42);

which would exactly match what you'd write in a multiple-assignment
UPDATE, and it has the same rejects-multiple-rows semantics too.


Hmm.  That's a fair point I did not consider.


Also note that the trivial cases we do already implement in UPDATE
look like

UPDATE targettab SET (a, b, c) = (1, 2, 3) [ WHERE ... ]

that is, we allow a row constructor where the optional keyword ROW has
been omitted.  I think people would expect to be able to write this in
plpgsql:

(a, b) = (1, 2);

Now, this doesn't provide any guidance for INSERT/UPDATE/DELETE RETURNING,
but frankly I don't feel any need to invent new syntax for those, since
RETURNING INTO already works the way you want.


Yeah, I don't feel strongly about having to support them with this 
syntax.  The inconsistency is a bit ugly, but it's not the end of the world.



I'm not too sure what it'd take to make this work.  Right now,

SELECT (SELECT x, y FROM foo WHERE id = 42);

would generate ERROR:  subquery must return only one column, but
I think it's mostly a historical artifact that it does that rather than
returning a composite value (of an anonymous record type).  If we were
willing to make that change then it seems like it'd be pretty
straightforward to teach plpgsql to handle

(a, b, ...) = row-valued-expression

where there wouldn't actually be any need to parse the RHS any differently
from the way plpgsql parses an assignment RHS right now.  Which would be
a good thing IMO.  If we don't generalize the behavior of scalar
subqueries then plpgsql would have to jump through a lot of hoops to
support the subselect case.


You can already do the equivalent of  (a,b,c) = (1,2,3)  with SELECT .. 
INTO.  Would you oppose to starting the work on this by only supporting 
the subquery syntax, with the implementation being similar to how we 
currently handle SELECT .. INTO?  If we could also not flat out reject 
including that into 9.4, I would be quite happy.



Regards,
Marko Tiikkaja


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >