date:20140113

Marti Raudsepp  writes:
> On Mon, Jan 13, 2014 at 5:16 PM, Tom Lane  wrote:
>> What remaining issues are there blocking a 9.3.3 release?

> Well hardly a blocker since this has missed 2 releases already, but
> I'm still hopeful to get many PGXS-based extensions to build again
> without the dreaded "install: will not overwrite just-created ..."
> http://www.postgresql.org/message-id/52406191.6040...@dunslane.net

AFAICT, the state of play on that is that it was reverted out of the
non-devel branches immediately before the October releases (9.3.1
et al), on the grounds that it still seemed to be a moving target, cf
http://www.postgresql.org/message-id/1381193255.25702.4.ca...@vanquo.pezone.net

I've not seen any further work on PGXS since commit eebdea08b, so maybe
that complaint is no longer valid?  On the other hand, dddc91ddd leaves
the impression that all is not yet well in PGXS-land.

> Thankfully I'm mostly using Debian so it's already patched for me.

The Debian guys do seem remarkably willing to apply patches that
upstream doesn't trust yet, don't they.

Anyway, my opinion about this one is it's not a regression and therefore
not a blocker, but I have no objection to fixing it in the back branches
once we have consensus that the fix is stable.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread Florian Pflug

On Jan14, 2014, at 00:33 , Craig Ringer  wrote:
> So I guess the question is: Is it worth all that hassle to remove a
> misfeature you have to go out of your way to use? Is support for non-1
> lower bounds stopping us from doing something useful and important? Or
> is it just an irritation that it exists?

I don't think it's worh it - as you say, the actual risk of bugs is low,
because you have to go out of your way to end up with a lower bound other
than one.

Also, at least from my POV, the fact that we use one type do represent
arrays with an arbitrary number of dimensions is actually worse than
the lower-bound problem. So *if* we ever remove support for arbitrary
lower bounds, we should also add distinct types for different dimensions.
That'd probably required some extension of the type system though...

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Comment typo in src/include/access/gin_private.h

2014-01-13 Thread Etsuro Fujita

I ran into a typo in src/include/access/gin_private.h.  Patch attached.

Thanks,

Best regards,
Etsuro Fujita


gin_private.h-typo.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Capturing EXPLAIN ANALYZE for a query without discarding the normal result set

2014-01-13 Thread Dave Cole

I apologise for dropping this out of nowhere.  I had an idea about EXPLAIN
ANALYZE that would be very useful for the system we are developing and
supporting.

It would be really cool if you could direct the EXPLAIN ANALYZE output to a
temporary table so that the query being analyzed could execute normally.
 Something like this:

EXPLAIN ANALYZE INTO a_temp
   SELECT 

Then we could temporarily cause our application to log EXPLAIN ANALYZE
information for certain queries without disrupting normal operation of the
system. In the case when we notice long running queries we would then
immediately follow up the original query with a select on the temporay
table.

We deal with a lot of transient data, so the conditions that cause bad
query performance are not always reproducible.

I have no idea how feasible this is, so please feel free to tell me I am an
idiot.

- Dave

Re: [HACKERS] Capturing EXPLAIN ANALYZE for a query without discarding the normal result set

2014-01-13 Thread Marti Raudsepp

On Tue, Jan 14, 2014 at 5:06 AM, Dave Cole  wrote:
> It would be really cool if you could direct the EXPLAIN ANALYZE output to a
> temporary table so that the query being analyzed could execute normally.

You can use the auto_explain contrib module to log the query plans of
slow(er) queries:
http://www.postgresql.org/docs/current/static/auto-explain.html

If you Really Need To, you can use the csvlog log format and import
that to a table, but really it's easier to use less/grep/etc.

Regards,
Marti


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Capturing EXPLAIN ANALYZE for a query without discarding the normal result set

Dave Cole  writes:
> It would be really cool if you could direct the EXPLAIN ANALYZE output to a
> temporary table so that the query being analyzed could execute normally.

What happens if the current transaction rolls back?

If you want noninvasive explain data, contrib/auto_explain offers
a solution right now.  The info goes to the postmaster log, which is
perhaps less convenient than a temp table for interactive use, but
it doesn't have the rollback problem --- and you can capture data
about queries issued by a live application, without hacking the app.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread Florian Pflug

On Jan14, 2014, at 02:10 , Kevin Grittner  wrote:
> The fact that some
> day some new programmer might not be aware of all business rules,
> or might choose to try to ignore them is the reason you add
> constraints to columns and domains.

Well, for columns and domains that seems easy. We could have

  array_has_shape(value anyarray, variadic bounds int4range[])

and

  array_has_shape(value anyarray, variadic bounds int4[])

The first returns true if the value has length(bounds) dimensions
and each dimension's bounds match the corresponding range's bound,
where NULL means "arbitrary". The second one requires all lower
bounds to be 1, and checks the upper bounds against the bounds array.

Checking that an array is one-dimensional with lower bound 1 is then
accomplished by

  array_has_shape(myarray, int4range(1, NULL))

or simply

  array_has_shape(myarray, NULL);

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Soften pg_[start|stop]_backup to allow them on a standby?

2014-01-13 Thread Michael Paquier

Hi all,

This is perhaps something that has already been discussed on hackers,
I just could not find anything in the archives.
Currently, pg_start_backup and pg_stop_backup cannot run on a standby
because it is not possible to write a backup_label file to disk,
because of the nature of a standby server preventing to write any data
in its PGDATA. Is this thought right? This is what the comments at the
top of do_pg_start_backup make me conclude.

Could we consider soften the rules of pg_start_backup and
pg_stop_backup (aka allow creation of a backup_label file) to be able
to run them on a standby? Or not? This could accelerate taking backups
from standbys when taking backups locally.

Another idea would be to send the backup label file directly as the
output of pg_start_backup such as client application can grab it and
reuse it. Any thoughts about that as well?

Note that pg_basebackup uses the command BASE_BACKUP to bypass that
and send directly the label file through a stream without writing it
to disk, satisfying this condition and making possible the creation of
backups from a standby.

Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Case sensitive mode in windows build option

2014-01-13 Thread Dilip kumar


As per current behavior if user want to build in debug mode in windows, then he 
need to give debug in capital letters (DEBUG),
I think many user will always make mistake in giving this option, in my opinion 
we can make it case insensitive.

I have attached a small patch for the same ( just converted comparison to case 
insensitive).

Regards,
Dilip


windows_build.patch
Description: windows_build.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Case sensitive mode in windows build option

2014-01-13 Thread Craig Ringer

On 01/14/2014 11:49 AM, Dilip kumar wrote:
>  
> 
> As per current behavior if user want to build in debug mode in windows,
> then he need to give debug in capital letters (DEBUG),
> 
> I think many user will always make mistake in giving this option, in my
> opinion we can make it case insensitive.

The idea seems reasonable, the implementation does not. You've changed
the meaning rather more than making it case insensitive.

Use the Perl 'lc' function to compare a lower-cased input instead.

http://perldoc.perl.org/functions/lc.html

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2014-01-13 Thread Amit Kapila

On Tue, Jan 14, 2014 at 2:16 AM, Robert Haas  wrote:
> On Sat, Jan 11, 2014 at 1:08 AM, Amit Kapila  wrote:
>> Yes, currently this applies to update, what I have in mind is that
>> in future if some one wants to use WAL compression for any other
>> operation like 'full_page_writes', then it can be easily extendible.
>>
>> To be honest, I have not evaluated whether such a flag or compression
>> would make sense for full page writes, but I think it should be possible
>> while doing full page write (BkpBlock has RelFileNode) to check such a
>> flag if it's present.
>
> Makes sense.

   So shall I change it to string instead of bool and keep the name as
   compress_wal or compress_wal_for_opr?

>> The reason of adding the same chunk in head of list is that it uses same
>> technique as pglz_hist_add. Now in pglz, it will not have repeat steps
>> from c~f, as it has concept of good_match which leads to get this done in
>> one go.
>>
>> Being said above, I am really not sure, how much real world data falls
>> in above category and should we try to optimize based on above example,
>> but yes it will save some CPU cycles in current test we are using.
>
> In the Rabin algorithm, we shouldn't try to find a longer match.  The
> match should end at the chunk end, period.  Otherwise, you lose the
> shift-resistant property of the algorithm.

   Okay, it will work well for cases when most chunks in tuple are due
   due to special pattern in it, but it will loose out on CPU cycles in
   cases where most of the chunks are due to maximum chunk boundary
   and most part of new tuple matches with old tuple. The reason is that
   if the algorithm have some such property of finding longer matches than
   chunk boundaries, then it can save us on calculating hash again and
   again when we try to find match in old tuple.
   However I think it is better to go with rabin's algorithm instead of adding
   optimizations based on our own assumptions, because it is difficult to
   predict the real world tuple data.

>>
>> Isn't it similar to how current pglz works, basically it also
>> uses next 4 bytes to calculate index (pglz_hist_idx) but still
>> does byte by byte comparison, here if we try to map to rabin's
>> delta encoding then always chunk size is 1.
>
> I don't quite understand this.  The point of the Rabin algorithm is to
> split the old tuple up into chunks and then for those chunks in the
> new tuple.  For example, suppose the old tuple is
> abcdefghijklmnopqrstuvwxyz.  It might get split like this: abcdef
> hijklmnopqrstuvw xyz.  If any of those three chunks appear in the new
> tuple, then we'll use them for compression.  If not, we'll just copy
> the literal bytes.  If the chunks appear in the new tuple reordered or
> shifted or with stuff inserted between one chunk at the next, we'll
> still find them.  Unless I'm confused, which is possible, what you're
> doing is essentially looking at the string and spitting it in those
> three places, but then recording the chunks as being three bytes
> shorter than they really are.  I don't see how that can be right.

  Today again spending some time on algorithm, I got the bug you
  are pointing to and you are right in saying that chunk is shorter.
  I think it should not be difficult to address this issue without affecting
  most part of algorithm, let me try to handle it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Filter error log statements by sqlstate

2014-01-13 Thread Jeevan Chalke

On Mon, Jan 13, 2014 at 4:30 PM, Oskari Saarenmaa  wrote:

> Hi,
>
>
> On 13/01/14 10:26, Jeevan Chalke wrote:
>
>> 1. Documentation is missing and thus becomes difficult to understand what
>> exactly you are trying to do. Or in other words, user will be uncertain
>> about using it more efficiently.
>>
>
> I figured I'd write documentation for this if it looks like a useful
> feature which would be accepted for 9.4, but I guess it would've helped to
> have a bit better description of this for the initial submission as well.
>
>
>  2. Some more comments required. At each new function and specifically at
>> get_sqlstate_error_level().
>>
>
> Just after I submitted the patch I noticed that I had a placeholder for
> comment about that function but never wrote the actual comment, sorry about
> that.
>
>
>  3. Please add test-case if possible.
>>
>
> Sure.
>
>
>  4. Some code part does not comply with PostgreSQL indentation style. (Can
>> be
>> ignored as it will pass through pg_indent, but better fix it).
>>
>
> I'll try to fix this for v2.
>
>
>  5. You have used ""XX000:warning," string to get maximum possible length
>> of
>> the valid sqlstate:level identifier. It's perfect, but small explanation
>> about that will be good there. Also in future if we have any other error
>> level
>> which exceeds this, we need changes here too. Right ?
>>
>
> Good point, I'll address this in v2.
>
>
>  I will look into this further. But please have your attention on above
>> points.
>>
>
> Thanks for the review!
>

Since you are taking care of most of the points above. I will wait for v2
patch. Till then marking "Waiting on Author".

Thanks


>
> / Oskari
>
>
>  On Fri, Jan 10, 2014 at 12:56 AM, Oskari Saarenmaa 
>> wrote:
>> > Allow the default log_min_error_statement to be overridden per
>> > sqlstate to make it possible to filter out some error types while
>> > maintaining a low log_min_error_statement or enable logging for some
>> > error types when the default is to not log anything.
>> >
>> > I've tried to do something like this using rsyslog filters, but
>> > that's pretty awkward and doesn't work at all when the statement is
>> > split to multiple syslog messages.
>> >
>> > https://github.com/saaros/postgres/compare/log-by-sqlstate
>>
>
>


-- 
Jeevan B Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are
not the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.

Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-13 Thread David Rowley

On Tue, Jan 14, 2014 at 2:00 PM, Florian Pflug  wrote:

> On Jan10, 2014, at 22:27 , David Rowley  wrote:
> > As the patch stands at the moment, I currently have a regression test
> > which currently fails due to these extra zeros after the decimal point:
> >
> > -- This test currently fails due extra trailing 0 digits.
> > SELECT SUM(n::numeric) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND
> UNBOUNDED FOLLOWING)
> >   FROM (VALUES(1,1.01),(2,2),(3,3)) v(i,n);
> >
> > Patched produces:
> >  6.01
> >  5.00
> >  3.00
> > Unpatched produces:
> >  6.01
> >  5
> >  3
>
> Hm, that's annoying. I checked the standard to see if it offers any
> guidance
> here, but it doesn't look like it does - the standard just says that SUM()
> ought
> to return a type with the same scale as the input type has. That's fine if
> every NUMERIC type has a fixed scale, but that's not the case in postgres -
> we allow values of unconstrained NUMERIC types (i.e., numeric types
> without an
> explicitly specified precision or scale) to each define their own scale.
>
>
Thanks for digging that out in the standard and thanks for all that
information you supplied here
http://www.postgresql.org/message-id/0fa6c08e-2166-405b-83f7-63b196b88...@phlo.org
too.
Sorry I've not had the chance to reply yet. I was kind of hoping that the
answer would be more in my favour to help with inverse transitions for
sum(numeric).



> To fix this, we'd have to track the maximum scale within the current frame.
> That's easier than the general problem of providing an inverse transition
> function for MAX, because AFAIK we limit the scale to at most 1000. So it'd
> be sufficient to track the number of times we saw each scale, and also the
> current maximum. Once we reduce the current maximum's count back to zero
> in the inverse transition function, we'd scan from that value to the left
> to
> find the next non-zero entry.
>
>
I've been thinking about this, but I had thought that the maximum dscale
was bigger than 1000. The docs seem to claim 16383 here -->
http://www.postgresql.org/docs/devel/static/datatype-numeric.html I'd go
ahead and implement this if that number was smaller, but I'm thinking
zeroing out an array of 16383 elements on first call to do_numeric_accum
might be too big an overhead to write off as "background noise". If it was
20 or even 100 then I'd probably try for that.

I think the overhead for each call after that would likely be ok as it
would probably just be an operation like
state->scaleCount[X.dscale]++; which I would imagine would be a very small
percentage overhead on normal aggregate functions. Of course the inverse
would have to do the harder work of looping backwards over the array until
it found an element with the count above 0 and setting that as the current
maximum. I think this would be a winner if it wasn't for the high initial
hit of zeroing that 16383 element array.. Or 1000 whichever.



> We could also choose to ignore this, although I'm not sure I really like
> that.
> It seems entirely OK at first sight - after all, the values all stay the
> same,
> we just emit a different number of trailing zeros. But it still causes
> results
> to be affected by values, even if only in the number of trailing zeros,
> which
> lie outside the value's range. That seems like more non-determinism than as
> database should show.
>
> > With inverse transitions this query still produces correct results, it
> just does
> > not produces the numeric in the same format as it does without
> performing inverse
> > transitions. Personally I'd rather focus on trying to get SUM(numeric)
> in there
> > for 9.4
>
> I think it'd be worthwile to get this into 9.4, if that's still an option,
> even if we only support COUNT.
>
> best regards,
> Florian Pflug
>
>

Re: [HACKERS] [BUGS] surprising to_timestamp behavior

2014-01-13 Thread Jeevan Chalke

On Thu, Oct 31, 2013 at 10:50 AM, Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

>
>
>
> On Tue, Oct 29, 2013 at 11:05 PM, Robert Haas wrote:
>
>> On Tue, Oct 29, 2013 at 12:03 PM, Tom Lane  wrote:
>> > Robert Haas  writes:
>> >> It turns out that when you use the to_timestamp function, a space in
>> >> the format mask can result in skipping any character at all, even a
>> >> digit, in the input string.  Consider this example, where 10 hours are
>> >> lost:
>> >
>> >> rhaas=# select to_timestamp('2013-10-29 10:47:18', '-MM-DD
>>  HH24:MI:SS');
>> >>   to_timestamp
>> >> 
>> >>  2013-10-29 00:47:18-04
>> >> (1 row)
>> >
>> > And that's a bug why?  The format says to ignore two characters before
>> the
>> > hours field.  I think you're proposing to remove important
>> functionality.
>> >
>> > To refine the point a bit, it's absolutely stupid to be using
>> to_timestamp
>> > at all for sane input data like this example.  Just cast the string to
>> > timestamp(tz), and the standard datatype input function will do a better
>> > job than to_timestamp ever would.  The point of to_timestamp, IMNSHO,
>> > is to extract data successfully from weirdly formatted input; which
>> might
>> > well include cases where there are stray digits you don't want taken as
>> > data.  So I'm not on board with proposals to "fix" cases like this by
>> > making the format string's meaning squishier.
>>
>> Well, you're the second person to react that way to this proposal, but
>> the current behavior seems mighty odd to me - even odder, now that I
>> realize that we'll happily match '"cat'" to 'dog'.  I just work here,
>> though.
>>
>
> Well, I agree with Tom that user provided two spaces to skip before hours
> and this is what we are exactly doing.
>
> Still here are few other observations:
>
>
> (1) I don't see following as wrong output in postgresql as I already said
> above and agreed with Tom. (in input, only one space between DD and HH24,
> but
> in mask we have 2 spaces)
>
> postgres=# select to_timestamp('2011-03-18 23:38:15', '-MM-DD
> HH24:MI:SS');
>to_timestamp
> ---
>  2011-03-18 03:38:15+05:30
> (1 row)
>
> (Note that, time 23 became 03, due to extra space in mask eating 2 in 23,
> resulting in 3 for HH24. But fair enough, as expected and thus NO issues)
>
>
> (2) But I see following buggy (both in input and mask we have 2 spaces
> between DD and HH24)
>
> postgres=# select to_timestamp('2011-03-18  23:38:15', '-MM-DD
> HH24:MI:SS');
>to_timestamp
> ---
>  2011-03-18 03:38:15+05:30
> (1 row)
>
> (Note that, this time we should not end up with eating 2 from 23 as we have
> exact spaces in mask and input. NOT so good and NOT expected, looks like
> BUG)
>
> So I think we need to resolve second case.
>

Attached patch which fixes this issue.

I have just tweaked the code around ignoring spaces in DCH_from_char()
function.

Adding to CommitFest 2014-01 (Open).

Thanks


> Thanks
>
>
>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>>
>> --
>> Sent via pgsql-bugs mailing list (pgsql-b...@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-bugs
>>
>
> --
> Jeevan B Chalke
> Principal Software Engineer, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
>


-- 
Jeevan B Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are
not the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.

Re: [HACKERS] Proposal: variant of regclass

2014-01-13 Thread Michael Paquier

Hi,

On Tue, Jan 14, 2014 at 4:28 PM, Yugo Nagata  wrote:
> Here is the patch to implement to_regclass, to_regproc, to_regoper,
> and to_regtype. They are new functions similar to regclass, regproc,
> regoper, and regtype except that if requested object is not found,
> returns InvalidOid, rather than raises an error.

You should add this patch to the upcoming commit fest (beginning
tomorrow actually), I am not seeing it in the list:
https://commitfest.postgresql.org/action/commitfest_view?id=21
Thanks,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Doc fix for VACUUM FREEZE

2014-01-13 Thread Amit Kapila

On Fri, Jan 3, 2014 at 9:02 PM, Peter Eisentraut  wrote:
> On 12/17/13, 8:16 PM, Maciek Sakrejda wrote:
>> (now with patch--sorry about that)
>
> This patch doesn't apply.

There are some recent changes around same place which broke this
patch. Please find the modified patch attached with this mail, I had
changed order of specifying vacuum_freeze_min_age and
vacuum_freeze_table_age in Vacuum FREEZE option.

I will mark this patch as Ready For Committer, unless there is
any other objection for this patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

correct-vacuum-freeze-docs_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP patch (v2) for updatable security barrier views

2014-01-13 Thread Craig Ringer

On 01/09/2014 11:19 PM, Tom Lane wrote:
> Dean Rasheed  writes:
>> My first thought was that it should just preprocess any security
>> barrier quals in subquery_planner() in the same way as other quals are
>> preprocessed. But thinking about it further, those quals are destined
>> to become the quals of subqueries in the range table, so we don't
>> actually want to preprocess them at that stage --- that will happen
>> later when the new subquery is planned by recursion back into
>> subquery_planner(). So I think the right answer is to make
>> adjust_appendrel_attrs() handle recursion into sublink subqueries.
> 
> TBH, this sounds like doubling down on a wrong design choice.  I see
> no good reason that updatable security views should require any
> fundamental rearrangements of the order of operations in the planner;
> and I doubt that this is the last bug you'll have if you insist on
> doing that.

I'd be quite happy to do this entirely within the rewriter. I've found
two persistent obstacles to that, and frankly I'm stuck. I'm going to be
reworking the RLS patches on top of Dean's functional patch unless I can
find some way to progress with a rewriter based approach.

The key problems are:

1. preprocess_targetlist in the planner assumes that the resultRelation
is the correct RTE to set as the varno in a new Var it adds to fetch the
row ctid (with label "ctid1") as a resjunk attr for row-marking. This
causes the tlist to have entries pointing to different RTE to the one
being scanned by the eventual seqscan / indexscan, though the underlying
Relation is the same. tlist validation checks don't like that.

There may be other places that need to add tlist entries pointing to the
relation we're reading rows from. They'll also need to be able to deal
with the fact that this no longer the resultRelation.

2. Despite bashing my head against it for ages, I haven't figured out
how to inject references to the base-rel's ctid, oid (if WITH OIDS), and
any tlist entries not specified in the DML statement into the subquery
tree. These are only accessible at the deepest level of rewriting, when
the final view is expanded into a subquery and processed with
rewriteTargetListUD(..). At this point we don't have "breadcrumbs" to
use to walk back up the nested subqueries adding the required tlist entries.

I keep on exploring ideas for this one, and get stuck in a dead end for
every one.

Without a way to move on these, I don't have much hope of adding
updatable security barrier views support using work done in the rewriter.

It seems inevitable that we'll have to add the separate concepts of
"source relation" (tuples to feed into HeapModifyTable for ctid, and for
heap_modify_table after junkfiltering) and "result relation" (target
Relation of heap_modify_table to actually write tuples to, target of row
level locking operations).

There's also going to need to be some kind of breadcrumb chain to allow
us to walk from the inner-most expanded view's RTE_RELATION back up the
expanded view subquery tlists, adding next-inner-most refs to resjunk
"ctid" and (if needed) "oid", injecting defaults, and expanding the
target list with Vars to match non-referenced attributes of the
inner-most RTE_RELATION. So far I haven't come up with a sensible form
for that breadcrumb trail.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE


On 01/11/2014 12:40 AM, Peter Geoghegan wrote:

My problem is that in general I'm not sold on the actual utility of
making this kind of row locking work with exclusion constraints. I'm
sincerely having a hard time thinking of a practical use-case
(although, as I've said, I want to make it work with IGNORE). Even if
you work all this row locking stuff out, and the spill-to-disk aspect
out, the interface is still wrong, because you need to figure out a
way to project more than one reject per slot. Maybe I lack imagination
around how to make that work, but there are a lot of "ifs" and "buts"
either way.


Exclusion constraints can be used to implement uniqueness checks with 
SP-GiST or GiST indexes. For example, if you want to enforce that there 
are no two tuples with the same x and y coordinates, ie. use a point as 
the key. You could add a b-tree index just to enforce the constraint, 
but it's better if you don't have to. In general, it's just always 
better if features don't have implementation-specific limitations like this.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Filter error log statements by sqlstate

2014-01-13 Thread Jeevan Chalke

Hi Oskari,

I had a quick look over the patch (Not compiled though). Here are few
comments on the changes:

1. Documentation is missing and thus becomes difficult to understand what
exactly you are trying to do. Or in other words, user will be uncertain
about
using it more efficiently.
2. Some more comments required. At each new function and specifically at
get_sqlstate_error_level().
3. Please add test-case if possible.
4. Some code part does not comply with PostgreSQL indentation style. (Can be
ignored as it will pass through pg_indent, but better fix it).
5. You have used ""XX000:warning," string to get maximum possible length of
the valid sqlstate:level identifier. It's perfect, but small explanation
about
that will be good there. Also in future if we have any other error level
which
exceeds this, we need changes here too. Right ?

I will look into this further. But please have your attention on above
points.

Thanks

On Fri, Jan 10, 2014 at 12:56 AM, Oskari Saarenmaa  wrote:

> Allow the default log_min_error_statement to be overridden per sqlstate to
> make it possible to filter out some error types while maintaining a low
> log_min_error_statement or enable logging for some error types when the
> default is to not log anything.
>
> I've tried to do something like this using rsyslog filters, but that's
> pretty awkward and doesn't work at all when the statement is split to
> multiple syslog messages.
>
> https://github.com/saaros/postgres/compare/log-by-sqlstate
>
>  src/backend/utils/error/elog.c | 183 ++
> ++-
>  src/backend/utils/misc/guc.c   |  14 +++-
>  src/include/utils/guc.h|   4 +
>  src/include/utils/guc_tables.h |   1 +
>  4 files changed, 199 insertions(+), 3 deletions(-)
>
> / Oskari
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>

-- 
Jeevan B Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are
not the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE


On 01/11/2014 12:39 PM, Peter Geoghegan wrote:

In any case, my patch is bound to win decisively for the other
extreme, the insert-only case, because the overhead of doing an index
scan first is always wasted there with your approach, and the overhead
of extended btree leaf page locking has been shown to be quite low.


Quite possibly. Run the benchmark, and we'll see how big a difference 
we're talking about.



In
the past you've spoken of avoiding that overhead through an adaptive
strategy based on statistics, but I think you'll have a hard time
beating a strategy where the decision comes as late as possible, and
is informed by highly localized page-level metadata already available.
My implementation can abort an attempt to just read an existing
would-be duplicate very inexpensively (with no strong locks), going
back to just after the _bt_search() to get a heavyweight lock if just
reading doesn't work out (if there is no duplicate found), so as to
not waste all of its prior work. Doing one of the two extremes of
insert-mostly or update-only well is relatively easy; dynamically
adapting to one or the other is much harder. Especially if it's a
consistent mix of inserts and updates, where general observations
aren't terribly useful.


Another way to optimize it is to keep the b-tree page pinned after doing 
the pre-check. Then you don't need to descend the tree again when doing 
the insert. That would require small indexam API changes, but wouldn't 
be too invasive, I think.



All other concerns of mine still remain, including the concern over
the extra locking of the proc array - I'm concerned about the
performance impact of that on other parts of the system not exercised
by this test.


Yeah, I'm not thrilled about that part either. Fortunately there are 
other ways to implement that. In fact, I think you could just not bother 
taking the ProcArrayLock when setting the fields. The danger is that 
another backend sees a mixed state of the fields, but that's OK. The 
worst that can happen is that it will do an unnecessary lock/release on 
the heavy-weight lock. And to reduce the overhead when reading the 
fields, you could merge the SpeculativeInsertionIsInProgress() check 
into TransactionIdIsInProgress(). The call site in tqual.c always calls 
it together with TransactionIdIsInProgress(), which scans the proc array 
anyway, while holding the lock.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ECPG regression tests generating warnings

2014-01-13 Thread Michael Meskes

On Sun, Jan 12, 2014 at 08:28:57AM -0800, Kevin Grittner wrote:
> desc.pgc:55: WARNING: descriptor ""outdesc"" does not exist
> desc.pgc:86: WARNING: descriptor ""outdesc"" does not exist

Thanks, I didn't notice, fixed.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org
Jabber: michael.meskes at gmail dot com
VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP patch (v2) for updatable security barrier views

2014-01-13 Thread Dean Rasheed

On 12 January 2014 10:12, Craig Ringer  wrote:
> On 01/09/2014 06:48 PM, Dean Rasheed wrote:
>> On 8 January 2014 10:19, Dean Rasheed  wrote:
>>> The assertion failure with inheritance and sublinks is a separate
>>> issue --- adjust_appendrel_attrs() is not expecting to find any
>>> unplanned sublinks in the query tree when it is invoked, since they
>>> would normally have all been planned by that point. However, the
>>> addition of the new security barrier subqueries after inheritance
>>> expansion can now insert new sublinks which need to be planned. I'll
>>> look into how best to make that happen.
>>
>> The attached patch does that, which fixes the case you reported.
>
> Dean, any objections to adding this to the current CF, or to my doing so?
>

OK, I'll do that.

I've added a page to the wiki with a more in-depth description of how
the patch works, and the test cases I've tried so far:

https://wiki.postgresql.org/wiki/Making_security_barrier_views_automatically_updatable

there's obviously still a lot more testing to do, but the early signs
are encouraging.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Filter error log statements by sqlstate

2014-01-13 Thread Oskari Saarenmaa

Hi,

On 13/01/14 10:26, Jeevan Chalke wrote:

1. Documentation is missing and thus becomes difficult to understand what
exactly you are trying to do. Or in other words, user will be uncertain
about using it more efficiently.

I figured I'd write documentation for this if it looks like a useful 
feature which would be accepted for 9.4, but I guess it would've helped 
to have a bit better description of this for the initial submission as well.

2. Some more comments required. At each new function and specifically at
get_sqlstate_error_level().

Just after I submitted the patch I noticed that I had a placeholder for 
comment about that function but never wrote the actual comment, sorry 
about that.

3. Please add test-case if possible.

Sure.

4. Some code part does not comply with PostgreSQL indentation style. (Can be
ignored as it will pass through pg_indent, but better fix it).

I'll try to fix this for v2.

5. You have used ""XX000:warning," string to get maximum possible length of
the valid sqlstate:level identifier. It's perfect, but small explanation
about that will be good there. Also in future if we have any other error level
which exceeds this, we need changes here too. Right ?

Good point, I'll address this in v2.

I will look into this further. But please have your attention on above
points.

Thanks for the review!

/ Oskari

On Fri, Jan 10, 2014 at 12:56 AM, Oskari Saarenmaa 
wrote:
> Allow the default log_min_error_statement to be overridden per
> sqlstate to make it possible to filter out some error types while
> maintaining a low log_min_error_statement or enable logging for some
> error types when the default is to not log anything.
>
> I've tried to do something like this using rsyslog filters, but
> that's pretty awkward and doesn't work at all when the statement is
> split to multiple syslog messages.
>
> https://github.com/saaros/postgres/compare/log-by-sqlstate

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.


On 01/09/2014 10:55 PM, Josh Berkus wrote:

On 01/09/2014 12:05 PM, Heikki Linnakangas wrote:


Actually, why is the partially-filled 00010002 file
archived in the first place? Looking at the code, it's been like that
forever, but it seems like a bad idea. If the original server is still
up and running, and writing more data to that file, what will happen is
that when the original server later tries to archive it, it will fail
because the partial version of the file is already in the archive. Or
worse, the partial version overwrites a previously archived more
complete version.


Oh!  This explains some transient errors I've seen.


Wouldn't it be better to not archive the old segment, and instead switch
to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?


It would be better to zero-fill and switch segments, yes.  We should
NEVER be in a position of archiving two different versions of the same
segment.


Ok, I think we're in agreement that that's the way to go for master.

Now, what to do about back-branches? On one hand, I'd like to apply the 
same fix to all stable branches, as the current behavior is silly and 
always has been. On the other hand, we haven't heard any complaints 
about it, so we probably shouldn't fix what ain't broken. Perhaps we 
should apply it to 9.3, as that's where we have the acute problem the OP 
reported. Thoughts?


In summary, I propose that we change master and REL9_3_STABLE to not 
archive the partial segment from previous timeline. Older branches will 
keep the current behavior.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] generic pseudotype IO functions?

2014-01-13 Thread Peter Eisentraut

On Mon, 2014-01-06 at 17:36 +0100, Andres Freund wrote:
> FWIW, I am perfectly fine with duplicating the functions for now - I
> just thought that that might not be the best way but I didn't (and
> still
> don't) have a strong opinion.

Could we just put 0 in for the functions' OID and have code elsewhere
that errors "there is no input function for this type"?




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-13 Thread Rajeev rastogi

 
> On Sun, Jan 12, Amit Kapila wrote:
> >> How would that work?  Would it be a tool in contrib?  There already
> >> is a timeout, so if a tool checked more frequently than the timeout,
> >> it should work.  The durable notification of the admin would happen
> >> in the tool, right?
> >
> > Well, you know what tool *I'm* planning to use.
> >
> > Thing is, when we talk about auto-degrade, we need to determine
> things
> > like "Is the replica down or is this just a network blip"? and take
> > action according to the user's desired configuration.  This is not
> > something, realistically, that we can do on a single request.
> Whereas
> > it would be fairly simple for an external monitoring utility to do:
> >
> > 1. decide replica is offline for the duration (several poll attempts
> > have failed)
> >
> > 2. Send ALTER SYSTEM SET to the master and change/disable the
> > synch_replicas.
> 
>Will it possible in current mechanism, because presently master will
>not accept any new command when the sync replica is not available?
>Or is there something else also which needs to be done along with
>above 2 points to make it possible.

Since there is not WAL written for ALTER SYSTEM SET command, 
then
it should be able to handle this command even though sync 
replica is
not available.

Thanks and Regards,
Kumar Rajeev Rastogi


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ISN extension bug? (with patch)


On 01/03/2014 07:53 PM, Fabien COELHO wrote:



If so, there is only the one-liner patch to consider.


This patch doesn't apply anymore.  Please submit an updated patch for
the commit fest.


In src/include/utils/elog.h there is an include for "utils/errcodes.h"
which is generated somehow when compiling postgresql but not present by
default. So you have to compile postgresql and then the contrib, or use
PGXS with an already installed version.

With this caveat, the one-liner patch (4 characters removed) reattached
does compile for me:


Thanks, applied.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP patch (v2) for updatable security barrier views

2014-01-13 Thread Craig Ringer

On 01/09/2014 11:19 PM, Tom Lane wrote:
> Dean Rasheed  writes:
>> My first thought was that it should just preprocess any security
>> barrier quals in subquery_planner() in the same way as other quals are
>> preprocessed. But thinking about it further, those quals are destined
>> to become the quals of subqueries in the range table, so we don't
>> actually want to preprocess them at that stage --- that will happen
>> later when the new subquery is planned by recursion back into
>> subquery_planner(). So I think the right answer is to make
>> adjust_appendrel_attrs() handle recursion into sublink subqueries.
> 
> TBH, this sounds like doubling down on a wrong design choice.  I see
> no good reason that updatable security views should require any
> fundamental rearrangements of the order of operations in the planner.

In that case, would you mind offerign a quick sanity check on the
following alternative idea:

- Add "sourceRelation" to Query. This refers to the RTE that supplies
tuple projections to feed into ExecModifyTable, with appropriate resjunk
"ctid" and (if requ'd) "oid" cols present.

- When expanding a target view into a subquery, set "sourceRelation" on
the outer view to the index of the RTE of the newly expanded subquery.

- In rewriteTargetView, as now, reassign resultRelation to the target
view's base rel. This is required so that  do any RETURNING and WITH
CHECK OPTION fixups required to adjust the RETURNING list to the new
result relation, so they act on the final tuple after any BEFORE
triggers act. Do not flatten the view subquery and merge the quals (as
currently happens); allow it to be expanded as a subquery by the
rewriter instead. Don't mess with the view tlist at this point except by
removing the whole-row Var added by rewriteTargetListUD.

- When doing tlist expansion in preprocess_targetlist, when we process
the outer Query (the only one for which query type is not SELECT, and
the only one that has a non-zero resultRelation), if resultRelation !=
sourceRelation recursively follow the chain of sourceRelation s to the
bottom one with type RTE_RELATION. Do tlist expansion on that inner-most
Query first, using sourceRelation to supply the varno for injected TLEs,
including injecting "ctid", "oid" if req'd, etc. During call stack
unwind, have each intermediate layer do regular tlist expansion, adding
a Var pointing to each tlist entry of the inner subquery.

At the outer level of preprocess_targetlist, sort the tlist, now
expanded to include all required vars, into attribute order for the
resultRelation. (this level is the only one that has resultRelation set).

Avoid invoking preprocess_targetlist on the inner Query again when it's
processed in turn, or just bail out when we see sourceRelation set since
we know it's already been done.

(Alternately, it might be possible to run preprocess_targetlist
depth-first instead of the current outermost-first, but I haven't looked
at that).

The optimizer can still flatten non-security-barrier updatable views,
following the chain of Vars as it collapses each layer. That's
effectively what the current rewriteTargetView code is doing manually at
each pass right now.

I'm sure there are some holes in this outline, but it's struck me as
possibly workable. The key is to set sourceRelation on every inner
subquery in the target query chain, not just the outer one, so it can be
easily followed from the outer query though the subqueries into the
innermost query with RTE_RELATION type.

The only alternative I've looked at is looking clumsier the longer I
examine it: adding a back-reference in each subquery's Query struct, to
the Query containing it and the RTI of the subquery within the outer
Query. That way, once rewriting hits the innermost rel with RTE_RELATION
type, the rewriter can walk back up the Query tree doing tlist
rewriting. I'm not sure if this is workable yet, and it creates ugly
pointer-based backrefs *up* the Query chain, making what was previously
a tree of Query* into a graph. That could get exciting, though there'd
never be any need for mutators to follow the parent query pointer so it
wouldn't make tree rewrites harder.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested hstore patch

2014-01-13 Thread Oleg Bartunov

Thank you, Erik !

Oleg

On Mon, Jan 13, 2014 at 12:25 PM, Erik Rijkers  wrote:
> On Mon, January 13, 2014 00:24, Erik Rijkers wrote:
>> On Sat, January 11, 2014 22:47, Andrew Dunstan wrote:
>>>
>>> On 01/11/2014 03:03 PM, Erik Rijkers wrote:
 On Sat, January 11, 2014 20:30, Peter Eisentraut wrote:
> The documentation doesn't build.
 corrective patch is here:

 http://www.postgresql.org/message-id/37b9f104d5a838eec9b75f3668517aa5.squir...@webmail.xs4all.nl

>>> It will be in the next version of the patch posted.
>>
>>
>> Attached is another handful of doc-fixes...
>>
>
> There are errors in the example expressions in "Table F-6. hstore Operators".
>
> Attached is a cumulative doc-patch (which includes the changes I sent 
> earlier) which fixes these.
>
> I also attach an test perl program that shows the (small) differences in 
> output between what's in that doc table and what
> one actually gets. (I found these too insignificant to change but perhaps you 
> have a different opinion.)
>
>
> thanks,
>
> Erik Rijkers
>
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-13 Thread Florian Pflug

On Jan12, 2014, at 04:18 , Josh Berkus  wrote:
> Thing is, when we talk about auto-degrade, we need to determine things
> like "Is the replica down or is this just a network blip"? and take
> action according to the user's desired configuration.  This is not
> something, realistically, that we can do on a single request.  Whereas
> it would be fairly simple for an external monitoring utility to do:
> 
> 1. decide replica is offline for the duration (several poll attempts
> have failed)
> 
> 2. Send ALTER SYSTEM SET to the master and change/disable the
> synch_replicas.
> 
> In other words, if we're going to have auto-degrade, the most
> intelligent place for it is in
> RepMgr/HandyRep/OmniPITR/pgPoolII/whatever.  It's also the *easiest*
> place.  Anything we do *inside* Postgres is going to have a really,
> really hard time determining when to degrade.

+1

This is also how 2PC works, btw - the database provides the building
blocks, i.e. PREPARE and COMMIT, and leaves it to a transaction manager
to deal with issues that require a whole-cluster perspective.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Where do we stand on 9.3 bugs?

What remaining issues are there blocking a 9.3.3 release?  I know that
there were unresolved multixact issues when we put out 9.3.2 --- are
those all dealt with now?  What else do people see as release-blockers?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] generic pseudotype IO functions?

Peter Eisentraut  writes:
> On Mon, 2014-01-06 at 17:36 +0100, Andres Freund wrote:
>> FWIW, I am perfectly fine with duplicating the functions for now - I
>> just thought that that might not be the best way but I didn't (and
>> still don't) have a strong opinion.

> Could we just put 0 in for the functions' OID and have code elsewhere
> that errors "there is no input function for this type"?

That doesn't seem like much of an improvement to me: that would be
taking a catalog corruption condition and blessing it as a legitimate
state of affairs, thereby reducing our ability to detect problems.

One instance where it would create issues is that I'm pretty sure
pg_dump would get confused by such a type.  Admittedly, pg_dump will
never try to dump the built-in pseudotypes, but do we really want them
handled so differently from user-definable types?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Where do we stand on 9.3 bugs?

Tom Lane wrote:
> What remaining issues are there blocking a 9.3.3 release?  I know that
> there were unresolved multixact issues when we put out 9.3.2 --- are
> those all dealt with now?  What else do people see as release-blockers?

The only thing I'm aware still outstanding in multixact land is the
creation of a separate multixact freeze age GUC variable.  All other
issues were resolved.  Bug #8470 is still unresolved, but that's a
performance issue.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Where do we stand on 9.3 bugs?

On 2014-01-13 12:26:45 -0300, Alvaro Herrera wrote:
> Tom Lane wrote:
> > What remaining issues are there blocking a 9.3.3 release?  I know that
> > there were unresolved multixact issues when we put out 9.3.2 --- are
> > those all dealt with now?  What else do people see as release-blockers?
> 
> The only thing I'm aware still outstanding in multixact land is the
> creation of a separate multixact freeze age GUC variable.  All other
> issues were resolved.  Bug #8470 is still unresolved, but that's a
> performance issue.

It's not strictly a 9.3 issue, but there's still the issue of /members
wrapping around into valid data, right?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Where do we stand on 9.3 bugs?

Andres Freund wrote:
> On 2014-01-13 12:26:45 -0300, Alvaro Herrera wrote:
> > Tom Lane wrote:
> > > What remaining issues are there blocking a 9.3.3 release?  I know that
> > > there were unresolved multixact issues when we put out 9.3.2 --- are
> > > those all dealt with now?  What else do people see as release-blockers?
> > 
> > The only thing I'm aware still outstanding in multixact land is the
> > creation of a separate multixact freeze age GUC variable.  All other
> > issues were resolved.  Bug #8470 is still unresolved, but that's a
> > performance issue.
> 
> It's not strictly a 9.3 issue, but there's still the issue of /members
> wrapping around into valid data, right?

Yes, but TBH I don't have much of an intention to fix that other than
having the freeze limit.  At least not for 9.3.3.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested hstore patch

Andrew Dunstan wrote:
> 
> On 01/13/2014 03:25 AM, Erik Rijkers wrote:
> 
> >There are errors in the example expressions in "Table F-6. hstore Operators".
> >
> >Attached is a cumulative doc-patch (which includes the changes I sent 
> >earlier) which fixes these.
> >
> >I also attach an test perl program that shows the (small) differences in 
> >output between what's in that doc table and what
> >one actually gets. (I found these too insignificant to change but perhaps 
> >you have a different opinion.)
> 
> 
> A new version of the patch is attached. It includes all of Erik's
> docs fixes and a small fix by Alexander Korotkov for hstore hash
> ops.

Interestingly, this also include transaction_commit event triggers.

There are also a few PANIC elogs, probably not what's intended.

(I was just giving this a quick skim to see if there's support to build
JSON objects incrementally from C source, i.e. not have to call
functions using the fmgr interface.  Apparently that's not the case, but
if I'm wrong please let me know.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested hstore patch

2014-01-13 Thread Oleg Bartunov

Andrew,

did you run perl script ? Actually, I found, that operator table needs
to be fixed.

Oleg

On Mon, Jan 13, 2014 at 7:36 PM, Andrew Dunstan  wrote:
>
> On 01/13/2014 03:25 AM, Erik Rijkers wrote:
>
>> There are errors in the example expressions in "Table F-6. hstore
>> Operators".
>>
>> Attached is a cumulative doc-patch (which includes the changes I sent
>> earlier) which fixes these.
>>
>> I also attach an test perl program that shows the (small) differences in
>> output between what's in that doc table and what
>> one actually gets. (I found these too insignificant to change but perhaps
>> you have a different opinion.)
>>
>>
>
>
>
> A new version of the patch is attached. It includes all of Erik's docs fixes
> and a small fix by Alexander Korotkov for hstore hash ops.
>
> cheers
>
> andrew
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add CREATE support to event triggers

Alvaro Herrera escribió:

> In an event trigger, the function pg_event_trigger_get_creation_commands()
> returns the following JSON blob:

After playing with this for a while, I realized something that must have
seemed quite obvious to those paying attention: what this function is,
is just a glorified sprintf() for JSON.  So I propose we take our
existing format(text) and use it to model a new format(json) function,
which will be useful to the project at hand and be of more general
applicability.

To make it a better fit, I have changed the spec slightly.  The format
string is now the "fmt" element in the topmost JSON.  This format string
can contain % escapes, which consist of:

* the literal % itself
* an element name, enclosed in braces { }.  The name can optionally be
  followed by a colon and a possibly-empty array separator.
* a format specifier, which can be I (identifier), D (dotted name), or s
  (string)
* Alternatively, %% expands to a literal %, as usual.

For each such escape, the JSON object is searched using the element name
as key.  For identifiers, the element is expected to be a string, and
will be quoted per identifier quoting rules.  Dotted-names are used to
format possibly-qualified relation names and such; the element must be
an object with one, two or three string elements, each of which is
quoted per identifier rules, and output separated by periods.

Finally, for arrays we expand each element in the JSON array element,
and separate them with the separator specified in the {} part of the
format specifier.

For instance,
alvherre=# select format(json '{"fmt":"hello, %{who}s! This is %{name}I", 
"who":"world", "name":"a function"}');
   format   
--
 hello, world! This is "a function"

Elements can be objects, in which case they are expanded recursively: a
"fmt" element is looked up and expanded as described above.


I don't yet see a need for %L escapes (that is, literals that can expand
to a single-quoted value or to NULL), but if I see it I will add that
too.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested hstore patch

2014-01-13 Thread Andrew Dunstan





On 01/13/2014 11:03 AM, Alvaro Herrera wrote:

Andrew Dunstan wrote:

On 01/13/2014 03:25 AM, Erik Rijkers wrote:


There are errors in the example expressions in "Table F-6. hstore Operators".

Attached is a cumulative doc-patch (which includes the changes I sent earlier) 
which fixes these.

I also attach an test perl program that shows the (small) differences in output 
between what's in that doc table and what
one actually gets. (I found these too insignificant to change but perhaps you 
have a different opinion.)


A new version of the patch is attached. It includes all of Erik's
docs fixes and a small fix by Alexander Korotkov for hstore hash
ops.

Interestingly, this also include transaction_commit event triggers.


Oh, wow, really? git really did something horrible, or I did 
inadvertently. This is what comes from using the same directory for 
multiple development lines :-(



Will fix



There are also a few PANIC elogs, probably not what's intended.



Oleg, Teodor, please address.



(I was just giving this a quick skim to see if there's support to build
JSON objects incrementally from C source, i.e. not have to call
functions using the fmgr interface.  Apparently that's not the case, but
if I'm wrong please let me know.)



Erm, maybe you need the other json patch: 



If we need to adjust some of that a bit to make it more friendly for 
internal use I'm happy to try to do that. Unfortunately, I don't think 
that's terribly easy for VARIADIC "any" functions like these.


cheers

andrew




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread Merlin Moncure

On Sun, Jan 12, 2014 at 4:38 AM, Craig Ringer  wrote:
> Implicit casts to text, anybody?

This backward compatibility break orphaned the company I work for on
8.1 until last year and very nearly caused postgres to be summarily
extirpated (only rescued at the last minute by my arrival). It cost
hundreds of thousands of dollars to qualify a sprawling java code base
so that it could be moved back into a supported version.  Breaking
compatibility sucks -- it hurts your users and costs people money.
Hacking type casts may not have been a mistake, but the arbitrary
introduction of the breakage certainly was.

This project has no deprecation policy, and I'd argue we'd need one
before considering breaking changes.  For example, maybe we could pull
out an occasional release for longer term support to help users that
caught out.   But really, the better way to go IMNSHO is to take a
hard line on compatibility issues pretty much always -- consider the
case of libc and win32 api.  There are certain limited exceptions to
this rule -- for example security problems or gross violations of the
standard (bringing row-wise comparison to spec comes to mind as an
example of that).

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Linux kernel impact on PostgreSQL performance

Hi,

I'm the chair for Linux Storage, Filesystem and Memory Management Summit 2014
(LSF/MM). A CFP was sent out last month (https://lwn.net/Articles/575681/)
that you may have seen already.

In recent years we have had at least one topic that was shared between
all three tracks that was lead by a person outside of the usual kernel
development community. I am checking if the PostgreSQL community
would be willing to volunteer someone to lead a topic discussing
PostgreSQL performance with recent kernels or to highlight regressions
or future developments you feel are potentially a problem. With luck
someone suitable is already travelling to the collaboration summit
(http://events.linuxfoundation.org/events/collaboration-summit) and it
would not be too inconvenient to drop in for LSF/MM as well.

There are two reasons why I'm suggesting this. First, PostgreSQL was the
basis of a test used to highlight a scheduler problem around kernel 3.6
but otherwise in my experience it is rare that PostgreSQL is part of a
bug report. I am skeptical this particular bug report was a typical use
case for PostgreSQL (pgbench, read-only, many threads, very small in-memory
database). I wonder why reports related to PostgreSQL are not more common.
One assumption would be that PostgreSQL is perfectly happy with the current
kernel behaviour in which case our discussion here is done.

This brings me to the second reason -- there is evidence
that the PostgreSQL community is not happy with the current
direction of kernel development. The most obvious example is this thread
http://postgresql.1045698.n5.nabble.com/Why-we-are-going-to-have-to-go-DirectIO-td5781471.html
but I suspect there are others. The thread alleges that the kernel community
are in the business of pushing hackish changes into the IO stack without
much thought or testing although the linked article describes a VM and not
a storage problem. I'm not here to debate the kernels regression testing
or development methodology but LSF/MM is one place where a large number
of people involved with the IO layers will be attending. If you have a
concrete complaint then here is a soap box.

Does the PostgreSQL community have a problem with recent kernels,
particularly with respect to the storage, filesystem or memory management
layers? If yes, do you have some data that can highlight this and can you
volunteer someone to represent your interests to the kernel community? Are
current developments in the IO layer counter to the PostgreSQL requirements?
If so, what developments, why are they a problem, do you have a suggested
alternative or some idea of what we should watch out for? The track topic
would be up to you but just as a hint, we'd need something a lot more
concrete than "you should test more".

--
Mel Gorman
SUSE Labs

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN improvements part 1: additional information

2014-01-13 Thread Alexander Korotkov

On Sat, Jan 11, 2014 at 6:15 AM, Tomas Vondra  wrote:

> On 8.1.2014 22:58, Alexander Korotkov wrote:
> > Thanks for reporting. Fixed version is attached.
>
> I've tried to rerun the 'archie' benchmark with the current patch, and
> once again I got
>
>PANIC:  could not split GIN page, didn't fit
>
> I reran it with '--enable-cassert' and with that I got
>
> TRAP: FailedAssertion("!(ginCompareItemPointers(&items[i - 1],
>&items[i]) < 0)", File: "gindatapage.c", Line: 149)
> LOG:  server process (PID 5364) was terminated by signal 6: Aborted
> DETAIL:  Failed process was running: INSERT INTO messages ...
>
> so the assert in GinDataLeafPageGetUncompressed fails for some reason.
>
> I can easily reproduce it, but my knowledge in this area is rather
> limited so I'm not entirely sure what to look for.


I've fixed this bug and many other bug. Now patch passes test suite that
I've used earlier. The results are so:

Operations time:
 event | period
---+-
 index_build   | 00:01:47.53915
 index_build_recovery  | 00:00:04
 index_update  | 00:05:24.388163
 index_update_recovery | 00:00:53
 search_new| 00:24:02.289384
 search_updated| 00:27:09.193343
(6 rows)

Index sizes:
 label |   size
---+---
 new   | 384761856
 after_updates | 667942912
(2 rows)

Also, I made following changes in algorithms:

   - Now, there is a limit to number of uncompressed TIDs in the page.
   After reaching this limit, they are encoded independent on if they can fit
   page. That seems to me more desirable behaviour and somehow it accelerates
   search speed. Before this change times were following:

 event | period
---+-
 index_build   | 00:01:51.467888
 index_build_recovery  | 00:00:04
 index_update  | 00:05:03.315155
 index_update_recovery | 00:00:51
 search_new| 00:24:43.194882
 search_updated| 00:28:36.316784
(6 rows)

   - Page are not fully re-encoded if it's enough to re-encode just last
   segment.


README is updated.

--
With best regards,
Alexander Korotkov.


gin-packed-postinglists-varbyte5.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] KNN-GiST with recheck

2014-01-13 Thread Alexander Korotkov

Hackers!

This patch was split from thread:
http://www.postgresql.org/message-id/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=gon-...@mail.gmail.com

I've split it to separate thead, because it's related to partial sort only
conceptually not technically. Also I renamed it to "knn-gist-recheck" from
"partial-knn" as more appropriate name. In the attached version docs are
updated. Possible weak point of this patch design is that it fetches heap
tuple from GiST scan. However, I didn't receive any notes about its design,
so, I'm going to put it to commitfest.

Here goes a desription of this patch same as in original thread.

KNN-GiST provides ability to get ordered results from index, but this order
is based only on index information. For instance, GiST index contains
bounding rectangles for polygons, and we can't get exact distance to
polygon from index (similar situation is in PostGIS). In attached patch,
GiST distance method can set recheck flag (similar to consistent method).
This flag means that distance method returned lower bound of distance and
we should recheck it from heap.

See an example.

create table test as (select id, polygon(3+(random()*10)::int,
circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
generate_series(1,100) id);
create index test_idx on test using gist (p);

We can get results ordered by distance from polygon to point.

postgres=# select id, p <-> point(0.5,0.5) from test order by p <->
point(0.5,0.5) limit 10;
   id   |   ?column?
+--
 755611 | 0.000405855808916853
 807562 | 0.000464123777564343
 437778 | 0.000738524708741959
 947860 |  0.00076250998760724
 389843 | 0.000886362723569568
  17586 | 0.000981960100555216
 411329 |  0.00145338112316853
 894191 |  0.00149399559703506
 391907 |   0.0016647896049741
 235381 |  0.00167554614889509
(10 rows)

It's fast using just index scan.

QUERY PLAN

--
 Limit  (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
rows=10 loops=1)
   ->  Index Scan using test_idx on test  (cost=0.29..157672.29
rows=100 width=36) (actual time=0.179..0.228 rows=10 loops=1)
 Order By: (p <-> '(0.5,0.5)'::point)
 Total runtime: 0.305 ms
(4 rows)

--
With best regards,
Alexander Korotkov.


knn-gist-recheck-1.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Add transforms feature

On Fri, Jan 10, 2014 at 10:40 PM, Peter Eisentraut  wrote:
> On Wed, 2013-12-11 at 11:07 -0500, Tom Lane wrote:
>> We should have learned by now that those are usually a bad idea.
>> In this case, we've got changes in the behavior of function calling,
>> which seems like not only a nightmare for debugging but a fertile
>> source of security issues.
>
> I note that this is the same mechanism that we have elaborately designed
> for *avoiding* security issues from search_path.

And it works like crap.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] nested hstore patch

2014-01-13 Thread Andrew Dunstan




On 01/13/2014 11:16 AM, Oleg Bartunov wrote:

Andrew,

did you run perl script ? Actually, I found, that operator table needs
to be fixed.





No. My build machine doesn't actually have DBD::Pg installed. Can you 
send me a patch if you don't want to push it yourself, or maybe Erik can 
send a pacth top adjust the table.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PoC: Partial sort

2014-01-13 Thread Alexander Korotkov

On Tue, Dec 31, 2013 at 5:41 AM, Andreas Karlsson  wrote:

> On 12/29/2013 08:24 AM, David Rowley wrote:
>
>> If it was possible to devise some way to reuse any
>> previous tuplesortstate perhaps just inventing a reset method which
>> clears out tuples, then we could see performance exceed the standard
>> seqscan -> sort. The code the way it is seems to lookup the sort
>> functions from the syscache for each group then allocate some sort
>> space, so quite a bit of time is also spent in palloc0() and pfree()
>>
>> If it was not possible to do this then maybe adding a cost to the number
>> of sort groups would be better so that the optimization is skipped if
>> there are too many sort groups.
>>
>
> It should be possible. I have hacked a quick proof of concept for reusing
> the tuplesort state. Can you try it and see if the performance regression
> is fixed by this?
>
> One thing which have to be fixed with my patch is that we probably want to
> close the tuplesort once we have returned the last tuple from ExecSort().
>
> I have attached my patch and the incremental patch on Alexander's patch.


Thanks. It's included into attached version of patch. As wall as estimation
improvements, more comments and regression tests fix.

--
With best regards,
Alexander Korotkov.


partial-sort-5.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: show xid and xmin in pg_stat_activity and pg_stat_replication


On 12/17/2013 04:58 PM, Christian Kruse wrote:

attached you will find a patch for showing the current transaction id
(xid) and the xmin of a backend in pg_stat_activty and the xmin in
pg_stat_replication.


Docs.

When an admin is looking for a long-running transaction that's blocking 
vacuum, he will currently rely on the timestamp fields, xact_start and 
query_start. I'm not sure how much extra value this adds over those 
timestamps in pg_stat_activity, but there are not such fields in 
pg_stat_replication, so that part is definitely useful. And if we're 
going to add xmin to pg_stat_replication, it makes sense to add it to 
pg_stat_activity too. Unless someone can come up with something better 
to display for walsenders. The timestamp of the last commit record 
that's been replayed, perhaps?


What else would a user would want to do with these?

This definitely sounds useful to me as a developer, though. So I'm 
thinking we should add these for that reason, in any case.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-13 Thread Hannu Krosing

On 01/13/2014 04:12 PM, Florian Pflug wrote:
> On Jan12, 2014, at 04:18 , Josh Berkus  wrote:
>> Thing is, when we talk about auto-degrade, we need to determine things
>> like "Is the replica down or is this just a network blip"? and take
>> action according to the user's desired configuration.  This is not
>> something, realistically, that we can do on a single request.  Whereas
>> it would be fairly simple for an external monitoring utility to do:
>>
>> 1. decide replica is offline for the duration (several poll attempts
>> have failed)
>>
>> 2. Send ALTER SYSTEM SET to the master and change/disable the
>> synch_replicas.
>>
>> In other words, if we're going to have auto-degrade, the most
>> intelligent place for it is in
>> RepMgr/HandyRep/OmniPITR/pgPoolII/whatever.  It's also the *easiest*
>> place.  Anything we do *inside* Postgres is going to have a really,
>> really hard time determining when to degrade.
> +1
>
> This is also how 2PC works, btw - the database provides the building
> blocks, i.e. PREPARE and COMMIT, and leaves it to a transaction manager
> to deal with issues that require a whole-cluster perspective.
>

++1

I like Simons idea to have a pg_xxx function for switching between
replication modes, which should be enough to support a monitor
daemon doing the switching.

Maybe we could have an 'syncrep_taking_too_long_command' GUC
which could be used to alert such a monitoring daemon, so it can
immediately check weather to

a) switch master to async rep or standalone mode (in case of sync slave
becoming unavailable)

or

b) to failover to slave (in almost equally likely case that it was the
master
which became disconnected from the world and slave is available)

or

c) do something else depending on circumstances/policy :)


NB! Note that in case of b) 'syncrep_taking_too_long_command' will
very likely also not reach the monitor daemon, so it can not relay on
this as main trigger!

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-13 Thread Joshua D. Drake



On 01/13/2014 10:12 AM, Hannu Krosing wrote:

In other words, if we're going to have auto-degrade, the most
intelligent place for it is in
RepMgr/HandyRep/OmniPITR/pgPoolII/whatever.  It's also the *easiest*
place.  Anything we do *inside* Postgres is going to have a really,
really hard time determining when to degrade.

+1

This is also how 2PC works, btw - the database provides the building
blocks, i.e. PREPARE and COMMIT, and leaves it to a transaction manager
to deal with issues that require a whole-cluster perspective.



++1


+1



I like Simons idea to have a pg_xxx function for switching between
replication modes, which should be enough to support a monitor
daemon doing the switching.

Maybe we could have an 'syncrep_taking_too_long_command' GUC
which could be used to alert such a monitoring daemon, so it can
immediately check weather to



I would think that would be a column in pg_stat_replication. Basically 
last_ack or something like that.




a) switch master to async rep or standalone mode (in case of sync slave
becoming unavailable)


Yep.



or

b) to failover to slave (in almost equally likely case that it was the
master
which became disconnected from the world and slave is available)

or


I think this should be left to external tools.

JD


--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
"In a time of universal deceit - telling the truth is a revolutionary 
act.", George Orwell



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Josh Berkus

Mel,

> I'm the chair for Linux Storage, Filesystem and Memory Management Summit 2014
> (LSF/MM). A CFP was sent out last month (https://lwn.net/Articles/575681/)
> that you may have seen already.
> 
> In recent years we have had at least one topic that was shared between
> all three tracks that was lead by a person outside of the usual kernel
> development community. I am checking if the PostgreSQL community
> would be willing to volunteer someone to lead a topic discussing
> PostgreSQL performance with recent kernels or to highlight regressions
> or future developments you feel are potentially a problem. With luck
> someone suitable is already travelling to the collaboration summit
> (http://events.linuxfoundation.org/events/collaboration-summit) and it
> would not be too inconvenient to drop in for LSF/MM as well.

We can definitely get someone there.  I'll certainly be there; I'm
hoping to get someone who has closer involvement with our kernel
interaction as well.

> There are two reasons why I'm suggesting this. First, PostgreSQL was the
> basis of a test used to highlight a scheduler problem around kernel 3.6
> but otherwise in my experience it is rare that PostgreSQL is part of a
> bug report.  I am skeptical this particular bug report was a typical use
> case for PostgreSQL (pgbench, read-only, many threads, very small in-memory
> database). I wonder why reports related to PostgreSQL are not more common.
> One assumption would be that PostgreSQL is perfectly happy with the current
> kernel behaviour in which case our discussion here is done.

To be frank, it's because most people are still running on 2.6.19, and
as a result are completely unaware of recent developments.  Second,
because there's no obvious place to complain to ... lkml doesn't welcome
bug reports, and where else do you go?

> Does the PostgreSQL community have a problem with recent kernels,
> particularly with respect to the storage, filesystem or memory management
> layers? If yes, do you have some data that can highlight this and can you
> volunteer someone to represent your interests to the kernel community? 

Yes, and yes.

> Are
> current developments in the IO layer counter to the PostgreSQL requirements?
> If so, what developments, why are they a problem, do you have a suggested
> alternative or some idea of what we should watch out for? 

Mostly the issue is changes to the IO scheduler which improve one use
case at the expense of others, or set defaults which emphasize desktop
hardware over server hardware.

What also came up with the recent change to LRU is that the Postgres
community apparently has more experience than the Linux community with
buffer-clearing algorithms, and we ought to share that.

> The track topic
> would be up to you but just as a hint, we'd need something a lot more
> concrete than "you should test more".

How about "don't add major IO behavior changes with no
backwards-compatibility switches"?  ;-)

Seriously, one thing I'd like to get out of Collab would be a reasonable
regimen for testing database performance on Linux kernels.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Kevin Grittner

Josh Berkus  wrote:

>> Does the PostgreSQL community have a problem with recent
>> kernels, particularly with respect to the storage, filesystem or
>> memory management layers?

> How about "don't add major IO behavior changes with no
> backwards-compatibility switches"?  ;-)

I notice, Josh, that you didn't mention the problems many people
have run into with Transparent Huge Page defrag and with NUMA
access.  Is that because there *are* configuration options that
allow people to get decent performance once the issue is diagnosed?
It seems like maybe there could be a better way to give a heads-up
on hazards in a new kernel to the database world, but I don't know
quite what that would be.  For all I know, it is already available
if you know where to look.

> Seriously, one thing I'd like to get out of Collab would be a
> reasonable regimen for testing database performance on Linux
> kernels.

... or perhaps you figure this is what would bring such issues to
the community's attention before people are bitten in production
environments?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

On Mon, Jan 13, 2014 at 12:23 AM, Heikki Linnakangas
 wrote:
> Exclusion constraints can be used to implement uniqueness checks with
> SP-GiST or GiST indexes. For example, if you want to enforce that there are
> no two tuples with the same x and y coordinates, ie. use a point as the key.
> You could add a b-tree index just to enforce the constraint, but it's better
> if you don't have to. In general, it's just always better if features don't
> have implementation-specific limitations like this.

That seems rather narrow. Among other things, I worry about the
baggage for users in documenting supporting SP-GiST/GiST. "We support
it, but it only really works for the case where you're using exclusion
constraints as unique constraints, something that might make sense in
certain narrow contexts, contrary to our earlier general statement
that a unique index should be preferred there". We catalog amcanunique
methods as the way that we support unique indexes. I really do feel
that that's the appropriate level to support the feature at, and I
have not precluded other amcanunique implementations from doing the
same, having documented the intended value locking interface/contract
for the benefit of any future amcanunique AM author. It's ON DUPLICATE
KEY, not ON OVERLAPPING KEY, or any other syntax suggestive of
exclusion constraints and their arbitrary commutative operators.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Josh Berkus

On 01/13/2014 10:51 AM, Kevin Grittner wrote:
>> How about "don't add major IO behavior changes with no
>> backwards-compatibility switches"?  ;-)
> 
> I notice, Josh, that you didn't mention the problems many people
> have run into with Transparent Huge Page defrag and with NUMA
> access.  Is that because there *are* configuration options that
> allow people to get decent performance once the issue is diagnosed?
> It seems like maybe there could be a better way to give a heads-up
> on hazards in a new kernel to the database world, but I don't know
> quite what that would be.  For all I know, it is already available
> if you know where to look.

Well, it was the lack of sysctl options which takes the 2Q change from
"annoyance" to "potential disaster".  We can't ever get away from the
possibility that the Postgres use-case might be the minority use-case,
and we might have to use non-default options.  It's when those options
aren't present *at all* that we're stuck.

However, I agree that a worthwhile thing to talk about is having some
better channel to notify the Postgres (and other DB) communities about
major changes to IO and Memory management.

Wanna go to Collab?

>> Seriously, one thing I'd like to get out of Collab would be a
>> reasonable regimen for testing database performance on Linux
>> kernels.
> 
> ... or perhaps you figure this is what would bring such issues to
> the community's attention before people are bitten in production
> environments?

That, too.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Show lossy heap block info in EXPLAIN ANALYZE for bitmap heap scan

On Thu, Jan 9, 2014 at 10:57 PM, Etsuro Fujita
 wrote:
> I wrote:
>> Robert Haas wrote:
>> > Hmm, fair point.  But I'm still not convinced that we really need to
>> > add extra accounting for this.  What's wrong with just reporting the
>> > number of exact and lossy pages?
>
>> No.  I intended to show the desired memory space for a TIDBitmap rather
>> than the peak memory usage for that TIDBitmap.  And I thought it'd be
> better
>> for the latter to be displayed as additional information.  However, I've
>> removed the functionality for showing the desired memory space due to
>> technical problems.  Now I should probably remove the functionality for
>> showing the peak memory usage too.
>
>> Yes, as Andres mentioned, showing the peak memory usage is not a bad idea,
>> I think.  But I start to think it's not necessarily worth complicating the
>> code ...
>
>> If there are no objections of others, I'll remove extra accounting for
>> showing the peak memory usage.
>
> Done.  Please find attached a patch.

Looks good to me, so committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER SYSTEM SET command to change postgresql.conf parameters

On Thu, Jan 9, 2014 at 11:06 PM, Amit Kapila  wrote:
> On Thu, Jan 9, 2014 at 12:26 AM, Robert Haas  wrote:
>> On Mon, Jan 6, 2014 at 11:37 PM, Amit Kapila  wrote:
>>> On Tue, Jan 7, 2014 at 12:52 AM, Robert Haas  wrote:
 On Mon, Jan 6, 2014 at 9:48 AM, Amit Kapila  
 wrote:
>> Couldn't we also handle this by postponing FreeConfigVariables until
>> after the if (error) block?
>
>Wouldn't doing that way can lead to bigger memory leak, if error level
>is ERROR. Though in current fix also it can leak memory but it will be
>just for ErrorConfFile_save. I think some similar case can happen for
>'pre_value' in code currently as well, that's why I have fixed it in a
>similar way in patch.

 I was assuming that error-recovery would reset the containing memory
 context, but I'm not sure what memory context we're executing in at
 this point.
>>>
>>>
>>> In current code, the only time it can go to error path with elevel as
>>> ERROR is during Postmaster startup
>>> (context == PGC_POSTMASTER), at which it will anyway upgrade
>>> ERROR to FATAL, so it should not be a problem to move
>>> function FreeConfigVariables() after error block check. However
>>> in future, if someone added any more ERROR (the chances of which
>>> seems to be quite less), it can cause leak, may be thats why original
>>> code has been written that way.
>>>
>>> If you think it's better to fix by moving FreeConfigVariables() after error
>>> block check, then I can update the patch by doing so and incorporate other
>>> change (directly use PG_AUTOCONF_FILENAME) suggested by you
>>> as well?
>>
>> Yeah, let's do it that way.
>
> Okay, done. Attached patch fixes both the display of wrong file name and
> usage of PG_AUTOCONF_FILENAME.

Committed with a comment change.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Planning time in explain/explain analyze

On Thu, Jan 9, 2014 at 11:45 PM, Tom Lane  wrote:
> Greg Stark  writes:
>> On Thu, Jan 9, 2014 at 9:14 PM, Tom Lane  wrote:
>>> In short then, I think we should just add this to EXPLAIN and be done.
>>> -1 for sticking the info into PlannedStmt or anything like that.
>
>> I'm confused. I thought I was arguing to support your suggestion that
>> the initial planning store the time in the cached plan and explain
>> should output the time the original planning took.
>
> Uh, no, wasn't my suggestion.  Doesn't that design imply measuring *every*
> planning cycle, explain or no?  I was thinking more of just putting the
> timing calls into explain.c.

Currently the patch includes changes to prepare.c which is what seems
odd to me.  I think it'd be fine to say, hey, I can't give you the
planning time in this EXPLAIN ANALYZE because I just used a cached
plan and did not re-plan.  But saying, hey, the planning time is
$TINYVALUE, when what we really mean is that looking up the
previously-cached plan took only that long, seems actively misleading
to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Planning time in explain/explain analyze

2014-01-13 Thread Stephen Frost

* Robert Haas (robertmh...@gmail.com) wrote:
> Currently the patch includes changes to prepare.c which is what seems
> odd to me.  I think it'd be fine to say, hey, I can't give you the
> planning time in this EXPLAIN ANALYZE because I just used a cached
> plan and did not re-plan.  But saying, hey, the planning time is
> $TINYVALUE, when what we really mean is that looking up the
> previously-cached plan took only that long, seems actively misleading
> to me.

My thought, at least, was to always grab the planning time and then
provide it for explain and/or explain analyze, and then for re-plan
cases, indicate if a cached plan was returned, if a replan happened, and
if a replan happened, what the old plan time and the new plan time was.

I don't think it makes any sense to report on the time returned from
pulling a previously-cached plan.

I understand that it's not completely free to track the plan time for
every query but I'm in the camp that says "we need better metrics and
information for 99% of what we do" and I'd like to see us eventually
able to track average plan time (maybe on a per-query basis..), average
run-time, how many times we do a hashjoin, mergejoin, the number of
records in/out of each, memory usage, etc, etc..  I don't think we need
per-tuple timing information.  I certainly wouldn't want to try and
collect all of this through shared memory or our existing stats
collector.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner  wrote:
> I notice, Josh, that you didn't mention the problems many people
> have run into with Transparent Huge Page defrag and with NUMA
> access.

Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
setting zone_reclaim_mode; is there some other problem besides that?

The other thing that comes to mind is the kernel's caching behavior.
We've talked a lot over the years about the difficulties of getting
the kernel to write data out when we want it to and to not write data
out when we don't want it to.  When it writes data back to disk too
aggressively, we get lousy throughput because the same page can get
written more than once when caching it for longer would have allowed
write-combining.  When it doesn't write data to disk aggressively
enough, we get huge latency spikes at checkpoint time when we call
fsync() and the kernel says "uh, what? you wanted that data *on the
disk*? sorry boss!" and then proceeds to destroy the world by starving
the rest of the system for I/O for many seconds or minutes at a time.
We've made some desultory attempts to use sync_file_range() to improve
things here, but I'm not sure that's really the right tool, and if it
is we don't know how to use it well enough to obtain consistent
positive results.

On a related note, there's also the problem of double-buffering.  When
we read a page into shared_buffers, we leave a copy behind in the OS
buffers, and similarly on write-out.  It's very unclear what to do
about this, since the kernel and PostgreSQL don't have intimate
knowledge of what each other are doing, but it would be nice to solve
somehow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

On Mon, Jan 13, 2014 at 1:53 PM, Peter Geoghegan  wrote:
> On Mon, Jan 13, 2014 at 12:23 AM, Heikki Linnakangas
>  wrote:
>> Exclusion constraints can be used to implement uniqueness checks with
>> SP-GiST or GiST indexes. For example, if you want to enforce that there are
>> no two tuples with the same x and y coordinates, ie. use a point as the key.
>> You could add a b-tree index just to enforce the constraint, but it's better
>> if you don't have to. In general, it's just always better if features don't
>> have implementation-specific limitations like this.
>
> That seems rather narrow. Among other things, I worry about the
> baggage for users in documenting supporting SP-GiST/GiST. "We support
> it, but it only really works for the case where you're using exclusion
> constraints as unique constraints, something that might make sense in
> certain narrow contexts, contrary to our earlier general statement
> that a unique index should be preferred there". We catalog amcanunique
> methods as the way that we support unique indexes. I really do feel
> that that's the appropriate level to support the feature at, and I
> have not precluded other amcanunique implementations from doing the
> same, having documented the intended value locking interface/contract
> for the benefit of any future amcanunique AM author. It's ON DUPLICATE
> KEY, not ON OVERLAPPING KEY, or any other syntax suggestive of
> exclusion constraints and their arbitrary commutative operators.

For what it's worth, I agree with Heikki.  There's probably nothing
sensible an upsert can do if it conflicts with more than one tuple,
but if it conflicts with just exactly one, it oughta be OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Claudio Freire

On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas  wrote:
> On a related note, there's also the problem of double-buffering.  When
> we read a page into shared_buffers, we leave a copy behind in the OS
> buffers, and similarly on write-out.  It's very unclear what to do
> about this, since the kernel and PostgreSQL don't have intimate
> knowledge of what each other are doing, but it would be nice to solve
> somehow.

There you have a much harder algorithmic problem.

You can basically control duplication with fadvise and WONTNEED. The
problem here is not the kernel and whether or not it allows postgres
to be smart about it. The problem is... what kind of smarts
(algorithm) to use.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Planning time in explain/explain analyze

2014-01-13 Thread Andreas Karlsson


On 01/13/2014 09:06 PM, Stephen Frost wrote:

My thought, at least, was to always grab the planning time and then
provide it for explain and/or explain analyze, and then for re-plan
cases, indicate if a cached plan was returned, if a replan happened, and
if a replan happened, what the old plan time and the new plan time was.


I like this solution due it is correctness and that all information is 
included. But I also think my original path was fine in how it in its 
simplicity solved the problem without adding any overhead in the 
non-EXPLAIN case. Either solution would be fine by me. If the consensus 
is that we want to always measure it I will look at implementing that 
instead.


--
Andreas Karlsson


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Planning time in explain/explain analyze

Robert Haas  writes:
> On Thu, Jan 9, 2014 at 11:45 PM, Tom Lane  wrote:
>> Uh, no, wasn't my suggestion.  Doesn't that design imply measuring *every*
>> planning cycle, explain or no?  I was thinking more of just putting the
>> timing calls into explain.c.

> Currently the patch includes changes to prepare.c which is what seems
> odd to me.  I think it'd be fine to say, hey, I can't give you the
> planning time in this EXPLAIN ANALYZE because I just used a cached
> plan and did not re-plan.  But saying, hey, the planning time is
> $TINYVALUE, when what we really mean is that looking up the
> previously-cached plan took only that long, seems actively misleading
> to me.

Meh.  Why?  This would only come into play for EXPLAIN EXECUTE stmtname.
I don't think users would be surprised to see a report of minimal planning
time for that.  In fact, it might be a good thing, as it would make it
easier to tell the difference between whether you were seeing a generic
plan or a custom plan for the prepared statement.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance


On 1/13/14, 2:19 PM, Claudio Freire wrote:

On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas  wrote:

On a related note, there's also the problem of double-buffering.  When
we read a page into shared_buffers, we leave a copy behind in the OS
buffers, and similarly on write-out.  It's very unclear what to do
about this, since the kernel and PostgreSQL don't have intimate
knowledge of what each other are doing, but it would be nice to solve
somehow.



There you have a much harder algorithmic problem.

You can basically control duplication with fadvise and WONTNEED. The
problem here is not the kernel and whether or not it allows postgres
to be smart about it. The problem is... what kind of smarts
(algorithm) to use.


Isn't this a fairly simple matter of when we read a page into shared buffers 
tell the kernel do forget that page? And a corollary to that for when we dump a 
page out of shared_buffers (here kernel, please put this back into your cache).
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

On Fri, Jan 10, 2014 at 2:07 PM, Tom Lane  wrote:
> Florian Pflug  writes:
>> On Jan10, 2014, at 19:08 , Tom Lane  wrote:
>>> Although, having said that ... maybe "build your own aggregate" would
>>> be a reasonable suggestion for people who need this?  I grant that
>>> it's going to be a minority requirement, maybe even a small minority
>>> requirement.  People who have the chops to get this sort of thing right
>>> can probably manage a custom aggregate definition.
>
>> So we'd put a footgun into the hands of people who don't know what they're
>> doing, to be fired for performance's sake, and leave it to the people
>> who know what they are doing to put the safety on?
>
> If I may put words in Kevin's mouth, I think his point is that having
> float8 sum() at all is a foot-gun, and that's hard to deny.  You need
> to know how to use it safely.

Yeah, but Florian's point is that not all foot-guns are created equal.
 The fact that we're walking around with a loaded BB-gun in our hip
pocket is not a good reason to replace it with a howitzer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Claudio Freire

On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby  wrote:
> On 1/13/14, 2:19 PM, Claudio Freire wrote:
>>
>> On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas 
>> wrote:
>>>
>>> On a related note, there's also the problem of double-buffering.  When
>>> we read a page into shared_buffers, we leave a copy behind in the OS
>>> buffers, and similarly on write-out.  It's very unclear what to do
>>> about this, since the kernel and PostgreSQL don't have intimate
>>> knowledge of what each other are doing, but it would be nice to solve
>>> somehow.
>>
>>
>>
>> There you have a much harder algorithmic problem.
>>
>> You can basically control duplication with fadvise and WONTNEED. The
>> problem here is not the kernel and whether or not it allows postgres
>> to be smart about it. The problem is... what kind of smarts
>> (algorithm) to use.
>
>
> Isn't this a fairly simple matter of when we read a page into shared buffers
> tell the kernel do forget that page? And a corollary to that for when we
> dump a page out of shared_buffers (here kernel, please put this back into
> your cache).


That's my point. In terms of kernel-postgres interaction, it's fairly simple.

What's not so simple, is figuring out what policy to use. Remember,
you cannot tell the kernel to put some page in its page cache without
reading it or writing it. So, once you make the kernel forget a page,
evicting it from shared buffers becomes quite expensive.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Planning time in explain/explain analyze

On Mon, Jan 13, 2014 at 3:23 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Jan 9, 2014 at 11:45 PM, Tom Lane  wrote:
>>> Uh, no, wasn't my suggestion.  Doesn't that design imply measuring *every*
>>> planning cycle, explain or no?  I was thinking more of just putting the
>>> timing calls into explain.c.
>
>> Currently the patch includes changes to prepare.c which is what seems
>> odd to me.  I think it'd be fine to say, hey, I can't give you the
>> planning time in this EXPLAIN ANALYZE because I just used a cached
>> plan and did not re-plan.  But saying, hey, the planning time is
>> $TINYVALUE, when what we really mean is that looking up the
>> previously-cached plan took only that long, seems actively misleading
>> to me.
>
> Meh.  Why?  This would only come into play for EXPLAIN EXECUTE stmtname.
> I don't think users would be surprised to see a report of minimal planning
> time for that.  In fact, it might be a good thing, as it would make it
> easier to tell the difference between whether you were seeing a generic
> plan or a custom plan for the prepared statement.

It would also make it easier to be wrong.  If you want to display that
information explicitly, fine.  But asking the user to use the elapsed
time to guess whether or not we really planned anything is just going
to confuse people who don't have enough experience with the system to
know what the boundary is between the largest time that could be a
cache lookup and the smallest time that could be real planning
activity.  And that means virtually everyone, me included.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance


On 1/13/14, 2:27 PM, Claudio Freire wrote:

On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby  wrote:

On 1/13/14, 2:19 PM, Claudio Freire wrote:


On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas 
wrote:


On a related note, there's also the problem of double-buffering.  When
we read a page into shared_buffers, we leave a copy behind in the OS
buffers, and similarly on write-out.  It's very unclear what to do
about this, since the kernel and PostgreSQL don't have intimate
knowledge of what each other are doing, but it would be nice to solve
somehow.




There you have a much harder algorithmic problem.

You can basically control duplication with fadvise and WONTNEED. The
problem here is not the kernel and whether or not it allows postgres
to be smart about it. The problem is... what kind of smarts
(algorithm) to use.



Isn't this a fairly simple matter of when we read a page into shared buffers
tell the kernel do forget that page? And a corollary to that for when we
dump a page out of shared_buffers (here kernel, please put this back into
your cache).



That's my point. In terms of kernel-postgres interaction, it's fairly simple.

What's not so simple, is figuring out what policy to use. Remember,
you cannot tell the kernel to put some page in its page cache without
reading it or writing it. So, once you make the kernel forget a page,
evicting it from shared buffers becomes quite expensive.


Well, if we were to collaborate with the kernel community on this then presumably we can 
do better than that for eviction... even to the extent of "here's some data from 
this range in this file. It's (clean|dirty). Put it in your cache. Just trust me on 
this."
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Claudio Freire

On Mon, Jan 13, 2014 at 5:32 PM, Jim Nasby  wrote:
>>
>> That's my point. In terms of kernel-postgres interaction, it's fairly
>> simple.
>>
>> What's not so simple, is figuring out what policy to use. Remember,
>> you cannot tell the kernel to put some page in its page cache without
>> reading it or writing it. So, once you make the kernel forget a page,
>> evicting it from shared buffers becomes quite expensive.
>
>
> Well, if we were to collaborate with the kernel community on this then
> presumably we can do better than that for eviction... even to the extent of
> "here's some data from this range in this file. It's (clean|dirty). Put it
> in your cache. Just trust me on this."


If I had a kernel developer hat, I'd put it on to say: I don't think
allowing that last bit is wise for a kernel.

It would violate oh-so-many separation rules and open an oh-so-big can-o-worms.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Planning time in explain/explain analyze

Robert Haas  writes:
> On Mon, Jan 13, 2014 at 3:23 PM, Tom Lane  wrote:
>> Meh.  Why?  This would only come into play for EXPLAIN EXECUTE stmtname.
>> I don't think users would be surprised to see a report of minimal planning
>> time for that.  In fact, it might be a good thing, as it would make it
>> easier to tell the difference between whether you were seeing a generic
>> plan or a custom plan for the prepared statement.

> It would also make it easier to be wrong.  If you want to display that
> information explicitly, fine.  But asking the user to use the elapsed
> time to guess whether or not we really planned anything is just going
> to confuse people who don't have enough experience with the system to
> know what the boundary is between the largest time that could be a
> cache lookup and the smallest time that could be real planning
> activity.  And that means virtually everyone, me included.

If you're saying that you'd like EXPLAIN to explicitly mention whether
the plan was cached or custom, I don't have any great complaint about
that.  I'm just not seeing how you arrive at the conclusion that we
mustn't report the amount of time EXPLAIN spent to get the plan.
If we do what you're proposing we'll just have a different set of confused
users, who will be wondering how EXPLAIN could have managed to spend
100 msec planning something when the EXPLAIN only took 10 msec in toto
according to psql.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

On 2014-01-13 15:15:16 -0500, Robert Haas wrote:
> On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner  wrote:
> > I notice, Josh, that you didn't mention the problems many people
> > have run into with Transparent Huge Page defrag and with NUMA
> > access.
> 
> Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
> setting zone_reclaim_mode; is there some other problem besides that?

I think that fixes some of the worst instances, but I've seen machines
spending horrible amounts of CPU (& BUS) time in page reclaim
nonetheless. If I analyzed it correctly it's in RAM << working set
workloads where RAM is pretty large and most of it is used as page
cache. The kernel ends up spending a huge percentage of time finding and
potentially defragmenting pages when looking for victim buffers.

> On a related note, there's also the problem of double-buffering.  When
> we read a page into shared_buffers, we leave a copy behind in the OS
> buffers, and similarly on write-out.  It's very unclear what to do
> about this, since the kernel and PostgreSQL don't have intimate
> knowledge of what each other are doing, but it would be nice to solve
> somehow.

I've wondered before if there wouldn't be a chance for postgres to say
"my dear OS, that the file range 0-8192 of file x contains y, no need to
reread" and do that when we evict a page from s_b but I never dared to
actually propose that to kernel people...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

On Sat, Jan 11, 2014 at 1:08 AM, Amit Kapila  wrote:
> Yes, currently this applies to update, what I have in mind is that
> in future if some one wants to use WAL compression for any other
> operation like 'full_page_writes', then it can be easily extendible.
>
> To be honest, I have not evaluated whether such a flag or compression
> would make sense for full page writes, but I think it should be possible
> while doing full page write (BkpBlock has RelFileNode) to check such a
> flag if it's present.

Makes sense.

> The reason of adding the same chunk in head of list is that it uses same
> technique as pglz_hist_add. Now in pglz, it will not have repeat steps
> from c~f, as it has concept of good_match which leads to get this done in
> one go.
>
> Being said above, I am really not sure, how much real world data falls
> in above category and should we try to optimize based on above example,
> but yes it will save some CPU cycles in current test we are using.

In the Rabin algorithm, we shouldn't try to find a longer match.  The
match should end at the chunk end, period.  Otherwise, you lose the
shift-resistant property of the algorithm.

>>But I do think there might be a bug here, which is
>> that, unless I'm misinterpreting something, hp is NOT the end of the
>> chunk.  After calling pgrb_hash_init(), we've looked at the first FOUR
>> bytes of the input.  If we find that we have a zero hash value at that
>> point, shouldn't the chunk size be 4, not 1?  And similarly if we find
>> it after sucking in one more byte, shouldn't the chunk size be 5, not
>> 2?  Right now, we're deciding where the chunks should end based on the
>> data in the chunk plus the following 3 bytes, and that seems wonky.  I
>> would expect us to include all of those bytes in the chunk.
>
> It depends on how we define chunk, basically chunk size will be based
> on the byte for which we consider hindex. The hindex for any byte is
> calculated considering that byte and the following 3 bytes, so
> after calling pgrb_hash_init(), even though we have looked at 4 bytes
> but still the hindex is for first byte and thats why it consider
> chunk size as 1, not 4.
>
> Isn't it similar to how current pglz works, basically it also
> uses next 4 bytes to calculate index (pglz_hist_idx) but still
> does byte by byte comparison, here if we try to map to rabin's
> delta encoding then always chunk size is 1.

I don't quite understand this.  The point of the Rabin algorithm is to
split the old tuple up into chunks and then for those chunks in the
new tuple.  For example, suppose the old tuple is
abcdefghijklmnopqrstuvwxyz.  It might get split like this: abcdef
hijklmnopqrstuvw xyz.  If any of those three chunks appear in the new
tuple, then we'll use them for compression.  If not, we'll just copy
the literal bytes.  If the chunks appear in the new tuple reordered or
shifted or with stuff inserted between one chunk at the next, we'll
still find them.  Unless I'm confused, which is possible, what you're
doing is essentially looking at the string and spitting it in those
three places, but then recording the chunks as being three bytes
shorter than they really are.  I don't see how that can be right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Planning time in explain/explain analyze

On Mon, Jan 13, 2014 at 3:40 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Mon, Jan 13, 2014 at 3:23 PM, Tom Lane  wrote:
>>> Meh.  Why?  This would only come into play for EXPLAIN EXECUTE stmtname.
>>> I don't think users would be surprised to see a report of minimal planning
>>> time for that.  In fact, it might be a good thing, as it would make it
>>> easier to tell the difference between whether you were seeing a generic
>>> plan or a custom plan for the prepared statement.
>
>> It would also make it easier to be wrong.  If you want to display that
>> information explicitly, fine.  But asking the user to use the elapsed
>> time to guess whether or not we really planned anything is just going
>> to confuse people who don't have enough experience with the system to
>> know what the boundary is between the largest time that could be a
>> cache lookup and the smallest time that could be real planning
>> activity.  And that means virtually everyone, me included.
>
> If you're saying that you'd like EXPLAIN to explicitly mention whether
> the plan was cached or custom, I don't have any great complaint about
> that.  I'm just not seeing how you arrive at the conclusion that we
> mustn't report the amount of time EXPLAIN spent to get the plan.
> If we do what you're proposing we'll just have a different set of confused
> users, who will be wondering how EXPLAIN could have managed to spend
> 100 msec planning something when the EXPLAIN only took 10 msec in toto
> according to psql.

What I'm saying is that if EXPLAIN reports something that's labelled
"Planning Time", it should *be* the planning time, and not anything
else.  When we retrieve a plan from cache, it would be sensible not to
report the planning time at all, and IMHO it would also be sensible to
report the time it actually took to plan whenever we originally did
it.  But reporting a value that is not the planning time and calling
it the planning time does not seem like a good idea to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

On Mon, Jan 13, 2014 at 12:17 PM, Robert Haas  wrote:
> For what it's worth, I agree with Heikki.  There's probably nothing
> sensible an upsert can do if it conflicts with more than one tuple,
> but if it conflicts with just exactly one, it oughta be OK.

If there is exactly one, *and* the existing value is exactly the same
as the value proposed for insertion (or, I suppose, a subset of the
existing value, but that's so narrow that it might as well not apply).
In short, when you're using an exclusion constraint as a unique
constraint. Which is very narrow indeed. Weighing the costs and the
benefits, that seems like far more cost than benefit, before we even
consider anything beyond simply explaining the applicability and
limitations of upserting with exclusion constraints. It's generally
far cleaner to define speculative insertion as something that happens
with unique indexes only.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PoC: Partial sort

2014-01-13 Thread Marti Raudsepp

Hi Alexander,

First, thanks a lot for working on this feature. This PostgreSQL
shortcoming crops up in all the time in web applications that implement
paging by multiple sorted columns.

I've been trying it out in a few situations. I implemented a new
enable_partialsort GUC to make it easier to turn on/off, this way it's a
lot easier to test. The attached patch applies on top of
partial-sort-5.patch

I will spend more time reviewing the patch, but some of this planner code
is over my head. If there's any way I can help to make sure this lands in
the next version, let me know.



The patch performs just as well as I would expect it to:

marti=# select ac.name, r.name from artist_credit ac join release r on (
ac.id=r.artist_credit) order by ac.name, r.name limit 1000;
Time: 9.830 ms
marti=# set enable_partialsort = off;
marti=# select ac.name, r.name from artist_credit ac join release r on (
ac.id=r.artist_credit) order by ac.name, r.name limit 1000;
Time: 1442.815 ms

A difference of almost 150x!

There's a missed opportunity in that the code doesn't consider pushing new
Sort steps into subplans. For example, if there's no index on
language(name) then this query cannot take advantage partial sorts:

marti=# explain select l.name, r.name from language l join release r on (
l.id=r.language) order by l.name, r.name limit 1000;
 Limit  (cost=123203.20..123205.70 rows=1000 width=32)
   ->  Sort  (cost=123203.20..126154.27 rows=1180430 width=32)
 Sort Key: l.name, r.name
 ->  Hash Join  (cost=229.47..58481.49 rows=1180430 width=32)
   Hash Cond: (r.language = l.id)
   ->  Seq Scan on release r  (cost=0.00..31040.10 rows=1232610
width=26)
   ->  Hash  (cost=131.43..131.43 rows=7843 width=14)
 ->  Seq Scan on language l  (cost=0.00..131.43
rows=7843 width=14)

But because there are only so few languages, it would be a lot faster to
sort languages in advance and then do partial sort:
 Limit  (rows=1000 width=31)
   ->  Partial sort  (rows=1180881 width=31)
 Sort Key: l.name, r.name
 Presorted Key: l.name
 ->  Nested Loop  (rows=1180881 width=31)
   ->  Sort  (rows=7843 width=10)
 Sort Key: name
 ->  Seq Scan on language  (rows=7843 width=14)
   ->  Index Scan using release_language_idx on release r
(rows=11246 width=25)
 Index Cond: (language = l.id)

Even an explicit sorted CTE cannot take advantage of partial sorts:
marti=# explain with sorted_lang as (select id, name from language order by
name)
marti-# select l.name, r.name from sorted_lang l join release r on
(l.id=r.language)
order by l.name, r.name limit 1000;
 Limit  (cost=3324368.83..3324371.33 rows=1000 width=240)
   CTE sorted_lang
 ->  Sort  (cost=638.76..658.37 rows=7843 width=14)
   Sort Key: language.name
   ->  Seq Scan on language  (cost=0.00..131.43 rows=7843 width=14)
   ->  Sort  (cost=3323710.46..3439436.82 rows=46290543 width=240)
 Sort Key: l.name, r.name
 ->  Merge Join  (cost=664.62..785649.92 rows=46290543 width=240)
   Merge Cond: (r.language = l.id)
   ->  Index Scan using release_language_idx on release r
(cost=0.43..87546.06 rows=1232610 width=26)
   ->  Sort  (cost=664.19..683.80 rows=7843 width=222)
 Sort Key: l.id
 ->  CTE Scan on sorted_lang l  (cost=0.00..156.86
rows=7843 width=222)

But even with these limitations, this will easily be the killer feature of
the next release, for me at least.

Regards,
Marti


On Mon, Jan 13, 2014 at 8:01 PM, Alexander Korotkov wrote:

> On Tue, Dec 31, 2013 at 5:41 AM, Andreas Karlsson wrote:
>
>> On 12/29/2013 08:24 AM, David Rowley wrote:
>>
>>> If it was possible to devise some way to reuse any
>>> previous tuplesortstate perhaps just inventing a reset method which
>>> clears out tuples, then we could see performance exceed the standard
>>> seqscan -> sort. The code the way it is seems to lookup the sort
>>> functions from the syscache for each group then allocate some sort
>>> space, so quite a bit of time is also spent in palloc0() and pfree()
>>>
>>> If it was not possible to do this then maybe adding a cost to the number
>>> of sort groups would be better so that the optimization is skipped if
>>> there are too many sort groups.
>>>
>>
>> It should be possible. I have hacked a quick proof of concept for reusing
>> the tuplesort state. Can you try it and see if the performance regression
>> is fixed by this?
>>
>> One thing which have to be fixed with my patch is that we probably want
>> to close the tuplesort once we have returned the last tuple from ExecSort().
>>
>> I have attached my patch and the incremental patch on Alexander's patch.
>
>
> Thanks. It's included into attached version of patch. As wall as
> estimation improvements, more comments and regression tests fix.
>
> --
> With best re

Re: [HACKERS] Linux kernel impact on PostgreSQL performance


On 1/13/14, 2:37 PM, Claudio Freire wrote:

On Mon, Jan 13, 2014 at 5:32 PM, Jim Nasby  wrote:


That's my point. In terms of kernel-postgres interaction, it's fairly
simple.

What's not so simple, is figuring out what policy to use. Remember,
you cannot tell the kernel to put some page in its page cache without
reading it or writing it. So, once you make the kernel forget a page,
evicting it from shared buffers becomes quite expensive.



Well, if we were to collaborate with the kernel community on this then
presumably we can do better than that for eviction... even to the extent of
"here's some data from this range in this file. It's (clean|dirty). Put it
in your cache. Just trust me on this."



If I had a kernel developer hat, I'd put it on to say: I don't think
allowing that last bit is wise for a kernel.

It would violate oh-so-many separation rules and open an oh-so-big can-o-worms.


Yeah, if it were me I'd probably want to keep a hash of the page and it's 
address and only accept putting a page back into the kernel if it matched my 
hash. Otherwise you'd just have to treat it as a write.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE


On 01/13/2014 10:53 PM, Peter Geoghegan wrote:

On Mon, Jan 13, 2014 at 12:17 PM, Robert Haas  wrote:

For what it's worth, I agree with Heikki.  There's probably nothing
sensible an upsert can do if it conflicts with more than one tuple,
but if it conflicts with just exactly one, it oughta be OK.


If there is exactly one, *and* the existing value is exactly the same
as the value proposed for insertion (or, I suppose, a subset of the
existing value, but that's so narrow that it might as well not apply).
In short, when you're using an exclusion constraint as a unique
constraint. Which is very narrow indeed. Weighing the costs and the
benefits, that seems like far more cost than benefit, before we even
consider anything beyond simply explaining the applicability and
limitations of upserting with exclusion constraints. It's generally
far cleaner to define speculative insertion as something that happens
with unique indexes only.


Well, even if you don't agree that locking all the conflicting rows for 
update is sensible, it's still perfectly sensible to return the rejected 
rows to the user. For example, you're inserting N rows, and if some of 
them violate a constraint, you still want to insert the non-conflicting 
rows instead of rolling back the whole transaction.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On 2014-01-13 15:53:36 -0500, Trond Myklebust wrote:
> > I've wondered before if there wouldn't be a chance for postgres to say
> > "my dear OS, that the file range 0-8192 of file x contains y, no need to
> > reread" and do that when we evict a page from s_b but I never dared to
> > actually propose that to kernel people...
> 
> O_DIRECT was specifically designed to solve the problem of double buffering 
> between applications and the kernel. Why are you not able to use that in 
> these situations?

Because we like to handle the OS handle part of postgres' caching. For
one, it makes servers with several applications/databases much more
realistic without seriously overallocating memory, for another it's a
huge chunk of platform dependent code to get good performance
everywhere.
The above was explicitly not to avoid double buffering but to move a
buffer away from postgres' own buffers to the kernel's buffers once it's
not 100% clear we need it in buffers anymore.

Part of the reason this is being discussed is because previously people
suggested going the direct IO route and some people (most prominently
J. Corbet in 
http://archives.postgresql.org/message-id/20131204083345.31c60dd1%40lwn.net
) and others disagreed because that goes the route of reinventing
storage layers everywhere without improving the common codepaths.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Jeff Janes

On Mon, Jan 13, 2014 at 12:32 PM, Jim Nasby  wrote:

> On 1/13/14, 2:27 PM, Claudio Freire wrote:
>
>> On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby  wrote:
>>
>>> On 1/13/14, 2:19 PM, Claudio Freire wrote:
>>>

 On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas 
 wrote:

>
> On a related note, there's also the problem of double-buffering.  When
> we read a page into shared_buffers, we leave a copy behind in the OS
> buffers, and similarly on write-out.  It's very unclear what to do
> about this, since the kernel and PostgreSQL don't have intimate
> knowledge of what each other are doing, but it would be nice to solve
> somehow.
>

 There you have a much harder algorithmic problem.

 You can basically control duplication with fadvise and WONTNEED. The
 problem here is not the kernel and whether or not it allows postgres
 to be smart about it. The problem is... what kind of smarts
 (algorithm) to use.

>>>
>>>
>>> Isn't this a fairly simple matter of when we read a page into shared
>>> buffers
>>> tell the kernel do forget that page? And a corollary to that for when we
>>> dump a page out of shared_buffers (here kernel, please put this back into
>>> your cache).
>>>
>>
>>
>> That's my point. In terms of kernel-postgres interaction, it's fairly
>> simple.
>>
>> What's not so simple, is figuring out what policy to use.
>
>
I think the above is pretty simple for both interaction (allow us to inject
a clean page into the file page cache) and policy (forget it after you hand
it to us, then remember it again when we hand it back to you clean).  And I
think it would pretty likely be an improvement over what we currently do.
 But I think it is probably the wrong way to get the improvement.  I think
the real problem is that we don't trust ourselves to manage more of the
memory ourselves.

As far as I know, we still don't have a publicly disclosable and readily
reproducible test case for the reports of performance degradation when we
have more than 8GB in shared_buffers.   If we had one of those, we could
likely reduce the double buffering problem by fixing our own scalability
issues and therefore taking responsibility for more of the data ourselves.

Remember,
>> you cannot tell the kernel to put some page in its page cache without
>> reading it or writing it. So, once you make the kernel forget a page,
>> evicting it from shared buffers becomes quite expensive.
>>
>
> Well, if we were to collaborate with the kernel community on this then
> presumably we can do better than that for eviction... even to the extent of
> "here's some data from this range in this file. It's (clean|dirty). Put it
> in your cache. Just trust me on this."

Which, in the case of it being clean, amounts to "Here is data we don't
want in memory any more because we think it is cold.  But we don't trust
ourselves, so please hold on to it anyway."  That might be a tough sell to
the kernel people.

 Cheers,

Jeff

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On Mon, Jan 13, 2014 at 3:53 PM, Trond Myklebust  wrote:
> O_DIRECT was specifically designed to solve the problem of double buffering 
> between applications and the kernel. Why are you not able to use that in 
> these situations?

O_DIRECT was apparently designed by a deranged monkey on some serious
mind-controlling substances.  But don't take it from me, I have it on
good authority:

http://yarchive.net/comp/linux/o_direct.html

One might even say the best authority.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread James Bottomley

On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote:
> On 1/13/14, 2:27 PM, Claudio Freire wrote:
> > On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby  wrote:
> >> On 1/13/14, 2:19 PM, Claudio Freire wrote:
> >>>
> >>> On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas 
> >>> wrote:
> 
>  On a related note, there's also the problem of double-buffering.  When
>  we read a page into shared_buffers, we leave a copy behind in the OS
>  buffers, and similarly on write-out.  It's very unclear what to do
>  about this, since the kernel and PostgreSQL don't have intimate
>  knowledge of what each other are doing, but it would be nice to solve
>  somehow.
> >>>
> >>>
> >>>
> >>> There you have a much harder algorithmic problem.
> >>>
> >>> You can basically control duplication with fadvise and WONTNEED. The
> >>> problem here is not the kernel and whether or not it allows postgres
> >>> to be smart about it. The problem is... what kind of smarts
> >>> (algorithm) to use.
> >>
> >>
> >> Isn't this a fairly simple matter of when we read a page into shared 
> >> buffers
> >> tell the kernel do forget that page? And a corollary to that for when we
> >> dump a page out of shared_buffers (here kernel, please put this back into
> >> your cache).
> >
> >
> > That's my point. In terms of kernel-postgres interaction, it's fairly 
> > simple.
> >
> > What's not so simple, is figuring out what policy to use. Remember,
> > you cannot tell the kernel to put some page in its page cache without
> > reading it or writing it. So, once you make the kernel forget a page,
> > evicting it from shared buffers becomes quite expensive.
> 
> Well, if we were to collaborate with the kernel community on this then
> presumably we can do better than that for eviction... even to the
> extent of "here's some data from this range in this file. It's (clean|
> dirty). Put it in your cache. Just trust me on this."

This should be the madvise() interface (with MADV_WILLNEED and
MADV_DONTNEED) is there something in that interface that is
insufficient?

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Trond Myklebust


On Jan 13, 2014, at 15:40, Andres Freund  wrote:

> On 2014-01-13 15:15:16 -0500, Robert Haas wrote:
>> On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner  wrote:
>>> I notice, Josh, that you didn't mention the problems many people
>>> have run into with Transparent Huge Page defrag and with NUMA
>>> access.
>> 
>> Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
>> setting zone_reclaim_mode; is there some other problem besides that?
> 
> I think that fixes some of the worst instances, but I've seen machines
> spending horrible amounts of CPU (& BUS) time in page reclaim
> nonetheless. If I analyzed it correctly it's in RAM << working set
> workloads where RAM is pretty large and most of it is used as page
> cache. The kernel ends up spending a huge percentage of time finding and
> potentially defragmenting pages when looking for victim buffers.
> 
>> On a related note, there's also the problem of double-buffering.  When
>> we read a page into shared_buffers, we leave a copy behind in the OS
>> buffers, and similarly on write-out.  It's very unclear what to do
>> about this, since the kernel and PostgreSQL don't have intimate
>> knowledge of what each other are doing, but it would be nice to solve
>> somehow.
> 
> I've wondered before if there wouldn't be a chance for postgres to say
> "my dear OS, that the file range 0-8192 of file x contains y, no need to
> reread" and do that when we evict a page from s_b but I never dared to
> actually propose that to kernel people...

O_DIRECT was specifically designed to solve the problem of double buffering 
between applications and the kernel. Why are you not able to use that in these 
situations?

Cheers,
   Trond

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Kevin Grittner

Josh Berkus  wrote:

> Wanna go to Collab?

I don't think that works out for me, but thanks for suggesting it.

I'd be happy to brainstorm with anyone who does go about issues to
discuss; although the ones I keep running into have already been
mentioned.

Regarding the problems others have mentioned, there are a few
features that might be a very big plus for us.  Additional ways of
hinting pages might be very useful.  If we had a way to specify how
many dirty pages were cached in PostgreSQL, the OS would count
those for calculations for writing dirty pages, and we could avoid
the "write avalanche" which is currently so tricky to avoid without
causing repeated writes to the same page.  Or perhaps instead a way
to hint a page as dirty so that the OS could not only count those,
but discard the obsolete data from its cache if it is not already
dirty at the OS level, and lower the write priority if it is dirty
(to improve the odds of collapsing multiple writes).  If there was
a way to use DONTNEED or something similar with the ability to
rescind it if the page was still happened to be in the OS cache,
that might help for when we discard a still-clean page from our
buffers.  And I seem to have a vague memory of there being cases
where the OS is first reading pages when we ask to write them,
which seems like avoidable I/O.  (I'm not sure about that one,
though.)

Also, something like THP support should really have sysctl support
rather than requiring people to put echo commands into scripts and
tie those into runlevel changes.  That's pretty ugly for something
which has turned out to be necessary so often.

I don't get too excited about changes to the default schedulers --
it's been pretty widely known for a long time that DEADLINE or NOOP
perform better than any alternatives for most database loads. 
Anyone with a job setting up Linux machines to be used for database
servers should know to cover that.  As long as those two don't get
broken, I'm good.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On 2014-01-13 12:34:35 -0800, James Bottomley wrote:
> On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote:
> > Well, if we were to collaborate with the kernel community on this then
> > presumably we can do better than that for eviction... even to the
> > extent of "here's some data from this range in this file. It's (clean|
> > dirty). Put it in your cache. Just trust me on this."
> 
> This should be the madvise() interface (with MADV_WILLNEED and
> MADV_DONTNEED) is there something in that interface that is
> insufficient?

For one, postgres doesn't use mmap for files (and can't without major
new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has
horrible consequences for performance/scalability - very quickly you
contend on locks in the kernel.
Also, that will mark that page dirty, which isn't what we want in this
case. One major usecase is transplanting a page comming from postgres'
buffers into the kernel's buffercache because the latter has a much
better chance of properly allocating system resources across independent
applications running.

Oh, and the kernel's page-cache management while far from perfect,
actually scales much better than postgres'.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master


On 1/13/14, 12:21 PM, Joshua D. Drake wrote:


On 01/13/2014 10:12 AM, Hannu Krosing wrote:

In other words, if we're going to have auto-degrade, the most
intelligent place for it is in
RepMgr/HandyRep/OmniPITR/pgPoolII/whatever.  It's also the *easiest*
place.  Anything we do *inside* Postgres is going to have a really,
really hard time determining when to degrade.

+1

This is also how 2PC works, btw - the database provides the building
blocks, i.e. PREPARE and COMMIT, and leaves it to a transaction manager
to deal with issues that require a whole-cluster perspective.



++1


+1


Josh, what do you think of the upthread idea of being able to recover 
in-progress transactions that are waiting when we turn off sync rep? I'm 
thinking that would be a very good feature to have... and it's not something 
you can easily do externally.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

On 2014-01-13 15:14:21 -0600, Jim Nasby wrote:
> On 1/13/14, 12:21 PM, Joshua D. Drake wrote:
> >
> >On 01/13/2014 10:12 AM, Hannu Krosing wrote:
> In other words, if we're going to have auto-degrade, the most
> intelligent place for it is in
> RepMgr/HandyRep/OmniPITR/pgPoolII/whatever.  It's also the *easiest*
> place.  Anything we do *inside* Postgres is going to have a really,
> really hard time determining when to degrade.
> >>>+1
> >>>
> >>>This is also how 2PC works, btw - the database provides the building
> >>>blocks, i.e. PREPARE and COMMIT, and leaves it to a transaction manager
> >>>to deal with issues that require a whole-cluster perspective.
> >>>
> >>
> >>++1
> >
> >+1
> 
> Josh, what do you think of the upthread idea of being able to recover 
> in-progress transactions that are waiting when we turn off sync rep? I'm 
> thinking that would be a very good feature to have... and it's not something 
> you can easily do externally.

I think it'd be a fairly simple patch to re-check the state of syncrep
config in SyncRepWaitForLsn(). Alternatively you can just write code to
iterate over the procarray and sets Proc->syncRepState to
SYNC_REP_WAIT_CANCELLED or such.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds


On 1/13/14, 10:40 AM, Merlin Moncure wrote:

On Sun, Jan 12, 2014 at 4:38 AM, Craig Ringer  wrote:

>Implicit casts to text, anybody?

This backward compatibility break orphaned the company I work for on
8.1 until last year and very nearly caused postgres to be summarily
extirpated (only rescued at the last minute by my arrival). It cost
hundreds of thousands of dollars to qualify a sprawling java code base
so that it could be moved back into a supported version.  Breaking
compatibility sucks -- it hurts your users and costs people money.
Hacking type casts may not have been a mistake, but the arbitrary
introduction of the breakage certainly was.


IMHO, I see no reason we need to plan on removing support. Presumably it's not 
that much burden on our codebase; it's only a PITA for users writing correct 
code. (It'd be very interesting to see how much user code would blow up if 
presented with anything other than 1 as the lower bound...)

I'd be perfectly happy with an initdb option to allow for lower bound support if you 
wanted it and disable it by default. People the legitimately want/need <> 1 
lower bounds can set that up, but our general user population will end up protected 
from a class of heisenbugs.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Greg Stark

On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund  wrote:
> For one, postgres doesn't use mmap for files (and can't without major
> new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> horrible consequences for performance/scalability - very quickly you
> contend on locks in the kernel.

I may as well dump this in this thread. We've discussed this in person
a few times, including at least once with Ted T'so when he visited
Dublin last year.

The fundamental conflict is that the kernel understands better the
hardware and other software using the same resources, Postgres
understands better its own access patterns. We need to either add
interfaces so Postgres can teach the kernel what it needs about its
access patterns or add interfaces so Postgres can find out what it
needs to know about the hardware context.

The more ambitious and interesting direction is to let Postgres tell
the kernel what it needs to know to manage everything. To do that we
would need the ability to control when pages are flushed out. This is
absolutely necessary to maintain consistency. Postgres would need to
be able to mark pages as unflushable until some point in time in the
future when the journal is flushed. We discussed various ways that
interface could work but it would be tricky to keep it low enough
overhead to be workable.

The less exciting, more conservative option would be to add kernel
interfaces to teach Postgres about things like raid geometries. Then
Postgres could use directio and decide to do prefetching based on the
raid geometry, how much available i/o bandwidth and iops is available,
etc.

Reimplementing i/o schedulers and all the rest of the work that the
kernel provides inside Postgres just seems like something outside our
competency and that none of us is really excited about doing.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Standalone synchronous master

2014-01-13 Thread Joshua D. Drake



On 01/13/2014 01:14 PM, Jim Nasby wrote:


On 1/13/14, 12:21 PM, Joshua D. Drake wrote:


On 01/13/2014 10:12 AM, Hannu Krosing wrote:

In other words, if we're going to have auto-degrade, the most
intelligent place for it is in
RepMgr/HandyRep/OmniPITR/pgPoolII/whatever.  It's also the *easiest*
place.  Anything we do *inside* Postgres is going to have a really,
really hard time determining when to degrade.

+1

This is also how 2PC works, btw - the database provides the building
blocks, i.e. PREPARE and COMMIT, and leaves it to a transaction manager
to deal with issues that require a whole-cluster perspective.



++1


+1


Josh, what do you think of the upthread idea of being able to recover
in-progress transactions that are waiting when we turn off sync rep? I'm
thinking that would be a very good feature to have... and it's not
something you can easily do externally.


I think it is extremely valuable, else we have lost those transactions 
which is exactly what we don't want.


JD


--
Command Prompt, Inc. - http://www.commandprompt.com/  509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
"In a time of universal deceit - telling the truth is a revolutionary 
act.", George Orwell



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread David Fetter

On Mon, Jan 13, 2014 at 10:40:57AM -0600, Merlin Moncure wrote:
> On Sun, Jan 12, 2014 at 4:38 AM, Craig Ringer  wrote:
> > Implicit casts to text, anybody?
> 
> This backward compatibility break orphaned the company I work for on
> 8.1 until last year and very nearly caused postgres to be summarily
> extirpated (only rescued at the last minute by my arrival). It cost
> hundreds of thousands of dollars to qualify a sprawling java code base
> so that it could be moved back into a supported version.  Breaking
> compatibility sucks -- it hurts your users and costs people money.
> Hacking type casts may not have been a mistake, but the arbitrary
> introduction of the breakage certainly was.

With utmost respect, it was not.  Databases are no good if there are
fixable things in them that cause them to produce incorrect results at
random, as auto-casting to text did.

> This project has no deprecation policy,

I believe it actually does, although it's not a formal, written
policy.  Would you like to help draft one up?

> and I'd argue we'd need one
> before considering breaking changes.  For example, maybe we could pull
> out an occasional release for longer term support to help users that
> caught out.   But really, the better way to go IMNSHO is to take a
> hard line on compatibility issues pretty much always -- consider the
> case of libc and win32 api.

Could you please help remind us what that was?

> There are certain limited exceptions to this rule -- for example
> security problems

Probably not.

> or gross violations of the standard

We're definitely there on lower bounds of arrays.  The standard, for a
wonder, is clear and unambiguous about them.  Whether we should go
there on the rest of our array implementation is a question for
another thread.

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql.consistent_into

On 1/13/14, 1:44 AM, Pavel Stehule wrote:

2014/1/12 Florian Pflug mailto:f...@phlo.org>>

On Jan12, 2014, at 22:37 , Pavel Stehule mailto:pavel.steh...@gmail.com>> wrote:
 > There is  GUC for variable_conflict already too. In this case I would to
 > enable this functionality everywhere (it is tool how to simply eliminate
 > some kind of strange bugs) so it needs a GUC.
 >
 > We have GUC for plpgsql.variable_conflict three years and I don't know
 > about any problem.

I must say I hate behaviour-changing GUCs with quite some passion. IMHO
they tend to cause bugs, not avoid them, in the long run. The pattern
usually is

   1) Code gets written, depends on some particular set of settings
  to work correctly

   2) Code gets reused, with little further testing since it's supposed
  to be battle-proven anyway. Settings get dropped.

   3) Code blows up for those corner-cases where the setting actually
  matter. Debugging is hell, because you effectively have to go
  over the code line-by-line and check if it might be affected by
  some GUC or another.

Only a few days ago I spent more than an hour tracking down a bug
which, as it turned out, was caused by a regex which subtly changed its
meaning depending on whether standard_conforming_strings is on or off.

Some GUCs are unavoidable - standard_conforming_strings, for example
probably still was a good idea, since the alternative would have been
to stick with the historical, non-standard behaviour forever.

But in this case, my feeling is that the trouble such a GUC may cause
out-weights the potential benefits. I'm all for having a directive like
#consistent_into (though I feel that the name could convey the
meaning better). If we *really* think that this ought to be the default
from 9.4 onward, then we should

   *) Change it to always complain, except if the function explictly
  specifies "#consistent_into on" or whatever.

   *) Have pg_dump add that to all plpgsql functions if the server
  version is < 9.4 or whatever major release this ends up in

That's all just my opinion of course.

I am thinking so GUC and plpgsql option can live together. If you like to 
accent a some behave, then you can use a plpgsql option. On second hand, I 
would to use a some functionality, that is safe, but I don't would to dirty 
source code by using repeated options. But I have to check (and calculate with 
risk) a GUC settings.

One idea: required GUC? Can be nice a possibility to ensure some GUC setting, 
and restore ensure these values or raises warning.

Back to main topic. Required and described feature doesn't change a behave of 
INTO clause. I can enable or disable this functionality and well written code 
should to work without change (and problems). When check is disabled, then 
execution is just less safe. So in this case, a impact of GUC is significantly 
less than by you described issues. Does know anybody a use case where this 
check should be disabled?

Probably we have a different experience about GUC. I had a problem with  
standard_conforming_strings and bytea format some years ago. Now I prepare 
document about required setting. But I can see (from my experience from Czech 
area) more often  problems related to effective_cache_size or 
from_collapse_limit and similar GUC. These parameters are behind knowledge (and 
visibility) typical user.

ISTM that in this case, it should be safe to make the new default behavior STRICT; if you 
forget to set the GUC to disable than you'll get an error that points directly at the 
problem, at which point you'll go "Oh, yeah... I forgot to set X..."

Outside of the GUC, I believe the default should definitely be STRICT. If your 
app is relying on non-strict then you need to be made aware of that. We should 
be able to provide a DO block that will change this setting for every function 
you've got if someone isn't happy with STRICT mode.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

On Mon, Jan 13, 2014 at 4:45 PM, David Fetter  wrote:
> On Mon, Jan 13, 2014 at 10:40:57AM -0600, Merlin Moncure wrote:
>> On Sun, Jan 12, 2014 at 4:38 AM, Craig Ringer  wrote:
>> > Implicit casts to text, anybody?
>>
>> This backward compatibility break orphaned the company I work for on
>> 8.1 until last year and very nearly caused postgres to be summarily
>> extirpated (only rescued at the last minute by my arrival). It cost
>> hundreds of thousands of dollars to qualify a sprawling java code base
>> so that it could be moved back into a supported version.  Breaking
>> compatibility sucks -- it hurts your users and costs people money.
>> Hacking type casts may not have been a mistake, but the arbitrary
>> introduction of the breakage certainly was.
>
> With utmost respect, it was not.  Databases are no good if there are
> fixable things in them that cause them to produce incorrect results at
> random, as auto-casting to text did.

With a precisely equal level of respect, that's a load of bunk.  As
has been discussed here many times in the past, those changes broke
many applications that were just fine.  Mine included, EnterpriseDB's
included.  It was designed to maximize rather than minimize breakage
in ways that were completely unnecessary, and every time anyone dares
to argue that the change was less than perfect in every respect, the
reponses evince a level of dubiousness normally reserved for the
parole hearings of convicted murderers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Josh Berkus

Everyone,

I am looking for one or more hackers to go to Collab with me to discuss
this.  If you think that might be you, please let me know and I'll look
for funding for your travel.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE


On 1/12/14, 9:35 PM, Andreas Karlsson wrote:

On 01/12/2014 11:20 PM, Peter Geoghegan wrote:

On Sun, Jan 12, 2014 at 8:12 AM, Andreas Karlsson  wrote:

On 01/11/2014 11:42 PM, Peter Geoghegan wrote:

I recently suggested that rather than RETURNING REJECTS, we could have
a REJECTING clause, which would see a DML statement project strictly
the complement of what RETURNING projects in the same context. So
perhaps you could also see what RETURNING would not have projected
because a before row trigger returned NULL (i.e. when a before trigger
indicates to not proceed with insertion). That is certainly more
general, and so is perhaps preferable. It's also less verbose, and it
seems less likely to matter that we'll need to make REJECTING a fully
reserved keyword, as compared to REJECTS. (RETURNING is already a
fully reserved keyword not described by the standard, so this makes a
certain amount of sense to me). If nothing else, REJECTING is more
terse than RETURNING REJECTS.


I do not entirely understand what you are proposing here.  Any example how
this would look compared to your RETURNING REJECTS proposal?


It's very similar - REJECTING is a total generalization of what I
already have. The difference is only that REJECTING is accepted in all
contexts that RETURNING is, and not just with INSERT...ON DUPLICATE
KEY LOCK FOR UPDATE. So you could potentially have REJECTING project
the slot proposed for insertion on an UPDATE where RETURNING would
not. If for example a BEFORE ROW trigger fired, and returned NULL,
perhaps it'd then be possible to project the slot as it was before
being passed to the trigger. Perhaps there is no real demand for that,
but, as I said, from a usability perspective it may be easier to
reason about a feature that projects strictly the complement of what
RETURNING would project in the same context.


So simply this?

WITH rej AS (
 INSERT INTO foo (a, b, c)
 VALUES (44, 1078, 'insert'), (55, 1088, 'insert')
 REJECTING a)
UPDATE foo SET c = 'update' FROM rej WHERE foo.a = rej.a;

Another question: have you given any thought on the case where you want to use 
both the successfully inserted tuples and the rejected and use in the CTE? Is 
that even something anyone would want? Would perhaps MERGE be more suited for 
that?


Well, a common case for INSERT RETURNING is to get your set of surrogate keys 
back; so I think users would want the ability to RETURN what finally made it 
into the table.

Also, if we want to support the case of identifying tuples where a BEFORE 
trigger disallowed the insert, we probably want to expose that that's why those 
tuples were rejected (as opposed to them being rejected due to a duplicate key 
violation).
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-13 Thread Merlin Moncure

On Mon, Jan 13, 2014 at 3:45 PM, David Fetter  wrote:
> On Mon, Jan 13, 2014 at 10:40:57AM -0600, Merlin Moncure wrote:
>> This project has no deprecation policy,
>
> I believe it actually does, although it's not a formal, written
> policy.  Would you like to help draft one up?

Lack of 'formal, written, policy' is equivalent to 'no policy'.
Regardless, the way things we done in the 7.x/8.x series may no longer
apply today; the project has grown up and we need to be more serious
about things, at least, IMNSHO.

>> and I'd argue we'd need one
>> before considering breaking changes.  For example, maybe we could pull
>> out an occasional release for longer term support to help users that
>> caught out.   But really, the better way to go IMNSHO is to take a
>> hard line on compatibility issues pretty much always -- consider the
>> case of libc and win32 api.
>
> Could you please help remind us what that was?

Let's take gets() for example.  C11 finally ditched it 12 years (!)
after it was formally deprecated in C99 and informally deprecate in
endless man pages ("don't use this!") for decades before that.  And
even then most compilers, at least the decent ones, should allow to
request previous standards for some time beyond that.  The win32 API
is also remarkably stable; ancient code written for it beyond the dim
horizon of time will still compile and execute today.  These are
probably strong contenders for most popular APIs ever made -- see the
connection?  Now, comparing C APIs to an SQL implementation for
deprecation purposes isn't quite applies to apples, but I'll stand by
the analogy.

>> or gross violations of the standard
>
> We're definitely there on lower bounds of arrays.  The standard, for a
> wonder, is clear and unambiguous about them.  Whether we should go
> there on the rest of our array implementation is a question for
> another thread.

The SQL standard requests that standard syntax gives standard
behavior.  Alternate bounds is non-standard syntax giving non-standard
behavior and is thus excepted.  Naturally, non-standard syntax is
dangerous because the standard may later implement it in which case
you then have a real problem (that may be the case here: I don't
know).  Our array implementation is a real mess on multiple levels but
at least it's an internally consistent mess.  Maybe it really should
be 'fixed', but not before the super un-fun discussion of how to ease
the path for our hapless users happens first.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

On Mon, Jan 13, 2014 at 03:15:16PM -0500, Robert Haas wrote:
> On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner  wrote:
> > I notice, Josh, that you didn't mention the problems many people
> > have run into with Transparent Huge Page defrag and with NUMA
> > access.
> 

Ok, there are at least three potential problems there that you may or
may not have run into.

First, THP when it was first introduced was a bit of a disaster. In 3.0,
it was *very* heavy handed and would trash the system reclaiming memory
to satisfy an allocation. When it did this, it would also writeback a
bunch of data and block on it to boot. It was not the smartest move of
all time but was improved over time and in some cases the patches were
also backported by 3.0.101. This is a problem that should have
alleviated over time.

The general symptoms of the problem would be massive stalls and
monitoring the /proc/PID/stack of interesting processes would show it to
be somewhere in do_huge_pmd_anonymous_page -> alloc_page_nodemask ->
try_to_free_pages -> migrate_pages or something similar. You may have
worked around it by disabling THP with a command line switch or
/sys/kernel/mm/transparent_hugepage/enabled in the past.

This is "not meant to happen" any more or at least it has been a while
since a bug was filed against me in this area. There are corner cases
though. If the underlying filesystem is NFS, the problem might still be
experienced.

That is the simple case.

You might have also hit the case where THPages filled with zeros did not
use the zero page. That would have looked like a larger footprint than
anticipated and lead to another range of problems. This is also addressed
since but maybe not recently enough. It's less likely this is your problem
though as I expect you actually use your buffers, not leave them filled
with zeros.

You mention NUMA but that's trickier to figure out that problem without more
context.  THP can cause unexpected interleaving between NUMA nodes. Memory
that would have been local on a 4K page boundary becomes remote accesses
when THP is enabled and performance would be hit (maybe 3-5% depending on
the machine). It's not the only possibility though. If memory was being
used sparsely and THP was in use then the overall memory footprint may be
higher than it should be. This potentially would cause allocations to spill
over to remote nodes while kswapd wakes up to reclaim local memory. That
would lead to weird buffer aging inversion problems. This is a hell of a
lot of guessing though and we'd need a better handle on the reproduction
case to pin it down.

> Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
> setting zone_reclaim_mode; is there some other problem besides that?
> 

Really?

zone_reclaim_mode is often a complete disaster unless the workload is
partitioned to fit within NUMA nodes. On older kernels enabling it would
sometimes cause massive stalls. I'm actually very surprised to hear it
fixes anything and would be interested in hearing more about what sort
of circumstnaces would convince you to enable that thing.

> The other thing that comes to mind is the kernel's caching behavior.
> We've talked a lot over the years about the difficulties of getting
> the kernel to write data out when we want it to and to not write data
> out when we don't want it to. 

Is sync_file_range() broke?

> When it writes data back to disk too
> aggressively, we get lousy throughput because the same page can get
> written more than once when caching it for longer would have allowed
> write-combining. 

Do you think that is related to dirty_ratio or dirty_writeback_centisecs?
If it's dirty_writeback_centisecs then that would be particularly tricky
because poor interactions there would come down to luck basically.

> When it doesn't write data to disk aggressively
> enough, we get huge latency spikes at checkpoint time when we call
> fsync() and the kernel says "uh, what? you wanted that data *on the
> disk*? sorry boss!" and then proceeds to destroy the world by starving
> the rest of the system for I/O for many seconds or minutes at a time.

Ok, parts of that are somewhat expected. It *may* depend on the
underlying filesystem. Some of them handle fsync better than others. If
you are syncing the whole file though when you call fsync then you are
potentially burned by having to writeback dirty_ratio amounts of memory
which could take a substantial amount of time.

> We've made some desultory attempts to use sync_file_range() to improve
> things here, but I'm not sure that's really the right tool, and if it
> is we don't know how to use it well enough to obtain consistent
> positive results.
> 

That implies that either sync_file_range() is broken in some fashion we
(or at least I) are not aware of and that needs kicking.

> On a related note, there's also the problem of double-buffering.  When
> we read a page into shared_buffers, we leave a copy behind in the OS
> buffers, and similarly on write-out.  It's

Re: [HACKERS] Linux kernel impact on PostgreSQL performance

On Mon, Jan 13, 2014 at 06:27:03PM -0200, Claudio Freire wrote:
> On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby  wrote:
> > On 1/13/14, 2:19 PM, Claudio Freire wrote:
> >>
> >> On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas 
> >> wrote:
> >>>
> >>> On a related note, there's also the problem of double-buffering.  When
> >>> we read a page into shared_buffers, we leave a copy behind in the OS
> >>> buffers, and similarly on write-out.  It's very unclear what to do
> >>> about this, since the kernel and PostgreSQL don't have intimate
> >>> knowledge of what each other are doing, but it would be nice to solve
> >>> somehow.
> >>
> >>
> >>
> >> There you have a much harder algorithmic problem.
> >>
> >> You can basically control duplication with fadvise and WONTNEED. The
> >> problem here is not the kernel and whether or not it allows postgres
> >> to be smart about it. The problem is... what kind of smarts
> >> (algorithm) to use.
> >
> >
> > Isn't this a fairly simple matter of when we read a page into shared buffers
> > tell the kernel do forget that page? And a corollary to that for when we
> > dump a page out of shared_buffers (here kernel, please put this back into
> > your cache).
> 
> 
> That's my point. In terms of kernel-postgres interaction, it's fairly simple.
> 
> What's not so simple, is figuring out what policy to use. Remember,
> you cannot tell the kernel to put some page in its page cache without
> reading it or writing it. So, once you make the kernel forget a page,
> evicting it from shared buffers becomes quite expensive.

posix_fadvise(POSIX_FADV_WILLNEED) is meant to cover this case by
forcing readahead. If you evict it prematurely then you do get kinda
screwed because you pay the IO cost to read it back in again even if you
had enough memory to cache it. Maybe this is the type of kernel-postgres
interaction that is annoying you.

If you don't evict, the kernel eventually steps in and evicts the wrong
thing. If you do evict and it was unnecessarily you pay an IO cost.

That could be something we look at. There are cases buried deep in the
VM where pages get shuffled to the end of the LRU and get tagged for
reclaim as soon as possible. Maybe you need access to something like
that via posix_fadvise to say "reclaim this page if you need memory but
leave it resident if there is no memory pressure" or something similar.
Not exactly sure what that interface would look like or offhand how it
could be reliably implemented.

-- 
Mel Gorman
SUSE Labs

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

On Mon, Jan 13, 2014 at 12:58 PM, Heikki Linnakangas
 wrote:
> Well, even if you don't agree that locking all the conflicting rows for
> update is sensible, it's still perfectly sensible to return the rejected
> rows to the user. For example, you're inserting N rows, and if some of them
> violate a constraint, you still want to insert the non-conflicting rows
> instead of rolling back the whole transaction.

Right, but with your approach, can you really be sure that you have
the right rejecting tuple ctid (not reject)? In other words, as you
wait for the exclusion constraint to conclusively indicate that there
is a conflict, minutes may have passed in which time other conflicts
may emerge in earlier unique indexes. Whereas with an approach where
values are locked, you are guaranteed that earlier unique indexes have
no conflicting values. Maintaining that property seems useful, since
we check in a well-defined order, and we're still projecting a ctid.
Unlike when row locking is involved, we can make no assumptions or
generalizations around where conflicts will occur. Although that may
also be a general concern with your approach when row locking, for
multi-master replication use-cases. There may be some value in knowing
it cannot have been earlier unique indexes (and so the existing values
for those unique indexes in the locked row should stay the same -
don't many conflict resolution policies work that way?).

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On 2014-01-13 14:19:56 -0800, James Bottomley wrote:
> >  Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> > horrible consequences for performance/scalability - very quickly you
> > contend on locks in the kernel.
> 
> Is this because of problems in the mmap_sem?

It's been a while since I looked at it, but yes, mmap_sem was part of
it. I also seem to recall the amount of IPIs increasing far too much for
it to be practical, but I am not sure anymore.

> > Also, that will mark that page dirty, which isn't what we want in this
> > case.
> 
> You mean madvise (page_addr)?  It shouldn't ... the state of the dirty
> bit should only be updated by actual writes.  Which MADV_ primitive is
> causing the dirty marking, because we might be able to fix it (unless
> there's some weird corner case I don't know about).

Not the madvise() itself, but transplanting the buffer from postgres'
buffers to the mmap() area of the underlying file would, right?

> We also do have a way of transplanting pages: it's called splice.  How
> do the semantics of splice differ from what you need?

Hm. I don't really see how splice would allow us to seed the kernel's
pagecache with content *without* marking the page as dirty in the
kernel.
We don't need zero-copy IO here, the important thing is just to fill the
pagecache with content without a) rereading the page from disk b)
marking the page as dirty.

> >  One major usecase is transplanting a page comming from postgres'
> > buffers into the kernel's buffercache because the latter has a much
> > better chance of properly allocating system resources across independent
> > applications running.
> 
> If you want to share pages between the application and the page cache,
> the only known interface is mmap ... perhaps we can discuss how better
> to improve mmap for you?

I think purely using mmap() is pretty unlikely to work out - there's
just too many constraints about when a page is allowed to be written out
(e.g. it's interlocked with postgres' write ahead log). I also think
that for many practical purposes using mmap() would result in an absurd
number of mappings or mapping way too huge areas; e.g. large btree
indexes are usually accessed in a quite fragmented manner.

> > Oh, and the kernel's page-cache management while far from perfect,
> > actually scales much better than postgres'.
> 
> Well, then, it sounds like the best way forward would be to get
> postgress to use the kernel page cache more efficiently.

No arguments there, although working on postgres scalability is a good
idea as well ;)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

On Mon, Jan 13, 2014 at 11:38:44PM +0100, Jan Kara wrote:
> On Mon 13-01-14 22:26:45, Mel Gorman wrote:
> > The flipside is also meant to hold true. If you know data will be needed
> > in the near future then posix_fadvise(POSIX_FADV_WILLNEED). Glancing at
> > the implementation it does a forced read-ahead on the range of pages of
> > interest. It doesn't look like it would block.
>   That's not quite true. POSIX_FADV_WILLNEED still needs to map logical
> file offsets to physical disk blocks and create IO requests. This happens
> synchronously. So if your disk is congested and relevant metadata is out of
> cache, or we simply run out of free IO requests, POSIX_FADV_WILLNEED can
> block for a significant amount of time.
> 

Umm, yes, you're right. It also potentially stalls allocating the pages
up front even though it will only try and direct reclaim pages once.
That can stall in some circumstances, particularly if there are a number
of processes trying to reclaim memory.

That kinda sucks though. One point of discussion would be to check if
this is an interface that can be used and if so, is it required to never
block and if so is there something we can do about it -- queue the IO
asynchronously if you can but if the kernel would block then do not bother.
That does mean that fadvise is not guaranteeing that the pages will be
resident in the future but it was not the intent of the interface
anyway.

-- 
Mel Gorman
SUSE Labs

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Syntax of INSERT...ON DUPLICATE KEY LOCK FOR UPDATE