date:20151221

Alvaro Herrera  writes:
> I think the new comment that talks about Toast Index should explain
> *why* we can skip the pinning in all cases except that one, instead of
> just saying we can do it.

I've not looked at that code in a long while, but my recollection is
that it's designed that way to protect non-MVCC catalog scans, which
are gone now ... except for SnapshotToast.  Maybe the right way to
approach this is to adjust HeapTupleSatisfiesToast (or maybe just
convince ourselves that no one could be dereferencing a stale toast
pointer in the first place?) and then redesign btree vacuuming without
the requirement to support non-MVCC scans, period.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Coding note: truncating with strlcpy() is not such a hot idea

There's been some effort to replace uses of strncpy() and our homegrown
StrNCpy() macro with strlcpy().  I had my nose rubbed today in the fact
that that isn't always a good transformation.  The reason why not is that
strlcpy() is defined to return strlen(src), which means that even after
it's transferred all the bytes it's going to, it keeps walking down the
source string looking for a '\0'.  Therefore:

1. Using strlcpy to copy from sources that are not known null-terminated
is outright unsafe: someday, you'll fall off the end of memory and
SIGSEGV.

2. Even with a known null-terminated string, using strlcpy to chop small
pieces out of it is not bright, because each call will traverse all the
rest of the input.  This means for example that extracting N words out
of a long string will spend O(N^2) time in strlcpy.

I ran into the latter effect in tab-complete.c; the old coding there
used strlcpy as a substitute for memcpy and explicit placement of
a trailing '\0'.  We hadn't noticed the problem because it never got
done more than nine times per query, but with the new dispensation
of parsing the whole input line, the O(N^2) cost became obvious.

I did a quick troll looking for places where we might have either an
actual bug or a performance problem worth worrying about, and did not find
others, though it was a very quick scan and I might've missed something.

In general, I would say that it's best to use strlcpy() only as a safety
blanket to protect fixed-length buffers that you're not really intending
to overflow.  If you are expecting it to actually truncate strings in
normal use, there is a significant risk that You're Doing It Wrong.
Caveat programmer.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

On Fri, Dec 18, 2015 at 2:22 PM, Peter Geoghegan  wrote:
> On Fri, Dec 18, 2015 at 9:35 AM, Robert Haas  wrote:
>>> Attached revision updates both the main commit (the optimization), and
>>> the backpatch commit (updated the contract).
>>
>> -   /* abbreviation is possible here only for by-reference types */
>> +   /*
>> +* Abbreviation is possible here only for by-reference types.
>> Note that a
>> +* pass-by-value representation for abbreviated values is forbidden, 
>> but
>> +* that's a distinct, generic restriction imposed by the SortSupport
>> +* contract.
>>
>> I think that you have not written what you meant to write here.  I
>> think what you mean is "Note that a pass-by-REFERENCE representation
>> for abbreviated values is forbidden...".
>
> You're right. Sorry about that.

PFA my proposal for comment changes for 9.5 and master.  This is based
on your 0001, but I edited somewhat.  Please let me know your
thoughts.  I am not willing to go further and rearrange actual code in
9.5 at this point; it just isn't necessary.

Looking at this whole system again, I wonder if we're missing a trick
here.  How about if we decree from on high that the abbreviated-key
comparator is always just the application of an integer comparison
operator?  The only abbreviation function that we have right now that
wouldn't be happy with that is the one for numeric, and that looks
like it could be fixed.  Then we could hard-code knowledge of this
representation in tuplesort.c in such a way that it wouldn't need to
call a comparator function at all - instead of doing
ApplySortComparator() and then maybe ApplySortAbbrevFullComparator(),
it would do something like:

if (using_abbreviation && (compare = ApplyAbbrevComparator(...)) != 0)
return compare;

I'm not sure if that would save enough cycles vs. the status quo to be
worth bothering with, but it seems like it might.  You may have a
better feeling for that than I do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 6572af8..cf1cdcb 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -930,7 +930,14 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	state->sortKeys->ssup_collation = sortCollation;
 	state->sortKeys->ssup_nulls_first = nullsFirstFlag;
 
-	/* abbreviation is possible here only for by-reference types */
+	/*
+	 * Abbreviation is possible here only for by-reference types.  In theory,
+	 * a pass-by-value datatype could have an abbreviated form that is cheaper
+	 * to compare.  In a tuple sort, we could support that, because we can
+	 * always extract the original datum from the tuple is needed.  Here, we
+	 * can't, because a datum sort only stores a single copy of the datum;
+	 * the "tuple" field of each sortTuple is NULL.
+	 */
 	state->sortKeys->abbreviate = !typbyval;
 
 	PrepareSortSupportFromOrderingOp(sortOperator, state->sortKeys);
@@ -1271,7 +1278,7 @@ tuplesort_putindextuplevalues(Tuplesortstate *state, Relation rel,
 		 * Store ordinary Datum representation, or NULL value.  If there is a
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
-		 * converter and just set datum1 to "void" representation (to be
+		 * converter and just set datum1 to zeroed representation (to be
 		 * consistent).
 		 */
 		stup.datum1 = original;
@@ -3155,7 +3162,7 @@ copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * Store ordinary Datum representation, or NULL value.  If there is a
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
-		 * converter and just set datum1 to "void" representation (to be
+		 * converter and just set datum1 to zeroed representation (to be
 		 * consistent).
 		 */
 		stup->datum1 = original;
@@ -3397,7 +3404,7 @@ copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * Store ordinary Datum representation, or NULL value.  If there is a
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
-		 * converter and just set datum1 to "void" representation (to be
+		 * converter and just set datum1 to zeroed representation (to be
 		 * consistent).
 		 */
 		stup->datum1 = original;
@@ -3701,7 +3708,7 @@ copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup)
 		 * Store ordinary Datum representation, or NULL value.  If there is a
 		 * converter it won't expect NULL values, and cost model is not
 		 * required to account for NULL, so in that case we avoid calling
-		 * converter and just set datum1 to "void" representation (to be
+		 * converter and just set datum1 to zeroed

Re: [HACKERS] Combining Aggregates

On 22 December 2015 at 01:30, Robert Haas  wrote:

> Can we use Tom's expanded-object stuff instead of introducing
> aggserialfn and aggdeserialfn?  In other words, if you have a
> aggtranstype = INTERNAL, then what we do is:
>
> 1. Create a new data type that represents the transition state.
> 2. Use expanded-object notation for that data type when we're just
> within a single process, and flatten it when we need to send it
> between processes.
>
>
I'd not seen this before, but on looking at it I'm not sure if using it
will be practical to use for this. I may have missed something, but it
seems that after each call of the transition function, I'd need to ensure
that the INTERNAL state was in the varlana format. This might be ok for a
state like Int8TransTypeData, since that struct has no pointers, but I
don't see how that could be done efficiently for NumericAggState, which has
two NumericVar, which will have pointers to other memory. The trans
function also has no idea whether it'll be called again for this state, so
it does not seem possible to delay the conversion until the final call of
the trans function.


> One thing to keep in mind is that we also want to be able to support a
> plan that involves having one or more remote servers do partial
> aggregation, send us the partial values, combine them across servers
> and possibly also with locally computed-values, and the finalize the
> aggregation.  So it would be nice if there were a way to invoke the
> aggregate function from SQL and get a transition value back rather
> than a final value.


This will be possible with what I proposed. The Agg Node will just need to
be setup with finalizeAggs=false, serialState=true. That way the returned
aggregate values will be the states converted into the serial type, to
which we can call the output function on and send where ever we like.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Parallel Aggregate

On 22 December 2015 at 04:16, Paul Ramsey  wrote:

> Shouldn’t parallel aggregate come into play regardless of scan selectivity?
>

I'd say that the costing should take into account the estimated number of
groups.

The more tuples that make it into each group, the more attractive parallel
grouping should seem. In the extreme case if there's 1 tuple per group,
then it's not going to be of much use to use parallel agg, this would be
similar to a scan with 100% selectivity. So perhaps the costings for it can
be modeled around a the parallel scan costing, but using the estimated
groups instead of the estimated tuples.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Patch to improve a few appendStringInfo* calls

On 22 December 2015 at 01:58, Robert Haas  wrote:

> On Thu, Jul 2, 2015 at 6:08 AM, David Rowley
>  wrote:
> >> I left out the changes like
> >>
> >>> -   appendStringInfoString(, buf.data);
> >>> +   appendBinaryStringInfo(, buf.data, buf.len);
> >>
> >>
> >> because they're not an improvement in readablity, IMHO, and they were
> not
> >> in performance-critical paths.
> >
> > Perhaps we can come up with appendStringInfoStringInfo at some later
> date.
>
> concatenateStringInfo?
>
>
 According to grep -rE "appendStringInfoString\(.*\.data\);" . we have 13
such matches. None of them seem to be in very performance critical places,
perhaps with the exception of xmlconcat().

Would you say that we should build a macro to wrap up a call
to appendBinaryStringInfo() or invent another function which looks similar?


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Parallel Aggregate

2015-12-21 Thread Haribabu Kommi

On Tue, Dec 22, 2015 at 2:16 AM, Paul Ramsey  wrote:
> Shouldn’t parallel aggregate come into play regardless of scan selectivity?
> I know in PostGIS land there’s a lot of stuff like:
>
> SELECT ST_Union(geom) FROM t GROUP BY areacode;
>
> Basically, in the BI case, there’s often no filter at all. Hoping that’s
> considered a prime case for parallel agg :)

Yes, the latest patch attached in the thread addresses this issue.
But it still lacks of proper cost calculation and comparison with
original aggregate cost.

The parallel aggregate selects only when the number of groups
should be at least less than 1/4 of rows that are getting selected.
Otherwise, doing aggregation two times for more number of
records leads to performance drop compared to original aggregate.

Regards,
Hari Babu
Fujitsu Australia

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql - DECLARE - cannot to use %TYPE or %ROWTYPE for composite types

Hi



2015-12-21 16:21 GMT+01:00 Artur Zakirov :

> Hi.
> I have tried to do some review of this patch. Below are my comments.
>
> Introduction:
>
> This patch fixes and adds the following functionality:
> - %TYPE - now can be used for composite types.
> - %ARRAYTYPE - new functionality, provides the array type from a variable
> or table column.
> - %ELEMENTTYPE - new funcitonality, provides the element type of a given
> array.
>
> New regression tests are included in the patch. Changes to the
> documentation are not provided.
>
> Initial Run:
>
> The patch applies correctly to HEAD. Regression tests pass successfully,
> without errors. It seems that the patch work as you expected.
>
> Performance:
>
> It seems patch have not possible performance issues for the existing
> functionality.
>
> Coding:
>
> The style looks fine. I attached the patch that does some corrections in
> code and documentation. I have corrected indentation in pl_comp.c and
> "read_datatype" function in pl_gram.y. I think changes in "read_datatype"
> function would be better to avoid a code duplication. But I could be wrong
> of course.
>
> Conclusion:
>
> The patch could be applied on master with documentation corrections. But
> I'm not sure that your task could be resloved only by adding %ARRAYTYPE and
> %ELEMENTTYPE. Maybe you will give some examples?
>

Thank you for review. The changes in code are good idea.

I waited with documentation if there will be some objections to syntax. The
month later, there are not any known objection.

The target of this feature isn't using for storing of database objects
only, but for storing the values of polymorphic parameters.

CREATE OR REPLACE FUNCTION buble_sort(a anyarray)
RETURNS anyarray AS $$
DECLARE
  aux a%ELEMENTTYPE;
  repeat_again boolean DEFAULT true;
BEGIN
  -- Don't use this code for large arrays!
  -- use builtin sort
  WHILE repeat_again
repeat_again := false;
FOR i IN array_lower(a, 1) .. array_upper(a, 2)
LOOP
  IF a[i] > a[i+1] THEN
aux := a[i+1];
a[i+1] := a[i]; a[i] := aux;
repeat_again := true;
  END IF;
END LOOP;
  END WHILE;
  RETURN a;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION array_init(v anyelement, size integer)
RETURNS anyarray AS $$
DECLARE result v%ARRAYTYPE DEFAULT '{}';
BEGIN
  -- prefer builtin function array_fill
  FOR i IN 1 .. size
  LOOP
result := result || v;
  END LOOP;
  RETURN result;
END;
$$ LANGUAGE plpgsql;


Regards

Pavel


>
> --
> Artur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company
>

Re: [HACKERS] Patch to improve a few appendStringInfo* calls

2015-12-21 13:58 GMT+01:00 Robert Haas :

> On Thu, Jul 2, 2015 at 6:08 AM, David Rowley
>  wrote:
> >> I left out the changes like
> >>
> >>> -   appendStringInfoString(, buf.data);
> >>> +   appendBinaryStringInfo(, buf.data, buf.len);
> >>
> >>
> >> because they're not an improvement in readablity, IMHO, and they were
> not
> >> in performance-critical paths.
> >
> > Perhaps we can come up with appendStringInfoStringInfo at some later
> date.
>
> concatenateStringInfo?
>

concatStringInfo ?

Pavel



>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: [HACKERS] Experimental evaluation of PostgreSQL's query optimizer

Viktor Leis  writes:
> I think it would be a good start to distinguish between nested loop
> joins with and without a index.

We do.

> In my opinion, the latter should simply NEVER be chosen.

So, if you're given a query with a non-equality join condition
that doesn't match any index on either table, you think the planner
should just fail?  This is not realistic.  Certainly nestloop with
inner seqscan is likely to be slow, but that's already reflected
in the cost estimates.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Combining Aggregates

On 18 December 2015 at 01:28, David Rowley 
wrote:

> # select sum(x::numeric) from generate_series(1,3) x(x);
> ERROR:  invalid memory alloc request size 18446744072025250716
>
> The reason that this happens is down to the executor always thinking that
> Aggref returns the aggtype, but in cases where we're not finalizing the
> aggregate the executor needs to know that we're actually returning
> aggtranstype instead. Hash aggregates appear to work as the trans value is
> just stuffed into a hash table, but with plain and sorted aggregates this
> gets formed into a Tuple again, and forming a tuple naturally requires the
> types to be set correctly! I'm not sure exactly what the best way to fix
> this will be. I've hacked something up in the attached test patch which
> gets around the problem by adding a new aggtranstype to Aggref and also an
> 'aggskipfinalize'  field which I manually set to true in a bit of a hacky
> way inside the grouping planner. Then in exprType() for Aggref I
> conditionally return the aggtype or aggtranstype based on the
> aggskipfinalize setting.  This is most likely not the way to properly fix
> this, but I'm running out of steam today to think of how it should be done,
> so I'm currently very open to ideas on this.
>

 Ok, so it seems that my mindset was not in parallel process space when I
was thinking about this.  I think having the pointer in the Tuple is
probably going to be fine for this multiple stage aggregation when that's
occurring in a single backend process, but obviously if the memory that the
pointer points to belongs to a worker process in a parallel aggregate
situation, then bad things will happen.

Now, there has been talk of this previously, on various threads, but I
don't believe any final decisions were made on how exactly it should be
done. At the moment I plan to make changes as follows:

   1.  Add 3 new columns to pg_aggregate, aggserialfn, aggdeserialfn and
   aggserialtype These will only be required when aggtranstype is INTERNAL.
   Perhaps we should disallow CREATE AGGREAGET from accepting them for any
   other type... The return type of aggserialfn should be aggserialtype, and
   it should take a single parameter of aggtranstype. aggdeserialfn will be
   the reverse of that.
   2. Add a new bool field to nodeAgg's state named serialStates. If this
   is field is set to true then when we're in finalizeAgg = false mode, we'll
   call the serialfn on the agg state instead of the finalfn. This will allow
   the serialized state to be stored in the tuple and sent off to the main
   backend.  The combine agg node should also be set to serialStates = true,
   so that it knows to deserialize instead of just assuming that the agg state
   is of type aggtranstype.

I believe this should allow us to not cause any performance regressions by
moving away from INTERNAL agg states. It should also be very efficient for
internal states such as Int8TransTypeData, as this struct merely has 2
int64 fields which should be very simple to stuff into a bytea varlena
type. We don't need to mess around converting the ->count and ->sum into a
strings or anything.

Then in order for the planner to allow parallel aggregation all aggregates
must:

   1. Not have a DISTINCT or ORDER BY clause
   2. Have a combinefn
   3. If aggtranstype = INTERNAL, must have a aggserialfn and aggdeserialfn.

We can relax the requirement on 3 if we're using 2-stage aggregation, but
not parallel aggregation.

Any objections, or better ideas?

That just leaves me to figure out how to set the correct Aggref->aggtype
during planning, as now there's 3 possible types:

if (finalizeAggs == false)
{
   if (serialStates == true)
  type = aggserialtype;
  else
 type = aggtranstype;
}
else
  type = prorettype; /* normal case */

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Optimizing away second VACUUM heap scan when only BRIN indexes on table

2015-12-21 Thread Peter Geoghegan

On Mon, Dec 21, 2015 at 1:24 AM, Simon Riggs  wrote:
> Given BRIN's characteristics, such a table design is compelling when the
> table is very large, yet possible only for certain use cases.

You can say the same thing about BRIN itself, of course.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-21 Thread Heikki Linnakangas


On 17/12/15 19:07, Robert Haas wrote:

On Mon, Dec 14, 2015 at 6:08 PM, Tomas Vondra
 wrote:

So we know that we should expect about

   (prev_wal_bytes - wal_bytes) + (prev_wal_fpw_bytes - wal_fpw_bytes)

   (   regular WAL) + (  FPW WAL )

to be produced until the end of the current checkpoint. I don't have a clear
idea how to transform this into the 'progress' yet, but I'm pretty sure
tracking the two types of WAL is a key to a better solution. The x^1.5 is
probably a step in the right direction, but I don't feel particularly
confident about the 1.5 (which is rather arbitrary).


If it works well empirically, does it really matter that it's
arbitrary?  I mean, the entire planner is full of fairly arbitrary
assumptions about which things to consider in the cost model and which
to ignore.  The proof that we have made good decisions there is in the
query plans it generates.  (The proof that we have made bad decisions
in some cases in the query plans, too.)


Agreed.


I think a bigger problem for this patch is that Heikki seems to have
almost completely disappeared.


Yeah, there's that problem too :-).

The reason I didn't commit this back then was lack of performance 
testing. I'm fairly confident that this would be a significant 
improvement for some workloads, and shouldn't hurt much even in the 
worst case. But I did only a little testing on my laptop. I think Simon 
was in favor of just committing it immediately, and Fabien wanted to see 
more performance testing before committing.


I was hoping that Digoal would re-ran his original test case, and report 
back on whether it helps. Fabien had a performance test setup, for 
testing another patch, but he didn't want to run it to test this patch. 
Amit did some testing, but didn't see a difference. We can take that as 
a positive sign - no regression - or as a negative sign, but I think 
that basically means that his test was just not sensitive to the FPW issue.


So Tomas, if you're willing to do some testing on this, that would be 
brilliant!


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Costing foreign joins in postgres_fdw

2015-12-21 Thread Ashutosh Bapat

On Fri, Dec 18, 2015 at 6:39 PM, Albe Laurenz 
wrote:

> Ashutosh Bapat wrote:
> > Costs for foreign queries are either obtained from the foreign server
> using EXPLAIN (if
> > use_remote_estimate is ON) otherwise they are cooked up locally based on
> the statistics available. For
> > joins as well, we have to do the same. If use_remote_estimates [1] is
> ON, we can get the costs from
> > the foreign server. Rest of the mail discusses approaches for estimating
> the costs when
> > use_remote_estimates is OFF.
> >
> >
> > 1. Unlike base relations where the table data "has to be" fetched from
> the foreign server, a join
> > doesn't "have to be" fetched from the foreign server. So, even if
> use_remote_estimate is OFF for a
> > base relation, we do try to estimate something locally. But for a join
> that's not compulsory, so we
> > can choose not to estimate anything and not push down the join if
> use_remote_estimate is OFF. Whether
> > we do that depends upon how well we can estimate the join cost when
> use_remote_estimate is OFF.
> >
> > 2. Locally estimating the cost of join that will be performed on the
> foreign server is difficult
> > because we do not know which join strategy the foreign server is going
> to use, which in turn depends
> > upon the availability of indexes, work memory, statistics about joining
> expressions etc. One way to do
> > this is to use the cost of cheapest local join path built upon foreign
> outer and inner paths, to
> > estimate the cost of executing the join at the foreign server The
> startup and run time costs for
> > sending, parsing and planning query at the foreign server as well as the
> cost to fetch the tuples need
> > to be adjusted, so that it doesn't get counted twice. We may assume that
> the cost for the foreign join
> > will be some factor of the adjusted cost, like we have done for
> estimating cost of sort pushdown. The
> > reason we choose cheapest path with foreign inner and outer paths is
> because that's likely to be a
> > closer to the real estimate than the path which does not have foreign
> inner and outer paths. In the
> > absence of such path, we should probably not push the join down since no
> local path has found pushing
> > inner and outer to be cheaper and it's likely (certainly not a rule)
> that pushing the join in question
> > down is not going to be cheaper than the local paths.
> >
> >
> > 1st option is easy but it sounds too restrictive. 2nd option liberal but
> is complex.
>
> My gut feeling is that for a join where all join predicates can be pushed
> down, it
> will usually be a win to push the join to the foreign server.
>

At least in the first cut, we are planning to push a join down only when
all the "join" predicates are pushable, otherwise, we do not push down the
join.


> So in your first scenario, I'd opt for always pushing down the join
> if possible if use_remote_estimate is OFF.
>

Well, that's what the mail is about. How do we decide whether or not push a
join based on the costs when remote estimation is not possible.


>
> Your second scenario is essentially to estimate that a pushed down join
> will
> always be executed as a nested loop join, which will in most cases produce
> an unfairly negative estimate.
>

No, that's not what is intention behind the second scenario. The cheapest
local strategy needn't always be (in fact it's mostly not) nested loop
join. It could be anything hash join, merge join. The reason I suggested
cheapest local was to give foreign join an advantage over all the local
paths created. If we cost the foreign join on fairly lower side compared to
all the local paths and then add transfer costs, that might give foreign
join an added advantage. But relying on local cheapest strategy may cause
the costs to sway a lot because of inaccuracies in local statistics. That's
where I think your next idea helps.


>
> What about using local statistics to come up with an estimated row count
> for
> the join and use that as the basis for an estimate?  My idea here is that
> it
> is always be a win to push down a join unless the result set is so large
> that
> transferring it becomes the bottleneck.
> Maybe, to come up with something remotely realistic, a formula like
>
> sum of locally estimated costs of sequential scan for the base table
> plus count of estimated result rows (times a factor)
>

Cost of foreign join = cost of scanning all the involved base relations +
cost of applying quals + cost of transferring result of the join. The join
is planned two relations (base or join) at a time, so we should be able to
compute this cost recursively, rather than grabbing cost of scanning base
relations every time. That will also work if part of the join tree is built
with use_remote_estimate ON and part without.


>
> Yours,
> Laurenz Albe
>
>
>


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2015-12-21 Thread Teodor Sigaev


Sorry, but there is something wrong with your patch:
% patch -p1 -C < ~/Downloads/libpq-failover-5.patch

--
|diff --git a/src/interfaces/libpq/fe-connect.c 
b/src/interfaces/libpq/fe-connect.c
...
Hunk #18 succeeded at 2804.
patch:  malformed patch at line 666: <<< BEGIN MERGE CONFLICT: local 
copy shown first <<<


Victor Wagner wrote:

On Mon, 07 Dec 2015 15:26:33 -0500
Korry Douglas  wrote:



The problem seems to be in PQconnectPoll() in the case for
CONNECTION_AUTH_OK, specifically this code:

/* We can release the address list now. */
pg_freeaddrinfo_all(conn->addrlist_family, conn->addrlist);
conn->addrlist = NULL;
conn->addr_cur = NULL;
That frees up the list of alternative host addresses.  The state
machine then progresses to CONNECTION_CHECK_RO (which invokes
pg_is_in_recovery()), then CONNECTION_CHECK_RW (waiting for the


Thank you for pointing to this problem. I've overlooked it. Probably
I should improve my testing scenario.

I', attaching new version of the patch, which, hopefully, handles
address list freeing correctly.









--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Experimental evaluation of PostgreSQL's query optimizer

Albe Laurenz  writes:
> - I also can corroborate your finding that nested loop joins are often
>   harmful, particularly when the inner loop is a sequential scan.
>   One of the first things I do when investigating bad performance of a query
>   whose plan has a nestend loop join is to set enable_nestloop to "off"
>   and see if that makes a difference, and it often does.
>   Maybe it would be a win to bias the planner against nested loop joins.
>   This is dreaming,

Yeah, it sure is, because there's another half of the world that complains
bitterly any time they *don't* get a nested-loop plan; and for their
queries and data, they're right.  I'd be the first to agree that it would
be good to make the planner smarter, but simplistic solutions like "bias
it against (or for) nested loops" aren't likely to improve matters.

>   but it might be nice to have some number as to how
>   reliable a certain estimate is, which is high if the estimate is, say,
>   derived from a single filter on a base table and sinks as more conditions
>   are involved or numbers pulled out of thin air.

That might indeed be a useful thing to try to do, though I confess I'm
not quite sure what we'd do with such numbers if we had them.  It seems
like the next thing would be to replace single cost estimates for plans
with cost ranges, but then how do you compare a couple of overlapping
ranges?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SET SESSION AUTHORIZATION superuser limitation.

On Sun, Dec 20, 2015 at 1:47 PM, Tom Lane  wrote:
> The syntax you propose exposes the user's password in cleartext in
> the command, where it is likely to get captured in logs for example.
> That's not going to do.

Of course, right now, the ALTER USER ... PASSWORD command has that
problem which is, uh, bad.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2015-12-21 Thread Victor Wagner

On Mon, 21 Dec 2015 17:18:37 +0300
Teodor Sigaev  wrote:

> Sorry, but there is something wrong with your patch:
> % patch -p1 -C < ~/Downloads/libpq-failover-5.patch

Really, somehow broken version of the patch got into message.

Here is correct one.

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 9c0e4c8..162bd4f 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -792,7 +792,7 @@ host=localhost port=5432 dbname=mydb connect_timeout=10

The general form for a connection URI is:
 
-postgresql://[user[:password]@][netloc][:port][/dbname][?param1=value1...]
+postgresql://[user[:password]@][netloc][:port][,netloc[:port]...][/dbname][?param1=value1...]
 

 
@@ -809,6 +809,7 @@ postgresql://localhost/mydb
 postgresql://user@localhost
 postgresql://user:secret@localhost
 postgresql://other@localhost/otherdb?connect_timeout=10application_name=myapp
+postgresql://node1,node2:5433,node3:4432,node4/mydb?hostorder=randomreadonly=1
 
 Components of the hierarchical part of the URI can also
 be given as parameters.  For example:
@@ -831,7 +832,9 @@ postgresql:///mydb?host=localhostport=5433

 For improved compatibility with JDBC connection URIs,
 instances of parameter ssl=true are translated into
-sslmode=require.
+sslmode=require and
+loadBalanceHosts=true  into
+hostorder=random.

 

@@ -841,6 +844,10 @@ postgresql:///mydb?host=localhostport=5433
 postgresql://[2001:db8::1234]/database
 

+   
+   There can be serveral host specifications, optionally accompanied
+   with port, separated by comma.
+   
 

 The host component is interpreted as described for the parameter PostgreSQL was built). On machines without
 Unix-domain sockets, the default is to connect to localhost.

+   
+   There can be more than one host parameter in
+   the connect string. In this case these hosts would be considered
+   alternate entries into same database and if connect to first one
+   fails, library would try to connect second etc. This can be used
+   for high availability cluster or for load balancing. See
+parameter.
+   
+   
+   Network host name can be accompanied with port number, separated by
+   colon. If so, this port number is used only when connected to
+   this host. If there is no port number, port specified in the
+parameter would be used.
+   
   
  
 
@@ -942,8 +963,44 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname


   
-
+
+  
+  hostorder
+  
+  
+  Specifies how to choose host from list of alternate hosts,
+  specified in the  parameter.
+  
+  
+  If value of this argument is sequential (the
+  default) library connects to the hosts in order they specified,
+  and tries to connect second one only if connection to the first
+  fails.
+  
+  
+  If value is random host to connect is randomly
+  picked from the list. It allows to balance load between several
+  cluster nodes. However, currently PostgreSQL doesn't support
+  multimaster clusters. So, without use of third-party products,
+  only read-only connections can take advantage from the
+  load-balancing. See 
+  
+  
+  
+  
+  readonly
+  
+  
+  If this parameter is 0 (the default), upon successful connection
+  library checks if host is in recovery state, and if it is so,
+  tries next host in the connect string. If this parameter is
+  non-zero, connection to warm standby nodes are considered
+  successful.
+  
+  
+  
   
+
port


@@ -985,7 +1042,6 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   
   
  
-
  
   connect_timeout
   
@@ -996,7 +1052,27 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   
   
  
-
+ 
+ falover_timeout
+ 
+ 
+ Maximum time to cycilically retry all the hosts in commect string.
+ (as decimal integer number of seconds). If not specified, then
+ hosts are tried just once.
+ 
+ 
+ If we have replicating cluster, and master node fails, it might
+ take some time to promote one of standby nodes to the new master.
+ So clients which notice that connect to the master fails, can
+ already give up attempt to reestablish a connection when new master
+ became available. 
+ 
+ 
+ Setting this parameter to reasonable time makes library try to
+ reconnect all the host in cyclically until new master appears.
+ 
+ 
+ 
  
   client_encoding
   
@@ -7165,6 +7241,18 @@ user=admin
An example file is provided at
share/pg_service.conf.sample.
   
+  
+  If more than one host option present in the section of service file, it
+  is interpeted as alternate servers for failover or load-balancing. See
+   option

Re: [HACKERS] ToDo list update for BRIN indexes

On Mon, Dec 21, 2015 at 8:06 AM, Simon Riggs  wrote:
> On 21 December 2015 at 12:54, Robert Haas  wrote:
>>
>> On Thu, Jul 9, 2015 at 4:49 PM, Jeff Janes  wrote:
>> > Is there anything in the below section which has been been implemented
>> > or
>> > rendered irrelevant by BRIN indexes?
>> >
>> > https://wiki.postgresql.org/wiki/Todo#Indexes
>> >
>> > "Consider smaller indexes that record a range of values per heap page,
>> > rather than having one index entry for every heap row"
>>
>> [ catching up on old threads ]
>>
>> BRIN is exactly this, isn't it?  Well, moreso: it's a range of values
>> for a range of heap pages.
>
> It's close, but not the same.
>
> BRIN is a summary index and so could never support uniqueness.

Hmm, but that Todo wording seems to suggest a summary index, so I
don't think that proposal would support uniqueness either.

> It's also possible to have an index type that has a precise TID entry, yet a
> more compact format, which would then allow unique values. This would be
> similar to the way SQLServer compresses primary key indexes.

True.  But would that require a new index type, or would we do that
just by optimizing btree?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ToDo list update for BRIN indexes

On 21 December 2015 at 14:38, Robert Haas  wrote:

> On Mon, Dec 21, 2015 at 8:06 AM, Simon Riggs 
> wrote:
> > On 21 December 2015 at 12:54, Robert Haas  wrote:
> >>
> >> On Thu, Jul 9, 2015 at 4:49 PM, Jeff Janes 
> wrote:
> >> > Is there anything in the below section which has been been implemented
> >> > or
> >> > rendered irrelevant by BRIN indexes?
> >> >
> >> > https://wiki.postgresql.org/wiki/Todo#Indexes
> >> >
> >> > "Consider smaller indexes that record a range of values per heap page,
> >> > rather than having one index entry for every heap row"
> >>
> >> [ catching up on old threads ]
> >>
> >> BRIN is exactly this, isn't it?  Well, moreso: it's a range of values
> >> for a range of heap pages.
> >
> > It's close, but not the same.
> >
> > BRIN is a summary index and so could never support uniqueness.
>
> Hmm, but that Todo wording seems to suggest a summary index, so I
> don't think that proposal would support uniqueness either.


Probably garbled slightly.


> > It's also possible to have an index type that has a precise TID entry,
> yet a
> > more compact format, which would then allow unique values. This would be
> > similar to the way SQLServer compresses primary key indexes.
>
> True.  But would that require a new index type, or would we do that
> just by optimizing btree?


Heikki worked on Grouped Item Tuple indexes about 8 years ago doing exactly
that. They gave about 30% performance advantage at the time, when used with
HOT, which was the original driver for that feature.

BRIN and GIT approaches degrade under heavy non-HOT updates.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] A question regarding LWLock in ProcSleep

2015-12-21 Thread Kenan Yao

Hi Amit,

Thanks for the reply!

Yes, in LockAcquireExtended, after exiting WaitOnLock, there would be a
read access to proclock, however, since the lock has either been granted or
rejected by deadlock check, I can think of no possible concurrent write
access to the proclock, so is it really necessary to acquire the LWLock?

Thanks,
Kenan

On Fri, Dec 18, 2015 at 10:25 PM, Amit Kapila 
wrote:

> On Thu, Dec 17, 2015 at 1:51 PM, Kenan Yao  wrote:
>
>> Hi there,
>>
>> In function ProcSleep, after the process has been waken up, either with
>> lock granted or deadlock detected, it would re-acquire the lock table's
>> partition LWLock.
>>
>> The code episode is here:
>>
>> /*
>>  * Re-acquire the lock table's partition lock.  We have to do this to 
>> hold
>>  * off cancel/die interrupts before we can mess with lockAwaited (else we
>>  * might have a missed or duplicated locallock update).
>>  */
>> LWLockAcquire(partitionLock, LW_EXCLUSIVE);
>>
>> /*
>>  * We no longer want LockErrorCleanup to do anything.
>>  */
>> lockAwaited = NULL;
>>
>> /*
>>  * If we got the lock, be sure to remember it in the locallock table.
>>  */
>> if (MyProc->waitStatus == STATUS_OK)
>> GrantAwaitedLock();
>>
>> /*
>>  * We don't have to do anything else, because the awaker did all the
>>  * necessary update of the lock table and MyProc.
>>  */
>> return MyProc->waitStatus;
>>
>> 
>> Questions are:
>>
>> (1) The comment says that "we might have a missed or duplicated locallock
>> update", in what cases would we hit this if without holding the LWLock?
>>
>> (2) The comment says "we have to do this to hold off cancel/die
>> interrupts", then:
>>
>>- why using LWLockAcquire instead of HOLD_INTERRUPTS directly?
>>- From the handler of SIGINT and SIGTERM, seems nothing serious would
>>be processed here, since no CHECK_FOR_INTERRUPS is called before releasing
>>this LWLock. Why we should hold off cancel/die interrupts here?
>>
>> (3) Before releasing this LWLock, the only share memory access is
>> MyProc->waitStatus; since the process has been granted the lock or removed
>> from the lock's waiting list because of deadlock, is it possible some other
>> processes would access this field? if not, then why we need LWLock here?
>> what does this lock protect?
>>
>>
> I think the other thing which needs protection of LWLock is
> access to proclock which is done in the caller
> (LockAcquireExtended).
>
>
>
>
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>

Re: [HACKERS] A question regarding LWLock in ProcSleep

2015-12-21 Thread Kenan Yao

+Heikki,

Hi Heikki,

Could you please help provide a case to elaborate why we need the LWLock
here? or could you please tell me to whom should I ask this question?

Thanks,
Kenan

On Tue, Dec 22, 2015 at 11:41 AM, Kenan Yao  wrote:

> Hi Amit,
>
> Thanks for the reply!
>
> Yes, in LockAcquireExtended, after exiting WaitOnLock, there would be a
> read access to proclock, however, since the lock has either been granted or
> rejected by deadlock check, I can think of no possible concurrent write
> access to the proclock, so is it really necessary to acquire the LWLock?
>
> Thanks,
> Kenan
>
> On Fri, Dec 18, 2015 at 10:25 PM, Amit Kapila 
> wrote:
>
>> On Thu, Dec 17, 2015 at 1:51 PM, Kenan Yao  wrote:
>>
>>> Hi there,
>>>
>>> In function ProcSleep, after the process has been waken up, either with
>>> lock granted or deadlock detected, it would re-acquire the lock table's
>>> partition LWLock.
>>>
>>> The code episode is here:
>>>
>>> /*
>>>  * Re-acquire the lock table's partition lock.  We have to do this to 
>>> hold
>>>  * off cancel/die interrupts before we can mess with lockAwaited (else 
>>> we
>>>  * might have a missed or duplicated locallock update).
>>>  */
>>> LWLockAcquire(partitionLock, LW_EXCLUSIVE);
>>>
>>> /*
>>>  * We no longer want LockErrorCleanup to do anything.
>>>  */
>>> lockAwaited = NULL;
>>>
>>> /*
>>>  * If we got the lock, be sure to remember it in the locallock table.
>>>  */
>>> if (MyProc->waitStatus == STATUS_OK)
>>> GrantAwaitedLock();
>>>
>>> /*
>>>  * We don't have to do anything else, because the awaker did all the
>>>  * necessary update of the lock table and MyProc.
>>>  */
>>> return MyProc->waitStatus;
>>>
>>> 
>>> Questions are:
>>>
>>> (1) The comment says that "we might have a missed or duplicated
>>> locallock update", in what cases would we hit this if without holding the
>>> LWLock?
>>>
>>> (2) The comment says "we have to do this to hold off cancel/die
>>> interrupts", then:
>>>
>>>- why using LWLockAcquire instead of HOLD_INTERRUPTS directly?
>>>- From the handler of SIGINT and SIGTERM, seems nothing serious
>>>would be processed here, since no CHECK_FOR_INTERRUPS is called before
>>>releasing this LWLock. Why we should hold off cancel/die interrupts here?
>>>
>>> (3) Before releasing this LWLock, the only share memory access is
>>> MyProc->waitStatus; since the process has been granted the lock or removed
>>> from the lock's waiting list because of deadlock, is it possible some other
>>> processes would access this field? if not, then why we need LWLock here?
>>> what does this lock protect?
>>>
>>>
>> I think the other thing which needs protection of LWLock is
>> access to proclock which is done in the caller
>> (LockAcquireExtended).
>>
>>
>>
>>
>> With Regards,
>> Amit Kapila.
>> EnterpriseDB: http://www.enterprisedb.com
>>
>
>

Re: [HACKERS] Additional LWLOCK_STATS statistics

On Mon, Dec 21, 2015 at 4:50 PM, Jesper Pedersen
 wrote:
> On 12/18/2015 01:16 PM, Robert Haas wrote:
>>
>> Is this just for informational purposes, or is this something you are
>> looking to have committed?  I originally thought the former, but now
>> I'm wondering if I misinterpreted your intent.  I have a hard time
>> getting excited about committing something that would, unless I'm
>> missing something, pretty drastically increase the overhead of running
>> with LWLOCK_STATS...
>>
>
> Yeah, so unless other people using LWLOCK_STATS find the additional
> information of use (w/ the extra overhead), I think we can mark it as
> "Returned with feedback" or "Rejected".

Marked as rejected for this CF then, log overhead is not something to
ignore. There has been a fair amount of infrastructure work done btw
thanks to your impulse.

> Alternative, I can redo the patch requiring an additional #define - f.ex.
> LWLOCK_STATS_QUEUE_SIZES

Feel free to do so if you wish, that may be interesting to see what this gives.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Additional role attributes && superuser review

On Tue, Dec 22, 2015 at 1:41 AM, Stephen Frost  wrote:
> Updated and rebased patch attached which takes the 'pg_switch_xlog'
> default role back out, leaving us with:
>
> pg_monitor - View privileged info
> pg_backup - start/stop backups, switch xlog, create restore points
> pg_replay - Pause/resume xlog replay on replicas
> pg_replication - Create/destroy/etc replication slots
> pg_rotate_logfile - Request logfile rotation
> pg_signal_backend - Signal other backends (cancel query/terminate)
> pg_file_settings - View configuration settings in all config files
>
> Michael, another review would be great, if you don't mind.  I'm going to
> be going through it also more closely since it sounds like we've got
> consensus on at least this initial set of default roles and rights.  If
> all looks good then I'll commit it.

Thanks. This looks fine to me. I just have one single comment:

+   Request logfile rotation.
s/logfile/transaction log file/
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] custom function for converting human readable sizes to bytes

On Tue, Dec 22, 2015 at 12:11 AM, Robert Haas  wrote:
> On Sun, Dec 20, 2015 at 4:54 AM, Pavel Stehule  
> wrote:
>> new update:
>>
>> 1. unit searching is case insensitive
>>
>> 2. initial support for binary byte prefixes - KiB, MiB, ..  (IEC standard),
>> change behave for SI units
>>
>> Second point is much more complex then it is looking - if pg_size_bytes
>> should be consistent with pg_size_pretty.
>>
>> The current pg_size_pretty and transformations in guc.c are based on JEDEC
>> standard. Using this standard for GUC has sense - using it for object sizes
>> is probably unhappy.
>>
>> I tried to fix (and enhance) pg_size_pretty - now reports correct units, and
>> via second parameter it allows to specify base: 2 (binary, IEC  - default)
>> or 10 (SI).
>
> -1 from me.  I don't think we should muck with the way pg_size_pretty works.

Yeah.

+ static const unit_multiplier unit_multiplier_table[] =
+ {
+ {"B", 1L},
+ {"kiB", 1024L},
+ {"MiB", 1024L * 1024},
+ {"GiB", 1024L * 1024 * 1024},
+ {"TiB", 1024L * 1024 * 1024 * 1024},
+ {"PiB", 1024L * 1024 * 1024 * 1024 * 1024},
This is rather close to memory_unit_conversion_table in guc.c. Would
it be worth refactoring those unit tables into something more generic
instead of duplicating them?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Commit fest status for 2015-11

Michael Paquier  writes:
> To committers: there are 4 patches waiting for input.

Those are:

: SQL function to report log message

AFAICT, there is no committer who likes this idea enough to commit it.
It's questionable whether we need such a feature at all, and it's even
more questionable whether this particular instantiation of the feature
is well-designed.  I'm not going to opine as to whether it should get
marked "returned with feedback" or "rejected", but I doubt it's going
to get committed as-is.

: Reusing abbreviated keys during second pass of ordered [set] aggregates

Robert has been committer for all the predecessor patches, so I assume
this one will be his.

: Speedup timestamp/time/date output functions

I'd look into this if Andres hadn't already claimed it ...

: Improving replay of XLOG_BTREE_VACUUM records

Heikki and Andres have already reviewed this, so I assume one or the
other of them will be committer.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Experimental evaluation of PostgreSQL's query optimizer

2015-12-21 Thread Craig Ringer

On 21 December 2015 at 23:57, Viktor Leis  wrote:

>
> Please have a look at Figure 6 (page 6) in
> http://www.vldb.org/pvldb/vol9/p204-leis.pdf Disabling nested loop
> joins without index scan (going from (a) to (b)) results in great
> improvements across the board. And even more importantly, it avoids
> most of the cases where queries took unreasonably long and timed
> out. Basically this amounts to the being able to run the query on
> PostgreSQL or not.
>

For that data, yes. But you're ignoring other important cases. Small or
even 1-element lookup tables can be one where a nestloop over a seqscan
turns out to be by far the fastest way to do the job. This can really add
up if it's deep in a subplan that's excuted repeatedly, or if it's part of
queries that get run very frequently on a busy OLTP system.

That said, these cases are also the ones that land up hurting very badly if
the stats are inaccurate or outdated and our expected 3 loops turns into
3000.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Additional role attributes && superuser review

On Tue, Dec 22, 2015 at 2:54 PM, Amit Langote
 wrote:
> On 2015/12/22 14:05, Michael Paquier wrote:
>> On Tue, Dec 22, 2015 at 1:41 AM, Stephen Frost  wrote:
>>> Updated and rebased patch attached which takes the 'pg_switch_xlog'
>>> default role back out, leaving us with:
>>>
>>> pg_monitor - View privileged info
>>> pg_backup - start/stop backups, switch xlog, create restore points
>>> pg_replay - Pause/resume xlog replay on replicas
>>> pg_replication - Create/destroy/etc replication slots
>>> pg_rotate_logfile - Request logfile rotation
>>> pg_signal_backend - Signal other backends (cancel query/terminate)
>>> pg_file_settings - View configuration settings in all config files
>>
>> Thanks. This looks fine to me. I just have one single comment:
>>
>> +   Request logfile rotation.
>> s/logfile/transaction log file/
>
> Looks correct as is. Or maybe "server's log file" as in:
>
> 9.26.2. Server Signaling Functions
>
> pg_rotate_logfile(): Rotate server's log file

You're right, this is not a WAL segment, just a normal log file. Your
phrasing is better.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Implement failover on libpq connect level.

On Mon, Dec 21, 2015 at 11:50 PM, Victor Wagner  wrote:
> On Mon, 21 Dec 2015 17:18:37 +0300
> Teodor Sigaev  wrote:
>
>> Sorry, but there is something wrong with your patch:
>> % patch -p1 -C < ~/Downloads/libpq-failover-5.patch
>
> Really, somehow broken version of the patch got into message.
>
> Here is correct one.

Patch is moved to next CF as you are still working on it.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] custom function for converting human readable sizes to bytes

2015-12-22 6:57 GMT+01:00 Michael Paquier :

> On Tue, Dec 22, 2015 at 2:33 PM, Pavel Stehule 
> wrote:
> >
> >
> > 2015-12-22 6:22 GMT+01:00 Michael Paquier :
> >>
> >> On Tue, Dec 22, 2015 at 12:11 AM, Robert Haas 
> >> wrote:
> >> > On Sun, Dec 20, 2015 at 4:54 AM, Pavel Stehule <
> pavel.steh...@gmail.com>
> >> > wrote:
> >> >> new update:
> >> >>
> >> >> 1. unit searching is case insensitive
> >> >>
> >> >> 2. initial support for binary byte prefixes - KiB, MiB, ..  (IEC
> >> >> standard),
> >> >> change behave for SI units
> >> >>
> >> >> Second point is much more complex then it is looking - if
> pg_size_bytes
> >> >> should be consistent with pg_size_pretty.
> >> >>
> >> >> The current pg_size_pretty and transformations in guc.c are based on
> >> >> JEDEC
> >> >> standard. Using this standard for GUC has sense - using it for object
> >> >> sizes
> >> >> is probably unhappy.
> >> >>
> >> >> I tried to fix (and enhance) pg_size_pretty - now reports correct
> >> >> units, and
> >> >> via second parameter it allows to specify base: 2 (binary, IEC  -
> >> >> default)
> >> >> or 10 (SI).
> >> >
> >> > -1 from me.  I don't think we should muck with the way pg_size_pretty
> >> > works.
> >>
> >> Yeah.
> >>
> >> + static const unit_multiplier unit_multiplier_table[] =
> >> + {
> >> + {"B", 1L},
> >> + {"kiB", 1024L},
> >> + {"MiB", 1024L * 1024},
> >> + {"GiB", 1024L * 1024 * 1024},
> >> + {"TiB", 1024L * 1024 * 1024 * 1024},
> >> + {"PiB", 1024L * 1024 * 1024 * 1024 * 1024},
> >> This is rather close to memory_unit_conversion_table in guc.c. Would
> >> it be worth refactoring those unit tables into something more generic
> >> instead of duplicating them?
> >
> >
> > yes, it is possible with following impacts:
> >
> > 1. We need add PB to memory_unit_conversion_table in guc.c
>
> No real objection to that. It would would make sense to have it, but
> we could not use it directly for a GUC. This just reminded me that
> even if we support TB in GUC params, it is not possible to set for
> example a GUC_UNIT_KB param to more than 2TB because those are limited
> to be int32.
>
> > 2. This table holds multipliers in JEDEC standard - and introduce other
> > standards IEC, SI there isn't good idea.
> >
> > Is it ok?
>
> Do you think it would be interesting to have GUC parameters able to
> use those units? If this is a standard, it may make sense to actually
> have them, no? Just a random thought, that's not something this patch
> should take care of, but it would be good to avoid code duplication
> where we can avoid it.
> Regards,
>

Nice to have it. But it can introduce a upgrade issues - there isn't
possible to specify base. If we do some change here, we have to do related
changes everywhere.

Our implementation is little bit obsolete, but nobody did some error
reports, so probably now isn't time to change it. It's not too important.

Better to have two conversion tables (simple and small), than opening new
topic, where I don't see any agreement to do changes - and these changes
can be done separately.

Pavel



> --
> Michael
>

Re: [HACKERS] Declarative partitioning

2015-12-21 Thread Amit Langote

On 2015/12/18 3:56, Robert Haas wrote:
> On Mon, Dec 14, 2015 at 2:14 AM, Amit Langote
>  wrote:
>> Syntax to create a partitioned table (up to 2 levels of partitioning):
>>
>> CREATE TABLE foo (
>> ...
>> )
>> PARTITION BY R/L ON (key0)
>> SUBPARTITION BY R/L ON (key1)
>> [(PARTITION foo_1 FOR VALUES  [] []
>> [(SUBPARTITION foo_1_1 FOR VALUES  [] [],
>> ...)], ...)];
>>
>> The above creates two pg_partitioned_rel entries for foo with partlevel 0
>> and 1, for key0 and key1, respectively. For foo_1 and foo_1_1, this
>> creates pg_partition entries, with foo and foo_1 as partparent,
>> respectively.
>>
>> Why just 2 levels? - it seems commonplace and makes the syntax more
>> intuitive? I guess it might be possible to generalize the syntax for
>> multi-level partitioning. Ideas? If we want to support the notion of
>> sub-partition template in future, that would require some thought, more
>> importantly proper catalog organization for the same.
> 
> I do not think this is a particularly good idea.  You're going to need
> to dump each partition separately at least in --binary-upgrade mode,
> because each is going to have its own magic OIDs that need to be
> restored, and also because there will most likely be at least some
> properties that are going to vary between partitions.  You could
> require that every partition have exactly the same set of columns,
> constraints, rules, triggers, policies, attribute defaults, comments,
> column comments, and everything else that might be different from one
> partition to another, and further require that they have exactly
> matching indexes.  It would take a fair amount of code to prohibit all
> that, but it could be done.  However, do we really want that?  There
> may well be some things were we want to enforce that the parent and
> the child are exactly identical, but I doubt we want that for
> absolutely every property, current and future, of the partition.  And
> even if you did, because of the --binary-upgrade stuff, you still need
> to to be able to dump them separately.
> 
> Therefore, I believe it is a whole lot better to make the primary
> syntax for table partitioning something where you issue a CREATE
> statement for the parent and then a CREATE statement for each child.
> If we want to also have a convenience syntax so that people who want
> to create a parent and a bunch of children in one fell swoop can do
> so, fine.

Regarding --binary-upgrade dump mode, how about we teach pg_dump to dump
each partition separately using ALTER TABLE parent ADD PARTITION
especially for the "magic OIDs" reason? It may very well be a CREATE
PARTITION-style command though. Note that each such command could specify
properties that can be different per partition. I said in my email,
perhaps not so clearly, that "only" WITH options, tablespace and
relpersistence can be different per partition. But I can see why that may
be severely restrictive at this stage.

By the way, what do you think about SUBPARTITION keyword-based syntax for
multi-level partitioning? Should we instead require that each partition
has its own PARTITION BY in its creation command?

> 
> I would not choose to model the syntax for creating partitions on
> Oracle.  I don't find that syntax particularly nice or easy to
> remember.  I say PARTITION BY RANGE, and then inside the parentheses I
> use the PARTITION keyword for each partition?  Really?  But I think
> copying the style while having the details be incompatible is an even
> worse idea.

As for the convenience syntax (if at all), how about:

CREATE TABLE foo (
  ...
)
PARTITION BY ... ON (...)
SUBPARTITION BY ... ON (...)
opt_partition_list;

where opt_partition_list is:

PARTITIONS (
  partname FOR VALUES ... [WITH] [TABLESPACE] opt_subpart_list
  [, ...]
)

where opt_subpart_list is:

SUBPARTITIONS (
  subpartname FOR VALUES ... [WITH] [ TABLESPACE]
  [, ...]
)

PARTITIONS, SUBPARTITIONS would be new unreserved keywords. Or we can do
away with them.

>> What about ALTER TABLE? - Instead of allowing ALTER TABLE to be applied
>> directly to partitions on case-by-case basis (they are tables under the
>> hood after all), we should restrict AT to the master table. Most of the AT
>> changes implicitly propagate from the master table to its partitions. Some
>> of them could be directly applied to partitions and/or sub-partitions such
>> as rename, storage manipulations like - changing tablespace, storage
>> parameters (reloptions), etc.:
>>
>> ALTER TABLE foo
>> RENAME PARTITION  TO ;
>>
>> ALTER TABLE foo
>> RENAME SUBPARTITION  TO ;
>>
>> ALTER TABLE foo
>> SET TABLESPACE ... [DEFAULT] FOR PARTITION ;
>>
>> ALTER TABLE foo
>> SET TABLESPACE ... FOR SUBPARTITION ;
>>
>> ALTER TABLE foo
>> SET (storage_parameter = value) [DEFAULT] FOR PARTITION ;
>>
>> ALTER TABLE foo
>> SET (storage_parameter = value) FOR SUBPARTITION ;
> 
> I don't think this is a very good idea.  This is basically proposing
> that for every DDL command that

Re: [HACKERS] Experimental evaluation of PostgreSQL's query optimizer

2015-12-21 Thread Craig Ringer

On 21 December 2015 at 20:53, Viktor Leis  wrote:

> I think your suggestion amounts to caching the cardinalities of all
> two-way joins. One major issue is that for a query like
>
> select * from r1, r2 where r1.x = r2.y and r1.a = ? and r2.b;
>
> it depends on the specific values of r1.a and r2.b whether there is
> any (anti-)correlation. And obviously one cannot store correction
> factors for each value of a and b.
>
>
I see a parallel with indexing and partial indexes here.

We obviously cannot afford to keep cross-table correlations for every
possible pairing of join conditions across every possible set of joined
tables. Much like we can't afford to keep indexes for every possible set of
columns, but even worse.

Much as we let users CREATE INDEX to tell us what cols to index, maybe we
should let them CREATE a cross-table join statistics collector for a
particular set of tables, optionally qualified with a filter condition just
like we do on partial indexes, and optionally transformed via an immutable
expression like we do for expression indexes, e.g.:

CREATE JOIN STATISTICS ON t1 JOIN t2 ON (t1.col1 = t2.col2);

CREATE JOIN STATISTICS ON t1 JOIN t2 ON (lower(t1.col1) = lower(t2.col2))
WHERE t1.othercol IS NOT NULL;

CREATE JOIN STATISTICS ON t1 JOIN t2 ON (t1.colx = t2.colx AND t1.coly =
t2.coly);

plus a simplified form like

CREATE JOIN STATISTICS ON t1 JOIN t2 USING (somecol);

That way we let an admin who's tuning queries direct effort at problem
areas. It's not automagical, but it's an area where tools could analyze
pg_stat_statements to direct effort, much like is currently done for index
creation. Like index creation I don't think it's practical to do this
entirely automatically and behind the scenes since collecting the stats for
all possibilities rapidly gets prohibitive.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] custom function for converting human readable sizes to bytes

2015-12-22 6:22 GMT+01:00 Michael Paquier :

> On Tue, Dec 22, 2015 at 12:11 AM, Robert Haas 
> wrote:
> > On Sun, Dec 20, 2015 at 4:54 AM, Pavel Stehule 
> wrote:
> >> new update:
> >>
> >> 1. unit searching is case insensitive
> >>
> >> 2. initial support for binary byte prefixes - KiB, MiB, ..  (IEC
> standard),
> >> change behave for SI units
> >>
> >> Second point is much more complex then it is looking - if pg_size_bytes
> >> should be consistent with pg_size_pretty.
> >>
> >> The current pg_size_pretty and transformations in guc.c are based on
> JEDEC
> >> standard. Using this standard for GUC has sense - using it for object
> sizes
> >> is probably unhappy.
> >>
> >> I tried to fix (and enhance) pg_size_pretty - now reports correct
> units, and
> >> via second parameter it allows to specify base: 2 (binary, IEC  -
> default)
> >> or 10 (SI).
> >
> > -1 from me.  I don't think we should muck with the way pg_size_pretty
> works.
>
> Yeah.
>
> + static const unit_multiplier unit_multiplier_table[] =
> + {
> + {"B", 1L},
> + {"kiB", 1024L},
> + {"MiB", 1024L * 1024},
> + {"GiB", 1024L * 1024 * 1024},
> + {"TiB", 1024L * 1024 * 1024 * 1024},
> + {"PiB", 1024L * 1024 * 1024 * 1024 * 1024},
> This is rather close to memory_unit_conversion_table in guc.c. Would
> it be worth refactoring those unit tables into something more generic
> instead of duplicating them?
>

yes, it is possible with following impacts:

1. We need add PB to memory_unit_conversion_table in guc.c
2. This table holds multipliers in JEDEC standard - and introduce other
standards IEC, SI there isn't good idea.

Is it ok?

Regards

Pavel



> --
> Michael
>

Re: [HACKERS] custom function for converting human readable sizes to bytes

On Tue, Dec 22, 2015 at 2:33 PM, Pavel Stehule  wrote:
>
>
> 2015-12-22 6:22 GMT+01:00 Michael Paquier :
>>
>> On Tue, Dec 22, 2015 at 12:11 AM, Robert Haas 
>> wrote:
>> > On Sun, Dec 20, 2015 at 4:54 AM, Pavel Stehule 
>> > wrote:
>> >> new update:
>> >>
>> >> 1. unit searching is case insensitive
>> >>
>> >> 2. initial support for binary byte prefixes - KiB, MiB, ..  (IEC
>> >> standard),
>> >> change behave for SI units
>> >>
>> >> Second point is much more complex then it is looking - if pg_size_bytes
>> >> should be consistent with pg_size_pretty.
>> >>
>> >> The current pg_size_pretty and transformations in guc.c are based on
>> >> JEDEC
>> >> standard. Using this standard for GUC has sense - using it for object
>> >> sizes
>> >> is probably unhappy.
>> >>
>> >> I tried to fix (and enhance) pg_size_pretty - now reports correct
>> >> units, and
>> >> via second parameter it allows to specify base: 2 (binary, IEC  -
>> >> default)
>> >> or 10 (SI).
>> >
>> > -1 from me.  I don't think we should muck with the way pg_size_pretty
>> > works.
>>
>> Yeah.
>>
>> + static const unit_multiplier unit_multiplier_table[] =
>> + {
>> + {"B", 1L},
>> + {"kiB", 1024L},
>> + {"MiB", 1024L * 1024},
>> + {"GiB", 1024L * 1024 * 1024},
>> + {"TiB", 1024L * 1024 * 1024 * 1024},
>> + {"PiB", 1024L * 1024 * 1024 * 1024 * 1024},
>> This is rather close to memory_unit_conversion_table in guc.c. Would
>> it be worth refactoring those unit tables into something more generic
>> instead of duplicating them?
>
>
> yes, it is possible with following impacts:
>
> 1. We need add PB to memory_unit_conversion_table in guc.c

No real objection to that. It would would make sense to have it, but
we could not use it directly for a GUC. This just reminded me that
even if we support TB in GUC params, it is not possible to set for
example a GUC_UNIT_KB param to more than 2TB because those are limited
to be int32.

> 2. This table holds multipliers in JEDEC standard - and introduce other
> standards IEC, SI there isn't good idea.
>
> Is it ok?

Do you think it would be interesting to have GUC parameters able to
use those units? If this is a standard, it may make sense to actually
have them, no? Just a random thought, that's not something this patch
should take care of, but it would be good to avoid code duplication
where we can avoid it.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SET SESSION AUTHORIZATION superuser limitation.

2015-12-21 Thread Craig Ringer

On 21 December 2015 at 22:57, Tom Lane  wrote:

> Robert Haas  writes:
> > On Sun, Dec 20, 2015 at 1:47 PM, Tom Lane  wrote:
> >> The syntax you propose exposes the user's password in cleartext in
> >> the command, where it is likely to get captured in logs for example.
> >> That's not going to do.
>
> > Of course, right now, the ALTER USER ... PASSWORD command has that
> > problem which is, uh, bad.
>
> Which is why we invented the ENCRYPTED PASSWORD syntax

... which doesn't actually help anything much at all.

It prevents exposure of the user's cleartext password, sure, but the hashed
("encrypted") password passed to ALTER USER ... ENCRYPTED PASSWORD is
sufficient to log in. It substitutes for the original password entirely.

Right now the logs just have to be treated as security critical. Which
sucks, but is not easily solved.

Nothing is going to stop:

ALTER USER fred PAWORD 'sekrit';

from logging the password in a syntax error. But it'd be nice to let
utility commands define a log hook that lets them emit a sanitized version
of themselves based on their parse tree representation to the logs.

Except that users will want to be able to mask log output too. I see lots
of questions about how to stop pgcrypto sql function calls from exposing
key materials in the logs. Right now the answer is "you can't". With
logging based on the raw statement text before parsing I don't see any way
to change that. I advise people to do their symmetric crypto and their
secret key operations in the application instead, which has the advantage
of also better isolating the key material from its persistent storage in
the database.

We have to be able to emit syntax errors and other things that use the raw
SQL text. We also don't have any functionality to turn a parsetree back
into SQL text with parts of it masked out, and it'd be impractical to do
that just for logging anyway.

I can see it being useful to be able to set a session level flag that
limits logging to command tags, not command text. Let the superuser GRANT
the right to set it to other users. Use a GUC to toggle it, preferably via
SET LOCAL. It has to be session level not statement level because we've got
no way to set generic options per-statement, and plus that'd risk leaking
the statement on a parse error. We'd probably replace the statement text
with a string like 'PARSE_ERROR' until the command tag was known, then
replace it with the command tag. This would reduce the audit utility of the
logs a little, but if it's superuser-only unless granted you're already
stuffed if someone who's not meant to gets hold of it.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

[HACKERS] Commit fest status for 2015-11

Hi all,

As of today, at the moment I am writing this message, here is the
commit fest app status:
Needs review: 20.
Waiting on Author: 24.
Ready for Committer: 4.
Committed: 29.
Moved to next CF: 11.
Rejected: 8.
Returned with Feedback: 7.
Total: 103.

This means in short that 48 out of 103 patches still need to be
treated by the end of this commit fest, which was actually 3 weeks
ago. As the so-said CFM, perhaps I slacked a bit too much here, but to
be honest at this rhythm we are not going to make it by the end of the
month, so I am going to look at the 48 remaining patches and see if
each one can be moved to the next commit fest or returned with
feedback. I would suspect as well that the status of some of those
patches has not been updated for a while.

It would be cool to get a largely cleaned up CF app by Christmas, or
err... 3 days. This is going to be a vacation period and a lot of
people are not going to be here.

To patch authors: please be careful that the status of your patch is
correctly updated. And feel free to scream injustice regarding your
patch if you think the CFM was unfair. I'll update the related threads
regarding the status of the patch, so feel free to complain if needed
on the related thread.
To reviewers: thanks for the reviews done!
To committers: there are 4 patches waiting for input.

Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-12-21 Thread Amit Langote

On 2015/12/22 15:24, Michael Paquier wrote:
> On Thu, Dec 10, 2015 at 1:32 AM, Robert Haas  wrote:
>> If we get the feature - join pushdown for postgres_fdw -
>> working, then we might get some feedback from users about what they
>> like about it or don't, and certainly if this is a frequent complaint
>> then that bolsters the case for doing something about it, and possibly
>> also helps us figure out what that thing should be.  On the other
>> hand, if we don't get the feature because we're busy debating
>> interface details related to this patch, then none of these details
>> matter anyway because nobody except developer is actually running the
>> code in question.
> 
> As this debate continues, I think that moving this patch to the next
> CF would make the most sense then.. So done this way.

Perhaps, this ended (?) with the following commit:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=385f337c9f39b21dca96ca4770552a10a6d5af24

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Coding note: truncating with strlcpy() is not such a hot idea

2015-12-21 Thread Noah Misch

On Mon, Dec 21, 2015 at 03:47:56PM -0500, Tom Lane wrote:
> 1. Using strlcpy to copy from sources that are not known null-terminated
> is outright unsafe: someday, you'll fall off the end of memory and
> SIGSEGV.
> 
> 2. Even with a known null-terminated string, using strlcpy to chop small
> pieces out of it is not bright, because each call will traverse all the
> rest of the input.  This means for example that extracting N words out
> of a long string will spend O(N^2) time in strlcpy.

Good points.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Some questions about the array.

On Tue, Dec 22, 2015 at 2:28 AM, Tom Lane  wrote:
> Uriy Zhuravlev  writes:
>>> I'm dubious that the parsetree representation is well-chosen.
>>> Probably a single is_slice flag would have been better.
>
>> What do you mean? This flag is for what? You are about the A_Indices
>> node(lidx_default/uidx_default)?
>
> Yes.  Those flags are partially redundant with the subtree pointers being
> NULL, and there are combinations that would be invalid (such as
> lidx_default being set but lidx not being null), and it's pretty unobvious
> what the difference in representation is between a non-slice case and a
> slice case with only the upper index provided.  In fact, since you have
> five syntaxes to represent, it's impossible for the two bools to
> distinguish them all, which means that at least one case *must* be
> identified by null-ness of a pointer contradicting what the corresponding
> bool's setting would imply.  So this just seems like a mess to me.
>
> I think it would come out cleaner if you had just one bool is_slice,
> which corresponds directly to whether a colon was present.  The four
> sub-possibilities of colon notation would be represented by combinations
> of null and non-null lidx and uidx.  With is_slice false, the only valid
> case is lidx==NULL, uidx!=NULL, as before for non-slice notation.

Patch is still in the works and author is still active, so moved to next CF.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

2015-12-21 Thread Kyotaro HORIGUCHI

Hello,

At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas  wrote 
in 
> On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier
>  wrote:
> > I am not really getting the meaning of this sentence. Shouldn't this
> > be reworded something like:
> > "Freezing occurs on the whole table once all pages of this relation require 
> > it."
> 
> That statement isn't remotely true, and I don't think this patch
> changes that.  Freezing occurs on the whole table once relfrozenxid is
> old enough that we think there might be at least one page in the table
> that requires it.

I doubt I can explain this accurately, but I took the original
phrase as that if and only if all pages of the table are marked
as "requires freezing" by accident, all pages are frozen. It's
quite obvious but it is what I think "happen to require freezing"
means. Does this make sense?

The phrase might not be necessary if this is correct.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimizing away second VACUUM heap scan when only BRIN indexes on table

On 21 December 2015 at 02:14, Alvaro Herrera 
wrote:

> Jim Nasby wrote:
> > On 11/23/15 5:06 PM, Peter Geoghegan wrote:
> > >I realize that the second scan performed by lazy_vacuum_heap() only
> > >visits those pages known to contain dead tuples. However, the
> > >experience of seeing problems with the random sampling of ANALYZE
> > >makes me think that that might not be very helpful. There is no good
> > >reason to think that there won't be a uniform distribution of dead
> > >tuples across the heap, and so only visiting pages known to contain
> > >dead tuples might be surprisingly little help even when there are
> > >relatively few VACUUM-able tuples in the table.
> >
> > Even worse is if you can't fit all the dead TIDs in memory and have to do
> > multiple passes for no reason...
>
> Since BRIN indexes cannot be primary keys nor unique keys, it's hard to
> be convinced that the use case of a table with only BRIN indexes is
> terribly interesting.
>

Given BRIN's characteristics, such a table design is compelling when the
table is very large, yet possible only for certain use cases.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Parallel Aggregate

2015-12-21 Thread Haribabu Kommi

On Mon, Dec 21, 2015 at 6:48 PM, David Rowley
 wrote:
> On 21 December 2015 at 17:23, Haribabu Kommi 
> wrote:
>>
>>
>> Attached latest performance report. Parallel aggregate is having some
>> overhead
>> in case of low selectivity.This can be avoided with the help of cost
>> comparison
>> between normal and parallel aggregates.
>>
>
> Hi, Thanks for posting an updated patch.
>
> Would you be able to supply a bit more detail on your benchmark? I'm
> surprised by the slowdown reported with the high selectivity version. It
> gives me the impression that the benchmark might be producing lots of groups
> which need to be pushed through the tuple queue to the main process. I think
> it would be more interesting to see benchmarks with varying number of
> groups, rather than scan selectivity. Selectivity was important for parallel
> seqscan, but less so for this, as it's aggregated groups we're sending to
> main process, not individual tuples.

Yes the query is producing more groups according to the selectivity.
For example - scan selectivity - 40, the number of groups - 400

Following is the query:

SELECT tenpoCord,
SUM(yokinZandaka)   AS yokinZandakaxGOUKEI,
SUM(kashikoshiZandaka)   AS kashikoshiZandakaxGOUKEI,
SUM(kouzasuu) AS kouzasuuxGOUKEI,
SUM(sougouKouzasuu) AS sougouKouzasuuxGOUKEI
   FROM public.test01
  WHERE tenpoCord   <= '001'  AND
kamokuCord   = '01'   AND
kouzaKatujyoutaiCord = '0'
GROUP BY kinkoCord,tenpoCord;


Regards,
Hari Babu
Fujitsu Australia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimizing away second VACUUM heap scan when only BRIN indexes on table

On 21 December 2015 at 09:35, Peter Geoghegan  wrote:

> On Mon, Dec 21, 2015 at 1:24 AM, Simon Riggs 
> wrote:
> > Given BRIN's characteristics, such a table design is compelling when the
> > table is very large, yet possible only for certain use cases.
>
> You can say the same thing about BRIN itself, of course.
>


AFAICS, this idea is workable and so I'd say "patches welcome" on it.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Experimental evaluation of PostgreSQL's query optimizer

2015-12-21 Thread Albe Laurenz

Viktor Leis wrote:
> We have recently performed an experimental evaluation of PostgreSQL's
> query optimizer. For example, we measured the contributions of
> cardinality estimation and the cost model on the overall query
> performance. You can download the resulting paper here:
> http://www.vldb.org/pvldb/vol9/p204-leis.pdf
> 
> Some findings:
> 1. Perhaps unsurprisingly, we found that cardinality
> estimation is the biggest problem in query optimization.
> 2. The quality of Postgres' cardinality estimates is not generally worse
> than that of the major commerical systems.
> 3. It seems to me that one obvious way to avoid many bad situations
> would be to disable nested loop joins when the inner relation is NOT
> an index scan.
> 
> I hope this will be of interest to some of you.

I have read the paper with great interest, and I have some comments.

- The paper mentions that the "Join Order Benchmark" has high cross-table
  correlation, and this correlation is responsible for bad cardinality
  estimates that cause bad plans with all RDBMS.
  Wouldn't it be interesting to do the same experiment with a different
  real-word data sets to see if that is indeed typical and not an
  idiosyncrasy of that specific benchmark?

- The paper suggests that sampling the base tables is preferable to
  using statistics because it gives better estimates, but I think that that
  is only a win with long running, complicated, data warehouse style queries.
  For the typical OLTP query it would incur intolerable planning times.
  Any ideas on that?

- From my experience in tuning SQL queries I can confirm your one finding,
  namely that bad cardinality estimates are the prime source for bad
  plan choices.
  Perhaps it would be valuable to start thinking about statistics for
  inter-table correlation. What about something as "simple" as a factor
  per (joinable) attribute pair that approximates the total row count
  of a join on these attributes, divided by the planner's estimate?

- I also can corroborate your finding that nested loop joins are often
  harmful, particularly when the inner loop is a sequential scan.
  One of the first things I do when investigating bad performance of a query
  whose plan has a nestend loop join is to set enable_nestloop to "off"
  and see if that makes a difference, and it often does.
  Maybe it would be a win to bias the planner against nested loop joins.
  This is dreaming, but it might be nice to have some number as to how
  reliable a certain estimate is, which is high if the estimate is, say,
  derived from a single filter on a base table and sinks as more conditions
  are involved or numbers pulled out of thin air.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Commit fest status for 2015-11

On Tue, Dec 22, 2015 at 3:35 PM, Tom Lane  wrote:
> Michael Paquier  writes:
>> To committers: there are 4 patches waiting for input.
>
> Those are:
>
> : SQL function to report log message
>
> AFAICT, there is no committer who likes this idea enough to commit it.
> It's questionable whether we need such a feature at all, and it's even
> more questionable whether this particular instantiation of the feature
> is well-designed.  I'm not going to opine as to whether it should get
> marked "returned with feedback" or "rejected", but I doubt it's going
> to get committed as-is.

Thanks, that's hard to follow all those threads. Let's cut the apple
in half and mark it as returned with feedback then.

> : Reusing abbreviated keys during second pass of ordered [set] aggregates
>
> Robert has been committer for all the predecessor patches, so I assume
> this one will be his.
>
> : Speedup timestamp/time/date output functions
>
> I'd look into this if Andres hadn't already claimed it ...
>
> : Improving replay of XLOG_BTREE_VACUUM records
>
> Heikki and Andres have already reviewed this, so I assume one or the
> other of them will be committer.

OK, if those don't get addressed soon, they will be moved to the next
CF with the same status.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench - allow backslash-continuations in custom scripts

On Wed, Oct 14, 2015 at 5:49 PM, Fabien COELHO  wrote:
>
> Hello,
>
> Here is a review, sorry for the delay...
>
>> This is done as the additional fourth patch, not merged into
>> previous ones, to show what's changed in the manner of command
>> storing.
>> [...]
>>>
>>>  - SQL multi-statement.
>>>
>>>SELECT 1; SELECT 2;
>
>
> I think this is really "SELECT 1\; SELECT 2;"
>
> I join a test script I used.
>
>
> The purpose of this 4 parts patch is to reuse psql scanner from pgbench
> so that commands are cleanly separated by ";", including managing dollar
> quoting, having \ continuations in backslash-commands, having
> multi-statement commands...
>
> This review is about 4 part v4 of the patch. The patches apply and compile
> cleanly.
>
> I think that the features are worthwhile. I would have prefer more limited
> changes to get them, but my earlier attempt was rejected, and the scanner
> sharing with psql results in reasonably limited changes, so I would go for
> it.

Regarding that:
+#if !defined OUTSIDE_PSQL
+#include "variables.h"
+#else
+typedef int * VariableSpace;
+#endif

And that:
+/* Provide dummy macros when no use of psql variables */
+#if defined OUTSIDE_PSQL
+#define GetVariable(space,name) NULL
+#define standard_strings() true
+#define psql_error(fmt,...) do { \
+fprintf(stderr, "psql_error is called. abort.\n");\
+exit(1);\
+} while(0)
+#endif
That's ugly... Wouldn't it be better with something say in src/common
which is frontend-only? We could start with a set of routines allowing
commands to be parsed. That gives us more room for future improvement.

+# fix up pg_xlogdump once it's been set up
+# files symlinked on Unix are copied on windows
+my $pgbench = AddSimpleFrontend('pgbench');
+$pgbench->AddDefine('FRONTEND');
+$pgbench->AddDefine('OUTSIDE_PSQL');
+$pgbench->AddFile('src/bin/psql/psqlscan.l');
+$pgbench->AddIncludeDir('src/bin/psql');
This is a simple copy-paste, with an incorrect comment at least
(haven't tested compilation with MSVC, I suspect that this is going to
fail still the flags are correctly set).

This patch is waiting for input from its author for quite some time
now, and the structure of this patch needs a rework. Are folks on this
thread fine if it is returned with feedback?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Additional role attributes && superuser review

2015-12-21 Thread Amit Langote

On 2015/12/22 14:05, Michael Paquier wrote:
> On Tue, Dec 22, 2015 at 1:41 AM, Stephen Frost  wrote:
>> Updated and rebased patch attached which takes the 'pg_switch_xlog'
>> default role back out, leaving us with:
>>
>> pg_monitor - View privileged info
>> pg_backup - start/stop backups, switch xlog, create restore points
>> pg_replay - Pause/resume xlog replay on replicas
>> pg_replication - Create/destroy/etc replication slots
>> pg_rotate_logfile - Request logfile rotation
>> pg_signal_backend - Signal other backends (cancel query/terminate)
>> pg_file_settings - View configuration settings in all config files
> 
> Thanks. This looks fine to me. I just have one single comment:
> 
> +   Request logfile rotation.
> s/logfile/transaction log file/

Looks correct as is. Or maybe "server's log file" as in:

9.26.2. Server Signaling Functions

pg_rotate_logfile(): Rotate server's log file

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On Thu, Dec 10, 2015 at 1:32 AM, Robert Haas  wrote:
> On Wed, Dec 9, 2015 at 3:22 AM, Etsuro Fujita
>  wrote:
>> Sorry, my explanation might be not enough, but I'm not saying to hide the
>> subplan.  I think it would be better to show the subplan somewhere in the
>> EXPLAIN outout, but I'm not sure that it's a good idea to show that in the
>> current form.  We have two plan trees; one for normal query execution and
>> another for EvalPlanQual testing.  I think it'd be better to show the
>> EXPLAIN output the way that allows users to easily identify each of the plan
>> trees.
>
> It's hard to do that because we don't identify that internally
> anywhere.  Like I said before, the possibility of a ForeignScan having
> an outer subplan is formally independent of the new EPQ stuff, and I'd
> prefer to maintain that separation and just address this with
> documentation.

Fujita-san, others, could this be addressed with documentation?

> Getting this bug fixed has been one of the more exhausting experiences
> of my involvement with PostgreSQL, and to be honest, I think I'd like
> to stop spending too much time on this now and work on getting the
> feature that this is intended to support working.  Right now, the only
> people who can have an opinion on this topic are those who are
> following this thread in detail, and there really aren't that many of
> those.

I am numbering that to mainly 3 people, you included :)

> If we get the feature - join pushdown for postgres_fdw -
> working, then we might get some feedback from users about what they
> like about it or don't, and certainly if this is a frequent complaint
> then that bolsters the case for doing something about it, and possibly
> also helps us figure out what that thing should be.  On the other
> hand, if we don't get the feature because we're busy debating
> interface details related to this patch, then none of these details
> matter anyway because nobody except developer is actually running the
> code in question.

As this debate continues, I think that moving this patch to the next
CF would make the most sense then.. So done this way.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fix parallel workers connection bug in pg_dump (Bug #13727)

On Fri, Nov 6, 2015 at 12:23 AM, Marko Tiikkaja  wrote:
> On 11/5/15 4:11 PM, Zeus Kronion wrote:
>>
>> On Nov 1, 2015 5:04 PM, "Marko Tiikkaja"  wrote:
>>>
>>> However, I don't quite like the way the password cache is kept up to date
>>
>> in the old *or* the new code.  It seems to me that it should instead look
>> like:
>>>
>>>
>>> if (PQconnectionUsedPassword(AH->connection))
>>> AH->savedPassword = PQpass(AH->connection);
>>>
>>> What do you think?
>>
>>
>> I don't understand why this logic is preferable. Is your concern that
>> AH->savedPassword may contain a password even when none is needed?
>
>
> The opposite, sort of.  If the first connection uses a password, the second
> one doesn't, and the third one does again, you need to ask for a password
> again because you emptied the cache on the second attempt since it didn't
> use a password.  Granted, this situation is quite unlikely to occur in
> practice, but I find the "correct" code *also* more readable. To me it reads
> like "if the what we're caching was applied during the connection attempt,
> update the cache; otherwise keep the previous value in case it's useful in
> the future".

OK, I think that we had better address this bug for this CF. I am
attaching an updated patch following Marko's suggestion, and marking
this patch as ready for committer. I have noticed that the fix was
actually not complete: ConnectDatabase needs a similar fix, this is a
code path when a connection string like "dbname=db user=foo" is used.
And this was failing as well when the password is passed directly in
the connection string for multiple jobs.
-- 
Michael


20151222_pgdump_jobs_pass.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

On Tue, Dec 22, 2015 at 3:52 PM, Amit Langote
 wrote:
> On 2015/12/22 15:24, Michael Paquier wrote:
>> As this debate continues, I think that moving this patch to the next
>> CF would make the most sense then.. So done this way.
>
> Perhaps, this ended (?) with the following commit:
>
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=385f337c9f39b21dca96ca4770552a10a6d5af24

Ah, thanks! What has been committed is actually more or less
epq-recheck-v6-efujita.patch posted upthread, I'll mark the patch as
committed then.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pglogical_output - a general purpose logical decoding output plugin

On Mon, Dec 7, 2015 at 12:10 PM, Craig Ringer  wrote:
> Removed, change pushed.
>
> Also pushed a change to expose the decoded row data to row filter hooks.
>
> I won't cut a v4 for this, as I'm working on merging the SGML-ified docs and
> will do a v4 with that and the above readme change once that's done.

Patch is moved to next CF, you seem to be still working on it..
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Getting sorted data from foreign server for merge join

2015-12-21 Thread Ashutosh Bapat

>
> I went over this patch in some detail today and did a lot of cosmetic
> cleanup.  The results are attached.  I'm fairly happy with this
> version, but let me know what you think.  Of course, feedback from
> others is more than welcome also.
>
>
Attached patch with some cosmetic changes (listed here for your quick
reference)
1. , was replaced with ; in comment "inner join, expressions in the " at
one place, which is correct, but missed other place.
2. The comment "First, consider whether any each active EC is potentially"
should use either "any" or "each". I have reworded it as "First, consider
whether any of the active ECs is potentially ...". Or we can use "First,
find all of the active ECs which are potentially ".
3. "having the remote side due the sort generally won't be any worse ..." -
instead of "due" we should use "do"?
4. Added static prototype of function get_useful_ecs_for_relation().
5. The comment "Extract unique EC for query, if any, so we don't consider
it again." is too crisp. Phrase "Unique EC for query" is confusing; EC can
not be associated with a query per say and EC's are always unique because
of canonicalisation. May be we should reword it as "Extract single EC for
ordering of query, if any, so we don't consider it again." Is that cryptic
as well?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


order-for-merge-pushdown-rmh_v2.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Combining Aggregates

On Mon, Dec 21, 2015 at 4:02 AM, David Rowley
 wrote:
>  Ok, so it seems that my mindset was not in parallel process space when I
> was thinking about this.  I think having the pointer in the Tuple is
> probably going to be fine for this multiple stage aggregation when that's
> occurring in a single backend process, but obviously if the memory that the
> pointer points to belongs to a worker process in a parallel aggregate
> situation, then bad things will happen.
>
> Now, there has been talk of this previously, on various threads, but I don't
> believe any final decisions were made on how exactly it should be done. At
> the moment I plan to make changes as follows:
>
>  Add 3 new columns to pg_aggregate, aggserialfn, aggdeserialfn and
> aggserialtype These will only be required when aggtranstype is INTERNAL.
> Perhaps we should disallow CREATE AGGREAGET from accepting them for any
> other type... The return type of aggserialfn should be aggserialtype, and it
> should take a single parameter of aggtranstype. aggdeserialfn will be the
> reverse of that.
> Add a new bool field to nodeAgg's state named serialStates. If this is field
> is set to true then when we're in finalizeAgg = false mode, we'll call the
> serialfn on the agg state instead of the finalfn. This will allow the
> serialized state to be stored in the tuple and sent off to the main backend.
> The combine agg node should also be set to serialStates = true, so that it
> knows to deserialize instead of just assuming that the agg state is of type
> aggtranstype.
>
> I believe this should allow us to not cause any performance regressions by
> moving away from INTERNAL agg states. It should also be very efficient for
> internal states such as Int8TransTypeData, as this struct merely has 2 int64
> fields which should be very simple to stuff into a bytea varlena type. We
> don't need to mess around converting the ->count and ->sum into a strings or
> anything.
>
> Then in order for the planner to allow parallel aggregation all aggregates
> must:
>
> Not have a DISTINCT or ORDER BY clause
> Have a combinefn
> If aggtranstype = INTERNAL, must have a aggserialfn and aggdeserialfn.
>
> We can relax the requirement on 3 if we're using 2-stage aggregation, but
> not parallel aggregation.
>
> Any objections, or better ideas?

Can we use Tom's expanded-object stuff instead of introducing
aggserialfn and aggdeserialfn?  In other words, if you have a
aggtranstype = INTERNAL, then what we do is:

1. Create a new data type that represents the transition state.
2. Use expanded-object notation for that data type when we're just
within a single process, and flatten it when we need to send it
between processes.

One thing to keep in mind is that we also want to be able to support a
plan that involves having one or more remote servers do partial
aggregation, send us the partial values, combine them across servers
and possibly also with locally computed-values, and the finalize the
aggregation.  So it would be nice if there were a way to invoke the
aggregate function from SQL and get a transition value back rather
than a final value.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Experimental evaluation of PostgreSQL's query optimizer

2015-12-21 Thread Viktor Leis

Am 21.12.2015 um 09:22 schrieb Albe Laurenz:
> Viktor Leis wrote:
>> We have recently performed an experimental evaluation of PostgreSQL's
>> query optimizer. For example, we measured the contributions of
>> cardinality estimation and the cost model on the overall query
>> performance. You can download the resulting paper here:
>> http://www.vldb.org/pvldb/vol9/p204-leis.pdf
>>
>> Some findings:
>> 1. Perhaps unsurprisingly, we found that cardinality
>> estimation is the biggest problem in query optimization.
>> 2. The quality of Postgres' cardinality estimates is not generally worse
>> than that of the major commerical systems.
>> 3. It seems to me that one obvious way to avoid many bad situations
>> would be to disable nested loop joins when the inner relation is NOT
>> an index scan.
>>
>> I hope this will be of interest to some of you.
> 
> I have read the paper with great interest, and I have some comments.
> 
> - The paper mentions that the "Join Order Benchmark" has high cross-table
>   correlation, and this correlation is responsible for bad cardinality
>   estimates that cause bad plans with all RDBMS.
>   Wouldn't it be interesting to do the same experiment with a different
>   real-word data sets to see if that is indeed typical and not an
>   idiosyncrasy of that specific benchmark?
The IMDB data set certainly has lots of correlations in comparison
with synthetic benchmarks like TPC-H.

I do not claim that IMDB is representative (whatever that means). But
it is certainly a data set that people are putting into a database and
run join queries on it. It would indeed be very interesting to do
similar experiments with other real-world data sets. We had to come up
with our own queries because you seldom find combinations of public
relational data sets with non-trivial OLAP queries.

> - The paper suggests that sampling the base tables is preferable to
>   using statistics because it gives better estimates, but I think that that
>   is only a win with long running, complicated, data warehouse style queries.
>   For the typical OLTP query it would incur intolerable planning times.
>   Any ideas on that?
I agree that sampling is not suitable for most OLTP queries. (One hack
would be to run the optimizer without sampling and check if the
estimated cost is high and reoptimize with sampling.)

In data warehouse settings, I've seen suggestions to increase
default_statistics_target to a large value, which in my experience
results in extremely large planning times. Sampling could be a more
precise alternative.

> - From my experience in tuning SQL queries I can confirm your one finding,
>   namely that bad cardinality estimates are the prime source for bad
>   plan choices.
>   Perhaps it would be valuable to start thinking about statistics for
>   inter-table correlation. What about something as "simple" as a factor
>   per (joinable) attribute pair that approximates the total row count
>   of a join on these attributes, divided by the planner's estimate?
I think your suggestion amounts to caching the cardinalities of all
two-way joins. One major issue is that for a query like

select * from r1, r2 where r1.x = r2.y and r1.a = ? and r2.b;

it depends on the specific values of r1.a and r2.b whether there is
any (anti-)correlation. And obviously one cannot store correction
factors for each value of a and b.


> - I also can corroborate your finding that nested loop joins are often
>   harmful, particularly when the inner loop is a sequential scan.
>   One of the first things I do when investigating bad performance of a query
>   whose plan has a nestend loop join is to set enable_nestloop to "off"
>   and see if that makes a difference, and it often does.
>   Maybe it would be a win to bias the planner against nested loop joins.
>   This is dreaming, but it might be nice to have some number as to how
>   reliable a certain estimate is, which is high if the estimate is, say,
>   derived from a single filter on a base table and sinks as more conditions
>   are involved or numbers pulled out of thin air.

I think it would be a good start to distinguish between nested loop
joins with and without a index. In my opinion, the latter should
simply NEVER be chosen.

--
Viktor Leis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-21 Thread Fabien COELHO



Hello Heikki,

The reason I didn't commit this back then was lack of performance testing. 
I'm fairly confident that this would be a significant improvement for some 
workloads, and shouldn't hurt much even in the worst case. But I did only a 
little testing on my laptop. I think Simon was in favor of just committing it 
immediately, and



Fabien wanted to see more performance testing before committing.


I confirm. To summarize my opinion:

I think that the 1.5 value somewhere in the patch is much too high for the 
purpose because it shifts the checkpoint load quite a lot (50% more load 
at the end of the checkpoint) just for the purpose of avoiding a spike 
which lasts a few seconds (I think) at the beginning. A much smaller value 
should be used (1.0 <= factor < 1.1), as it would be much less disruptive 
and would probably avoid the issue just the same. I recommend not to 
commit with a 1.5 factor in any case.


Another issue I raised is that the load change occurs both with xlog and 
time triggered checkpoints, and I'm sure it should be applied in both 
case.


Another issue is that the patch makes sense when the WAL & relations are 
on the same disk, but might degrade performance otherwise.


Another point that it interacts potentially with a patch I submitted which 
has a large impact on performance (order of magnitude better in some cases 
by sorting & flushing blocks on checkpoints), so it would make sense to 
check that.


So more testing is definitely needed. A guc would be nice for this 
purpose, especially to look at different factors.


I was hoping that Digoal would re-ran his original test case, and report 
back on whether it helps. Fabien had a performance test setup, for 
testing another patch, but he didn't want to run it to test this patch.


Indeed, I have, but I'm quite behind at the moment, I cannot promise 
anything. Moreover, I'm not sure I see this "spike" issue in my setting, 
AFAICR.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A typo in syncrep.c

On Mon, Dec 21, 2015 at 12:23 AM, Kyotaro HORIGUCHI
 wrote:
> At Fri, 18 Dec 2015 12:44:34 -0500, Robert Haas  wrote 
> in 
>> On Wed, Dec 16, 2015 at 3:33 AM, Kyotaro HORIGUCHI
>>  wrote:
>> > Hello, I think I found a typo in a comment of syncrep.c.
>> >
>> >>   * acknowledge the commit nor raise ERROR or FATAL.  The latter would
>> >> - * lead the client to believe that that the transaction aborted, which
>> >>   * is not true: it's already committed locally. The former is no good
>> >
>> > The 'that' looks duplicate.
>>
>> Agreed.
>>
>> > And it might be better to put a
>> > be-verb before the 'aborted'.
>> >
>> >> + * lead the client to believe that the transaction is aborted, which
>>
>> No, that's correct the way it is.  What you're proposing wouldn't
>> exactly be wrong, but it's a little less clear and direct.
>
> Hmm. I thought they are equal in meaning and make clearer, but I
> understand they have such difference. Thank you for correcting
> it.

The difference is that if you say "the transaction aborted" you mean
that the transaction did something - specifically, it aborted.  If you
say that "the transaction is aborted" you are talking about the state
in which the transaction ended up, without really saying how it got
that way.  In this case we are talking about whether the client might
think that the transaction did something (aborted), not what the
client might think about the state we ended up in (aborted), so the
current wording seems better to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch to improve a few appendStringInfo* calls

On Thu, Jul 2, 2015 at 6:08 AM, David Rowley
 wrote:
>> I left out the changes like
>>
>>> -   appendStringInfoString(, buf.data);
>>> +   appendBinaryStringInfo(, buf.data, buf.len);
>>
>>
>> because they're not an improvement in readablity, IMHO, and they were not
>> in performance-critical paths.
>
> Perhaps we can come up with appendStringInfoStringInfo at some later date.

concatenateStringInfo?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW

2015-12-21 Thread Etsuro Fujita


On 2015/11/26 18:00, Etsuro Fujita wrote:

On 2015/11/25 20:36, Thom Brown wrote:

On 13 May 2015 at 04:10, Etsuro Fujita 
wrote:

On 2015/05/13 0:55, Stephen Frost wrote:

While the EXPLAIN output changed, the structure hasn't really changed
from what was discussed previously and there's not been any real
involvment from the core code in what's happening here.

Clearly, the documentation around how to use the FDW API hasn't changed
at all and there's been no additions to it for handling bulk work.
Everything here continues to be done inside of postgres_fdw, which
essentially ignores the prescribed "Update/Delete one tuple" interface
for ExecForeignUpdate/ExecForeignDelete.

I've spent the better part of the past two days trying to reason my way
around that while reviewing this patch and I haven't come out the other
side any happier with this approach than I was back in
20140911153049.gc16...@tamriel.snowman.net.

There are other things that don't look right to me, such as what's
going
on at the bottom of push_update_down(), but I don't think there's much
point going into it until we figure out what the core FDW API here
should look like.  It might not be all that far from what we have now,
but I don't think we can just ignore the existing, documented, API.



OK, I'll try to introduce the core FDW API for this (and make changes
to the
core code) to address your previous comments.



I'm a bit behind in reading up on this, so maybe it's been covered
since, but is there a discussion of this API on another thread, or a
newer patch available?


To address Stephen's comments, I'd like to propose the following FDW APIs:

bool
PlanDMLPushdown (PlannerInfo *root,
 ModifyTable *plan,
 Index resultRelation,
 int subplan_index);

This is called in make_modifytable, before calling PlanForeignModify. 
This checks to see whether a given UPDATE/DELETE .. RETURNING .. is 
pushdown-safe and if so, performs planning actions needed for the DML 
pushdown.  The idea is to modify a ForeignScan subplan accordingly as in 
the previous patch.  If the DML is pushdown-safe, this returns true, and 
we don't call PlanForeignModify anymore.  (Else returns false and call 
PlanForeignModify as before.)  When the DML is pushdown-safe, we hook 
the following FDW APIs located in nodeForeignscan.c, instead of 
BeginForeignModify, ExecForeignUpdate/ExecForeignDelete and 
EndForeignModify:


void
BeginDMLPushdown (ForeignScanState *node,
  int eflags);

This initializes the DML pushdown, like BeginForeignScan.

TupleTableSlot *
IterateDMLPushdown (ForeignScanState *node);

This fetches one returning result from the foreign server, like 
IterateForeignScan, if having a RETURNING clause.  If not, just return 
an empty slot.  (I'm thinking that it's required that the FDW replaces 
the targetlist of the ForeignScan subplan to the RETURNING clause during 
PlanDMLPushdown, if having the clause, so that we do nothing at 
ModifyTable.)


void
EndDMLPushdown (ForeignScanState *node);

This finishes the DML pushdown, like EndForeignScan.

I'm attaching a WIP patch, which only includes changes to the core.  I'm 
now working on the postgres_fdw patch to demonstrate that these APIs 
work well, but I'd be happy if I could get any feedback earlier.


Best regards,
Etsuro Fujita
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
***
*** 887,893  ExplainNode(PlanState *planstate, List *ancestors,
  			pname = sname = "WorkTable Scan";
  			break;
  		case T_ForeignScan:
! 			pname = sname = "Foreign Scan";
  			break;
  		case T_CustomScan:
  			sname = "Custom Scan";
--- 887,911 
  			pname = sname = "WorkTable Scan";
  			break;
  		case T_ForeignScan:
! 			sname = "Foreign Scan";
! 			switch (((ForeignScan *) plan)->operation)
! 			{
! case CMD_SELECT:
! 	pname = "Foreign Scan";
! 	operation = "Select";
! 	break;
! case CMD_UPDATE:
! 	pname = "Foreign Update";
! 	operation = "Update";
! 	break;
! case CMD_DELETE:
! 	pname = "Foreign Delete";
! 	operation = "Delete";
! 	break;
! default:
! 	pname = "???";
! 	break;
! 			}
  			break;
  		case T_CustomScan:
  			sname = "Custom Scan";
***
*** 1658,1663  show_plan_tlist(PlanState *planstate, List *ancestors, ExplainState *es)
--- 1676,1686 
  		return;
  	if (IsA(plan, RecursiveUnion))
  		return;
+ 	/* Likewise for ForeignScan in case of pushed-down UPDATE/DELETE */
+ 	if (IsA(plan, ForeignScan) &&
+ 		(((ForeignScan *) plan)->operation == CMD_UPDATE ||
+ 		 ((ForeignScan *) plan)->operation == CMD_DELETE))
+ 		return;
  
  	/* Set up deparsing context */
  	context = set_deparse_context_planstate(es->deparse_cxt,
*** a/src/backend/executor/execMain.c
--- b/src/backend/executor/execMain.c
***
*** 1011,1020  InitPlan(QueryDesc *queryDesc, int eflags)
   * CheckValidRowMarkRel.

Re: [HACKERS] ToDo list update for BRIN indexes

On 21 December 2015 at 12:54, Robert Haas  wrote:

> On Thu, Jul 9, 2015 at 4:49 PM, Jeff Janes  wrote:
> > Is there anything in the below section which has been been implemented or
> > rendered irrelevant by BRIN indexes?
> >
> > https://wiki.postgresql.org/wiki/Todo#Indexes
> >
> > "Consider smaller indexes that record a range of values per heap page,
> > rather than having one index entry for every heap row"
>
> [ catching up on old threads ]
>
> BRIN is exactly this, isn't it?  Well, moreso: it's a range of values
> for a range of heap pages.
>

It's close, but not the same.

BRIN is a summary index and so could never support uniqueness.

It's also possible to have an index type that has a precise TID entry, yet
a more compact format, which would then allow unique values. This would be
similar to the way SQLServer compresses primary key indexes.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-21 Thread Tomas Vondra


Hi,

On 12/21/2015 12:03 PM, Heikki Linnakangas wrote:

On 17/12/15 19:07, Robert Haas wrote:

On Mon, Dec 14, 2015 at 6:08 PM, Tomas Vondra
 wrote:

So we know that we should expect about

   (prev_wal_bytes - wal_bytes) + (prev_wal_fpw_bytes - wal_fpw_bytes)

   (   regular WAL) + (  FPW WAL )

to be produced until the end of the current checkpoint. I don't
have a clear idea how to transform this into the 'progress' yet,
but I'm pretty sure tracking the two types of WAL is a key to a
better solution. The x^1.5 is probably a step in the right
direction, but I don't feel particularly confident about the 1.5
(which is rather arbitrary).


If it works well empirically, does it really matter that it's
arbitrary? I mean, the entire planner is full of fairly arbitrary
assumptions about which things to consider in the cost model and
which to ignore. The proof that we have made good decisions there
is in the query plans it generates. (The proof that we have made
bad decisions in some cases in the query plans, too.)


Agreed.


What if it only seems to work well because it was tested on cases it was 
designed for? What about the workloads that behave differently?


Whenever we do changes to costing and query planning, we carefully 
consider counter-examples and cases where it might fail. I see nothing 
like that in this thread - all I see is a bunch of pgbench tests, which 
seems rather insufficient to me.





I think a bigger problem for this patch is that Heikki seems to have
almost completely disappeared.


Yeah, there's that problem too :-).

The reason I didn't commit this back then was lack of performance
testing. I'm fairly confident that this would be a significant
improvement for some workloads, and shouldn't hurt much even in the
worst case. But I did only a little testing on my laptop. I think
Simon was in favor of just committing it immediately, and Fabien
wanted to see more performance testing before committing.

I was hoping that Digoal would re-ran his original test case, and
report back on whether it helps. Fabien had a performance test setup,
for testing another patch, but he didn't want to run it to test this
patch. Amit did some testing, but didn't see a difference. We can
take that as a positive sign - no regression - or as a negative sign,
but I think that basically means that his test was just not sensitive
to the FPW  issue.

So Tomas, if you're willing to do some testing on this, that would
be brilliant!


I'm ready to spend some time on this, assuming we can agree on what 
tests to run. Can we come up with realistic workloads where we expect 
the patch might actually work poorly?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-21 Thread Heikki Linnakangas


On 21/12/15 13:53, Tomas Vondra wrote:

On 12/21/2015 12:03 PM, Heikki Linnakangas wrote:

On 17/12/15 19:07, Robert Haas wrote:

If it works well empirically, does it really matter that it's
arbitrary? I mean, the entire planner is full of fairly arbitrary
assumptions about which things to consider in the cost model and
which to ignore. The proof that we have made good decisions there
is in the query plans it generates. (The proof that we have made
bad decisions in some cases in the query plans, too.)


Agreed.


What if it only seems to work well because it was tested on cases it was
designed for? What about the workloads that behave differently?

Whenever we do changes to costing and query planning, we carefully
consider counter-examples and cases where it might fail. I see nothing
like that in this thread - all I see is a bunch of pgbench tests, which
seems rather insufficient to me.


Agreed on that too.


I'm ready to spend some time on this, assuming we can agree on what
tests to run. Can we come up with realistic workloads where we expect
the patch might actually work poorly?


I think the worst case scenario would be the case where there is no 
FPW-related WAL burst at all, and checkpoints are always triggered by 
max_wal_size rather than checkpoint_timeout. In that scenario, the 
compensation formula will cause the checkpoint to be too lazy in the 
beginning, and it will have to catch up more aggressively towards the 
end of the checkpoint cycle.


One such scenario might be to do only COPYs into a table with no 
indexes. Or hack pgbench to do concentrate all the updates on only a few 
very rows. There will be a FPW on those few pages initially, but the 
spike will be much shorter. Or turn full_page_writes=off, and hack the 
patch to do compensation even when fullpage_writes=off, and then just 
run pgbench.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ToDo list update for BRIN indexes

On Thu, Jul 9, 2015 at 4:49 PM, Jeff Janes  wrote:
> Is there anything in the below section which has been been implemented or
> rendered irrelevant by BRIN indexes?
>
> https://wiki.postgresql.org/wiki/Todo#Indexes
>
> "Consider smaller indexes that record a range of values per heap page,
> rather than having one index entry for every heap row"

[ catching up on old threads ]

BRIN is exactly this, isn't it?  Well, moreso: it's a range of values
for a range of heap pages.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Odd behaviour of SELECT ... ORDER BY ... FOR UPDATE

On Thu, Jul 2, 2015 at 3:59 AM, Etsuro Fujita
 wrote:
> Hi Marko,
>
> On 2015/07/02 16:27, Marko Tiikkaja wrote:
>> On 7/2/15 9:15 AM, Etsuro Fujita wrote:
>>> While working on the foreign-join-pushdown issue, I noticed that in READ
>>> COMMITTED isolation level it's possible that the result of SELECT ...
>>> ORDER BY ... FOR UPDATE is not sorted correctly due to concurrent
>>> updates that replaced the sort key columns with new values as shown in
>>> the below example.  That seems odd to me.  So, I'd like to propose
>>> raising an error rather than returning a possibly-incorrect result for
>>> cases where the sorted tuples to be locked were modified by concurrent
>>> updates.
>
>> I don't like the idea of READ COMMITTED suddenly throwing errors due to
>> concurrency problems.  Using FOR UPDATE correctly is really tricky, and
>> this is just one example.  And a documented one, at that, too.
>
> Ah, you are right.  I'll withdraw this.  Sorry for the noise.

Does 385f337c9f39b21dca96ca4770552a10a6d5af24 make any difference to
the issue described here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Additional role attributes && superuser review

2015-12-21 Thread Stephen Frost

Michael, all,

* Michael Paquier (michael.paqu...@gmail.com) wrote:
> OK, let's do so then by having this one fall under pg_backup. Let's
> not be my grunting concerns be an obstacle for this patch, and we
> could still change it afterwards in this release beta cycle anyway
> based on user feedback.

Updated and rebased patch attached which takes the 'pg_switch_xlog'
default role back out, leaving us with:

pg_monitor - View privileged info
pg_backup - start/stop backups, switch xlog, create restore points
pg_replay - Pause/resume xlog replay on replicas
pg_replication - Create/destroy/etc replication slots
pg_rotate_logfile - Request logfile rotation
pg_signal_backend - Signal other backends (cancel query/terminate)
pg_file_settings - View configuration settings in all config files

Michael, another review would be great, if you don't mind.  I'm going to
be going through it also more closely since it sounds like we've got
consensus on at least this initial set of default roles and rights.  If
all looks good then I'll commit it.

Thanks!

Stephen
From 59bda4266a96976547e0aa44874ad716bf3dbdc9 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Wed, 18 Nov 2015 11:50:57 -0500
Subject: [PATCH 1/3] Add note regarding permissions in pg_catalog

Add a note to the system catalog section pointing out that while
modifying the permissions on catalog tables is possible, it's
unlikely to have the desired effect.
---
 doc/src/sgml/catalogs.sgml | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 97ef618..3b7768c 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -21,6 +21,17 @@
particularly esoteric operations, such as adding index access methods.
   
 
+  
+   
+Changing the permissions on objects in the system catalogs, while
+possible, is unlikely to have the desired effect as the internal
+lookup functions use a cache and do not check the permissions nor
+policies of tables in the system catalog.  Further, permission
+changes to objects in the system catalogs are not preserved by
+pg_dump or across upgrades.
+   
+  
+
  
   Overview
 
-- 
2.5.0


From a659b7ecd220c0671d3cc272b3774318bce97567 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Wed, 30 Sep 2015 07:04:55 -0400
Subject: [PATCH 2/3] Reserve the "pg_" namespace for roles

This will prevent users from creating roles which begin with "pg_" and
will check for those roles before allowing an upgrade using pg_upgrade.

This will allow for default roles to be provided at initdb time.
---
 doc/src/sgml/ref/psql-ref.sgml  |  8 +--
 src/backend/catalog/catalog.c   |  5 ++--
 src/backend/commands/user.c | 40 
 src/backend/utils/adt/acl.c | 41 +
 src/bin/pg_dump/pg_dumpall.c|  2 ++
 src/bin/pg_upgrade/check.c  | 40 ++--
 src/bin/psql/command.c  |  4 ++--
 src/bin/psql/describe.c |  5 +++-
 src/bin/psql/describe.h |  2 +-
 src/bin/psql/help.c |  4 ++--
 src/include/utils/acl.h |  1 +
 src/test/regress/expected/rolenames.out | 18 +++
 src/test/regress/sql/rolenames.sql  |  8 +++
 13 files changed, 166 insertions(+), 12 deletions(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 6d0cb3d..76bb642 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1365,13 +1365,15 @@ testdb=
 
 
   
-\dg[+] [ pattern ]
+\dg[S+] [ pattern ]
 
 
 Lists database roles.
 (Since the concepts of users and groups have been
 unified into roles, this command is now equivalent to
 \du.)
+By default, only user-created roles are shown; supply the
+S modifier to include system roles.
 If pattern is specified,
 only those roles whose names match the pattern are listed.
 If the form \dg+ is used, additional information
@@ -1525,13 +1527,15 @@ testdb=
   
 
   
-\du[+] [ pattern ]
+\du[S+] [ pattern ]
 
 
 Lists database roles.
 (Since the concepts of users and groups have been
 unified into roles, this command is now equivalent to
 \dg.)
+By default, only user-created roles are shown; supply the
+S modifier to include system roles.
 If pattern is specified,
 only those roles whose names match the pattern are listed.
 If the form \du+ is used, additional information
diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index 81ccebf..184aa7d 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -184,8 +184,9 @@ IsToastNamespace(Oid namespaceId)

Re: [HACKERS] Some questions about the array.

2015-12-21 Thread Uriy Zhuravlev


In the continuation of thread:
http://www.postgresql.org/message-id/19144.1450457...@sss.pgh.pa.us


I'm dubious that the parsetree representation is well-chosen.
Probably a single is_slice flag would have been better.


What do you mean? This flag is for what? You are about the A_Indices 
node(lidx_default/uidx_default)?


All the other issues I fixed.

Thanks!

--
Uriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Aggregate

2015-12-21 Thread Paul Ramsey

On December 21, 2015 at 2:33:56 AM, Haribabu Kommi (kommi.harib...@gmail.com) 
wrote:
Yes the query is producing more groups according to the selectivity. 
For example - scan selectivity - 40, the number of groups - 400 

Following is the query: 

SELECT tenpoCord, 
SUM(yokinZandaka) AS yokinZandakaxGOUKEI, 
SUM(kashikoshiZandaka) AS kashikoshiZandakaxGOUKEI, 
SUM(kouzasuu) AS kouzasuuxGOUKEI, 
SUM(sougouKouzasuu) AS sougouKouzasuuxGOUKEI 
FROM public.test01 
WHERE tenpoCord <= '001' AND 
kamokuCord = '01' AND 
kouzaKatujyoutaiCord = '0' 
GROUP BY kinkoCord,tenpoCord; 
Shouldn’t parallel aggregate come into play regardless of scan selectivity? I 
know in PostGIS land there’s a lot of stuff like:

SELECT ST_Union(geom) FROM t GROUP BY areacode;

Basically, in the BI case, there’s often no filter at all. Hoping that’s 
considered a prime case for parallel agg :)

P

Re: [HACKERS] extend pgbench expressions with functions

2015-12-21 Thread Fabien COELHO



Hello Michael,


I'm not sure whether we are talking about the same thing:
 - there a "double" type managed within expressions, but not variables
 - there is a double() function, which takes an int and casts to double

I understood that you were suggesting to remove all "double" expressions,
but now it seems to be just about the double() function.


There is indeed a misunderstanding here: I meant from the start the
removal of only the "double" function.


Ok. I clearly misunderstood everything...

It would be nice to keep as user-visible only things that have some 
meaning.


Why not.

The purpose of some of these functions what to show how the function 
infrastructured extended, as was asked on the thread. The really useful 
ones in the bunch are the arithmetic operators and randoms generators.


[...] Well, if there were doubles as return results really allocated as 
doubles in variables having both would make sense. And honestly 
something like sqrt that returns an integer when allocated in a variable 
is really surprising.. And as you mentioned upthread there is no real 
meaning to have doubles variable types that can be allocated.


So you would just like to remove double double(int) and double 
sqrt(double) from the patch, and basically that is all? int int(double)??

debug??? (hmmm, useful debugging a non trivial expression).

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

On Mon, Dec 21, 2015 at 3:27 AM, Kyotaro HORIGUCHI
 wrote:
> Hello,
>
> At Fri, 18 Dec 2015 12:09:43 -0500, Robert Haas  wrote 
> in 
>> On Thu, Dec 17, 2015 at 1:17 AM, Michael Paquier
>>  wrote:
>> > I am not really getting the meaning of this sentence. Shouldn't this
>> > be reworded something like:
>> > "Freezing occurs on the whole table once all pages of this relation 
>> > require it."
>>
>> That statement isn't remotely true, and I don't think this patch
>> changes that.  Freezing occurs on the whole table once relfrozenxid is
>> old enough that we think there might be at least one page in the table
>> that requires it.
>
> I doubt I can explain this accurately, but I took the original
> phrase as that if and only if all pages of the table are marked
> as "requires freezing" by accident, all pages are frozen. It's
> quite obvious but it is what I think "happen to require freezing"
> means. Does this make sense?
>
> The phrase might not be necessary if this is correct.

Maybe you are trying to say something like "only those pages which
require freezing are frozen?".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Avoiding pin scan during btree vacuum

During VACUUM of btrees, we need to pin all pages, even those where tuples
are not removed, which I am calling here the "pin scan". This is especially
a problem during redo, where removing one tuple from a 100GB btree can take
a minute or more. That causes replication lags. Bad Thing.

Previously, I have suggested ways of optimizing that and code comments
reflect that. I would like to look at removing the pin scan entirely, on a
standby only.

In
www.postgresql.org/message-id/flat/721615179.3351449.1423959585771.javamail.ya...@mail.yahoo.com,
the focus was on removing pins from btrees.
In the commit message for that thread/patch,
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2ed5b87f96d473962ec5230fd820abfeaccb2069
we see that there are some preconditions to altering the locking.

"This patch leaves behavior unchanged for some cases, which can be
addressed separately so that each case can be evaluated on its own
merits. These unchanged cases are when a scan uses a non-MVCC
snapshot, an index-only scan, and a scan of a btree index for which
modifications are not WAL-logged. If later patches allow all of
these cases to drop the buffer pin after reading a leaf page, then
the btree vacuum process can be simplified; it will no longer need
the "super-exclusive" lock to delete tuples from a page."

The case for master and standby are different. The standby is simpler, yet
more important to change since the pin scan happens in the foreground.

Looking just at the case for standbys, we see there are 3 cases
* non-WAL logged indexes - does not apply on a standby, so ignore
* non-MVCC snapshot - looks like only TOAST scans are a problem on standbys
* index only scans (IOS) - the analysis of which looks wrong to me...

IOSs always check the visibility before using the tuple. If a tuple is
about to be removed by a VACUUM then the tuple will already be dead and the
visibility map will always be set to not-all-visible. So any tuple being
removed by vacuum can never cause a problem to an IOS. Hence the locking
interactions are not required, at least on standbys, for normal tables.

So it looks like we can skip the "pin scan" during redo unless we are
vacuuming a toast index.

Patch attached.

Notice that the patch does not slacken the requirement to
super-exclusive-lock the block from which tuples are being removed. The
only thing it does is skip past the requirement to pin each of the
intervening blocks where nothing has happened.

--
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

avoid_standby_pin_scan.v1.patch
Description: Binary data

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Experimental evaluation of PostgreSQL's query optimizer

2015-12-21 Thread Viktor Leis

Am 21.12.2015 um 15:42 schrieb Tom Lane:
> Viktor Leis  writes:
>> I think it would be a good start to distinguish between nested loop
>> joins with and without a index.
> 
> We do.
> 
>> In my opinion, the latter should simply NEVER be chosen.
> 
> So, if you're given a query with a non-equality join condition
> that doesn't match any index on either table, you think the planner
> should just fail?  This is not realistic.  Certainly nestloop with
> inner seqscan is likely to be slow, but that's already reflected
> in the cost estimates.
You are right that for non-equality joins there is no alternative, so
nested loop is obviously the right choice. However, that does not make
the selection of nested loop join in cases where a hash join would be
possible a good choice.

Please have a look at Figure 6 (page 6) in
http://www.vldb.org/pvldb/vol9/p204-leis.pdf Disabling nested loop
joins without index scan (going from (a) to (b)) results in great
improvements across the board. And even more importantly, it avoids
most of the cases where queries took unreasonably long and timed
out. Basically this amounts to the being able to run the query on
PostgreSQL or not.

The cost model does not save you because the estimated cardinality is
close to 1. Also note that a hash join is fast too if the estimate is
correct. Picking nested loop join in these situations is very risky
but there is practically no upside for this decision.

--
Viktor Leis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SET SESSION AUTHORIZATION superuser limitation.

2015-12-21 Thread Dmitry Igrishin

2015-12-21 17:57 GMT+03:00 Tom Lane :

> Robert Haas  writes:
> > On Sun, Dec 20, 2015 at 1:47 PM, Tom Lane  wrote:
> >> The syntax you propose exposes the user's password in cleartext in
> >> the command, where it is likely to get captured in logs for example.
> >> That's not going to do.
>
> > Of course, right now, the ALTER USER ... PASSWORD command has that
> > problem which is, uh, bad.
>
> Which is why we invented the ENCRYPTED PASSWORD syntax, as well as
> psql's \password command ... but using that approach for actual
> login to an account would be a security fail as well.
>
The connection should be secured somehow (SSL/SSH...) to prevent password
thefts. On the other hand, the logging system should not log details of
commands
like ALTER USER ...

Re: [HACKERS] SET SESSION AUTHORIZATION superuser limitation.

Robert Haas  writes:
> On Sun, Dec 20, 2015 at 1:47 PM, Tom Lane  wrote:
>> The syntax you propose exposes the user's password in cleartext in
>> the command, where it is likely to get captured in logs for example.
>> That's not going to do.

> Of course, right now, the ALTER USER ... PASSWORD command has that
> problem which is, uh, bad.

Which is why we invented the ENCRYPTED PASSWORD syntax, as well as
psql's \password command ... but using that approach for actual
login to an account would be a security fail as well.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] custom function for converting human readable sizes to bytes

On Sun, Dec 20, 2015 at 4:54 AM, Pavel Stehule  wrote:
> new update:
>
> 1. unit searching is case insensitive
>
> 2. initial support for binary byte prefixes - KiB, MiB, ..  (IEC standard),
> change behave for SI units
>
> Second point is much more complex then it is looking - if pg_size_bytes
> should be consistent with pg_size_pretty.
>
> The current pg_size_pretty and transformations in guc.c are based on JEDEC
> standard. Using this standard for GUC has sense - using it for object sizes
> is probably unhappy.
>
> I tried to fix (and enhance) pg_size_pretty - now reports correct units, and
> via second parameter it allows to specify base: 2 (binary, IEC  - default)
> or 10 (SI).

-1 from me.  I don't think we should muck with the way pg_size_pretty works.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plpgsql - DECLARE - cannot to use %TYPE or %ROWTYPE for composite types

2015-12-21 Thread Artur Zakirov


Hi.
I have tried to do some review of this patch. Below are my comments.

Introduction:

This patch fixes and adds the following functionality:
- %TYPE - now can be used for composite types.
- %ARRAYTYPE - new functionality, provides the array type from a 
variable or table column.
- %ELEMENTTYPE - new funcitonality, provides the element type of a given 
array.


New regression tests are included in the patch. Changes to the 
documentation are not provided.


Initial Run:

The patch applies correctly to HEAD. Regression tests pass successfully, 
without errors. It seems that the patch work as you expected.


Performance:

It seems patch have not possible performance issues for the existing 
functionality.


Coding:

The style looks fine. I attached the patch that does some corrections in 
code and documentation. I have corrected indentation in pl_comp.c and 
"read_datatype" function in pl_gram.y. I think changes in 
"read_datatype" function would be better to avoid a code duplication. 
But I could be wrong of course.


Conclusion:

The patch could be applied on master with documentation corrections. But 
I'm not sure that your task could be resloved only by adding %ARRAYTYPE 
and %ELEMENTTYPE. Maybe you will give some examples?


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/doc/src/sgml/plpgsql.sgml
--- b/doc/src/sgml/plpgsql.sgml
***
*** 710,715  SELECT merge_fields(t.*) FROM table1 t WHERE ... ;
--- 710,756 
 

  
+   
+Array Types
+ 
+ 
+ variable%ARRAYTYPE
+ 
+ 
+
+ %ARRAYTYPE provides the array type from a variable or
+ table column.  You can use this to declare array variables that will
+ hold database values.  To declare an array variable that will hold
+ values from users.user_id you can write:
+ 
+ user_id users.user_id%ARRAYTYPE;
+ 
+
+ 
+
+ By using %ARRAYTYPE you don't need to know the data
+ type of elements you are referencing.
+
+   
+ 
+   
+Array Element Types
+ 
+ 
+ variable%ELEMENTTYPE
+ 
+ 
+
+ %ELEMENTTYPE provides the element type of a given
+ array.  To declare a variable with the same data type as
+ users array element you can write:
+ 
+ user_id users%ELEMENTTYPE;
+ 
+
+ 
+   
+ 

 Collation of PL/pgSQL Variables
  
*** a/src/pl/plpgsql/src/pl_comp.c
--- b/src/pl/plpgsql/src/pl_comp.c
***
*** 1619,1625  plpgsql_parse_tripword(char *word1, char *word2, char *word3,
  
  /*
   * Derive type from ny base type controlled by reftype_mode
-  *
   */
  static PLpgSQL_type *
  derive_type(PLpgSQL_type *base_type, int reftype_mode)
--- 1619,1624 
***
*** 1661,1667  derive_type(PLpgSQL_type *base_type, int reftype_mode)
  ereport(ERROR,
  		(errcode(ERRCODE_DATATYPE_MISMATCH),
  		 errmsg("there are not array type for type %s",
! 	format_type_be(base_type->typoid;
  
  			return plpgsql_build_datatype(typoid, -1,
  			plpgsql_curr_compile->fn_input_collation);
--- 1660,1666 
  ereport(ERROR,
  		(errcode(ERRCODE_DATATYPE_MISMATCH),
  		 errmsg("there are not array type for type %s",
! format_type_be(base_type->typoid;
  
  			return plpgsql_build_datatype(typoid, -1,
  			plpgsql_curr_compile->fn_input_collation);
*** a/src/pl/plpgsql/src/pl_gram.y
--- b/src/pl/plpgsql/src/pl_gram.y
***
*** 2694,2700  read_datatype(int tok)
  	StringInfoData		ds;
  	char			   *type_name;
  	int	startlocation;
! 	PLpgSQL_type		*result;
  	int	parenlevel = 0;
  
  	/* Should only be called while parsing DECLARE sections */
--- 2694,2700 
  	StringInfoData		ds;
  	char			   *type_name;
  	int	startlocation;
! 	PLpgSQL_type		*result = 0;
  	int	parenlevel = 0;
  
  	/* Should only be called while parsing DECLARE sections */
***
*** 2720,2751  read_datatype(int tok)
  			tok = yylex();
  			if (tok_is_keyword(tok, ,
  			   K_TYPE, "type"))
- 			{
  result = plpgsql_parse_wordtype(dtname, PLPGSQL_REFTYPE_TYPE);
! if (result)
! 	return result;
! 			}
! 			if (tok_is_keyword(tok, ,
! 			   K_ELEMENTTYPE, "elementtype"))
! 			{
  result = plpgsql_parse_wordtype(dtname, PLPGSQL_REFTYPE_ELEMENT);
! if (result)
! 	return result;
! 			}
! 			if (tok_is_keyword(tok, ,
! 			   K_ARRAYTYPE, "arraytype"))
! 			{
  result = plpgsql_parse_wordtype(dtname, PLPGSQL_REFTYPE_ARRAY);
- if (result)
- 	return result;
- 			}
  			else if (tok_is_keyword(tok, ,
  	K_ROWTYPE, "rowtype"))
- 			{
  result = plpgsql_parse_wordrowtype(dtname);
! if (result)
! 	return result;
! 			}
  		}
  	}
  	else if (plpgsql_token_is_unreserved_keyword(tok))
--- 2720,2737 
  			tok = yylex();
  			if (tok_is_keyword(tok, ,
  			   K_TYPE, "type"))
  result = plpgsql_parse_wordtype(dtname, PLPGSQL_REFTYPE_TYPE);
! 			else if (tok_is_keyword(tok, ,
! 	K_ELEMENTTYPE,

[HACKERS] tracking owner of extension-managed objects

2015-12-21 Thread Chapman Flack

I'm looking for best-practice advice.

PL/Java is an extension that manages some objects (jar files, which
users can tell PL/Java to load, drop, or replace). The objects have
owners (have had since PL/Java 1.1.0 anyway).

When the owner tracking was added for 1.1.0 it recorded the owner oid.

In 2006, before 1.3.0, it was changed to keep the owner name instead
of the oid, in response to a bug report 1506 that involved the wrong
owner name being shown after dump/restore into another db where the
user oids were different.

Neither approach seems perfect to me (in fact, they both strike me
as having complementary sides of the same weakness, which dump/restore
just happens to expose). I am also wondering whether PL/Java ought to
create references, or a trigger, on pg_authid to clean up if the user
goes away; it currently doesn't.

Before I spend too many cycles on this, is there a most favored
design pattern that has already emerged for this kind of thing?

-Chap


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Some questions about the array.

Uriy Zhuravlev  writes:
>> I'm dubious that the parsetree representation is well-chosen.
>> Probably a single is_slice flag would have been better.

> What do you mean? This flag is for what? You are about the A_Indices 
> node(lidx_default/uidx_default)?

Yes.  Those flags are partially redundant with the subtree pointers being
NULL, and there are combinations that would be invalid (such as
lidx_default being set but lidx not being null), and it's pretty unobvious
what the difference in representation is between a non-slice case and a
slice case with only the upper index provided.  In fact, since you have
five syntaxes to represent, it's impossible for the two bools to
distinguish them all, which means that at least one case *must* be
identified by null-ness of a pointer contradicting what the corresponding
bool's setting would imply.  So this just seems like a mess to me.

I think it would come out cleaner if you had just one bool is_slice,
which corresponds directly to whether a colon was present.  The four
sub-possibilities of colon notation would be represented by combinations
of null and non-null lidx and uidx.  With is_slice false, the only valid
case is lidx==NULL, uidx!=NULL, as before for non-slice notation.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tracking owner of extension-managed objects