from:"Antonin Houska"

Re: [HACKERS] WIP: Aggregation push-down

2017-11-03 Thread Antonin Houska

Antonin Houska <a...@cybertec.at> wrote:

> This is another version of the patch.

> shard.tgz demonstrates the typical postgres_fdw use case. One query shows base
> scans of base relation's partitions being pushed to shard nodes, the other
> pushes down a join and performs aggregation of the join result on the remote
> node. Of course, the query can only references one particular partition, until
> the "partition-wise join" [1] patch gets committed and merged with this my
> patch.

Since [1] is already there, the new version of shard.tgz shows what I consider
the typical use case. (I'm aware of the postgres_fdw regression test failures,
I'll try to fix them all in the next version.)

Besides that:

* A concept of "path unique keys" has been introduced. It helps to find out if
  the final relation appears to generate a distinct set of grouping keys. If
  that happens, the final aggregation is replaced by mere call of aggfinalfn()
  function on each transient state value.

* FDW can sort rows by aggregate.

* enable_agg_pushdown GUC was added. The default value is false.

* I fixed errors reported during the previous CF.

* Added a few more regression tests.

I'm not about to add any other features now. Implementation of the missing
parts (see the TODO comments in the code) is the next step. But what I'd
appreciate most is a feedback on the design. Thanks.

> [1] https://commitfest.postgresql.org/14/994/

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

agg_pushdown_v4.tgz
Description: GNU Zip compressed data

shard.tgz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath

2017-10-26 Thread Antonin Houska

David Rowley <david.row...@2ndquadrant.com> wrote:

> Method 1:
> 
> In set_append_rel_size() detect when just a single subpath would be
> added to the Append path.

I spent some time reviewing the partition-wise join patch during the last CF
and have an impression that set_append_rel_size() is called too early to find
out how many child paths will eventually exist if Append represents a join of
a partitioned table. I think the partition matching takes place at later stage
and that determines the actual number of partitions, and therefore the number
of Append subpaths.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Separate log file for extension

2017-09-27 Thread Antonin Houska

Magnus Hagander <mag...@hagander.net> wrote:

> > On Fri, Aug 25, 2017 at 12:12 AM, Antonin Houska <a...@cybertec.at> wrote:

> 
> I like this idea in general.
> 
>  Then it's supposed to change some of its attributes
> 
> >  adjust_log_stream_attr(>filename, "my_extension.log");
> 
> This, however, seems to be wrong.
> 
> The logfile name does not belong in the extension, it belongs in the
> configuration file. I think the extension should set it's "stream id" or
> whatever you want to call it, and then it should be possible to control in
> postgresql.conf where that log is sent.

Doesn't the last paragraph of

https://www.postgresql.org/message-id/11412.1503912190%40localhost

address your concerns?

> Also, what if this extension is loaded on demand in a session and not via
> shared_preload_libraries? It looks like the syslogger only gets the list of
> configured streams when it starts?

Yes, the syslogger gets the list of streams only when it starts, so the
extension that wants to use this feature needs to provide the file information
via shared_preload_libraries. I consider it sufficient because various
existing logging-related GUCs also can't be changed on-the-fly.

> In short, I think the solution should be more generic, and not "just for 
> extensions".

o.k. Any idea about dividing the streams into categories? Should they for
example correspond somehow to categories of GUC variables?

> I completely missed this thread when I did my quick-wip at
> https://www.postgresql.org/message-id/flat/cabuevexztl0gorywm9s4tr_ft3fmjbraxqdxj+bqzjpvmru...@mail.gmail.com#cabuevexztl0gorywm9s4tr_ft3fmjbraxqdxj+bqzjpvmru...@mail.gmail.com
> -- some of the changes made were close enough that I got the two confused :)
> Based on the feedback of that one, have you done any performance checks?

I don't expect mere routing of messages into multiple files to bring any
overhead. I'll run some tests, just out of curiosity.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Separate log file for extension

2017-09-27 Thread Antonin Houska

Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:

> On 08/28/2017 11:23 AM, Antonin Houska wrote:
> > Craig Ringer <cr...@2ndquadrant.com> wrote:
> > 
> >> On 25 August 2017 at 15:12, Antonin Houska <a...@cybertec.at> wrote:
> >>
> >> How will this play with syslog? csvlog? etc?
> > 
> > There's nothing special about csvlog: the LogStream structure has a
> > "destination" field, so if particular extension wants this kind of output, 
> > it
> > simply sets the LOG_DESTINATION_CSVLOG bit here.
> > 
> 
> I assume Craig's point was more about CSVLOG requiring log_collector=on.
> So what will happen if the PostgreSQL is started without the collector,
> and an extension attempts to use LOG_DESTINATION_CSVLOG? Or will it
> start a separate collector for each such extension?

The patch does not try to change this behavior. So there'll be no csvlog file
and only the plain (not csv-formatted) message will be written to console.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Aggregation push-down

2017-09-08 Thread Antonin Houska

Ashutosh Bapat <ashutosh.ba...@enterprisedb.com> wrote:

> On Thu, Aug 17, 2017 at 8:52 PM, Antonin Houska <a...@cybertec.at> wrote:
> > Antonin Houska <a...@cybertec.at> wrote:
> >
> >> Antonin Houska <a...@cybertec.at> wrote:
> >>
> >> > This is a new version of the patch I presented in [1].
> >>
> >> Rebased.
> >>
> >> cat .git/refs/heads/master
> >> b9a3ef55b253d885081c2d0e9dc45802cab71c7b
> >
> > This is another version of the patch.
> >
> > Besides other changes, it enables the aggregation push-down for 
> > postgres_fdw,
> > although only for aggregates whose transient aggregation state is equal to 
> > the
> > output type. For other aggregates (like avg()) the remote nodes will have to
> > return the transient state value in an appropriate form (maybe bytea type),
> > which does not depend on PG version.
> 
> Having transient aggregation state type same as output type doesn't
> mean that we can feed the output of remote aggregation as is to the
> finalization stage locally or finalization is not needed at all. I am
> not sure why is that test being used to decide whether or not to push
> an aggregate (I assume they are partial aggregates) to the foreign
> server.

I agree. This seems to be the same problem as reported in [2].

> postgres_fdw doesn't have a feature to fetch partial aggregate
> results from the foreign server. Till we do that, I don't think this
> patch can push any partial aggregates down to the foreign server.

Even if the query contains only aggregates like sum(int4), i.e. aggfinalfn
does not exist and the transient type can be transfered from the remote server
in textual form? (Of course there's a question is if such behaviour is
consistent from user's perspective.)

> >
> > One thing I'm not sure about is whether the patch should remove
> > GetForeignUpperPaths function from FdwRoutine, which it essentially makes
> > obsolete. Or should it only be deprecated so far? I understand that
> > deprecation is important for "ordinary SQL users", but FdwRoutine is an
> > interface for extension developers, so the policy might be different.
> 
> I doubt if that's correct. We can do that only when we know that all
> kinds aggregates/grouping can be pushed down the join tree, which
> looks impossible to me. Apart from that that FdwRoutine handles all
> kinds of upper relations like windowing, distinct, ordered which are
> not pushed down by this set of patches.

Good point.

> Here's review of the patchset
> The patches compile cleanly when applied together.
> 
> Testcase "limit" fails in make check.

If I remember correctly, this test generates different plan because the
aggregate push-down gets involved. I tend to ignore this error until the patch
has cost estimates refined. Then let's see if the test will pass or if the
expected output should be adjusted.

> This is a pretty large patchset and the patches are divided by stages in the
> planner rather than functionality. IOW, no single patch provides a full
> functionality. Reviewing and probably committing such patches is not easy and
> may take quite long. Instead, if the patches are divided by functionality, 
> i.e.
> each patch provides some functionality completely, it will be easy to review
> and commit such patches with each commit giving something useful. I had
> suggested breaking patches on that line at [1]. The patches also refactor
> existing code. It's better to move such refactoring in a patch/es by itself.

I definitely saw commits whose purpose is preparation for something else. But
you're right that some splitting of the actual funcionality wouldn't
harm. I'll at least separate partial aggregation of base relations and that of
joins. And maybe some more preparation steps.

> The patches need more comments, explaining various decisions

o.k.

> The patch maintains an extra rel target, two pathlists, partial pathlist and
> non-partial pathlist for grouping in RelOptInfo. These two extra
> pathlists require "grouped" argument to be passed to a number of functions.
> Instead probably it makes sense to create a RelOptInfo for 
> aggregation/grouping
> and set pathlist and partial_pathlist in that RelOptInfo. That will also make 
> a
> place for keeping grouped estimates.

If grouped paths require a separate RelOptInfo, why the existing partial paths
do not?

> For placeholders we have two function add_placeholders_to_base_rels() and
> add_placeholders_to_joinrel(). We have add_grouping_info_to_base_rels(), but 
> do
> not have corresponding function for joinrels. How do we push aggregates
> involving two or more relations to the corresponding joinrels?

add_grou

Re: [HACKERS] WIP: Aggregation push-down

2017-09-08 Thread Antonin Houska

Jeevan Chalke <jeevan.cha...@enterprisedb.com> wrote:

> 1.
> + if (aggref->aggvariadic ||
> + aggref->aggdirectargs || aggref->aggorder ||
> + aggref->aggdistinct || aggref->aggfilter)
> 
> I did not understand, why you are not pushing aggregate in above cases?
> Can you please explain?

Currently I'm trying to implement infrastructure to propagate result of
partial aggregation from base relation or join to the root of the join tree
and to use these as input for the final aggregation. Some special cases are
disabled because I haven't thought about them enough so far. Some of these
restrictions may be easy to relax, others may be hard. I'll get back to them
at some point.

> 2. "make check" in contrib/postgres_fdw crashes.
> 
> SELECT COUNT(*) FROM ft1 t1;
> ! server closed the connection unexpectedly
> ! This probably means the server terminated abnormally
> ! before or while processing the request.
> ! connection to server was lost
> 
> From your given setup, if I wrote a query like:
> EXPLAIN VERBOSE SELECT count(*) FROM orders_0;
> it crashes.
> 
> Seems like missing few checks.

Please consider this a temporary limitation.

> 3. In case of pushing partial aggregation on the remote side, you use schema
> named "partial", I did not get that change. If I have AVG() aggregate,
> then I end up getting an error saying
> "ERROR: schema "partial" does not exist".

Attached to his email is an extension that I hacked quickly at some point, for
experimental purposes only. It's pretty raw but may be useful if you just want
to play with it.

> 4. I am not sure about the code changes related to pushing partial
> aggregate on the remote side. Basically, you are pushing a whole aggregate
> on the remote side and result of that is treated as partial on the
> basis of aggtype = transtype. But I am not sure that is correct unless
> I miss something here. The type may be same but the result may not.

You're right. I mostly used sum() and count() in my examples but these are
special cases. It should also be checked whether aggfinalfn exists or
not. I'll fix it, thanks.

> 5. There are lot many TODO comments in the patch-set, are you working
> on those?

Yes. I wasn't too eager to complete all the TODOs soon because reviews of the
partch may result in a major rework. And if large parts of the code will have
to be wasted, some / many TODOs will be gone as well.

> Thanks
> 
> On Thu, Aug 17, 2017 at 8:52 PM, Antonin Houska <a...@cybertec.at> wrote:
> 
>  Antonin Houska <a...@cybertec.at> wrote:
> 
>  > Antonin Houska <a...@cybertec.at> wrote:
>  >
>  > > This is a new version of the patch I presented in [1].
>  >
>  > Rebased.
>  >
>  > cat .git/refs/heads/master
>  > b9a3ef55b253d885081c2d0e9dc45802cab71c7b
> 
>  This is another version of the patch.
> 
>  Besides other changes, it enables the aggregation push-down for postgres_fdw,
>  although only for aggregates whose transient aggregation state is equal to 
> the
>  output type. For other aggregates (like avg()) the remote nodes will have to
>  return the transient state value in an appropriate form (maybe bytea type),
>  which does not depend on PG version.
> 
>  shard.tgz demonstrates the typical postgres_fdw use case. One query shows 
> base
>  scans of base relation's partitions being pushed to shard nodes, the other
>  pushes down a join and performs aggregation of the join result on the remote
>  node. Of course, the query can only references one particular partition, 
> until
>  the "partition-wise join" [1] patch gets committed and merged with this my
>  patch.
> 
>  One thing I'm not sure about is whether the patch should remove
>  GetForeignUpperPaths function from FdwRoutine, which it essentially makes
>  obsolete. Or should it only be deprecated so far? I understand that
>  deprecation is important for "ordinary SQL users", but FdwRoutine is an
>  interface for extension developers, so the policy might be different.
> 
>  [1] https://commitfest.postgresql.org/14/994/
> 
>  Any feedback is appreciated.
> 
>  --
>  Antonin Houska
>  Cybertec Schönig & Schönig GmbH
>  Gröhrmühlgasse 26
>  A-2700 Wiener Neustadt
>  Web: http://www.postgresql-support.de, http://www.cybertec.at
> 
>  --
>  Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>  To make changes to your subscription:
>  http://www.postgresql.org/mailpref/pgsql-hackers
> 
> -- 
> Jeevan Chalke
> Principal Software Engineer, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
> 
> 

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



partial.tgz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2017-09-07 Thread Antonin Houska

Ashutosh Bapat <ashutosh.ba...@enterprisedb.com> wrote:

> I have fixed the issues which were marked as TODOs in the attached
> patches. Also, I have included your test change patch in my series of
> patches.

I've noticed that partition_bounds_merge() is called twice from
make_join_rel():

 * build_join_rel -> build_joinrel_partition_info -> partition_bounds_merge

 * try_partition_wise_join -> partition_bounds_merge

Is this intentional, or just a thinko?

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-07 Thread Antonin Houska

Ashutosh Bapat <ashutosh.ba...@enterprisedb.com> wrote:

> On Fri, Sep 1, 2017 at 6:05 PM, Antonin Houska <a...@cybertec.at> wrote:
> >
> >
> >
> > * get_partitioned_child_rels_for_join()
> >
> > I think the Assert() statement is easier to understand inside the loop, see
> > the assert.diff attachment.

> The assert at the end of function also checks that we have got
> child_rels lists for all the parents passed in.

Really? I can imagine that some instances of PartitionedChildRelInfo have the
child_rels list empty, while other ones have these lists long enough to
compensate for the empty lists.

> >
> >
> > * have_partkey_equi_join()
> >
> > As the function handles generic join, this comment doesn't seem to me
> > relevant:
> >
> > /*
> >  * The equi-join between partition keys is strict if equi-join between
> >  * at least one partition key is using a strict operator. See
> >  * explanation about outer join reordering identity 3 in
> >  * optimizer/README
> >  */
> > strict_op = op_strict(opexpr->opno);
> 
> What in that comment is not exactly relevant?

Basically I don't understand why you mention join reordering here. The join
ordering questions must all have been resolved by the time
have_partkey_equi_join() is called.

> >
> > And I think the function can return true even if strict_op is false for all
> > the operators evaluated in the loop.
> 
> I think it does that. Do you have a case where it doesn't?

Here I refer to this part of the comment above:

"... if equi-join between at least one partition key is using a strict
operator."

My understanding of the code (especially match_expr_to_partition_keys) is that
no operator actually needs to be strict as long as each operator involved in
the join matches at least one non-nullable expression on both sides of the
join.

> > * match_expr_to_partition_keys()
> >
> > I'm not sure this comment is clear enough:
> >
> > /*
> >  * If it's a strict equi-join a NULL partition key on one side will
> >  * not join a NULL partition key on the other side. So, rows with NULL
> >  * partition key from a partition on one side can not join with those
> >  * from a non-matching partition on the other side. So, search the
> >  * nullable partition keys as well.
> >  */
> > if (!strict_op)
> > continue;
> >
> > My understanding of the problem of NULL values generated by outer join is:
> > these NULL values --- if evaluated by non-strict expression --- can make row
> > of N-th partition on one side of the join match row(s) of *other than* N-th
> > partition(s) on the other side. Thus the nullable input expressions may only
> > be evaluated by strict operators. I think it'd be clearer if you stressed 
> > that
> > (undesired) *match* of partition keys can be a problem, rather than mismatch
> 
> Sorry, I am not able to understand this. To me it looks like my
> wording conveys what you are saying.

I just tried to expreess the idea in a way that is clearer to me. I think we
both mean the same. Not sure I should spend more effort on another version of
the comment.

> > If you insist on your wording, then I think you should at least move the
> > comment below to the part that only deals with strict operators.
> 
> Done.

o.k.

> >
> > * map_and_merge_partitions()
> >
> > Besides a few changes proposed in map_and_merge_partitions.diff (a few of 
> > them
> > to suppress compiler warnings) I think that this part needs more thought:
> >
> > {
> > Assert(mergemap1[index1] != mergemap2[index2] &&
> >mergemap1[index1] >= 0 && mergemap2[index2] >= 0);
> >
> > /*
> >  * Both the partitions map to different merged partitions. This
> >  * means that multiple partitions from one relation matches to 
> > one
> >  * partition from the other relation. Partition-wise join does 
> > not
> >  * handle this case right now, since it requires ganging 
> > multiple
> >  * partitions together (into one RelOptInfo).
> >  */
> > merged_index = -1;
> > }
> >
> > I could hit this path with the following test:
> >
> > CREATE TABLE a(i int) PARTITION BY LIST(i);
> > CREATE TABLE a_0 PARTITION OF a FOR VALUES IN (0, 2);
> > CREATE TABLE b(j int) PARTITION BY LIST(j);
> > CREATE TABLE b_0 PARTITION OF b FOR VALUES IN (1, 2);
> >
> > SET

Re: [HACKERS] pg_basebackup throttling doesn't throttle as promised

2017-09-02 Thread Antonin Houska

Jeff Janes <jeff.ja...@gmail.com> wrote:

> On Fri, Sep 1, 2017 at 1:32 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:
> 
>  The "-r" option to pg_basebackup is supposed to throttle the rate of the
>  backup. But it only works properly if the server is mostly idle.
> 
>  Every non-trivial call to XLogFlush or XLogBackgroundFlush will wake up the
>  wal sender (the one which is not really sending wal, but base files), and
>  the throttling routine doesn't go back to sleep after being awoke
>  early. Rather, it releases another 32kb of data.

Sorry, I missed this fact when coding the feature.

>  Should the basebackup.c throttle sleep in a loop until its time has
>  expired?

I think this is the correct approach because latch can be set for unrelated
reasons.

The patch makes sense to me. I just recommend moving this part in front of the
loop because elapsed_min does not have to be re-computed each time:

/* How much should have elapsed at minimum? */
elapsed_min = elapsed_min_unit * (throttling_counter / throttling_sample);

And also a comment explaining the purpose of the loop would be
appreciated. Thanks.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-01 Thread Antonin Houska

Ashutosh Bapat <ashutosh.ba...@enterprisedb.com> wrote:

> I originally thought to provide it along-with the changes to
> expand_inherited_rtentry(), but that thread is taking longer. Jeevan
> Chalke needs rebased patches for his work on aggregate pushdown and
> Thomas might need them for further review. So, here they are.

Since I have related patch in the current commitfest
(https://commitfest.postgresql.org/14/1247/), I spent some time reviewing your
patch:

* generate_partition_wise_join_paths()

Right parenthesis is missing in the prologue.

* get_partitioned_child_rels_for_join()

I think the Assert() statement is easier to understand inside the loop, see
the assert.diff attachment.

* have_partkey_equi_join()

As the function handles generic join, this comment doesn't seem to me
relevant:

/*
 * The equi-join between partition keys is strict if equi-join between
 * at least one partition key is using a strict operator. See
 * explanation about outer join reordering identity 3 in
 * optimizer/README
 */
strict_op = op_strict(opexpr->opno);

And I think the function can return true even if strict_op is false for all
the operators evaluated in the loop.

* match_expr_to_partition_keys()

I'm not sure this comment is clear enough:

/*
 * If it's a strict equi-join a NULL partition key on one side will
 * not join a NULL partition key on the other side. So, rows with NULL
 * partition key from a partition on one side can not join with those
 * from a non-matching partition on the other side. So, search the
 * nullable partition keys as well.
 */
if (!strict_op)
continue;

My understanding of the problem of NULL values generated by outer join is:
these NULL values --- if evaluated by non-strict expression --- can make row
of N-th partition on one side of the join match row(s) of *other than* N-th
partition(s) on the other side. Thus the nullable input expressions may only
be evaluated by strict operators. I think it'd be clearer if you stressed that
(undesired) *match* of partition keys can be a problem, rather than mismatch.

If you insist on your wording, then I think you should at least move the
comment below to the part that only deals with strict operators.

* There are several places where lfirst_node() macro should be used. For
  example

rel = lfirst_node(RelOptInfo, lc);

instead of

rel = (RelOptInfo *) lfirst(lc);

* map_and_merge_partitions()

Besides a few changes proposed in map_and_merge_partitions.diff (a few of them
to suppress compiler warnings) I think that this part needs more thought:

{
Assert(mergemap1[index1] != mergemap2[index2] &&
   mergemap1[index1] >= 0 && mergemap2[index2] >= 0);

/*
 * Both the partitions map to different merged partitions. This
 * means that multiple partitions from one relation matches to one
 * partition from the other relation. Partition-wise join does not
 * handle this case right now, since it requires ganging multiple
 * partitions together (into one RelOptInfo).
 */
merged_index = -1;
}

I could hit this path with the following test:

CREATE TABLE a(i int) PARTITION BY LIST(i);
CREATE TABLE a_0 PARTITION OF a FOR VALUES IN (0, 2);
CREATE TABLE b(j int) PARTITION BY LIST(j);
CREATE TABLE b_0 PARTITION OF b FOR VALUES IN (1, 2);

SET enable_partition_wise_join TO on;

SELECT  *
FROMa
FULL JOIN
b ON i = j;

I don't think there's a reason not to join a_0 partition to b_0, is there?

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
new file mode 100644
index a75b1a3..3094b56
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*** get_partitioned_child_rels_for_join(Plan
*** 6160,6170 
  		PartitionedChildRelInfo *pc = lfirst(l);

  		if (bms_is_member(pc->parent_relid, join_relids))
  			result = list_concat(result, list_copy(pc->child_rels));
  	}

- 	/* The root partitioned table is included as a child rel */
- 	Assert(list_length(result) >= bms_num_members(join_relids));
- 
  	return result;
  }
--- 6160,6172 
  		PartitionedChildRelInfo *pc = lfirst(l);

  		if (bms_is_member(pc->parent_relid, join_relids))
+ 		{
+ 			/* The root partitioned table is included as a child rel */
+ 			Assert(list_length(pc->child_rels) >= 1);
+ 
  			result = list_concat(result, list_copy(pc->child_rels));
+ 		}
  	}

  	return result;
  }
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
new file mode 100644
index eb35fab..aa9c70d
*** a/src/backend/catalog/partition.c
--- b/src/backend

Re: [HACKERS] WIP: Separate log file for extension

2017-08-30 Thread Antonin Houska

Masahiko Sawada <sawada.m...@gmail.com> wrote:

> On Fri, Aug 25, 2017 at 4:12 PM, Antonin Houska <a...@cybertec.at> wrote:
> > Attached is a draft patch to allow extension to write log messages to a
> > separate file.
> 
> Does it allow postgres core code to write log messages to multiple log
> files as well? I can imagine a use case where we want to write log
> messages to separate log files by error level or purpose (server log
> and SQL log etc).

At low level it should work as long as several log streams are reserved for
the core code. However, as every single stream needs multiple variables, I
have no idea how to configure those additional streams w/o adding many new GUC
variables to the core.

Also the discussion what should (not) be logged separate would probably be
difficult.

Specifically to log SQL statements to a separate file, I think (but not
verified) that you can write an extension that reserves a stream for itself
and also uses emit_log_hook to recognize SQL statements among the logged
messages. If the hook adjusts the syslogger_stream field of ErrorData
accordingly, the message should end up in the separate file.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Write operations in parallel mode

2017-08-28 Thread Antonin Houska

Now that dynamic shared memory hash table has been committed
(8c0d7bafad36434cb08ac2c78e69ae72c194ca20) I wonder if it's still a big deal
to remove restrictions like this in (e.g. heap_update()):

/*
 * Forbid this during a parallel operation, lest it allocate a combocid.
 * Other workers might need that combocid for visibility checks, and we
 * have no provision for broadcasting it to them.
 */
if (IsInParallelMode())
ereport(ERROR,
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
 errmsg("cannot update tuples during a
parallel operation")));



-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Separate log file for extension

2017-08-28 Thread Antonin Houska

Craig Ringer <cr...@2ndquadrant.com> wrote:

> On 25 August 2017 at 15:12, Antonin Houska <a...@cybertec.at> wrote:
> 
> How will this play with syslog? csvlog? etc?

There's nothing special about csvlog: the LogStream structure has a
"destination" field, so if particular extension wants this kind of output, it
simply sets the LOG_DESTINATION_CSVLOG bit here.

LOG_DESTINATION_SYSLOG is a problem because (AFAIK) a single process can
maintain no more than one connection to the syslog server. I don't like the
idea of closing the current connection and opening a different one whenever
the next ereport() is associated with a different stream than the previous
was.

As for LOG_DESTINATION_EVENTLOG, I haven't checked yet.

> I wonder if a level of indirection is appropriate here, where extensions (or
> postgres subsystems, even) provide a log stream label. Then the logging
> backed takes care of using that appropriately for the logging mechanism in
> use; for logging to file that'd generally be separate files. Same for
> CSVlog. Other mechanisms could be left as stubs initially.
> 
> So the outcome would be the same, just without the assumption of specific
> file name and output mechanism baked in.

The example I shown in my previous email was the simplest case I could think
of. But it does not mean that the file name has to be hard-coded in the
extension. Instead of setting the LogStream.filename field, you can pass a
pointer to this field to DefineCustomStringVariable() function, so the
specific GUC can control the value.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] WIP: Separate log file for extension

2017-08-25 Thread Antonin Houska

Attached is a draft patch to allow extension to write log messages to a
separate file. It introduces a concept of a "log stream". The extension's
shared library gets its stream assigned by calling this function from
_PG_init()

my_stream_id = get_log_stream("my_extension", _log_stream);

Then it's supposed to change some of its attributes

adjust_log_stream_attr(>filename, "my_extension.log");

and to use the stream id in ereport() calls

ereport(LOG, (errmsg("Hello world"), errstream(my_stream_id)));

The EXEC_BACKEND mechanism makes initialization of the log streams by
postmaster child processes non-trivial. I decided to extend
save_backend_variables() and restore_backend_variables() accordingly. Maybe
someone has better idea.

pgaudit seems to be the most obvious use case for this enhancement, but it
might be useful for many other extensions.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
new file mode 100644
index 95180b2..e9b5684
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
*** typedef struct
*** 464,469 
--- 464,470 
  
  static pid_t backend_forkexec(Port *port);
  static pid_t internal_forkexec(int argc, char *argv[], Port *port);
+ static Size get_backend_params_size(void);
  
  /* Type for a socket that can be inherited to a client process */
  #ifdef WIN32
*** typedef int InheritableSocket;
*** 482,488 
--- 483,495 
   */
  typedef struct
  {
+ 	/*
+ 	 * read_backend_variables() relies on size to be the first field, followed
+ 	 * by port.
+ 	 */
+ 	Size		size;
  	Port		port;
+ 
  	InheritableSocket portsocket;
  	char		DataDir[MAXPGPATH];
  	pgsocket	ListenSocket[MAXLISTEN];
*** typedef struct
*** 528,533 
--- 535,542 
  	char		my_exec_path[MAXPGPATH];
  	char		pkglib_path[MAXPGPATH];
  	char		ExtraOptions[MAXPGPATH];
+ 	int			nlogstreams;
+ 	char		log_streams[FLEXIBLE_ARRAY_MEMBER];
  } BackendParameters;
  
  static void read_backend_variables(char *id, Port *port);
*** PostmasterMain(int argc, char *argv[])
*** 578,583 
--- 587,593 
  	bool		listen_addr_saved = false;
  	int			i;
  	char	   *output_config_variable = NULL;
+ 	LogStream  *log = _streams[0];
  
  	MyProcPid = PostmasterPid = getpid();
  
*** PostmasterMain(int argc, char *argv[])
*** 1273,1279 
  	 * saying so, to provide a breadcrumb trail for users who may not remember
  	 * that their logging is configured to go somewhere else.
  	 */
! 	if (!(Log_destination & LOG_DESTINATION_STDERR))
  		ereport(LOG,
  (errmsg("ending log output to stderr"),
   errhint("Future log output will go to log destination \"%s\".",
--- 1283,1289 
  	 * saying so, to provide a breadcrumb trail for users who may not remember
  	 * that their logging is configured to go somewhere else.
  	 */
! 	if (!(log->destination & LOG_DESTINATION_STDERR))
  		ereport(LOG,
  (errmsg("ending log output to stderr"),
   errhint("Future log output will go to log destination \"%s\".",
*** internal_forkexec(int argc, char *argv[]
*** 4421,4431 
  	static unsigned long tmpBackendFileNum = 0;
  	pid_t		pid;
  	char		tmpfilename[MAXPGPATH];
! 	BackendParameters param;
  	FILE	   *fp;
  
! 	if (!save_backend_variables(, port))
  		return -1;/* log made by save_backend_variables */
  
  	/* Calculate name for temp file */
  	snprintf(tmpfilename, MAXPGPATH, "%s/%s.backend_var.%d.%lu",
--- 4431,4448 
  	static unsigned long tmpBackendFileNum = 0;
  	pid_t		pid;
  	char		tmpfilename[MAXPGPATH];
! 	Size		param_size;
! 	BackendParameters *param;
  	FILE	   *fp;
  
! 	param_size = get_backend_params_size();
! 	param = (BackendParameters *) palloc(param_size);
! 	if (!save_backend_variables(param, port))
! 	{
! 		pfree(param);
  		return -1;/* log made by save_backend_variables */
+ 	}
+ 	Assert(param->size == param_size);
  
  	/* Calculate name for temp file */
  	snprintf(tmpfilename, MAXPGPATH, "%s/%s.backend_var.%d.%lu",
*** internal_forkexec(int argc, char *argv[]
*** 4449,4466 
  	(errcode_for_file_access(),
  	 errmsg("could not create file \"%s\": %m",
  			tmpfilename)));
  			return -1;
  		}
  	}
  
! 	if (fwrite(, sizeof(param), 1, fp) != 1)
  	{
  		ereport(LOG,
  (errcode_for_file_access(),
   errmsg("could not write to file \"%s\": %m", tmpfilename)));
  		FreeFile(fp);
  		return -1;
  	}
  
  	/* Release file */
  	if (FreeFile(fp))
--- 4466,4486 
  	(errcode_for_file_access(),
  	 errmsg("could not create file \"%s\": %m&qu

Re: [HACKERS] WIP: Aggregation push-down

2017-08-17 Thread Antonin Houska

Antonin Houska <a...@cybertec.at> wrote:

> Antonin Houska <a...@cybertec.at> wrote:
> 
> > This is a new version of the patch I presented in [1].
> 
> Rebased.
> 
> cat .git/refs/heads/master 
> b9a3ef55b253d885081c2d0e9dc45802cab71c7b

This is another version of the patch.

Besides other changes, it enables the aggregation push-down for postgres_fdw,
although only for aggregates whose transient aggregation state is equal to the
output type. For other aggregates (like avg()) the remote nodes will have to
return the transient state value in an appropriate form (maybe bytea type),
which does not depend on PG version.

shard.tgz demonstrates the typical postgres_fdw use case. One query shows base
scans of base relation's partitions being pushed to shard nodes, the other
pushes down a join and performs aggregation of the join result on the remote
node. Of course, the query can only references one particular partition, until
the "partition-wise join" [1] patch gets committed and merged with this my
patch.

One thing I'm not sure about is whether the patch should remove
GetForeignUpperPaths function from FdwRoutine, which it essentially makes
obsolete. Or should it only be deprecated so far? I understand that
deprecation is important for "ordinary SQL users", but FdwRoutine is an
interface for extension developers, so the policy might be different.

[1] https://commitfest.postgresql.org/14/994/

Any feedback is appreciated.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

agg_pushdown_v3.tgz
Description: GNU Zip compressed data

shard.tgz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Unused field of ProjectionPath

2017-08-03 Thread Antonin Houska

I've noticed that the dummypp field of ProjectionPath is set but never read.

If the only possible change between the path and plan creation time is that
the projection path and the subpath targetlists become different, then dummypp
could be used this way:

diff --git a/src/backend/optimizer/plan/createplan.c 
b/src/backend/optimizer/plan/createplan.c
new file mode 100644
index 5c934f2..1910d3f
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
*** create_projection_plan(PlannerInfo *root
*** 1572,1578 
 * not using.)
 */
if (is_projection_capable_path(best_path->subpath) ||
!   tlist_same_exprs(tlist, subplan->targetlist))
{
/* Don't need a separate Result, just assign tlist to subplan */
plan = subplan;
--- 1572,1578 
 * not using.)
 */
if (is_projection_capable_path(best_path->subpath) ||
!   (best_path->dummypp && tlist_same_exprs(tlist, 
subplan->targetlist)))
{
/* Don't need a separate Result, just assign tlist to subplan */
plan = subplan;


On the other hand, if the targetlists can also be different at path creation
time and equal at plan creation time, the lists do always need comparison at
plan creation time and the dummypp field should probably be removed.


-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] asynchronous execution

2017-07-11 Thread Antonin Houska

Kyotaro HORIGUCHI <horiguchi.kyot...@lab.ntt.co.jp> wrote:

> > Just one idea that I had while reading the code.
> > 
> > In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the
> > complete requests to the end and finaly adjust estate->es_num_pending_async 
> > so
> > that the array no longer contains the complete requests. I think the point 
> > is
> > that then you can add new requests to the end of the array.
> > 
> > I wonder if a set (Bitmapset) of incomplete requests would make the code 
> > more
> > efficient. The set would contain position of each incomplete request in
> > estate->es_num_pending_async (I think it's the myindex field of
> > PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the
> > requests subject to ExecAsyncNotify etc, then the compaction of
> > estate->es_pending_async wouldn't be necessary.
> > 
> > ExecAsyncRequest would use the set to look for space for new requests by
> > iterating it and trying to find the first gap (which corresponds to 
> > completed
> > request).
> > 
> > And finally, item would be removed from the set at the moment the request
> > state is being set to ASYNCREQ_COMPLETE.
> 
> Effectively it is a waiting-queue followed by a
> completed-list. The point of the compaction is keeping the order
> of waiting or not-yet-completed requests, which is crucial to
> avoid kind-a precedence inversion. We cannot keep the order by
> using bitmapset in such way.

> The current code waits all waiters at once and processes all
> fired events at once. The order in the waiting-queue is
> inessential in the case. On the other hand I suppoese waiting on
> several-tens to near-hundred remote hosts is in a realistic
> target range. Keeping the order could be crucial if we process a
> part of the queue at once in the case.
> 
> Putting siginificance on the deviation of response time of
> remotes, process-all-at-once is effective. In turn we should
> consider the effectiveness of the lifecycle of the larger wait
> event set.

ok, I missed the fact that the order of es_pending_async entries is
important. I think this is worth adding a comment.

Actually the reason I thought of simplification was that I noticed small
inefficiency in the way you do the compaction. In particular, I think it's not
always necessary to swap the tail and head entries. Would something like this
make sense?


/* If any node completed, compact the array. */
if (any_node_done)
{
int hidx = 0,
tidx;

/*
 * Swap all non-yet-completed items to the start of the 
array.
 * Keep them in the same order.
 */
for (tidx = 0; tidx < estate->es_num_pending_async; 
++tidx)
{
PendingAsyncRequest *tail = 
estate->es_pending_async[tidx];

Assert(tail->state != 
ASYNCREQ_CALLBACK_PENDING);

if (tail->state == ASYNCREQ_COMPLETE)
continue;

/*
 * If the array starts with one or more 
incomplete requests,
 * both head and tail point at the same item, 
so there's no
 * point in swapping.
 */
if (tidx > hidx)
{
PendingAsyncRequest *head = 
estate->es_pending_async[hidx];

/*
 * Once the tail got ahead, it should 
only leave
 * ASYNCREQ_COMPLETE behind. Only those 
can then be seen
 * by head.
 */
Assert(head->state == 
ASYNCREQ_COMPLETE);

estate->es_pending_async[tidx] = head;
estate->es_pending_async[hidx] = tail;
}

++hidx;
}

estate->es_num_pending_async = hidx;
    }

And besides that, I think it'd be more intuitive if the meaning of "head" and
"tail" was reversed: if the array is iterated from lower to higher positions,
then I'd consider head to be at higher position, not tail.

-- 
Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener
Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] asynchronous execution

2017-06-28 Thread Antonin Houska

Kyotaro HORIGUCHI <horiguchi.kyot...@lab.ntt.co.jp> wrote:

> The patch got conflicted. This is a new version just rebased to
> the current master. Furtuer amendment will be taken later.

Just one idea that I had while reading the code.

In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the
complete requests to the end and finaly adjust estate->es_num_pending_async so
that the array no longer contains the complete requests. I think the point is
that then you can add new requests to the end of the array.

I wonder if a set (Bitmapset) of incomplete requests would make the code more
efficient. The set would contain position of each incomplete request in
estate->es_num_pending_async (I think it's the myindex field of
PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the
requests subject to ExecAsyncNotify etc, then the compaction of
estate->es_pending_async wouldn't be necessary.

ExecAsyncRequest would use the set to look for space for new requests by
iterating it and trying to find the first gap (which corresponds to completed
request).

And finally, item would be removed from the set at the moment the request
state is being set to ASYNCREQ_COMPLETE.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] asynchronous execution

2017-06-28 Thread Antonin Houska

Kyotaro HORIGUCHI <horiguchi.kyot...@lab.ntt.co.jp> wrote:

> The patch got conflicted. This is a new version just rebased to
> the current master. Furtuer amendment will be taken later.

Can you please explain this part of make_append() ?

/* Currently async on partitioned tables is not available */
Assert(nasyncplans == 0 || partitioned_rels == NIL);

I don't think the output of Append plan is supposed to be ordered even if the
underlying relation is partitioned. Besides ordering, is there any other
reason not to use the asynchronous execution?

And even if there was some, the planner should ensure that executor does not
fire the assertion statement above. The script attached shows an example how
to cause the assertion failure.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

OPT_HOST="-h localhost"
USER=postgres

for x in $(psql $OPT_HOST -c "SELECT datname FROM pg_database WHERE datname ~ 
'shard.*' " postgres -A -t)

do
echo dropping $x
dropdb $OPT_HOST $x
done;

createdb $OPT_HOST shard

psql $OPT_HOST -U $USER shard -c "
CREATE EXTENSION postgres_fdw;
CREATE TABLE orders(customer_id int, product_id int)
   PARTITION BY LIST (customer_id);"
createdb $OPT_HOST shard_0
psql $OPT_HOST -U $USER shard_0 -c "
CREATE TABLE orders_0 AS
SELECT trunc(2 * random())::int AS customer_id,
   trunc(100 * random())::int AS product_id
FROM generate_series(1, 5000);
ANALYZE;"
psql $OPT_HOST -U $USER shard -c "
CREATE SERVER server_0 FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host 'localhost', dbname 'shard_0', fetch_size 
'1000',
use_remote_estimate 'true', fdw_tuple_cost '10');
CREATE USER MAPPING FOR CURRENT_USER SERVER server_0 OPTIONS 
(user '$USER');
IMPORT FOREIGN SCHEMA public FROM SERVER server_0 INTO public;
ALTER TABLE orders ATTACH PARTITION orders_0 FOR VALUES IN (0, 
1);"

createdb $OPT_HOST shard_1
psql $OPT_HOST -U $USER shard_1 -c "
CREATE TABLE orders_1 AS
SELECT trunc(2 * random())::int + 2 AS customer_id,
   trunc(100 * random())::int + 2 AS product_id
FROM generate_series(1, 5000);
ANALYZE;"
psql $OPT_HOST -U $USER shard -c "
CREATE SERVER server_1 FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host 'localhost', dbname 'shard_1', fetch_size 
'1000',
use_remote_estimate 'true', fdw_tuple_cost '10');
CREATE USER MAPPING FOR CURRENT_USER SERVER server_1 OPTIONS 
(user '$USER');
IMPORT FOREIGN SCHEMA public FROM SERVER server_1 INTO public;
ALTER TABLE orders ATTACH PARTITION orders_1 FOR VALUES IN (2, 
3);"


psql $OPT_HOST -U $USER shard -c "SELECT * FROM orders"

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] snapbuild woes

2017-06-09 Thread Antonin Houska

Andres Freund <and...@anarazel.de> wrote:

> Looking at 0001:
> - GetOldestSafeDecodingTransactionId() only guarantees to return an xid
>   safe for decoding (note how procArray->replication_slot_catalog_xmin
>   is checked), not one for the initial snapshot - so afaics this whole
>   exercise doesn't guarantee much so far.

I happen to use CreateInitDecodingContext() in an extension, so I had to think
what the new argumen exactly means (as for the incompatibility between PG
9.6.2 and 9.6.3, I suppose preprocessor directives can handle it).

One thing I'm failing to understand is: if TRUE is passed for
need_full_snapshot, then slot->effective_xmin receives the result of

GetOldestSafeDecodingTransactionId(need_full_snapshot)

but this does include "catalog xmin".

If the value returned is determined by an existing slot which has valid
effective_catalog_xmin and invalid effective_xmin (i.e. that slot only
protects catalog tables from VACUUM but not the regular ones), then the new
slot will think it (i.e. the new slot) protects even non-catalog tables, but
that's no true.

Shouldn't the xmin_horizon be computed by this call instead?

GetOldestSafeDecodingTransactionId(!need_full_snapshot);

(If so, I think "considerCatalog" argument name would be clearer than
"catalogOnly".)

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

2017-04-28 Thread Antonin Houska

Jeevan Chalke <jeevan.cha...@enterprisedb.com> wrote:

> On Thu, Apr 27, 2017 at 4:53 PM, Antonin Houska <a...@cybertec.at> wrote:
>
> > Robert Haas <robertmh...@gmail.com> wrote:
> > > Well, I'm confused. I see that there's a relationship between what
> > > Antonin is trying to do and what Jeevan is trying to do, but I can't
> > > figure out whether one is a subset of the other, whether they're both
> > > orthogonal, or something else. This plan looks similar to what I
> > > would expect Jeevan's patch to produce,

> >  The point is that the patch Jeevan wanted to work on is actually a subset 
> > of
> >  [1] combined with [2].

> Seems like, as you are targeting every relation whether or not it is
> partitioned.

Yes.

> With my patch, I am getting following plan where we push entire
> aggregation below append.
> 
> QUERY PLAN 
> --
> Append
> -> HashAggregate
> Group Key: b_1.j
> -> Hash Join
> Hash Cond: (b_1.j = c_1.k)
> -> Seq Scan on b_1
> -> Hash
> -> Seq Scan on c_1
> -> HashAggregate
> Group Key: b_2.j
> -> Hash Join
> Hash Cond: (b_2.j = c_2.k)
> -> Seq Scan on b_2
> -> Hash
> -> Seq Scan on c_2
> (15 rows)

I think this is not generic enough because the result of the Append plan can
be joined to another relation. As such a join can duplicate the
already-aggregated values, the aggregates should not be finalized below the
top-level plan.

> Antonin, I have tried applying your patch on master but it doesn't get
> apply. Can you please provide the HEAD and any other changes required
> to be applied first?

I've lost that information. I'll post a new version to the [1] thread asap.

> How the plan look like when GROUP BY key does not match with the
> partitioning key i.e. GROUP BY b.v ?

EXPLAIN (COSTS false)
SELECT  b.v, avg(b.v + c.v)
FROMb
JOIN
c ON b.j = c.k
GROUP BYb.v;

   QUERY PLAN   

 Finalize HashAggregate
   Group Key: b_1.v
   ->  Append
 ->  Partial HashAggregate
   Group Key: b_1.v
   ->  Hash Join
 Hash Cond: (b_1.j = c_1.k)
 ->  Seq Scan on b_1
 ->  Hash
   ->  Seq Scan on c_1
 ->  Partial HashAggregate
   Group Key: b_2.v
   ->  Hash Join
 Hash Cond: (b_2.j = c_2.k)
 ->  Seq Scan on b_2
 ->  Hash
   ->  Seq Scan on c_2


> >  [1] https://www.postgresql.org/message-id/9666.1491295317%40localhost
> >
> >  [2] https://commitfest.postgresql.org/14/994/


-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

2017-04-05 Thread Antonin Houska

Antonin Houska <a...@cybertec.at> wrote:
> 
> Jeevan Chalke <jeevan.cha...@enterprisedb.com> wrote:
> 
> > Our work will overlap when we are pushing down the aggregate on partitioned
> > base relation to its children/partitions.
> > 
> > I think you should continue working on pushing down aggregate onto the
> > joins/scans where as I will continue my work on pushing down aggregates to
> > partitions (joins as well as single table). Once we are done with these 
> > task,
> > then we may need to find a way to integrate them.
> > 
> > [1] 
> > https://www.postgresql.org/message-id/flat/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=ead...@mail.gmail.com#CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=ead...@mail.gmail.com
> 
> My patch does also create (partial) aggregation paths below the Append node,
> but only expects SeqScan as input. Please check if you patch can be based on
> this or if there's any conflict.

Well, I haven't imposed any explicit restriction on the kind of path to be
aggregated below the Append path. Maybe the only thing to do is to merge my
patch with the "partition-wise join" patch (which I haven't checked yet).

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

2017-04-05 Thread Antonin Houska

The promised new version of my patch is here:

https://www.postgresql.org/message-id/9666.1491295317%40localhost

Jeevan Chalke <jeevan.cha...@enterprisedb.com> wrote:

> On Tue, Mar 21, 2017 at 1:47 PM, Antonin Houska <a...@cybertec.at> wrote:
> 
>  Jeevan Chalke <jeevan.cha...@enterprisedb.com> wrote:
> 
> IIUC, it seems that you are trying to push down the aggregation into the
> joining relations. So basically you are converting
> Agg -> Join -> {scan1, scan2} into
> FinalAgg -> Join -> {PartialAgg -> scan1, PartialAgg -> scan2}.
> In addition to that your patch pushes aggregates on base rel to its children,
> if any.
> 
> Where as what I propose here is pushing down aggregation below the append
> node keeping join/scan as is. So basically I am converting
> Agg -> Append-> Join -> {scan1, scan2} into
> Append -> Agg -> Join -> {scan1, scan2}.
> This will require partition-wise join as posted in [1].
> But I am planning to make this work for partitioned relations and not for
> generic inheritance.
> 
> I treat these two as separate strategies/paths to be consider while planning.
> 
> Our work will overlap when we are pushing down the aggregate on partitioned
> base relation to its children/partitions.
> 
> I think you should continue working on pushing down aggregate onto the
> joins/scans where as I will continue my work on pushing down aggregates to
> partitions (joins as well as single table). Once we are done with these task,
> then we may need to find a way to integrate them.
> 
> [1] 
> https://www.postgresql.org/message-id/flat/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=ead...@mail.gmail.com#CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=ead...@mail.gmail.com

My patch does also create (partial) aggregation paths below the Append node,
but only expects SeqScan as input. Please check if you patch can be based on
this or if there's any conflict.

(I'll probably be unable to respond before Monday 04/17.)

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

2017-03-21 Thread Antonin Houska

Jeevan Chalke <jeevan.cha...@enterprisedb.com> wrote:

> Declarative partitioning is supported in PostgreSQL 10 and work is already in
> progress to support partition-wise joins. Here is a proposal for 
> partition-wise
> aggregation/grouping. Our initial performance measurement has shown 7 times
> performance when partitions are on foreign servers and approximately 15% when
> partitions are local.
> 
> Partition-wise aggregation/grouping computes aggregates for each partition
> separately. If the group clause contains the partition key, all the rows
> belonging to a given group come from one partition, thus allowing aggregates
> to be computed completely for each partition. Otherwise, partial aggregates
> computed for each partition are combined across the partitions to produce the
> final aggregates. This technique improves performance because:

> i. When partitions are located on foreign server, we can push down the
> aggregate to the foreign server.

> ii. If hash table for each partition fits in memory, but that for the whole
> relation does not, each partition-wise aggregate can use an in-memory hash
> table.

> iii. Aggregation at the level of partitions can exploit properties of
> partitions like indexes, their storage etc.

I suspect this overlaps with

https://www.postgresql.org/message-id/29111.1483984605%40localhost

I'm working on the next version of the patch, which will be able to aggregate
the result of both base relation scans and joins. I'm trying hard to make the
next version available before an urgent vacation that I'll have to take at
random date between today and early April. I suggest that we coordinate the
effort, it's lot of work in any case.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parameterization of partial path

2017-02-10 Thread Antonin Houska

Robert Haas <robertmh...@gmail.com> wrote:

> On Thu, Feb 9, 2017 at 12:36 PM, Antonin Houska <a...@cybertec.at> wrote:
> > When looking at try_partial_hashjoin_path and try_partial_nestloop_path
> > functions I'm failing to understand the comment "Parameterized partial paths
> > are not supported.".
> >
> > It's clear to me that GatherPath can't have parameters because repeated
> > execution of parallel plan with adjusted parameters is not supported.
> 
> Actually, I think in theory that is fine.  You'd start up and shut
> down workers for every execution, which is likely to stink in terms of
> performance, but it should work OK.  The only problem is that it'll
> only work if you pass the values of the parameters down to the worker
> processes, which the code currently doesn't. Bind parameters sent by
> the user are passed down, but there's no provision to pass down
> parameters populated at execution time.  Amit has written code for
> that, though, and it's been posted here.  It just hasn't gotten into
> the tree yet for one reason or another.

ok, I also thought it's about missing implementation rather than principal
issue.

> > But the
> > fact that generic partial path has parameters should not be a problem if 
> > those
> > parameters are satisfied below the nearest GatherPath node higher in the 
> > plan
> > tree. Do I miss anything of the concept?
> 
> Yes, I think you're missing a key point.  A parameterized path would
> normally be used for a query like small LJ (big1 IJ big2 ON big1.x =
> big2.x) ON big1.y = small.y.  The plan might look like this:

Thanks for detailed explanation. I think the missing point in my thoughts was
that the only way to satisfy parameters of a relation (so that Gather does not
have to pass any parameters) is to put the relation on the inner side of a
join. However the inner side must provide the whole result set, not just part
of it, else the result is wrong.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Parameterization of partial path

2017-02-09 Thread Antonin Houska

When looking at try_partial_hashjoin_path and try_partial_nestloop_path
functions I'm failing to understand the comment "Parameterized partial paths
are not supported.".

It's clear to me that GatherPath can't have parameters because repeated
execution of parallel plan with adjusted parameters is not supported. But the
fact that generic partial path has parameters should not be a problem if those
parameters are satisfied below the nearest GatherPath node higher in the plan
tree. Do I miss anything of the concept?

Actually this check

if (!bms_is_subset(inner_paramrels, outer_path->parent->relids))
   return;

in try_partial_nestloop_path only rejects special case where the inner path
references relations outside the join, but still allows the outer path to have
parameters outside.

As for try_partial_hashjoin_path, the comment "If the inner path is
parameterized ..." seems to be copy & pasted from try_partial_nestloop_path,
but I think it does not fit hash join. From hash_inner_and_outer I judge that
neither side of hash join can be parameterized by the other:

/*
 * If either cheapest-total path is parameterized by the other rel, we
 * can't use a hashjoin.  (There's no use looking for alternative
 * input paths, since these should already be the least-parameterized
 * available paths.)
 */
if (PATH_PARAM_BY_REL(cheapest_total_outer, innerrel) ||
PATH_PARAM_BY_REL(cheapest_total_inner, outerrel))
return;

Shouldn't try_partial_hashjoin_path and try_partial_nestloop_path do just the
same checks that try_hashjoin_path and try_nestloop_path do respectively?

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_bsd_indent: implement -lps ("leave preprocessor space")

2017-02-09 Thread Antonin Houska

Andres Freund <and...@anarazel.de> wrote:

> On 2017-02-07 23:30:44 -0500, Tom Lane wrote:
> > Piotr Stefaniak <postg...@piotr-stefaniak.me> writes:
> > > this is a patch that Andres asked me for. It makes pg_bsd_indent leave
> > > preprocessor space alone, as in this example:
> > 
> > > #if 0
> > > #  if 0
> > > # if 0
> > > #   error
> > > # endif
> > > #  endif
> > > #else
> > > #  line 7
> > > #endif
> 
> For context: I'd asked Piotr how dificult it'd be to add this.
> 
> > Um ... but the point of pgindent is to standardize spacing, not to let
> > people invent their own style. If you wanted to have a discussion about
> > whether pgindent should force preprocessor directives to look like the
> > above, we could talk about that.  But I do not want to be reading code that
> > looks like the above in one place and code that does not ten lines away.
> 
> I don't think that's something easily done in an automatic
> manner. Because you'd e.g. obviously not want to indent everything
> within include guards.   I don't really buy the danger of large
> divergances in code nearby - this seems mostly useful when writing a bit
> more complicated macros, and I don't think they'll be that frequently
> added in existing files.
> 
> I do think this makes the nesting for #ifdefs a *lot* more readable, and
> we have plenty of cases where it's currently really hard to discern what
> "branch" one is currently reading.  Allowing to opt-in into the "newer"
> formatting in places where it makes sense, seems reasonable to me.

As an alternative approach, the formatting can be implemented in elisp and
added to src/tools/editors/emacs.samples. In particular I mean one function to
make the code human readable and another one to turn it back to the concise
style. User would only call the function on text selection (region) as opposed
to the whole file.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] asynchronous execution

2017-02-03 Thread Antonin Houska

Kyotaro HORIGUCHI <horiguchi.kyot...@lab.ntt.co.jp> wrote:

> I noticed that this patch is conflicting with 665d1fa (Logical
> replication) so I rebased this. Only executor/Makefile
> conflicted.

I was lucky enough to see an infinite loop when using this patch, which I
fixed by this change:

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 588ba18..9b87fbd
*** a/src/backend/executor/execAsync.c
--- b/src/backend/executor/execAsync.c
*** ExecAsyncEventWait(EState *estate, long
*** 364,369 
--- 364,370 

if ((w->events & WL_LATCH_SET) != 0)
{
+   ResetLatch(MyLatch);
process_latch_set = true;
continue;
}

Actually _almost_ fixed because at some point one of the following

Assert(areq->state == ASYNC_WAITING);

statements fired. I think it was the immediately following one, but I can
imagine the same to happen in the branch

if (process_latch_set)
...

I think the wants_process_latch field of PendingAsyncRequest is not useful
alone because the process latch can be set for reasons completely unrelated to
the asynchronous processing. If the asynchronous node should use latch to
signal it's readiness, I think an additional flag is needed in the request
which tells ExecAsyncEventWait that the latch was set by the asynchronous
node.

BTW, do we really need the ASYNC_CALLBACK_PENDING state? I can imagine the
async node either to change ASYNC_WAITING directly to ASYNC_COMPLETE, or leave
it ASYNC_WAITING if the data is not ready.

In addition, the following comments are based only on code review, I didn't
verify my understanding experimentally:

* Isn't it possible for AppendState.as_asyncresult to contain multiple
  responses from the same async node? Since the array stores TupleTableSlot
  instead of the actual tuple (so multiple items of as_asyncresult point to
  the same slot), I suspect the slot contents might not be defined when the
  Append node eventually tries to return it to the upper plan.

* For the WaitEvent subsystem to work, I think postgres_fdw should keep a
  separate libpq connection per node, not per user mapping. Currently the
  connections are cached by user mapping, but it's legal to locate multiple
  child postgres_fdw nodes of Append plan on the same remote server. I expect
  that these "co-located" nodes would currently use the same user mapping and
  therefore the same connection.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-27 Thread Antonin Houska

I thought about the patch from the perspective of "grouped relations"
(especially [1]). When looking for the appropriate context within the thread,
I picked this message.

David Rowley <david.row...@2ndquadrant.com> wrote:

> On 12 March 2016 at 11:43, Tom Lane <t...@sss.pgh.pa.us> wrote:
> >
> > It seems like the major intellectual complexity here is to figure out
> > how to detect inner-side-unique at reasonable cost.  I see that for
> > LEFT joins you're caching that in the SpecialJoinInfos, which is probably
> > fine.  But for INNER joins it looks like you're just doing it over again
> > for every candidate join, and that seems mighty expensive.

> ... I'll look into that.
> 
> The other thing I thought of was to add a dedicated list for unique
> indexes in RelOptInfo, this would also allow
> rel_supports_distinctness() to do something a bit smarter than just
> return false if there's no indexes. That might not buy us much though,
> but at least relations tend to have very little unique indexes, even
> when they have lots of indexes.

I'm thinking of a concept of "unique keys", similar to path keys that the
planner already uses. Besides the current evaluation of uniqueness of the
inner side of a join, the planner would (kind of) union the unique keys of the
joined rels, ie compute a list of expressions which generates an unique row
throughout the new join result. (Requirement is that each key must be usable
in join expression, as opposed to filter.)

To figure out whether at most one inner row exists per outer row, each unique
key of the inner relation which references the outer relation needs to match
an unique key of the outer relation (but it's probably wrong if multiple
unique keys of the inner rel reference the same key of the outer rel).

Like path key, the unique key would also point to an equivalence class. Thus
mere equality of the EC pointers could perhaps be used to evaluate the match
of the inner and outer keys.

Given that rel_is_distinct_for() currently does not accept joins, this change
would make the patch more generic. (BTW, with this approach, unique_rels and
non_unique_rels caches would have to be stored per-relation (RelOptInfo), as
opposed to PlannerInfo.)

The reason I'd like the unique keys is that - from the "grouped relation"
point of view - the relation uniqueness also needs to be checked against the
GROUP BY clause. Thus the "unique keys" concept seem to me like an useful
abstraction.

Does this proposal seem to have a serious flaw?

[1]
https://www.postgresql.org/message-id/CAKJS1f_h1CLff92B%3D%2BbdrMK2Nf3EfGWaJu2WbzQUYcSBUi02ag%40mail.gmail.com

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-01-25 Thread Antonin Houska

David Rowley <david.row...@2ndquadrant.com> wrote:
> On 19 January 2017 at 11:06, David Rowley <david.row...@2ndquadrant.com> 
> wrote:
> > Old patch no longer applies, so I've attached a rebased patch. This
> > also re-adds a comment line which I mistakenly removed.
> 
> (meanwhile Andres commits 69f4b9c)
> 
> I should've waited a bit longer.
> 
> Here's another that fixes the new conflicts.

I suspect that "inner" and "outer" relation / tuple are sometimes confused in
comments:


* analyzejoins.c:70

 "searches for subsequent matching outer tuples."


* analyzejoins.c:972

/*
 * innerrel_is_unique
 *Check for proofs which prove that 'innerrel' can, at most, match a
 *single tuple in 'outerrel' based on the join condition in
 *'restrictlist'.
 */


* relation.h:1831

boolinner_unique;   /* inner side of join matches no more 
than one
         * outer side 
tuple */


-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PoC: Grouped base relation

2017-01-25 Thread Antonin Houska

David Rowley <david.row...@2ndquadrant.com> wrote:

> On 20 January 2017 at 00:22, Antonin Houska <a...@cybertec.at> wrote:
> > Sorry, it was my thinko - I somehow confused David's CROSS JOIN example with
> > this one. If one side of the join clause is unique and the other becomes
> > unique due to aggregation (and if parallel processing is not engaged) then
> > neither combinefn nor multiplyfn should be necessary before the finalfn.
> 
> Yes, if the join can be detected not to duplicate the groups then a
> normal aggregate node can be pushed below the join. No need for
> Partial Aggregate, or Finalize Aggregate nodes.
> 
> I've a pending patch in the commitfest named "Unique Joins", which
> aims teach the planner about the unique properties of joins. So you
> should just have both stages of aggregation occur for now, and that
> can be improved on once the planner is a bit smart and knows about
> unique joins.

Thanks for a hint. I haven't paid attention to the "Unique Joins" patch until
today. Yes, that's definitely useful.

Given the progress of your patch, I don't worry to make the next version of my
patch depend on it. Implementing temporary solution for the aggregation
push-down seems to me like wasted effort.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PoC: Grouped base relation

2017-01-19 Thread Antonin Houska

Ashutosh Bapat <ashutosh.ba...@enterprisedb.com> wrote:

> >> 1. Pushing down aggregates/groups down join tree, so that the number of 
> >> rows
> >> to be joined decreases.  This might be a good optimization to have. However
> >> there are problems in the current patch. Every path built for a relation
> >> (join or base) returns the same result expressed by the relation or its
> >> subset restricted by parameterization or unification. But this patch 
> >> changes
> >> that. It creates paths which represent grouping in the base relation.  I
> >> think, we need a separate relation to represent that result and hold paths
> >> which produce that result. That itself would be a sizable patch.
> >
> > Whether a separate relation (RelOptInfo) should be created for grouped
> > relation is an important design decision indeed. More important than your
> > argument about the same result ("partial path", used to implement parallel
> > nodes actually does not fit this criterion perfectly - it only returns part 
> > of
> > the set) is the fact that the data type (target) differs.
> > I even spent some time coding a prototype where separate RelOptInfo is 
> > created
> > for the grouped relation but it was much more invasive. In particular, if 
> > only
> > some relations are grouped, it's hard to join them with non-grouped ones w/o
> > changing make_rel_from_joinlist and subroutines substantially. (Decision
> > whether the plain or the grouped relation should be involved in joining 
> > makes
> > little sense at the leaf level of the join tree.)
> >
> > So I took the approach that resembles the partial paths - separate pathlists
> > within the same RelOptInfo.

> Yes, it's hard, but I think without having a separate RelOptInfo the
> design won't be complete. Is there a subset of problem that can be
> solved by using a separate RelOptInfo e.g. pushing aggregates down
> child relations or anything else.

I'm still not convinced that all the fields of RelOptInfo (typically relids)
need to be duplicated. If the current concept should be improved, I'd move all
the grouping related fields to a separate structure, e.g. GroupPathInfo, and
let RelOptInfo point to it. Similar to ParamPathInfo, which contains
parameterization-specific information, GroupPathInfo would conain the
grouping-specific information: target, row count, width, maybe path lists too.

> >
> >> 2. Try to push down aggregates based on the equivalence classes, where
> >> grouping properties can be transferred from one relation to the other using
> >> EC mechanism.
> >
> > I don't think the EC part should increase the patch complexity a lot. 
> > Unless I
> > missed something, it's rather isolated to the part where target of the 
> > grouped
> > paths is assembled. And I think it's important even for initial version of 
> > the
> > patch.
> >
> >> This seems to require solving the problem of combining aggregates across 
> >> the
> >> relations. But there might be some usecases which could benefit without
> >> solving this problem.
> >
> > If "combining aggregates ..." refers to joining grouped relations, then I
> > insist on doing this in the initial version of the new feature too. 
> > Otherwise
> > it'd only work if exactly one base relation of the query is grouped.
> 
> No. "combining aggregates" refers to what aggtransmultifn does. But,
> possibly that problem needs to be solved in the first step itself.

ok. As the discussion goes on, I see that this part could be more useful than
I originally thought. I'll consider it.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PoC: Grouped base relation

2017-01-19 Thread Antonin Houska

Antonin Houska <a...@cybertec.at> wrote:

Well, the following one does not seem to be a typical example. I could
generate the plan, but now I think that the aggregation push down does not in
general decrease the number of groups the final aggregation has to
process. Maybe I just hit planner limitation to estimate the number of groups
within append relation.

> For this query
> 
> SELECT
> p.id, sum(price)
> FROM
> products AS p
> JOIN sales AS s ON s.product_id = p.id
> GROUP BY
> p.id
> 
> I get this plan at "normal circumstances"
> 
>  HashAggregate
>Group Key: p.id
>->  Hash Join
>  Hash Cond: (s.product_id = p.id)
>  ->  Gather
>Workers Planned: 2
>->  Append
>  ->  Parallel Seq Scan on sales s
>  ->  Parallel Seq Scan on sales_2015 s_1
>  ->  Parallel Seq Scan on sales_2016 s_2
>  ->  Parallel Seq Scan on sales_2017 s_3
>  ->  Hash
>->  Gather
>  Workers Planned: 2
>  ->  Append
>->  Parallel Seq Scan on products p
>->  Parallel Seq Scan on products_01 p_1
>->  Parallel Seq Scan on products_02 p_2
>->  Parallel Seq Scan on products_03 p_3
>->  Parallel Seq Scan on products_04 p_4
> 
> 
> but if work_mem is sufficiently low for the hash join to be efficient, the
> aggregation can be moved to individual partitions.
> 
>  Gather
>Workers Planned: 1
>Single Copy: true
>->  Finalize HashAggregate
>  Group Key: p.id
>  ->  Hash Join
>Hash Cond: (p.id = s.product_id)
>->  Append
>  ->  Partial HashAggregate
>Group Key: p.id
>->  Seq Scan on products p
>  ->  Partial HashAggregate
>Group Key: p_1.id
>->  Seq Scan on products_01 p_1
>  ->  Partial HashAggregate
>Group Key: p_2.id
>->  Seq Scan on products_02 p_2
>  ->  Partial HashAggregate
>Group Key: p_3.id
>->  Seq Scan on products_03 p_3
>  ->  Partial HashAggregate
>Group Key: p_4.id
>->  Seq Scan on products_04 p_4
>->  Hash
>  ->  Append
>->  Partial HashAggregate
>  Group Key: s.product_id
>  ->  Seq Scan on sales s
>->  Partial HashAggregate
>  Group Key: s_1.product_id
>  ->  Seq Scan on sales_2015 s_1
>->  Partial HashAggregate
>  Group Key: s_2.product_id
>  ->  Seq Scan on sales_2016 s_2
>->  Partial HashAggregate
>  Group Key: s_3.product_id
>  ->  Seq Scan on sales_2017 s_3

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PoC: Grouped base relation

2017-01-19 Thread Antonin Houska

Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
> On 01/17/2017 08:05 PM, Antonin Houska wrote:
> > Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
> >

> >
> >> Another thing is that in my experience most queries do joins on foreign 
> >> keys
> >> (so the PK side is unique by definition), so the benefit on practical 
> >> examples
> >> is likely much smaller.
> >
> > ok. So in some cases the David's approach might be better.
> >
> In which cases would David's approach be more efficient? But even if there are
> such cases, I assume we could generate both paths and decide based on cost,
> just like with all other alternative paths.

Sorry, it was my thinko - I somehow confused David's CROSS JOIN example with
this one. If one side of the join clause is unique and the other becomes
unique due to aggregation (and if parallel processing is not engaged) then
neither combinefn nor multiplyfn should be necessary before the finalfn.

> > However I think the ability to join 2 grouped (originally non-unique)
> > relations is still important. Consider a query involving "sales" as well as
> > another table which also has many-to-one relationship to "products".
> >
> 
> Well, can you give a practical example? What you describe seems like a
> combination of two fact tables + a dimension, something like this:
> 
> CREATE TABLE products (
> idSERIAL PRIMARY KEY,
> name  TEXT,
> category_id   INT,
> producer_id   INT
> );
> 
> CREATE TABLE sales (
> product_idREFERENCES products (id),
> nitemsINT,
> price NUMERIC
> );
> 
> CREATE TABLE reviews (
> product_idREFERENCES products (id),
> stars INT
> );
> 
> But how exactly do you join that together? Because
> 
> SELECT * FROM products p JOIN sales s ON (p.id = s.product_id)
>  JOIN reviews r ON (p.id = r.product_id)
> 
> is clearly wrong - it's essentially M:N join between the two fact tables,
> increasing the number of rows.

Without elaborating details I imagined join condition between the 2 fact
tables which references their non-unique columns, but that does not make sense
here. Sorry.

> It'd helpful to have an example of a practical query optimized by the
> patch. I'm not claiming it does not exist, but I've been unable to come up
> with something reasonable at the moment.

As I mentioned elsewhere, remote aggregation is the primary use case.

Besides that, I can imagine another one - join of partitioned tables (costs
are not displayed just to make the plan easier to read).

For this query

SELECT
p.id, sum(price)
FROM
products AS p
JOIN sales AS s ON s.product_id = p.id
GROUP BY
p.id

I get this plan at "normal circumstances"

 HashAggregate
   Group Key: p.id
   ->  Hash Join
 Hash Cond: (s.product_id = p.id)
 ->  Gather
   Workers Planned: 2
   ->  Append
 ->  Parallel Seq Scan on sales s
 ->  Parallel Seq Scan on sales_2015 s_1
 ->  Parallel Seq Scan on sales_2016 s_2
 ->  Parallel Seq Scan on sales_2017 s_3
 ->  Hash
   ->  Gather
 Workers Planned: 2
 ->  Append
   ->  Parallel Seq Scan on products p
   ->  Parallel Seq Scan on products_01 p_1
   ->  Parallel Seq Scan on products_02 p_2
   ->  Parallel Seq Scan on products_03 p_3
   ->  Parallel Seq Scan on products_04 p_4


but if work_mem is sufficiently low for the hash join to be efficient, the
aggregation can be moved to individual partitions.

 Gather
   Workers Planned: 1
   Single Copy: true
   ->  Finalize HashAggregate
 Group Key: p.id
 ->  Hash Join
   Hash Cond: (p.id = s.product_id)
   ->  Append
 ->  Partial HashAggregate
   Group Key: p.id
   ->  Seq Scan on products p
 ->  Partial HashAggregate
   Group Key: p_1.id
   ->  Seq Scan on products_01 p_1
 ->  Partial HashAggregate
   Group Key: p_2.id
   ->  Seq Scan on products_02 p_2
 ->  Partial HashAggregate
   Group Key: p_3.id
   ->  Seq Scan on products_03 p_3
 ->  Partial HashAggregate
   Group Key: p_4.i

Re: [HACKERS] PoC: Grouped base relation

2017-01-17 Thread Antonin Houska

[ Trying to respond to both Tomas and David. I'll check tomorrow if anything
else of the thread needs my comment. ]

Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:

> On 01/17/2017 12:42 AM, David Rowley wrote:
> > On 10 January 2017 at 06:56, Antonin Houska <a...@cybertec.at> wrote:

> > I've been thinking about this aggtransmultifn and I'm not sure if it's
> > really needed. Adding a whole series of new transition functions is
> > quite a pain. At least I think so, and I have a feeling Robert might
> > agree with me.
> >
> > Let's imagine some worst case (and somewhat silly) aggregate query:
> >
> > SELECT count(*)
> > FROM million_row_table
> > CROSS JOIN another_million_row_table;
> >
> > Today that's going to cause 1 TRILLION transitions! Performance will
> > be terrible.
> >
> > If we pushed the aggregate down into one of those tables and performed
> > a Partial Aggregate on that, then a Finalize Aggregate on that single
> > row result (after the join), then that's 1 million transfn calls, and
> > 1 million combinefn calls, one for each row produced by the join.
> >
> > If we did it your way (providing I understand your proposal correctly)
> > there's 1 million transfn calls on one relation, then 1 million on the
> > other and then 1 multiplyfn call. which does 100 * 100
> >
> > What did we save vs. using the existing aggcombinefn infrastructure
> > which went into 9.6? Using this actually costs us 1 extra function
> > call, right? I'd imagine the size of the patch to use aggcombinefn
> > instead would be a fraction of the size of the one which included all
> > the new aggmultiplyfns and pg_aggregate.h changes.
> >

> I think the patch relies on the assumption that the grouping reduces
> cardinality,

Yes.

> so a CROSS JOIN without a GROUP BY clause may not be the best
> counterexample.

Yet it tells me that my approach is not ideal in some cases ...

> > It sounds like a much more manageable project by using aggcombinefn
> > instead. Then maybe one day when we can detect if a join did not cause
> > any result duplication (i.e Unique Joins), we could finalise the
> > aggregates on the first call, and completely skip the combine state
> > altogether.
> >

> I don't quite see how the patch could use aggcombinefn without sacrificing a
> lot of the benefits. Sure, it's possible to run the aggcombinefn in a loop
> (with number of iterations determined by the group size on the other side of
> the join), but that sounds pretty expensive and eliminates the reduction of
> transition function calls. The join cardinality would still be reduced,
> though.

That's what I think. The generic case is that neither side of the join is
unique. If it appears that both relations should be aggregated below the join,
aggcombinefn would have to be called multiple times on each output row of the
join to achieve the same result as the calling aggmultiplyfn.

> I do have other question about the patch, however. It seems to rely on the
> fact that the grouping and joins both reference the same columns. I wonder how
> uncommon such queries are.
> 
> To give a reasonable example, imagine the typical start schema, which is
> pretty standard for large analytical databases. A dimension table is
> "products" and the fact table is "sales", and the schema might look like this:
> 
> CREATE TABLE products (
> idSERIAL PRIMARY KEY,
> name  TEXT,
> category_id   INT,
> producer_id   INT
> );
> 
> CREATE TABLE sales (
> product_idREFERENCES products (id),
> nitemsINT,
> price NUMERIC
> );
> 
> A typical query then looks like this:
> 
> SELECT category_id, SUM(nitems), SUM(price)
> FROM products p JOIN sales s ON (p.id = s.product_id)
> GROUP BY p.category_id;
> 
> which obviously uses different columns for the grouping and join, and so the
> patch won't help with that. Of course, a query grouping by product_id would
> allow the patch to work

Right, the current version does not handle this. Thanks for suggestion.

> Another thing is that in my experience most queries do joins on foreign keys
> (so the PK side is unique by definition), so the benefit on practical examples
> is likely much smaller.

ok. So in some cases the David's approach might be better.

However I think the ability to join 2 grouped (originally non-unique)
relations is still important. Consider a query involving "sales" as well as
another table which also has many-to-one relationship to "products".

> But I guess my main question is if there are actual examples of queries the
&

Re: [HACKERS] PoC: Grouped base relation

2017-01-17 Thread Antonin Houska

Ashutosh Bapat <ashutosh.ba...@enterprisedb.com> wrote:
> [... snip ]]
>
> This all works well, as long as the aggregate is "summing" something
> across rows. The method doesn't work when aggregation is say
> "multiplying" across the rows or "concatenating" across the rows like
> array_agg() or string_agg(). They need a different strategy to combine
> aggregates across relations.

Good point. The common characteristic of these seems to be that thay don't
have aggcombinefn defined.

> IIUC, we are trying to solve multiple problems here:

> 1. Pushing down aggregates/groups down join tree, so that the number of rows
> to be joined decreases.  This might be a good optimization to have. However
> there are problems in the current patch. Every path built for a relation
> (join or base) returns the same result expressed by the relation or its
> subset restricted by parameterization or unification. But this patch changes
> that. It creates paths which represent grouping in the base relation.  I
> think, we need a separate relation to represent that result and hold paths
> which produce that result. That itself would be a sizable patch.

Whether a separate relation (RelOptInfo) should be created for grouped
relation is an important design decision indeed. More important than your
argument about the same result ("partial path", used to implement parallel
nodes actually does not fit this criterion perfectly - it only returns part of
the set) is the fact that the data type (target) differs.

I even spent some time coding a prototype where separate RelOptInfo is created
for the grouped relation but it was much more invasive. In particular, if only
some relations are grouped, it's hard to join them with non-grouped ones w/o
changing make_rel_from_joinlist and subroutines substantially. (Decision
whether the plain or the grouped relation should be involved in joining makes
little sense at the leaf level of the join tree.)

So I took the approach that resembles the partial paths - separate pathlists
within the same RelOptInfo.

> 2. Try to push down aggregates based on the equivalence classes, where
> grouping properties can be transferred from one relation to the other using
> EC mechanism.

I don't think the EC part should increase the patch complexity a lot. Unless I
missed something, it's rather isolated to the part where target of the grouped
paths is assembled. And I think it's important even for initial version of the
patch.

> This seems to require solving the problem of combining aggregates across the
> relations. But there might be some usecases which could benefit without
> solving this problem.

If "combining aggregates ..." refers to joining grouped relations, then I
insist on doing this in the initial version of the new feature too. Otherwise
it'd only work if exactly one base relation of the query is grouped.

> 3. If the relation to which we push the aggregate is an append relation,
> push (partial) aggregation/grouping down into the child relations. - We
> don't do that right now even for grouping aggregation on a single append
> table. Parallel partial aggregation does that, but not exactly per
> relation. That may be a sizable project in itself.  Even without this piece
> the rest of the optimizations proposed by this patch are important.

Yes, this can be done in a separate patch. I'll consider it.

> 4. Additional goal: push down the aggregation to any relation (join/base)
> where it can be computed.

I think this can be achieved by adding extra aggregation nodes to the join
tree. As I still anticipate more important design changes, this part is not at
the top of my TODO list.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slow I/O can break throttling of base backup

2016-12-16 Thread Antonin Houska

Antonin Houska <a...@cybertec.at> wrote:

> It seems to be my bug. I'll check tomorrow.

I could reproduce the problem by adding sufficient sleep time to the
loop.

> Magnus Hagander <mag...@hagander.net> wrote:
>> I wonder if the else if (sleep > 0) at the bottom of throttle() should just
>> be a simple else and always run, resetting last_throttle?

I agree. In fact, I could simplify the code even more.

Since (elapsed + sleep) almost equals to GetCurrentIntegerTimestamp(), we can
use the following statement unconditionally (I think I tried too hard to avoid
calling GetCurrentIntegerTimestamp too often in the original patch):

throttled_last = GetCurrentIntegerTimestamp();

Thus we can also get rid of the "else" branch that clears both "sleep" and
"wait_result".

(The patch contains minor comment refinement that I found useful when seeing
the code after years.)

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
new file mode 100644
index ffc7e58..40b3c11
*** a/src/backend/replication/basebackup.c
--- b/src/backend/replication/basebackup.c
*** throttle(size_t increment)
*** 1370,1395 
  		if (wait_result & WL_LATCH_SET)
  			CHECK_FOR_INTERRUPTS();
  	}
- 	else
- 	{
- 		/*
- 		 * The actual transfer rate is below the limit.  A negative value
- 		 * would distort the adjustment of throttled_last.
- 		 */
- 		wait_result = 0;
- 		sleep = 0;
- 	}
  
  	/*
! 	 * Only a whole multiple of throttling_sample was processed. The rest will
! 	 * be done during the next call of this function.
  	 */
  	throttling_counter %= throttling_sample;
  
! 	/* Once the (possible) sleep has ended, new period starts. */
! 	if (wait_result & WL_TIMEOUT)
! 		throttled_last += elapsed + sleep;
! 	else if (sleep > 0)
! 		/* Sleep was necessary but might have been interrupted. */
! 		throttled_last = GetCurrentIntegerTimestamp();
  }
--- 1370,1385 
  		if (wait_result & WL_LATCH_SET)
  			CHECK_FOR_INTERRUPTS();
  	}
  
  	/*
! 	 * As we work with integers, only whole multiple of throttling_sample was
! 	 * processed. The rest will be done during the next call of this function.
  	 */
  	throttling_counter %= throttling_sample;
  
! 	/*
! 	 * Time interval for the remaining amount and possible next increments
! 	 * starts now.
! 	 */
! 	throttled_last = GetCurrentIntegerTimestamp();
  }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slow I/O can break throttling of base backup

2016-12-15 Thread Antonin Houska

It seems to be my bug. I'll check tomorrow.

Magnus Hagander <mag...@hagander.net> wrote:

> Running pg_basebackup with a throttling of say 10M runs it into the risk of
> the I/O on the server actually being slower than pg_basebackup (I have
> preproduced similar issues on fake-slow disks with lower rate limits).
> 
> What happens in this case in basebackup.c is that the value for "sleep"
> comes out negative. This means we don't sleep, which is fine.
> 
> However, that also means we don't set throttle_last.
> 
> That means that the next time we come around to throttle(), the value for
> "elapsed" is even bigger, which results in an even bigger negative number,
> and we're "stuck".
> 
> AFAICT this means that a temporary I/O spike that makes reading of the disk
> too slow can leave us in a situation where we never recover, and thus never
> end up sleeping ever again, effectively turning off rate limiting.
> 
> I wonder if the else if (sleep > 0) at the bottom of throttle() should just
> be a simple else and always run, resetting last_throttle?
> 
> -- 
> Magnus Hagander
> Me: http://www.hagander.net/
> Work: http://www.redpill-linpro.com/
> 

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Possibly too stringent Assert() in b-tree code

2016-09-19 Thread Antonin Houska

I've recently seen this when using 9.6:

#0  0x7f147892f0c7 in raise () from /lib64/libc.so.6
#1  0x7f1478930478 in abort () from /lib64/libc.so.6
#2  0x009683a1 in ExceptionalCondition (conditionName=0x9f2ef8 
"!(((PageHeader) (page))->pd_special >= (__builtin_offsetof (PageHeaderData, 
pd_linp)))", errorType=0x9f2e8a "FailedAssertion",
fileName=0x9f2e60 "../../../../src/include/storage/bufpage.h", 
lineNumber=314) at assert.c:54
#3  0x004ee3d2 in PageValidateSpecialPointer (page=0x7f14700f3480 "") 
at ../../../../src/include/storage/bufpage.h:314
#4  0x004ef913 in _bt_getbuf (rel=0x7f14795a1a50, blkno=11, access=2) 
at nbtpage.c:629
#5  0x004eae8c in _bt_split (rel=0x7f14795a1a50, buf=136, cbuf=0, 
firstright=367, newitemoff=408, newitemsz=16, newitem=0x2608330, 
newitemonleft=0 '\000') at nbtinsert.c:986
#6  0x004ea563 in _bt_insertonpg (rel=0x7f14795a1a50, buf=136, cbuf=0, 
stack=0x26090c8, itup=0x2608330, newitemoff=408, split_only_page=0 '\000') at 
nbtinsert.c:781
#7  0x004e954c in _bt_doinsert (rel=0x7f14795a1a50, itup=0x2608330, 
checkUnique=UNIQUE_CHECK_YES, heapRel=0x25aa0a0) at nbtinsert.c:204
#8  0x004f3b73 in btinsert (rel=0x7f14795a1a50, values=0x7ffe46c6f7f0, 
isnull=0x7ffe46c6f7d0 "", ht_ctid=0x260927c, heapRel=0x25aa0a0, 
checkUnique=UNIQUE_CHECK_YES) at nbtree.c:279
#9  0x004e7964 in index_insert (indexRelation=0x7f14795a1a50, 
values=0x7ffe46c6f7f0, isnull=0x7ffe46c6f7d0 "", heap_t_ctid=0x260927c, 
heapRelation=0x25aa0a0, checkUnique=UNIQUE_CHECK_YES)
at indexam.c:204

Of course, the database could have been corrupted after having encountered
many crashes during my experiments. Neverthelesss, even without in-depth
knowledge of the b-tree code I suspect that this stack trace might reflect a
legal situation. In partcular, if _bt_page_recyclable() returned on this
condition:

if (PageIsNew(page))
return true;


Unfortunately I no longer have the cluster nor the core dump, but I remember
all the fields of PageHeader were indeed zeroed.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Unused function arguments

2016-09-05 Thread Antonin Houska

While reading the logical replication (and related) code, I found a few unused
function arguments:

 * XactLogCommitRecord() - unused argument forceSync

 * SnapBuildBuildSnapshot() - xid

 * TeardownHistoricSnapshot() - is_error

No idea which ones are intended for future use and which ones can be
removed.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Question about MVCC-unsafe commands

2016-08-23 Thread Antonin Houska

Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Aug 23, 2016 at 10:10 AM, Antonin Houska <a...@cybertec.at> wrote:

> CLUSTER preserves xmin/xmax when rewriting, but ALTER TABLE does not.

Indeed, CLUSTER was the command I picked for my experiments. I didn't expect
such a subtle difference. Thanks for answer.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Question about MVCC-unsafe commands

2016-08-23 Thread Antonin Houska

I'm failing to understand why [1] mentions "table-rewriting forms of ALTER
TABLE" besides TRUNCATE command.

For TRUNCATE it seems clear: if transaction A takes the snapshot before it
accesses the table first time (typically because isolation level is at least
REPEATABLE READ) and transaction B manages to commit TRUNCATE soon enough,
then A sees pg_class entry of the table already affected by B, which has the
new (empty) relfilenode. (The original pg_class entry is no longer visible by
catalog snapshot, nor does it contain valid OID of the original relfilenode.)

But IMO heap rewriting changes neither table contents, nor visibility. Can
anyone explain what I miss?

[1] https://www.postgresql.org/docs/9.6/static/mvcc-caveats.html
-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Typos in logical decoding

2016-07-18 Thread Antonin Houska

While reading the logical decoding code I noticed a few supposedly mistyped
comments, see the diff below.

Besides that, output_plugin_options argument of CreateInitDecodingContext
function is unused, and the function passes NIL to the corresponding argument
of StartupDecodingContext. This looks to me like a thinko.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
new file mode 100644
index 7c8a777..ecf9a03
*** a/src/backend/replication/logical/logical.c
--- b/src/backend/replication/logical/logical.c
*** CreateInitDecodingContext(char *plugin,
*** 281,287 
  	LWLockRelease(ProcArrayLock);
  
  	/*
! 	 * tell the snapshot builder to only assemble snapshot once reaching the a
  	 * running_xact's record with the respective xmin.
  	 */
  	xmin_horizon = slot->data.catalog_xmin;
--- 281,287 
  	LWLockRelease(ProcArrayLock);
  
  	/*
! 	 * tell the snapshot builder to only assemble snapshot once reaching the
  	 * running_xact's record with the respective xmin.
  	 */
  	xmin_horizon = slot->data.catalog_xmin;
*** LogicalIncreaseRestartDecodingForSlot(XL
*** 880,886 
  }
  
  /*
!  * Handle a consumer's conformation having received all changes up to lsn.
   */
  void
  LogicalConfirmReceivedLocation(XLogRecPtr lsn)
--- 880,886 
  }
  
  /*
!  * Handle a consumer's confirmation having received all changes up to lsn.
   */
  void
  LogicalConfirmReceivedLocation(XLogRecPtr lsn)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
new file mode 100644
index 00e31a2..9e2208a
*** a/src/backend/replication/logical/reorderbuffer.c
--- b/src/backend/replication/logical/reorderbuffer.c
*** ReorderBufferGetTupleBuf(ReorderBuffer *
*** 466,473 
  	/*
  	 * Most tuples are below MaxHeapTupleSize, so we use a slab allocator for
  	 * those. Thus always allocate at least MaxHeapTupleSize. Note that tuples
! 	 * tuples generated for oldtuples can be bigger, as they don't have
! 	 * out-of-line toast columns.
  	 */
  	if (alloc_len < MaxHeapTupleSize)
  		alloc_len = MaxHeapTupleSize;
--- 466,473 
  	/*
  	 * Most tuples are below MaxHeapTupleSize, so we use a slab allocator for
  	 * those. Thus always allocate at least MaxHeapTupleSize. Note that tuples
! 	 * generated for oldtuples can be bigger, as they don't have out-of-line
! 	 * toast columns.
  	 */
  	if (alloc_len < MaxHeapTupleSize)
  		alloc_len = MaxHeapTupleSize;
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
new file mode 100644
index b4dc617..b5fa3db
*** a/src/backend/replication/logical/snapbuild.c
--- b/src/backend/replication/logical/snapbuild.c
*** SnapBuildEndTxn(SnapBuild *builder, XLog
*** 901,907 
  	/*
  	 * NB: This handles subtransactions correctly even if we started from
  	 * suboverflowed xl_running_xacts because we only keep track of toplevel
! 	 * transactions. Since the latter are always are allocated before their
  	 * subxids and since they end at the same time it's sufficient to deal
  	 * with them here.
  	 */
--- 901,907 
  	/*
  	 * NB: This handles subtransactions correctly even if we started from
  	 * suboverflowed xl_running_xacts because we only keep track of toplevel
! 	 * transactions. Since the latter are always allocated before their
  	 * subxids and since they end at the same time it's sufficient to deal
  	 * with them here.
  	 */
*** SnapBuildCommitTxn(SnapBuild *builder, X
*** 981,987 
  		 * we reached consistency.
  		 */
  		forced_timetravel = true;
! 		elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running to early", xid);
  	}
  
  	for (nxact = 0; nxact < nsubxacts; nxact++)
--- 981,987 
  		 * we reached consistency.
  		 */
  		forced_timetravel = true;
! 		elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
  	}
  
  	for (nxact = 0; nxact < nsubxacts; nxact++)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNIQUE capability to hash indexes

2016-02-04 Thread Antonin Houska

Tom Lane <t...@sss.pgh.pa.us> wrote:

> Alvaro Herrera <alvhe...@2ndquadrant.com> writes:
> > Tom Lane wrote:
> >> Not that I've heard of.  It's very hard to muster any enthusiasm for
> >> improving hash indexes unless their lack of WAL-logging is fixed first.
> 
> > This is really strange though.  Surely adding WAL-logging is not an
> > enormous task anymore ... I mean, we're undertaking far larger efforts
> > now, the WAL logging code is simpler than before, and we even have a
> > tool (ok, gotta streamline that one a little bit) to verify that the
> > results are correct.
> 
> ISTR that we discussed this previously and ran into some stumbling block
> or other that made it less-than-trivial.  Don't recall what though.

Concurrency of bucket split is the issue. It makes sense to fix the problem
before anyone tries to implement WAL. Otherwise WAL will have to be reworked
from scratch someday.

I had some ideas which I even published:

http://www.postgresql.org/message-id/32423.1427413442@localhost

Then I spent some more time on it sometime in October and improved the concept
a bit (and also found bugs in the version I had published, so please don't
spend much time looking at it). I also wrote a function to check if the
consistent (to be possibly added to pageinspect extension). I even wrote a
function that inserts tuples only into index, not into heap - I suppose that
should make comparison of index performance with and without the patch
simpler.

Now, besides making the patch easier to read, I need to test it
thoroughly. The lack of time is one problem, but I need to admit that it's a
personal issue too :-) So far I think I have a good idea, but now I should try
hard to break it.

Now that I see the problem mentioned again, I feel myself kind of "ignited".
I expect to have some leisure time at the end of February, so I'll test the
patch and post my results early in March.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bitmap index scans use of filters on available columns

2015-11-04 Thread Antonin Houska

While prefix expression

y like 'abc%'

can be converted to

y >= 'abc'

(see expand_indexqual_opclause()), I'm not sure any kind of expansion is
possible for '%abc%' which would result in a b-tree searchable condition.

Jeff Janes <jeff.ja...@gmail.com> wrote:

> Is there a fundamental reason the filter on y is not being applied to
> the index scan, rather than the heap scan?

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Non-volatile variables used for spinlock manipulation

2015-09-04 Thread Antonin Houska

Since SpinLockAcquire() / SpinLockRelease() macros usually reference variables
declared as volatile, I wonder if the following changes should be applied.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index 0e60dbc..256d09d 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -195,11 +195,11 @@ shm_mq_set_receiver(shm_mq *mq, PGPROC *proc)
 	volatile shm_mq *vmq = mq;
 	PGPROC	   *sender;
 
-	SpinLockAcquire(>mq_mutex);
+	SpinLockAcquire(>mq_mutex);
 	Assert(vmq->mq_receiver == NULL);
 	vmq->mq_receiver = proc;
 	sender = vmq->mq_sender;
-	SpinLockRelease(>mq_mutex);
+	SpinLockRelease(>mq_mutex);
 
 	if (sender != NULL)
 		SetLatch(>procLatch);
@@ -214,11 +214,11 @@ shm_mq_set_sender(shm_mq *mq, PGPROC *proc)
 	volatile shm_mq *vmq = mq;
 	PGPROC	   *receiver;
 
-	SpinLockAcquire(>mq_mutex);
+	SpinLockAcquire(>mq_mutex);
 	Assert(vmq->mq_sender == NULL);
 	vmq->mq_sender = proc;
 	receiver = vmq->mq_receiver;
-	SpinLockRelease(>mq_mutex);
+	SpinLockRelease(>mq_mutex);
 
 	if (receiver != NULL)
 		SetLatch(>procLatch);
@@ -233,9 +233,9 @@ shm_mq_get_receiver(shm_mq *mq)
 	volatile shm_mq *vmq = mq;
 	PGPROC	   *receiver;
 
-	SpinLockAcquire(>mq_mutex);
+	SpinLockAcquire(>mq_mutex);
 	receiver = vmq->mq_receiver;
-	SpinLockRelease(>mq_mutex);
+	SpinLockRelease(>mq_mutex);
 
 	return receiver;
 }
@@ -249,9 +249,9 @@ shm_mq_get_sender(shm_mq *mq)
 	volatile shm_mq *vmq = mq;
 	PGPROC	   *sender;
 
-	SpinLockAcquire(>mq_mutex);
+	SpinLockAcquire(>mq_mutex);
 	sender = vmq->mq_sender;
-	SpinLockRelease(>mq_mutex);
+	SpinLockRelease(>mq_mutex);
 
 	return sender;
 }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Thinko in processing of SHM message size info?

2015-08-06 Thread Antonin Houska

Robert Haas robertmh...@gmail.com wrote:
 On Thu, Aug 6, 2015 at 1:24 PM, Robert Haas robertmh...@gmail.com wrote:
  On Thu, Aug 6, 2015 at 9:04 AM, Antonin Houska a...@cybertec.at wrote:
  Can anyone please explain why the following patch shouldn't be applied?
 
  diff --git a/src/backend/storage/ipc/shm_mq.c 
  b/src/backend/storage/ipc/shm_mq.c
  index 126cb07..4cd52ac 100644
  --- a/src/backend/storage/ipc/shm_mq.c
  +++ b/src/backend/storage/ipc/shm_mq.c
  @@ -584,7 +584,7 @@ shm_mq_receive(shm_mq_handle *mqh, Size *nbytesp, void 
  **datap, bool nowait)
  if (mqh-mqh_partial_bytes + rb  sizeof(Size))
  lengthbytes = sizeof(Size) - 
  mqh-mqh_partial_bytes;
  else
  -   lengthbytes = rb - mqh-mqh_partial_bytes;
  +   lengthbytes = rb;
  memcpy(mqh-mqh_buffer[mqh-mqh_partial_bytes], 
  rawdata,
 lengthbytes);
  mqh-mqh_partial_bytes += lengthbytes;
 
 
  I'm failing to understand why anything should be subtracted. Note that the
  previous iteration must have called shm_mq_inc_bytes_read(), so rb should
  not include anything of mqh-mqh_partial_bytes. Thanks.
 

  Hmm, I think you are correct.  This would matter in the case where the
  message length word was read in more than two chunks.  I don't *think*
  that's possible right now because I believe the only systems where
  MAXIMUM_ALIGNOF  sizeof(Size) are those with MAXIMUM_ALIGNOF == 4 and
  sizeof(Size) == 8. However, if we had systems where MAXIMUM_ALIGNOF
  == 4 and sizeof(Size) == 16, or systems where MAXIMUM_ALIGNOF == 2 and
  sizeof(Size) == 8, this would be a live bug.

I ought to admit that I didn't think about the specific combinations of
MAXIMUM_ALIGNOF and sizeof(Size), and considered the problem rather rare
(maybe also because it can't happen on my workstation). But the next your
consideration makes sense to me:

 Hmm, actually, maybe it is a live bug anyway, because the if statement
 tests  rather than =.  If we've read 4 bytes and exactly 4 more
 bytes are available, we'd set lengthbytes to 0 instead of 4.  Oops.

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Race conditions in shm_mq.c

2015-08-06 Thread Antonin Houska

Robert Haas robertmh...@gmail.com wrote:
 On Thu, Aug 6, 2015 at 2:38 PM, Robert Haas robertmh...@gmail.com wrote:
  On Thu, Aug 6, 2015 at 10:10 AM, Antonin Houska a...@cybertec.at wrote:
  During my experiments with parallel workers I sometimes saw the master 
  and
  worker process blocked. The master uses shm queue to send data to the 
  worker,
  both sides nowait==false. I concluded that the following happened:
 
  The worker process set itself as a receiver on the queue after
  shm_mq_wait_internal() has completed its first check of ptr, so this
  function left sender's procLatch in reset state. But before the procLatch 
  was
  reset, the receiver still managed to read some data and set sender's 
  procLatch
  to signal the reading, and eventually called its (receiver's) WaitLatch().
 
  So sender has effectively missed the receiver's notification and called
  WaitLatch() too (if the receiver already waits on its latch, it does not 
  help
  for sender to call shm_mq_notify_receiver(): receiver won't do anything
  because there's no new data in the queue).
 
  Below is my patch proposal.
 
  Another good catch.  However, I would prefer to fix this without
  introducing a continue as I think that will make the control flow
  clearer.  Therefore, I propose the attached variant of your idea.
 
 Err, that doesn't work at all.  Have a look at this version instead.

This makes sense to me.

One advantage of continue was that I could apply the patch to my test code
(containing the appropriate sleep() calls, to simulate the race conditions)
with no conflicts and see the difference. The restructuring you do does not
allow for such a mechanical testing, but it's clear enough.

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Thinko in processing of SHM message size info?

2015-08-06 Thread Antonin Houska

Can anyone please explain why the following patch shouldn't be applied?

diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index 126cb07..4cd52ac 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -584,7 +584,7 @@ shm_mq_receive(shm_mq_handle *mqh, Size *nbytesp, void 
**datap, bool nowait)
if (mqh-mqh_partial_bytes + rb  sizeof(Size))
lengthbytes = sizeof(Size) - 
mqh-mqh_partial_bytes;
else
-   lengthbytes = rb - mqh-mqh_partial_bytes;
+   lengthbytes = rb;
memcpy(mqh-mqh_buffer[mqh-mqh_partial_bytes], 
rawdata,
   lengthbytes);
mqh-mqh_partial_bytes += lengthbytes;


I'm failing to understand why anything should be subtracted. Note that the
previous iteration must have called shm_mq_inc_bytes_read(), so rb should
not include anything of mqh-mqh_partial_bytes. Thanks.

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Race conditions in shm_mq.c

2015-08-06 Thread Antonin Houska

During my experiments with parallel workers I sometimes saw the master and
worker process blocked. The master uses shm queue to send data to the worker,
both sides nowait==false. I concluded that the following happened:

The worker process set itself as a receiver on the queue after
shm_mq_wait_internal() has completed its first check of ptr, so this
function left sender's procLatch in reset state. But before the procLatch was
reset, the receiver still managed to read some data and set sender's procLatch
to signal the reading, and eventually called its (receiver's) WaitLatch().

So sender has effectively missed the receiver's notification and called
WaitLatch() too (if the receiver already waits on its latch, it does not help
for sender to call shm_mq_notify_receiver(): receiver won't do anything
because there's no new data in the queue).

Below is my patch proposal.

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index 126cb07..39ea673 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -803,6 +808,22 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
 	return SHM_MQ_DETACHED;
 }
 mqh-mqh_counterparty_attached = true;
+
+/*
+ * Re-check if the queue is still full.
+ *
+ * While we were using our procLatch to detect receiver's
+ * connection, the receiver could have connected and started
+ * reading from it - that includes concurrent manipulation of
+ * the latch, in order to report on reading activity. Thus we
+ * could miss the information that some data has already been
+ * consumed, and cause a deadlock by calling SetLatch() below.
+ *
+ * (If the receiver starts waiting on its latch soon enough,
+ * our call of shm_mq_notify_receiver() will have no effect.)
+ */
+if (!nowait)
+	continue;
 			}
 
 			/* Let the receiver know that we need them to read some data. */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-07-15 Thread Antonin Houska

Amit Kapila amit.kapil...@gmail.com wrote:
 Attached, find the rebased version of patch.

[I haven't read this thread so far, sorry for possibly redundant comment.]

I noticed that false is passed for required_outer agrument of
create_partialseqscan_path(), while NULL seems to be cleaner in terms of C
language.

But in terms of semantics, I'm not sure this is correct anyway. Why does
create_parallelscan_paths() not accept the actual rel-lateral_relids, just
like create_seqscan_path() does? (See set_plain_rel_pathlist().) If there's
reason for your approach, I think it's worth a comment.


BTW, emacs shows whitespace on otherwise empty line parallelpath.c:57.

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Possible gaps/garbage in the output of XLOG reader

2015-04-09 Thread Antonin Houska

While playing with xlogreader, I was lucky enough to see one of the many
record validations to fail. After having some fun with gdb, I found out that
in some cases the reader does not enforce enough data to be in state-readBuf
before copying into state-readRecordBuf starts. This should not happen if the
callback always reads XLOG_BLCKSZ bytes, but in fact only *reqLen* is the
mandatory size of the chunk delivered.

There are probably various ways to fix this problem. Attached is what I did in
my environment. I hit the problem on 9.4.1, but the patch seems to apply to
master too.

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 474137a..e6ebd9d 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -313,7 +313,21 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 		goto err;
 	}
 
-	len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
+	/* Bytes of the current record residing on the current page. */
+	len = Min(XLOG_BLCKSZ - targetRecOff, total_len);
+
+	/*
+	 * Nothing beyond the record header is guaranteed to be in state-readBuf
+	 * so far.
+	 */
+	if (readOff  targetRecOff + len)
+	{
+		readOff = ReadPageInternal(state, targetPagePtr, targetRecOff + len);
+
+		if (readOff  0)
+			goto err;
+	}
+
 	if (total_len  len)
 	{
 		/* Need to reassemble record */
@@ -322,9 +336,11 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 		char	   *buffer;
 		uint32		gotlen;
 
+		Assert(readOff == targetRecOff + len);
+		Assert(readOff == XLOG_BLCKSZ);
+
 		/* Copy the first fragment of the record from the first page. */
-		memcpy(state-readRecordBuf,
-			   state-readBuf + RecPtr % XLOG_BLCKSZ, len);
+		memcpy(state-readRecordBuf, state-readBuf + targetRecOff, len);
 		buffer = state-readRecordBuf + len;
 		gotlen = len;
 
@@ -413,20 +429,16 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	}
 	else
 	{
-		/* Wait for the record data to become available */
-		readOff = ReadPageInternal(state, targetPagePtr,
- Min(targetRecOff + total_len, XLOG_BLCKSZ));
-		if (readOff  0)
-			goto err;
+		Assert(readOff = targetRecOff + len);
 
 		/* Record does not cross a page boundary */
 		if (!ValidXLogRecord(state, record, RecPtr))
 			goto err;
 
-		state-EndRecPtr = RecPtr + MAXALIGN(total_len);
+		state-EndRecPtr = RecPtr + MAXALIGN(len);
 
 		state-ReadRecPtr = RecPtr;
-		memcpy(state-readRecordBuf, record, total_len);
+		memcpy(state-readRecordBuf, record, len);
 	}
 
 	/*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Concurrent calls of _hash_getnewbuf()

2015-03-30 Thread Antonin Houska

When doing my experiments with bucket split ([1]), I noticed a comment that
_hash_getnewbuf should not be called concurrently. However, there's no
synchronization of calls from _hash_splitbucket in HEAD. I could reproduce
such concurrent calls using gdb (multiple bucket splits in progress at a
time).

When the function is called from _hash_getovflpage, content lock of metapage
buffer seems to be (mis)used to synchronize the calls:

/*
 * Fetch the page with _hash_getnewbuf to ensure smgr's idea of the
 * relation length stays in sync with ours.  XXX It's annoying to do 
this
 * with metapage write lock held; would be better to use a lock that
 * doesn't block incoming searches.
 */
newbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);

I think it'd also be the easiest fix for _hash_splitbucket. Or should a
separate (regular) lock be introduced and used and used in both cases?


[1] http://www.postgresql.org/message-id/32423.1427413442@localhost

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
new file mode 100644
index 46c6c96..25c1dd1
*** a/src/backend/access/hash/hashpage.c
--- b/src/backend/access/hash/hashpage.c
*** _hash_splitbucket(Relation rel,
*** 765,771 
--- 765,773 
  	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
  
  	nblkno = start_nblkno;
+ 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
  	nbuf = _hash_getnewbuf(rel, nblkno, MAIN_FORKNUM);
+ 	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
  	npage = BufferGetPage(nbuf);
  
  	/* initialize the new bucket's primary page */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Split of hash index bucket

2015-03-27 Thread Antonin Houska

Antonin Houska a...@cybertec.at wrote:
 I'm still testing it. especially the concurrent access. There are probably
 bugs in the code, but it can help understand what I mean.

I've traced the most important cases of concurrent insertion into a bucket
split of which is in progress. A few related bugs fixed. Some tool to check
the index structure is needed yet, before performance testing makes
sense. That might include enhancement of contrib/pageinspect.

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
new file mode 100644
index 24b06a5..149bbcf
*** a/src/backend/access/hash/hash.c
--- b/src/backend/access/hash/hash.c
*** loop_top:
*** 572,577 
--- 572,599 
  			opaque = (HashPageOpaque) PageGetSpecialPointer(page);
  			Assert(opaque-hasho_bucket == cur_bucket);
  
+ 			/*
+ 			 * If the bucket participates in a split, give up.
+ 			 *
+ 			 * (Unlike the metapage copy, the flags at bucket level should
+ 			 * always be up-to-date.)
+ 			 *
+ 			 * TODO
+ 			 *
+ 			 * 1. Analyze if both buckets participating in the split impose
+ 			 * too severe restrictions, and if it makes sense to introduce
+ 			 * separate flags for old and new bucket. Also, would such a
+ 			 * restricted VACUUM still make sense?
+ 			 *
+ 			 * 2. Consider how statistics should reflect the fact that some
+ 			 * buckets are skipped because of split.
+ 			 */
+ 			if (opaque-hasho_flag  LH_BUCKET_SPLIT)
+ 			{
+ _hash_relbuf(rel, buf);
+ break;
+ 			}
+ 
  			/* Scan each tuple in page */
  			maxoffno = PageGetMaxOffsetNumber(page);
  			for (offno = FirstOffsetNumber;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
new file mode 100644
index 63aaec9..de445c9
*** a/src/backend/access/hash/hashinsert.c
--- b/src/backend/access/hash/hashinsert.c
*** _hash_doinsert(Relation rel, IndexTuple
*** 37,42 
--- 37,43 
  	Page		page;
  	HashPageOpaque pageopaque;
  	Size		itemsz;
+ 	uint16  buckets_total;
  	bool		do_expand;
  	uint32		hashkey;
  	Bucket		bucket;
*** _hash_doinsert(Relation rel, IndexTuple
*** 123,129 
  		 */
  		BlockNumber nextblkno = pageopaque-hasho_nextblkno;
  
! 		if (BlockNumberIsValid(nextblkno))
  		{
  			/*
  			 * ovfl page exists; go get it.  if it doesn't have room, we'll
--- 124,131 
  		 */
  		BlockNumber nextblkno = pageopaque-hasho_nextblkno;
  
! 		if (BlockNumberIsValid(nextblkno) 
! 			!(pageopaque-hasho_flag  LH_BUCKET_SPLIT_LAST))
  		{
  			/*
  			 * ovfl page exists; go get it.  if it doesn't have room, we'll
*** _hash_doinsert(Relation rel, IndexTuple
*** 136,142 
  		else
  		{
  			/*
! 			 * we're at the end of the bucket chain and we haven't found a
  			 * page with enough room.  allocate a new overflow page.
  			 */
  
--- 138,145 
  		else
  		{
  			/*
! 			 * we're at the end of the bucket chain, or (during a split) right
! 			 * before redirection to the old bucket, and we haven't found a
  			 * page with enough room.  allocate a new overflow page.
  			 */
  
*** _hash_doinsert(Relation rel, IndexTuple
*** 151,157 
  			Assert(PageGetFreeSpace(page) = itemsz);
  		}
  		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
! 		Assert(pageopaque-hasho_flag == LH_OVERFLOW_PAGE);
  		Assert(pageopaque-hasho_bucket == bucket);
  	}
  
--- 154,160 
  			Assert(PageGetFreeSpace(page) = itemsz);
  		}
  		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
! 		Assert(pageopaque-hasho_flag  LH_OVERFLOW_PAGE);
  		Assert(pageopaque-hasho_bucket == bucket);
  	}
  
*** _hash_doinsert(Relation rel, IndexTuple
*** 173,180 
  	metap-hashm_ntuples += 1;
  
  	/* Make sure this stays in sync with _hash_expandtable() */
! 	do_expand = metap-hashm_ntuples 
! 		(double) metap-hashm_ffactor * (metap-hashm_maxbucket + 1);
  
  	/* Write out the metapage and drop lock, but keep pin */
  	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
--- 176,184 
  	metap-hashm_ntuples += 1;
  
  	/* Make sure this stays in sync with _hash_expandtable() */
! 	buckets_total = metap-hashm_maxbucket + 1 + metap-hashm_split_count;
! 	do_expand = metap-hashm_split_count  HASH_MAX_SPLITS 
! 		metap-hashm_ntuples  (double) metap-hashm_ffactor * buckets_total;
  
  	/* Write out the metapage and drop lock, but keep pin */
  	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
new file mode 100644
index b775164..4345f29
*** a/src/backend/access/hash/hashovfl.c
--- b/src/backend/access/hash/hashovfl.c
***
*** 21,27 
  #include utils/rel.h
  
  
- static Buffer _hash_getovflpage(Relation rel, Buffer metabuf);
  static uint32 _hash_firstfreebit(uint32 map

[HACKERS] WIP: Split of hash index bucket

2015-03-26 Thread Antonin Houska

I've read proposal [1] and also major part of thread [2]. [1] encouraged me to
try an alternative approach, which does not require flags at tuple
level. After thinking about it for some time, I decided to code something -
see attachment of this mail. (I was not sure whether I should write some kind
of pseudocode, but was too eager to try whether my idea works :-) )

The idea is that new bucket is initialized as an empty primary page, whose
'hasho_nextblkno' points at the first page of the old bucket (the one being
split). Then, tuples belonging to the new bucket are copied there and the link
at the end of the new bucket is redirected to the 2nd page of the old
bucket. And so on. When the last page of the old bucket is processed, the link
from the new to the old bucket is broken.

Any bucket participating in a split (whether the original one or the one being
created) has a flag on its primary page, so that its split-in-progress status
does not require access to the index metapage.

This logic should ensure that the split can be performed in small steps, w/o
blocking scans and inserts at bucket level (of course, contention still exists
at page level).

I'm still testing it. especially the concurrent access. There are probably
bugs in the code, but it can help understand what I mean.

If this split algorithm proves to be viable, an important question about the
role of bucket-level locks (implemented currently as heavyweight lock of the
bucket's primary page) remains.


(Note that squeeze bucket functionality is not implemented in this version.)


References:

[1] 
http://www.postgresql.org/message-id/ca+tgmozymojsrfxhxq06g8jhjxqcskvdihb_8z_7nc7hj7i...@mail.gmail.com
[2]
http://www.postgresql.org/message-id/ca+tgmoy4x7vkyc4dawujutuboyxe2qsjf9aybhwzjxxwoc6...@mail.gmail.co

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
new file mode 100644
index 24b06a5..149bbcf
*** a/src/backend/access/hash/hash.c
--- b/src/backend/access/hash/hash.c
*** loop_top:
*** 572,577 
--- 572,599 
  			opaque = (HashPageOpaque) PageGetSpecialPointer(page);
  			Assert(opaque-hasho_bucket == cur_bucket);
  
+ 			/*
+ 			 * If the bucket participates in a split, give up.
+ 			 *
+ 			 * (Unlike the metapage copy, the flags at bucket level should
+ 			 * always be up-to-date.)
+ 			 *
+ 			 * TODO
+ 			 *
+ 			 * 1. Analyze if both buckets participating in the split impose
+ 			 * too severe restrictions, and if it makes sense to introduce
+ 			 * separate flags for old and new bucket. Also, would such a
+ 			 * restricted VACUUM still make sense?
+ 			 *
+ 			 * 2. Consider how statistics should reflect the fact that some
+ 			 * buckets are skipped because of split.
+ 			 */
+ 			if (opaque-hasho_flag  LH_BUCKET_SPLIT)
+ 			{
+ _hash_relbuf(rel, buf);
+ break;
+ 			}
+ 
  			/* Scan each tuple in page */
  			maxoffno = PageGetMaxOffsetNumber(page);
  			for (offno = FirstOffsetNumber;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
new file mode 100644
index 63aaec9..4b372ae
*** a/src/backend/access/hash/hashinsert.c
--- b/src/backend/access/hash/hashinsert.c
*** _hash_doinsert(Relation rel, IndexTuple
*** 37,42 
--- 37,43 
  	Page		page;
  	HashPageOpaque pageopaque;
  	Size		itemsz;
+ 	uint16  buckets_total;
  	bool		do_expand;
  	uint32		hashkey;
  	Bucket		bucket;
*** _hash_doinsert(Relation rel, IndexTuple
*** 173,180 
  	metap-hashm_ntuples += 1;
  
  	/* Make sure this stays in sync with _hash_expandtable() */
! 	do_expand = metap-hashm_ntuples 
! 		(double) metap-hashm_ffactor * (metap-hashm_maxbucket + 1);
  
  	/* Write out the metapage and drop lock, but keep pin */
  	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
--- 174,182 
  	metap-hashm_ntuples += 1;
  
  	/* Make sure this stays in sync with _hash_expandtable() */
! 	buckets_total = metap-hashm_maxbucket + 1 + metap-hashm_split_count;
! 	do_expand = metap-hashm_split_count  HASH_MAX_SPLITS 
! 		metap-hashm_ntuples  (double) metap-hashm_ffactor * buckets_total;
  
  	/* Write out the metapage and drop lock, but keep pin */
  	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
new file mode 100644
index b775164..4345f29
*** a/src/backend/access/hash/hashovfl.c
--- b/src/backend/access/hash/hashovfl.c
***
*** 21,27 
  #include utils/rel.h
  
  
- static Buffer _hash_getovflpage(Relation rel, Buffer metabuf);
  static uint32 _hash_firstfreebit(uint32 map);
  
  
--- 21,26 
*** _hash_addovflpage(Relation rel, Buffer m
*** 127,133 
  		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
  		nextblkno = pageopaque-hasho_nextblkno

Re: [HACKERS] Corner case for add_path_precheck

2015-02-11 Thread Antonin Houska

Tom Lane t...@sss.pgh.pa.us wrote:
 Antonin Houska a...@cybertec.at writes:
  The special case is that the path passed to add_path_precheck() has costs
  *equal to* those of the old_path. If pathkeys, outer rells and costs are the
  same, as summarized in the comment above, I expect add_path_precheck() to
  return false. Do I misread anything?
 
 It does, so I don't see your point?

Just that pre-check is - in this special (and very rare?) case - more
stringent than the proper check would be.

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Corner case for add_path_precheck

2015-02-09 Thread Antonin Houska

While thinking about add_path_precheck() function, it occurred to me that it
can discard some paths that otherwise would have chance be accepted in this
part of add_path():

/*
 * Same pathkeys and outer rels, and fuzzily
 * the same cost, so keep just one; to decide
 * which, first check rows and then do a fuzzy
 * cost comparison with very small fuzz limit.
 * (We used to do an exact cost comparison,
 * but that results in annoying
 * platform-specific plan variations due to
 * roundoff in the cost estimates.)  If things
 * are still tied, arbitrarily keep only the
 * old path.  Notice that we will keep only
 * the old path even if the less-fuzzy
 * comparison decides the startup and total
 * costs compare differently.
 */
if (new_path-rows  old_path-rows)
remove_old = true;  /* new dominates old */
else if (new_path-rows  old_path-rows)
accept_new = false; /* old dominates new */
else if (compare_path_costs_fuzzily(new_path,

The special case is that the path passed to add_path_precheck() has costs
*equal to* those of the old_path. If pathkeys, outer rells and costs are the
same, as summarized in the comment above, I expect add_path_precheck() to
return false. Do I misread anything?

(Maybe the fact that this does not happen too often is that
add_path_precheck() compares the costs exactly, as opposed to the fuzzy
comparison?)

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] memory explosion on planning complex query

2014-11-26 Thread Antonin Houska

On 11/26/2014 11:00 PM, Andrew Dunstan wrote:
 
 Attached is some anonymized DDL for a fairly complex schema from a 
 PostgreSQL Experts client. Also attached is an explain query that runs 
 against the schema. The client's problem is that in trying to run the 
 explain, Postgres simply runs out of memory. On my untuned 9.3 test rig, 
 (Scientific Linux 6.4 with 24Gb of RAM and 24Gb of swap) vmstat clearly 
 shows the explain chewing up about 7Gb of memory. When it's done the 
 free memory jumps back to where it was. On a similar case on the clients 
 test rig we saw memory use jump lots more.
 
 The client's question is whether this is not a bug. It certainly seems 
 like it should be possible to plan a query without chewing up this much 
 memory, or at least to be able to limit the amount of memory that can be 
 grabbed during planning. Going from humming along happily to OOM 
 conditions all through running explain somequery is not very friendly.

It's not trivial to track the whole hierarchy of views, but I think it
can result in the FROM list or some JOIN lists being too long. How about
setting from_collapse_limit / join_collapse_limit to lower-than-default
value ?

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pg_class(relpersistence) of hash index

2014-11-24 Thread Antonin Houska

While checking how BM_PERMANENT flag is set (in buffer header), I noticed that
hash index has it set too. Shouldn't pg_class(relpersistence) be 'u' in this
case? Currently it's set to 'p':

postgres=# CREATE TABLE a(i int);
CREATE TABLE
postgres=# CREATE INDEX ON a USING HASH (i);
WARNING:  hash indexes are not WAL-logged and their use is discouraged
CREATE INDEX
postgres=# \d a
   Table public.a
 Column |  Type   | Modifiers 
+-+---
 i  | integer | 
Indexes:
a_i_idx hash (i)

postgres=# select relpersistence from pg_class where relname='a_i_idx';
 relpersistence 

 p
(1 row)

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] some ideas from users

2014-11-22 Thread Antonin Houska

On 11/22/2014 12:24 PM, Marko Tiikkaja wrote:
 On 2014-11-22 12:20 PM, Pavel Stehule wrote:
 2. missing table function with all settings. Like SHOW ALL, but with
 filtering possibility
 
 What's wrong with pg_settings?

Do you mean pg_show_all_settings() ?

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Functions used in index definitions shouldn't be changed

2014-11-19 Thread Antonin Houska

Albe Laurenz laurenz.a...@wien.gv.at wrote:
 Currently it is possible to change the behaviour of a function with
 CREATE OR REPLACE FUNCTION even if the function is part of an index 
 definition.
 
 I think that should be forbidden, because it is very likely to corrupt
 the index.  I expect the objection that this would break valid use cases
 where people know exactly what they are doing, but I believe that this
 is a footgun for inexperienced users that should be disarmed.
 
 I'd also opt for forbidding behaviour changing modifications with ALTER 
 FUNCTION
 for functions used in index definitions, specifically altering strictness.
 
 Attached is a patch implementing a fix.

Instead of adding extra check, shouldn't you just ensure that
DEPENDENCY_INTERNAL is the dependency type and let the existing logic do the 
work?

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unintended restart after recovery error

2014-11-14 Thread Antonin Houska

Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Nov 13, 2014 at 8:30 AM, Robert Haas robertmh...@gmail.com wrote:
 It's true that if the startup process dies we don't try to restart,
 but it's also true that if the checkpointer dies we do try to restart.
 I'm not sure why this specific situation should be an exception to
 that general rule.

My distinction was during recovery vs outside recovery, rather than
startup process vs checkpointer. But I'm not sure it's easy enough to
recognize that checkpointer (maybe also bgwriter) no longer participates in
recovery.

 442231d7f71764b8c628044e7ce2225f9aa43b6 introduced the latter rule
 for hot-standby case.

I didn't fully understand the purpose of this condition until I saw the commit
message. Thanks for pointing out.

 Maybe *during crash recovery* (i.e., hot standby should not be enabled) it's
 better to treat the crash of startup process as a catastrophic crash.

Yes, that's what I thought too. But I think the current StartupXLOG() does not
always (need to) determine the exact boundary between crash and archive
recovery. I'd need to think more if the end of crash recovery can be safely
identified during the replay and signaled to postmaster.

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] array exclusion constraints

2014-11-13 Thread Antonin Houska

franc...@hush.com wrote:
 
 CREATE TABLE test (
foo text[],
EXCLUDE USING gist (foo WITH )
 );
 
 ERROR:  data type text[] has no default operator class for access method 
 gist
 HINT:  You must specify an operator class for the index or define a default 
 operator class for the data type.
 
 It looks like exclusion constraints don't work with arrays, there's no gist 
 opclass for them. This would be a nice to have feature and, if I recall 
 correctly, exclusion constraints were meant to work both with ranges and 
 arrays. Am I missing something?

I recall I spent some time thinking about this issue, and even could find the
link (also referenced in Indexes section of
https://wiki.postgresql.org/wiki/TODO)

http://www.postgresql.org/message-id/ca+tgmobzhfrjnyz-fyw5kdtrurk0hjwp0vtp5fgzle6evsw...@mail.gmail.com

Regarding implementation, more recent thread summarizes the difficulties:

http://www.postgresql.org/message-id/5297dc17.7000...@proxel.se

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Unintended restart after recovery error

2014-11-12 Thread Antonin Houska

While looking at postmaster.c:reaper(), one problematic case occurred to me.


1. Startup process signals PMSIGNAL_RECOVERY_STARTED.

2. Checkpointer process is forked and immediately dies.

3. reaper() catches this failure, calls HandleChildCrash() and thus sets
FatalError to true.

4. Startup process exits with non-zero status code too - either due to SIGQUIT
received from HandleChildCrash or due to some other failure of the startup
process itself. However, FatalError is already set, because of the previous
crash of the checkpointer. Thus reaper() does not set RecoveryError.

5. As RecoverError failed to be set to true, postmaster will try to restart
the cluster, although it apparently should not.


I could simulate the problem using the following changes (and by installing
recovery.conf into DATA directory):


diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 99f702c..0cbd1c1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6178,6 +6178,12 @@ StartupXLOG(void)
SetForwardFsyncRequests();
SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
bgwriterLaunched = true;
+
+   /*
+* Accidental delay, ensuring that checkpointer's crash 
is caught
+* by PM first.
+*/
+   pg_usleep(500L);
}
 
/*
diff --git a/src/backend/postmaster/checkpointer.c 
b/src/backend/postmaster/checkpointer.c
index 6c814ba..4585119 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -194,6 +194,8 @@ CheckpointerMain(void)
sigjmp_buf  local_sigjmp_buf;
MemoryContext checkpointer_context;
 
+   ereport(FATAL, (errmsg(early failure)));
+
CheckpointerShmem-checkpointer_pid = MyProcPid;
 
/*


This works for me as a fix:

diff --git a/src/backend/postmaster/postmaster.c 
b/src/backend/postmaster/postmaster.c
index 6220a8e..0fb13bb 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2573,15 +2573,11 @@ reaper(SIGNAL_ARGS)
 * After PM_STARTUP, any unexpected exit (including 
FATAL exit) of
 * the startup process is catastrophic, so kill other 
children,
 * and set RecoveryError so we don't try to 
reinitialize after
-* they're gone.  Exception: if FatalError is already 
set, that
-* implies we previously sent the startup process a 
SIGQUIT, so
-* that's probably the reason it died, and we do want 
to try to
-* restart in that case.
+* they're gone.
 */
if (!EXIT_STATUS_0(exitstatus))
{
-   if (!FatalError)
-   RecoveryError = true;
+   RecoveryError = true;
HandleChildCrash(pid, exitstatus,
 _(startup 
process));
continue;

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unintended restart after recovery error

2014-11-12 Thread Antonin Houska

Antonin Houska a...@cybertec.at wrote:

 While looking at postmaster.c:reaper(), one problematic case occurred to me.
 
 
 1. Startup process signals PMSIGNAL_RECOVERY_STARTED.
 
 2. Checkpointer process is forked and immediately dies.
 
 3. reaper() catches this failure, calls HandleChildCrash() and thus sets
 FatalError to true.
 
 4. Startup process exits with non-zero status code too - either due to SIGQUIT
 received from HandleChildCrash or due to some other failure of the startup
 process itself. However, FatalError is already set, because of the previous
 crash of the checkpointer. Thus reaper() does not set RecoveryError.
 
 5. As RecoverError failed to be set to true, postmaster will try to restart
 the cluster, although it apparently should not.

More common case occurred to me as soon as I sent the previous mail: any
process of standby cluster has died. Thus the proposed fix would make
restart_after_crash (GUC) completely ineffective for standbys. I'm not sure if
that's desired. Question is whether RecoveryError should reflect problems
during any kind of recovery, or just during crash recovery.

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unintended restart after recovery error

2014-11-12 Thread Antonin Houska

Fujii Masao masao.fu...@gmail.com wrote:

 On Wed, Nov 12, 2014 at 6:52 PM, Antonin Houska a...@cybertec.at wrote:
  While looking at postmaster.c:reaper(), one problematic case occurred to me.
 
 
  1. Startup process signals PMSIGNAL_RECOVERY_STARTED.
 
  2. Checkpointer process is forked and immediately dies.
 
  3. reaper() catches this failure, calls HandleChildCrash() and thus sets
  FatalError to true.
 
  4. Startup process exits with non-zero status code too - either due to 
  SIGQUIT
  received from HandleChildCrash or due to some other failure of the startup
  process itself. However, FatalError is already set, because of the previous
  crash of the checkpointer. Thus reaper() does not set RecoveryError.
 
  5. As RecoverError failed to be set to true, postmaster will try to restart
  the cluster, although it apparently should not.
 
 Why shouldn't postmaster restart the cluster in that case?
 

At least for the behavior to be consistent with simpler cases of failed
recovery (e.g. any FATAL error in StartupXLOG), which end up not restarting
the cluster.

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Convert query plan to sql query

2014-11-04 Thread Antonin Houska

mariem mariem.benfad...@gmail.com wrote:

 Hello,
 
 I would like to transform the query plan (output of the planner,
 debug_print_plan) into an sql query.

I don't think SQL can express the information the plan contains. For example,
join methods (hash, nest loop, merge).

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] from_collapse_limit considerations

2014-10-22 Thread Antonin Houska

[ I think I responded earlier but don't see my mail in the archive... ]

Tom Lane t...@sss.pgh.pa.us wrote:
 Antonin Houska a...@cybertec.at writes:
  I noticed that - unlike join_collapse_limit - the from_collapse_limit does 
  not
  enforce maximum length of the top-level list. Shouldn't it do?

 How would it do that?  You want it to fail outright if there are more than
 N tables?  That seems unhelpful.

Sure, truncation of the FROM list would be crazy, and that's not what I tried 
to propose.

--
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] from_collapse_limit considerations

2014-09-22 Thread Antonin Houska

;
+
+	/*
+	 * If appropriate, fold items into sub-lists so that from_collapse_limit is
+	 * never exceeded.
+	 */
+	while (list_length(result)  from_collapse_limit)
+	{
+		ListCell *lc;
+		List *fromlist_new = NIL;
+		List *fromlist_sub = NIL;
+
+		foreach(lc, result)
+		{
+			fromlist_sub = lappend(fromlist_sub, lfirst(lc));
+
+			if (list_length(fromlist_sub) = from_collapse_limit)
+			{
+/*
+ * Incorporate the whole sub-list into the new FROM list
+ * and start a new sub-list.
+ */
+fromlist_new = lappend(fromlist_new, fromlist_sub);
+fromlist_sub = NIL;
+			}
+		}
+
+		/*
+		 * Length of the last sublist might not have reached
+		 * from_collapse_limit while being checked in the loop above.
+		 */
+		if (list_length(fromlist_sub)  0)
+		{
+			if (list_length(fromlist_sub) == 1)
+/* No 1-element sublists. */
+fromlist_new = list_concat(fromlist_new, fromlist_sub);
+			else
+fromlist_new = lappend(fromlist_new, fromlist_sub);
+		}
+
+		/* Free the original list (but not the contained items). */
+		list_free(result);
+
+		/* Replace the list with folded one. */
+		result = fromlist_new;
+	}
+
+	return result;
+}
+
+
 /*
  * make_outerjoininfo
  *	  Build a SpecialJoinInfo for the current outer join

-- 
Antonin Houska
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Another logical decoding assertion failure

2014-08-15 Thread Antonin Houska

http://www.postgresql.org/message-id/blu436-smtp12682d628f61ab9736099c3dc...@phx.gbl
recalls me that I also saw an assertion failure recently. Although I
wanted to isolate and report my issue when my vacation is over, this
report made me curious whether I saw the same. Eventually it seems to be
a different symptom.

Following are the steps that trigger the failure in my environment
(9.5devel), although it's not always that straightforward - sometimes I
need to create and populate one more table. (I use the 'test_decoding'
contrib module.)


postgres=# SELECT pg_create_logical_replication_slot('my_slot',
'test_decoding');
 pg_create_logical_replication_slot

 (my_slot,0/16F3B30)
(1 row)

postgres=# CREATE TABLE a AS SELECT * FROM
pg_logical_slot_get_changes('my_slot', NULL, NULL) WITH NO DATA;
SELECT 0
postgres=# INSERT INTO a SELECT * FROM
pg_logical_slot_get_changes('my_slot', NULL, NULL);
INSERT 0 2
postgres=# INSERT INTO a SELECT * FROM
pg_logical_slot_get_changes('my_slot', NULL, NULL);
The connection to the server was lost. Attempting reset: Failed.
!

The stack trace (should I upload / send the whole core file anywhere?):

(gdb) bt
#0  0x7f12d0640849 in raise () from /lib64/libc.so.6
#1  0x7f12d0641cd8 in abort () from /lib64/libc.so.6
#2  0x008c2d51 in ExceptionalCondition (conditionName=0xa926a0
!(((bool)((relation)-rd_refcnt == 0))),
errorType=0xa92210 FailedAssertion, fileName=0xa92104
relcache.c, lineNumber=1892) at assert.c:54
#3  0x008b2776 in RelationDestroyRelation
(relation=0x7f12c766d240, remember_tupdesc=0 '\000') at relcache.c:1892
#4  0x008b2c28 in RelationClearRelation
(relation=0x7f12c766d240, rebuild=1 '\001') at relcache.c:2106
#5  0x008b2fa3 in RelationFlushRelation
(relation=0x7f12c766d240) at relcache.c:2204
#6  0x008b30b5 in RelationCacheInvalidateEntry
(relationId=16384) at relcache.c:2256
#7  0x008abc65 in LocalExecuteInvalidationMessage
(msg=0x28c1260) at inval.c:567
#8  0x0073e1f5 in ReorderBufferExecuteInvalidations
(rb=0x28b0318, txn=0x28b0430) at reorderbuffer.c:1798
#9  0x0073ddbf in ReorderBufferForget (rb=0x28b0318, xid=1279,
lsn=24154944) at reorderbuffer.c:1645
#10 0x007383f8 in DecodeCommit (ctx=0x2894658,
buf=0x7fff3e658c30, xid=1279, dboid=12736, commit_time=461418357793855,
nsubxacts=0, sub_xids=0x28af650, ninval_msgs=69, msgs=0x28af650) at
decode.c:539
#11 0x00737c91 in DecodeXactOp (ctx=0x2894658,
buf=0x7fff3e658c30) at decode.c:207
#12 0x007379ce in LogicalDecodingProcessRecord (ctx=0x2894658,
record=0x28af610) at decode.c:103
#13 0x0073b0d6 in pg_logical_slot_get_changes_guts
(fcinfo=0x7fff3e658f50, confirm=1 '\001', binary=0 '\000') at
logicalfuncs.c:432
#14 0x0073b221 in pg_logical_slot_get_changes
(fcinfo=0x7fff3e658f50) at logicalfuncs.c:478
#15 0x0063e1a3 in ExecMakeTableFunctionResult
(funcexpr=0x288be78, econtext=0x288b758, argContext=0x28b5580,
expectedDesc=0x288ccd8, randomAccess=0 '\000') at execQual.c:2157
#16 0x0065e302 in FunctionNext (node=0x288ba18) at
nodeFunctionscan.c:95
#17 0x0064537e in ExecScanFetch (node=0x288ba18,
accessMtd=0x65e24b FunctionNext, recheckMtd=0x65e630 FunctionRecheck)
at execScan.c:82
#18 0x006453ed in ExecScan (node=0x288ba18, accessMtd=0x65e24b
FunctionNext, recheckMtd=0x65e630 FunctionRecheck)
at execScan.c:132
#19 0x0065e665 in ExecFunctionScan (node=0x288ba18) at
nodeFunctionscan.c:269
#20 0x0063a649 in ExecProcNode (node=0x288ba18) at
execProcnode.c:426
#21 0x0065ca53 in ExecModifyTable (node=0x288b8c0) at
nodeModifyTable.c:926
#22 0x0063a577 in ExecProcNode (node=0x288b8c0) at
execProcnode.c:377
#23 0x00638465 in ExecutePlan (estate=0x288b518,
planstate=0x288b8c0, operation=CMD_INSERT, sendTuples=0 '\000',
numberTuples=0,
direction=ForwardScanDirection, dest=0x2815ac0) at execMain.c:1475
#24 0x0063667a in standard_ExecutorRun (queryDesc=0x288f948,
direction=ForwardScanDirection, count=0) at execMain.c:308
#25 0x00636512 in ExecutorRun (queryDesc=0x288f948,
direction=ForwardScanDirection, count=0) at execMain.c:256
#26 0x007998ec in ProcessQuery (plan=0x28159c8,
sourceText=0x2854d18 INSERT INTO a SELECT * FROM
pg_logical_slot_get_changes('my_slot', NULL, NULL);, params=0x0,
dest=0x2815ac0,
completionTag=0x7fff3e659920 ) at pquery.c:185
#27 0x0079b16e in PortalRunMulti (portal=0x2890638, isTopLevel=1
'\001', dest=0x2815ac0, altdest=0x2815ac0,
completionTag=0x7fff3e659920 ) at pquery.c:1279
#28 0x0079a7b1 in PortalRun (portal=0x2890638,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x2815ac0,
altdest=0x2815ac0,
completionTag=0x7fff3e659920 ) at pquery.c:816
#29 0x0079487b in exec_simple_query (
query_string=0x2854d18 INSERT INTO a SELECT * FROM
pg_logical_slot_get_changes('my_slot', NULL, NULL);) at

Re: [HACKERS] Another logical decoding assertion failure

2014-08-15 Thread Antonin Houska

On 08/15/2014 03:16 PM, Andres Freund wrote:
 On 2014-08-15 14:53:45 +0200, Antonin Houska wrote:
 postgres=# SELECT pg_create_logical_replication_slot('my_slot',
 'test_decoding');
  pg_create_logical_replication_slot
 
  (my_slot,0/16F3B30)
 (1 row)

 postgres=# CREATE TABLE a AS SELECT * FROM
 pg_logical_slot_get_changes('my_slot', NULL, NULL) WITH NO DATA;
 SELECT 0
 postgres=# INSERT INTO a SELECT * FROM
 pg_logical_slot_get_changes('my_slot', NULL, NULL);
 INSERT 0 2
 postgres=# INSERT INTO a SELECT * FROM
 pg_logical_slot_get_changes('my_slot', NULL, NULL);
 The connection to the server was lost. Attempting reset: Failed.
 !
 
 Is this just to reproduce the issue, or are you really storing the
 results of logical decoding in a table again?
 
 Why? That'll just result in circularity, no?  It's not something I
 really thought about allowing when writing this. Possibly we can make it
 work, but I don't really see the point. We obviously shouldn't just
 crash, but that's not my point.

It was basically an experiment. I tried to capture changes into a table.
I don't know whether it's legal (i.e. whether the current call of
pg_logical_slot_get_changes() should include changes that the current
transaction did). Unloged / temporary table may be necessary for case
like this.

The reason I report it is that (I think), if such an use case is not
legal, it should be detected somehow and handled using ereport / elog.

// Tony



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Review: ECPG FETCH readahead

2014-04-23 Thread Antonin Houska

I haven't been too familiar with the ECPG internals so far but tried to
do my best.


Generic criteria


* Does it follow the project coding guidelines?

  Yes.


* Are there portability issues?

  Shouldn't be. I even noticed the code tries to avoid platform-specific
behaviour of standard library function - see comment about strtoll() in
Linux in 25.patch. (I personally don't know how that function works
elsewhere but that shouldn't matter.)


* Are the comments sufficient and accurate?

  Yes, mostly. Just a few improvements recommended below.


* Does it do what it says, correctly?
  IMO, yes.


* Does it produce compiler warnings?
  No.


* Can you make it crash?
  No.


Only some of the parts deserve comments:


23.patch


Reviewed earlier as a component of the relate patch
(http://www.postgresql.org/message-id/52a1e61a.7010...@gmail.com)
and minimum changes done since that time. Nevertheless, a few more comments:

* How about a regression test for the new ECPGcursor_dml() function?

* ECPGtrans() - arguments are explained, but the return (bool) value
should be as well.

* line breaks (pgindent might help):

static void
output_cursor_name(struct cursor *ptr)
{

instead of

static void output_cursor_name(struct cursor *ptr)
{


* confused messages in src/interfaces/ecpg/test/sql/cursorsubxact.pgc,
starting at  100:

exec sql open :curname;
if (sqlca.sqlcode  0)
printf(open %s (case sensitive) failed with SQLSTATE
%5s\n, curname, sqlca.sqlstate);
else
printf(close %s (case sensitive) succeeded\n, curname);

I suppose both should be open.


26.patch (the READAHEAD feature itself)
---

I tried to understand the code but couldn't find any obvious error. The
coding style looks clean. Maybe just the arguments and return value of
the ecpglib functinons (ECPGopen, ECPGopen, ECPGfetch, ...) deserve a
bit more attention.

As for tests, I find them comprehensive and almost everything they do is
clear to me. Just the following was worth questions:

* sql-cursor-ra-fetch.stderr

[NO_PID]: ecpg_execute on line 169: query: move forward all in
scroll_cur; with 0 parameter(s) on connection regress1
...
[NO_PID]: ecpg_execute on line 169: query: move relative -3 in
scroll_cur; with 0 parameter(s) on

As the first iteration is special anyway, wouldn't move absolute -3 be
more efficient than the existing 2 commands?

The following test (also FETCH RELATIVE) uses move absolute:

[NO_PID]: ecpg_execute on line 186: query: move absolute -20 in
scroll_cur; with 0 parameter(s) on connection regress1


Other than this, I had an idea to improve the behaviour if READAHEAD is
smaller than the actual step, but then realized that 29.patch actually
does fix that :-)


* cursor-ra-move.pgc

What's the relevant difference between unspec_cur1 and unspec_cur2
cursors? There's no difference in scrollability or ordering. And the
tests seem to be identical.


* cursor-ra-swdir.pgc

  No questions


* cursorsubxact.pgc

This appears to be the test we discussed earlier:
http://www.postgresql.org/message-id/52a1cab6.9020...@cybertec.at

The only difference I see is minor change of log message of DECLARE
command. Therefore I didn't have to recheck the logic of the test.


28.patch


 * ecpg_cursor_do_move_absolute() and ecpg_cursor_do_move_all() should
better be declared in 26.patch. Just a pedantic comment - all the parts
will probably be applied all at once.


Other
-

Besides the individual parts I propose some typo fixes and
improvements in wording:


* doc/src/sgml/ecpg.sgml

462c462
ECPG does cursor accounting in its runtime library and this makes
possible
---
ECPG does cursor accounting in its runtime library and this makes
it possible
504c504
  recommended to recompile using option option-r
readahead=number/option
---
  recommended to recompile it using option option-r
readahead=number/option


* src/interfaces/ecpg/ecpglib/extern.h

97c97
* The cursor was created in this level of * (sub-)transaction.
---
* The cursor was created at this level of (sub-)transaction.


* src/interfaces/ecpg/ecpglib/README.cursor+subxact

4c4
 Contents of tuples returned by a cursor always reflect the data present at
---
 Contents of tuples returned by a cursor always reflects the data
present at
29c29
 needs two operations. If the next command issued by the application spills
---
 need two operations. If the next command issued by the application spills
32c32
 kinds of commands as is after 3 cache misses. FETCH FORWARD/BACKWARD
allows
---
 kinds of commands as is after 3 cache misses. FETCH FORWARD/BACKWARD
allows
81c81
 I can also be negative (but also 1-based) if the application started with
---
 It can also be negative (but also 1-based) if the application started with
132c132
 is that (sub-)transactions are also needed to be tracked. These two are
---
 is that (sub-)transactions also need to

Re: [HACKERS] Review: ECPG FETCH readahead

2014-04-23 Thread Antonin Houska

[Now I'm only replying where my explanation seems useful. If you expect
anything else, please remind me.]

On 04/23/2014 06:41 PM, Boszormenyi Zoltan wrote:
 
 All exported ECPG functions returns bool. IIRC the code generated by
 EXEC SQL WHENEVER something-else-than-CONTINUE makes use
 of the returned value.

ok


 26.patch (the READAHEAD feature itself)
 ---
 Maybe just the arguments and return value of
 the ecpglib functinons (ECPGopen, ECPGopen, ECPGfetch, ...) deserve a
 bit more attention.
 
 What do you mean exactly?

Basically the missing description of return type was most blatant, but
you explained it above. Now that I see some of the existing library
functions, the descriptions of parameters are neither too eloquent. So
ignore this remark.


 * sql-cursor-ra-fetch.stderr

 [NO_PID]: ecpg_execute on line 169: query: move forward all in
 scroll_cur; with 0 parameter(s) on connection regress1
 ...
 [NO_PID]: ecpg_execute on line 169: query: move relative -3 in
 scroll_cur; with 0 parameter(s) on

 As the first iteration is special anyway, wouldn't move absolute -3 be
 more efficient than the existing 2 commands?
 
 The caching code tries to do the correct thing whichever direction
 the cursor is scanned. AFAIR this one explicitly tests invalidating
 the readahead window if you fall off it by using MOVE FORWARD ALL.

I have no doubt about correctness of the logic, just suspect that a
single MOVE command could do the action. Perhaps consider it my paranoia
and let committer judge if it's worth a change (especially if the
related amount of coding would seem inadequate).


 Other
 -

 Besides the individual parts I propose some typo fixes and
 improvements in wording:


 * doc/src/sgml/ecpg.sgml

In general, I'm not English native speaker, can be wrong in some cases.
Just pointed out what I thought is worth checking.

// Tony


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: ECPG FETCH readahead

2014-04-23 Thread Antonin Houska

On 04/23/2014 05:24 PM, Alvaro Herrera wrote:
 Antonin Houska wrote:
 I haven't been too familiar with the ECPG internals so far but tried to
 do my best.
 
 I'm afraid we're stuck on this patch until Michael has time to review
 it, or some other committer wants to acquire maintainership rights in
 the ECPG code.
 

Committer availability might well be the issue, but missing review
probably too.

Whether this review is enough to move the patch to ready for committer
- I tend to let the next CFM decide. (I don't find it productive to
ignite another round of discussion about kinds of reviews - already saw
some.)

// Tony


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: fix range queries in btree_gin

2014-03-28 Thread Antonin Houska

On 03/28/2014 04:30 PM, Alexander Korotkov wrote:

 We have range types, and restriction col @ range can be correctly
 handled by gin_extract_query, because it will be passed there as single
 restriction. This is workaround itself, but it's weird to force users
 express queries like this.

This reminds me of my earlier experiment

http://www.postgresql.org/message-id/51fbc99d.7040...@gmail.com

even though my motivation was different: to make comparePartial()
support function unnecessary.

// Antonin Houska (Tony)



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Comment - uniqueness of relfilenode

2014-03-05 Thread Antonin Houska

On 03/06/2014 04:33 AM, Robert Haas wrote:
 On Wed, Mar 5, 2014 at 8:54 PM, Bruce Momjian br...@momjian.us wrote:
 On Mon, Nov 11, 2013 at 05:48:52PM +0100, Antonin Houska wrote:
 On 11/10/2013 12:57 AM, Robert Haas wrote:
 On Thu, Nov 7, 2013 at 10:56 AM, Antonin Houska
 antonin.hou...@gmail.com wrote:
 catalog/catalog.c:GetNewRelFileNode() and its calls indicate that the
 following change makes sense:


 diff --git a/src/include/storage/relfilenode.h
 b/src/include/storage/relfilenode.h
 index 75f897f..7190974 100644
 --- a/src/include/storage/relfilenode.h
 +++ b/src/include/storage/relfilenode.h
 @@ -55,7 +55,7 @@ typedef enum ForkNumber
   * relNode identifies the specific relation.  relNode corresponds to
   * pg_class.relfilenode (NOT pg_class.oid, because we need to be able
   * to assign new physical files to relations in some situations).
 - * Notice that relNode is only unique within a particular database.
 + * Notice that relNode is only unique within a particular tablespace.
   *
   * Note: spcNode must be GLOBALTABLESPACE_OID if and only if dbNode is
   * zero.  We support shared relations only in the global tablespace.


 // Antonin Houska (Tony)

 Technically speaking, I think it's only guaranteed to be unique with a
 database-tablespace combination.  In other words, the same OID can be
 reused as a relfilenode if *either* of those two values differs.

 You're right. I missed the fact that Postgres (unlike another DBMS that
 I worked with) allows for tablespace to be shared across databases.

 I have update the C comment:

   * Notice that relNode is only unique within a particular database.
 ---
   * Notice that relNode is only unique within a particular 
 tablespace.
 
 Yep. But the new text is no more correct than the old text.  Did you
 read what I wrote upthread?

Perhaps ... unique within a particular (tablespace, database)
combination. ?

// Tony



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Equivalence Rules

2014-03-02 Thread Antonin Houska

There are 2 kinds of rules in this document: for joins and for set
operations.

As for joins, I think they are all about *inner* joins. Postgres (IMO)
implements them by not doing anything special if query only contains
inner joins.

On the other hand, attention has to be paid if there's at least one
*outer* join in the query. Identities summarized in 'Valid OUTER JOIN
Optimizations' section of optimizer/README come into play then. I think
make_outerjoininfo() is the code to recognize these relationships in the
original query, and join_is_legal() then to check if new joins (those
not present in the original query) do not change the semantics.

(As for set operations, someone else needs to explain.)

// Antonin Houska (Tony)


On 03/02/2014 09:02 AM, Ali Piroozi wrote:
 Hi
 
 My question is:
 Does PostgreSQL implements equivalence rules(from those are listed in
 email's attachment)?
 Which function or which part of source code(in PostgreSQL ) implements
 the equivalence rules?
 I think, this should be implemented in query optimization part of
 PostgreSQL, but which rule
 and where, I don't know?
 I want to use that(function or part of source code), to produce the
 equivalence Relational Algebras (based on equivalence rules in
 attachment) for a given SQL query(Relational Algebra).
 
 Thanks
 
 
 
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Backup throttling

2014-02-28 Thread Antonin Houska

On 02/27/2014 11:04 PM, Alvaro Herrera wrote:
 I pushed this patch with a few further tweaks.  In your changes to
 address the above point, you made the suffix mandatory in the
 pg_basebackup -r option.  This seemed a strange restriction, so I
 removed it.  It seems more user-friendly to me to accept the value as
 being expressed in kilobytes per second without requiring the suffix to
 be there; the 'k' suffix is then also accepted and has no effect.  I
 amended the docs to say that also.
 
 If you or others feel strongly about this, we can still tweak it, of
 course.

I'm used to assume the base unit if there's no suffix, but have no
objections against considering kB as the default. I see you adjusted
documentation too.

 I also moved the min/max #defines to replication/basebackup.h, and
 included that file in pg_basebackup.c.  This avoids the duplicated
 values.  That file is okay to be included there.

I kept in mind that pg_basebackup.c is not linked to the backend, but
you're right, mere inclusion is something else.

 Thanks for your patch, and the numerous reviewers who took part.

Thanks for committing - this is my first patch :-)

// Tony




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Backup throttling

2014-02-03 Thread Antonin Houska

On 01/31/2014 06:26 AM, Fujii Masao wrote:
 Is there a good place to define the constant, so that both backend and
 client can use it? I'd say 'include/common' but no existing file seems
 to be appropriate. I'm not sure if it's worth to add a new file there.
 
 If there is no good place to define them, it's okay to define them
 also in client side
 for now.
 +termBASE_BACKUP [literalLABEL/literal
 replaceable'label'/replaceable] [literalPROGRESS/literal]
 [literalFAST/literal] [literalWAL/literal]
 [literalNOWAIT/literal] [literalMAX_RATE/literal]/term
 
 It's better to add something like replaceable'rate'/replaceable just after
 literalMAX_RATE/literal.
 
 + para
 +  literalMAX_RATE/literal does not affect WAL streaming.
 + /para
 
 I don't think that this paragraph is required here because BASE_BACKUP is
 basically independent from WAL streaming.
 
 Why did you choose bytes per second as a valid rate which we can specify?
 Since the minimum rate is 32kB, isn't it better to use KB per second for 
 that?
 If we do that, we can easily increase the maximum rate from 1GB to very large
 number in the future if required.

The attached version addresses all the comments above.


 +wait_result = WaitLatch(MyWalSnd-latch, WL_LATCH_SET | WL_TIMEOUT
 +| WL_POSTMASTER_DEATH, (long) (sleep / 1000));
 
 If WL_POSTMASTER_DEATH is triggered, we should exit immediately like
 other process does? This is not a problem of this patch. This problem exists
 also in current master. But ISTM it's better to solve that together. Thought?

Once we're careful about not missing signals, I think PM death should be
noticed too. The backup functionality itself would probably manage to
finish without postmaster, however it's executed under walsender process.

Question is where !PostmasterIsAlive() check should be added. I think it
should go to the main loop of perform_base_backup(), but that's probably
not in the scope of this patch.

Do you think that my patch should only add a comment like Don't forget
to set WL_POSTMASTER_DEATH flag when making basebackup.c sensitive to PM
death?

// Antonin Houska (Tony)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 832524e..704b653 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1772,7 +1772,7 @@ The commands accepted in walsender mode are:
   /varlistentry
 
   varlistentry
-termBASE_BACKUP [literalLABEL/literal replaceable'label'/replaceable] [literalPROGRESS/literal] [literalFAST/literal] [literalWAL/literal] [literalNOWAIT/literal]/term
+termBASE_BACKUP [literalLABEL/literal replaceable'label'/replaceable] [literalPROGRESS/literal] [literalFAST/literal] [literalWAL/literal] [literalNOWAIT/literal] [literalMAX_RATE/literal replaceable'rate'/replaceable]/term
 listitem
  para
   Instructs the server to start streaming a base backup.
@@ -1840,7 +1840,20 @@ The commands accepted in walsender mode are:
   the waiting and the warning, leaving the client responsible for
   ensuring the required log is available.
  /para
- /listitem
+/listitem
+   /varlistentry
+   varlistentry
+termliteralMAX_RATE/literal replaceablerate//term
+listitem
+ para
+  Limit (throttle) the maximum amount of data transferred from server
+  to client per unit of time. The expected unit is kilobytes per second.
+  If this option is specified, the value must either be equal to zero
+  or it must fall within the range from 32 kB through 1 GB (inclusive).
+  If zero is passed or the option is not specified, no restriction is
+  imposed on the transfer.
+ /para
+/listitem
/varlistentry
   /variablelist
  /para
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index c379df5..f8866db 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -189,6 +189,27 @@ PostgreSQL documentation
  /varlistentry
 
  varlistentry
+  termoption-r replaceable class=parameterrate/replaceable/option/term
+  termoption--max-rate=replaceable class=parameterrate/replaceable/option/term
+  listitem
+   para
+The maximum amount of data transferred from server per second.
+The purpose is to limit the impact of applicationpg_basebackup/application
+on the running server.
+   /para
+   para
+This option always affects transfer of the data directory. Transfer of
+WAL files is only affected if the collection method is literalfetch/literal.
+   /para
+   para
+Valid values are between literal32 kB/literal and literal1024 MB/literal.
+Suffixes literalk/literal (kilobytes) and literalM/literal
+(megabytes) are accepted.
+   /para
+  /listitem
+ /varlistentry
+
+ varlistentry
   termoption-R/option

[HACKERS] bgworker crashed or not?

2014-01-31 Thread Antonin Houska

In 9.3 I noticed that postmaster considers bgworker crashed (and
therefore tries to restart it) even if it has exited with zero status code.

I first thought about a patch like the one below, but then noticed that
postmaster.c:bgworker_quickdie() signal handler exits with 0 too (when
there's no success). Do we need my patch, my patch + something for the
handler or no patch at all?

// Antonin Houska (Tony)


diff --git a/src/backend/postmaster/postmaster.c
b/src/backend/postmaster/postmaster.c
index 0957e91..0313fd7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2791,11 +2814,7 @@ reaper(SIGNAL_ARGS)

/* Was it one of our background workers? */
if (CleanupBackgroundWorker(pid, exitstatus))
-   {
-   /* have it be restarted */
-   HaveCrashedWorker = true;
continue;
-   }

/*
 * Else do standard backend child cleanup.
@@ -2851,7 +2870,10 @@ CleanupBackgroundWorker(int pid,

/* Delay restarting any bgworker that exits with a
nonzero status. */
if (!EXIT_STATUS_0(exitstatus))
+   {
rw-rw_crashed_at = GetCurrentTimestamp();
+   HaveCrashedWorker = true;
+   }
else
rw-rw_crashed_at = 0;



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Backup throttling

2014-01-21 Thread Antonin Houska

I realize the following should be applied on the top of v7:

index a0216c1..16dd939 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1263,7 +1263,7 @@ throttle(size_t increment)
throttling_counter %= throttling_sample;

/* Once the (possible) sleep has ended, new period starts. */
-   if (wait_result | WL_TIMEOUT)
+   if (wait_result  WL_TIMEOUT)
throttled_last += elapsed + sleep;
else if (sleep  0)
/* Sleep was necessary but might have been interrupted. */

// Tony

On 01/20/2014 05:10 PM, Antonin Houska wrote:
 On 01/15/2014 10:52 PM, Alvaro Herrera wrote:
 I gave this patch a look.  There was a bug that the final bounds check
 for int32 range was not done when there was no suffix, so in effect you
 could pass numbers larger than UINT_MAX and pg_basebackup would not
 complain until the number reached the server via BASE_BACKUP.  Maybe
 that's fine, given that numbers above 1G will cause a failure on the
 server side anyway, but it looked like a bug to me.  I tweaked the parse
 routine slightly; other than fix the bug, I made it accept fractional
 numbers, so you can say 0.5M for instance.
 
 Thanks.
 
 Perhaps we should make sure pg_basebackup rejects numbers larger than 1G
 as well.
 
 Is there a good place to define the constant, so that both backend and
 client can use it? I'd say 'include/common' but no existing file seems
 to be appropriate. I'm not sure if it's worth to add a new file there.
 
 Another thing I found a bit strange was the use of the latch.  What this
 patch does is create a separate latch which is used for the throttling.
 This means that if the walsender process receives a signal, it will not
 wake up if it's sleeping in throttling.  Perhaps this is okay: as Andres
 was quoted upthread as saying, maybe this is not a problem because the
 sleep times are typically short anyway.  But we're pretty much used to
 the idea that whenever a signal is sent, processes act on it
 *immediately*.  Maybe some admin will not feel comfortable about waiting
 some extra 20ms when they cancel their base backups.  In any case,
 having a secondary latch to sleep on in a process seems weird.  Maybe
 this should be using MyWalSnd-latch somehow.
 
 o.k., MyWalSnd-latch is used now.
 
 You have this interesting THROTTLING_MAX_FREQUENCY constant defined to
 128, with the comment check this many times per second.
 Let's see: if the user requests 1MB/s, this value results in
 throttling_sample = 1MB / 128 = 8192.  So for every 8kB transferred, we
 would stop, check the wall clock time, and if less time has lapsed than
 we were supposed to spend transferring those 8kB then we sleep.  Isn't a
 check every 8kB a bit too frequent?  This doesn't seem sensible to me.
 I think we should be checking perhaps every tenth of the requested
 maximum rate, or something like that, not every 1/128th.

 Now, what the code actually does is not necessarily that, because the
 sampling value is clamped to a minimum of 32 kB.  But then if we're
 going to do that, why use such a large divisor value in the first place?
 I propose we set that constant to a smaller value such as 8.
 
 I tried to use THROTTLING_SAMPLE_MIN and THROTTLING_MAX_FREQUENCY to
 control both the minimum and maximum chunk size. It was probably too
 generic, THROTTLING_SAMPLE_MIN is no longer there.
 
 New patch version is attached.
 
 // Antonin Houska (Tony)
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Backup throttling

2014-01-20 Thread Antonin Houska

On 01/15/2014 10:52 PM, Alvaro Herrera wrote:
 I gave this patch a look.  There was a bug that the final bounds check
 for int32 range was not done when there was no suffix, so in effect you
 could pass numbers larger than UINT_MAX and pg_basebackup would not
 complain until the number reached the server via BASE_BACKUP.  Maybe
 that's fine, given that numbers above 1G will cause a failure on the
 server side anyway, but it looked like a bug to me.  I tweaked the parse
 routine slightly; other than fix the bug, I made it accept fractional
 numbers, so you can say 0.5M for instance.

Thanks.

 Perhaps we should make sure pg_basebackup rejects numbers larger than 1G
 as well.

Is there a good place to define the constant, so that both backend and
client can use it? I'd say 'include/common' but no existing file seems
to be appropriate. I'm not sure if it's worth to add a new file there.

 Another thing I found a bit strange was the use of the latch.  What this
 patch does is create a separate latch which is used for the throttling.
 This means that if the walsender process receives a signal, it will not
 wake up if it's sleeping in throttling.  Perhaps this is okay: as Andres
 was quoted upthread as saying, maybe this is not a problem because the
 sleep times are typically short anyway.  But we're pretty much used to
 the idea that whenever a signal is sent, processes act on it
 *immediately*.  Maybe some admin will not feel comfortable about waiting
 some extra 20ms when they cancel their base backups.  In any case,
 having a secondary latch to sleep on in a process seems weird.  Maybe
 this should be using MyWalSnd-latch somehow.

o.k., MyWalSnd-latch is used now.

 You have this interesting THROTTLING_MAX_FREQUENCY constant defined to
 128, with the comment check this many times per second.
 Let's see: if the user requests 1MB/s, this value results in
 throttling_sample = 1MB / 128 = 8192.  So for every 8kB transferred, we
 would stop, check the wall clock time, and if less time has lapsed than
 we were supposed to spend transferring those 8kB then we sleep.  Isn't a
 check every 8kB a bit too frequent?  This doesn't seem sensible to me.
 I think we should be checking perhaps every tenth of the requested
 maximum rate, or something like that, not every 1/128th.
 
 Now, what the code actually does is not necessarily that, because the
 sampling value is clamped to a minimum of 32 kB.  But then if we're
 going to do that, why use such a large divisor value in the first place?
 I propose we set that constant to a smaller value such as 8.

I tried to use THROTTLING_SAMPLE_MIN and THROTTLING_MAX_FREQUENCY to
control both the minimum and maximum chunk size. It was probably too
generic, THROTTLING_SAMPLE_MIN is no longer there.

New patch version is attached.

// Antonin Houska (Tony)
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 7d99976..799d214 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1720,7 +1720,7 @@ The commands accepted in walsender mode are:
   /varlistentry
 
   varlistentry
-termBASE_BACKUP [literalLABEL/literal replaceable'label'/replaceable] [literalPROGRESS/literal] [literalFAST/literal] [literalWAL/literal] [literalNOWAIT/literal]/term
+termBASE_BACKUP [literalLABEL/literal replaceable'label'/replaceable] [literalPROGRESS/literal] [literalFAST/literal] [literalWAL/literal] [literalNOWAIT/literal] [literalMAX_RATE/literal]/term
 listitem
  para
   Instructs the server to start streaming a base backup.
@@ -1788,7 +1788,23 @@ The commands accepted in walsender mode are:
   the waiting and the warning, leaving the client responsible for
   ensuring the required log is available.
  /para
- /listitem
+/listitem
+   /varlistentry
+   varlistentry
+termliteralMAX_RATE/literal replaceablerate//term
+listitem
+ para
+  Limit (throttle) the maximum amount of data transferred from server
+  to client per unit of time. The expected unit is bytes per second.
+  If this option is specified, the value must either be equal to zero
+  or it must fall within the range from 32 kB through 1 GB (inclusive).
+  If zero is passed or the option is not specified, no restriction is
+  imposed on the transfer.
+ /para
+ para
+  literalMAX_RATE/literal does not affect WAL streaming.
+ /para
+/listitem
/varlistentry
   /variablelist
  /para
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index c379df5..2ec81b7 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -189,6 +189,27 @@ PostgreSQL documentation
  /varlistentry
 
  varlistentry
+  termoption-r replaceable class=parameterrate/replaceable/option/term
+  termoption--max-rate=replaceable class=parameterrate/replaceable

[HACKERS] Assertion failure in base backup code path

2013-12-16 Thread Antonin Houska

In HEAD, pg_basebackup causes WAL sender to fail when the replication
user is not a superuser:


#0  0x7f34f671dd25 in raise () from /lib64/libc.so.6
#1  0x7f34f671f1a8 in abort () from /lib64/libc.so.6
#2  0x008989a9 in ExceptionalCondition (conditionName=0xa51ac1
!(IsTransactionState()), errorType=0xa51734 FailedAssertion,
fileName=0xa516e0 catcache.c, lineNumber=) at assert.c:54
#3  0x0087ea36 in SearchCatCache (cache=0x23c4fb8, v1=16384,
v2=0, v3=0, v4=0) at catcache.c:
#4  0x00890cdd in SearchSysCache (cacheId=11, key1=16384,
key2=0, key3=0, key4=0) at syscache.c:909
#5  0x008a9a99 in has_rolreplication (roleid=16384) at
miscinit.c:401
#6  0x005146d1 in do_pg_start_backup (backupidstr=0x239d5b0
bkp_01, fast=0 '\000', starttli_p=0x7fff78e4f8ec,
labelfile=0x7fff78e4f8e0) at xlog.c:9633
#7  0x00733a24 in perform_base_backup (opt=0x7fff78e4fa30,
tblspcdir=0x242c6a0) at basebackup.c:106
#8  0x00735013 in SendBaseBackup (cmd=0x239dbf8) at basebackup.c:563
#9  0x0072f4f2 in exec_replication_command (cmd_string=0x23ea078
BASE_BACKUP LABEL 'bkp_01'  WAL  NOWAIT) at walsender.c:668
#10 0x0077c5c4 in PostgresMain (argc=1, argv=0x2385358,
dbname=0x2385248 , username=0x2385210 postgres_replication) at
postgres.c:4009
#11 0x00717c94 in BackendRun (port=0x23a2e90) at postmaster.c:4085
#12 0x0071742e in BackendStartup (port=0x23a2e90) at
postmaster.c:3774
#13 0x00713cc9 in ServerLoop () at postmaster.c:1585
#14 0x00713370 in PostmasterMain (argc=3, argv=0x2381f60) at
postmaster.c:1240
#15 0x00677698 in main (argc=3, argv=0x2381f60) at main.c:196

Some additional condition may be needed in the Assert() statement?

// Antonin Houska (Tony)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-13 Thread Antonin Houska

On 12/12/2013 04:36 PM, Tom Lane wrote:
 BTW, on further thought, I'm afraid this is a bigger can of worms than
 it appears.  The remarks above presume that the subquery is simple enough
 to be pulled up, which is the case in this example.  It might not be too
 hard to make that case work.  But what if the subquery *isn't* simple
 enough to be pulled up --- for instance, it includes grouping or
 aggregation?  Then there's no way to unify its WHERE clause with the upper
 semijoin qual.  At the very least, this breaks the uniqueify-then-do-a-
 plain-join implementation strategy for semijoins.

After having thought about it further, I think I understand.

 So I'm now thinking this patch isn't worth pursuing.  Getting all the
 corner cases right would be a significant amount of work, and in the
 end it would only benefit strangely-written queries.

Originally it seemed to me that I just (luckily) found a new opportunity
for the existing infrastructure. To change the infrastructure because of
this small feature would be exactly the opposite. Thanks for having
taken a look at it.

// Antonin Houska (Tony)



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-11 Thread Antonin Houska

On 12/11/2013 10:15 PM, Tom Lane wrote:
 
 FWIW, that plan isn't obviously wrong; if it is broken, most likely the
 reason is that the HashAggregate is incorrectly unique-ifying the lower
 table.  (Unfortunately, EXPLAIN doesn't show enough about the HashAgg
 to know what it's doing exactly.)  The given query is, I think, in
 principle equivalent to
 
  SELECT ...
   FROM SUBSELECT_TBL upper
   WHERE (f1, f2::float) IN
 (SELECT f2, f3 FROM SUBSELECT_TBL);
 
 and if you ask unmodified HEAD to plan that you get
 
  Hash Join  (cost=41.55..84.83 rows=442 width=16)
Hash Cond: ((upper.f1 = subselect_tbl.f2) AND ((upper.f2)::double 
 precision = subselect_tbl.f3))
-  Seq Scan on subselect_tbl upper  (cost=0.00..27.70 rows=1770 width=16)
-  Hash  (cost=38.55..38.55 rows=200 width=12)
  -  HashAggregate  (cost=36.55..38.55 rows=200 width=12)
-  Seq Scan on subselect_tbl  (cost=0.00..27.70 rows=1770 
 width=12)

Before I opened your mail, I also recalled the technique that I noticed
in the planner code, to evaluate SEMI JOIN as INNER JOIN with the RHS
uniquified, so also thought it could be about the uniquification.

 which is the same thing at the visible level of detail ... but this
 version computes the correct result.  The cost of the HashAggregate is
 estimated higher, though, which suggests that maybe it's distinct'ing on
 two columns where the bogus plan only does one.

debug_print_plan output contains
:grpColIdx 2
in the AGG node. I think this corresponds to the join condition, which
IMO should be
(upper.f1 = subselect_tbl.f2)
while the other condition was not in the list of join clauses and
therefore ignored for the uniquification's purpose.

And gdb tells me that create_unique_path() never gets more than 1
clause. I can't tell whether it should do just for this special purpose.

 Not sure about where Antonin's patch is going off the rails.  I suspect
 it's too simple somehow, but it's also possible that it's OK and the
 real issue is some previously undetected bug in LATERAL processing.

So far I have no idea how to achieve such conditions without this patch.
Thanks for your comments.

// Antonin Houska (Tony)




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Backup throttling

2013-12-10 Thread Antonin Houska

Thanks for checking. The new version addresses your findings.

// Antonin Houska (Tony)

On 12/09/2013 03:49 PM, Fujii Masao wrote:
 On Fri, Dec 6, 2013 at 6:43 PM, Boszormenyi Zoltan z...@cybertec.at wrote:
 Hi,

 2013-12-05 15:36 keltezéssel, Antonin Houska írta:

 On 12/02/2013 02:23 PM, Boszormenyi Zoltan wrote:

 Hi,

 I am reviewing your patch.

 Thanks. New version attached.


 I have reviewed and tested it and marked it as ready for committer.
 
 Here are the review comments:
 
 +  termoption-r/option/term
 +  termoption--max-rate/option/term
 
 You need to add something like replaceable
 class=parameterrate/replaceable.
 
 +The purpose is to limit impact of
 applicationpg_basebackup/application
 +on a running master server.
 
 s/master server/server because we can take a backup from also the standby.
 
 I think that it's better to document the default value and the accepted range 
 of
 the rate that we can specify.
 
 You need to change the protocol.sgml because you changed BASE_BACKUP
 replication command.
 
 +printf(_(  -r, --max-rate maximum transfer rate to
 transfer data directory\n));
 
 You need to add something like =RATE just after --max-rate.
 
 +result = strtol(src, after_num, 0);
 
 errno should be set to 0 just before calling strtol().
 
 +if (errno_copy == ERANGE || result != (uint64) ((uint32) result))
 +{
 +fprintf(stderr, _(%s: transfer rate \%s\ exceeds integer
 range\n), progname, src);
 +exit(1);
 +}
 
 We can move this check after the check of src == after_num like
 parse_int() in guc.c does.
 If we do this, the local variable 'errno_copy' is no longer necessary.
 
 I think that it's better to output the hint message like Valid units for
 the transfer rate are \k\ and \M\. when a user specified wrong unit.
 
 +/*
 + * THROTTLING_SAMPLE_MIN / MAX_RATE_LOWER (in seconds) should be the
 + * longest possible time to sleep. Thus the cast to long is safe.
 + */
 +pg_usleep((long) sleep);
 
 It's better to use the latch here so that we can interrupt immediately.
 
 Regards,
 

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 0b2e60e..2f24fff 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1719,7 +1719,7 @@ The commands accepted in walsender mode are:
   /varlistentry
 
   varlistentry
-termBASE_BACKUP [literalLABEL/literal replaceable'label'/replaceable] [literalPROGRESS/literal] [literalFAST/literal] [literalWAL/literal] [literalNOWAIT/literal]/term
+termBASE_BACKUP [literalLABEL/literal replaceable'label'/replaceable] [literalPROGRESS/literal] [literalFAST/literal] [literalWAL/literal] [literalNOWAIT/literal] [literalMAX_RATE/literal]/term
 listitem
  para
   Instructs the server to start streaming a base backup.
@@ -1787,7 +1787,23 @@ The commands accepted in walsender mode are:
   the waiting and the warning, leaving the client responsible for
   ensuring the required log is available.
  /para
- /listitem
+/listitem
+   /varlistentry
+   varlistentry
+termliteralMAX_RATE/literal/term
+listitem
+ para
+  Limit the maximum amount of data transferred to client per unit of
+  time. The expected unit is bytes per second. If
+  literalMAX_RATE/literal is passed, it must be either equal to
+  zero or fall to range 32 kB through 1 GB (inclusive). If zero is
+  passed or the option is not passed at all, no restriction is imposed
+  on the transfer.
+ /para
+ para
+  literalMAX_RATE/literal does not affect WAL streaming.
+ /para
+/listitem
/varlistentry
   /variablelist
  /para
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index c379df5..caede77 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -189,6 +189,28 @@ PostgreSQL documentation
  /varlistentry
 
  varlistentry
+  termoption-r replaceable class=parameterrate/replaceable/option/term
+  termoption--max-rate=replaceable class=parameterrate/replaceable/option/term
+  listitem
+   para
+The maximum amount of data transferred from server per second.
+The purpose is to limit impact of applicationpg_basebackup/application
+on running server.
+   /para
+   para
+This option always affects transfer of the data directory. Transfer of
+WAL files is only affected if the collection method is literalfetch/literal.
+   /para
+   para
+Value between literal32 kB/literal and literal1024 MB/literal
+is expected. Suffixes literalk/literal (kilobytes) and
+literalM/literal (megabytes) are accepted. For example:
+literal10M/literal
+   /para
+  /listitem
+ /varlistentry
+
+ varlistentry

Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-10 Thread Antonin Houska

On 12/06/2013 03:33 PM, Kevin Grittner wrote:
 Antonin Houska antonin.hou...@gmail.com wrote:
 
 SELECT *
 FROMtab1 a
  LEFT JOIN
  tab2 b
  ON a.i = ANY (
  SELECT  k
  FROMtab3 c
  WHEREk = a.i);
 
 This query works with k in any or all tables, but the semantics
 certainly vary depending on where k happens to be.  It would help a
 lot if you showed SQL statements to create and populate the tables
 involved and/or if you qualified all referenced column names with
 the table alias to avoid ambiguity.

I used the DDLs attached (tables.ddl) for this query too, not only for
the queries in quaries.sql. Yes, if I had mentioned it and/or qualified
the 'k' column reference, it wouldn't have broken anything.

 If I assume that the k reference is supposed to be a column in
 tab3, what you have is a query where you always get all rows from
 tab1, and for each row from tab1 you either match it to all rows
 from tab2 or no rows from tab2 depending on whether the tab1 row
 has a match in tab3.

I concede this particular query is not useful. But the important thing
to consider here is which side of the LEFT JOIN the subquery references.

 SELECT  *
 FROMtab1 a
  LEFT JOIN
  (
SELECT *
tab2 b
SEMI JOIN
(  SELECT  k
FROMtab3 c
WHERE  k = a.i
) AS ANY_subquery
ON a.i = ANY_subquery.k
  ) AS SJ_subquery
  ON true;
 
 It is hard to see what you intend here, since this is not valid
 syntax.

This is what I - after having read the related source code - imagine to
happen internally when the ANY predicate of the first query is being
processed. In fact it should become something like this (also internal
stuff)

SELECT  *
FROMtab1 a
LEFT JOIN
(
  tab2 b
  SEMI JOIN
  (  SELECT  k
  FROMtab3 c
  WHERE  k = a.i
  ) AS ANY_subquery
  ON a.i = ANY_subquery.k
)
ON true;

that is, SEMI JOIN node inserted into the tree rather than a subquery
(SJ_subquery). I posted the construct with SJ_subquery to show how I
thought about the problem: I thought it's safe (even though not
necessarily beautiful) to wrap the SEMI JOIN into the SJ_subquery and
let the existing infrastructure decide whether it's legal to turn it
into a join node. I concluded that the subquery's references to the tab1
ensure that SJ_subquery won't be flattened, so the patch does nothing if
such a reference exists.

 PostgreSQL supports semi-joins; but that is an implementation detail
 for the EXISTS or IN syntax.

... and for ANY, see subselect.c:convert_ANY_sublink_to_join()

 Could you clarify your intent?

To get rid of a subplan in some cases that require it so far: when the
subquery references table exactly 1 level higher (i.e. the immediate
parent query).

(I got the idea while reading the source code, as opposed to query
tuning.)

// Antonin Houska (Tony)

 --
 Kevin Grittner
 EDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Review: [HACKERS] ECPG infrastructure changes part 1, was: Re: ECPG fixes

2013-12-06 Thread Antonin Houska

Tested git apply and build again. No warnings.

The regression test also looks good to me now.

I'm done with this review.

(Not sure if I should move it to 'ready for committer' status or the CFM
should do).

// Antonin Houska (Tony)

On 12/06/2013 02:01 PM, Boszormenyi Zoltan wrote:
 2013-12-04 14:51 keltezéssel, Boszormenyi Zoltan írta:
 2013-12-03 16:48 keltezéssel, Antonin Houska írta:

 Tests - 23.patch
 

 src/interfaces/ecpg/test/sql/cursorsubxact.pgc


  /*
   * Test the implicit RELEASE SAVEPOINT if a SAVEPOINT
   * is used with an already existing name.
   */

 Shouldn't it be ... if a CURSOR is used with an already existing
 name?. Or just ... implicit RELEASE SAVEPOINT after an error?
 I'd also appreciate a comment where exactly the savepoint is
 (implicitly) released.

 I have already answered this in my previous answer.
 
 And I was wrong in that. The comments in the test were rearranged
 a little and the fact in the above comment is now actually tested.
 
 Some harmless unused variables were also removed and an
 uninitialized variable usage was fixed. Because of these and the above
 changes a lot of patches need to be rebased.
 
 All patches are attached again for completeness.
 
 Best regards,
 Zoltán Böszörményi
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Backup throttling

2013-12-05 Thread Antonin Houska

On 12/02/2013 02:23 PM, Boszormenyi Zoltan wrote:
 Hi,
 
 I am reviewing your patch.

Thanks. New version attached.

 
 * Does it follow the project coding guidelines?
 
 Yes. A nitpicking: this else branch below might need brackets
 because there is also a comment in that branch:
 
 +   /* The 'real data' starts now (header was ignored). */
 +   throttled_last = GetCurrentIntegerTimestamp();
 +   }
 +   else
 +   /* Disable throttling. */
 +   throttling_counter = -1;
 +

Done.

 
 * Does it do what it says, correctly?
 
 Yes.
 
 Although it should be mentioned in the docs that rate limiting
 applies to walsenders individually, not globally. I tried this
 on a freshly created database:
 
 $ time pg_basebackup -D data2 -r 1M -X stream -h 127.0.0.1
 real0m26.508s
 user0m0.054s
 sys0m0.360s
 
 The source database had 3 WAL files in pg_xlog, one of them was
 also streamed. The final size of data2 was 43MB or 26MB without pg_xlog,
 i.e. without the -X stream option. The backup used 2 walsenders
 in parallel (one for WAL) which is a known feature.

Right, if the method is 'stream', a separate WAL sender is used and the
transfer is not limited: client must keep up with the stream
unconditionally. I added a note to documentation.

But there's no reason not to throttle WAL files if the method is
'fetch'. It's fixed now.

 Another note. This chunk should be submitted separately as a comment bugfix:
 
 diff --git a/src/backend/utils/adt/timestamp.c 
 b/src/backend/utils/adt/timestamp.c
 index c3c71b7..5736fd8 100644
 --- a/src/backend/utils/adt/timestamp.c
 +++ b/src/backend/utils/adt/timestamp.c
 @@ -1288,7 +1288,7 @@ GetCurrentTimestamp(void)
   /*
* GetCurrentIntegerTimestamp -- get the current operating system time as 
 int64
*
 - * Result is the number of milliseconds since the Postgres epoch. If compiled
 + * Result is the number of microseconds since the Postgres epoch. If compiled
* with --enable-integer-datetimes, this is identical to 
 GetCurrentTimestamp(),
* and is implemented as a macro.
*/

Will do.

// Tony

diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index c379df5..e878592 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -189,6 +189,26 @@ PostgreSQL documentation
  /varlistentry
 
  varlistentry
+  termoption-r/option/term
+  termoption--max-rate/option/term
+  listitem
+   para
+The maximum amount of data transferred from server per second.
+The purpose is to limit impact of applicationpg_basebackup/application
+on a running master server.
+   /para
+   para
+This option always affects transfer of the data directory. Transfer of
+WAL files is only affected if the collection method is literalfetch/literal.
+   /para
+   para
+Suffixes literalk/literal (kilobytes) and literalM/literal
+(megabytes) are accepted. For example: literal10M/literal
+   /para
+  /listitem
+ /varlistentry
+
+ varlistentry
   termoption-R/option/term
   termoption--write-recovery-conf/option/term
   listitem
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index ba8d173..f194746 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -33,6 +33,7 @@
 #include utils/builtins.h
 #include utils/elog.h
 #include utils/ps_status.h
+#include utils/timestamp.h
 #include pgtar.h
 
 typedef struct
@@ -42,6 +43,7 @@ typedef struct
 	bool		fastcheckpoint;
 	bool		nowait;
 	bool		includewal;
+	uint32		maxrate;
 } basebackup_options;
 
 
@@ -59,6 +61,7 @@ static void perform_base_backup(basebackup_options *opt, DIR *tblspcdir);
 static void parse_basebackup_options(List *options, basebackup_options *opt);
 static void SendXlogRecPtrResult(XLogRecPtr ptr, TimeLineID tli);
 static int	compareWalFileNames(const void *a, const void *b);
+static void throttle(size_t increment);
 
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
@@ -68,6 +71,41 @@ static bool backup_started_in_recovery = false;
  */
 #define TAR_SEND_SIZE 32768
 
+
+/*
+ * The maximum amount of data per second - bounds of the user input.
+ *
+ * If the maximum should be increased to more than 4 GB, uint64 must
+ * be introduced for the related variables. However such high values have
+ * little to do with throttling.
+ */
+#define MAX_RATE_LOWER	32768
+#define MAX_RATE_UPPER	(1024  20)
+
+/*
+ * Transfer rate is only measured when this number of bytes has been sent.
+ * (Too frequent checks would impose too high CPU overhead.)
+ *
+ * The default value is used unless it'd result in too frequent checks.
+ */
+#define THROTTLING_SAMPLE_MIN	32768
+
+/* The maximum number of checks per

Re: [HACKERS] Reference to parent query from ANY sublink

2013-12-05 Thread Antonin Houska

On 10/31/2013 09:37 PM, Antonin Houska wrote:
 On 10/31/2013 03:46 PM, Antonin Houska wrote:
 I'm not sure if it's legal for the WHERE clause to reference LHS of the
 original outer join (a.j). Some more restriction may be needed. I need
 to think about it a bit more.

For a subquery or sublink expression referencing the outer table of an
OJ (see tab1)

SELECT *
FROMtab1 a
LEFT JOIN
tab2 b
ON a.i = ANY (
SELECT  k
FROMtab3 c
WHERE   k = a.i);

I started my considerations by inserting the SEMI JOIN in a form of
subquery, instead of a join node - see SJ_subquery here:

SELECT  *
FROMtab1 a
LEFT JOIN
(
   SELECT *
   tab2 b
   SEMI JOIN
   (  SELECT  k
  FROMtab3 c
  WHERE   k = a.i
   ) AS ANY_subquery
   ON a.i = ANY_subquery.k
) AS SJ_subquery
ON true;

(To allow a.i in the sublink expression, we'd only need to pass both
tab1 and tab2 to pull_up_sublinks_qual_recurse() in available_rels1.)

However it seem to be these lateral references (from the subquery and/or
the sublink expression) to tab1 that make it impossible for
SJ_subquery to be pulled up into the parent query's join tree - see
jointree_contains_lateral_outer_refs(). I'm not sure if it makes much
sense to pull up the sublink in such a case, does it?

I ended up with this logic: if the join is INNER, both the subquery and
sublink expression can reference either side. If the join is OUTER, only
the inner side can be referenced. Otherwise no attempt to introduce the
SEMI JOIN.

Can this be considered a patch, or is it wrong/incomplete?

// Antonin Houska (Tony)



diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index d8cabbd..6095008 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -1167,6 +1167,7 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	JoinExpr   *result;
 	Query	   *parse = root-parse;
 	Query	   *subselect = (Query *) sublink-subselect;
+	bool	subselectLateral;
 	Relids		upper_varnos;
 	int			rtindex;
 	RangeTblEntry *rte;
@@ -1177,11 +1178,10 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	Assert(sublink-subLinkType == ANY_SUBLINK);
 
 	/*
-	 * The sub-select must not refer to any Vars of the parent query. (Vars of
-	 * higher levels should be okay, though.)
+	 * If the subselect refers to any Vars of the parent query, it must be
+	 * marked LATERAL.
 	 */
-	if (contain_vars_of_level((Node *) subselect, 1))
-		return NULL;
+	subselectLateral = contain_vars_of_level((Node *) subselect, 1);
 
 	/*
 	 * The test expression must contain some Vars of the parent query, else
@@ -1199,6 +1199,20 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 		return NULL;
 
 	/*
+	 * Neither the subselect can refer to outside available_rels.
+	 * (Such lateral references would prevent the resulting subquery from
+	 * being reached by pull_up_simple_subquery.)
+	 */
+	if (subselectLateral)
+	{
+		Relids sub_varnos = pull_varnos_of_level((Node *) subselect, 1);
+
+		if (!bms_is_subset(sub_varnos, available_rels))
+			return NULL;
+	}
+
+
+	/*
 	 * The combining operators and left-hand expressions mustn't be volatile.
 	 */
 	if (contain_volatile_functions(sublink-testexpr))
@@ -1215,7 +1229,7 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	rte = addRangeTableEntryForSubquery(NULL,
 		subselect,
 		makeAlias(ANY_subquery, NIL),
-		false,
+		subselectLateral,
 		false);
 	parse-rtable = lappend(parse-rtable, rte);
 	rtindex = list_length(parse-rtable);
CREATE TABLE tab1(i int);
CREATE TABLE tab2(j int);
CREATE TABLE tab3(k int);
EXPLAIN
SELECT *
FROM	tab1 a
	JOIN
	tab2 b
	ON b.j = ANY (
		SELECT  k
		FROM	tab3
		WHERE k = b.j);

EXPLAIN
SELECT *
FROM	tab1 a
	JOIN
	tab2 b
	-- Available_rels1 contains both 'a' and 'b'.
	-- Both sublink expression and the subselect can refer to either side
	-- of the JOIN.
	ON a.i = ANY (
		SELECT  k
		FROM	tab3
		WHERE k = a.i);

EXPLAIN
SELECT *
FROM	tab1 a
	LEFT JOIN
	tab2 b
	ON b.j = ANY (
		SELECT  k
		FROM	tab3
		WHERE k = b.j);

EXPLAIN
SELECT *
FROM	tab1 a
	LEFT JOIN
	tab2 b
	ON b.j = ANY (
		SELECT  k
		FROM	tab3
		-- Lateral reference to 'a', no sublink pull-up
	 	WHERE k = a.i); 

EXPLAIN
SELECT *
FROM	tab1 a
	LEFT JOIN
	tab2 b
	-- Lateral reference to 'a', no sublink pull-up
	ON a.i = ANY (
		SELECT  k
		FROM	tab3
	 	WHERE k = b.j); 
QUERY PLAN
--
 Nested Loop  (cost=0.00..84113.00 rows=288 width=8)
   -  Seq Scan on tab1 a  (cost=0.00..34.00 rows=2400 width=4)
   -  Materialize  (cost=0.00..48082.00 rows=1200 width=4)
 -  Seq Scan on tab2 b  (cost=0.00..48076.00 rows

Review: [HACKERS] ECPG infrastructure changes part 1, was: Re: ECPG fixes

2013-12-03 Thread Antonin Houska

The changes are very well divided into logical units, so I think I could
understand the ideas. I'm not too familiar with the ecpg internals, so
consider this at least a coding review.


git apply:  Clean, except for one finding below

Build:  no errors/warnings

Documentation build: no errors/warnings. The changes do appear in ecpg.html.

Regression tests: all passed

Non-Linux platforms: I can't verify, but I haven't noticed any new
dependencies added.

Comments (in the source code): good.


(My) additional comments:


22.patch


tuples_left  0

instead of just

tuples_left

seems to me safer in for-loops. Not sure if there's a convention for
this though.

23.patch


git apply --verbose ~/cybertec/patches/ecpq/23.patch
/home/anton/cybertec/patches/ecpq/23.patch:494: space before tab in indent.
/*--
/home/anton/cybertec/patches/ecpq/23.patch:495: space before tab in indent.
   translator: this string will be truncated at
149 characters expanded.  */
/home/anton/cybertec/patches/ecpq/23.patch:4019: trailing whitespace.
exec sql close :curname;



Tests - 23.patch


src/interfaces/ecpg/test/sql/cursorsubxact.pgc


/*
 * Test the implicit RELEASE SAVEPOINT if a SAVEPOINT
 * is used with an already existing name.
 */

Shouldn't it be ... if a CURSOR is used with an already existing
name?. Or just ... implicit RELEASE SAVEPOINT after an error?
I'd also appreciate a comment where exactly the savepoint is
(implicitly) released.


23.patch and 24.patch
-

SO_MAJOR_VERSION and also interfaces/ecpg/include/ecpglib.h is changed

Thus all client applications probably need to be preprocessed  compiled
against the PG 9.4. I don't know how this is usually enforced. I'm
mentioning it for the sake of completeness.

// Antonin Houska (Tony)


On 11/28/2013 03:21 PM, Boszormenyi Zoltan wrote:
 2013-11-20 14:53 keltezéssel, Boszormenyi Zoltan írta:
 2013-11-20 14:41 keltezéssel, Boszormenyi Zoltan írta:
 2013-11-12 07:15 keltezéssel, Boszormenyi Zoltan írta:
 2013-11-12 07:01 keltezéssel, Noah Misch írta:
 On Mon, Nov 11, 2013 at 10:17:54AM +0100, Boszormenyi Zoltan wrote:
 The old contents of my GIT repository was removed so you need to
 clone it fresh. https://github.com/zboszor/ecpg-readahead.git
 I won't post the humongous patch again, since sending a 90KB
 compressed file to everyone on the list is rude.
 Patches of that weight show up on a regular basis.  I don't think it's 
 rude.

 OK, here it is.

 ...
 Subsequent patches will come as reply to this email.

 Infrastructure changes in ecpglib/execute.c to split up
 ECPGdo and ecpg_execute() and expose the parts as
 functions internal to ecpglib.
 
 Rebased after killing the patch that changed the DECLARE CURSOR command tag.
 All infrastructure patches are attached, some of them compressed.
 
 Best regards,
 Zoltán Böszörményi
 
 
 
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Todo item: Support amgettuple() in GIN

2013-11-29 Thread Antonin Houska

On 11/29/2013 01:13 AM, Andreas Karlsson wrote:

 When doing partial matching the code need to be able to return the union 
 of all TIDs in all the matching posting trees in TID order (to be able 
 to do AND and OR operations with multiple search keys later). It does 
 this by traversing them posting tree after posting tree and collecting 
 them all in a TIDBitmap which is later iterated over.

I think it's not a plain union. My understanding is that - to evaluate a
single key (typically array) - you first need to get all the TID streams
for that key (i.e. one posting list/tree per element of the key array)
and then iterate all these streams in parallel and 'merge' them using
consistent() function. That's how I understand ginget.c:keyGetItem().

So the problem of partial match is (IMO) that there can be too many TID
streams to merge - much more than the number of elements of the key array.

// Antonin Houska (Tony)



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Todo item: Support amgettuple() in GIN

2013-11-29 Thread Antonin Houska

On 11/29/2013 01:57 PM, Andreas Karlsson wrote:
 On 11/29/2013 09:54 AM, Antonin Houska wrote:
 On 11/29/2013 01:13 AM, Andreas Karlsson wrote:

 When doing partial matching the code need to be able to return the union
 of all TIDs in all the matching posting trees in TID order (to be able
 to do AND and OR operations with multiple search keys later). It does
 this by traversing them posting tree after posting tree and collecting
 them all in a TIDBitmap which is later iterated over.

 I think it's not a plain union. My understanding is that - to evaluate a
 single key (typically array) - you first need to get all the TID streams
 for that key (i.e. one posting list/tree per element of the key array)
 and then iterate all these streams in parallel and 'merge' them  using
 consistent() function. That's how I understand ginget.c:keyGetItem().
 
 For partial matches the merging is done in two steps: first a simple 
 union of all the streams per key and then second merging those union 
 streams using the consistent() function.

Yes, short after I sent my previous mail I realized that your union
probably referred to the things that collectMatchBitmap() does.

// Tony



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Easily reading debug_print_plan

2013-11-20 Thread Antonin Houska

On 11/20/2013 09:12 AM, Craig Ringer wrote:
 Hi all
 
 I'm spending a lot of time staring at parse and plan trees at the
 moment, and I'm finding reading them rather cumbersome.
 
 For those of you who do this a lot, do you use any sort of tooling to
 help you out?

vim editor. The '%' shortcut can be used to jump between opening and
closing brackets and thus skip smaller or bigger parts of the output.
IMO, this output is primarily for hackers (as opposed to application
developers or users) and hacker should know at least a few vim shortcuts
anyway :-)

// Tony



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Review: HStore Gin Speedup

2013-11-18 Thread Antonin Houska

Following are my initial comments for
https://commitfest.postgresql.org/action/patch_view?id=1203


1. It no longer applies, probably due to commit
a18167510f4c385329697588ce5132cbf95779c3

error: contrib/hstore/hstore--1.1.sql: No such file or directory


2. Compatibility with HStore v2.0
(https://commitfest.postgresql.org/action/patch_view?id=1289)

As long as the separate operator class gin_hstore_combined_ops is used,
there should be no conflict. I'll verify as soon as the $subject patch
is applicable to master.


3. Regression tests

Shouldn't EXPLAIN be there too? I suggest a test that creates 2 indexes
on the hstore column - one using the existing GIN (default) opclass and
one using the new one (combined). The execution plan will then show if
an index is used and if it's the correct one.


// Antonin Houska (Tony)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Information about Access methods

2013-11-13 Thread Antonin Houska

On 11/13/2013 08:59 AM, Rohit Goyal wrote:
 Could you please suggest something about abt update operation of B tree
 index.

access/nbtree/README is probably the next text to read. It points to
theoretical background and also explains specifics of Postgres
implementation.

// Antonin Houska (Tony)



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Comment - uniqueness of relfilenode

2013-11-11 Thread Antonin Houska

On 11/10/2013 12:57 AM, Robert Haas wrote:
 On Thu, Nov 7, 2013 at 10:56 AM, Antonin Houska
 antonin.hou...@gmail.com wrote:
 catalog/catalog.c:GetNewRelFileNode() and its calls indicate that the
 following change makes sense:


 diff --git a/src/include/storage/relfilenode.h
 b/src/include/storage/relfilenode.h
 index 75f897f..7190974 100644
 --- a/src/include/storage/relfilenode.h
 +++ b/src/include/storage/relfilenode.h
 @@ -55,7 +55,7 @@ typedef enum ForkNumber
   * relNode identifies the specific relation.  relNode corresponds to
   * pg_class.relfilenode (NOT pg_class.oid, because we need to be able
   * to assign new physical files to relations in some situations).
 - * Notice that relNode is only unique within a particular database.
 + * Notice that relNode is only unique within a particular tablespace.
   *
   * Note: spcNode must be GLOBALTABLESPACE_OID if and only if dbNode is
   * zero.  We support shared relations only in the global tablespace.


 // Antonin Houska (Tony)
 
 Technically speaking, I think it's only guaranteed to be unique with a
 database-tablespace combination.  In other words, the same OID can be
 reused as a relfilenode if *either* of those two values differs.

You're right. I missed the fact that Postgres (unlike another DBMS that
I worked with) allows for tablespace to be shared across databases.

// Antonin Houska (Tony)



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Comment - uniqueness of relfilenode

2013-11-07 Thread Antonin Houska

catalog/catalog.c:GetNewRelFileNode() and its calls indicate that the
following change makes sense:


diff --git a/src/include/storage/relfilenode.h
b/src/include/storage/relfilenode.h
index 75f897f..7190974 100644
--- a/src/include/storage/relfilenode.h
+++ b/src/include/storage/relfilenode.h
@@ -55,7 +55,7 @@ typedef enum ForkNumber
  * relNode identifies the specific relation.  relNode corresponds to
  * pg_class.relfilenode (NOT pg_class.oid, because we need to be able
  * to assign new physical files to relations in some situations).
- * Notice that relNode is only unique within a particular database.
+ * Notice that relNode is only unique within a particular tablespace.
  *
  * Note: spcNode must be GLOBALTABLESPACE_OID if and only if dbNode is
  * zero.  We support shared relations only in the global tablespace.


// Antonin Houska (Tony)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Reference to parent query from ANY sublink

2013-10-31 Thread Antonin Houska

So far, a suquery of ANY sublink located in WHERE/ON clause can't
reference vars exactly one level up, as long as pull-up into the join
tree is expected. Now that we have LATERAL subqueries (there seem to be
no specifics of SEMI JOIN when it comes to parameterization etc), I
think this restriction can be lifted. Thus a subplan should be avoided
often.

Not sure if something like that is applicable to EXISTS: various parts
are cut-off, so there are probably no vars having (varlevelsup == 1).

The attachments show cases where the SEMI JOIN should be inserted above
INNER JOIN and into the nullable side of OUTER JOIN respectively, each
before the patch is applied and after that.

So far I didn't test recursive processing, but don't expect problems here.

Can the change be as simple as this or do I neglect anything?

// Antonin Houska (Tony)
diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c
index 0df70c4..062175f 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -1167,6 +1167,7 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	JoinExpr   *result;
 	Query	   *parse = root-parse;
 	Query	   *subselect = (Query *) sublink-subselect;
+	bool	subqueryLateral;
 	Relids		upper_varnos;
 	int			rtindex;
 	RangeTblEntry *rte;
@@ -1176,12 +1177,7 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 
 	Assert(sublink-subLinkType == ANY_SUBLINK);
 
-	/*
-	 * The sub-select must not refer to any Vars of the parent query. (Vars of
-	 * higher levels should be okay, though.)
-	 */
-	if (contain_vars_of_level((Node *) subselect, 1))
-		return NULL;
+	subqueryLateral = contain_vars_of_level((Node *) subselect, 1);
 
 	/*
 	 * The test expression must contain some Vars of the parent query, else
@@ -1215,7 +1211,7 @@ convert_ANY_sublink_to_join(PlannerInfo *root, SubLink *sublink,
 	rte = addRangeTableEntryForSubquery(NULL,
 		subselect,
 		makeAlias(ANY_subquery, NIL),
-		false,
+		subqueryLateral,
 		false);
 	parse-rtable = lappend(parse-rtable, rte);
 	rtindex = list_length(parse-rtable);
EXPLAIN
SELECT	*
FROM	tab1
WHERE	i = ANY (SELECT  k
		FROM	tab2
		WHERE	k = j);
  QUERY PLAN  
--
 Seq Scan on tab1  (cost=0.00..39388.68 rows=1070 width=8)
   Filter: (SubPlan 1)
   SubPlan 1
 -  Seq Scan on tab2  (cost=0.00..36.75 rows=11 width=4)
   Filter: (k = tab1.j)
(5 rows)

 QUERY PLAN 

 Hash Join  (cost=73.64..76.50 rows=6 width=8)
   Hash Cond: (tab2.k = tab1.j)
   -  HashAggregate  (cost=36.75..38.75 rows=200 width=4)
 -  Seq Scan on tab2  (cost=0.00..31.40 rows=2140 width=4)
   -  Hash  (cost=36.75..36.75 rows=11 width=8)
 -  Seq Scan on tab1  (cost=0.00..36.75 rows=11 width=8)
   Filter: (j = i)
(7 rows)

EXPLAIN
SELECT	*
FROM	tab1 a
	LEFT JOIN
	tab1 b
	ON a.i = b.i
WHERE	a.i = ANY (
		SELECT  tab2.k
		FROM	tab2
		WHERE	k = a.j);
  QUERY PLAN  
--
 Hash Left Join  (cost=58.15..39850.22 rows=11449 width=16)
   Hash Cond: (a.i = b.i)
   -  Seq Scan on tab1 a  (cost=0.00..39388.68 rows=1070 width=8)
 Filter: (SubPlan 1)
 SubPlan 1
   -  Seq Scan on tab2  (cost=0.00..36.75 rows=11 width=4)
 Filter: (k = a.j)
   -  Hash  (cost=31.40..31.40 rows=2140 width=8)
 -  Seq Scan on tab1 b  (cost=0.00..31.40 rows=2140 width=8)
(9 rows)

   QUERY PLAN   

 Hash Right Join  (cost=76.57..116.64 rows=59 width=16)
   Hash Cond: (b.i = a.i)
   -  Seq Scan on tab1 b  (cost=0.00..31.40 rows=2140 width=8)
   -  Hash  (cost=76.50..76.50 rows=6 width=8)
 -  Hash Join  (cost=73.64..76.50 rows=6 width=8)
   Hash Cond: (tab2.k = a.i)
   -  HashAggregate  (cost=36.75..38.75 rows=200 width=4)
 -  Seq Scan on tab2  (cost=0.00..31.40 rows=2140 width=4)
   -  Hash  (cost=36.75..36.75 rows=11 width=8)
 -  Seq Scan on tab1 a  (cost=0.00..36.75 rows=11 width=8)
   Filter: (i = j)
(11 rows)

CREATE TABLE tab1(i int, j int);
CREATE TABLE tab2(k int, l int);
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reference to parent query from ANY sublink

2013-10-31 Thread Antonin Houska

On 10/31/2013 03:46 PM, Antonin Houska wrote:
 Can the change be as simple as this or do I neglect anything?

Well, the example of outer join is wrong. Instead I think query

SELECT *
FROMtab1 a
LEFT JOIN
tab1 b
ON b.i = ANY (
SELECT  tab2.k
FROMtab2
WHERE   k = a.j);


should be converted to

SELECT  *
FROMtab1 a
LEFT JOIN
(  tab1 b
   LATERAL SEMI JOIN
   (  SELECT  tab2.k
  FROM  tab2
  WHERE k = a.j
   ) AS ANY_subquery
   ON b.i = sub.k
)

I'm not sure if it's legal for the WHERE clause to reference LHS of the
original outer join (a.j). Some more restriction may be needed. I need
to think about it a bit more.

// Tony



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 129 matches

Mail list logo