from:"Amit Langote"

Re: [HACKERS] path toward faster partition pruning

2017-11-13 Thread Amit Langote

Horiguchi-san,

Thanks for taking a look.  Replying to all your emails here.

On 2017/11/10 12:30, Kyotaro HORIGUCHI wrote:
> In 0002, bms_add_range has a bit naive-looking loop
> 
> + while (wordnum <= uwordnum)
> + {
> + bitmapword mask = (bitmapword) ~0;
> +
> + /* If working on the lower word, zero out bits below 'lower'. */
> + if (wordnum == lwordnum)
> + {
> + int lbitnum = BITNUM(lower);
> + mask >>= lbitnum;
> + mask <<= lbitnum;
> + }
> +
> + /* Likewise, if working on the upper word, zero bits above 
> 'upper' */
> + if (wordnum == uwordnum)
> + {
> + int ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(upper) + 
> 1);
> + mask <<= ushiftbits;
> + mask >>= ushiftbits;
> + }
> +
> + a->words[wordnum++] |= mask;
> + }
> 
> Without some aggressive optimization, the loop takes most of the
> time to check-and-jump for nothing especially with many
> partitions and somewhat unintuitive.
> 
> The following uses a bit tricky bitmap operation but
> is straightforward as a whole.
> 
> =
> /* fill the bits upper from BITNUM(lower) (0-based) of the first word */
> a->workds[wordnum++] += ~(bitmapword)((1 << BITNUM(lower)) - 1);
> 
> /* fill up intermediate words */
> while (wordnum < uwordnum)
>a->words[wordnum++] = ~(bitmapword) 0;
> 
> /* fill up to BITNUM(upper) bit (0-based) of the last word */
> a->workds[wordnum++] |=
>  (~(bitmapword) 0) >> (BITS_PER_BITMAPWORD - (BITNUM(upper) - 1));
> =

Considering also the David's comment downthread, I will try to incorporate
this into bms_add_range().

> In 0003, 
> 
> +match_clauses_to_partkey(RelOptInfo *rel,
> ...
> +  if (rinfo->pseudoconstant &&
> +(IsA(clause, Const) &&
> + Const *) clause)->constisnull) ||
> +  !DatumGetBool(((Const *) clause)->constvalue
> +  {
> +*constfalse = true;
> +continue;
> 
> If we call this function in both conjunction and disjunction
> context (the latter is only in recursive case). constfalse ==
> true means no need of any clauses for the former case.
> 
> Since (I think) just a list of RestrictInfo is expected to be
> treated as a conjunction (it's quite doubious, though..),

I think it makes sense to consider a list of RestrictInfo's, such as
baserestrictinfo, that is passed as input to match_clauses_to_partkey(),
to be mutually conjunctive for our purpose here.

> we
> might be better call this for each subnodes of a disjunction. Or,
> like match_clauses_to_index, we might be better provide
> match_clause_to_partkey(rel, rinfo, contains_const), which
> returns NULL if constfalse. (I'm not self-confident on this..)

After reading your comment, I realized that it was wrong that the
recursive call to match_clauses_to_partkey() passed the arguments of an OR
clause all at once.  That broke the assumption mentioned above that all of
the clauses in the list passed to match_clauses_to_partkey() are mutually
conjunctive.  Instead, we must make a single-member list for each of the
OR clause's arguments and pass the same.

Then if we got constfalse for all of the OR's arguments, then we return
the constfalse=true to the original caller.

> 
> +  /*
> +   * If no commutator exists, cannot flip the qual's args,
> +   * so give up.
> +   */
> +  if (!OidIsValid(expr_op))
> +continue;
> 
> I suppose it's better to leftop and rightop as is rather than
> flipping over so that var is placed left-side. Does that make
> things so complex?

Reason to do it that way is that the code placed far away (code in
partition.c that extracts constant values to use for pruning from matched
clauses) can always assume that the clauses determined to be useful for
partition-pruning always come in the 'partkey op constant' form.

> + * It the operator happens to be '<>', which is never listed
> If?

Right, will fix.

> +if (!op_in_opfamily(expr_op, partopfamily))
> +{
> +  Oidnegator = get_negator(expr_op);
> +
> +  if (!OidIsValid(negator) ||
> +!op_in_opfamily(negator, partopfamily))
> +continue;
> 
> classify_partition_bounding_keys() checks the same thing by
> checking whether the negator's strategy is
> BTEquealStrategyNumber. (I'm not sure the operator is guaranteed
> to be of btreee, though..) Aren't they needed to be in similar
> way?

You're right.  The <>'s negator may not always be a btree operator.  So,
we should add a check in match_clauses_to_partkey() that list or range
partitioning is in use, because only those require a btree operator
family.  We now have hash partitioning, so need to be careful not to make
the assumption that all partitioning operators are from btree operator
families.

If match_clauses_to_partkey()

Re: [HACKERS] path toward faster partition pruning

2017-11-09 Thread Amit Langote

Hi Amul.

On 2017/11/09 20:05, amul sul wrote:
> On Mon, Nov 6, 2017 at 3:31 PM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> On 2017/11/06 14:32, David Rowley wrote:
>>> On 6 November 2017 at 17:30, Amit Langote wrote:
>>>> On 2017/11/03 13:32, David Rowley wrote:
>>>>> On 31 October 2017 at 21:43, Amit Langote wrote:
> []
>>
>> Attached updated set of patches, including the fix to make the new pruning
>> code handle Boolean partitioning.
>>
> 
> I am getting following warning on mac os x:

Thanks for the review.

> partition.c:1800:24: warning: comparison of constant -1 with
> expression of type 'NullTestType'
>   (aka 'enum NullTestType') is always false
> [-Wtautological-constant-out-of-range-compare]
> if (keynullness[i] == -1)
> ~~ ^  ~~
> partition.c:1932:25: warning: comparison of constant -1 with
> expression of type 'NullTestType'
>   (aka 'enum NullTestType') is always false
> [-Wtautological-constant-out-of-range-compare]
> if (keynullness[i] == -1)
> ~~ ^  ~~
> 2 warnings generated.
> 
> 
> Comment for 0004 patch:
> 270 +   /* -1 represents an invalid value of NullTestType. */
>  271 +   memset(keynullness, -1, PARTITION_MAX_KEYS * sizeof(NullTestType));
> 
> I think we should not use memset to set a value other than 0 or true/false.
> This will work for -1 on the system where values are stored in the 2's
> complement but I am afraid of other architecture.

OK, I will remove all instances of comparing and setting variables of type
NullTestType to a value of -1.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] fix wrong create table statement in documentation

2017-11-08 Thread Amit Langote

On 2017/11/09 7:21, Tom Lane wrote:
> jotpe  writes:
>> In the current documentation [1] this create table statement is listed:
>> CREATE TABLE measurement_y2008m01 PARTITION OF measurement
>>  FOR VALUES FROM ('2008-01-01') TO ('2008-02-01')
>>  TABLESPACE fasttablespace
>>  WITH (parallel_workers = 4);
> 
> Yup, that's wrong.  Fix pushed, thanks!

Oops!  Thanks Tom for the fix.

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

2017-11-08 Thread Amit Langote

Hi Rajkumar,

Thanks for testing.

On 2017/11/08 15:52, Rajkumar Raghuwanshi wrote:
> On Mon, Nov 6, 2017 at 3:31 PM, Amit Langote <langote_amit...@lab.ntt.co.jp>
> wrote:
> 
>> Attached updated set of patches, including the fix to make the new pruning
>> code handle Boolean partitioning.
>>
> 
> Hi Amit,
> 
> I have tried pruning for different values of constraint exclusion GUC
> change, not sure exactly how it should behave, but I can see with the
> delete statement pruning is not happening when constraint_exclusion is off,
> but select is working as expected. Is this expected behaviour?

Hmm, the new pruning only works for selects, not DML.  The patch also
changes get_relation_constraints() to not include the internal partition
constraints, but mistakenly does so for all query types, not just select.
Will look into it.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning

2017-11-07 Thread Amit Langote

Hi David.

Thanks for the review.

(..also looking at the comments you sent earlier today.)

On 2017/11/07 11:14, David Rowley wrote:
> On 7 November 2017 at 01:52, David Rowley  
> wrote:
>> Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13)
> 
> I have a little more review to share:
> 
> 1. Missing "in" in comment. Should be "mentioned in"
> 
>  * get_append_rel_partitions
>  * Return the list of partitions of rel that pass the clauses mentioned
>  * rel->baserestrictinfo
> 
> 2. Variable should be declared in inner scope with the following fragment:
> 
> void
> set_basic_child_rel_properties(PlannerInfo *root,
>RelOptInfo *rel,
>RelOptInfo *childrel,
>AppendRelInfo *appinfo)
> {
> AttrNumber attno;
> 
> if (rel->part_scheme)
> {
> 
> which makes the code the same as where you moved it from.

It seems like you included the above changes in your attached C file,
which I will incorporate into my repository.

> 3. Normally lfirst() is assigned to a variable at the start of a
> foreach() loop. You have code which does not follow this.
> 
> foreach(lc, clauses)
> {
> Expr   *clause;
> int i;
> 
> if (IsA(lfirst(lc), RestrictInfo))
> {
> RestrictInfo *rinfo = lfirst(lc);
> 
> You could assign this to a Node * since the type is unknown to you at
> the start of the loop.

Will make the suggested changes to match_clauses_to_partkey().

> 4.
> /*
> * Useless if what we're thinking of as a constant is actually
> * a Var coming from this relation.
> */
> if (bms_is_member(rel->relid, constrelids))
> continue;
> 
> should this be moved to just above the op_strict() test? This one seems 
> cheaper.

Agreed, will do.  Also makes sense to move the PartCollMatchesExprColl()
test together.

> 5. Typo "paritions": /* No clauses to prune paritions, so scan all
> partitions. */
> 
> But thinking about it more the comment should something more along the
> lines of /* No useful clauses for partition pruning. Scan all
> partitions. */

You fixed it. :)

> The key difference is that there might be clauses, just without Consts.
> 
> Actually, the more I look at get_append_rel_partitions() I think it
> would be better if you re-shaped that if/else if test so that it only
> performs the loop over the partindexes if it's been set.
> 
> I ended up with the attached version of the function after moving
> things around a little bit.

Thanks a lot for that.  Looks much better now.

> I'm still reviewing but thought I'd share this part so far.

As mentioned at the top, I'm looking at your latest comments and they all
seem to be good points to me, so will address those in the next version.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-11-06 Thread Amit Langote

On 2017/11/07 14:40, Amit Khandekar wrote:
> On 7 November 2017 at 00:33, Robert Haas  wrote:
> 
>> Also, +1 for Amit Langote's idea of trying to merge
>> mt_perleaf_childparent_maps with mt_persubplan_childparent_maps.
> 
> Currently I am trying to see if it simplifies things if we do that. We
> will be merging these arrays into one, but we are adding a new int[]
> array that maps subplans to leaf partitions. Will get back with how it
> looks finally.

One thing to note is that the int[] array I mentioned will be much faster
to compute than going to convert_tuples_by_name() to build the additional
maps array.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning

2017-11-06 Thread Amit Langote

On 2017/11/06 21:52, David Rowley wrote:
> On 6 November 2017 at 23:01, Amit Langote <langote_amit...@lab.ntt.co.jp> 
> wrote:
>> OK, I have gotten rid of the min/max partition index interface and instead
>> adopted the bms_add_range() approach by including your patch to add the
>> same in the patch set (which is now 0002 in the whole set).  I have to
>> admit that it's simpler to understand the new code with just Bitmapsets to
>> look at, but I'm still a bit concerned about materializing the whole set
>> right within partition.c, although we can perhaps optimize it later.
> 
> Thanks for making that change. The code looks much more simple now.
> 
> For performance, if you're worried about a very large number of
> partitions, then I think you're better off using bms_next_member()
> rather than bms_first_member(), (likely this applies globally, but you
> don't need to worry about those).
> 
> The problem with bms_first_member is that it must always loop over the
> 0 words before it finds any bits set for each call, whereas
> bms_next_member will start on the word it was last called for. There
> will likely be a pretty big performance difference between the two
> when processing a large Bitmapset.

Ah, thanks for the explanation.  I will change it to bms_next_member() in
the next version.

>> Attached updated set of patches, including the fix to make the new pruning
>> code handle Boolean partitioning.
> 
> Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13)

Thank you.

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning

2017-11-06 Thread Amit Langote

On 2017/11/06 13:15, David Rowley wrote:
> On 31 October 2017 at 21:43, Amit Langote <langote_amit...@lab.ntt.co.jp> 
> wrote:
>> Attached updated version of the patches
> 
> match_clauses_to_partkey() needs to allow for the way quals on Bool
> columns are represented.
> 
> create table pt (a bool not null) partition by list (a);
> create table pt_true partition of pt for values in('t');
> create table pt_false partition of pt for values in('f');
> explain select * from pt where a = true;
> QUERY PLAN
> --
>  Append  (cost=0.00..76.20 rows=2810 width=1)
>->  Seq Scan on pt_false  (cost=0.00..38.10 rows=1405 width=1)
>  Filter: a
>->  Seq Scan on pt_true  (cost=0.00..38.10 rows=1405 width=1)
>  Filter: a
> (5 rows)
> 
> match_clause_to_indexcol() shows an example of how to handle this.
> 
> explain select * from pt where a = false;
> 
> will need to be allowed too. This works slightly differently.

You're right.  I've fixed things to handle Boolean partitioning in the
updated set of patches I will post shortly.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning

2017-11-05 Thread Amit Langote

On 2017/11/06 12:53, David Rowley wrote:
> On 3 November 2017 at 17:32, David Rowley  
> wrote:
>> 2. This code is way more complex than it needs to be.
>>
>> if (num_parts > 0)
>> {
>> int j;
>>
>> all_indexes = (int *) palloc(num_parts * sizeof(int));
>> j = 0;
>> if (min_part_idx >= 0 && max_part_idx >= 0)
>> {
>> for (i = min_part_idx; i <= max_part_idx; i++)
>> all_indexes[j++] = i;
>> }
>> if (!bms_is_empty(other_parts))
>> while ((i = bms_first_member(other_parts)) >= 0)
>> all_indexes[j++] = i;
>> if (j > 1)
>> qsort((void *) all_indexes, j, sizeof(int), intcmp);
>> }
>>
>> It looks like the min/max partition stuff is just complicating things
>> here. If you need to build this array of all_indexes[] anyway, I don't
>> quite understand the point of the min/max. It seems like min/max would
>> probably work a bit nicer if you didn't need the other_parts
>> BitmapSet, so I recommend just getting rid of min/max completely and
>> just have a BitmapSet with bit set for each partition's index you
>> need, you'd not need to go to the trouble of performing a qsort on an
>> array and you could get rid of quite a chunk of code too.
>>
>> The entire function would then not be much more complex than:
>>
>> partindexes = get_partitions_from_clauses(parent, partclauses);
>>
>> while ((i = bms_first_member(partindexes)) >= 0)
>> {
>> AppendRelInfo *appinfo = rel->part_appinfos[i];
>> result = lappend(result, appinfo);
>> }
>>
>> Then you can also get rid of your intcmp() function too.
> 
> I've read a bit more of the patch and I think even more now that the
> min/max stuff should be removed.

Oops, I didn't catch this email before sending my earlier reply.  Thanks
for the bms range patch.  Will reply to this shortly after reading your
patch and thinking on it a bit.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dropping partitioned tables without CASCADE

2017-11-05 Thread Amit Langote

On 2017/11/03 21:39, Alvaro Herrera wrote:
> Ashutosh Bapat wrote:
>> On Fri, Nov 3, 2017 at 1:42 PM, Alvaro Herrera  
>> wrote:
> 
>>> I think adding "is partitioned" at end of line isn't good; looks like a
>>> phrase but isn't translatable.  Maybe add keyword PARTITIONED instead?
>>
>> In that case may be we should separate bounds and "PARTITIONED" with a
>> ",". "part_default DEFAULT, PARTITIONED" would read better than
>> "part_default DEFAULT PARTITIONED"?
> 
> Hmm, I vote +0.5 for the comma.

Me too.

>>> Is it possible to put it at either start or end of the list?
>>
>> Right now, we could do that if we order the list by bound expression;
>> lexically DEFAULT would come before FOR VALUES ... . But that's not
>> future-safe; we may have a bound expression starting with A, B or C.
>> Beyond that it really gets tricky to order the partitions by bounds.
> 
> I was just thinking in changing the query to be "order by
> is_the_default_partition, partition_name" instead of just "order by
> partition_name".  Sorting by bounds rather than name (a feature whose
> worth should definitely be discussed separately IMV) sounds a lot more
> complicated.

Yeah, it sounds like a desirable feature, but as you both say, should be
discussed separately.  Since the facility to order partitions in the bound
order is internal to the server yet, we'd need some new server-side
functionality to expose the same with sane SQL-callable interface, which
clearly needs its own separate discussion.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Setting pd_lower in GIN metapage

2017-11-05 Thread Amit Langote

On 2017/11/03 6:24, Tom Lane wrote:
> Amit Langote <langote_amit...@lab.ntt.co.jp> writes:
>> On 2017/09/26 16:30, Michael Paquier wrote:
>>> Cool, let's switch it back to a ready for committer status then.
> 
>> Sure, thanks.
> 
> Pushed with some cosmetic adjustments --- mostly, making the comments more
> explicit about why we need the apparently-redundant assignments to
> pd_lower.

Thank you.

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-11-02 Thread Amit Langote

Hi Amit.

Thanks a lot for updated patches and sorry that I couldn't get to looking
at your emails sooner.  Note that I'm replying here to both of your
emails, but looking at only the latest v22 patch.

On 2017/10/24 0:15, Amit Khandekar wrote:
> On 16 October 2017 at 08:28, Amit Langote <langote_amit...@lab.ntt.co.jp> 
> wrote:
>>
>> +   (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup
>> == NULL
>>
>> Is there some reason why a bitwise operator is used here?
> 
> That exact condition means that the function is called for transition
> capture for updated rows being moved to another partition. For this
> scenario, either the oldtup or the newtup is NULL. I wanted to exactly
> capture that condition there. I think the bitwise operator is more
> user-friendly in emphasizing the point that it is indeed an "either a
> or b, not both" condition.

I see.  In that case, since this patch adds the new condition, a note
about it in the comment just above would be good, because the situation
you describe here seems to arise only during update-tuple-routing, IIUC.

>> + * 'perleaf_parentchild_maps' receives an array of TupleConversionMap 
>> objects
>> + * with on entry for every leaf partition (required to convert input
>> tuple
>> + * based on the root table's rowtype to a leaf partition's rowtype after
>> + * tuple routing is done)
>>
>> Could this be named leaf_tupconv_maps, maybe?  It perhaps makes clear that
>> they are maps needed for "tuple conversion".  And the other field holding
>> the reverse map as leaf_rev_tupconv_maps.  Either that or use underscores
>> to separate words, but then it gets too long I guess.
> 
> In master branch, now this param is already there with the name
> "tup_conv_maps". In the rebased version in the earlier mail, I haven't
> again changed it. I think "tup_conv_maps" looks clear enough.

OK.

In the latest patch:

+ * 'update_rri' has the UPDATE per-subplan result rels. These are re-used
+ *  instead of allocating new ones while generating the array of all leaf
+ *  partition result rels.

Instead of:

"These are re-used instead of allocating new ones while generating the
array of all leaf partition result rels."

how about:

"There is no need to allocate a new ResultRellInfo entry for leaf
partitions for which one already exists in this array"

>> ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex
>> interface.  I guess it could simply have the following interface:
>>
>> static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate,
>>HeapTuple tuple, bool is_update);
>>
>> And figure out, based on the value of is_update, which map to use and
>> which slot to set *p_new_slot to (what is now "new_slot" argument).
>> You're getting mtstate here anyway, which contains all the information you
>> need here.  It seems better to make that (selecting which map and which
>> slot) part of the function's implementation if we're having this function
>> at all, imho.  Maybe I'm missing some details there, but my point still
>> remains that we should try to put more logic in that function instead of
>> it just do the mechanical tuple conversion.
> 
> I tried to see how the interface would look if we do that way. Here is
> how the code looks :
> 
> static TupleTableSlot *
> ConvertPartitionTupleSlot(ModifyTableState *mtstate,
> bool for_update_tuple_routing,
> int map_index,
> HeapTuple *tuple,
> TupleTableSlot *slot)
> {
>TupleConversionMap   *map;
>TupleTableSlot *new_slot;
> 
>if (for_update_tuple_routing)
>{
>   map = mtstate->mt_persubplan_childparent_maps[map_index];
>   new_slot = mtstate->mt_rootpartition_tuple_slot;
>}
>else
>{
>   map = mtstate->mt_perleaf_parentchild_maps[map_index];
>   new_slot = mtstate->mt_partition_tuple_slot;
>}
> 
>if (!map)
>   return slot;
> 
>*tuple = do_convert_tuple(*tuple, map);
> 
>/*
> * Change the partition tuple slot descriptor, as per converted tuple.
> */
>ExecSetSlotDescriptor(new_slot, map->outdesc);
>ExecStoreTuple(*tuple, new_slot, InvalidBuffer, true);
> 
>return new_slot;
> }
> 
> It looks like the interface does not much simplify, and above that, we
> have more number of lines in that function. Also, the caller anyway
> has to be aware whether the map_index is the index into the leaf
> partitions or the update subplans. So it i

Re: [HACKERS] Adding column_constraint description in ALTER TABLE synopsis

2017-10-31 Thread Amit Langote

On 2017/10/31 21:31, Stephen Frost wrote:
> * Lætitia Avrot (laetitia.av...@gmail.com) wrote:
>> As Amit Langot pointed out, the column_constraint definition is missing
>> whereas it is used in ALTER TABLE synopsis. It can be easily found in the
>> CREATE TABLE synopsis, but it's not very user friendly.
>
> Thanks, this looks pretty reasonable, but did you happen to look for any
> other keywords in the ALTER TABLE that should really be in ALTER TABLE
> also?
> 
> I'm specifically looking at, at least, partition_bound_spec.  Maybe you
> could propose an updated patch which addresses that also, and any other
> cases you find?

Ah, yes.  I remember having left out partition_bound_spec simply because I
thought it was kind of how it was supposed to be done, seeing that neither
column_constraint and table_constraint were expanded in the ALTER TABLE's
synopsis.

It seems that there are indeed a couple of other things that need to be
brought over to ALTER TABLE synopsis including partition_bound_spec.
9f295c08f877 [1] added table_constraint, but missed to add the description
of index_parameters and exclude_element which are referenced therein.

Attached find updated version of the Lætitia's patch.

Thanks,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9f295c
diff --git a/doc/src/sgml/ref/alter_table.sgml 
b/doc/src/sgml/ref/alter_table.sgml
index 41acda003f..e059f87875 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -85,6 +85,20 @@ ALTER TABLE [ IF EXISTS ] name
 OWNER TO { new_owner | 
CURRENT_USER | SESSION_USER }
 REPLICA IDENTITY { DEFAULT | USING INDEX index_name | FULL | NOTHING }
 
+and column_constraint 
is:
+
+[ CONSTRAINT constraint_name ]
+{ NOT NULL |
+  NULL |
+  CHECK ( expression ) [ NO 
INHERIT ] |
+  DEFAULT default_expr |
+  GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( 
sequence_options ) ] |
+  UNIQUE index_parameters |
+  PRIMARY KEY index_parameters |
+  REFERENCES reftable [ ( 
refcolumn ) ] [ MATCH FULL | MATCH 
PARTIAL | MATCH SIMPLE ]
+[ ON DELETE action ] [ ON 
UPDATE action ] }
+[ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ]
+
 and table_constraint 
is:
 
 [ CONSTRAINT constraint_name ]
@@ -96,6 +110,15 @@ ALTER TABLE [ IF EXISTS ] name
 [ MATCH FULL | MATCH PARTIAL | MATCH SIMPLE ] [ ON DELETE action ] [ ON UPDATE action ] }
 [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ]
 
+index_parameters in 
UNIQUE, PRIMARY KEY, and 
EXCLUDE constraints are:
+
+[ WITH ( storage_parameter [= 
value] [, ... ] ) ]
+[ USING INDEX TABLESPACE tablespace_name ]
+
+exclude_element in an 
EXCLUDE constraint is:
+
+{ column_name | ( expression ) } [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST 
} ]
+
 and table_constraint_using_index is:
 
 [ CONSTRAINT constraint_name ]
@@ -104,6 +127,13 @@ ALTER TABLE [ IF EXISTS ] name
 
  
 
+and partition_bound_spec 
is:
+
+IN ( { numeric_literal | 
string_literal | NULL } [, ...] ) |
+FROM ( { numeric_literal | 
string_literal | MINVALUE | 
MAXVALUE } [, ...] )
+  TO ( { numeric_literal | 
string_literal | MINVALUE | 
MAXVALUE } [, ...] )
+
+
  
   Description
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

2017-10-31 Thread Amit Langote

Thanks for the test case.

On 2017/10/30 17:09, Rajkumar Raghuwanshi wrote:
> I am getting wrong output when default is sub-partitioned further, below is
> a test case.
> 
> CREATE TABLE lpd(a int, b varchar, c float) PARTITION BY LIST (a);
> CREATE TABLE lpd_p1 PARTITION OF lpd FOR VALUES IN (1,2,3);
> CREATE TABLE lpd_p2 PARTITION OF lpd FOR VALUES IN (4,5);
> CREATE TABLE lpd_d PARTITION OF lpd DEFAULT PARTITION BY LIST(a);
> CREATE TABLE lpd_d1 PARTITION OF lpd_d FOR VALUES IN (7,8,9);
> CREATE TABLE lpd_d2 PARTITION OF lpd_d FOR VALUES IN (10,11,12);
> CREATE TABLE lpd_d3 PARTITION OF lpd_d FOR VALUES IN (6,null);
> INSERT INTO lpd SELECT i,i,i FROM generate_Series (1,12)i;
> INSERT INTO lpd VALUES (null,null,null);
> 
> --on HEAD
> postgres=# EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM lpd WHERE
> a IS NOT NULL ORDER BY 1;
>  QUERY PLAN
> -
>  Sort
>Sort Key: ((lpd_p1.tableoid)::regclass)
>->  Result
>  ->  Append
>->  Seq Scan on lpd_p1
>  Filter: (a IS NOT NULL)
>->  Seq Scan on lpd_p2
>  Filter: (a IS NOT NULL)
>->  Seq Scan on lpd_d3
>  Filter: (a IS NOT NULL)
>->  Seq Scan on lpd_d1
>  Filter: (a IS NOT NULL)
>->  Seq Scan on lpd_d2
>  Filter: (a IS NOT NULL)
> (14 rows)
> 
> postgres=#
> postgres=# SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER
> BY 1;
>  tableoid | a  | b  | c
> --+++
>  lpd_p1   |  1 | 1  |  1
>  lpd_p1   |  2 | 2  |  2
>  lpd_p1   |  3 | 3  |  3
>  lpd_p2   |  4 | 4  |  4
>  lpd_p2   |  5 | 5  |  5
>  lpd_d1   |  7 | 7  |  7
>  lpd_d1   |  8 | 8  |  8
>  lpd_d1   |  9 | 9  |  9
>  lpd_d2   | 12 | 12 | 12
>  lpd_d2   | 10 | 10 | 10
>  lpd_d2   | 11 | 11 | 11
>  lpd_d3   |  6 | 6  |  6
> (12 rows)
> 
> 
> --on HEAD + v8 patches
> 
> postgres=# EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM lpd WHERE
> a IS NOT NULL ORDER BY 1;
>  QUERY PLAN
> -
>  Sort
>Sort Key: ((lpd_p1.tableoid)::regclass)
>->  Result
>  ->  Append
>->  Seq Scan on lpd_p1
>  Filter: (a IS NOT NULL)
>->  Seq Scan on lpd_p2
>  Filter: (a IS NOT NULL)
> (8 rows)
> 
> postgres=# SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER
> BY 1;
>  tableoid | a | b | c
> --+---+---+---
>  lpd_p1   | 1 | 1 | 1
>  lpd_p1   | 2 | 2 | 2
>  lpd_p1   | 3 | 3 | 3
>  lpd_p2   | 4 | 4 | 4
>  lpd_p2   | 5 | 5 | 5
> (5 rows)

I found bugs in 0003 and 0005 that caused this.  Will post the patches
containing the fix in reply to the Dilip's email which contains some code
review comments [1].

Also, I noticed that the new pruning code was having a hard time do deal
with the fact that the default "range" partition doesn't explicitly say in
its partition constraint that it might contain null values.  More
precisely perhaps, the default range partition's constraint appears to
imply that it can only contain non-null values, which confuses the new
pruning code.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFiTN-thYsobXxPS6bwOA_9erpax_S=iztsn3rtuxkkmkg4...@mail.gmail.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

2017-10-27 Thread Amit Langote

On 2017/10/27 13:57, Robert Haas wrote:
> On Fri, Oct 27, 2017 at 3:17 AM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>>> I don't think we really want to get into theorem-proving here, because
>>> it's slow.
>>
>> Just to be clear, I'm saying we could use theorem-proving (if at all) just
>> for the default partition.
> 
> I don't really see why it should be needed there either.  We've got
> all the bounds in order, so we should know where there are any gaps
> that are covered by the default partition in the range we care about.

Sorry, I forgot to add: "...just for the default partition, for cases like
the one in Beena's example."

In normal cases, default partition selection doesn't require any
theorem-proving.  It proceeds in a straightforward manner more or less
like what you said it should.

After thinking more on it, I think there is a rather straightforward trick
to implement the idea you mentioned to get this working for the case
presented in Beena's example, which works as follows:

For any non-root partitioned tables, we add the list of its partition
constraint clauses to the query-provided list of clauses and use the whole
list to drive the partition-pruning algorithm.  So, when partition-pruning
runs for tprt_1, along with (< 1) which the original query provides,
we also have (>= 1) which comes from the partition constraint of tprt_1
(which is >= 1 and < 5).  Note that there exists a trick in the new
code for the (< 5) coming from the constraint to be overridden by the
more restrictive (< 1) coming from the original query.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

2017-10-26 Thread Amit Langote

On 2017/10/26 20:34, Robert Haas wrote:
> On Thu, Oct 26, 2017 at 1:17 PM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> It can perhaps taught to not make that conclusion by taking into account
>> the default partition's partition constraint, which includes constraint
>> inherited from the parent, viz. 1 <= col1 < 50001.  To do that, it might
>> be possible to summon up predtest.c's powers to conclude from the default
>> partition's partition constraint that it cannot contain any keys < 1, but
>> then we'll have to frame up a clause expression describing the latter.
>> Generating such a clause expression can be a bit daunting for a
>> multi-column key.   So, I haven't yet tried really hard to implement this.
>>  Any thoughts on that?
> 
> I don't think we really want to get into theorem-proving here, because
> it's slow.

Just to be clear, I'm saying we could use theorem-proving (if at all) just
for the default partition.

> Whatever we're going to do we should be able to do without
> that - keeping it in the form of btree-strategy + value.  It doesn't
> seem that hard.  Suppose we're asked to select partitions from tprt
> subject to (<, 1).  Well, we determine that some of the tprt_1
> partitions may be relevant, so we tell tprt_1 to select partitions
> subject to (>=, 1, <, 1).  We know to do that because we know that
> 1 < 5 and we know to include >= 1 because we haven't got any
> lower bound currently at all.  What's the problem?

Hmm, that's interesting.  With the approach that the patch currently
takes, (>= 1) wouldn't be passed down when selecting the partitions of
tprt_1.  The source of values (+ btree strategy) to use to select
partitions is the same original set of clauses for all partitioned tables
in the tree, as the patch currently implements it.  Nothing would get
added to that set (>= 1, as in this example), nor subtracted (such as
clauses that are trivially true).

I will think about this approach in general and to solve this problem in
particular.

> In some sense it's tempting to say that this case just doesn't matter
> very much; after all, subpartitioning on the same column used to
> partition at the top level is arguably lame.  But if we can get it
> right in a relatively straightforward manner then let's do it.

Yeah, I tend to agree.

Thanks for the input.

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed assertion in adjust_appendrel_attrs_mutator

2017-10-23 Thread Amit Langote

On 2017/10/24 0:22, Tom Lane wrote:
> Amit Langote <langote_amit...@lab.ntt.co.jp> writes:
>> On 2017/10/23 2:07, Tom Lane wrote:
>>> Hmm.  adjust_appendrel_attrs() thinks it's only used after conversion
>>> of sublinks to subplans, but this is a counterexample.  I wonder if
>>> that assumption was ever correct?  Or maybe we need to rethink what
>>> it does when recursing into RTE subqueries.
> 
>> Looking at the backtrace, the crashing SubLink seems to have been
>> referenced from a PlaceHolderVar that is in turn referenced by
>> joinaliasvars of a JOIN rte in query->rtable.
> 
> Right.  The core of the issue is that joinaliasvars lists don't get run
> through preprocess_expression, so (among other things) any SubLinks in
> them don't get expanded to subplans.

Ah, I missed that.

> Probably the reason we've not
> heard field complaints about this is that in a non-assert build there
> would be no observable bad effects --- the scan would simply ignore
> the subquery, and whether the joinaliasvars entry has been correctly
> mutated doesn't matter at all because it will never be used again.

I see.

>> I wonder if we shouldn't just ignore those (joinaliasvars in JOIN rtes)
>> while adjust_appendrel_attrs() is doing its job on a Query, just like we
>> ask to ignore subqueries by passing QTW_IGNORE_RC_SUBQUERIES to
>> query_tree_mutator()?
> 
> I don't really like this fix, because ISTM it's fixing one symptom rather
> than the root of the problem.

That's true.

> The root is that joinaliasvars lists
> diverge from the representation of expressions elsewhere in the tree,
> and not only with respect to SubLinks --- another example is that function
> calls with named arguments won't have been rearranged into executable
> form.  That could easily be a dangerous thing, if we allow arbitrary
> expression processing to get done on them.  Moreover, your patch is
> letting the divergence get even bigger: now the joinaliasvars lists don't
> even have the right varnos, making them certainly unusable for anything.

Hmm, yes.  Although, I only proposed that patch because I had convinced
myself that joinaliasvars lists weren't looked at anymore.

> So what I'm thinking is that we would be well advised to actually remove
> the untransformed joinaliasvars from the tree as soon as we're done with
> preprocessing expressions.  We'd drop them at the end of planning anyway
> (cf. add_rte_to_flat_rtable) so this is just getting rid of them a bit
> sooner, and it won't affect anything after the planner.
>
> In short, I'm thinking something more like the attached.

Yeah, that makes more sense.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-10-23 Thread Amit Langote

On 2017/10/24 1:15, Alvaro Herrera wrote:
> Robert Haas wrote:
>> On Mon, Oct 23, 2017 at 11:12 AM, Alvaro Herrera
>>  wrote:
>>> I started with Maksim's submitted code, and developed according to the
>>> ideas discussed in this thread.  Attached is a very WIP patch series for
>>> this feature.

Nice!

>>> Many things remain to be done before this is committable:  pg_dump
>>> support needs to be written.  ALTER INDEX ATTACH/DETACH not yet
>>> implemented.  No REINDEX support yet.  Docs not updated (but see the
>>> regression test as a guide for how this is supposed to work; see patch
>>> 0005).  CREATE INDEX CONCURRENTLY not done yet.
>>>
>>> I'm now working on the ability to build unique indexes (and unique
>>> constraints) on top of this.
>>
>> Cool.  Are you planning to do that by (a) only allowing the special
>> case where the partition key columns/expressions are included in the
>> indexed columns/expressions, (b) trying to make every insert to any
>> index check all of the indexes for uniqueness conflicts, or (c)
>> implementing global indexes?  Because (b) sounds complex - think about
>> attach operations, for example - and (c) sounds super-hard.  I'd
>> suggest doing (a) first, just on the basis of complexity.
> 
> Yes, I think (a) is a valuable thing to have -- not planning on doing
> (c) at all because I fear it'll be a huge time sink.  I'm not sure about
> (b), but it's not currently on my plan.

+1 to proceeding with (a) first.

>> I hope that you don't get so involved in making this unique index
>> stuff work that we don't get the cascading index feature, at least,
>> committed to v11.  That's already a considerable step forward in terms
>> of ease of use, and I'd really like to have it.
> 
> Absolutely -- I do plan to get this one finished regardless of unique
> indexes.

+1

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed assertion in adjust_appendrel_attrs_mutator

2017-10-23 Thread Amit Langote

On 2017/10/23 2:07, Tom Lane wrote:
> Andreas Seltenreich  writes:
>> testing master as of 7c981590c2, sqlsmith just triggered the following
>> assertion:
>> TRAP: FailedAssertion("!(!const Node*)(node))->type) == T_SubLink))", 
>> File: "prepunion.c", Line: 2231)
> 
> Hmm.  adjust_appendrel_attrs() thinks it's only used after conversion
> of sublinks to subplans, but this is a counterexample.  I wonder if
> that assumption was ever correct?  Or maybe we need to rethink what
> it does when recursing into RTE subqueries.

Looking at the backtrace, the crashing SubLink seems to have been
referenced from a PlaceHolderVar that is in turn referenced by
joinaliasvars of a JOIN rte in query->rtable.

I wonder if we shouldn't just ignore those (joinaliasvars in JOIN rtes)
while adjust_appendrel_attrs() is doing its job on a Query, just like we
ask to ignore subqueries by passing QTW_IGNORE_RC_SUBQUERIES to
query_tree_mutator()?  IOW, it doesn't look like anything critically
depends on the Vars referenced from joinaliasvars being translated at that
point in inheritance_planner(), but I may be missing something.

The attached patch to do the same, while didn't break any existing tests,
fixed the crash reported by OP.

Thoughts?

Thanks,
Amit
diff --git a/src/backend/optimizer/prep/prepunion.c 
b/src/backend/optimizer/prep/prepunion.c
index 1c84a2cb28..4bdfe73d29 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1964,7 +1964,8 @@ adjust_appendrel_attrs(PlannerInfo *root, Node *node, int 
nappinfos,
newnode = query_tree_mutator((Query *) node,

adjust_appendrel_attrs_mutator,
 (void 
*) ,
-
QTW_IGNORE_RC_SUBQUERIES);
+
QTW_IGNORE_RC_SUBQUERIES |
+
QTW_IGNORE_JOINALIASES);
for (cnt = 0; cnt < nappinfos; cnt++)
{
AppendRelInfo *appinfo = appinfos[cnt];

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] alter table doc fix

2017-10-22 Thread Amit Langote

On 2017/10/18 20:37, Alvaro Herrera wrote:
> Amit Langote wrote:
>> Hi.
>>
>> Noticed that a alter table sub-command's name in Description (where it's
>> OWNER) differs from that in synopsis (where it's OWNER TO).  Attached
>> patch to make them match, if the difference is unintentional.
> 
> I agree -- pushed.

Thanks for committing.

> This paragraph
> 
>
> The actions for identity columns (ADD
> GENERATED, SET etc., DROP
> IDENTITY), as well as the actions
> TRIGGER, CLUSTER, 
> OWNER,
> and TABLESPACE never recurse to descendant tables;
> that is, they always act as though ONLY were specified.
> Adding a constraint recurses only for CHECK constraints
> that are not marked NO INHERIT.
>
> 
> is a bit annoying, though I think it'd be worse if we "fix" it to be
> completely strict about the subcommands it refers to.

I didn't notice it in this paragraph before you pointed out, but maybe as
you say, there's not much point in trying to be strict here too.  If we do
fix it though, we might want to do something about TRIGGER, CLUSTER, too,
because there are no sub-commands named just TRIGGER, CLUSTER.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block level parallel vacuum WIP

2017-10-22 Thread Amit Langote

On 2017/10/22 5:25, Thomas Munro wrote:
> On Sun, Oct 22, 2017 at 5:09 AM, Robert Haas  wrote:
>> On Tue, Sep 19, 2017 at 3:31 AM, Masahiko Sawada  
>> wrote:
 Down at the bottom of the build log in the regression diffs file you can 
 see:

 ! ERROR: cache lookup failed for relation 32893

 https://travis-ci.org/postgresql-cfbot/postgresql/builds/277165907
>>>
>>> Thank you for letting me know.
>>>
>>> Hmm, it's an interesting failure. I'll investigate it and post the new 
>>> patch.
>>
>> Did you ever find out what the cause of this problem was?
> 
> I wonder if it might have been the same issue that commit
> 19de0ab23ccba12567c18640f00b49f01471018d fixed a week or so later.

Hmm, 19de0ab23ccba seems to prevent the "cache lookup failed for relation
XXX" error in a different code path though (the code path handling manual
vacuum).  Not sure if the commit could have prevented that error being
emitted by DROP SCHEMA ... CASCADE, which seems to be what produced it in
this case.  Maybe I'm missing something though.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.

2017-10-17 Thread Amit Langote

On 2017/10/18 1:52, Alvaro Herrera wrote:
> Alvaro Herrera wrote:
>> Robert Haas wrote:
>>> Implement table partitioning.
>>
>> Is it intentional that you can use ALTER TABLE OWNER TO on the parent
>> table, and that this does not recurse to modify the partitions' owners?
>> This doesn't seem to be mentioned in comments nor documentation, so it
>> seems an oversight to me.

Hmm, I would say of it that the new partitioning didn't modify the
behavior that existed for inheritance.

That said, I'm not sure if the lack of recursive application of ownership
change to descendant tables is unintentional.

> The alter table docs say that ONLY must be specified if one does not
> want to modify descendants, so I think this is a bug.

Just to clarify, if we do think of it as a bug, then it will apply to the
inheritance case as well, right?

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] alter table doc fix

2017-10-17 Thread Amit Langote

Hi.

Noticed that a alter table sub-command's name in Description (where it's
OWNER) differs from that in synopsis (where it's OWNER TO).  Attached
patch to make them match, if the difference is unintentional.

Thanks,
Amit
diff --git a/doc/src/sgml/ref/alter_table.sgml 
b/doc/src/sgml/ref/alter_table.sgml
index 68393d70b4..b4b8dab911 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -713,7 +713,7 @@ ALTER TABLE [ IF EXISTS ] name

 

-OWNER
+OWNER TO
 
  
   This form changes the owner of the table, sequence, view, materialized 
view,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] relkind check in DefineIndex

2017-10-15 Thread Amit Langote

On 2017/10/14 4:32, Robert Haas wrote:
> On Fri, Oct 13, 2017 at 12:38 PM, Alvaro Herrera
>  wrote:
>> The relkind check in DefineIndex has grown into an ugly rats nest of
>> 'if' statements.  I propose to change it into a switch, as per the
>> attached.
> 
> wfm

+1

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-10-15 Thread Amit Langote

Hi Amit.

On 2017/10/04 22:51, Amit Khandekar wrote:
> Main patch :
> update-partition-key_v20.patch

Guess you're already working on it but the patch needs a rebase.  A couple
of hunks in the patch to execMain.c and nodeModifyTable.c fail.

Meanwhile a few comments:

+void
+pull_child_partition_columns(Bitmapset **bitmapset,
+Relation rel,
+Relation parent)

Nitpick: don't we normally list the output argument(s) at the end?  Also,
"bitmapset" could be renamed to something that conveys what it contains?

+   if (partattno != 0)
+   child_keycols =
+   bms_add_member(child_keycols,
+  partattno -
FirstLowInvalidHeapAttributeNumber);
+   }
+   foreach(lc, partexprs)
+   {

Elsewhere (in quite a few places), we don't iterate over partexprs
separately like this, although I'm not saying it is bad, just different
from other places.

+ * the transition tuplestores can be built. Furthermore, if the transition
+ *  capture is happening for UPDATEd rows being moved to another
partition due
+ *  partition-key change, then this function is called once when the row is
+ *  deleted (to capture OLD row), and once when the row is inserted to
another
+ *  partition (to capture NEW row). This is done separately because
DELETE and
+ *  INSERT happen on different tables.

Extra space at the beginning from the 2nd line onwards.

+   (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup
== NULL

Is there some reason why a bitwise operator is used here?

+ * 'update_rri' has the UPDATE per-subplan result rels.

Could you explain why they are being received as input here?

+ * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects
+ * with on entry for every leaf partition (required to convert input
tuple
+ * based on the root table's rowtype to a leaf partition's rowtype after
+ * tuple routing is done)

Could this be named leaf_tupconv_maps, maybe?  It perhaps makes clear that
they are maps needed for "tuple conversion".  And the other field holding
the reverse map as leaf_rev_tupconv_maps.  Either that or use underscores
to separate words, but then it gets too long I guess.

+   tuple = ConvertPartitionTupleSlot(mtstate,
+
mtstate->mt_perleaf_parentchild_maps[leaf_part_index],
+

The 2nd line here seems to have gone over 80 characters.

ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex
interface.  I guess it could simply have the following interface:

static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate,
   HeapTuple tuple, bool is_update);

And figure out, based on the value of is_update, which map to use and
which slot to set *p_new_slot to (what is now "new_slot" argument).
You're getting mtstate here anyway, which contains all the information you
need here.  It seems better to make that (selecting which map and which
slot) part of the function's implementation if we're having this function
at all, imho.  Maybe I'm missing some details there, but my point still
remains that we should try to put more logic in that function instead of
it just do the mechanical tuple conversion.

+ * We have already checked partition constraints above, so skip them
+ * below.

How about: ", so skip checking here."?

ISTM, the newly introduced logic in ExecSetupTransitionCaptureState() to
try to reuse the per-subplan child-to-parent map as per-leaf
child-to-parent map could be simplified a bit.  I mean the following code:

+/*
+ * But for Updates, we can share the per-subplan maps with the per-leaf
+ * maps.
+ */
+update_rri_index = 0;
+update_rri = mtstate->resultRelInfo;
+if (mtstate->mt_nplans > 0)
+cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc);

-/* Choose the right set of partitions */
-if (mtstate->mt_partition_dispatch_info != NULL)
+for (i = 0; i < numResultRelInfos; ++i)
+{

How about (pseudo-code):

 j = 0;
 for (i = 0; i < n_leaf_parts; i++)
 {
 if (j < n_subplans && leaf_rri[i]->oid == subplan_rri[j]->oid)
 {
 leaf_childparent_map[i] = subplan_childparent_map[j];
 j++;
 }
 else
 {
 leaf_childparent_map[i] = new map
 }
 }

I think the above would also be useful in ExecSetupPartitionTupleRouting()
where you've added similar code to try to reuse per-subplan ResultRelInfos.

In ExecInitModifyTable(), can we try to minimize the number of places
where update_tuple_routing_needed is being set.  Currently, it's being set
in 3 places:

+boolupdate_tuple_routing_needed = node->part_cols_updated;

&

+/*
+ * If this is an UPDATE and a BEFORE UPDATE trigger is present,
we may
+ * need to do update tuple routing.
+ */
+if (resultRelInfo->ri_TrigDesc &&
+resultRelInfo->ri_TrigDesc->trig_update_before_row &&
+

Re: [HACKERS] v10 bottom-listed

2017-10-15 Thread Amit Langote

On 2017/10/13 22:58, Magnus Hagander wrote:
> On Fri, Oct 6, 2017 at 3:59 AM, Amit Langote <langote_amit...@lab.ntt.co.jp>
> wrote:
> 
>> On 2017/10/05 22:28, Erik Rijkers wrote:
>>> In the 'ftp' listing, v10 appears at the bottom:
>>>   https://www.postgresql.org/ftp/source/
>>>
>>> With all the other v10* directories at the top, we could get a lot of
>>> people installing wrong binaries...
>>>
>>> Maybe it can be fixed so that it appears at the top.
>>
>> Still see it at the bottom.  Maybe ping pgsql-www?
>>
> 
> Turns out there was already quite an ugly hack to deal with this, so I just
> made it a bit more ugly. Once the caches expire I believe sorting should be
> correct.

Saw that it's now listed all the way up, thanks! :)

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-10-12 Thread Amit Langote

On 2017/10/13 6:18, Robert Haas wrote:
> Is anybody still reviewing the main patch here?  (It would be good if
> the answer is "yes".)

I am going to try to look at the latest version over the weekend and early
next week.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimise default partition scanning while adding new partition

2017-10-12 Thread Amit Langote

On 2017/10/13 4:18, Robert Haas wrote:
> On Thu, Oct 5, 2017 at 9:29 PM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> Attached a patch to modify the INFO messages in check_default_allows_bound.
> 
> Committed.  However, I didn't see a reason to adopt the comment change
> you proposed, so I left that part out.

Okay, thanks.

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

2017-10-11 Thread Amit Langote

On 2017/09/30 1:53, Robert Haas wrote:
> On Thu, Sep 28, 2017 at 1:54 AM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> I looked into how satisfies_hash_partition() works and came up with an
>> idea that I think will make constraint exclusion work.  What if we emitted
>> the hash partition constraint in the following form instead:
>>
>> hash_partition_mod(hash_partition_hash(key1-exthash, key2-exthash),
>>) = 
>>
>> With that form, constraint exclusion seems to work as illustrated below:
>>
>> \d+ p0
>> <...>
>> Partition constraint:
>> (hash_partition_modulus(hash_partition_hash(hashint4extended(a,
>> '8816678312871386367'::bigint)), 4) = 0)
>>
>> -- note only p0 is scanned
>> explain select * from p where
>> hash_partition_modulus(hash_partition_hash(hashint4extended(a,
>> '8816678312871386367'::bigint)), 4) = 0;
> 
> What we actually want constraint exclusion to cover is SELECT * FROM p
> WHERE a = 525600;

I agree.

> As Amul says, nobody's going to enter a query in the form you have it
> here.  Life is too short to take time to put queries into bizarre
> forms.

Here too.  I was falsely thinking that satisfies_hash_partition() is
intended to be used for more than just enforcing the partition constraint
when data is directly inserted into a hash partition, or more precisely to
be used in the CHECK constraint of the table that is to be attached as a
hash partition.  Now, we ask users to add such a constraint to avoid the
constraint validation scan, because the system knows how to infer from the
constraint that the partition constraint is satisfied.  I observed however
that, unlike range and list partitioning, the hash partition's constraint
could only ever be implied because of structural equality (equal()'ness)
of the existing constraint expression and the partition constraint
expression.  For example, a more restrictive range or list qual implies
the partition constraint, but it requires performing btree operator based
proof.  The proof is impossible with the chosen structure of hash
partitioning constraint, but it seems that that's OK.  That is, it's OK to
ask users to add the exact constraint (matching modulus and reminder
values in the call to satisfies_hash_partition() specified in the CHECK
constraint) to avoid the validation scan.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] v10 bottom-listed

2017-10-05 Thread Amit Langote

On 2017/10/05 22:28, Erik Rijkers wrote:
> In the 'ftp' listing, v10 appears at the bottom:
>   https://www.postgresql.org/ftp/source/
> 
> With all the other v10* directories at the top, we could get a lot of
> people installing wrong binaries...
> 
> Maybe it can be fixed so that it appears at the top.

Still see it at the bottom.  Maybe ping pgsql-www?

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimise default partition scanning while adding new partition

2017-10-05 Thread Amit Langote

On 2017/10/06 2:25, Robert Haas wrote:
> On Tue, Sep 26, 2017 at 4:27 AM, Amit Langote wrote:
>> I guess we don't need to squash, as they could be seen as implementing
>> different features. Reordering the patches helps though.  So, apply them
>> in this order:
>>
>> 1. My patch to teach ValidatePartitionConstraints() to skip scanning
>>a partition's own partitions, which optimizes ATTACH PARTITION
>>command's partition constraint validation scan (this also covers the
>>case of scanning the default partition to validate its updated
>>constraint when attaching a new partition)
>>
>> 2. Jeevan's patch to teach check_default_allows_bound() to skip scanning
>>the default partition's own partitions, which covers the case of
>>scanning the default partition to validate its updated constraint when
>>adding a new partition using CREATE TABLE
>>
>> Attached 0001 and 0002 are ordered that way.
> 
> OK, I pushed all of this, spread out over 3 commits.  I reworked the
> test cases not to be entangled with the existing test cases, and also
> to test both of these highly-related features together.  Hopefully you
> like the result.

Thanks for committing.

I noticed that 6476b26115f3 doesn't use the same INFO message as
14f67a8ee28.  You can notice in the following example that the message
emitted (that the default partition scan is skipped) is different when a
new partition is added with CREATE TABLE and with ATTACH PARTITION,
respectively.

create table foo (a int, b char) partition by list (a);

-- default partition
create table food partition of foo default partition by list (b);

-- default partition's partition with the check constraint
create table fooda partition of food (check (not (a is not null and a = 1
and a = 2))) for values in ('a');

create table foo1 partition of foo for values in (1);
INFO:  partition constraint for table "fooda" is implied by existing
constraints
CREATE TABLE

create table foo2 (like foo);
alter table foo attach partition foo2 for values in (2);
INFO:  updated partition constraint for default partition "fooda" is
implied by existing constraints
ALTER TABLE

That's because it appears that you applied Jeevan's original patch.
Revised version of his patch that I had last posted contained the new
message. Actually the revised patch had not only addressed the case where
the default partition's child's scan is skipped, but also the case where
the scan of default partition itself is skipped.  As things stand now:

alter table food add constraint skip_check check (not (a is not null and
(a = any (array[1, 2];

create table foo1 partition of foo for values in (1);
INFO:  partition constraint for table "food" is implied by existing
constraints
CREATE TABLE

create table foo2 (like foo);
CREATE TABLE

alter table foo attach partition foo2 for values in (2);
INFO:  updated partition constraint for default partition "food" is
implied by existing constraints
ALTER TABLE


Attached a patch to modify the INFO messages in check_default_allows_bound.

Thanks,
Amit
From 1bd3b03bcd1f8af6cec3a22c40e96c9cde46e705 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Fri, 6 Oct 2017 10:23:24 +0900
Subject: [PATCH] Like c31e9d4bafd, but for check_default_allows_bound

---
 src/backend/catalog/partition.c   | 10 +++---
 src/test/regress/expected/alter_table.out |  4 ++--
 2 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 3a8306a055..1459fba778 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -920,7 +920,7 @@ check_default_allows_bound(Relation parent, Relation 
default_rel,
if (PartConstraintImpliedByRelConstraint(default_rel, 
def_part_constraints))
{
ereport(INFO,
-   (errmsg("partition constraint for table \"%s\" 
is implied by existing constraints",
+   (errmsg("updated partition constraint for 
default partition \"%s\" is implied by existing constraints",

RelationGetRelationName(default_rel;
return;
}
@@ -956,16 +956,12 @@ check_default_allows_bound(Relation parent, Relation 
default_rel,
{
part_rel = heap_open(part_relid, NoLock);
 
-   /*
-* If the partition constraints on default partition 
child imply
-* that it will not contain any row that would belong 
to the new
-* partition, we can avoid scanning the child table.
-*/
+   /* Can we skip scanning this part_rel? */

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-03 Thread Amit Langote

On 2017/10/04 4:27, Robert Haas wrote:
> On Tue, Oct 3, 2017 at 8:57 AM, Ashutosh Bapat
>  wrote:
>>> Regarding nomenclature and my previous griping about wisdom, I was
>>> wondering about just calling this a "partition join" like you have in
>>> the regression test.  So the GUC would be enable_partition_join, you'd
>>> have generate_partition_join_paths(), etc.  Basically just delete
>>> "wise" throughout.
>>
>> Partition-wise join is standard term used in literature and in
>> documentation of other popular DBMSes, so partition_wise makes more
>> sense. But I am fine with partition_join as well. Do you want it
>> partition_join or partitionjoin like enable_mergejoin/enable_hashjoin
>> etc.?
> 
> Well, you're making me have second thoughts.  It's really just that
> partition_wise looks a little awkward to me, and maybe that's not
> enough reason to change anything.  I suppose if I commit it this way
> and somebody really hates it, it can always be changed later.  We're
> not getting a lot of input from anyone else at the moment.

FWIW, the name enable_partition_join seems enough to convey the core
feature, that is, I see "_wise" as redundant, even though I'm now quite
used to seeing "_wise" in the emails here and saying it out loud every now
and then.  Ashutosh may have a point though that users coming from other
databases might miss the "_wise". :)

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

2017-10-02 Thread Amit Langote

Hi David.

Thanks a lot for your review comments and sorry it took me a while to reply.

On 2017/09/28 18:16, David Rowley wrote:
> On 27 September 2017 at 14:22, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> - 0001 includes refactoring that Dilip proposed upthread [1] (added him as
>>   an author).  I slightly tweaked his patch -- renamed the function
>>   get_matching_clause to match_clauses_to_partkey, similar to
>>   match_clauses_to_index.
> 
> Hi Amit,
> 
> I've made a pass over the 0001 patch while trying to get myself up to
> speed with the various developments that are going on in partitioning
> right now.
> 
> These are just my thoughts from reading over the patch. I understand
> that there's quite a bit up in the air right now about how all this is
> going to work, but I have some thoughts about that too, which I
> wouldn't mind some feedback on as my line of thought may be off.
> 
> Anyway, I will start with some small typos that I noticed while reading:
> 
> partition.c:
> 
> + * telling what kind of NullTest has been applies to the corresponding
> 
> "applies" should be "applied"

Will fix.

> plancat.c:
> 
> * might've occurred to satisfy the query.  Rest of the fields are set in
> 
> "Rest of the" should be "The remaining"

Will fix.

> Any onto more serious stuff:
> 
> allpath.c:
> 
> get_rel_partitions()
> 
> I wonder if this function does not belong in partition.c. I'd have
> expected a function to exist per partition type, RANGE and LIST, I'd
> imagine should have their own function in partition.c to eliminate
> partitions
> which cannot possibly match, and return the remainder. I see people
> speaking of HASH partitioning, but we might one day also want
> something like RANDOM or ROUNDROBIN (I'm not really kidding, imagine
> routing records to be processed into foreign tables where you never
> need to query them again). It would be good if we could easily expand
> this list and not have to touch files all over the optimizer to do
> that. Of course, there would be other work to do in the executor to
> implement any new partitioning method too.

I think there will have to be at least some code in the optimizer.  That
is, the code to match the query to the table's partition keys and using
the constant values therein to then look up the table's partitions.  More
importantly, once the partitions (viz. their offsets in the table's
partition descriptor) have been looked up by partition.c, we must be able
to then map them to their planner data structures viz. their
AppendRelInfo, and subsequently RelOptInfo.  This last part will have to
be in the optimizer.  In fact, that was the role of get_rel_partitions in
the initial versions of the patch, when neither of the code for matching
keys and for pruning using constants was implemented.

One may argue that the first part, that is, matching clauses to match to
the partition key and subsequent lookup of partitions using constants
could both be implemented in partition.c, but it seems better to me to
keep at least the part of matching clauses to the partition keys within
the planner (just like matching clauses to indexes is done by the
planner).  Looking up partitions using constants cannot be done outside
partition.c anyway, so that's something we have to implement there.

> I know there's mention of it somewhere about get_rel_partitions() not
> being so smart about WHERE partkey > 20 AND partkey > 10, the code
> does not understand what's more restrictive. I think you could
> probably follow the same rules here that are done in
> eval_const_expressions(). Over there I see that evaluate_function()
> will call anything that's not marked as volatile.

Hmm, unless I've missed it, I don't see in evaluate_function() anything
about determining which clause is more restrictive.  AFAIK, such
determination depends on the btree operator class semantics (at least in
the most common cases where, say, ">" means greater than in a sense that
btree code uses it), so I was planning to handle it the way btree code
handles it in _bt_preprocess_keys().  In fact, the existing code in
predtest.c, which makes decisions of the similar vein also relies on btree
semantics.  It's OK to do so, because partitioning methods that exist
today and for which we'd like to have such smarts use btree semantics to
partition the data.  Also, we won't need to optimize such cases for hash
partitioning anyway.

> I'd imagine, for
> each partition key, you'd want to store a Datum with the minimum and
> maximum possible value based on the quals processed. If either the
> minimum or maximum is still set to NULL, then it's unbounded at that
> end. If you encounter partkey = Const, then minimum and maximum can be
> set to the va

Re: [HACKERS] Commitfest 201709 is now closed

2017-10-02 Thread Amit Langote

On 2017/10/03 7:16, Michael Paquier wrote:
> On Tue, Oct 3, 2017 at 1:23 AM, Alexander Korotkov
>  wrote:
>> On Mon, Oct 2, 2017 at 6:57 PM, Tom Lane  wrote:
>>>
>>> Daniel Gustafsson  writes:
 Thanks to everyone who participated, and to everyone who have responded
 to my
 nagging via the CF app email function. This is clearly an awesome
 community.
>>>
>>> And thanks to you for your hard work as CFM!  That's tedious and
>>> largely thankless work, but it's needed to keep things moving.
>>
>> +1,
>> Thank you very much, Daniel!  It was a pleasure working with you at
>> commitfest.
> 
> Thanks for doing a bunch of work, Daniel. This is a difficult task.

+1.  Thank you, Daniel!

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

2017-10-01 Thread Amit Langote

On 2017/09/30 1:28, Robert Haas wrote:
> On Thu, Sep 28, 2017 at 5:16 AM, David Rowley
>  wrote:
>> I'd imagine, for
>> each partition key, you'd want to store a Datum with the minimum and
>> maximum possible value based on the quals processed. If either the
>> minimum or maximum is still set to NULL, then it's unbounded at that
>> end. If you encounter partkey = Const, then minimum and maximum can be
>> set to the value of that Const. From there you can likely ignore any
>> other quals for that partition key, as if the query did contain
>> another qual with partkey = SomeOtherConst, then that would have
>> become a gating qual during the constant folding process. This way if
>> the user had written WHERE partkey >= 1 AND partkey <= 1 the
>> evaluation would end up the same as if they'd written WHERE partkey =
>> 1 as the minimum and maximum would be the same value in both cases,
>> and when those two values are the same then it would mean just one
>> theoretical binary search on a partition range to find the correct
>> partition instead of two.
> 
> I have not looked at the code submitted here in detail yet but I do
> think we should try to avoid wasting cycles in the
> presumably-quite-common case where equality is being tested.  The
> whole idea of thinking of this as minimum/maximum seems like it might
> be off precisely for that reason.

I agree.  Equality checks are going to be common enough to warrant them to
be handled specially, instead of implementing equality-pruning on top of
min/max framework.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partitions: \d vs \d+

2017-09-28 Thread Amit Langote

On 2017/09/28 22:19, Maksim Milyutin wrote:
> I also noticed ambiguity in printing "No partition constraint" in
> non-verbose mode and "Partition constraint:..." in verbose one for
> partition tables regardless of the type of partition.
> Attached small patch removes any output about partition constraint in
> non-verbose mode.

Patch looks good.

So, we should be looking at partconstraintdef only when verbose is true,
because that's only when we set it to a valid value.  Now, if
partconstraintdef is NULL even after verbose is true, that means backend
returned that there exists no constraint for that partition, which I
thought would be true for a default partition (because the commit that
introduced default partitions also introduced "No partition constraint"),
but it's really not.

For example, \d and \d+ show contradictory outputs for a default partition.

create table p (a int) partition by list (a);
create table p1 partition of p for values in (1);
create table pd partition of p default;

\d pd
 Table "public.pd"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
Partition of: p DEFAULT
No partition constraint

\d+ pd
Table "public.pd"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats
target | Description
+-+---+--+-+-+--+-
 a  | integer |   |  | | plain   |  |
Partition of: p DEFAULT
Partition constraint: (NOT ((a IS NOT NULL) AND (a = ANY (ARRAY[1]

Perhaps, there is no case when "No partition constraint" should be output,
but I may be missing something.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partitions: \d vs \d+

2017-09-28 Thread Amit Langote

On 2017/09/28 22:29, Jesper Pedersen wrote:
> On 09/28/2017 09:19 AM, Maksim Milyutin wrote:
>>> E.g. "No partition constraint" vs. "Partition constraint:
>>> satisfies_hash_partition(...)".
>>
>> I also noticed ambiguity in printing "No partition constraint" in
>> non-verbose mode and "Partition constraint:..." in verbose one for
>> partition tables regardless of the type of partition.
>> Attached small patch removes any output about partition constraint in
>> non-verbose mode.
>>
> 
> Yeah, that could be one way.
> 
> It should likely be backported to REL_10_STABLE, so the question is if we
> are too late in the release cycle to change that output.

I think the default partition commit [1] introduced some change around
that code, so the behavior is new in 11dev and I think it needs a fix like
the one that Maksim proposed.

When I check with REL_10_STABLE tip, I find things to be normal:

create table p (a int) partition by list (a);
create table p1 partition of p for values in (1);

\d p1
 Table "public.p1"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
Partition of: p FOR VALUES IN (1)

\d+ p1
Table "public.p1"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats
target | Description
+-+---+--+-+-+--+-
 a  | integer |   |  | | plain   |  |
Partition of: p FOR VALUES IN (1)
Partition constraint: ((a IS NOT NULL) AND (a = ANY (ARRAY[1])))

Thanks,
Amit

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/bin/psql/describe.c;h=d22ec68431e231d9c781c2256a6030d66e0fd09d;hp=6fb9bdd063583fb8b60ad282aeb5256df67942e4;hb=6f6b99d1335be8ea1b74581fc489a97b109dd08a;hpb=2cf15ec8b1cb29bea149559700566a21a790b6d3



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

2017-09-28 Thread Amit Langote

On 2017/09/28 13:58, Dilip Kumar wrote:
> On Wed, Sep 27, 2017 at 6:52 AM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
> 
> I was looking into the latest patch set, seems like we can reuse some
> more code between this path and runtime pruning[1]
> 
> + foreach(lc1, matchedclauses[i])
> + {
> + Expr   *clause = lfirst(lc1);
> + Const  *rightop = (Const *) get_rightop(clause);
> + Oid opno = ((OpExpr *) clause)->opno,
> + opfamily = rel->part_scheme->partopfamily[i];
> + StrategyNumber strategy;
> +
> + strategy = get_op_opfamily_strategy(opno, opfamily);
> + switch (strategy)
> + {
> + case BTLessStrategyNumber:
> + case BTLessEqualStrategyNumber:
> + if (need_next_max)
> + {
> + maxkeys[i] = rightop->constvalue;
> + if (!maxkey_set[i])
> + n_maxkeys++;
> + maxkey_set[i] = true;
> + max_incl = (strategy == BTLessEqualStrategyNumber);
> + }
> + if (strategy == BTLessStrategyNumber)
> + need_next_max = false;
> 
> I think the above logic is common between this patch and the runtime
> pruning.  I think we can make
> a reusable function.  Here we are preparing minkey and maxkey of Datum
> because we can directly fetch rightop->constvalue whereas for runtime
> pruning we are making minkeys and maxkeys of Expr because during
> planning time we don't have the values for the Param.  I think we can
> always make these minkey, maxkey array of Expr and later those can be
> processed in whatever way we want it.  So this path will fetch the
> constval out of Expr and runtime pruning will Eval that expression at
> runtime.

I think that makes sense.  In fact we could even move the minkey/maxkey
collection code to match_clauses_to_partkey() itself.  No need for a
different function and worrying about defining a separate interface for
the same.  We match clauses exactly because we want to extract the
constant or param values out of them.  No need to do the two activities
independently and in different places.

> Does this make sense or it will cause one level of extra processing
> for this path i.e converting the Expr array to CONST array?

Hm, it's not such a big cost to pay I'd think.

I will update the planner patch accordingly.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PartitionSchemaData & partcollation (Re: [COMMITTERS] pgsql: Associate partitioning information with each RelOptInfo.)

2017-09-28 Thread Amit Langote

On 2017/09/28 16:13, Ashutosh Bapat wrote:
> On Thu, Sep 28, 2017 at 11:47 AM, Amit Langote wrote:
>> On 2017/09/21 12:42, Robert Haas wrote:
>>> Associate partitioning information with each RelOptInfo.
>>>
>>> This is not used for anything yet, but it is necessary infrastructure
>>> for partition-wise join and for partition pruning without constraint
>>> exclusion.
>>>
>>> Ashutosh Bapat, reviewed by Amit Langote and with quite a few changes,
>>> mostly cosmetic, by me.  Additional review and testing of this patch
>>> series by Antonin Houska, Amit Khandekar, Rafia Sabih, Rajkumar
>>> Raghuwanshi, Thomas Munro, and Dilip Kumar.
>>
>> I noticed that this commit does not add partcollation field to
>> PartitionSchemeData, while it adds parttypcoll.  I think it'd be necessary
>> to have partcollation too, because partitioning would have used the same,
>> not parttypcoll.  For, example see the following code in
>> partition_rbound_datum_cmp:
>>
>> cmpval = DatumGetInt32(FunctionCall2Coll(>partsupfunc[i],
>>  key->partcollation[i],
>>  rb_datums[i],
>>  tuple_datums[i]));
>>
>> So, it would be wrong to use parttypcoll, if we are to use the collation
>> to match a clause with the partition bounds when doing partition-pruning.
>> Concretely, a clause's inputcollid should match partcollation for the
>> corresponding column, not the column's parttypcoll.
>>
>> Attached is a patch that adds the same.  I first thought of including it
>> in the partition-pruning patch set [1], but thought we could independently
>> fix this.
>>
> 
> I think PartitionSchemeData structure will grow as we need more
> information about partition key for various things. E.g. partsupfunc
> is not part of this structure right now, but it would be required to
> compare partition bound datums. Similarly partcollation. Please add
> this to partition pruning patchset. May be parttypcoll won't be used
> at all and we may want to remove it altogether.

Okay, I will post it with the partition-pruning patch set.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Use of RangeVar for partitioned tables in autovacuum

2017-09-28 Thread Amit Langote

Thanks Michael for working on this.

On 2017/09/27 11:28, Michael Paquier wrote:
> Hi all,
> 
> I have been looking more closely at the problem in $subject, that I
> have mentioned a couple of times, like here:
> https://www.postgresql.org/message-id/cab7npqqa37oune_rjzpmwc4exqalx9f27-ma_-rsfl_3mj+...@mail.gmail.com
> 
> As of HEAD, the RangeVar defined in calls of vacuum() is used for
> error reporting, only in two cases: for autoanalyze and autovacuum
> when reporting to users that a lock cannot be taken on a relation. The
> thing is that autovacuum has the good idea to call vacuum() on only
> one relation at the same time, with always a relation OID and a
> RangeVar defined, so the code currently on HEAD is basically fine with
> its error reporting, because get_rel_oids() will always work on the
> relation OID with its RangeVar, and because code paths with manual
> VACUUMs don't use the RangeVar for any reports.

Yeah, autovacuum_do_vac_analyze will never pass partitioned table OIDs to
vacuum, for example, because they're not RELKIND_RELATION relations.

> While v10 is safe, I think that this is wrong in the long-term and
> that may be a cause of bugs if people begin to generate reports for
> manual VACUUMs, which is plausible in my opinion. The patch attached,
> is what one solution would look like if we would like to make things
> more robust as long as manual VACUUM commands can only specify one
> relation at the same time. I have found that tracking the parent OID
> by tweaking a bit get_rel_oids() was the most elegant solution.

The patch basically looks good to me, FWIW.

> Please
> note that this range of problems is something that I think is better
> solved with the infrastructure that support for VACUUM with multiple
> relations includes (last version here
> https://www.postgresql.org/message-id/766556dd-aa3c-42f7-acf4-5dc97f41f...@amazon.com).
> So I don't think that the patch attached should be integrated, still I
> am attaching it to satisfy the curiosity of anybody looking at this
> message.

Yeah, after looking at the code a bit, it seems that the problem won't
really occur for the case we're trying to apply this patch for.

> In conclusion, I think that the open item of $subject should be
> removed from the list, and we should try to get the multi-VACUUM patch
> in to cover any future problems. I'll do so if there are no
> objections.

+1 to this, taking the previous +1 back [1]. :)

> One comment on top of vacuum() is wrong by the way: in the case of a
> manual VACUUM or ANALYZE, a NULL RangeVar is used if no relation is
> specified in the command.

Yep, but it doesn't ever get accessed per our observations.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/8d810dd9-5f64-a5f3-c016-a81f05528fa8%40lab.ntt.co.jp



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PartitionSchemaData & partcollation (Re: [COMMITTERS] pgsql: Associate partitioning information with each RelOptInfo.)

2017-09-28 Thread Amit Langote

Sorry, I meant to say PartitionSchem"e"Data in subject.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PartitionSchemaData & partcollation (Re: [COMMITTERS] pgsql: Associate partitioning information with each RelOptInfo.)

2017-09-28 Thread Amit Langote

On 2017/09/21 12:42, Robert Haas wrote:
> Associate partitioning information with each RelOptInfo.
> 
> This is not used for anything yet, but it is necessary infrastructure
> for partition-wise join and for partition pruning without constraint
> exclusion.
> 
> Ashutosh Bapat, reviewed by Amit Langote and with quite a few changes,
> mostly cosmetic, by me.  Additional review and testing of this patch
> series by Antonin Houska, Amit Khandekar, Rafia Sabih, Rajkumar
> Raghuwanshi, Thomas Munro, and Dilip Kumar.

I noticed that this commit does not add partcollation field to
PartitionSchemeData, while it adds parttypcoll.  I think it'd be necessary
to have partcollation too, because partitioning would have used the same,
not parttypcoll.  For, example see the following code in
partition_rbound_datum_cmp:

cmpval = DatumGetInt32(FunctionCall2Coll(>partsupfunc[i],
 key->partcollation[i],
 rb_datums[i],
 tuple_datums[i]));

So, it would be wrong to use parttypcoll, if we are to use the collation
to match a clause with the partition bounds when doing partition-pruning.
Concretely, a clause's inputcollid should match partcollation for the
corresponding column, not the column's parttypcoll.

Attached is a patch that adds the same.  I first thought of including it
in the partition-pruning patch set [1], but thought we could independently
fix this.

Thoughts?

Thanks,
Amit

[1] https://commitfest.postgresql.org/14/1272/
From 4e25d503a00c5fb42919e73aae037eacb0164af6 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Wed, 27 Sep 2017 15:49:50 +0900
Subject: [PATCH 1/5] Add partcollation field to PartitionSchemeData

It copies PartitionKeyData.partcollation.  We need that in addition
to parttypcoll.
---
 src/backend/optimizer/util/plancat.c | 1 +
 src/include/nodes/relation.h | 7 ++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/util/plancat.c 
b/src/backend/optimizer/util/plancat.c
index cac46bedf9..1273f28069 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1897,6 +1897,7 @@ find_partition_scheme(PlannerInfo *root, Relation 
relation)
part_scheme->partopfamily = partkey->partopfamily;
part_scheme->partopcintype = partkey->partopcintype;
part_scheme->parttypcoll = partkey->parttypcoll;
+   part_scheme->partcollation = partkey->partcollation;
part_scheme->parttyplen = partkey->parttyplen;
part_scheme->parttypbyval = partkey->parttypbyval;
 
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 48e6012f7f..2adefd0873 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -342,6 +342,10 @@ typedef struct PlannerInfo
  * partition bounds. Since partition key data types and the opclass declared
  * input data types are expected to be binary compatible (per ResolveOpClass),
  * both of those should have same byval and length properties.
+ *
+ * Since partitioning might be using a collation for a given partition key
+ * column that is not same as the collation implied by column's type, store
+ * the same separately.
  */
 typedef struct PartitionSchemeData
 {
@@ -349,7 +353,8 @@ typedef struct PartitionSchemeData
int16   partnatts;  /* number of partition 
attributes */
Oid*partopfamily;   /* OIDs of operator families */
Oid*partopcintype;  /* OIDs of opclass declared 
input data types */
-   Oid*parttypcoll;/* OIDs of collations of 
partition keys. */
+   Oid*parttypcoll;/* OIDs of partition key type 
collation. */
+   Oid*partcollation;  /* OIDs of partitioning 
collation */
 
/* Cached information about partition key data types. */
int16  *parttyplen;
-- 
2.11.0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

2017-09-27 Thread Amit Langote

On 2017/09/27 22:41, Jesper Pedersen wrote:
> On 09/27/2017 03:05 AM, amul sul wrote:
 Attached rebased patch, thanks.

>>> While reading through the patch I thought it would be better to keep
>>> MODULUS and REMAINDER in caps, if CREATE TABLE was in caps too in order to
>>> highlight that these are "keywords" for hash partition.
>>>
>>> Also updated some of the documentation.
>>>
>>>
>> Thanks a lot for the patch, included in the attached version.
>>
> 
> Thank you.
> 
> Based on [1] I have moved the patch to "Ready for Committer".

Thanks a lot Amul for working on this.  Like Jesper said, the patch looks
pretty good overall.  I was looking at the latest version with intent to
study certain things about hash partitioning the way patch implements it,
during which I noticed some things.

+  The modulus must be a positive integer, and the remainder must a

must be a

+  suppose you have a hash-partitioned table with 8 children, each of
which
+  has modulus 8, but find it necessary to increase the number of
partitions
+  to 16.

Might it be a good idea to say 8 "partitions" instead of "children" in the
first sentence?

+  each modulus-8 partition until none remain.  While this may still
involve
+  a large amount of data movement at each step, it is still better than
+  having to create a whole new table and move all the data at once.
+ 
+

I read the paragraph that ends with the above text and started wondering
if the example to redistribute data in hash partitions by detaching and
attaching with new modulus/remainder could be illustrated with an example?
Maybe in the examples section of the ALTER TABLE page?

+  Since hash operator class provide only equality, not ordering,
collation

Either "Since hash operator classes provide" or "Since hash operator class
provides"

Other than the above points, patch looks good.

By the way, I noticed a couple of things about hash partition constraints:

1. In get_qual_for_hash(), using
get_fn_expr_rettype(>partsupfunc[i]), which returns InvalidOid for
the lack of fn_expr being set to non-NULL value, causes funcrettype of the
FuncExpr being generated for hashing partition key columns to be set to
InvalidOid, which I think is wrong.  That is, the following if condition
in get_fn_expr_rettype() is always satisfied:

if (!flinfo || !flinfo->fn_expr)
return InvalidOid;

I think we could use get_func_rettype(>partsupfunc[i].fn_oid)
instead.  Attached patch
hash-v21-set-funcexpr-funcrettype-correctly.patch, which applies on top
v21 of your patch.

2. It seems that the reason constraint exclusion doesn't work with hash
partitions as implemented by the patch is that predtest.c:
operator_predicate_proof() returns false even without looking into the
hash partition constraint, which is of the following form:

satisfies_hash_partition(, , ,..)

beccause the above constraint expression doesn't translate into a a binary
opclause (an OpExpr), which operator_predicate_proof() knows how to work
with.  So, false is returned at the beginning of that function by the
following code:

if (!is_opclause(predicate))
return false;

For example,

create table p (a int) partition by hash (a);
create table p0 partition of p for values with (modulus 4, remainder 0);
create table p1 partition of p for values with (modulus 4, remainder 1);
\d+ p0
<...>
Partition constraint: satisfies_hash_partition(4, 0, hashint4extended(a,
'8816678312871386367'::bigint))

-- both p0 and p1 scanned
explain select * from p where satisfies_hash_partition(4, 0,
hashint4extended(a, '8816678312871386367'::bigint));
 QUERY PLAN

 Append  (cost=0.00..96.50 rows=1700 width=4)
   ->  Seq Scan on p0  (cost=0.00..48.25 rows=850 width=4)
 Filter: satisfies_hash_partition(4, 0, hashint4extended(a,
'8816678312871386367'::bigint))
   ->  Seq Scan on p1  (cost=0.00..48.25 rows=850 width=4)
 Filter: satisfies_hash_partition(4, 0, hashint4extended(a,
'8816678312871386367'::bigint))
(5 rows)

-- both p0 and p1 scanned
explain select * from p where satisfies_hash_partition(4, 1,
hashint4extended(a, '8816678312871386367'::bigint));
 QUERY PLAN

 Append  (cost=0.00..96.50 rows=1700 width=4)
   ->  Seq Scan on p0  (cost=0.00..48.25 rows=850 width=4)
 Filter: satisfies_hash_partition(4, 1, hashint4extended(a,
'8816678312871386367'::bigint))
   ->  Seq Scan on p1  (cost=0.00..48.25 rows=850 width=4)
 Filter: satisfies_hash_partition(4, 1, hashint4extended(a,
'8816678312871386367'::bigint))
(5 rows)

I looked into how satisfies_hash_partition() works and came up with an
idea that I think will make constraint exclusion work.  What if we emitted
the hash

Re: [HACKERS] path toward faster partition pruning

2017-09-26 Thread Amit Langote

Hi Jesper.

Firstly, thanks for looking at the patch.

On 2017/09/26 22:00, Jesper Pedersen wrote:
> Hi Amit,
> 
> On 09/15/2017 04:50 AM, Amit Langote wrote:
>> On 2017/09/15 11:16, Amit Langote wrote:
>>> I will post rebased patches later today, although I think the overall
>>> design of the patch on the planner side of things is not quite there yet.
>>> Of course, your and others' feedback is greatly welcome.
>>
>> Rebased patches attached.  Because Dilip complained earlier today about
>> clauses of the form (const op var) not causing partition-pruning, I've
>> added code to commute the clause where it is required.  Some other
>> previously mentioned limitations remain -- no handling of OR clauses, no
>> elimination of redundant clauses for given partitioning column, etc.
>>
>> A note about 0001: this patch overlaps with
>> 0003-Canonical-partition-scheme.patch from the partitionwise-join patch
>> series that Ashutosh Bapat posted yesterday [1].  Because I implemented
>> the planner-portion of this patch based on what 0001 builds, I'm posting
>> it here.  It might actually turn out that we will review and commit
>> 0003-Canonical-partition-scheme.patch on that thread, but meanwhile apply
>> 0001 if you want to play with the later patches.  I would certainly like
>> to review  0003-Canonical-partition-scheme.patch myself, but won't be able
>> to immediately (see below).
>>
> 
> Could you share your thoughts on the usage of PartitionAppendInfo's
> min_datum_idx / max_datum_idx ? Especially in relation to hash partitions.
> 
> I'm looking at get_partitions_for_keys.

Sure.  You may have noticed that min_datum_idx and max_datum_idx in
PartitionAppendInfo are offsets in the PartitionDescData.boundinfo.datums
array, of datums that lie within the query-specified range (that is, using
=, >, >=, <, <= btree operators in the query). That array contains bounds
of all partitions sorted in the btree operator class defined order, at
least for list and range partitioning.  I haven't (yet) closely looked at
the composition of hash partition datums in PartitionBoundInfo, which
perhaps have different ordering properties (or maybe none) than list and
range partitioning datums.

Now, since they are offsets of datums in PartitionBoundInfo, not indexes
of partitions themselves, their utility outside partition.c might be
questionable.  But partition-wise join patch, for example, to determine if
two partitioned tables can be joined partition-wise, is going to check if
PartitionBoundInfos in RelOptInfos of two partitioned tables are
bound-to-bound equal [1].  Partition-pruning may select only a subset of
partitions of each of the joining partitioned tables.  Equi-join
requirement for partition-wise join means that the subset of partitions
will be same on both sides of the join.  My intent of having
min_datum_idx, max_datum_idx, along with contains_null_partition, and
contains_default_partition in PartitionAppendInfo is to have sort of a
cross check that those values end up being same on both sides of the join
after equi-join requirement has been satisfied.  That is,
get_partitions_for_keys will have chosen the same set of partitions for
both partitioned tables and hence will have set the same values for those
fields.

As mentioned above, that may be enough for list and range partitioning,
but since hash partition datums do not appear to have the same properties
as list and range datums [2], we may need an additional field(s) to
describe the hash partition selected by get_partitions_for_keys.  I guess
only one field will be enough, that will be the offset in the datums array
of the hash partition chosen for the query or -1 if query quals couldn't
conclusively determine one (maybe not all partition keys were specified to
be hashed or some or all used non-equality operator).

Hope that answers your question at least to some degree.  Now, there are
some points Robert mentioned in his reply that I will need to also
consider in the patch, which I'll go do now. :)

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFjFpRc4UdCYknBai9pBu2GA1h4nZVNPDmzgs4jOkqFamT1huA%40mail.gmail.com

[2] It appears that in the hash partitioning case, unlike list and range
partitioning, PartitionBoundInfo doesn't contain values that are
directly comparable to query-specified constants, but a pair (modulus,
remainder) for each partition.  We'll first hash *all* the key values
(mentioned in the query) using the partitioning hash machinery and
then determine the hash partition index by using
(hash % largest_modulus) as offset into the PartitionBoundInfo.indexes
array.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

2017-09-26 Thread Amit Langote

Hi David,

On 2017/09/27 6:04, David Rowley wrote:
> On 25 September 2017 at 23:04, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> By the way, I'm now rebasing these patches on top of [1] and will try to
>> merge your refactoring patch in some appropriate way.  Will post more
>> tomorrow.
>>
>> [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9140cf826
> 
> Yeah, I see 0001 conflicts with that. I'm going to set this to waiting
> on author while you're busy rebasing this.

Thanks for the reminder.  Just thought I'd say that while I'm actually
done rebasing itself (attaching rebased patches to try 'em out), I'm now
considering Robert's comments and will be busy for a bit revising things
based on those comments.

Some notes about the attached patches:

- 0001 includes refactoring that Dilip proposed upthread [1] (added him as
  an author).  I slightly tweaked his patch -- renamed the function
  get_matching_clause to match_clauses_to_partkey, similar to
  match_clauses_to_index.

- Code to set AppendPath's partitioned_rels in add_paths_to_append_rel()
  revised by 0a480502b09 (was originally introduced in d3cc37f1d80) is
  still revised to get partitioned_rels from a source that is not
  PlannerInfo.pcinfo_list.  With the new code, partitioned_rels won't
  contain RT indexes of the partitioned child tables that were pruned.


Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFiTN-tGnQzF_4QtbOHT-3hE%3DOvNaMfbbeRxa4UY0CQyF0G8gQ%40mail.gmail.com
From f20aebcad9b089434ba60cd4439fa1a9d55091b8 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Wed, 13 Sep 2017 18:24:55 +0900
Subject: [PATCH 1/4] WIP: planner-side changes for partition-pruning

Firstly, this adds a stub get_partitions_for_keys() in partition.c
with appropriate interface for the caller to specify bounding scan
keys, along with other information about the scan keys extracted
from the query, such as NULL-ness of the keys, inclusive-ness, etc.

More importantly, this implements the planner-side logic to extract
bounding scan keys to be passed to get_partitions_for_keys.  That is,
it will go through rel->baserestrictinfo and match individual clauses
to partition keys and construct lower bound and upper bound tuples,
which may cover only a prefix of a multi-column partition key.

A bunch of smarts are still missing when mapping the clause operands
with keys.  For example, code to match a clause is specifed as
(constant op var) doesn't exist.  Also, redundant keys are not
eliminated, for example, a combination of clauses a = 10 and a > 1
will cause the later clause a > 1 taking over and resulting in
needless scanning of partitions containing values a > 1 and a < 10.

...constraint exclusion is still used, because
get_partitions_for_keys is just a stub...

Authors: Amit Langote, Dilip Kumar
---
 src/backend/catalog/partition.c   |  43 
 src/backend/optimizer/path/allpaths.c | 390 ++
 src/backend/optimizer/util/plancat.c  |  10 +
 src/backend/optimizer/util/relnode.c  |   7 +
 src/include/catalog/partition.h   |   9 +
 src/include/nodes/nodes.h |   1 +
 src/include/nodes/relation.h  |  66 ++
 7 files changed, 487 insertions(+), 39 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 1ab6dba7ae..ccf8a1fa67 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1335,6 +1335,49 @@ get_partition_dispatch_recurse(Relation rel, Relation 
parent,
}
 }
 
+/*
+ * get_partitions_for_keys
+ * Returns the list of indexes of rel's partitions that will need 
to be
+ * scanned given the bounding scan keys.
+ *
+ * Each value in the returned list can be used as an index into the oids array
+ * of the partition descriptor.
+ *
+ * Inputs:
+ * keynullness contains between 0 and (key->partnatts - 1) values, each
+ * telling what kind of NullTest has been applies to the corresponding
+ * partition key column.  minkeys represents the lower bound on the 
partition
+ * the key of the records that the query will return, while maxkeys
+ * represents upper bound.  min_inclusive and max_inclusive tell whether 
the
+ * bounds specified minkeys and maxkeys is inclusive, respectively.
+ *
+ * Other outputs:
+ * *min_datum_index will return the index in boundinfo->datums of the first
+ * datum that the query's bounding keys allow to be returned for the query.
+ * Similarly, *max_datum_index.  *null_partition_chosen returns whether
+ * the null partition will be scanned.
+ *
+ * TODO: Implement.
+ */
+List *
+get_partitions_for_keys(Relation rel,
+   NullTestType *keynullness,
+   Datum *minkeys, int n_minkeys, 
bool min_inclusive,
+

Re: [HACKERS] path toward faster partition pruning

2017-09-26 Thread Amit Langote

On 2017/09/27 1:51, Robert Haas wrote:
> On Tue, Sep 26, 2017 at 10:57 AM, Jesper Pedersen
>  wrote:
>> One could advocate (*cough*) that the hash partition patch [1] should be
>> merged first in order to find other instances of where other CommitFest
>> entries doesn't account for hash partitions at the moment in their method
>> signatures; Beena noted something similar in [2]. I know that you said
>> otherwise [3], but this is CommitFest 1, so there is time for a revert
>> later, and hash partitions are already useful in internal testing.
> 
> Well, that's a fair point.  I was assuming that committing things in
> that order would cause me to win the "least popular committer" award
> at least for that day, but maybe not.  It's certainly not ideal to
> have to juggle that patch along and keep rebasing it over other
> changes when it's basically done, and just waiting on other
> improvements to land.  Anybody else wish to express an opinion?

FWIW, I tend to agree that it would be nice to get the hash partitioning
patch in, even with old constraint exclusion based partition-pruning not
working for hash partitions.  That way, it might be more clear what we
need to do in the partition-pruning patches to account for hash partitions.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

2017-09-26 Thread Amit Langote

On 2017/09/25 20:21, Dilip Kumar wrote:
> On Mon, Sep 25, 2017 at 3:34 PM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
> 
>> Thanks for looking at the patches and the comments.
> 
>> It's not clear to me whether get_rel_partitions() itself, as it is, is
>> callable from outside the planner, because its signature contains
>> RelOptInfo.  We have the RelOptInfo in the signature, because we want to
>> mark certain fields in it so that latter planning steps can use them.  So,
>> get_rel_partitions()'s job is not just to match clauses and find
>> partitions, but also to perform certain planner-specific tasks of
>> generating information that the later planning steps will want to use.
>> That may turn out to be unnecessary, but until we know that, let's not try
>> to export get_rel_partitions() itself out of the planner.
>>
>> OTOH, the function that your refactoring patch separates out to match
>> clauses to partition keys and extract bounding values seems reusable
>> outside the planner and we should export it in such a way that it can be
>> used in the executor.  Then, the hypothetical executor function that does
>> the pruning will first call the planner's clause-matching function,
>> followed by calling get_partitions_for_keys() in partition.c to get the
>> selected partitions.
> 
> Thanks for your reply.
> 
> Actually, we are still planning to call get_matching_clause at the
> optimizer time only.  Since we can not use get_rel_partitions function
> directly for runtime pruning because it does all the work (find
> matching clause, create minkey and maxkey and call
> get_partitions_for_keys) during planning time itself.
>
> For runtime pruning, we are planning to first get_matching_clause
> function during optimizer time to identify the clause which is
> matching with partition keys, but for PARAM_EXEC case we can not
> depend upon baserelrestriction instead we will get the from join
> clause, that's the reason I have separated out get_matching_clause.
> But it will still be used during planning time.

I see.  So, in the run-time pruning case, only the work of extracting
bounding values is deferred to execution time.  Matching clauses with the
partition key still occurs during planning time.  Only that the clauses
that require run-time pruning are not those in rel->baserestrictinfo.

> After separating out the matching clause we will do somewhat similar
> processing what "get_rel_partitions" is doing. But, at optimizer time
> for PARAM we will not have Datum values for rightop, so we will keep
> track of the PARAM itself.

I guess information about which PARAMs map to which partition keys will be
kept in the plan somehow.

> And, finally at runtime when we get the PARAM value we can prepare
> minkey and maxkey and call get_partitions_for_keys function.

Note that get_partitions_for_keys() is not planner code, nor is it bound
with any other planning code.  It's callable from executor without much
change.  Maybe you already know that though.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Shaky coding for vacuuming partitioned relations

2017-09-26 Thread Amit Langote

On 2017/09/26 11:12, Michael Paquier wrote:
> On Tue, Sep 26, 2017 at 10:54 AM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> I think that's right, although, I don't see any new RangeVar created under
>> vacuum() at the moment.  Maybe, you're referring to the Nathan's patch
>> that perhaps does that.
> 
> Yes, you can check what it does here (last version):
> 766556dd-aa3c-42f7-acf4-5dc97f41f...@amazon.com

Thanks, will look.

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Shaky coding for vacuuming partitioned relations

2017-09-26 Thread Amit Langote

On 2017/09/26 11:14, Michael Paquier wrote:
> On Tue, Sep 26, 2017 at 10:55 AM, Amit Langote wrote:
>> On 2017/09/26 9:51, Michael Paquier wrote:
>>> On Tue, Sep 26, 2017 at 8:48 AM, Michael Paquier wrote:
>>>> Something like that looks like a good compromise for v10. I would
>>>> rather see a more complete fix with each relation name reported
>>>> correctly on HEAD though. The information provided would be useful for
>>>> users when using autovacuum when skipping a relation because no lock
>>>> could be taken on it.
>>>
>>> Actually, perhaps this should be tracked as an open item? A simple fix
>>> is leading to the path that no information is better than misleading
>>> information in this case.
>>
>> +1.
> 
> Let's track it then and spawn a separate thread with a patch. Do you
> want to work on it or should I? The solution proposed by Tom seems
> like the correct answer. I am adding an item for now, we could always
> link it to a thread later on.

I assume you mean the Tom's solution wherein we pass a null RangeVar for
tables that were not mentioned in the command (that is, for partitions of
a partitioned table that was mentioned in the command or for tables read
from pg_class when a user ran VACUUM without mentioning any table name).

Please feel free to come up with the patch for the same, if you have time.
I'll be glad to review.

> Let's also track the problem that has been reported on this thread.

Yes.  Just to be clear, the original problem this thread was started for
is that get_rel_oids() may emit a less user-friendly "cache lookup failed
for relation " error, which it shouldn't.  We should fix the locking
like the patch you posted does, to avoid having to come across this
syscache lookup error.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimise default partition scanning while adding new partition

2017-09-26 Thread Amit Langote

On 2017/09/16 1:57, Amit Langote wrote:
> On Sat, Sep 16, 2017 at 12:59 AM, Robert Haas <robertmh...@gmail.com> wrote:
>> I believe the intended advantage of the current system is that if you
>> specify multiple operations in a single ALTER TABLE command, you only
>> do one scan rather than having a second scan per operation.  If that's
>> currently working, we probably don't want to make it stop working.
> 
> OK.
> 
> How about squash Jeevan's and my patch, so both
> check_default_allows_bound() and ValidatePartitionConstraints() know
> to scan default partition's children and there won't be any surprises
> in the regression test output as you found after applying just the
> Jeevan's patch.  Unfortunately, I'm not able to post such a patch
> right now.

I guess we don't need to squash, as they could be seen as implementing
different features. Reordering the patches helps though.  So, apply them
in this order:

1. My patch to teach ValidatePartitionConstraints() to skip scanning
   a partition's own partitions, which optimizes ATTACH PARTITION
   command's partition constraint validation scan (this also covers the
   case of scanning the default partition to validate its updated
   constraint when attaching a new partition)

2. Jeevan's patch to teach check_default_allows_bound() to skip scanning
   the default partition's own partitions, which covers the case of
   scanning the default partition to validate its updated constraint when
   adding a new partition using CREATE TABLE

Attached 0001 and 0002 are ordered that way.

In addition to implementing the features mentioned in 1 and 2 above, the
patches also modify the INFO message to mention "updated partition
constraint for default partition \"%s\"", instead of "partition constraint
for table \"%s\"", when the default partition is involved.  That's so that
it's consistent with the error message that would be emitted by either
check_default_allows_bound() or ATRewriteTable() when the scan finds a row
that violates updated default partition constraint, viz. the following:
"updated partition constraint for default partition \"%s\" would be
violated by some row"

Thoughts?

Thanks,
Amit
From 166cc082b9d652e186df4ef7b6c41976a22e55a8 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Mon, 7 Aug 2017 10:51:47 +0900
Subject: [PATCH 1/2] Teach ValidatePartitionConstraints to skip validation in
 more cases

In cases where the table being attached is a partitioned table and
the table itself does not have constraints that would allow validation
on the whole table to be skipped, we can still skip the validations
of individual partitions if they each happen to have the requisite
constraints.

Idea by Robert Haas.
---
 src/backend/commands/tablecmds.c  | 26 +++---
 src/test/regress/expected/alter_table.out | 17 +++--
 src/test/regress/sql/alter_table.sql  | 10 ++
 3 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 563bcda30c..2d4dcd7556 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -13635,9 +13635,14 @@ ValidatePartitionConstraints(List **wqueue, Relation 
scanrel,
 */
if (PartConstraintImpliedByRelConstraint(scanrel, partConstraint))
{
-   ereport(INFO,
-   (errmsg("partition constraint for table \"%s\" 
is implied by existing constraints",
-   
RelationGetRelationName(scanrel;
+   if (!validate_default)
+   ereport(INFO,
+   (errmsg("partition constraint for table 
\"%s\" is implied by existing constraints",
+   
RelationGetRelationName(scanrel;
+   else
+   ereport(INFO,
+   (errmsg("updated partition constraint 
for default partition \"%s\" is implied by existing constraints",
+   
RelationGetRelationName(scanrel;
return;
}

@@ -13678,6 +13683,21 @@ ValidatePartitionConstraints(List **wqueue, Relation 
scanrel,
/* There can never be a whole-row reference here */
if (found_whole_row)
elog(ERROR, "unexpected whole-row reference 
found in partition key");
+
+   /* Can we skip scanning this part_rel? */
+   if (PartConstraintImpliedByRelConstraint(part_rel, 
my_partconstr))
+   {
+   if (!validate_defau

Re: [HACKERS] Setting pd_lower in GIN metapage

2017-09-26 Thread Amit Langote

On 2017/09/26 16:30, Michael Paquier wrote:
> On Tue, Sep 26, 2017 at 4:22 PM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>>> Except that small thing, the patches do their duty.
>>
>> Thanks, revised patches attached.
> 
> Cool, let's switch it back to a ready for committer status then.

Sure, thanks.

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Setting pd_lower in GIN metapage

2017-09-26 Thread Amit Langote

On 2017/09/26 12:17, Michael Paquier wrote:
> On Mon, Sep 25, 2017 at 3:48 PM, Amit Langote wrote:
>> So, ISTM, comments that the patches add should all say that setting the
>> meta pages' pd_lower to the correct value helps to pass those pages to
>> xlog.c as compressible standard layout pages, regardless of whether they
>> are actually passed that way.  Even if the patches do take care of the
>> latter as well.
>>
>> Did I miss something?
> 
> Not that I think of.

Thanks.

> Buffer  metabuffer;
> +   Pagemetapage;
> SpGistMetaPageData *metadata;
> 
> metabuffer = ReadBuffer(index, SPGIST_METAPAGE_BLKNO);
> +   metapage = BufferGetPage(metabuffer);
> No need to define metapage here and to call BufferGetPage() as long as
> the lock on the buffer is not taken.

Ah, okay.

Moved those additions inside the if (ConditionalLockBuffer(metabuffer)) block.

> Except that small thing, the patches do their duty.

Thanks, revised patches attached.

Regards,
Amit
From e82e787bc9533977e73456b0782ed78354637bda Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Fri, 23 Jun 2017 11:20:41 +0900
Subject: [PATCH 1/5] Set pd_lower correctly in the GIN metapage.

Also tell xlog.c to treat the metapage like a standard page, so any
hole in it is compressed.
---
 src/backend/access/gin/ginfast.c   | 20 ++--
 src/backend/access/gin/gininsert.c |  4 ++--
 src/backend/access/gin/ginutil.c   | 17 -
 src/backend/access/gin/ginxlog.c   | 25 ++---
 4 files changed, 46 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 59e435465a..359de85dfc 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -399,6 +399,14 @@ ginHeapTupleFastInsert(GinState *ginstate, 
GinTupleCollector *collector)
/*
 * Write metabuffer, make xlog entry
 */
+
+   /*
+* Set pd_lower just past the end of the metadata.  This is essential,
+* because without doing so, metadata will be lost if xlog.c compresses
+* the page.
+*/
+   ((PageHeader) metapage)->pd_lower =
+   ((char *) metadata + sizeof(GinMetaPageData)) - (char 
*) metapage;
MarkBufferDirty(metabuffer);
 
if (needWal)
@@ -407,7 +415,7 @@ ginHeapTupleFastInsert(GinState *ginstate, 
GinTupleCollector *collector)
 
memcpy(, metadata, sizeof(GinMetaPageData));
 
-   XLogRegisterBuffer(0, metabuffer, REGBUF_WILL_INIT);
+   XLogRegisterBuffer(0, metabuffer, REGBUF_WILL_INIT | 
REGBUF_STANDARD);
XLogRegisterData((char *) , sizeof(ginxlogUpdateMeta));
 
recptr = XLogInsert(RM_GIN_ID, XLOG_GIN_UPDATE_META_PAGE);
@@ -572,6 +580,13 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber 
newHead,
metadata->nPendingHeapTuples = 0;
}
 
+   /*
+* Set pd_lower just past the end of the metadata.  This is 
essential,
+* because without doing so, metadata will be lost if xlog.c 
compresses
+* the page.
+*/
+   ((PageHeader) metapage)->pd_lower =
+   ((char *) metadata + sizeof(GinMetaPageData)) - (char 
*) metapage;
MarkBufferDirty(metabuffer);
 
for (i = 0; i < data.ndeleted; i++)
@@ -586,7 +601,8 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber 
newHead,
XLogRecPtr  recptr;
 
XLogBeginInsert();
-   XLogRegisterBuffer(0, metabuffer, REGBUF_WILL_INIT);
+   XLogRegisterBuffer(0, metabuffer,
+  REGBUF_WILL_INIT | 
REGBUF_STANDARD);
for (i = 0; i < data.ndeleted; i++)
XLogRegisterBuffer(i + 1, buffers[i], 
REGBUF_WILL_INIT);
 
diff --git a/src/backend/access/gin/gininsert.c 
b/src/backend/access/gin/gininsert.c
index 5378011f50..c9aa4ee147 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -348,7 +348,7 @@ ginbuild(Relation heap, Relation index, IndexInfo 
*indexInfo)
Pagepage;
 
XLogBeginInsert();
-   XLogRegisterBuffer(0, MetaBuffer, REGBUF_WILL_INIT);
+   XLogRegisterBuffer(0, MetaBuffer, REGBUF_WILL_INIT | 
REGBUF_STANDARD);
XLogRegisterBuffer(1, RootBuffer, REGBUF_WILL_INIT);
 
recptr = XLogInsert(RM_GIN_ID, XLOG_GIN_CREATE_INDEX);
@@ -447,7 +447,7 @@ ginbuildempty(Relation index)
START_CRIT_SECTION();
GinInitMetabuffer(MetaBuffer);
MarkBufferDirty(MetaBuffer);
-   log_newpage_buffer

Re: [HACKERS] Setting pd_lower in GIN metapage

2017-09-26 Thread Amit Langote

On 2017/09/26 11:34, Amit Kapila wrote:
> On Mon, Sep 25, 2017 at 12:18 PM, Amit Langote wrote:
>> So, ISTM, comments that the patches add should all say that setting the
>> meta pages' pd_lower to the correct value helps to pass those pages to
>> xlog.c as compressible standard layout pages, regardless of whether they
>> are actually passed that way.  Even if the patches do take care of the
>> latter as well.
>>
>> Did I miss something?
>>
>> Looking at Amit K's updated patches for btree and hash, it seems that he
>> updated the comment to say that setting pd_lower to the correct value is
>> *essential*, because those pages are passed as REGBUF_STANDARD pages and
>> hence will be compressed.  That seems to contradict what I wrote above,
>> but I think that's not wrong, too.
>>
> 
> I think you are missing that there are many cases where we use only
> REGBUF_STANDARD for meta-pages (cf. hash index).  For btree, where the
> advantage is in fewer cases, I have explicitly stated those cases.

I do see that there are some places that use only REGBUF_STANDARD.  I also
understand that specifying this flag is necessary condition for
XLogRecordAssemble() to perform the hole compression, if it is to be
performed at all.  ISTM, the hole is compressed only if we write the FP
image.  However, reasons for why FP image needs to be written, AFAICS, are
independent of whether the hole is (should be) compressed or not.  Whether
compression should occur or not depends on whether the REGBUF_STANDARD
flag is passed, that is, whether a caller is sure that the page is of
standard layout.  The last part is only true if page's pd_lower is always
set correctly.

Perhaps, we should focus on just writing exactly why setting pd_lower to
the correct value is essential.  I think that's a good change, so I
adopted it from your patch.  The *only* reason why it's essential to set
pd_lower to the correct value, as I see it, is that xloginsert.c *might*
compress the hole in the page and its pd_lower better be correct if we
want compression to preserve the metadata content (in meta pages).  OTOH,
it's not clear to me why we should write in that comment *why* the
compression would occur or why the FP image would be written in the first
place.  Whether or not FP image is written has nothing to do with whether
we should set pd_lower correctly, does it?  Also, since we want to repeat
the same comment in multiple places where we now set pd_lower (gin, brin,
spgist patches have many more new places where meta page's pd_lower is
set), keeping it to the point would be good.

But as you said a few times, we should leave it to the committer to decide
the exact text to write in these comment, which I think is fine.

> Do you still have confusion?

This is an area of PG code I don't have much experience with and is also a
bit complex, so sorry if I'm saying things that are not quite right.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] moving some partitioning code to executor

2017-09-25 Thread Amit Langote

On 2017/09/14 16:13, Amit Langote wrote:
> Hi.
> 
> It seems to me that some of the code in partition.c is better placed
> somewhere under the executor directory.  There was even a suggestion
> recently [1] to introduce a execPartition.c to house some code around
> tuple-routing.
> 
> IMO, catalog/partition.c should present an interface for handling
> operations on a *single* partitioned table and avoid pretending to support
> any operations on the whole partition tree.  For example, the
> PartitionDispatch structure embeds some knowledge about the partition tree
> it's part of, which is useful when used for tuple-routing, because of the
> way it works now (lock and form ResultRelInfos of *all* leaf partitions
> before the first input row is processed).
> 
> So, let's move that structure, along with the code that creates and
> manipulates the same, out of partition.c/h and to execPartition.c/h.
> Attached patch attempts to do that.
> 
> While doing the same, I didn't move *all* of get_partition_for_tuple() out
> to execPartition.c, instead modified its signature as shown below:
> 
> -extern int get_partition_for_tuple(PartitionDispatch *pd,
> -TupleTableSlot *slot,
> -EState *estate,
> -PartitionDispatchData **failed_at,
> -TupleTableSlot **failed_slot);
> +extern int get_partition_for_tuple(Relation relation, Datum *values,
> +bool *isnull);
> 
> That way, we keep the core partition bound comparison logic inside
> partition.c and move rest of the stuff to its caller ExecFindPartition(),
> which includes navigating the enveloping PartitionDispatch's.
> 
> Thoughts?
> 
> PS: 0001 of the attached is the patch from [2] which is here to be applied
> on HEAD before applying the main patch (0002) itself

Since that 0001 patch was committed [1], here is the rebased patch.  Will
add this to the November commit-fest.

Thanks,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=77b6b5e9c
From 08b337436b5862b0ee593314864d6ba98f95e6c0 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Fri, 8 Sep 2017 19:07:38 +0900
Subject: [PATCH] Move certain partitioning code to the executor

---
 src/backend/catalog/partition.c| 438 +-
 src/backend/commands/copy.c|   1 +
 src/backend/executor/Makefile  |   2 +-
 src/backend/executor/execMain.c| 265 +---
 src/backend/executor/execPartition.c   | 559 +
 src/backend/executor/nodeModifyTable.c |   1 +
 src/include/catalog/partition.h|  48 +--
 src/include/executor/execPartition.h   |  65 
 src/include/executor/executor.h|  14 +-
 9 files changed, 708 insertions(+), 685 deletions(-)
 create mode 100644 src/backend/executor/execPartition.c
 create mode 100644 src/include/executor/execPartition.h

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 1ab6dba7ae..903c8c4def 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -147,8 +147,6 @@ static int32 partition_bound_cmp(PartitionKey key,
 static int partition_bound_bsearch(PartitionKey key,
PartitionBoundInfo boundinfo,
void *probe, bool 
probe_is_bound, bool *is_equal);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-  List **pds, List 
**leaf_part_oids);
 
 /*
  * RelationBuildPartitionDesc
@@ -1193,148 +1191,6 @@ get_partition_qual_relid(Oid relid)
return result;
 }
 
-/*
- * RelationGetPartitionDispatchInfo
- * Returns information necessary to route tuples down a partition 
tree
- *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
- *
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-int 
*num_parted, List **leaf_part_oids)
-{
-   List   *pdlist = NIL;
-   PartitionDispatchData **pd;
-   ListCell   *lc;
-   int i;
-
-   Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-   *num_parted = 0;
-   *leaf_part_oids = NIL;
-
-   get_partition_dispatch_recurse(rel, NULL, , leaf_part_oids);
-   *num_parted = list_length(pdlist);
-   pd = (PartitionDi

Re: [HACKERS] Shaky coding for vacuuming partitioned relations

2017-09-25 Thread Amit Langote

On 2017/09/26 9:51, Michael Paquier wrote:
> On Tue, Sep 26, 2017 at 8:48 AM, Michael Paquier
>  wrote:
>> On Mon, Sep 25, 2017 at 11:32 PM, Tom Lane  wrote:
>>> Yeah, I'd noticed that while reviewing the vacuum-multiple-tables patch.
>>> My thought about fixing it was to pass a null RangeVar when handling a
>>> table we'd identified through inheritance or pg_class-scanning, to
>>> indicate that this wasn't a table named in the original command.  This
>>> only works conveniently if you decide that it's appropriate to silently
>>> ignore relation_open failure on such table OIDs, but I think it is.
>>>
>>> Not sure about whether we ought to try to fix that in v10.  It's a
>>> mostly-cosmetic problem in what ought to be an infrequent corner case,
>>> so it might not be worth taking risks for post-RC1.  OTOH, I would
>>> not be surprised to get bug reports about it down the road.
>>
>> Something like that looks like a good compromise for v10. I would
>> rather see a more complete fix with each relation name reported
>> correctly on HEAD though. The information provided would be useful for
>> users when using autovacuum when skipping a relation because no lock
>> could be taken on it.
> 
> Actually, perhaps this should be tracked as an open item? A simple fix
> is leading to the path that no information is better than misleading
> information in this case.

+1.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Shaky coding for vacuuming partitioned relations

2017-09-25 Thread Amit Langote

On 2017/09/25 18:37, Michael Paquier wrote:
> On Mon, Sep 25, 2017 at 4:42 PM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> On 2017/09/25 12:10, Michael Paquier wrote:
>> Hmm, I'm not sure if we need to lock the partitions, too.  Locks taken by
>> find_all_inheritors() will be gone once the already-running transaction is
>> ended by the caller (vacuum()).  get_rel_oids() should just lock the table
>> mentioned in the command (the parent table), so that find_all_inheritors()
>> returns the correct partition list, as Tom perhaps alluded to when he said
>> "it would also make the find_all_inheritors call a lot more meaningful.",
>> but...
> 
> It is important to hold the lock for child tables until the end of the
> transaction actually to fill in correctly the RangeVar associated to
> each partition when scanning them.

I think that's right, although, I don't see any new RangeVar created under
vacuum() at the moment.  Maybe, you're referring to the Nathan's patch
that perhaps does that.

>> When vacuum() iterates that list and calls vacuum_rel() for each relation
>> in the list, it might be the case that a relation in that list is no
>> longer a partition of the parent.  So, one might say that it would get
>> vacuumed unnecessarily.  Perhaps that's fine?
> 
> I am personally fine with that. Any checks would prove to complicate
> the code for not much additional value.

Tend to agree here.

>>> I am not
>>> saying that this needs to be fixed in REL_10_STABLE at this stage
>>> though as this would require some refactoring similar to what the
>>> patch adding support for VACUUM with multiple relations does.
>>
>> I think it would be better to fix that independently somehow.  VACUUM on
>> partitioned tables is handled by calling vacuum_rel() on individual
>> partitions.  It would be nice if vacuum_rel() didn't have to refer to the
>> RangeVar.  Could we not use get_rel_name(relid) in place of
>> relation->relname?  I haven't seen the other patch yet though, so maybe
>> I'm missing something.
> 
> The RangeVar is used for error reporting, and once you reach
> vacuum_rel, the relation and its OID may be gone. So you need to save
> the information of the RangeVar beforehand when scanning the
> relations. That's one reason behind having the VacuumRelation stuff
> discussed on the thread for multiple table VACUUM, this way you store
> for each item to be vacuumed:
> - A RangeVar.
> - A list of columns.
> - An OID saved by vacuum deduced from the RangeVar.
> That's quite some refactoring, so my opinion is that this ship has
> sailed already for REL_10_STABLE, and that we had better live with
> that in 10.

Yeah, your simple patch seems enough for v10.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

2017-09-25 Thread Amit Langote

Hi Dilip.

Thanks for looking at the patches and the comments.

On 2017/09/16 18:43, Dilip Kumar wrote:
> On Fri, Sep 15, 2017 at 2:20 PM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> On 2017/09/15 11:16, Amit Langote wrote:
> 
> Thanks for the updated patch.  I was going through the logic of
> get_rel_partitions in 0002 as almost similar functionality will be
> required by runtime partition pruning on which Beena is working.  The
> only difference is that here we are processing the
> "rel->baserestrictinfo" and in case of runtime pruning, we also need
> to process join clauses which are pushed down to appendrel.

Yeah, I agree with the point you seem to be making that
get_rel_partitions() covers a lot of functionality, which it would be nice
to break down into reusable function(s) with suitable signature(s) that
the executor will also be able to use.

Your proposed refactoring patch down-thread seems to be a good step in
that direction.  Thanks for working on it.

> So can we make some generic logic which can be used for both the patches.
> 
> So basically, we need to do two changes
> 
> 1. In get_rel_partitions instead of processing the
> "rel->baserestrictinfo" we can take clause list as input that way we
> can pass any clause list to this function.
> 
> 2. Don't call "get_partitions_for_keys" routine from the
> "get_rel_partitions", instead, get_rel_partitions can just prepare
> minkey, maxkey and the caller of the get_rel_partitions can call
> get_partitions_for_keys, because for runtime pruning we need to call
> get_partitions_for_keys at runtime.

It's not clear to me whether get_rel_partitions() itself, as it is, is
callable from outside the planner, because its signature contains
RelOptInfo.  We have the RelOptInfo in the signature, because we want to
mark certain fields in it so that latter planning steps can use them.  So,
get_rel_partitions()'s job is not just to match clauses and find
partitions, but also to perform certain planner-specific tasks of
generating information that the later planning steps will want to use.
That may turn out to be unnecessary, but until we know that, let's not try
to export get_rel_partitions() itself out of the planner.

OTOH, the function that your refactoring patch separates out to match
clauses to partition keys and extract bounding values seems reusable
outside the planner and we should export it in such a way that it can be
used in the executor.  Then, the hypothetical executor function that does
the pruning will first call the planner's clause-matching function,
followed by calling get_partitions_for_keys() in partition.c to get the
selected partitions.

We should be careful when designing the interface of the exported function
to make sure it's not bound to the planner.  Your patch still maintains
the RelOptInfo in the signature of the clause-matching function, which the
executor pruning function won't have access to.

> After these changes also there will be one problem that the
> get_partitions_for_keys is directly fetching the "rightop->constvalue"
> whereas, for runtime pruning, we need to store rightop itself and
> calculate the value at runtime by param evaluation,  I haven't yet
> thought how can we make this last part generic.

I don't think any code introduced by the patch in partition.c itself looks
inside OpExpr (or any type of clause for that matter).  That is, I don't
see where get_partitions_for_keys() is looking at rightop->constvalue.
All it receives to work with are arrays of Datums and some other relevant
information like inclusivity, nullness, etc.

By the way, I'm now rebasing these patches on top of [1] and will try to
merge your refactoring patch in some appropriate way.  Will post more
tomorrow.

Thanks,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9140cf826

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Shaky coding for vacuuming partitioned relations

2017-09-25 Thread Amit Langote

On 2017/09/25 12:10, Michael Paquier wrote:
> On Sat, Sep 23, 2017 at 4:13 AM, Tom Lane  wrote:
>> Somebody inserted this into vacuum.c's get_rel_oids():
>>
>> tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
>> if (!HeapTupleIsValid(tuple))
>> elog(ERROR, "cache lookup failed for relation %u", relid);
>>
>> apparently without having read the very verbose comment two lines above,
>> which points out that we're not taking any lock on the target relation.
>> So, if that relation is concurrently being dropped, you're likely to
>> get "cache lookup failed for relation " rather than anything more
>> user-friendly.
> 
> This has been overlooked during the lookups of 3c3bb99, and by
> multiple people including me. elog() should never be things users can
> face, as well as cache lookups.

Agreed, that was an oversight.

>> A minimum-change fix would be to replace the elog() with an ereport
>> that produces the same "relation does not exist" error you'd have
>> gotten from RangeVarGetRelid, had the concurrent DROP TABLE committed
>> a few microseconds earlier.  But that feels like its's band-aiding
>> around the problem.
> 
> Yeah, that's not right. This is a cache lookup error at the end.

Also agreed that fixing the locking here would be a better solution.

>> What I'm wondering about is changing the RangeVarGetRelid call to take
>> ShareUpdateExclusiveLock rather than no lock.  That would protect the
>> syscache lookup, and it would also make the find_all_inheritors call
>> a lot more meaningful.
>>
>> If we're doing a VACUUM, the ShareUpdateExclusiveLock would be dropped
>> as soon as we close the caller's transaction, and then we'd acquire
>> the same or stronger lock inside vacuum_rel().  So that seems fine.
>> If we're doing an ANALYZE, then the lock would continue to be held
>> and analyze_rel would merely be acquiring it an extra time, so we'd
>> actually be removing a race-condition failure scenario for ANALYZE.
>> This would mean a few more cycles in lock management, but since this
>> only applies to a manual VACUUM or ANALYZE that specifies a table
>> name, I'm not too concerned about that.
> 
> I think that I am +1 on that. Testing such a thing I am not seeing
> anything wrong either. The call to find_all_inheritors should also use
> ShareUpdateExclusiveLock, and the lock on top of RangeVarGetRelid()
> needs to be reworked.
> 
> Attached is a proposal of patch.

Hmm, I'm not sure if we need to lock the partitions, too.  Locks taken by
find_all_inheritors() will be gone once the already-running transaction is
ended by the caller (vacuum()).  get_rel_oids() should just lock the table
mentioned in the command (the parent table), so that find_all_inheritors()
returns the correct partition list, as Tom perhaps alluded to when he said
"it would also make the find_all_inheritors call a lot more meaningful.",
but...

When vacuum() iterates that list and calls vacuum_rel() for each relation
in the list, it might be the case that a relation in that list is no
longer a partition of the parent.  So, one might say that it would get
vacuumed unnecessarily.  Perhaps that's fine?

>> Thoughts?
> 
> As long as I don't forget... Another thing currently on HEAD and
> REL_10_STABLE is that OIDs of partitioned tables are used, but the
> RangeVar of the parent is used for error reports. This leads to
> incorrect reports if a partition gets away in the middle of autovacuum
> as only information about the parent is reported to the user.

Oh, you're right.  The original RangeVar (corresponding to the table
mentioned in the command) refers to the parent table.

> I am not
> saying that this needs to be fixed in REL_10_STABLE at this stage
> though as this would require some refactoring similar to what the
> patch adding support for VACUUM with multiple relations does.

I think it would be better to fix that independently somehow.  VACUUM on
partitioned tables is handled by calling vacuum_rel() on individual
partitions.  It would be nice if vacuum_rel() didn't have to refer to the
RangeVar.  Could we not use get_rel_name(relid) in place of
relation->relname?  I haven't seen the other patch yet though, so maybe
I'm missing something.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Setting pd_lower in GIN metapage

2017-09-25 Thread Amit Langote

Hi.

Trying to catch up.

On 2017/09/25 13:43, Michael Paquier wrote:
> On Sun, Sep 24, 2017 at 2:25 PM, Amit Kapila  wrote:
>> Added and updated the comments for both btree and hash index patches.
> 
> I don't have real complaints about this patch, this looks fine to me.
> 
> +* Currently, the advantage of setting pd_lower is in limited cases like
> +* during wal_consistency_checking or while logging for unlogged relation
> +* as for all other purposes, we initialize the metapage.  Note, it also
> +* helps in page masking by allowing to mask unused space.
> I would have reworked this comment a bit, say like that:
> Setting pd_lower is useful for two cases which make use of WAL
> compressibility even if the meta page is initialized at replay:
> - Logging of init forks for unlogged relations.
> - wal_consistency_checking logs extra full-page writes, and this
> allows masking of the unused space of the page.
> 
> Now I often get complains that I suck at this exercise ;)

So, I think I understand the discussion so far and the arguments about
what we should write to explain why we're setting pd_lower to the correct
value.

Just to remind, I didn't actually start this thread [1] to address the
issue that the FPWs of meta pages written to WAL are not getting
compressed.  An external backup tool relies on pd_lower to give the
correct starting offset of the hole to compress, provided the page's other
fields suggest it has the standard page layout.  Since we discovered that
pd_lower is not set correctly in gin, brin, and spgist meta pages, I
created patches to do the same.  You (Michael) pointed out that that
actually helps compress their FPW images in WAL as well, so we began
considering that.  Also, you pointed out that WAL checking code masks
pages based on the respective masking functions' assumptions about the
page's layout properties, which the original patches forgot to consider.
So, we updated the patches to change the respective masking functions to
mask meta pages as pages with standard page layout, too.

But as Tom pointed out [2], WAL compressibility benefit cannot be obtained
unless we change how the meta page is passed to xlog.c to be written to
WAL.  So, we found out all the places that write/register the meta page to
WAL and changed the respective XLogRegisterBuffer calls to include the
REGBUG_STANDARD flag.  Some of these calls already passed
REGBUF_WILL_INIT, which would result in no FPW image to be written to the
WAL and so there would be no question of compressibility.  But, passing
REGBUF_STANDARD would still help the case where WAL checking is enabled,
because FPW image *is* written in that case.

So, ISTM, comments that the patches add should all say that setting the
meta pages' pd_lower to the correct value helps to pass those pages to
xlog.c as compressible standard layout pages, regardless of whether they
are actually passed that way.  Even if the patches do take care of the
latter as well.

Did I miss something?

Looking at Amit K's updated patches for btree and hash, it seems that he
updated the comment to say that setting pd_lower to the correct value is
*essential*, because those pages are passed as REGBUF_STANDARD pages and
hence will be compressed.  That seems to contradict what I wrote above,
but I think that's not wrong, too.  So, I updated the gin, brin, and
spgist patches to say the same, viz. the following:

   /*
* Set pd_lower just past the end of the metadata.  This is essential,
* because without doing so, metadata will be lost if xlog.c compresses
* the page.
*/

Attached updated patches.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/0d273805-0e9e-ec1a-cb84-d4da400b8f85%40lab.ntt.co.jp

[2] https://www.postgresql.org/message-id/22268.1504815869%40sss.pgh.pa.us
From 600caa054e98673428fc9595b13df54efe06a6a4 Mon Sep 17 00:00:00 2001
From: amit 
Date: Fri, 23 Jun 2017 11:20:41 +0900
Subject: [PATCH 1/5] Set pd_lower correctly in the GIN metapage.

Also tell xlog.c to treat the metapage like a standard page, so any
hole in it is compressed.
---
 src/backend/access/gin/ginfast.c   | 20 ++--
 src/backend/access/gin/gininsert.c |  4 ++--
 src/backend/access/gin/ginutil.c   | 17 -
 src/backend/access/gin/ginxlog.c   | 25 ++---
 4 files changed, 46 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 59e435465a..359de85dfc 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -399,6 +399,14 @@ ginHeapTupleFastInsert(GinState *ginstate, 
GinTupleCollector *collector)
/*
 * Write metabuffer, make xlog entry
 */
+
+   /*
+* Set pd_lower just past the end of the metadata.  This is essential,
+* because without doing so, metadata will be lost if xlog.c compresses
+* the page.
+*/
+

Re: [HACKERS] path toward faster partition pruning

2017-09-15 Thread Amit Langote

On Sat, Sep 16, 2017 at 4:04 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Sep 15, 2017 at 4:50 AM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> Rebased patches attached.  Because Dilip complained earlier today about
>> clauses of the form (const op var) not causing partition-pruning, I've
>> added code to commute the clause where it is required.  Some other
>> previously mentioned limitations remain -- no handling of OR clauses, no
>> elimination of redundant clauses for given partitioning column, etc.
>>
>> A note about 0001: this patch overlaps with
>> 0003-Canonical-partition-scheme.patch from the partitionwise-join patch
>> series that Ashutosh Bapat posted yesterday [1].
>
> It doesn't merely overlap; it's obviously a derivative work,

Yes it is.  I noted that upthread [1] that most of these are derived
from Ashutosh's patch on his suggestion.  I guess I should have
repeated that in this message too, sorry.

> and the
> commit message in your version should credit all the authors.

That was a mistake on my part, too.  Will be careful hereon.

Thanks,
Amit

[1] 
https://www.postgresql.org/message-id/0e829199-a43c-2a66-b966-89a0020a6cd4%40lab.ntt.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimise default partition scanning while adding new partition

2017-09-15 Thread Amit Langote

On Sat, Sep 16, 2017 at 12:59 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Sep 15, 2017 at 2:00 AM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> I wonder if we should call check_default_allows_bound() from
>> ATExecAttachPartition(), too, instead of validating updated default
>> partition constraint using ValidatePartitionConstraints()?  That is, call
>> the latter only to validate the partition constraint of the table being
>> attached and call check_default_allows_bound() to validate the updated
>> default partition constraint.  That way, INFO/ERROR messages related to
>> default partition constraint are consistent across the board.
>
> I believe the intended advantage of the current system is that if you
> specify multiple operations in a single ALTER TABLE command, you only
> do one scan rather than having a second scan per operation.  If that's
> currently working, we probably don't want to make it stop working.

OK.

How about squash Jeevan's and my patch, so both
check_default_allows_bound() and ValidatePartitionConstraints() know
to scan default partition's children and there won't be any surprises
in the regression test output as you found after applying just the
Jeevan's patch.  Unfortunately, I'm not able to post such a patch
right now.

Thanks,
Amit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] no test coverage for ALTER FOREIGN DATA WRAPPER name HANDLER ...

2017-09-15 Thread Amit Langote

On Fri, Sep 15, 2017 at 9:20 PM, Robert Haas  wrote:
> On Thu, Sep 14, 2017 at 8:30 AM, Ashutosh Bapat
>  wrote:
>> LGTM. The patch applies cleanly on the current HEAD, compiles (small
>> bit in regress.c requires compilation), and make check passes. Marking
>> this as "ready for committer".
>
> Committed.

Thanks, closed in the CF app.

Regards,
Amit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

2017-09-15 Thread Amit Langote

On 2017/09/15 11:16, Amit Langote wrote:
> I will post rebased patches later today, although I think the overall
> design of the patch on the planner side of things is not quite there yet.
> Of course, your and others' feedback is greatly welcome.

Rebased patches attached.  Because Dilip complained earlier today about
clauses of the form (const op var) not causing partition-pruning, I've
added code to commute the clause where it is required.  Some other
previously mentioned limitations remain -- no handling of OR clauses, no
elimination of redundant clauses for given partitioning column, etc.

A note about 0001: this patch overlaps with
0003-Canonical-partition-scheme.patch from the partitionwise-join patch
series that Ashutosh Bapat posted yesterday [1].  Because I implemented
the planner-portion of this patch based on what 0001 builds, I'm posting
it here.  It might actually turn out that we will review and commit
0003-Canonical-partition-scheme.patch on that thread, but meanwhile apply
0001 if you want to play with the later patches.  I would certainly like
to review  0003-Canonical-partition-scheme.patch myself, but won't be able
to immediately (see below).

> Also, I must inform to all of those who're looking at this thread that I
> won't be able to respond to emails from tomorrow (9/16, Sat) until 9/23,
> Sat, due to some personal business.

To remind.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFiTN-skmaqeCVaoAHCBqe2DyfO3f6sgdtEjHWrUgi0kV1yPLQ%40mail.gmail.com
From ff9ccd8df6555cfca31e54e22293ac1613db327c Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Wed, 13 Sep 2017 18:24:55 +0900
Subject: [PATCH 1/5] Some optimizer data structures for partitioned rels

Nobody uses it though.
---
 src/backend/optimizer/util/plancat.c | 120 +++
 src/backend/optimizer/util/relnode.c |  20 ++
 src/include/nodes/nodes.h|   1 +
 src/include/nodes/relation.h | 135 +++
 src/include/optimizer/plancat.h  |   2 +
 5 files changed, 278 insertions(+)

diff --git a/src/backend/optimizer/util/plancat.c 
b/src/backend/optimizer/util/plancat.c
index a1ebd4acc8..f7e3a1df5f 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -68,6 +68,8 @@ static List *get_relation_constraints(PlannerInfo *root,
 static List *build_index_tlist(PlannerInfo *root, IndexOptInfo *index,
  Relation heapRelation);
 static List *get_relation_statistics(RelOptInfo *rel, Relation relation);
+static void get_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
+   Relation relation);
 
 /*
  * get_relation_info -
@@ -420,6 +422,10 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, 
bool inhparent,
/* Collect info about relation's foreign keys, if relevant */
get_relation_foreign_keys(root, rel, relation, inhparent);
 
+   /* Collect partitioning info, if relevant. */
+   if (relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+   get_relation_partition_info(root, rel, relation);
+
heap_close(relation, NoLock);
 
/*
@@ -1802,3 +1808,117 @@ has_row_triggers(PlannerInfo *root, Index rti, CmdType 
event)
heap_close(relation, NoLock);
return result;
 }
+
+static void
+get_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
+   Relation relation)
+{
+   int i;
+   ListCell *l;
+   PartitionKey key = RelationGetPartitionKey(relation);
+   PartitionDesc partdesc = RelationGetPartitionDesc(relation);
+
+   rel->part_scheme = find_partition_scheme(root, relation);
+   rel->partexprs = (List **) palloc0(key->partnatts * sizeof(List *));
+
+   l = list_head(key->partexprs);
+   for (i = 0; i < key->partnatts; i++)
+   {
+   Expr *keyCol;
+
+   if (key->partattrs[i] != 0)
+   {
+   keyCol = (Expr *) makeVar(rel->relid,
+ 
key->partattrs[i],
+ 
key->parttypid[i],
+ 
key->parttypmod[i],
+ 
key->parttypcoll[i],
+ 0);
+   }
+   else
+   {
+   if (l == NULL)
+   elog(ERROR, "wrong number of partition key 
expressions");
+   keyCol = (Expr *) copyObject(lfirst(l));
+   l = lnext(l);
+   }
+
+   rel->partexprs[i] = list_make1(keyCol);

Re: [HACKERS] Setting pd_lower in GIN metapage

2017-09-15 Thread Amit Langote

On 2017/09/14 16:00, Michael Paquier wrote:
> On Wed, Sep 13, 2017 at 4:43 PM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> Sure, no problem.
> 
> OK, I took a closer look at all patches, but did not run any manual
> tests to test the compression except some stuff with
> wal_consistency_checking.

Thanks for the review.

> +   if (opaque->flags & GIN_DELETED)
> +   mask_page_content(page);
> +   else if (pagehdr->pd_lower != 0)
> +   mask_unused_space(page);
> [...]
> +   /* Mask the unused space, provided the page's pd_lower is set. */
> +   if (pagehdr->pd_lower != 0)
> mask_unused_space(page);
>
> For the masking functions, shouldn't those check use (pd_lower >
> SizeOfPageHeaderData) instead of 0? pd_lower gets set to a non-zero
> value on HEAD, so you would apply the masking even if the meta page is
> upgraded from an instance that did not enforce the value of pd_lower
> later on. Those conditions also definitely need comments. That will be
> a good reminder so as why it needs to be kept.

Agreed, done.

> +*
> +* This won't be of any help unless we use option like REGBUF_STANDARD
> +* while registering the buffer for metapage as otherwise, it won't try to
> +* remove the hole while logging the full page image.
>  */
> This comment is in the btree code. But you actually add
> REGBUF_STANDARD. So I think that this could be just removed.
> 
>  * Set pd_lower just past the end of the metadata.  This is not essential
> -* but it makes the page look compressible to xlog.c.
> +* but it makes the page look compressible to xlog.c.  See
> +* _bt_initmetapage.
> This reference could be removed as well as _bt_initmetapage does not
> provide any information, the existing comment is wrong without your
> patch, and then becomes right with this patch.

Amit K's reply may have addressed these comments.

> After that I have spotted a couple of places for btree, hash and
> SpGist where the updates of pd_lower are not happening. Let's keep in
> mind that updated meta pages could come from an upgraded version, so
> we need to be careful here about all code paths updating meta pages,
> and WAL-logging them.
>
> It seems to me that an update of pd_lower is missing in _bt_getroot(),
> just before marking the buffer as dirty I think. And there is a second
> in _bt_unlink_halfdead_page(), a third in _bt_insertonpg() and finally
> one in _bt_newroot().

Amit K's reply about btree and hash should've resolved any doubts for
those index types.  About SP-Gist, see the comment below.

> For SpGist, I think that there are two missing: spgbuild() and spgGetCache().

spgbuild() calls SpGistInitMetapage() before marking the metapage buffer
dirty.  The latter already sets pd_lower correctly, so we don't need to do
it explicitly in spgbuild() itself.

spgGetCache() doesn't write the metapage, only reads it:

/* Last, get the lastUsedPages data from the metapage */
metabuffer = ReadBuffer(index, SPGIST_METAPAGE_BLKNO);
LockBuffer(metabuffer, BUFFER_LOCK_SHARE);

metadata = SpGistPageGetMeta(BufferGetPage(metabuffer));

if (metadata->magicNumber != SPGIST_MAGIC_NUMBER)
elog(ERROR, "index \"%s\" is not an SP-GiST index",
 RelationGetRelationName(index));

cache->lastUsedPages = metadata->lastUsedPages;

UnlockReleaseBuffer(metabuffer);

So, I think it won't be correct to set pd_lower here, no?

> For hash, hashbulkdelete(), _hash_vacuum_one_page(),
> _hash_addovflpage(), _hash_freeovflpage() and _hash_doinsert() are
> missing the shot, no? We could have a meta page of a hash index
> upgraded from PG10.

Amit K's reply. :)


Updated patch attached, which implements your suggested changes to the
masking functions.

By the way, as I noted on another unrelated thread, I will not be able to
respond to emails from tomorrow until 9/23.

Thanks,
Amit
From 6741334ce5f3261b5e5caffa3914d4e1485fe5d8 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Fri, 23 Jun 2017 11:20:41 +0900
Subject: [PATCH 1/3] Set pd_lower correctly in the GIN metapage.

Also tell xlog.c to treat the metapage like a standard page, so any
hole in it is compressed.
---
 src/backend/access/gin/ginfast.c   | 22 --
 src/backend/access/gin/gininsert.c |  4 ++--
 src/backend/access/gin/ginutil.c   | 19 ++-
 src/backend/access/gin/ginxlog.c   | 25 ++---
 4 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 59e435465a..d96529cf72 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -399,6 +399,15 @@ ginHeapTupleFastInser

Re: [HACKERS] Race between SELECT and ALTER TABLE NO INHERIT

2017-09-15 Thread Amit Langote

On 2017/09/15 15:36, Amit Langote wrote:
> The fact that
> parent is locked after the child and with ShareUpdateExclusiveLock instead
> of AccessExclusiveLock, we observe this race condition when SELECTing from
> the parent.

Oops, I meant "parent table is locked with AccessShareLock instead of
AccessExclusiveLock..."

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Race between SELECT and ALTER TABLE NO INHERIT

2017-09-15 Thread Amit Langote

Hi.

On 2017/08/28 18:28, Kyotaro HORIGUCHI wrote:
> << the following is another topic >>
> 
 BTW, in the partitioned table case, the parent is always locked first
 using an AccessExclusiveLock.  There are other considerations in that case
 such as needing to recreate the partition descriptor upon termination of
 inheritance (both the DETACH PARTITION and also DROP TABLE child cases).
>>>
>>> Apart from the degree of concurrency, if we keep parent->children
>>> order of locking, such recreation does not seem to be
>>> needed. Maybe I'm missing something.
>>
>> Sorry to have introduced that topic in this thread, but I will try to
>> explain anyway why things are the way they are currently:
>>
>> Once a table is no longer a partition of the parent (detached or dropped),
>> we must make sure that the next commands in the transaction don't see it
>> as one.  That information is currently present in the relcache
>> (rd_partdesc), which is used by a few callers, most notably the
>> tuple-routing code.  Next commands must recreate the entry so that the
>> correct thing happens based on the updated information.  More precisely,
>> we must invalidate the current entry.  RelationClearRelation() will either
>> delete the entry or rebuild it.  If it's being referenced somewhere, it
>> will be rebuilt.  The place holding the reference may also be looking at
>> the content of rd_partdesc, which we don't require them to make a copy of,
>> so we must preserve its content while other fields of RelationData are
>> being read anew from the catalog.  We don't have to preserve it if there
>> has been any change (partition added/dropped), but to make such a change
>> one would need to take a strong enough lock on the relation (parent).  We
>> assume here that anyone who wants to reference rd_partdesc takes at least
>> AccessShareLock lock on the relation, and anyone who wants to change its
>> content must take a lock that will conflict with it, so
>> AccessExclusiveLock.  Note that in all of this, we are only talking about
>> one relation, that is the parent, so parent -> child ordering of taking
>> locks may be irrelevant.
> 
> I think I understand this, anyway DropInherit and DropPartition
> is different-but-at-the-same-level operations so surely needs
> amendment for drop/detach cases. Is there already a solution? Or
> reproducing steps?

Sorry, I think I forgot to reply to this.  Since you seem to have chosen
the other solution (checking that child is still a child), maybe this
reply is a bit too late, but anyway.

DropInherit or NO INHERIT is seen primarily as changing a child table's
(which is the target table of the command) property that it is no longer a
child of the parent, so we lock the child table to block concurrent
operations from considering it a child of parent anymore.  The fact that
parent is locked after the child and with ShareUpdateExclusiveLock instead
of AccessExclusiveLock, we observe this race condition when SELECTing from
the parent.

DropPartition or DETACH PARTITION is seen primarily as changing the parent
table's (which is the target table of the command) property that one of
the partitions is removed, so we lock the parent.   Any concurrent
operations that rely on the parent's relcache to get the partition list
will wait for the session that is dropping the partition to finish, so
that they get the fresh information from the relcache (or more importantly
do not end up with information obtained from the relcache going invalid
under them without notice).  Note that the lock on the partition/child is
also present and it plays more or less the the same role as it does in the
DropInherit case, but due to different order of locking, reported race
condition does not occur between SELECT on partitioned table and
DROP/DETACH PARTITION.

By the way, I will take a look at your patch when I come back from the
vacation.  Meanwhile, I noticed that it needs another rebase after
0a480502b092 [1].

Thanks,
Amit

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0a480502b092

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimise default partition scanning while adding new partition

2017-09-15 Thread Amit Langote

On 2017/09/15 0:59, Robert Haas wrote:
> On Thu, Sep 14, 2017 at 4:03 AM, Jeevan Ladhe
>  wrote:
>> Thanks Amit for reviewing.
>>> Patch looks fine to me.  By the way, why don't we just say "Can we skip
>>> scanning part_rel?" in the comment before the newly added call to
>>> PartConstraintImpliedByRelConstraint()?  We don't need to repeat the
>>> explanation of what it does at the every place we call it.
>>
>> I have changed the comment.
>> PFA.
> 
> I'm probably missing something stupid, but why does this happen?
> 
>  ALTER TABLE list_parted2 ATTACH PARTITION part_7 FOR VALUES IN (7);
> -INFO:  partition constraint for table "list_parted2_def" is implied
> by existing constraints
>  ERROR:  partition constraint is violated by some row
> 
> Based on the regression test changes made up higher in the file, I'd
> expect that line to be replaced by two lines, one for
> list_parted2_def_p1 and another for list_parted2_def_p2, because at
> this point, list_parted2_def is a partitioned table with those two
> children, and they seem to have appropriate constraints.
> 
> list_parted2_def_p1 has
>  Check constraints:
>  "check_a" CHECK (a = ANY (ARRAY[21, 22]))
> 
> list_parted2_def_p2 has
>  Check constraints:
>  "check_a" CHECK (a = ANY (ARRAY[31, 32]))
> 
> Well, if a is 21 or 22 for the first, or 31 or 32 for the second, then
> it's not 7.  Or so I would think.

ISTM, Jeevan's patch modifies check_default_allows_bound(), which only
DefineRelation() calls as things stand today.  IOW, it doesn't handle the
ATTACH PARTITION case, so you're not seeing the lines for
list_parted2_def_p1 and list_parted2_def_p2, because ATExecAttachPartition
doesn't yet know how to skip the validation scan for default partition's
children.

I wonder if we should call check_default_allows_bound() from
ATExecAttachPartition(), too, instead of validating updated default
partition constraint using ValidatePartitionConstraints()?  That is, call
the latter only to validate the partition constraint of the table being
attached and call check_default_allows_bound() to validate the updated
default partition constraint.  That way, INFO/ERROR messages related to
default partition constraint are consistent across the board.  I hacked
that up in the attached 0001.  The patch also modifies the INFO message
that check_default_allows_bound() emits when the scan is skipped to make
it apparent that a default partition is involved.  (Sorry, this should've
been a review comment I should've posted before the default partition
patch was committed.)

Then apply 0002, which is Jeevan's patch.  With 0001 in place, Robert's
complaint is taken care of.

Then comes 0003, which is my patch to teach ValidatePartitionConstraints()
to skip child table scans using their existing constraint.

Thanks,
Amit
From 00765f2166c78902d0c92ed9d1a4dea033a50a12 Mon Sep 17 00:00:00 2001
From: amit 
Date: Fri, 15 Sep 2017 14:30:36 +0900
Subject: [PATCH 1/3] Teach ATExecAttachPartition to use
 check_default_allows_bound

Currently it schedules validation scan of the default partition
in the same way as it schedules the scan of the partition being
attached.  Instead, call check_default_allows_bound() to do the
scanning, so the handling is symmetric with DefineRelation().

Also, update the INFO message in check_default_allows_bound(), so
that it's clear from the message that the default partition is
involved.
---
 src/backend/catalog/partition.c   |  8 ++-
 src/backend/commands/tablecmds.c  | 40 ++-
 src/test/regress/expected/alter_table.out | 12 +-
 3 files changed, 20 insertions(+), 40 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 1ab6dba7ae..027b98d850 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -912,15 +912,11 @@ check_default_allows_bound(Relation parent, Relation 
default_rel,
def_part_constraints =
get_proposed_default_constraint(new_part_constraints);

-   /*
-* If the existing constraints on the default partition imply that it 
will
-* not contain any row that would belong to the new partition, we can
-* avoid scanning the default partition.
-*/
+   /* Can we skip scanning default_rel? */
if (PartConstraintImpliedByRelConstraint(default_rel, 
def_part_constraints))
{
ereport(INFO,
-   (errmsg("partition constraint for table \"%s\" 
is implied by existing constraints",
+   (errmsg("updated partition constraint for 
default partition \"%s\" is implied by existing constraints",

RelationGetRelationName(default_rel;
return;
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 563bcda30c..ed4a2092b3 100644

Re: [HACKERS] path toward faster partition pruning

2017-09-14 Thread Amit Langote

Hi Dilip,

Thanks for looking at the patch.

On 2017/09/15 13:43, Dilip Kumar wrote:
> On Wed, Sep 6, 2017 at 4:08 PM, Amit Langote
>> [PATCH 2/5] WIP: planner-side changes for partition-pruning
>>
>> This patch adds a stub get_partitions_for_keys in partition.c with a
>> suitable interface for the caller to pass bounding keys extracted from the
>> query and other related information.
>>
>> Importantly, it implements the logic within the planner to match query's
>> scan keys to a parent table's partition key and form the bounding keys
>> that will be passed to partition.c to compute the list of partitions that
>> satisfy those bounds.
> 
> + Node   *leftop = get_leftop(clause);
> +
> + if (IsA(leftop, RelabelType))
> + leftop = (Node *) ((RelabelType *) leftop)->arg;
> +
> + if (equal(leftop, partkey))
> 
> It appears that this patch always assume that clause will be of form
> "var op const", but it can also be "const op var"
> 
> That's the reason in below example where in both the queries condition
> is same it can only prune in the first case but not in the second.
> 
> postgres=# explain select * from t where t.a < 2;
>QUERY PLAN
> 
>  Append  (cost=0.00..2.24 rows=1 width=8)
>->  Seq Scan on t1  (cost=0.00..2.24 rows=1 width=8)
>  Filter: (a < 2)
> (3 rows)
> 
> postgres=# explain select * from t where 2>t.a;
>QUERY PLAN
> 
>  Append  (cost=0.00..4.49 rows=2 width=8)
>->  Seq Scan on t1  (cost=0.00..2.24 rows=1 width=8)
>  Filter: (2 > a)
>->  Seq Scan on t2  (cost=0.00..2.25 rows=1 width=8)
>  Filter: (2 > a)
> (5 rows)

Yeah, there are a bunch of smarts still missing in that patch as it is.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

2017-09-14 Thread Amit Langote

On 2017/09/15 10:55, David Rowley wrote:
> On 21 August 2017 at 18:37, Amit Langote <langote_amit...@lab.ntt.co.jp> 
> wrote:
>> I've been working on implementing a way to perform plan-time
>> partition-pruning that is hopefully faster than the current method of
>> using constraint exclusion to prune each of the potentially many
>> partitions one-by-one.  It's not fully cooked yet though.
> 
> I'm interested in seeing improvements in this area, so I've put my
> name down to review this.

Great, thanks!

I will post rebased patches later today, although I think the overall
design of the patch on the planner side of things is not quite there yet.
Of course, your and others' feedback is greatly welcome.

Also, I must inform to all of those who're looking at this thread that I
won't be able to respond to emails from tomorrow (9/16, Sat) until 9/23,
Sat, due to some personal business.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-14 Thread Amit Langote

On 2017/09/15 4:43, Robert Haas wrote:
> On Thu, Sep 14, 2017 at 8:06 AM, Ashutosh Bapat
>  wrote:
>> I have few changes to multi-level expansion patch as per discussion in
>> earlier mails
> 
> OK, I have committed
> 0002-Multi-level-partitioned-table-expansion.patch with a few cosmetic
> changes.
> 
> Phew, getting that sorted out has been an astonishing amount of work.

Yeah, thanks to both of you.  Now on to other complicated stuff. :)

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] expanding inheritance in partition bound order

2017-09-14 Thread Amit Langote

On 2017/09/15 1:37, Robert Haas wrote:
> On Thu, Sep 14, 2017 at 7:56 AM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>> On 14 September 2017 at 06:43, Amit Langote
>>> langote_amit...@lab.ntt.co.jp> wrote:
>>> Attached updated patch.
>>
>> @@ -1222,151 +1209,130 @@ PartitionDispatch *
>>  RelationGetPartitionDispatchInfo(Relation rel,
>>  int
>> *num_parted, List **leaf_part_oids)
>>  {
>> +   List   *pdlist;
>> PartitionDispatchData **pd;
>>
>> +   get_partition_dispatch_recurse(rel, NULL, , leaf_part_oids);
>>
>> Above, pdlist is passed uninitialized. And then inside
>> get_partition_dispatch_recurse(), it is used here :
>> *pds = lappend(*pds, pd);
>>
>> 
>>
>> pg_indent says more alignments needed. For e.g. gettext_noop() call
>> below needs to be aligned:
>> pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
>> tupdesc,
>> gettext_noop("could not convert row type"));
>>
>> 
>>
>> Other than that, the patch looks good to me. I verified that the leaf
>> oids are ordered exaclty in the order of the UPDATE subplans output.
> 
> Committed with fixes for those issues and a few other cosmetic changes.

Thanks Amit for the review and Robert for committing.

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Use MINVALUE/MAXVALUE instead of UNBOUNDED for range partition b

2017-09-14 Thread Amit Langote

On 2017/09/14 16:53, Dean Rasheed wrote:
> On 13 September 2017 at 10:05, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> Coincidentally, I just wrote the patch for canonicalizing stored values,
>> instead of erroring out.  Please see attached if that's what you were
>> thinking too.
>>
> 
> Looks reasonable to me, if we decide to go this way.>
> One minor review comment --

Thanks for the review.

> it isn't really necessary to have the
> separate new boolean local variables as well as the datum kind
> variables. Just tracking the datum kind is sufficient and slightly
> simpler. That would also address a slight worry I have that your
> coding might result in a compiler warning about the kind variables
> being used uninitialised with some less intelligent compilers, not
> smart enough to work out that it can't actually happen.

Got rid of the boolean variables in the updated patch.

Thanks,
Amit
From 61bfaeb46623572f5adba0b624d00dfecb6ed495 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Wed, 13 Sep 2017 17:51:38 +0900
Subject: [PATCH] Canonicalize catalog representation of range partition bounds

Since the system effectively ignores values of the bound for
partition key columns following the first unbounded key, it's pointless
to accurately remember them in the system catalog.  Instead store
minvalue or maxvalue for all such columns, matching what the first
unbounded key is.  This makes \d output of range partitions look
more canonical.
---
 doc/src/sgml/ref/create_table.sgml |  6 +-
 src/backend/parser/parse_utilcmd.c | 30 --
 src/test/regress/expected/create_table.out | 20 +---
 src/test/regress/expected/insert.out   |  8 
 src/test/regress/sql/create_table.sql  |  6 ++
 5 files changed, 60 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/ref/create_table.sgml 
b/doc/src/sgml/ref/create_table.sgml
index 824253de40..7d2eb30722 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -328,7 +328,11 @@ FROM ( { numeric_literal | MAXVALUE in a partition bound are ignored; so the bound
   (10, MINVALUE, 0) is equivalent to
   (10, MINVALUE, 10) and (10, MINVALUE, MINVALUE)
-  and (10, MINVALUE, MAXVALUE).
+  and (10, MINVALUE, MAXVALUE).  Morever, the system will
+  store (10, MINVALUE, MINVALUE) into the system catalog if
+  the user specifies (10, MINVALUE, 0), and
+  (10, MAXVALUE, MAXVALUE) if the user specifies
+  (10, MAXVALUE, 0).
  
 
  
diff --git a/src/backend/parser/parse_utilcmd.c 
b/src/backend/parser/parse_utilcmd.c
index 655da02c10..ca983105cc 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -3381,6 +3381,8 @@ transformPartitionBound(ParseState *pstate, Relation 
parent,
   *cell2;
int i,
j;
+   PartitionRangeDatumKind lower_kind = 
PARTITION_RANGE_DATUM_VALUE,
+   
upper_kind = PARTITION_RANGE_DATUM_VALUE;
 
if (spec->strategy != PARTITION_STRATEGY_RANGE)
ereport(ERROR,
@@ -3426,7 +3428,17 @@ transformPartitionBound(ParseState *pstate, Relation 
parent,
coltype = get_partition_col_typid(key, i);
coltypmod = get_partition_col_typmod(key, i);
 
-   if (ldatum->value)
+   /*
+* If we've found the first infinite bound, use the 
same for
+* subsequent columns.
+*/
+   if (lower_kind != PARTITION_RANGE_DATUM_VALUE)
+   {
+   ldatum = copyObject(ldatum);/* don't 
scribble on input */
+   ldatum->kind = lower_kind;
+   ldatum->value = NULL;
+   }
+   else if (ldatum->value)
{
con = castNode(A_Const, ldatum->value);
value = transformPartitionBoundValue(pstate, 
con,
@@ -3439,8 +3451,20 @@ transformPartitionBound(ParseState *pstate, Relation 
parent,
ldatum = copyObject(ldatum);/* don't 
scribble on input */
ldatum->value = (Node *) value;
}
+   else
+   lower_kind = ldatum->kind;
 
-   if (rdatum->value)
+   /*
+* If we've found the first infinite bound, use the 
same for
+* subsequent colu

Re: [HACKERS] moving some partitioning code to executor

2017-09-14 Thread Amit Langote

Repeating links for better accessibility:

On 2017/09/14 16:13, Amit Langote wrote:
> [1]

https://www.postgresql.org/message-id/CA%2BTgmoafr%3DhUrM%3Dcbx-k%3DBDHOF2OfXaw95HQSNAK4mHBwmSjtw%40mail.gmail.com

> [2]

https://www.postgresql.org/message-id/7fe0007b-7ad1-a593-df11-ad05364ebce4%40lab.ntt.co.jp

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] moving some partitioning code to executor

2017-09-14 Thread Amit Langote

Hi.

It seems to me that some of the code in partition.c is better placed
somewhere under the executor directory.  There was even a suggestion
recently [1] to introduce a execPartition.c to house some code around
tuple-routing.

IMO, catalog/partition.c should present an interface for handling
operations on a *single* partitioned table and avoid pretending to support
any operations on the whole partition tree.  For example, the
PartitionDispatch structure embeds some knowledge about the partition tree
it's part of, which is useful when used for tuple-routing, because of the
way it works now (lock and form ResultRelInfos of *all* leaf partitions
before the first input row is processed).

So, let's move that structure, along with the code that creates and
manipulates the same, out of partition.c/h and to execPartition.c/h.
Attached patch attempts to do that.

While doing the same, I didn't move *all* of get_partition_for_tuple() out
to execPartition.c, instead modified its signature as shown below:

-extern int get_partition_for_tuple(PartitionDispatch *pd,
-TupleTableSlot *slot,
-EState *estate,
-PartitionDispatchData **failed_at,
-TupleTableSlot **failed_slot);
+extern int get_partition_for_tuple(Relation relation, Datum *values,
+bool *isnull);

That way, we keep the core partition bound comparison logic inside
partition.c and move rest of the stuff to its caller ExecFindPartition(),
which includes navigating the enveloping PartitionDispatch's.

Thoughts?

PS: 0001 of the attached is the patch from [2] which is here to be applied
on HEAD before applying the main patch (0002) itself

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoafr%3DhUrM%3Dcbx-k%3DBDHOF2OfXa
w95HQSNAK4mHBwmSjtw%40mail.gmail.com

[2]
https://www.postgresql.org/message-id/7fe0007b-7ad1-a593-df11-ad05364ebce4%40l
ab.ntt.co.jp
From cb16fb6c89f60a8cf6b8f713ca9c831fe3824b2d Mon Sep 17 00:00:00 2001
From: amit 
Date: Fri, 8 Sep 2017 17:35:10 +0900
Subject: [PATCH 1/2] Make RelationGetPartitionDispatch expansion order
 depth-first

This is so as it matches what the planner is doing with partitioning
inheritance expansion.  Matching with planner order helps because
it helps ease matching the executor's per-partition objects with
planner-created per-partition nodes.
---
 src/backend/catalog/partition.c | 242 
 1 file changed, 99 insertions(+), 143 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 73eff17202..ddb46a80cb 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -147,6 +147,8 @@ static int32 partition_bound_cmp(PartitionKey key,
 static int partition_bound_bsearch(PartitionKey key,
PartitionBoundInfo boundinfo,
void *probe, bool 
probe_is_bound, bool *is_equal);
+static void get_partition_dispatch_recurse(Relation rel, Relation parent,
+  List **pds, List 
**leaf_part_oids);
 
 /*
  * RelationBuildPartitionDesc
@@ -1192,21 +1194,6 @@ get_partition_qual_relid(Oid relid)
 }
 
 /*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
-   do\
-   {\
-   int i;\
-   for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
-   {\
-   (partoids) = lappend_oid((partoids), 
(rel)->rd_partdesc->oids[i]);\
-   (parents) = lappend((parents), (rel));\
-   }\
-   } while(0)
-
-/*
  * RelationGetPartitionDispatchInfo
  * Returns information necessary to route tuples down a partition 
tree
  *
@@ -1222,151 +1209,120 @@ PartitionDispatch *
 RelationGetPartitionDispatchInfo(Relation rel,
 int 
*num_parted, List **leaf_part_oids)
 {
+   List   *pdlist;
PartitionDispatchData **pd;
-   List   *all_parts = NIL,
-  *all_parents = NIL,
-  *parted_rels,
-  *parted_rel_parents;
-   ListCell   *lc1,
-  *lc2;
-   int i,
-   k,
-   offset;
+   ListCell *lc;
+   int i;
 
-   /*
-* We rely on the relcache to traverse the partition tree to build both
-* the leaf partition OIDs list and the array of PartitionDispatch 
objects
-* for the partitioned tables in the tree.  That means every partitioned
-* table in the tree must be locked, which is fine since we require the
-

Re: [HACKERS] Optimise default partition scanning while adding new partition

2017-09-13 Thread Amit Langote

Hi Jeevan,

On 2017/09/12 18:22, Jeevan Ladhe wrote:
> Commit 6f6b99d1335be8ea1b74581fc489a97b109dd08a introduced default
> partitioning support. This commit added a new function
> check_default_allows_bound(),
> which checks if there exists a row in the default partition that would
> belong to
> the new partition being added. If it finds one, it throws an error. Before
> taking
> the decision to scan the default partition, this function checks if there
> are
> existing constraints on default partition that would imply the new partition
> constraints, if yes it skips scanning the default partition, otherwise it
> scans the
> default partition and its children(if any). But, while doing so the current
> code
> misses the fact that there can be constraints on the child of default
> partition
> such that they would imply the constraints of the new partition being added,
> and hence individual child scan can also be skipped.
> Attached is the patch which does this.
> 
> This is previously discussed in default partitioning thread[1], and decision
> was made that we can take this a separate patch rather than as a part of the
> default partitioning support.

Patch looks fine to me.  By the way, why don't we just say "Can we skip
scanning part_rel?" in the comment before the newly added call to
PartConstraintImpliedByRelConstraint()?  We don't need to repeat the
explanation of what it does at the every place we call it.

> Amit Langote has a similar patch[2] for scanning the children of a
> partitioned
> table which is being attached as partition of another partitioned table.

I just posted my rebased patch there [1].

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/a83a0899-19f5-594c-9aac-3ba0f16989a1%40lab.ntt.co.jp



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A bug in mapping attributes in ATExecAttachPartition()

2017-09-13 Thread Amit Langote

On 2017/08/07 11:05, Amit Langote wrote:
> By the way, bulk of 0004 is refactoring which it seems is what Jeevan's
> default partition patch set also includes as one of the patches [1].  It
> got a decent amount review from Ashutosh.  I broke it down into a separate
> patch, so that the patch to add the new feature is its own tiny patch.
> 
> I also spotted a couple of comments referring to attachRel that we just
> recently renamed.
> 
> So, attached are:
> 
> 0001: s/attachRel/attachrel/g
> 0002: Refactoring to introduce a PartConstraintImpliedByRelConstraint
> 0003: Add the feature to skip the scan of individual leaf partitions

Since Jeevan Ladhe mentioned this patch [1] earlier this week, sending the
rebased patches here for consideration.  Actually there are only 2 patches
now, because 0002 above is rendered unnecessary by ecfe59e50fb [2].

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAOgcT0MWwG8WBw8frFMtRYHAgDD=tpt6u7wcso_l2k0kypm...@mail.gmail.com

[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ecfe59e50fb
From 55e1e14a821de541c2d24c152c193bf57eb91d43 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Mon, 7 Aug 2017 10:45:39 +0900
Subject: [PATCH 1/2] Typo: attachRel is now attachrel

---
 src/backend/commands/tablecmds.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 96354bdee5..563bcda30c 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -13779,7 +13779,7 @@ ATExecAttachPartition(List **wqueue, Relation rel, 
PartitionCmd *cmd)
 * Prevent circularity by seeing if rel is a partition of attachrel. (In
 * particular, this disallows making a rel a partition of itself.)
 *
-* We do that by checking if rel is a member of the list of attachRel's
+* We do that by checking if rel is a member of the list of attachrel's
 * partitions provided the latter is partitioned at all.  We want to 
avoid
 * having to construct this list again, so we request the strongest lock
 * on all partitions.  We need the strongest lock, because we may decide
-- 
2.11.0

From a7d0d781bd9e3730f90d902d0e09abf79962f872 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Mon, 7 Aug 2017 10:51:47 +0900
Subject: [PATCH 2/2] Teach ATExecAttachPartition to skip validation in more
 cases

In cases where the table being attached is a partitioned table and
the table itself does not have constraints that would allow validation
on the whole table to be skipped, we can still skip the validations
of individual partitions if they each happen to have the requisite
constraints.

Per an idea of Robert Haas', with code refactoring suggestions from
Ashutosh Bapat.
---
 src/backend/commands/tablecmds.c  | 10 ++
 src/test/regress/expected/alter_table.out | 13 +
 src/test/regress/sql/alter_table.sql  | 10 ++
 3 files changed, 33 insertions(+)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 563bcda30c..901eea7fe2 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -13678,6 +13678,16 @@ ValidatePartitionConstraints(List **wqueue, Relation 
scanrel,
/* There can never be a whole-row reference here */
if (found_whole_row)
elog(ERROR, "unexpected whole-row reference 
found in partition key");
+
+   /* Check if we can we skip scanning this part_rel. */
+   if (PartConstraintImpliedByRelConstraint(part_rel, 
my_partconstr))
+   {
+   ereport(INFO,
+   (errmsg("partition constraint 
for table \"%s\" is implied by existing constraints",
+   
RelationGetRelationName(part_rel;
+   heap_close(part_rel, NoLock);
+   continue;
+   }
}
 
/* Grab a work queue entry. */
diff --git a/src/test/regress/expected/alter_table.out 
b/src/test/regress/expected/alter_table.out
index 0478a8ac60..e3415837b6 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -3464,6 +3464,19 @@ ERROR:  updated partition constraint for default 
partition would be violated by
 -- should be ok after deleting the bad row
 DELETE FROM part5_def_p1 WHERE b = 'y';
 ALTER TABLE part_5 ATTACH PARTITION part5_p1 FOR VALUES IN ('y');
+-- If the partitioned table being attached does not have a constraint that
+-- would allow validation scan to be skipped, but an individual partition
+-- does, then the partition's validat

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-13 Thread Amit Langote

On 2017/09/14 7:43, Robert Haas wrote:
> On Wed, Sep 13, 2017 at 12:56 PM, Ashutosh Bapat
>  wrote:
>> I debugged what happens in case of query "select 1 from t1 union all
>> select 2 from t1;" with the current HEAD (without multi-level
>> expansion patch attached). It doesn't set partitioned_rels in Append
>> path that gets converted into Append plan. Remember t1 is a
>> multi-level partitioned table here with t1p1 as its immediate
>> partition and t1p1p1 as partition of t1p1. So, the
>> set_append_rel_pathlist() recurses once as shown in the following
>> stack trace.
> 
> Nice debugging.

+1.

> I spent some time today looking at this and I think
> it's a bug in v10, and specifically in add_paths_to_append_rel(),
> which only sets partitioned_rels correctly when the appendrel is a
> partitioned rel, and not when it's a subquery RTE with one or more
> partitioned queries beneath it.
> 
> Attached are two patches either one of which will fix it.  First, I
> wrote mechanical-partrels-fix.patch, which just mechanically
> propagates partitioned_rels lists from accumulated subpaths into the
> list used to construct the parent (Merge)AppendPath.  I wasn't entire
> happy with that, because it ends up building multiple partitioned_rels
> lists for the same RelOptInfo.  That seems silly, but there's no
> principled way to avoid it; avoiding it amounts to hoping that all the
> paths for the same relation carry the same partitioned_rels list,
> which is uncomfortable.
> 
> So then I wrote pcinfo-for-subquery.patch.  That patch notices when an
> RTE_SUBQUERY appendrel is processed and accumulates the
> partitioned_rels of its immediate children; in case there can be
> multiple nested levels of subqueries before we get down to the actual
> partitioned rel, it also adds a PartitionedChildRelInfo for the
> subquery RTE, so that there's no need to walk the whole tree to build
> the partitioned_rels list at higher levels, just the immediate
> children.  I find this fix a lot more satisfying.  It adds less code
> and does no extra work in the common case.

I very much like pcinfo-for-subquery.patch, although I'm not sure if we
need to create PartitionedChildRelInfo for the sub-query parent RTE as the
patch teaches add_paths_to_append_rel() to do.  ISTM, nested UNION ALL
subqueries are flattened way before we get to add_paths_to_append_rel();
if it could not be flattened, there wouldn't be a call to
add_paths_to_append_rel() in the first place, because no AppendRelInfos
would be generated.  See what happens when is_simple_union_all_recurse()
returns false to flatten_simple_union_all() -- no AppendRelInfos will be
generated and added to root->append_rel_list in that case.

IOW, there won't be nested AppendRelInfos for nested UNION ALL sub-queries
like we're setting out to build for multi-level partitioned tables.

So, as things stand today, there can at most be one recursive call of
add_path_to_append_rel() for a sub-query parent RTE, that is, if its child
sub-queries contain partitioned tables, but not more.  The other patch
(multi-level expansion of partitioned tables) will change that, but even
then we won't need sub-query's own PartitioendChildRelInfo.

> Notice that the choice of fix we adopt has consequences for your
> 0001-Multi-level-partitioned-table-expansion.patch -- with
> mechanical-partrels-fix.patch, that patch could either associated all
> partitioned_rels with the top-parent or it could work level by level
> and everything would get properly assembled later.  But with
> pcinfo-for-subquery.patch, we need everything associated with the
> top-parent.  That doesn't seem like a problem to me, but it's
> something to note.

I think it's fine.

With 0001-Multi-level-partitioned-table-expansion.patch,
get_partitioned_child_rels() will get called even for non-root partitioned
tables, for which it won't find a valid pcinfo.  I think that patch must
also change its callers to stop Asserting that a valid pcinfo is returned.

Spotted a typo in pcinfo-for-subquery.patch:

+ * A plain relation will alread have

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] expanding inheritance in partition bound order

2017-09-13 Thread Amit Langote

On 2017/09/14 1:42, Robert Haas wrote:
> On Wed, Sep 13, 2017 at 6:02 AM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> It seems to me we don't really need the first patch all that much.  That
>> is, let's keep PartitionDispatchData the way it is for now, since we don't
>> really have any need for it beside tuple-routing (EIBO as committed didn't
>> need it for one).  So, let's forget about "decoupling
>> RelationGetPartitionDispatchInfo() from the executor" thing for now and
>> move on.
>>
>> So, attached is just the patch to make RelationGetPartitionDispatchInfo()
>> traverse the partition tree in depth-first manner to be applied on HEAD.
> 
> I like this patch.  Not only does it improve the behavior, but it's
> actually less code than we have now, and in my opinion, the new code
> is easier to understand, too.
> 
> A few suggestions:

Thanks for the review.

> - I think get_partition_dispatch_recurse() get a check_stack_depth()
> call just like expand_partitioned_rtentry() did, and for the same
> reasons: if the catalog contents are corrupted so that we have an
> infinite loop in the partitioning hierarchy, we want to error out, not
> crash.

Ah, missed that.  Done.

> - I think we should add a comment explaining that we're careful to do
> this in the same order as expand_partitioned_rtentry() so that callers
> can assume that the N'th entry in the leaf_part_oids array will also
> be the N'th child of an Append node.

Done.  Since the Append/ModifyTable may skip some leaf partitions due to
pruning, added a note about that too.

> + * For every partitioned table other than root, we must store a
> 
> other than the root
> 
> + * partitioned table.  The value multiplied back by -1 is returned as the
> 
> multiplied by -1, not multiplied back by -1
> 
> + * tables in the tree, using which, search is continued further down the
> + * partition tree.
> 
> Period after "in the tree".  Then continue: "This value is used to
> continue the search in the next level of the partition tree."

Fixed.

Attached updated patch.

Thanks,
Amit
From c2599d52267cc362e059452efe69ddd09b94c083 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Fri, 8 Sep 2017 17:35:10 +0900
Subject: [PATCH] Make RelationGetPartitionDispatch expansion order depth-first

This is so as it matches what the planner is doing with partitioning
inheritance expansion.  Matching with planner order helps because
it helps ease matching the executor's per-partition objects with
planner-created per-partition nodes.
---
 src/backend/catalog/partition.c | 252 +---
 1 file changed, 109 insertions(+), 143 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 73eff17202..36f52ddb98 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -147,6 +147,8 @@ static int32 partition_bound_cmp(PartitionKey key,
 static int partition_bound_bsearch(PartitionKey key,
PartitionBoundInfo boundinfo,
void *probe, bool 
probe_is_bound, bool *is_equal);
+static void get_partition_dispatch_recurse(Relation rel, Relation parent,
+  List **pds, List 
**leaf_part_oids);
 
 /*
  * RelationBuildPartitionDesc
@@ -1192,21 +1194,6 @@ get_partition_qual_relid(Oid relid)
 }
 
 /*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
-   do\
-   {\
-   int i;\
-   for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
-   {\
-   (partoids) = lappend_oid((partoids), 
(rel)->rd_partdesc->oids[i]);\
-   (parents) = lappend((parents), (rel));\
-   }\
-   } while(0)
-
-/*
  * RelationGetPartitionDispatchInfo
  * Returns information necessary to route tuples down a partition 
tree
  *
@@ -1222,151 +1209,130 @@ PartitionDispatch *
 RelationGetPartitionDispatchInfo(Relation rel,
 int 
*num_parted, List **leaf_part_oids)
 {
+   List   *pdlist;
PartitionDispatchData **pd;
-   List   *all_parts = NIL,
-  *all_parents = NIL,
-  *parted_rels,
-  *parted_rel_parents;
-   ListCell   *lc1,
-  *lc2;
-   int i,
-   k,
-   offset;
+   ListCell *lc;
+   int i;
 
-   /*

Re: [HACKERS] expanding inheritance in partition bound order

2017-09-13 Thread Amit Langote

On 2017/09/11 18:56, Amit Langote wrote:
> Attached updated patch does it that way for both partitioned table indexes
> and leaf partition indexes.  Thanks for pointing it out.

It seems to me we don't really need the first patch all that much.  That
is, let's keep PartitionDispatchData the way it is for now, since we don't
really have any need for it beside tuple-routing (EIBO as committed didn't
need it for one).  So, let's forget about "decoupling
RelationGetPartitionDispatchInfo() from the executor" thing for now and
move on.

So, attached is just the patch to make RelationGetPartitionDispatchInfo()
traverse the partition tree in depth-first manner to be applied on HEAD.

Thoughts?

Thanks,
Amit
From 1e99c776eda30c29fdb0e48570d6b3acd6b9a05d Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Fri, 8 Sep 2017 17:35:10 +0900
Subject: [PATCH] Make RelationGetPartitionDispatch expansion order depth-first

This is so as it matches what the planner is doing with partitioning
inheritance expansion.  Matching with planner order helps because
it helps ease matching the executor's per-partition objects with
planner-created per-partition nodes.
---
 src/backend/catalog/partition.c | 242 
 1 file changed, 99 insertions(+), 143 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 73eff17202..ddb46a80cb 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -147,6 +147,8 @@ static int32 partition_bound_cmp(PartitionKey key,
 static int partition_bound_bsearch(PartitionKey key,
PartitionBoundInfo boundinfo,
void *probe, bool 
probe_is_bound, bool *is_equal);
+static void get_partition_dispatch_recurse(Relation rel, Relation parent,
+  List **pds, List 
**leaf_part_oids);
 
 /*
  * RelationBuildPartitionDesc
@@ -1192,21 +1194,6 @@ get_partition_qual_relid(Oid relid)
 }
 
 /*
- * Append OIDs of rel's partitions to the list 'partoids' and for each OID,
- * append pointer rel to the list 'parents'.
- */
-#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \
-   do\
-   {\
-   int i;\
-   for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\
-   {\
-   (partoids) = lappend_oid((partoids), 
(rel)->rd_partdesc->oids[i]);\
-   (parents) = lappend((parents), (rel));\
-   }\
-   } while(0)
-
-/*
  * RelationGetPartitionDispatchInfo
  * Returns information necessary to route tuples down a partition 
tree
  *
@@ -1222,151 +1209,120 @@ PartitionDispatch *
 RelationGetPartitionDispatchInfo(Relation rel,
 int 
*num_parted, List **leaf_part_oids)
 {
+   List   *pdlist;
PartitionDispatchData **pd;
-   List   *all_parts = NIL,
-  *all_parents = NIL,
-  *parted_rels,
-  *parted_rel_parents;
-   ListCell   *lc1,
-  *lc2;
-   int i,
-   k,
-   offset;
+   ListCell *lc;
+   int i;
 
-   /*
-* We rely on the relcache to traverse the partition tree to build both
-* the leaf partition OIDs list and the array of PartitionDispatch 
objects
-* for the partitioned tables in the tree.  That means every partitioned
-* table in the tree must be locked, which is fine since we require the
-* caller to lock all the partitions anyway.
-*
-* For every partitioned table in the tree, starting with the root
-* partitioned table, add its relcache entry to parted_rels, while also
-* queuing its partitions (in the order in which they appear in the
-* partition descriptor) to be looked at later in the same loop.  This 
is
-* a bit tricky but works because the foreach() macro doesn't fetch the
-* next list element until the bottom of the loop.
-*/
-   *num_parted = 1;
-   parted_rels = list_make1(rel);
-   /* Root partitioned table has no parent, so NULL for parent */
-   parted_rel_parents = list_make1(NULL);
-   APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents);
-   forboth(lc1, all_parts, lc2, all_parents)
+   Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+
+   *num_parted = 0;
+   *leaf_part_oids = NIL;
+
+   get_partition_dispatch_recurse(rel, NULL, , leaf_part_oids);
+   *num_parted = list_length(pdlist);
+   pd = (PartitionDispatchData **) palloc(*num_parted *
+   
   size

Re: [HACKERS] Re: [COMMITTERS] pgsql: Use MINVALUE/MAXVALUE instead of UNBOUNDED for range partition b

2017-09-13 Thread Amit Langote

Hi Dean,

On 2017/09/13 17:51, Dean Rasheed wrote:
> Robert Haas  writes:
>> On Tue, Sep 12, 2017 at 9:58 AM, Alvaro Herrera  
>> wrote:
>>> Did anything happen on this, or did we just forget it completely?
>>
>> I forgot it.  :-(
>>
>> I really think we should fix this.
> 
> Ah, sorry. This was for me to follow up, and I dropped the ball.
> 
> Here's a patch restoring the original error checks (actually not the
> same coding as used in previous versions of the patch, because that
> would have allowed a MINVALUE after a MAXVALUE and vice versa).
> 
> A drawback to doing this is that we lose compatibility with syntaxes
> supported by other databases, which was part of the reason for
> choosing the terms MINVALUE and MAXVALUE in the first place.
> 
> So thinking about this afresh, my preference would actually be to just
> canonicalise the values stored rather than erroring out.
> 
> Thoughts?

Coincidentally, I just wrote the patch for canonicalizing stored values,
instead of erroring out.  Please see attached if that's what you were
thinking too.

Thanks,
Amit
From e5fc04c915cb98b0c56b234372ece4b9b1043033 Mon Sep 17 00:00:00 2001
From: amit 
Date: Wed, 13 Sep 2017 17:51:38 +0900
Subject: [PATCH] Canonicalize catalog representation of range partition bounds

Since the system effectively ignores values of the bound for
partition key columns following the first unbounded key, it's pointless
to accurately remember them in the system catalog.  Instead store
minvalue or maxvalue for all such columns, matching what the first
unbounded key is.  This makes \d output of range partitions look
more canonical.
---
 doc/src/sgml/ref/create_table.sgml |  6 +-
 src/backend/parser/parse_utilcmd.c | 30 --
 src/test/regress/expected/create_table.out |  6 +++---
 src/test/regress/expected/insert.out   |  8 
 4 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/ref/create_table.sgml 
b/doc/src/sgml/ref/create_table.sgml
index 824253de40..7d2eb30722 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -328,7 +328,11 @@ FROM ( { numeric_literal | MAXVALUE in a partition bound are ignored; so the bound
   (10, MINVALUE, 0) is equivalent to
   (10, MINVALUE, 10) and (10, MINVALUE, MINVALUE)
-  and (10, MINVALUE, MAXVALUE).
+  and (10, MINVALUE, MAXVALUE).  Morever, the system will
+  store (10, MINVALUE, MINVALUE) into the system catalog if
+  the user specifies (10, MINVALUE, 0), and
+  (10, MAXVALUE, MAXVALUE) if the user specifies
+  (10, MAXVALUE, 0).
  
 
  
diff --git a/src/backend/parser/parse_utilcmd.c 
b/src/backend/parser/parse_utilcmd.c
index 655da02c10..9086ac90ef 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -3381,6 +3381,10 @@ transformPartitionBound(ParseState *pstate, Relation 
parent,
   *cell2;
int i,
j;
+   PartitionRangeDatumKind first_lower_unbounded_kind,
+   
first_upper_unbounded_kind;
+   boollower_unbounded_found = false,
+   upper_unbounded_found = false;
 
if (spec->strategy != PARTITION_STRATEGY_RANGE)
ereport(ERROR,
@@ -3426,7 +3430,13 @@ transformPartitionBound(ParseState *pstate, Relation 
parent,
coltype = get_partition_col_typid(key, i);
coltypmod = get_partition_col_typmod(key, i);
 
-   if (ldatum->value)
+   if (lower_unbounded_found)
+   {
+   ldatum = copyObject(ldatum);
+   ldatum->kind = first_lower_unbounded_kind;
+   ldatum->value = NULL;
+   }
+   else if (ldatum->value)
{
con = castNode(A_Const, ldatum->value);
value = transformPartitionBoundValue(pstate, 
con,
@@ -3439,8 +3449,19 @@ transformPartitionBound(ParseState *pstate, Relation 
parent,
ldatum = copyObject(ldatum);/* don't 
scribble on input */
ldatum->value = (Node *) value;
}
+   else
+   {
+   lower_unbounded_found = true;
+   first_lower_unbounded_kind = ldatum->kind;
+   }
 
-   if (rdatum->value)
+   if (upper_unbounded_found)
+   {
+   rdatum =

Re: [HACKERS] no test coverage for ALTER FOREIGN DATA WRAPPER name HANDLER ...

2017-09-13 Thread Amit Langote

On 2017/09/13 16:59, Ashutosh Bapat wrote:
> On Wed, Sep 13, 2017 at 1:27 PM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> On 2017/09/13 16:42, Ashutosh Bapat wrote:
>>> On Wed, Sep 13, 2017 at 7:49 AM, Amit Langote wrote:
>>>> In the attached updated patch, I created separate .source files in
>>>> src/test/regress/input and output directories called fdw_handler.source
>>>> and put the test_fdw_handler function definition there.  When I had
>>>> originally thought of it back when I wrote the patch, it seemed to be an
>>>> overkill, because we're just normally defining a single C function there
>>>> to be used in the newly added foreign_data tests.  In any case, we need to
>>>> go the .source file way, because that's the only way to refer to paths to
>>>> .so library when defining C language functions.
>>>
>>> It still looks like an overkill to add a new file to define a dummy
>>> FDW handler. Why do we need to define a handler as a C function? Can't
>>> we define handler as a SQL function. If we could do that we could add
>>> the function definition in foreign_data.sql itself.
>>
>> I guess that's because the last time I tried to define the handler as a
>> SQL function, I couldn't:
>>
>> create function test_fdw_handler() returns fdw_handler as '' language sql;
>> ERROR:  SQL functions cannot return type fdw_handler
>>
>> fdw_handler is a pseudo-type, which neither SQL nor plpgsql function can
>> return.
>>
>> Am I missing something?
> 
> Ok. May be then create_function_1.sql is the right place. Just add it
> to the section of passing tests and annotate that it's testing
> creating an FDW handler. Sorry for jumping back and forth.

Alright, done.  Thanks for reviewing.

Regards,
Amit
From d0c28965b892ac76d83987e9bf35e8ab8fd62e11 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Wed, 10 May 2017 10:37:42 +0900
Subject: [PATCH] Add some FDW HANDLER DDL tests

---
 src/test/regress/expected/foreign_data.out   | 28 +++-
 src/test/regress/input/create_function_1.source  |  6 +
 src/test/regress/output/create_function_1.source |  5 +
 src/test/regress/regress.c   |  7 ++
 src/test/regress/sql/foreign_data.sql| 13 +++
 5 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/src/test/regress/expected/foreign_data.out 
b/src/test/regress/expected/foreign_data.out
index c6e558b07f..331f7a911f 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -89,6 +89,14 @@ CREATE FOREIGN DATA WRAPPER foo VALIDATOR 
postgresql_fdw_validator;
  postgresql | regress_foreign_data_user | -   | postgresql_fdw_validator | 
  | | 
 (3 rows)
 
+-- HANDLER related checks
+CREATE FUNCTION invalid_fdw_handler() RETURNS int LANGUAGE SQL AS 'SELECT 1;';
+CREATE FOREIGN DATA WRAPPER test_fdw HANDLER invalid_fdw_handler;  -- ERROR
+ERROR:  function invalid_fdw_handler must return type fdw_handler
+CREATE FOREIGN DATA WRAPPER test_fdw HANDLER test_fdw_handler HANDLER 
invalid_fdw_handler;  -- ERROR
+ERROR:  conflicting or redundant options
+CREATE FOREIGN DATA WRAPPER test_fdw HANDLER test_fdw_handler;
+DROP FOREIGN DATA WRAPPER test_fdw;
 -- ALTER FOREIGN DATA WRAPPER
 ALTER FOREIGN DATA WRAPPER foo; -- ERROR
 ERROR:  syntax error at or near ";"
@@ -188,18 +196,26 @@ ALTER FOREIGN DATA WRAPPER foo RENAME TO foo1;
 (3 rows)
 
 ALTER FOREIGN DATA WRAPPER foo1 RENAME TO foo;
+-- HANDLER related checks
+ALTER FOREIGN DATA WRAPPER foo HANDLER invalid_fdw_handler;  -- ERROR
+ERROR:  function invalid_fdw_handler must return type fdw_handler
+ALTER FOREIGN DATA WRAPPER foo HANDLER test_fdw_handler HANDLER anything;  -- 
ERROR
+ERROR:  conflicting or redundant options
+ALTER FOREIGN DATA WRAPPER foo HANDLER test_fdw_handler;
+WARNING:  changing the foreign-data wrapper handler can change behavior of 
existing foreign tables
+DROP FUNCTION invalid_fdw_handler();
 -- DROP FOREIGN DATA WRAPPER
 DROP FOREIGN DATA WRAPPER nonexistent;  -- ERROR
 ERROR:  foreign-data wrapper "nonexistent" does not exist
 DROP FOREIGN DATA WRAPPER IF EXISTS nonexistent;
 NOTICE:  foreign-data wrapper "nonexistent" does not exist, skipping
 \dew+
-List of foreign-data 
wrappers
-Name|   Owner   | Handler |Validator | 
Access privileges | FDW options  | Description 
-+---+-+--+---+--+-
- dummy  | regress_foreign_

Re: [HACKERS] no test coverage for ALTER FOREIGN DATA WRAPPER name HANDLER ...

2017-09-13 Thread Amit Langote

On 2017/09/13 16:42, Ashutosh Bapat wrote:
> On Wed, Sep 13, 2017 at 7:49 AM, Amit Langote wrote:
>> In the attached updated patch, I created separate .source files in
>> src/test/regress/input and output directories called fdw_handler.source
>> and put the test_fdw_handler function definition there.  When I had
>> originally thought of it back when I wrote the patch, it seemed to be an
>> overkill, because we're just normally defining a single C function there
>> to be used in the newly added foreign_data tests.  In any case, we need to
>> go the .source file way, because that's the only way to refer to paths to
>> .so library when defining C language functions.
> 
> It still looks like an overkill to add a new file to define a dummy
> FDW handler. Why do we need to define a handler as a C function? Can't
> we define handler as a SQL function. If we could do that we could add
> the function definition in foreign_data.sql itself.

I guess that's because the last time I tried to define the handler as a
SQL function, I couldn't:

create function test_fdw_handler() returns fdw_handler as '' language sql;
ERROR:  SQL functions cannot return type fdw_handler

fdw_handler is a pseudo-type, which neither SQL nor plpgsql function can
return.

Am I missing something?

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Setting pd_lower in GIN metapage

2017-09-13 Thread Amit Langote

On 2017/09/13 16:20, Michael Paquier wrote:
> On Wed, Sep 13, 2017 at 2:48 PM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> I updated the patches so that the metapage's pd_lower is set to the
>> correct value just before *every* point where we are about to insert a
>> full page image of the metapage into WAL.  That's in addition to doing the
>> same in various metapage init routines, which the original patch did
>> already anyway.  I guess this now ensures that wal_consistency_checking
>> masking of these metapages as standard layout pages always works, even for
>> pre-v11 indexes that were upgraded.
> 
> Please note that I do have plans to look at all the patches proposed
> on this thread for all the indexes next. No report for today though as
> those deal with many code paths so it requires some attention. I think
> I'll group the review for all index AMs into the same email if you
> don't mind, each patch deals with its own thing in its own
> src/backend/access/ path.

Sure, no problem.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-13 Thread Amit Langote

On 2017/09/13 16:21, Ashutosh Bapat wrote:
> On Wed, Sep 13, 2017 at 12:39 AM, Robert Haas  wrote:
>> locks taken from the executor are worthless because plancache.c will
>> always do the job for us.  I don't know of a case where we execute a
>> saved plan without going through the plan cache, but that doesn't mean
>> that there isn't one or that there couldn't be one in the future.
>> It's not the job of these partitioning patches to whack around the way
>> we do locking in general -- they should preserve the existing behavior
>> as much as possible.  If we want to get rid of the locking in the
>> executor altogether, that's a separate discussion where, I have a
>> feeling, there will prove to be better reasons for the way things are
>> than we are right now supposing.
>>
> 
> I agree that it's not the job of these patches to change the locking
> or even get rid of partitioned_rels. In order to continue returning
> partitioned_rels in Append paths esp. in the case of queries involving
> set operations and partitioned table e.g "select 1 from t1 union all
> select 2 from t1;" in which t1 is multi-level partitioned table, we
> need a fix in add_paths_to_append_rels(). The fix provided in [1] is
> correct but we will need a longer explanation of why we have to
> involve RTE_SUBQUERY with RELKIND_PARTITIONED_TABLE. The explanation
> is complicated. If we get rid of partitioned_rels, we don't need to
> fix that code in add_paths_to_append_rel().

Yeah, let's get on with setting partitioned_rels in AppendPath correctly
in this patch.  Ashutosh's suggested approach seems fine, although it
needlessly requires to scan root->pcinfo_list.  But it shouldn't be longer
than the number of partitioned tables in the query, so maybe that's fine
too.  At least, it doesn't require us to add code to
add_paths_to_append_rel() that can be pretty hard to wrap one's head around.

That said, we might someday need to look carefully at some things that
Robert mentioned carefully, especially around the order of locks taken by
AcquireExecutorLocks() in light of the EIBO patch getting committed.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-13 Thread Amit Langote

On 2017/09/12 19:56, Ashutosh Bapat wrote:
> I think the code here expects the original parent_rte and not the one
> we set around line 1169.
> 
> This isn't a bug right now, since both the parent_rte s have same
> content. But I am not sure if that will remain to be so. Here's patch
> to fix the thinko.

Instead of the new bool is_parent_partitioned, why not move the code to
set partitioned_rels to the block where you're now setting
is_parent_partitioned.

Also, since we know this isn't a bug at the moment but will turn into one
once we have step-wise expansion, why not include this fix in that patch
itself?

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Setting pd_lower in GIN metapage

2017-09-12 Thread Amit Langote

On 2017/09/13 13:05, Tom Lane wrote:
> Amit Langote <langote_amit...@lab.ntt.co.jp> writes:
>> On 2017/09/12 23:27, Amit Kapila wrote:
>>> I think one point which might be missed is that the patch needs to
>>> modify pd_lower for all usages of metapage, not only when it is first
>>> time initialized.
> 
>> Maybe I'm missing something, but isn't the metadata size fixed and hence
>> pd_lower won't change once it's initialized?  Maybe, it's not true for all
>> index types?
> 
> No, the point is that you might be dealing with an index recently
> pg_upgraded from v10 or before, which does not have the correct
> value for pd_lower on that page.  This has to be coped with.

Ah, got it.  Thanks for the explanation.

I updated the patches so that the metapage's pd_lower is set to the
correct value just before *every* point where we are about to insert a
full page image of the metapage into WAL.  That's in addition to doing the
same in various metapage init routines, which the original patch did
already anyway.  I guess this now ensures that wal_consistency_checking
masking of these metapages as standard layout pages always works, even for
pre-v11 indexes that were upgraded.

Also, we now pass the metapage buffer as containing a page of standard
layout to XLogRegisterBuffer(), so that any hole in it is compressed when
actually writing to WAL.

Thanks,
Amit
From 607b4ab062652e7ffc0f95338c9265b09be18b56 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Fri, 23 Jun 2017 11:20:41 +0900
Subject: [PATCH 1/3] Set pd_lower correctly in the GIN metapage.

Also tell xlog.c to treat the metapage like a standard page, so any
hole in it is compressed.
---
 src/backend/access/gin/ginfast.c   | 22 --
 src/backend/access/gin/gininsert.c |  4 ++--
 src/backend/access/gin/ginutil.c   | 19 ++-
 src/backend/access/gin/ginxlog.c   | 24 +---
 4 files changed, 49 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 59e435465a..d96529cf72 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -399,6 +399,15 @@ ginHeapTupleFastInsert(GinState *ginstate, 
GinTupleCollector *collector)
/*
 * Write metabuffer, make xlog entry
 */
+
+   /*
+* Set pd_lower just past the end of the metadata.  This is not 
essential
+* but it makes the page look compressible to xlog.c, because we pass 
the
+* buffer containing this page to XLogRegisterBuffer() as a page with
+* standard layout.
+*/
+   ((PageHeader) metapage)->pd_lower =
+   ((char *) metadata + sizeof(GinMetaPageData)) - (char 
*) metapage;
MarkBufferDirty(metabuffer);
 
if (needWal)
@@ -407,7 +416,7 @@ ginHeapTupleFastInsert(GinState *ginstate, 
GinTupleCollector *collector)
 
memcpy(, metadata, sizeof(GinMetaPageData));
 
-   XLogRegisterBuffer(0, metabuffer, REGBUF_WILL_INIT);
+   XLogRegisterBuffer(0, metabuffer, REGBUF_WILL_INIT | 
REGBUF_STANDARD);
XLogRegisterData((char *) , sizeof(ginxlogUpdateMeta));
 
recptr = XLogInsert(RM_GIN_ID, XLOG_GIN_UPDATE_META_PAGE);
@@ -572,6 +581,14 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber 
newHead,
metadata->nPendingHeapTuples = 0;
}
 
+   /*
+* Set pd_lower just past the end of the metadata.  This is not
+* essential but it makes the page look compressible to xlog.c,
+* because we pass the buffer containing this page to
+* XLogRegisterBuffer() as page with standard layout.
+*/
+   ((PageHeader) metapage)->pd_lower =
+   ((char *) metadata + sizeof(GinMetaPageData)) - (char 
*) metapage;
MarkBufferDirty(metabuffer);
 
for (i = 0; i < data.ndeleted; i++)
@@ -586,7 +603,8 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber 
newHead,
XLogRecPtr  recptr;
 
XLogBeginInsert();
-   XLogRegisterBuffer(0, metabuffer, REGBUF_WILL_INIT);
+   XLogRegisterBuffer(0, metabuffer,
+  REGBUF_WILL_INIT | 
REGBUF_STANDARD);
for (i = 0; i < data.ndeleted; i++)
XLogRegisterBuffer(i + 1, buffers[i], 
REGBUF_WILL_INIT);
 
diff --git a/src/backend/access/gin/gininsert.c 
b/src/backend/access/gin/gininsert.c
index 5378011f50..c9aa4ee147 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -348,7 +348,7 @@ ginbuild(Relation heap, Relation index, IndexInfo 
*indexInfo)

Re: [HACKERS] Race between SELECT and ALTER TABLE NO INHERIT

2017-09-12 Thread Amit Langote

On 2017/09/13 12:05, Simon Riggs wrote:
> On 26 June 2017 at 10:16, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
> 
>> BTW, in the partitioned table case, the parent is always locked first
>> using an AccessExclusiveLock.  There are other considerations in that case
>> such as needing to recreate the partition descriptor upon termination of
>> inheritance (both the DETACH PARTITION and also DROP TABLE child cases).
> 
> Is this requirement documented or in comments anywhere?

Yes.  See the last sentence in the description of PARTITION OF clause in
CREATE TABLE:

https://www.postgresql.org/docs/devel/static/sql-createtable.html#sql-createtable-partition

And, the 4th point in the list of differences between declarative
partitioning and inheritance:

https://www.postgresql.org/docs/devel/static/ddl-partitioning.html#ddl-partitioning-implementation-inheritance

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Setting pd_lower in GIN metapage

2017-09-12 Thread Amit Langote

Thanks for the review.

On 2017/09/12 23:27, Amit Kapila wrote:
> On Tue, Sep 12, 2017 at 3:51 PM, Amit Langote wrote:
>> I updated the patches for GIN, BRIN, and SP-GiST to include the following
>> changes:
>>
>> 1. Pass REGBUF_STNADARD flag when registering the metapage buffer
>>
> 
> I have looked into brin patch and it seems you have not considered all
> usages of meta page.  The structure BrinRevmap also contains a
> reference to meta page buffer and when that is modified (ex. in
> revmap_physical_extend), then also I think you need to consider using
> REGBUF_STNADARD flag.

Fixed.

>> Did I miss something from the discussion?
>>
> 
> I think one point which might be missed is that the patch needs to
> modify pd_lower for all usages of metapage, not only when it is first
> time initialized.

Maybe I'm missing something, but isn't the metadata size fixed and hence
pd_lower won't change once it's initialized?  Maybe, it's not true for all
index types?

Thanks,
Amit
From c73183871632b368e2662ff5a35bfb6b3eaaade1 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Fri, 23 Jun 2017 11:20:41 +0900
Subject: [PATCH 1/3] Set pd_lower correctly in the GIN metapage.

Also tell xlog.c to treat the metapage like a standard page, so any
hole in it is compressed.
---
 src/backend/access/gin/gininsert.c |  4 ++--
 src/backend/access/gin/ginutil.c   |  9 +
 src/backend/access/gin/ginxlog.c   | 24 +---
 3 files changed, 20 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c 
b/src/backend/access/gin/gininsert.c
index 5378011f50..c9aa4ee147 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -348,7 +348,7 @@ ginbuild(Relation heap, Relation index, IndexInfo 
*indexInfo)
Pagepage;
 
XLogBeginInsert();
-   XLogRegisterBuffer(0, MetaBuffer, REGBUF_WILL_INIT);
+   XLogRegisterBuffer(0, MetaBuffer, REGBUF_WILL_INIT | 
REGBUF_STANDARD);
XLogRegisterBuffer(1, RootBuffer, REGBUF_WILL_INIT);
 
recptr = XLogInsert(RM_GIN_ID, XLOG_GIN_CREATE_INDEX);
@@ -447,7 +447,7 @@ ginbuildempty(Relation index)
START_CRIT_SECTION();
GinInitMetabuffer(MetaBuffer);
MarkBufferDirty(MetaBuffer);
-   log_newpage_buffer(MetaBuffer, false);
+   log_newpage_buffer(MetaBuffer, true);
GinInitBuffer(RootBuffer, GIN_LEAF);
MarkBufferDirty(RootBuffer);
log_newpage_buffer(RootBuffer, false);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 136ea27718..e926649dd2 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -374,6 +374,15 @@ GinInitMetabuffer(Buffer b)
metadata->nDataPages = 0;
metadata->nEntries = 0;
metadata->ginVersion = GIN_CURRENT_VERSION;
+
+   /*
+* Set pd_lower just past the end of the metadata.  This is not 
essential
+* but it makes the page look compressible to xlog.c, as long as the
+* buffer containing the page is passed to XLogRegisterBuffer() as a
+* REGBUF_STANDARD page.
+*/
+   ((PageHeader) page)->pd_lower =
+   ((char *) metadata + sizeof(GinMetaPageData)) - (char 
*) page;
 }
 
 /*
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index 7ba04e324f..f5c11b2d9a 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -514,7 +514,7 @@ ginRedoUpdateMetapage(XLogReaderState *record)
Assert(BufferGetBlockNumber(metabuffer) == GIN_METAPAGE_BLKNO);
metapage = BufferGetPage(metabuffer);
 
-   GinInitPage(metapage, GIN_META, BufferGetPageSize(metabuffer));
+   GinInitMetabuffer(metabuffer);
memcpy(GinPageGetMeta(metapage), >metadata, 
sizeof(GinMetaPageData));
PageSetLSN(metapage, lsn);
MarkBufferDirty(metabuffer);
@@ -656,7 +656,7 @@ ginRedoDeleteListPages(XLogReaderState *record)
Assert(BufferGetBlockNumber(metabuffer) == GIN_METAPAGE_BLKNO);
metapage = BufferGetPage(metabuffer);
 
-   GinInitPage(metapage, GIN_META, BufferGetPageSize(metabuffer));
+   GinInitMetabuffer(metabuffer);
 
memcpy(GinPageGetMeta(metapage), >metadata, 
sizeof(GinMetaPageData));
PageSetLSN(metapage, lsn);
@@ -768,6 +768,7 @@ void
 gin_mask(char *pagedata, BlockNumber blkno)
 {
Pagepage = (Page) pagedata;
+   PageHeader  pagehdr = (PageHeader) page;
GinPageOpaque opaque;
 
mask_page_lsn(page);
@@ -776,18 +777,11 @@ gin_mask(char *pagedata, BlockNumber blkno)
mask_page_hint_bits(page);
 
/*
-* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. 
Hence,
-* we need to apply masking for those pages.
+

Re: [HACKERS] no test coverage for ALTER FOREIGN DATA WRAPPER name HANDLER ...

2017-09-12 Thread Amit Langote

On 2017/09/12 20:17, Ashutosh Bapat wrote:
> On Tue, Sep 12, 2017 at 2:27 PM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> Thanks Ashutosh for taking a look at this.
>>
>> On 2017/09/05 21:16, Ashutosh Bapat wrote:
>>> The patch needs a rebase.
>>
>> Attached rebased patch.
> 
> Thanks for rebased patch.

Thanks for the review.

> We could annotate each ERROR with an explanation as to why that's an
> error, but then this file doesn't do that for other commands, so may
> be the patch is just fine.

Agreed.  Note that this patch is just about adding the tests, not
modifying foreigncmds.c to change error handling around HANDLER functions.

> Also, I am wondering whether we should create the new handler function
> in foreign.c similar to postgresql_fdw_validator(). The prologue has a
> caution
> 
> 606  * Caution: this function is deprecated, and is now meant only for testing
> 607  * purposes, because the list of options it knows about doesn't 
> necessarily
> 608  * square with those known to whichever libpq instance you might be using.
> 609  * Inquire of libpq itself, instead.
> 
> So, may be we don't want to add it there. But adding the handler
> function in create_function_1 doesn't seem good. If that's the correct
> place, then at least it should be moved before " -- Things that
> shouldn't work:"; it doesn't belong to functions that don't work.

In the attached updated patch, I created separate .source files in
src/test/regress/input and output directories called fdw_handler.source
and put the test_fdw_handler function definition there.  When I had
originally thought of it back when I wrote the patch, it seemed to be an
overkill, because we're just normally defining a single C function there
to be used in the newly added foreign_data tests.  In any case, we need to
go the .source file way, because that's the only way to refer to paths to
.so library when defining C language functions.

Thanks,
Amit
From 510987531bfdf22df0bc8eef27f232e580d415b1 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Wed, 10 May 2017 10:37:42 +0900
Subject: [PATCH] Add some FDW HANDLER DDL tests

---
 src/test/regress/expected/foreign_data.out | 28 ++--
 src/test/regress/input/fdw_handler.source  |  5 +
 src/test/regress/output/fdw_handler.source |  5 +
 src/test/regress/parallel_schedule |  2 +-
 src/test/regress/regress.c |  7 +++
 src/test/regress/serial_schedule   |  1 +
 src/test/regress/sql/.gitignore|  1 +
 src/test/regress/sql/foreign_data.sql  | 13 +
 8 files changed, 55 insertions(+), 7 deletions(-)
 create mode 100644 src/test/regress/input/fdw_handler.source
 create mode 100644 src/test/regress/output/fdw_handler.source

diff --git a/src/test/regress/expected/foreign_data.out 
b/src/test/regress/expected/foreign_data.out
index c6e558b07f..331f7a911f 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -89,6 +89,14 @@ CREATE FOREIGN DATA WRAPPER foo VALIDATOR 
postgresql_fdw_validator;
  postgresql | regress_foreign_data_user | -   | postgresql_fdw_validator | 
  | | 
 (3 rows)
 
+-- HANDLER related checks
+CREATE FUNCTION invalid_fdw_handler() RETURNS int LANGUAGE SQL AS 'SELECT 1;';
+CREATE FOREIGN DATA WRAPPER test_fdw HANDLER invalid_fdw_handler;  -- ERROR
+ERROR:  function invalid_fdw_handler must return type fdw_handler
+CREATE FOREIGN DATA WRAPPER test_fdw HANDLER test_fdw_handler HANDLER 
invalid_fdw_handler;  -- ERROR
+ERROR:  conflicting or redundant options
+CREATE FOREIGN DATA WRAPPER test_fdw HANDLER test_fdw_handler;
+DROP FOREIGN DATA WRAPPER test_fdw;
 -- ALTER FOREIGN DATA WRAPPER
 ALTER FOREIGN DATA WRAPPER foo; -- ERROR
 ERROR:  syntax error at or near ";"
@@ -188,18 +196,26 @@ ALTER FOREIGN DATA WRAPPER foo RENAME TO foo1;
 (3 rows)
 
 ALTER FOREIGN DATA WRAPPER foo1 RENAME TO foo;
+-- HANDLER related checks
+ALTER FOREIGN DATA WRAPPER foo HANDLER invalid_fdw_handler;  -- ERROR
+ERROR:  function invalid_fdw_handler must return type fdw_handler
+ALTER FOREIGN DATA WRAPPER foo HANDLER test_fdw_handler HANDLER anything;  -- 
ERROR
+ERROR:  conflicting or redundant options
+ALTER FOREIGN DATA WRAPPER foo HANDLER test_fdw_handler;
+WARNING:  changing the foreign-data wrapper handler can change behavior of 
existing foreign tables
+DROP FUNCTION invalid_fdw_handler();
 -- DROP FOREIGN DATA WRAPPER
 DROP FOREIGN DATA WRAPPER nonexistent;  -- ERROR
 ERROR:  foreign-data wrapper "nonexistent" does not exist
 DROP FOREIGN DATA WRAPPER IF EXISTS nonexistent;
 NOTICE:  foreign-data wrapper "nonexistent" does not exist, skipping
 \dew+
-

Re: [HACKERS] dropping partitioned tables without CASCADE

2017-09-12 Thread Amit Langote

On 2017/09/06 19:14, Amit Langote wrote:
> On 2017/09/06 18:46, Rushabh Lathia wrote:
>> Okay, I have marked this as ready for committer.
> 
> Thanks Ashutosh and Rushabh for rebasing and improving the patch.  Looks
> good to me too.

Patch needed to be rebased after the default partitions patch went in, so
done.  Per build status on http://commitfest.cputube.org :)

Thanks,
Amit
From 0ac21ff604b5dccf818f9d69c945ff845d1771bf Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Wed, 13 Sep 2017 09:56:34 +0900
Subject: [PATCH] Some enhancments for \d+ output of partitioned tables

---
 src/bin/psql/describe.c| 32 --
 src/test/regress/expected/create_table.out | 13 +++-
 src/test/regress/expected/foreign_data.out |  3 +++
 src/test/regress/expected/insert.out   | 17 
 src/test/regress/sql/create_table.sql  |  2 +-
 src/test/regress/sql/insert.sql|  4 
 6 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index d22ec68431..855e6870e9 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -2831,7 +2831,7 @@ describeOneTableDetails(const char *schemaname,
/* print child tables (with additional info if partitions) */
if (pset.sversion >= 10)
printfPQExpBuffer(,
- "SELECT 
c.oid::pg_catalog.regclass, pg_catalog.pg_get_expr(c.relpartbound, c.oid)"
+ "SELECT 
c.oid::pg_catalog.regclass, pg_get_expr(c.relpartbound, c.oid), c.relkind"
  " FROM 
pg_catalog.pg_class c, pg_catalog.pg_inherits i"
  " WHERE 
c.oid=i.inhrelid AND i.inhparent = '%s'"
  " ORDER BY 
c.oid::pg_catalog.regclass::pg_catalog.text;", oid);
@@ -2854,7 +2854,18 @@ describeOneTableDetails(const char *schemaname,
else
tuples = PQntuples(result);
 
-   if (!verbose)
+   /*
+* For a partitioned table with no partitions, always print the 
number
+* of partitions as zero, even when verbose output is expected.
+* Otherwise, we will not print "Partitions" section for a 
partitioned
+* table without any partitions.
+*/
+   if (tableinfo.relkind == RELKIND_PARTITIONED_TABLE && tuples == 
0)
+   {
+   printfPQExpBuffer(, _("Number of partitions: %d"), 
tuples);
+   printTableAddFooter(, buf.data);
+   }
+   else if (!verbose)
{
/* print the number of child tables, if any */
if (tuples > 0)
@@ -2886,12 +2897,21 @@ describeOneTableDetails(const char *schemaname,
}
else
{
+   char   *partitioned_note;
+
+   if (*(PQgetvalue(result, i, 2)) == 
RELKIND_PARTITIONED_TABLE)
+   partitioned_note = " is 
partitioned";
+   else
+   partitioned_note = "";
+
if (i == 0)
-   printfPQExpBuffer(, "%s: %s 
%s",
-   
  ct, PQgetvalue(result, i, 0), PQgetvalue(result, i, 1));
+   printfPQExpBuffer(, "%s: %s 
%s%s",
+   
  ct, PQgetvalue(result, i, 0), PQgetvalue(result, i, 1),
+   
  partitioned_note);
else
-   printfPQExpBuffer(, "%*s  
%s %s",
-   
  ctw, "", PQgetvalue(result, i, 0), PQgetvalue(result, i, 1));
+   printfPQExpBuffer(, "%*s  
%s %s%s",
+   
  ctw, "", PQgetvalue(result, i, 0), PQgetvalue(result, i, 1),
+   
  partitioned_note);
}

Re: [HACKERS] Setting pd_lower in GIN metapage

2017-09-12 Thread Amit Langote

On 2017/09/11 18:13, Michael Paquier wrote:
> On Mon, Sep 11, 2017 at 5:40 PM, Amit Langote wrote:
>> On 2017/09/10 15:22, Michael Paquier wrote:
>>> Coordinating efforts here would be nice. If you, Amit K, are taking
>>> care of a patch for btree and hash, would you, Amit L, write the part
>>> for GIN, BRIN and SpGist? This needs a careful lookup as many code
>>> paths need a lookup so it may take time. Please note that I don't mind
>>> double-checking this part if you don't have enough room to do so.
>>
>> Sorry, I didn't have time today to carefully go through the recent
>> discussion on this thread (starting with Tom's email wherein he said he
>> set the status of the patch to Waiting on Author).  I will try tomorrow.
> 
> Thanks for the update! Once you get to this point, please let me know
> if you would like to work on a more complete patch for brin, gin and
> spgist. If you don't have enough room, I am fine to produce something.

I updated the patches for GIN, BRIN, and SP-GiST to include the following
changes:

1. Pass REGBUF_STNADARD flag when registering the metapage buffer

2. Pass true for page_std argument of log_newpage() or
   log_newpage_buffer() when using it to log the metapage

3. Update comment near where pd_lower is set in the respective metapages,
   similar to what Amit K did in his patches for btree and hash metapages.

Did I miss something from the discussion?

Thanks,
Amit
From bf291247e8f49d4558ab70fa335bd5a7bdda88b4 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Fri, 23 Jun 2017 11:20:41 +0900
Subject: [PATCH 1/3] Set pd_lower correctly in the GIN metapage.

Also tell xlog.c to treat the metapage like a standard page, so any
hole in it is compressed.
---
 src/backend/access/gin/gininsert.c |  4 ++--
 src/backend/access/gin/ginutil.c   |  9 +
 src/backend/access/gin/ginxlog.c   | 24 +---
 3 files changed, 20 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c 
b/src/backend/access/gin/gininsert.c
index 5378011f50..c9aa4ee147 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -348,7 +348,7 @@ ginbuild(Relation heap, Relation index, IndexInfo 
*indexInfo)
Pagepage;
 
XLogBeginInsert();
-   XLogRegisterBuffer(0, MetaBuffer, REGBUF_WILL_INIT);
+   XLogRegisterBuffer(0, MetaBuffer, REGBUF_WILL_INIT | 
REGBUF_STANDARD);
XLogRegisterBuffer(1, RootBuffer, REGBUF_WILL_INIT);
 
recptr = XLogInsert(RM_GIN_ID, XLOG_GIN_CREATE_INDEX);
@@ -447,7 +447,7 @@ ginbuildempty(Relation index)
START_CRIT_SECTION();
GinInitMetabuffer(MetaBuffer);
MarkBufferDirty(MetaBuffer);
-   log_newpage_buffer(MetaBuffer, false);
+   log_newpage_buffer(MetaBuffer, true);
GinInitBuffer(RootBuffer, GIN_LEAF);
MarkBufferDirty(RootBuffer);
log_newpage_buffer(RootBuffer, false);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 136ea27718..e926649dd2 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -374,6 +374,15 @@ GinInitMetabuffer(Buffer b)
metadata->nDataPages = 0;
metadata->nEntries = 0;
metadata->ginVersion = GIN_CURRENT_VERSION;
+
+   /*
+* Set pd_lower just past the end of the metadata.  This is not 
essential
+* but it makes the page look compressible to xlog.c, as long as the
+* buffer containing the page is passed to XLogRegisterBuffer() as a
+* REGBUF_STANDARD page.
+*/
+   ((PageHeader) page)->pd_lower =
+   ((char *) metadata + sizeof(GinMetaPageData)) - (char 
*) page;
 }
 
 /*
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index 7ba04e324f..f5c11b2d9a 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -514,7 +514,7 @@ ginRedoUpdateMetapage(XLogReaderState *record)
Assert(BufferGetBlockNumber(metabuffer) == GIN_METAPAGE_BLKNO);
metapage = BufferGetPage(metabuffer);
 
-   GinInitPage(metapage, GIN_META, BufferGetPageSize(metabuffer));
+   GinInitMetabuffer(metabuffer);
memcpy(GinPageGetMeta(metapage), >metadata, 
sizeof(GinMetaPageData));
PageSetLSN(metapage, lsn);
MarkBufferDirty(metabuffer);
@@ -656,7 +656,7 @@ ginRedoDeleteListPages(XLogReaderState *record)
Assert(BufferGetBlockNumber(metabuffer) == GIN_METAPAGE_BLKNO);
metapage = BufferGetPage(metabuffer);
 
-   GinInitPage(metapage, GIN_META, BufferGetPageSize(metabuffer));
+   GinInitMetabuffer(metabuffer);
 
memcpy(GinPageGetMeta(metapage), >metadata, 
sizeof(GinMetaPageData));
PageSetLSN(metapage, lsn);
@@ -768,6 +768,7 @@ void
 gin_mask(cha

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Amit Langote

On 2017/09/12 18:49, Ashutosh Bapat wrote:
> On Tue, Sep 12, 2017 at 2:17 PM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>>
>> That said, I noticed that we might need to be careful about what the value
>> of the root parent's PlanRowMark's allMarkType field gets set to.  We need
>> to make sure that it reflects markType of all partitions in the tree,
>> including those that are not root parent's direct children.  Is that true
>> with the proposed implementation?
> 
> Yes. We include child's allMarkTypes into parent's allMarkTypes. So,
> top parent's PlanRowMarks should have all descendant's allMarkTypes,
> which is not happening in the patch right now. There are two ways to
> fix that.
> 
> 1. Pass top parent's PlanRowMark all the way down to the leaf
> partitions, so that current expand_single_inheritance_child() collects
> allMarkTypes of all children correctly. But this way, PlanRowMarks of
> intermediate parent does not reflect allMarkTypes of its children,
> only top root records that.
> 2. Pass immediate parent's PlanRowMark to
> expand_single_inheritance_child(), so that it records allMarkTypes of
> its children. In expand_partitioned_rtentry() have following sequence
> 
> expand_single_inheritance_child(root, parentrte, parentRTindex,
> parentrel, parentrc, childrel,
> appinfos, , ,
> );
> 
> /* If this child is itself partitioned, recurse */
> if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
>{
> expand_partitioned_rtentry(root, childrte, childRTindex,
>childrel, childrc, lockmode, appinfos,
>partitioned_child_rels);
> 
> /* Include child's rowmark type in parent's allMarkTypes */
> parentrc->allMarkTypes |= childrc->allMarkTypes;
>}
> so that we push allMarkTypes up the hierarchy.
> 
> I like the second way, since every intermediate parent records
> allMarkTypes of its descendants.

I like the second way, too.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Amit Langote

On 2017/09/12 17:53, Ashutosh Bapat wrote:
> On Tue, Sep 12, 2017 at 1:42 PM, Amit Langote wrote:
>> So, we can remove partitioned_rels from (Merge)AppendPath and
>> (Merge)Append nodes and remove ExecLockNonLeafAppendTables().
> 
> Don't we need partitioned_rels from Append paths to be transferred to
> ModifyTable node or we have a different way of calculating
> nonleafResultRelations?

No, we don't transfer partitioned_rels from Append path to ModifyTable
node.  inheritance_planner(), that builds the ModifyTable path for
UPDATE/DELETE on a partitioned table, fetches partitioned_rels from
root->pcinfo_list itself and passes it to create_modifytable_path.  No
Append path is involved in that case.  PlannedStmt.nonleafResultRelations
is built by concatenating the partitioned_rels lists of all ModifyTable
nodes appearing in the query.  It does not depend on Append's or
AppendPath's partitioned_rels.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] no test coverage for ALTER FOREIGN DATA WRAPPER name HANDLER ...

2017-09-12 Thread Amit Langote

Thanks Ashutosh for taking a look at this.

On 2017/09/05 21:16, Ashutosh Bapat wrote:
> The patch needs a rebase.

Attached rebased patch.

Thanks,
Amit
From 75bcb6ebcc00193cb0251fced994f03d205e9e7f Mon Sep 17 00:00:00 2001
From: amit 
Date: Wed, 10 May 2017 10:37:42 +0900
Subject: [PATCH] Add some FDW HANDLER DDL tests

---
 src/test/regress/expected/foreign_data.out   | 28 +++-
 src/test/regress/input/create_function_1.source  |  3 +++
 src/test/regress/output/create_function_1.source |  2 ++
 src/test/regress/regress.c   |  7 ++
 src/test/regress/sql/foreign_data.sql| 13 +++
 5 files changed, 47 insertions(+), 6 deletions(-)

diff --git a/src/test/regress/expected/foreign_data.out 
b/src/test/regress/expected/foreign_data.out
index c6e558b07f..331f7a911f 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -89,6 +89,14 @@ CREATE FOREIGN DATA WRAPPER foo VALIDATOR 
postgresql_fdw_validator;
  postgresql | regress_foreign_data_user | -   | postgresql_fdw_validator | 
  | | 
 (3 rows)
 
+-- HANDLER related checks
+CREATE FUNCTION invalid_fdw_handler() RETURNS int LANGUAGE SQL AS 'SELECT 1;';
+CREATE FOREIGN DATA WRAPPER test_fdw HANDLER invalid_fdw_handler;  -- ERROR
+ERROR:  function invalid_fdw_handler must return type fdw_handler
+CREATE FOREIGN DATA WRAPPER test_fdw HANDLER test_fdw_handler HANDLER 
invalid_fdw_handler;  -- ERROR
+ERROR:  conflicting or redundant options
+CREATE FOREIGN DATA WRAPPER test_fdw HANDLER test_fdw_handler;
+DROP FOREIGN DATA WRAPPER test_fdw;
 -- ALTER FOREIGN DATA WRAPPER
 ALTER FOREIGN DATA WRAPPER foo; -- ERROR
 ERROR:  syntax error at or near ";"
@@ -188,18 +196,26 @@ ALTER FOREIGN DATA WRAPPER foo RENAME TO foo1;
 (3 rows)
 
 ALTER FOREIGN DATA WRAPPER foo1 RENAME TO foo;
+-- HANDLER related checks
+ALTER FOREIGN DATA WRAPPER foo HANDLER invalid_fdw_handler;  -- ERROR
+ERROR:  function invalid_fdw_handler must return type fdw_handler
+ALTER FOREIGN DATA WRAPPER foo HANDLER test_fdw_handler HANDLER anything;  -- 
ERROR
+ERROR:  conflicting or redundant options
+ALTER FOREIGN DATA WRAPPER foo HANDLER test_fdw_handler;
+WARNING:  changing the foreign-data wrapper handler can change behavior of 
existing foreign tables
+DROP FUNCTION invalid_fdw_handler();
 -- DROP FOREIGN DATA WRAPPER
 DROP FOREIGN DATA WRAPPER nonexistent;  -- ERROR
 ERROR:  foreign-data wrapper "nonexistent" does not exist
 DROP FOREIGN DATA WRAPPER IF EXISTS nonexistent;
 NOTICE:  foreign-data wrapper "nonexistent" does not exist, skipping
 \dew+
-List of foreign-data 
wrappers
-Name|   Owner   | Handler |Validator | 
Access privileges | FDW options  | Description 
-+---+-+--+---+--+-
- dummy  | regress_foreign_data_user | -   | -| 
  |  | useless
- foo| regress_test_role_super   | -   | -| 
  | (b '3', c '4', a '2', d '5') | 
- postgresql | regress_foreign_data_user | -   | postgresql_fdw_validator | 
  |  | 
+ List of 
foreign-data wrappers
+Name|   Owner   | Handler  |Validator  
   | Access privileges | FDW options  | Description 
++---+--+--+---+--+-
+ dummy  | regress_foreign_data_user | -| - 
   |   |  | useless
+ foo| regress_test_role_super   | test_fdw_handler | - 
   |   | (b '3', c '4', a '2', d '5') | 
+ postgresql | regress_foreign_data_user | -| 
postgresql_fdw_validator |   |  | 
 (3 rows)
 
 DROP ROLE regress_test_role_super;  -- ERROR
diff --git a/src/test/regress/input/create_function_1.source 
b/src/test/regress/input/create_function_1.source
index f2b1561cc2..669a5355b0 100644
--- a/src/test/regress/input/create_function_1.source
+++ b/src/test/regress/input/create_function_1.source
@@ -87,3 +87,6 @@ CREATE FUNCTION test1 (int) RETURNS int LANGUAGE C
 
 CREATE FUNCTION test1 (int) RETURNS int LANGUAGE internal
 AS 'nosuch';
+
+CREATE FUNCTION test_fdw_handler () RETURNS fdw_handler LANGUAGE C
+AS '@libdir@/regress@DLSUFFIX@', 'test_fdw_handler';
diff --git

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Amit Langote

On 2017/09/12 16:39, Ashutosh Bapat wrote:
> On Tue, Sep 12, 2017 at 7:31 AM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>> On 2017/09/11 19:45, Ashutosh Bapat wrote:
>>> On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote:
>>>> IMHO, we should make it the responsibility of the future patch to set a
>>>> child PlanRowMark's prti to the direct parent's RT index, when we actually
>>>> know that it's needed for something.  We clearly know today why we need to
>>>> pass the other objects like child RT entry, RT index, and Relation, so we
>>>> should limit this patch to pass only those objects to the recursive call.
>>>> That makes this patch a relatively easy to understand change.
>>>
>>> I think you are mixing two issues here 1. setting parent RTI in child
>>> PlanRowMark and 2. passing immediate parent's PlanRowMark to
>>> expand_single_inheritance_child().
>>>
>>> I have discussed 1 in my reply to Robert.
>>>
>>> About 2 you haven't given any particular comments to my reply. To me
>>> it looks like it's this patch that introduces the notion of
>>> multi-level expansion, so it's natural for this patch to pass
>>> PlanRowMark in cascaded fashion similar to other structures.
>>
>> You patch does 2 to be able to do 1, doesn't it?  That is, to be able to
>> set the child PlanRowMark's prti to the direct parent's RT index, you pass
>> the immediate parent's PlanRowMark to the recursive call of
>> expand_single_inheritance_child().
> 
> No. child PlanRowMark's prti is set to parentRTIndex, which is a
> separate argument and is used to also set parent_relid in
> AppendRelInfo.

OK.  So, to keep the old behavior (if at all), we'd actually need a new
argument rootParentRTindex.  Old behavior being that all child
PlanRowMarks has the rootParentRTindex as their prti.

It seems though that the new behavior where prti will now be set to the
direct parent's RT index is more or less harmless, because whatever we set
prti to, as long as it's different from rti, we can consider it a child
PlanRowMark.  So it might be fine to set prti to direct parent's RT index.

That said, I noticed that we might need to be careful about what the value
of the root parent's PlanRowMark's allMarkType field gets set to.  We need
to make sure that it reflects markType of all partitions in the tree,
including those that are not root parent's direct children.  Is that true
with the proposed implementation?

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Amit Langote

On 2017/09/12 16:55, Ashutosh Bapat wrote:
> On Tue, Sep 12, 2017 at 1:16 PM, Amit Langote wrote:
>> So I looked at this a bit closely and came to the conclusion that we may
>> not need to keep partitioned table RT indexes in the
>> (Merge)Append.partitioned_rels after all, as far as execution-time locking
>> is concerned.
>>
>> Consider two cases:
>>
>> 1. Plan is created and executed in the same transaction
>>
>> In this case, locks taken on the partitioned tables by the planner will
>> suffice.
>>
>> 2. Plan is executed in a different transaction from the one in which it
>>was created (a cached plan)
>>
>> In this case, AcquireExecutorLocks will lock all the relations in
>> PlannedStmt.rtable, which must include all partitioned tables of all
>> partition trees involved in the query.  Of those, it will lock the tables
>> whose RT indexes appear in PlannedStmt.nonleafResultRelations with
>> RowExclusiveLock mode.  PlannedStmt.nonleafResultRelations is a global
>> list of all partitioned table RT indexes obtained by concatenating
>> partitioned_rels lists of all ModifyTable nodes involved in the query
>> (set_plan_refs does that).  We need to distinguish nonleafResultRelations,
>> because we need to take the stronger lock on a given table before any
>> weaker one if it happens to appear in the query as a non-result relation
>> too, to avoid lock strength upgrade deadlock hazard.
>>
>> Moreover, because all the tables from plannedstmt->rtable, including the
>> partitioned tables, will be added to PlannedStmt.relationsOids, any
>> invalidation events affecting the partitioned tables (for example,
>> add/remove a partition) will cause the plan involving partitioned tables
>> to be recreated.
>>
>> In none of this do we rely on the partitioned table RT indexes appearing
>> in the (Merge)Append node itself.  Maybe, we should just remove
>> partitioned_rels from (Merge)AppendPath and (Merge)Append node in a
>> separate patch and move on.
>>
>> Thoughts?
> 
> Yes, I did the same analysis (to which I refer in my earlier reply to
> you). I too think we should just remove partitioned_rels from Append
> paths. But then the question is those are then transferred to
> ModifyTable node in create_modifytable_plan() and use it for something
> else. What should we do about that code? I don't think we are really
> using that list from ModifyTable node as well, so may be we could
> remove it from there as well. What do you think? Does that mean
> partitioned_rels isn't used at all in the code?

No, we cannot simply get rid of partitioned_rels altogether.  We'll need
to keep it in the ModifyTable node, because we *do* need the
nonleafResultRelations list in PlannedStmt to distinguish partitioned
table result relations, which set_plan_refs builds by concatenating
partitioned_rels lists of various ModifyTable nodes of the query.  The
PlannedStmt.nonleafResultRelations list actually has some use (which
parallels PlannedStmt.resultRelations), but partitioned_rels list in the
individual (Merge)Append, as it turns out, doesn't.

So, we can remove partitioned_rels from (Merge)AppendPath and
(Merge)Append nodes and remove ExecLockNonLeafAppendTables().

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-12 Thread Amit Langote

On 2017/09/11 21:07, Ashutosh Bapat wrote:
> On Mon, Sep 11, 2017 at 5:19 PM, Robert Haas  wrote:
>> On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat
>>  wrote:
>>> So, all partitioned partitions are getting locked correctly. Am I
>>> missing something?
>>
>> That's not a valid test.  In that scenario, you're going to hold all
>> the locks acquired by the planner, all the locks acquired by the
>> rewriter, and all the locks acquired by the executor, but when using
>> prepared queries, it's possible to execute the plan after the planner
>> and rewriter locks are no longer held.
> 
> I see the same thing when I use prepare and execute

So I looked at this a bit closely and came to the conclusion that we may
not need to keep partitioned table RT indexes in the
(Merge)Append.partitioned_rels after all, as far as execution-time locking
is concerned.

Consider two cases:

1. Plan is created and executed in the same transaction

In this case, locks taken on the partitioned tables by the planner will
suffice.

2. Plan is executed in a different transaction from the one in which it
   was created (a cached plan)

In this case, AcquireExecutorLocks will lock all the relations in
PlannedStmt.rtable, which must include all partitioned tables of all
partition trees involved in the query.  Of those, it will lock the tables
whose RT indexes appear in PlannedStmt.nonleafResultRelations with
RowExclusiveLock mode.  PlannedStmt.nonleafResultRelations is a global
list of all partitioned table RT indexes obtained by concatenating
partitioned_rels lists of all ModifyTable nodes involved in the query
(set_plan_refs does that).  We need to distinguish nonleafResultRelations,
because we need to take the stronger lock on a given table before any
weaker one if it happens to appear in the query as a non-result relation
too, to avoid lock strength upgrade deadlock hazard.

Moreover, because all the tables from plannedstmt->rtable, including the
partitioned tables, will be added to PlannedStmt.relationsOids, any
invalidation events affecting the partitioned tables (for example,
add/remove a partition) will cause the plan involving partitioned tables
to be recreated.

In none of this do we rely on the partitioned table RT indexes appearing
in the (Merge)Append node itself.  Maybe, we should just remove
partitioned_rels from (Merge)AppendPath and (Merge)Append node in a
separate patch and move on.

Thoughts?

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-11 Thread Amit Langote

On 2017/09/11 19:45, Ashutosh Bapat wrote:
> On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote:
>> IMHO, we should make it the responsibility of the future patch to set a
>> child PlanRowMark's prti to the direct parent's RT index, when we actually
>> know that it's needed for something.  We clearly know today why we need to
>> pass the other objects like child RT entry, RT index, and Relation, so we
>> should limit this patch to pass only those objects to the recursive call.
>> That makes this patch a relatively easy to understand change.
> 
> I think you are mixing two issues here 1. setting parent RTI in child
> PlanRowMark and 2. passing immediate parent's PlanRowMark to
> expand_single_inheritance_child().
> 
> I have discussed 1 in my reply to Robert.
> 
> About 2 you haven't given any particular comments to my reply. To me
> it looks like it's this patch that introduces the notion of
> multi-level expansion, so it's natural for this patch to pass
> PlanRowMark in cascaded fashion similar to other structures.

You patch does 2 to be able to do 1, doesn't it?  That is, to be able to
set the child PlanRowMark's prti to the direct parent's RT index, you pass
the immediate parent's PlanRowMark to the recursive call of
expand_single_inheritance_child().

All I am trying to say is that this patch's mission is to expand
inheritance step-wise to be able to do certain things in the *planner*
that weren't possible before.  The patch accomplishes that by creating
child AppendRelInfos such that its parent_relid field is set to the
immediate parent's RT index.  It's quite clear why we're doing so.  It's
not clear why we should do so for PlanRowMarks too.  Maybe it's fine as
long as nothing breaks.

>> If we go with your patch, partitioned tables won't get locked, for
>> example, in case of the following query (p is a partitioned table):
>>
>> select 1 from p union all select 2 from p;
>>
>> That's because the RelOptInfos for the two instances of p in the above
>> query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL.  They are children
>> of the Append corresponding to the UNION ALL subquery RTE.  So,
>> partitioned_rels does not get set per your proposed code.
> 

[...]

> So, all partitioned partitions are getting locked correctly. Am I
> missing something?

Will reply to this separately to your other email.

> Actually, the original problem that caused this discussion started
> with an assertion failure in get_partitioned_child_rels() as
> Assert(list_length(result) >= 1);
> 
> This assertion fails if result is NIL when an intermediate partitioned
> table is passed. May be we should assert (result == NIL ||
> list_length(result) == 1) and allow that function to be called even
> for intermediate partitioned partitions for which the function will
> return NIL. That will leave the code in add_paths_to_append_rel()
> simple. Thoughts?

Yeah, I guess that could work.  We'll just have to write comments to
describe why the Assert is written that way.

>> In create_lateral_join_info():
>>
>> +Assert(IS_SIMPLE_REL(brel));
>> +Assert(brte);
>>
>> The second Assert is either unnecessary or should be placed first.
> 
> simple_rte_array[] may have some NULL entries. Second assert makes
> sure that we aren't dealing with a NULL entry. Any particular reason
> to reorder the asserts?

Sorry, I missed that the 2nd Assert has b"rte".  I thought it's b"rel".

>> The following comment could be made a bit clearer.
>>
>> + * In the case of table inheritance, the parent RTE is directly
>> linked
>> + * to every child table via an AppendRelInfo.  In the case of table
>> + * partitioning, the inheritance hierarchy is expanded one level at 
>> a
>> + * time rather than flattened.  Therefore, an other member rel
>> that is
>> + * a partitioned table may have children of its own, and must
>> + * therefore be marked with the appropriate lateral info so that
>> those
>> + * children eventually get marked also.
>>
>> How about: In the case of partitioned table inheritance, the original
>> parent RTE is linked, via AppendRelInfo, only to its immediate partitions.
>>  Partitions below the first level are accessible only via their immediate
>> parent's RelOptInfo, which would be of kind RELOPT_OTHER_MEMBER_REL, so
>> consider those as well.
> 
> I don't see much difference between those two. We usually do not use
> macros in comments, so usually comments mention "other member" rel.
> Let's leave this for the committer to judge.

Sure.

>> In expand_inherited_rtentry(), the following comment fragmen

Re: [HACKERS] expanding inheritance in partition bound order

2017-09-11 Thread Amit Langote

Hi Amit,

On 2017/09/11 16:16, Amit Khandekar wrote:
> Thanks Amit for the patch. I am still reviewing it, but meanwhile
> below are a few comments so far ...

Thanks for the review.

> +   next_parted_idx += (list_length(*pds) - next_parted_idx - 1);
> 
> I think this can be replaced just by :
> +   next_parted_idx = list_length(*pds) - 1;
> Or, how about removing this variable next_parted_idx altogether ?
> Instead, we can just do this :
> pd->indexes[i] = -(1 + list_length(*pds));

That seems like the simplest possible way to do it.

> + next_leaf_idx += (list_length(*leaf_part_oids) - next_leaf_idx);
> 
> Didn't understand why next_leaf_idx needs to be updated in case when
> the current partrelid is partitioned. I think it should be incremented
> only for leaf partitions, no ? Or for that matter, again, how about
> removing the variable 'next_leaf_idx' and doing this :
> *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
> pd->indexes[i] = list_length(*leaf_part_oids) - 1;

Yep.

Attached updated patch does it that way for both partitioned table indexes
and leaf partition indexes.  Thanks for pointing it out.


> ---
> 
> * For every partitioned table in the tree, starting with the root
> * partitioned table, add its relcache entry to parted_rels, while also
> * queuing its partitions (in the order in which they appear in the
> * partition descriptor) to be looked at later in the same loop.  This is
> * a bit tricky but works because the foreach() macro doesn't fetch the
> * next list element until the bottom of the loop.
> 
> I think the above comment needs to be modified with something
> explaining the relevant changed code. For e.g. there is no
> parted_rels, and the "tricky" part was there earlier because of the
> list being iterated and at the same time being appended.
> 
> 

I think I forgot to update this comment.

> I couldn't see the existing comments like "Indexes corresponding to
> the internal partitions are multiplied by" anywhere in the patch. I
> think those comments are still valid, and important.

Again, I failed to keep this comment.  Anyway, I reworded the comments a
bit to describe what the code is doing more clearly.  Hope you find it so too.

Thanks,
Amit
From 0e04cee14a5168e0652c2aa646c169789ae41e8e Mon Sep 17 00:00:00 2001
From: amit 
Date: Wed, 30 Aug 2017 10:02:05 +0900
Subject: [PATCH 1/2] Decouple RelationGetPartitionDispatchInfo() from executor

Currently it and the structure it generates viz. PartitionDispatch
objects are too coupled with the executor's tuple-routing code.  In
particular, it's pretty undesirable that it makes it the responsibility
of the caller to release some resources, such as executor tuple table
slots, tuple-conversion maps, etc.  That makes it harder to use in
places other than where it's currently being used.

After this refactoring, ExecSetupPartitionTupleRouting() now needs to
do some of the work that was previously done in
RelationGetPartitionDispatchInfo().
---
 src/backend/catalog/partition.c| 53 +++-
 src/backend/commands/copy.c| 32 +++--
 src/backend/executor/execMain.c| 88 ++
 src/backend/executor/nodeModifyTable.c | 37 +++---
 src/include/catalog/partition.h| 20 +++-
 src/include/executor/executor.h|  4 +-
 src/include/nodes/execnodes.h  | 40 +++-
 7 files changed, 181 insertions(+), 93 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 73eff17202..555b7c21c7 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1292,7 +1292,6 @@ RelationGetPartitionDispatchInfo(Relation rel,
Relationpartrel = lfirst(lc1);
Relationparent = lfirst(lc2);
PartitionKey partkey = RelationGetPartitionKey(partrel);
-   TupleDesc   tupdesc = RelationGetDescr(partrel);
PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
int j,
m;
@@ -1300,29 +1299,12 @@ RelationGetPartitionDispatchInfo(Relation rel,
pd[i] = (PartitionDispatch) 
palloc(sizeof(PartitionDispatchData));
pd[i]->reldesc = partrel;
pd[i]->key = partkey;
-   pd[i]->keystate = NIL;
pd[i]->partdesc = partdesc;
if (parent != NULL)
-   {
-   /*
-* For every partitioned table other than root, we must 
store a
-* tuple table slot initialized with its tuple 
descriptor and a
-* tuple conversion map to convert a tuple from its 
parent's
-* rowtype to its own. That is to make sure that we are 
looking at
-* the correct row using the

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-09-11 Thread Amit Langote

On 2017/09/11 16:23, Ashutosh Bapat wrote:
> On Sat, Sep 9, 2017 at 6:28 AM, Robert Haas  wrote:
>> I'm a bit suspicious about the fact that there are now executor
>> changes related to the PlanRowMarks.  If the rowmark's prti is now the
>> intermediate parent's RT index rather than the top-parent's RT index,
>> it'd seem like that'd matter somehow.  Maybe it doesn't, because the
>> code that cares about prti seems to only care about whether it's
>> different from rti.  But if that's true everywhere, then why even
>> change this?  I think we might be well off not to tinker with things
>> that don't need to be changed.
> 
> In the definition of ExecRowMark, I see
> Index   prti;   /* parent range table index, if child */
> It just says parent, by which I take as direct parent. For
> inheritance, which earlier flattened inheritance hierarchy, direct
> parent was top parent. So, probably nobody thought whether a parent is
> direct parent or top parent. But now that we have introduced that
> concept we need to interpret this comment anew. And I think
> interpreting it as direct parent is non-lossy.

But then we also don't have anything to say about why we're making that
change.  If you could describe what non-lossy is in this context, then
fine.  But that we'd like to match with what we're going to do for
AppendRelInfos does not seem to be a sufficient explanation for this change.

> If we set top parent's
> index, parent RTI in AppendRelInfo and PlanRowMark would not agree.
> So, it looks quite natural that we set the direct parent's index in
> PlanRowMark.

They would not agree, yes, but aren't they unrelated?  If we have a reason
for them to agree, (for example, row-locking breaks in the inherited table
case if we didn't), then we should definitely make them agree.

Updating the comment for prti definition might be something that this
patch could (should?) do, but I'm not quite sure about that too.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1316 matches

Mail list logo