Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)

2015-01-27 Thread Amit Langote
On 27-01-2015 AM 05:46, Jim Nasby wrote:
> On 1/25/15 7:42 PM, Amit Langote wrote:
>> On 21-01-2015 PM 07:26, Amit Langote wrote:
>>> Ok, I will limit myself to focusing on following things at the moment:
>>>
>>> * Provide syntax in CREATE TABLE to declare partition key
>>
>> While working on this, I stumbled upon the question of how we deal with
>> any index definitions following from constraints defined in a CREATE
>> statement. I think we do not want to have a physical index created for a
>> table that is partitioned (in other words, has no heap of itself). As
>> the current mechanisms dictate, constraints like PRIMARY KEY, UNIQUE,
>> EXCLUSION CONSTRAINT are enforced as indexes. It seems there are really
>> two decisions to make here:
>>
>> 1) how do we deal with any index definitions (either explicit or
>> implicit following from constraints defined on it) - do we allow them by
>> marking them specially, say, in pg_index, as being mere
>> placeholders/templates or invent some other mechanism?
>>
>> 2) As a short-term solution, do we simply reject creating any indexes
>> (/any constraints that require them) on a table whose definition also
>> includes PARTITION ON clause? Instead define them on its partitions (or
>> any relations in hierarchy that are not further partitioned).
>>
>> Or maybe I'm missing something...
> 
> Wasn't the idea that the parent table in a partitioned table wouldn't
> actually have a heap of it's own? If there's no heap there can't be an
> index.
>

Yes, that's right. Perhaps, we should look at heap-less partitioned
relation thingy not so soon as you say below.

> That said, I think this is premature optimization that could be done later.

It seems so.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)

2015-01-26 Thread Jim Nasby

On 1/25/15 7:42 PM, Amit Langote wrote:

On 21-01-2015 PM 07:26, Amit Langote wrote:

Ok, I will limit myself to focusing on following things at the moment:

* Provide syntax in CREATE TABLE to declare partition key


While working on this, I stumbled upon the question of how we deal with
any index definitions following from constraints defined in a CREATE
statement. I think we do not want to have a physical index created for a
table that is partitioned (in other words, has no heap of itself). As
the current mechanisms dictate, constraints like PRIMARY KEY, UNIQUE,
EXCLUSION CONSTRAINT are enforced as indexes. It seems there are really
two decisions to make here:

1) how do we deal with any index definitions (either explicit or
implicit following from constraints defined on it) - do we allow them by
marking them specially, say, in pg_index, as being mere
placeholders/templates or invent some other mechanism?

2) As a short-term solution, do we simply reject creating any indexes
(/any constraints that require them) on a table whose definition also
includes PARTITION ON clause? Instead define them on its partitions (or
any relations in hierarchy that are not further partitioned).

Or maybe I'm missing something...


Wasn't the idea that the parent table in a partitioned table wouldn't actually 
have a heap of it's own? If there's no heap there can't be an index.

That said, I think this is premature optimization that could be done later.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)

2015-01-25 Thread Amit Langote
On 21-01-2015 PM 07:26, Amit Langote wrote:
> Ok, I will limit myself to focusing on following things at the moment:
> 
> * Provide syntax in CREATE TABLE to declare partition key

While working on this, I stumbled upon the question of how we deal with
any index definitions following from constraints defined in a CREATE
statement. I think we do not want to have a physical index created for a
table that is partitioned (in other words, has no heap of itself). As
the current mechanisms dictate, constraints like PRIMARY KEY, UNIQUE,
EXCLUSION CONSTRAINT are enforced as indexes. It seems there are really
two decisions to make here:

1) how do we deal with any index definitions (either explicit or
implicit following from constraints defined on it) - do we allow them by
marking them specially, say, in pg_index, as being mere
placeholders/templates or invent some other mechanism?

2) As a short-term solution, do we simply reject creating any indexes
(/any constraints that require them) on a table whose definition also
includes PARTITION ON clause? Instead define them on its partitions (or
any relations in hierarchy that are not further partitioned).

Or maybe I'm missing something...

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)

2015-01-21 Thread Amit Langote
On 21-01-2015 AM 01:42, Robert Haas wrote:
> On Mon, Jan 19, 2015 at 8:48 PM, Amit Langote
>  wrote:
 Specifically, do we regard a partitions as pg_inherits children of its
 partitioning parent?
>>>
>>> I don't think this is totally an all-or-nothing decision.  I think
>>> everyone is agreed that we need to not break things that work today --
>>> e.g. Merge Append.  What that implies for pg_inherits isn't altogether
>>> clear.
>>
>> One point is that an implementation may end up establishing the
>> parent-partition hierarchy somewhere other than (or in addition to)
>> pg_inherits. One intention would be to avoid tying partitioning scheme
>> to certain inheritance features that use pg_inherits. For example,
>> consider call sites of find_all_inheritors(). One notable example is
>> Append/MergeAppend which would be of interest to partitioning. We would
>> want to reuse that part of the infrastructure but we could might as well
>> write an equivalent, say find_all_partitions() which scans something
>> other than pg_inherits to get all partitions.
> 
> IMHO, there's little reason to avoid putting pg_inherits entries in
> for the partitions, and then this just works.  We can find other ways
> to make it work if that turns out to be better, but if we don't have
> one, there's no reason to complicate things.
> 

Ok, I will go forward and stick to pg_inherits approach for now. Perhaps
the concerns I am expressing have other solutions that don't require
abandoning pg_inherits approach altogether.

>> Agree that some concrete idea of internal representation should help
>> guide the catalog structure. If we are going to cache the partitioning
>> info in relcache (which we most definitely will), then we should try to
>> make sure to consider the scenario of having a lot of partitioned tables
>> with a lot of individual partitions. It looks like it would be similar
>> to a scenarios where there are a lot of inheritance hierarchies. But,
>> availability of partitioning feature would definitely cause these
>> numbers to grow larger. Perhaps this is an important point driving this
>> discussion.
> 
> Yeah, it would be good if the costs of supporting, say, 1000
> partitions were negligible.
> 
>> A primary question for me about partition-pruning is when do we do it?
>> Should we model it after relation_excluded_by_constraints() and hence
>> totally plan-time? But, the tone of the discussion is that we postpone
>> partition-pruning to execution-time and hence my perhaps misdirected
>> attempts to inject it into some executor machinery.
> 
> It's useful to prune partitions at plan time, because then you only
> have to do the work once.  But sometimes you don't know enough to do
> it at plan time, so it's useful to do it at execution time, too.
> Then, you can do it differently for every tuple based on the actual
> value you have.  There's no point in doing 999 unnecessary relation
> scans if we can tell which partition the actual run-time value must be
> in.  But I think execution-time pruning can be a follow-on patch.  If
> you don't restrict the scope of the first patch as much as possible,
> you're not going to have much luck getting this committed.
> 

Ok, I will limit myself to focusing on following things at the moment:

* Provide syntax in CREATE TABLE to declare partition key
* Provide syntax in CREATE TABLE to declare a table as partition of a
partitioned table and values it contains
* Arrange to have partition key and values stored in appropriate
catalogs (existing or new)
* Arrange to cache partitioning info of partitioned tables in relcache

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)

2015-01-20 Thread Robert Haas
On Mon, Jan 19, 2015 at 8:48 PM, Amit Langote
 wrote:
>>> Specifically, do we regard a partitions as pg_inherits children of its
>>> partitioning parent?
>>
>> I don't think this is totally an all-or-nothing decision.  I think
>> everyone is agreed that we need to not break things that work today --
>> e.g. Merge Append.  What that implies for pg_inherits isn't altogether
>> clear.
>
> One point is that an implementation may end up establishing the
> parent-partition hierarchy somewhere other than (or in addition to)
> pg_inherits. One intention would be to avoid tying partitioning scheme
> to certain inheritance features that use pg_inherits. For example,
> consider call sites of find_all_inheritors(). One notable example is
> Append/MergeAppend which would be of interest to partitioning. We would
> want to reuse that part of the infrastructure but we could might as well
> write an equivalent, say find_all_partitions() which scans something
> other than pg_inherits to get all partitions.

IMHO, there's little reason to avoid putting pg_inherits entries in
for the partitions, and then this just works.  We can find other ways
to make it work if that turns out to be better, but if we don't have
one, there's no reason to complicate things.

> Agree that some concrete idea of internal representation should help
> guide the catalog structure. If we are going to cache the partitioning
> info in relcache (which we most definitely will), then we should try to
> make sure to consider the scenario of having a lot of partitioned tables
> with a lot of individual partitions. It looks like it would be similar
> to a scenarios where there are a lot of inheritance hierarchies. But,
> availability of partitioning feature would definitely cause these
> numbers to grow larger. Perhaps this is an important point driving this
> discussion.

Yeah, it would be good if the costs of supporting, say, 1000
partitions were negligible.

> A primary question for me about partition-pruning is when do we do it?
> Should we model it after relation_excluded_by_constraints() and hence
> totally plan-time? But, the tone of the discussion is that we postpone
> partition-pruning to execution-time and hence my perhaps misdirected
> attempts to inject it into some executor machinery.

It's useful to prune partitions at plan time, because then you only
have to do the work once.  But sometimes you don't know enough to do
it at plan time, so it's useful to do it at execution time, too.
Then, you can do it differently for every tuple based on the actual
value you have.  There's no point in doing 999 unnecessary relation
scans if we can tell which partition the actual run-time value must be
in.  But I think execution-time pruning can be a follow-on patch.  If
you don't restrict the scope of the first patch as much as possible,
you're not going to have much luck getting this committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)

2015-01-19 Thread Amit Langote
On 20-01-2015 AM 10:48, Amit Langote wrote:
> On 17-01-2015 AM 02:34, Robert Haas wrote:
>> On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote
>>  wrote:
>>> * It is desirable to treat partitions as pg_class relations with perhaps
>>> a new relkind(s). We may want to choose an implementation where only the
>>> lowest level relations in a partitioning hierarchy have storage; those
>>> at the upper layers are mere placeholder relations though of course with
>>> associated constraints determined by partitioning criteria (with
>>> appropriate metadata entered into the additional catalogs).
>>
>> I think the storage-less parents need a new relkind precisely to
>> denote that they have no storage; I am not convinced that there's any
>> reason to change the relkind for the leaf nodes.  But that's been
>> proposed, so evidently someone thinks there's a reason to do it.
>>
> 
> Again, this remains partly tied to decisions we make regarding catalog
> structure.
> 
> I am not sure but wouldn't we ever need to tell from a pg_class entry
> that a leaf relation has partition bounds associated with it? One reason
> I can see that we may not need it is that we would rather use
> relispartitioned of a non-leaf relation to trigger finding all its
> partitions and their associated bounds; we don't need to know (or
> reserve a field for) that a relation has partition bounds associated
> with it. The bounds can be stored in pg_partition indexed by relid.
> Maybe relkind is not the right field for this anyway.
> 
> With that said, would we be comfortable with putting partition key into
> pg_class (maybe as a pg_node_tree also encapsulating opclass) so that if
> relispartitioned, also look for relpartkey?
> 

This paints a picture that our leaf relations would be plain old
relations. They are almost similar in all respects (how they are
planned, modified, maintained, ...). They just have an additional
property that the values they can contain are restricted by, say,
pg_partition.values; but it doesn't concern how they are planned.
Planning related changes are confined to upper layers of the hierarchy
instead. Kinda like saying instead of doing
relation_excluded_by_constraints(childrel), we'd instead say
prune_useless_partitions(&partitionedrel) possibly at some other site
than its counterpart. Guess that illustrates the point.

I am not sure again if we want to limit access to individual partitions
unless via some special syntax, then what that means for the above. We
have been discussing that. Such access limiting could (only) be
facilitated by a new relkind.

On the other hand, the non-leaf relations are slightly new kind of
relations in that they do not have storage (they could have a tablespace
which would be the default tablespace for its underlying partitions).
Obviously they do not have indexes pointing at them. Because they are
further partitioned, they are differently planned - most probably Append
with partition-pruning (almost like Append with constraint-exclusion but
supposedly quicker because of the explicit access to partition
definitions and perhaps execution-time). INSERT/COPY on these involve
routing tuple to the appropriate leaf relation.

Not surprisingly, this is almost similar to the picture that Alvaro had
presented modulo some differences.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)

2015-01-19 Thread Amit Langote
On 17-01-2015 AM 02:34, Robert Haas wrote:
> On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote
>  wrote:
>> * It has been repeatedly pointed out that we may want to decouple
>> partitioning from inheritance because implementing partitioning as an
>> extension of inheritance mechanism means that we have to keep all the
>> existing semantics which might limit what we want to do with the special
>> case of it which is partitioning; in other words, we would find
>> ourselves in difficult position where we have to inject a special case
>> code into a very generalized mechanism that is inheritance.
>> Specifically, do we regard a partitions as pg_inherits children of its
>> partitioning parent?
> 
> I don't think this is totally an all-or-nothing decision.  I think
> everyone is agreed that we need to not break things that work today --
> e.g. Merge Append.  What that implies for pg_inherits isn't altogether
> clear.
> 

One point is that an implementation may end up establishing the
parent-partition hierarchy somewhere other than (or in addition to)
pg_inherits. One intention would be to avoid tying partitioning scheme
to certain inheritance features that use pg_inherits. For example,
consider call sites of find_all_inheritors(). One notable example is
Append/MergeAppend which would be of interest to partitioning. We would
want to reuse that part of the infrastructure but we could might as well
write an equivalent, say find_all_partitions() which scans something
other than pg_inherits to get all partitions.

Now, we may not want to do that and instead add special case code to
prevent partitioning from fiddling with unnecessary inheritance features
in the code paths of inheritance. This seems like an important decision
to make.

>> * Syntax: do we want to make it similar to one of the many other
>> databases out there? Or we could invent our own?
> 
> Well, what I think we don't want is something that is *almost* like
> some other database but not quite.  I lean toward inventing our own
> since I'm not aware of something that we'd want to copy exactly.
> 
>> I wonder if we could add a clause like DISTRIBUTED BY to complement
>> PARTITION ON that represents a hash distributed/partitioned table (that
>> could be a syntax to support sharded tables maybe; we would definitely
>> want to move ahead in that direction I guess)
> 
> Maybe eventually, but let's not complicate things by worrying too much
> about that now.
> 

Agree that we may not want to mix the two too much at this point.

>> * Catalog: We would like to have a catalog structure suitable to
>> implement capabilities like multi-column partitioning, sub-partitioning
>> (with arbitrary number of levels in the hierarchy). I had suggested
>> that we create two new catalogs viz. pg_partitioned_rel,
>> pg_partition_def to store metadata about a partition key of a
>> partitioned relation and partition bound info of a partition,
>> respectively. Also, see the point about on-disk representation of
>> partition bounds
> 
> I'm not convinced that there is any benefit in spreading this
> information across two tables neither of which exist today.  If the
> representation of the partitioning scheme is going to be a node tree,
> then there's no point in taking what would otherwise have been a List
> and storing each element of it in a separate tuple. The overarching
> point here is that the system catalog structure should be whatever is
> most convenient for the system internals; I'm not sure we understand
> what that is yet.
> 

Agree that some concrete idea of internal representation should help
guide the catalog structure. If we are going to cache the partitioning
info in relcache (which we most definitely will), then we should try to
make sure to consider the scenario of having a lot of partitioned tables
with a lot of individual partitions. It looks like it would be similar
to a scenarios where there are a lot of inheritance hierarchies. But,
availability of partitioning feature would definitely cause these
numbers to grow larger. Perhaps this is an important point driving this
discussion.

I guess this remains tied to the decision we would like make regarding
inheritance (pg_inherits, etc.)

>> * It is desirable to treat partitions as pg_class relations with perhaps
>> a new relkind(s). We may want to choose an implementation where only the
>> lowest level relations in a partitioning hierarchy have storage; those
>> at the upper layers are mere placeholder relations though of course with
>> associated constraints determined by partitioning criteria (with
>> appropriate metadata entered into the additional catalogs).
> 
> I think the storage-less parents need a new relkind precisely to
> denote that they have no storage; I am not convinced that there's any
> reason to change the relkind for the leaf nodes.  But that's been
> proposed, so evidently someone thinks there's a reason to do it.
> 

Again, this remains partly tied to decisions we make regarding catalog
struct

Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)

2015-01-18 Thread Amit Langote
On 19-01-2015 PM 12:37, Ashutosh Bapat wrote:
> On Fri, Jan 16, 2015 at 11:04 PM, Robert Haas  wrote:
> 
>> On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote
>>  wrote:
>>
>>> I wonder if we could add a clause like DISTRIBUTED BY to complement
>>> PARTITION ON that represents a hash distributed/partitioned table (that
>>> could be a syntax to support sharded tables maybe; we would definitely
>>> want to move ahead in that direction I guess)
>>
>> Maybe eventually, but let's not complicate things by worrying too much
>> about that now.
>>
> 
> Instead we might want to specify which server (foreign or local) each of
> the partition go to, something like LOCATED ON clause for each of the
> partitions with default as local server.
> 

Given how things stand today, we do not allow DDL with the FDW
interface, unless I'm missing something. So, we are restricted to only
going the other way around, say,

CREATE FOREIGN TABLE partXX PARTITION OF parent SERVER ...;

assuming we like the proposed syntax -

CREATE TABLE child PARTITION OF parent;

I think this is also assuming we are relying on foreign table
inheritance. That is, both that partitioning is based on inheritance and
foreign tables support inheritance (which should be the case soon)

Still, I think Robert may be correct in that it would not be sooner that
we integrate foreign tables with partitioning scheme (I guess mostly the
syntax aspect of it).

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)

2015-01-18 Thread Ashutosh Bapat
On Fri, Jan 16, 2015 at 11:04 PM, Robert Haas  wrote:

> On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote
>  wrote:
> > * It has been repeatedly pointed out that we may want to decouple
> > partitioning from inheritance because implementing partitioning as an
> > extension of inheritance mechanism means that we have to keep all the
> > existing semantics which might limit what we want to do with the special
> > case of it which is partitioning; in other words, we would find
> > ourselves in difficult position where we have to inject a special case
> > code into a very generalized mechanism that is inheritance.
> > Specifically, do we regard a partitions as pg_inherits children of its
> > partitioning parent?
>
> I don't think this is totally an all-or-nothing decision.  I think
> everyone is agreed that we need to not break things that work today --
> e.g. Merge Append.  What that implies for pg_inherits isn't altogether
> clear.
>
> > * Syntax: do we want to make it similar to one of the many other
> > databases out there? Or we could invent our own?
>
> Well, what I think we don't want is something that is *almost* like
> some other database but not quite.  I lean toward inventing our own
> since I'm not aware of something that we'd want to copy exactly.
>
> > I wonder if we could add a clause like DISTRIBUTED BY to complement
> > PARTITION ON that represents a hash distributed/partitioned table (that
> > could be a syntax to support sharded tables maybe; we would definitely
> > want to move ahead in that direction I guess)
>
> Maybe eventually, but let's not complicate things by worrying too much
> about that now.
>

Instead we might want to specify which server (foreign or local) each of
the partition go to, something like LOCATED ON clause for each of the
partitions with default as local server.


>
> > * Catalog: We would like to have a catalog structure suitable to
> > implement capabilities like multi-column partitioning, sub-partitioning
> > (with arbitrary number of levels in the hierarchy). I had suggested
> > that we create two new catalogs viz. pg_partitioned_rel,
> > pg_partition_def to store metadata about a partition key of a
> > partitioned relation and partition bound info of a partition,
> > respectively. Also, see the point about on-disk representation of
> > partition bounds
>
> I'm not convinced that there is any benefit in spreading this
> information across two tables neither of which exist today.  If the
> representation of the partitioning scheme is going to be a node tree,
> then there's no point in taking what would otherwise have been a List
> and storing each element of it in a separate tuple. The overarching
> point here is that the system catalog structure should be whatever is
> most convenient for the system internals; I'm not sure we understand
> what that is yet.
>
> > * It is desirable to treat partitions as pg_class relations with perhaps
> > a new relkind(s). We may want to choose an implementation where only the
> > lowest level relations in a partitioning hierarchy have storage; those
> > at the upper layers are mere placeholder relations though of course with
> > associated constraints determined by partitioning criteria (with
> > appropriate metadata entered into the additional catalogs).
>
> I think the storage-less parents need a new relkind precisely to
> denote that they have no storage; I am not convinced that there's any
> reason to change the relkind for the leaf nodes.  But that's been
> proposed, so evidently someone thinks there's a reason to do it.
>
> > I am not
> > quite sure if each kind of the relations involved in the partitioning
> > scheme have separate namespaces and, if they are, how we implement that
>
> I am in favor of having all of the nodes in the hierarchy have names
> just as relations do today -- pg_class.relname.  Anything else seems
> to me to be complex to implement and of very marginal benefit.  But
> again, it's been proposed.
>
> > * In the initial implementation, we could just live with partitioning on
> > a set of columns (and not arbitrary expressions of them)
>
> Seems quite fair.
>
> > * We perhaps do not need multi-column LIST partitions as they are not
> > very widely used and may complicate the implementation
>
> I agree that the use case is marginal; but I'm not sure it needs to
> complicate the implementation much.  Depending on how the
> implementation shakes out, prohibiting it might come to seem like more
> of a wart than allowing it.
>
> > * There are a number of suggestions about how we represent partition
> > bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype
> > associated with the relation itself), etc. Important point to consider
> > here may be that partition key may contain more than one column
>
> Yep.
>
> > * How we represent partition definition in memory (for a given
> > partitioned relation) - important point to remember is that such a
> > representation should be efficient to iterate t

Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)

2015-01-16 Thread Robert Haas
On Wed, Jan 14, 2015 at 9:07 PM, Amit Langote
 wrote:
> * It has been repeatedly pointed out that we may want to decouple
> partitioning from inheritance because implementing partitioning as an
> extension of inheritance mechanism means that we have to keep all the
> existing semantics which might limit what we want to do with the special
> case of it which is partitioning; in other words, we would find
> ourselves in difficult position where we have to inject a special case
> code into a very generalized mechanism that is inheritance.
> Specifically, do we regard a partitions as pg_inherits children of its
> partitioning parent?

I don't think this is totally an all-or-nothing decision.  I think
everyone is agreed that we need to not break things that work today --
e.g. Merge Append.  What that implies for pg_inherits isn't altogether
clear.

> * Syntax: do we want to make it similar to one of the many other
> databases out there? Or we could invent our own?

Well, what I think we don't want is something that is *almost* like
some other database but not quite.  I lean toward inventing our own
since I'm not aware of something that we'd want to copy exactly.

> I wonder if we could add a clause like DISTRIBUTED BY to complement
> PARTITION ON that represents a hash distributed/partitioned table (that
> could be a syntax to support sharded tables maybe; we would definitely
> want to move ahead in that direction I guess)

Maybe eventually, but let's not complicate things by worrying too much
about that now.

> * Catalog: We would like to have a catalog structure suitable to
> implement capabilities like multi-column partitioning, sub-partitioning
> (with arbitrary number of levels in the hierarchy). I had suggested
> that we create two new catalogs viz. pg_partitioned_rel,
> pg_partition_def to store metadata about a partition key of a
> partitioned relation and partition bound info of a partition,
> respectively. Also, see the point about on-disk representation of
> partition bounds

I'm not convinced that there is any benefit in spreading this
information across two tables neither of which exist today.  If the
representation of the partitioning scheme is going to be a node tree,
then there's no point in taking what would otherwise have been a List
and storing each element of it in a separate tuple. The overarching
point here is that the system catalog structure should be whatever is
most convenient for the system internals; I'm not sure we understand
what that is yet.

> * It is desirable to treat partitions as pg_class relations with perhaps
> a new relkind(s). We may want to choose an implementation where only the
> lowest level relations in a partitioning hierarchy have storage; those
> at the upper layers are mere placeholder relations though of course with
> associated constraints determined by partitioning criteria (with
> appropriate metadata entered into the additional catalogs).

I think the storage-less parents need a new relkind precisely to
denote that they have no storage; I am not convinced that there's any
reason to change the relkind for the leaf nodes.  But that's been
proposed, so evidently someone thinks there's a reason to do it.

> I am not
> quite sure if each kind of the relations involved in the partitioning
> scheme have separate namespaces and, if they are, how we implement that

I am in favor of having all of the nodes in the hierarchy have names
just as relations do today -- pg_class.relname.  Anything else seems
to me to be complex to implement and of very marginal benefit.  But
again, it's been proposed.

> * In the initial implementation, we could just live with partitioning on
> a set of columns (and not arbitrary expressions of them)

Seems quite fair.

> * We perhaps do not need multi-column LIST partitions as they are not
> very widely used and may complicate the implementation

I agree that the use case is marginal; but I'm not sure it needs to
complicate the implementation much.  Depending on how the
implementation shakes out, prohibiting it might come to seem like more
of a wart than allowing it.

> * There are a number of suggestions about how we represent partition
> bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype
> associated with the relation itself), etc. Important point to consider
> here may be that partition key may contain more than one column

Yep.

> * How we represent partition definition in memory (for a given
> partitioned relation) - important point to remember is that such a
> representation should be efficient to iterate through or
> binary-searchable. Also see the points about tuple-routing and
> partition-pruning

Yep.

> * Overflow/catchall partition: it seems we do not want/need them. It
> might seem desirable for example in cases where a big transaction enters
> a large number of tuples all but one of which find a defined partition;
> we may not want to error out in such case but instead enter that erring
> tuple into the o

Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)

2015-01-14 Thread Amit Langote
On 06-01-2015 PM 03:40, Amit Langote wrote:
> 
> I agree that while we are discussing these points, we could also be
> discussing how we solve problems of existing partitioning implementation
> using whatever the above things end up being. Proposed approaches to
> solve those problems might be useful to drive the first step as well or
> perhaps that's how it should be done anyway.
> 

I realize the discussion has not quite brought us to *conclusions* so
far though surely there has been valuable input from people. Anyway,
starting a new thread with the summary of what has been (please note
that the order of listing the points does not necessarily connote the
priority):

* It has been repeatedly pointed out that we may want to decouple
partitioning from inheritance because implementing partitioning as an
extension of inheritance mechanism means that we have to keep all the
existing semantics which might limit what we want to do with the special
case of it which is partitioning; in other words, we would find
ourselves in difficult position where we have to inject a special case
code into a very generalized mechanism that is inheritance.
Specifically, do we regard a partitions as pg_inherits children of its
partitioning parent?

* Syntax: do we want to make it similar to one of the many other
databases out there? Or we could invent our own? I like the syntax that
Robert suggested that covers the cases of RANGE and LIST partitioning
without actually having to use those keywords explicitly; something like
the following:

CREATE TABLE parent PARTITION ON (column [ USING opclass ] [, ... ]);

CREATE TABLE child PARTITION OF parent_name
   FOR VALUES { (value, ...) [ TO (value, ...) ] }

So instead of making a hard distinction between range and list
partitioning, you can say:

CREATE TABLE child_name PARTITION OF parent_name FOR VALUES (3, 5, 7);

wherein, child is effectively a LIST partition

CREATE TABLE child PARTITION OF parent_name FOR VALUES (8) TO (12);

wherein, child is effectively a RANGE partition on one column

CREATE TABLE child PARTITION OF parent_name FOR VALUES(20, 120) TO (30,
130);

wherein, child is effectively a RANGE partition on two columns

I wonder if we could add a clause like DISTRIBUTED BY to complement
PARTITION ON that represents a hash distributed/partitioned table (that
could be a syntax to support sharded tables maybe; we would definitely
want to move ahead in that direction I guess)

* Catalog: We would like to have a catalog structure suitable to
implement capabilities like multi-column partitioning, sub-partitioning
(with arbitrary number of levels in the hierarchy). I had suggested
that we create two new catalogs viz. pg_partitioned_rel,
pg_partition_def to store metadata about a partition key of a
partitioned relation and partition bound info of a partition,
respectively. Also, see the point about on-disk representation of
partition bounds

* It is desirable to treat partitions as pg_class relations with perhaps
a new relkind(s). We may want to choose an implementation where only the
lowest level relations in a partitioning hierarchy have storage; those
at the upper layers are mere placeholder relations though of course with
associated constraints determined by partitioning criteria (with
appropriate metadata entered into the additional catalogs). I am not
quite sure if each kind of the relations involved in the partitioning
scheme have separate namespaces and, if they are, how we implement that

* In the initial implementation, we could just live with partitioning on
a set of columns (and not arbitrary expressions of them)

* We perhaps do not need multi-column LIST partitions as they are not
very widely used and may complicate the implementation

* There are a number of suggestions about how we represent partition
bounds (on-disk) - pg_node_tree, RECORD (a composite type or the rowtype
associated with the relation itself), etc. Important point to consider
here may be that partition key may contain more than one column

* How we represent partition definition in memory (for a given
partitioned relation) - important point to remember is that such a
representation should be efficient to iterate through or
binary-searchable. Also see the points about tuple-routing and
partition-pruning

* Overflow/catchall partition: it seems we do not want/need them. It
might seem desirable for example in cases where a big transaction enters
a large number of tuples all but one of which find a defined partition;
we may not want to error out in such case but instead enter that erring
tuple into the overflow partition instead. If we choose to implement
that, we would like to also implement the capability to move the tuples
into the appropriate partition once it's defined. Related is the notion
of automatically creating partitions if one is not already defined for a
just entered tuple; but there may be locking troubles if many concurrent
sessions try to do that

* Tuple-routing: based on the intern

Re: [HACKERS] On partitioning

2015-01-05 Thread Amit Langote
On 18-12-2014 AM 04:52, Robert Haas wrote:
> On Wed, Dec 17, 2014 at 1:53 PM, Josh Berkus  wrote:
>>
>> Sure.  But there's a big difference between "we're going to take these
>> steps and that problem will be fixable eventually" and "we're going to
>> retain features of the current partitioning system which make that
>> problem impossible to fix."  The drift of discussion on this thread
>> *sounded* like the latter, and I've been calling attention to the issue
>> in an effort to make sure that it's not.
>>
>> Last week, I wrote a longish email listing out the common problems users
>> have with our current partitioning as a way of benchmarking the plan for
>> new partitioning.  While some people responded to that post, absolutely
>> nobody discussed the list of issues I gave.  Is that because there's
>> universal agreement that I got the major issues right?  Seems doubtful.
> 
> I agreed with many of the things you listed but not all of them.
> However, I don't think it's realistic to burden whatever patch Amit
> writes with the duty of, for example, making global indexes work.
> That's a huge problem all of its own.  Now, conceivably, we could try
> to solve that as part of the next patch by insisting that the
> "partitions" have to really be block number ranges within a single
> relfilenode rather than separate relfilenodes as they are today ...
> but I think that's a bad design which we would likely regret bitterly.
> I also think that it would likely make what's being talked about here
> so complicated that it will never go anywhere.  I think it's better
> that we focus on solving one problem really well - storing metadata
> for partition boundaries in the catalog so that we can do efficient
> tuple routing and partition pruning - and leave the other problems for
> later.
> 

Yes, I think partitioning as a whole is a BIG enough project that we
need to tackle it as a series of steps each of which is a discussion of
its own. The first step might as well be discussing how we represent a
partitioned table. We have a number of design decisions to make during
this step itself and we would definitely want to reach a consensus on
these points.

Things like where we indicate if a table is partitioned (pg_class), what
the partition key looks like, where it is stored, what the partition
definition looks like, where it is stored, how we represent arbitrary
number of levels in partitioning hierarchy, how we implement that only
leaf level relations in a hierarchy have storage, what are implications
of all these choices, etc. Some of these points are being discussed.

I agree that while we are discussing these points, we could also be
discussing how we solve problems of existing partitioning implementation
using whatever the above things end up being. Proposed approaches to
solve those problems might be useful to drive the first step as well or
perhaps that's how it should be done anyway.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-17 Thread Robert Haas
On Wed, Dec 17, 2014 at 1:53 PM, Josh Berkus  wrote:
> On 12/16/2014 07:35 PM, Robert Haas wrote:
>> On Tue, Dec 16, 2014 at 9:01 PM, Josh Berkus  wrote:
>>> Anyway, what I'm saying is that I personally regard the inability to
>>> handle even moderately complex expressions a major failing of our
>>> existing partitioning scheme (possibly its worst single failing), and I
>>> would regard any new partitioning feature which didn't address that
>>> issue as suspect.
>>
>> I understand, but I think you need to be careful not to stonewall all
>> progress in the name of getting what you want.  Getting the
>> partitioning metadata into the system catalogs in a suitable format
>> will be a huge step forward regardless of whether it solves this
>> particular problem right away or not, because it will make it possible
>> to solve this problem in a highly-efficient way, which is quite hard
>> to do right now.
>
> Sure.  But there's a big difference between "we're going to take these
> steps and that problem will be fixable eventually" and "we're going to
> retain features of the current partitioning system which make that
> problem impossible to fix."  The drift of discussion on this thread
> *sounded* like the latter, and I've been calling attention to the issue
> in an effort to make sure that it's not.
>
> Last week, I wrote a longish email listing out the common problems users
> have with our current partitioning as a way of benchmarking the plan for
> new partitioning.  While some people responded to that post, absolutely
> nobody discussed the list of issues I gave.  Is that because there's
> universal agreement that I got the major issues right?  Seems doubtful.

I agreed with many of the things you listed but not all of them.
However, I don't think it's realistic to burden whatever patch Amit
writes with the duty of, for example, making global indexes work.
That's a huge problem all of its own.  Now, conceivably, we could try
to solve that as part of the next patch by insisting that the
"partitions" have to really be block number ranges within a single
relfilenode rather than separate relfilenodes as they are today ...
but I think that's a bad design which we would likely regret bitterly.
I also think that it would likely make what's being talked about here
so complicated that it will never go anywhere.  I think it's better
that we focus on solving one problem really well - storing metadata
for partition boundaries in the catalog so that we can do efficient
tuple routing and partition pruning - and leave the other problems for
later.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-17 Thread Josh Berkus
On 12/17/2014 11:19 AM, Heikki Linnakangas wrote:
> On 12/17/2014 08:53 PM, Josh Berkus wrote:
>> Last week, I wrote a longish email listing out the common problems users
>> have with our current partitioning as a way of benchmarking the plan for
>> new partitioning.  While some people responded to that post, absolutely
>> nobody discussed the list of issues I gave.  Is that because there's
>> universal agreement that I got the major issues right?  Seems doubtful.
> 
> That was a good list.

;-)

Ok, that made my morning.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-17 Thread Heikki Linnakangas

On 12/17/2014 08:53 PM, Josh Berkus wrote:

Last week, I wrote a longish email listing out the common problems users
have with our current partitioning as a way of benchmarking the plan for
new partitioning.  While some people responded to that post, absolutely
nobody discussed the list of issues I gave.  Is that because there's
universal agreement that I got the major issues right?  Seems doubtful.


That was a good list.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-17 Thread Josh Berkus
On 12/16/2014 07:35 PM, Robert Haas wrote:
> On Tue, Dec 16, 2014 at 9:01 PM, Josh Berkus  wrote:
>> Anyway, what I'm saying is that I personally regard the inability to
>> handle even moderately complex expressions a major failing of our
>> existing partitioning scheme (possibly its worst single failing), and I
>> would regard any new partitioning feature which didn't address that
>> issue as suspect.
> 
> I understand, but I think you need to be careful not to stonewall all
> progress in the name of getting what you want.  Getting the
> partitioning metadata into the system catalogs in a suitable format
> will be a huge step forward regardless of whether it solves this
> particular problem right away or not, because it will make it possible
> to solve this problem in a highly-efficient way, which is quite hard
> to do right now.

Sure.  But there's a big difference between "we're going to take these
steps and that problem will be fixable eventually" and "we're going to
retain features of the current partitioning system which make that
problem impossible to fix."  The drift of discussion on this thread
*sounded* like the latter, and I've been calling attention to the issue
in an effort to make sure that it's not.

Last week, I wrote a longish email listing out the common problems users
have with our current partitioning as a way of benchmarking the plan for
new partitioning.  While some people responded to that post, absolutely
nobody discussed the list of issues I gave.  Is that because there's
universal agreement that I got the major issues right?  Seems doubtful.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-16 Thread Robert Haas
On Tue, Dec 16, 2014 at 9:01 PM, Josh Berkus  wrote:
> On 12/16/2014 05:52 PM, Robert Haas wrote:
>> But in a more complicated case where the value there isn't known until
>> runtime, yeah, it scans everything.  I'm not sure what the best way to
>> fix that is.  If the partition bounds were stored in a structured way,
>> as we've been discussing, then the Append or Merge Append node could,
>> when initialized, check which partition the id = X qual routes to and
>> ignore the rest.  But that's more iffy with the current
>> representation, I think.
>
> Huh.  I was just testing:
>
> WHERE event_time BETWEEN timestamptz '2014-12-01' and ( timestamptz
> '2014-12-01' + interval '1 month')
>
> In that case, the expression above got folded to constants by the time
> Postgres did the index scans, but it scanned all partitions.  So somehow
> (timestamptz + interval) doesn't get constant-folded until after
> planning, at least not on 9.3.
>
> And of course this leaves out common patterns like "now() - interval '30
> days'" or "to_timestamp('20141201','MMDD')"
>
> Anyway, what I'm saying is that I personally regard the inability to
> handle even moderately complex expressions a major failing of our
> existing partitioning scheme (possibly its worst single failing), and I
> would regard any new partitioning feature which didn't address that
> issue as suspect.

I understand, but I think you need to be careful not to stonewall all
progress in the name of getting what you want.  Getting the
partitioning metadata into the system catalogs in a suitable format
will be a huge step forward regardless of whether it solves this
particular problem right away or not, because it will make it possible
to solve this problem in a highly-efficient way, which is quite hard
to do right now.

For example, we could (right now) write code that would do run-time
partition pruning by taking the final filter clause, with all values
substituted in, and re-checking for partitions that can be pruned via
constraint exclusion.  But that would be expensive and would often
fail to find anything useful.  Even in the best case where it works
out it's O(n) in the number of partitions, and will therefore perform
badly for large numbers of partitions (even, say, 1000).  But once the
partitioning metadata is stored in the catalog, we can implement this
as a binary search -- O(lg n) time -- and the constant factor should
be lower -- and it will be pretty easy to skip it in cases where it's
useless so that we don't waste cycles spinning our wheels.  Whether
the initial patch covers all the cases you care about or not, and it
probably won't, it will be a really big step towards making it
POSSIBLE to handle those cases.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-16 Thread Josh Berkus
On 12/16/2014 05:52 PM, Robert Haas wrote:
> But in a more complicated case where the value there isn't known until
> runtime, yeah, it scans everything.  I'm not sure what the best way to
> fix that is.  If the partition bounds were stored in a structured way,
> as we've been discussing, then the Append or Merge Append node could,
> when initialized, check which partition the id = X qual routes to and
> ignore the rest.  But that's more iffy with the current
> representation, I think.

Huh.  I was just testing:

WHERE event_time BETWEEN timestamptz '2014-12-01' and ( timestamptz
'2014-12-01' + interval '1 month')

In that case, the expression above got folded to constants by the time
Postgres did the index scans, but it scanned all partitions.  So somehow
(timestamptz + interval) doesn't get constant-folded until after
planning, at least not on 9.3.

And of course this leaves out common patterns like "now() - interval '30
days'" or "to_timestamp('20141201','MMDD')"

Anyway, what I'm saying is that I personally regard the inability to
handle even moderately complex expressions a major failing of our
existing partitioning scheme (possibly its worst single failing), and I
would regard any new partitioning feature which didn't address that
issue as suspect.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-16 Thread Robert Haas
On Tue, Dec 16, 2014 at 1:45 PM, Josh Berkus  wrote:
> Yes, I wasn't saying that expressions should be used when *creating* the
> partitions, which strikes me as a bad idea for several reasons.
> Expressions should be usable when SELECTing data from the partitions.
> Right now, they aren't, because the planner picks parttiions well before
> the rewrite phase which would reduce "extract (month from current_date)"
> to a constant.
>
> Right now, if you partition by an integer ID even, and do:
>
> SELECT * FROM partitioned_table WHERE ID = ( 3 + 4 )
>
> ... postgres will scan all partitions because ( 3 + 4 ) is an expression
> and isn't evaluated until after CE is done.

Well, actually, that case works fine:

rhaas=# create table partitioned_table (id integer, data text);
CREATE TABLE
rhaas=# create table child1 (check (id < 1000)) inherits (partitioned_table);
CREATE TABLE
rhaas=# create table child2 (check (id >= 1000)) inherits (partitioned_table);
CREATE TABLE
rhaas=# explain select * from partitioned_table where id = (3 + 4);
   QUERY PLAN

 Append  (cost=0.00..25.38 rows=7 width=36)
   ->  Seq Scan on partitioned_table  (cost=0.00..0.00 rows=1 width=36)
 Filter: (id = 7)
   ->  Seq Scan on child1  (cost=0.00..25.38 rows=6 width=36)
 Filter: (id = 7)
(5 rows)

The reason is that 3 + 4 gets constant-folded pretty early on in the process.

But in a more complicated case where the value there isn't known until
runtime, yeah, it scans everything.  I'm not sure what the best way to
fix that is.  If the partition bounds were stored in a structured way,
as we've been discussing, then the Append or Merge Append node could,
when initialized, check which partition the id = X qual routes to and
ignore the rest.  But that's more iffy with the current
representation, I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-16 Thread Amit Langote
On 17-12-2014 AM 12:28, Claudio Freire wrote:
> On Tue, Dec 16, 2014 at 12:15 PM, Robert Haas  wrote:
>> I'm not really sure what you are getting here.  An "otherwise-good
>> expression" basically means a constant.  Index expressions have to be
>> things that always produce the same result given the same input,
>> because otherwise you might get a different result when searching the
>> index than you did when building it, and then you would fail to find
>> keys that are actually present.
> 
> I think the point is partitioning based on the result of an expression
> over row columns. 

Actually, in this case, I was thinking about a partition definition not
partition key definition. That is, using an expression as partition
value which has problems that I see.

> Or if it's not, it should be made anyway:
> 
> PARTITION BY LIST (extract(month from date_created) VALUES (1, 3, 6, 9, 12);
> 
> Or something like that.
> 

Such a thing seems very desirable though there are some tradeoffs
compared to having partitioning key be just attrnums. Or at least we can
start with that.

An arbitrary expression as partitioning key means that we have to
recompute such an expression for each input row. Think how inefficient
that may be when bulk-loading into a partitioned table during, say, a
COPY. Though there may be ways to fix that.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-16 Thread Amit Langote
On 17-12-2014 AM 12:15, Robert Haas wrote:
> On Mon, Dec 15, 2014 at 6:55 PM, Amit Langote
>  wrote:
>> Robert wrote:
>>> I would expect that to fail, just as it would fail if you tried to
>>> build an index using a volatile expression.
>>
>> Oops, wrong example, sorry. In case of an otherwise good expression?
> 
> I'm not really sure what you are getting here.  An "otherwise-good
> expression" basically means a constant.  Index expressions have to be
> things that always produce the same result given the same input,
> because otherwise you might get a different result when searching the
> index than you did when building it, and then you would fail to find
> keys that are actually present.  In the same way, partition boundaries
> also need to be constants.  Maybe you could allow expressions that can
> be constant-folded, but that's about it.  

Yeah, this is what I meant. Expressions that can be constant-folded.
Sorry, the example I chose was pretty lame. I was just thinking about
kind of stuff that something like pg_node_tree would be a good choice
for as on-disk representation of partition values. Though definitely it
wouldn't be to store arbitrary expressions that evaluate to different
values at different times.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-16 Thread Josh Berkus
On 12/15/2014 10:55 AM, Robert Haas wrote:
>> This means if a user puts arbitrary expressions in a partition definition, 
>> say,
>> >
>> > ... FOR VALUES  extract(month from current_date) TO extract(month from 
>> > current_date + interval '3 months'),
>> >
>> > we make sure that those expressions are pre-computed to literal values.
> I would expect that to fail, just as it would fail if you tried to
> build an index using a volatile expression.

Yes, I wasn't saying that expressions should be used when *creating* the
partitions, which strikes me as a bad idea for several reasons.
Expressions should be usable when SELECTing data from the partitions.
Right now, they aren't, because the planner picks parttiions well before
the rewrite phase which would reduce "extract (month from current_date)"
to a constant.

Right now, if you partition by an integer ID even, and do:

SELECT * FROM partitioned_table WHERE ID = ( 3 + 4 )

... postgres will scan all partitions because ( 3 + 4 ) is an expression
and isn't evaluated until after CE is done.

I don't think there's an easy way to do the expression rewrite while
we're still in planning, is there?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-16 Thread Claudio Freire
On Tue, Dec 16, 2014 at 12:15 PM, Robert Haas  wrote:
>  wrote:
>> Robert wrote:
>>> On Sun, Dec 14, 2014 at 9:12 PM, Amit Langote
>>>  wrote:
>>> > This means if a user puts arbitrary expressions in a partition 
>>> > definition, say,
>>> >
>>> > ... FOR VALUES  extract(month from current_date) TO extract(month from
>>> current_date + interval '3 months'),
>>> >
>>> > we make sure that those expressions are pre-computed to literal values.
>>>
>>> I would expect that to fail, just as it would fail if you tried to
>>> build an index using a volatile expression.
>>
>> Oops, wrong example, sorry. In case of an otherwise good expression?
>
> I'm not really sure what you are getting here.  An "otherwise-good
> expression" basically means a constant.  Index expressions have to be
> things that always produce the same result given the same input,
> because otherwise you might get a different result when searching the
> index than you did when building it, and then you would fail to find
> keys that are actually present.

I think the point is partitioning based on the result of an expression
over row columns. Or if it's not, it should be made anyway:

PARTITION BY LIST (extract(month from date_created) VALUES (1, 3, 6, 9, 12);

Or something like that.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-16 Thread Robert Haas
On Mon, Dec 15, 2014 at 6:55 PM, Amit Langote
 wrote:
> Robert wrote:
>> On Sun, Dec 14, 2014 at 9:12 PM, Amit Langote
>>  wrote:
>> > This means if a user puts arbitrary expressions in a partition definition, 
>> > say,
>> >
>> > ... FOR VALUES  extract(month from current_date) TO extract(month from
>> current_date + interval '3 months'),
>> >
>> > we make sure that those expressions are pre-computed to literal values.
>>
>> I would expect that to fail, just as it would fail if you tried to
>> build an index using a volatile expression.
>
> Oops, wrong example, sorry. In case of an otherwise good expression?

I'm not really sure what you are getting here.  An "otherwise-good
expression" basically means a constant.  Index expressions have to be
things that always produce the same result given the same input,
because otherwise you might get a different result when searching the
index than you did when building it, and then you would fail to find
keys that are actually present.  In the same way, partition boundaries
also need to be constants.  Maybe you could allow expressions that can
be constant-folded, but that's about it.  If you allow anything else,
then the partition boundary might "move" once it's been established
and then some of the data will be in the wrong partition.

What possible use case is there for defining partitions with
non-constant boundaries?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-15 Thread Amit Langote

Robert wrote:
> On Sun, Dec 14, 2014 at 9:12 PM, Amit Langote
>  wrote:
> > This means if a user puts arbitrary expressions in a partition definition, 
> > say,
> >
> > ... FOR VALUES  extract(month from current_date) TO extract(month from
> current_date + interval '3 months'),
> >
> > we make sure that those expressions are pre-computed to literal values.
> 
> I would expect that to fail, just as it would fail if you tried to
> build an index using a volatile expression.

Oops, wrong example, sorry. In case of an otherwise good expression?

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-15 Thread Robert Haas
On Sun, Dec 14, 2014 at 9:12 PM, Amit Langote
 wrote:
> This means if a user puts arbitrary expressions in a partition definition, 
> say,
>
> ... FOR VALUES  extract(month from current_date) TO extract(month from 
> current_date + interval '3 months'),
>
> we make sure that those expressions are pre-computed to literal values.

I would expect that to fail, just as it would fail if you tried to
build an index using a volatile expression.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-15 Thread Claudio Freire
On Mon, Dec 15, 2014 at 8:09 AM, José Luis Tallón
 wrote:
> On 12/15/2014 07:42 AM, Claudio Freire wrote:
>>
>> [snip]
>
>
>> If you do that, you start with empty partitions, and each insert updates
>> the BRIN tuple. Avoiding concurrency loss in this case would be tricky, but
>> in theory this could allow very general partition exclusion. In fact it
>> could even work with constraint exclusion right now: you'd have a
>> single-tuple BRIN index for each partition and benefit from it. But you
>> don't need to pay the price of updating BRIN indexes, as min-max tuples for
>> each partition can be produced while creating the partitions if the syntax
>> already provides the information. Then, it's just a matter of querying this
>> meta-data which just happens to have the form of a BRIN tuple for each
>> partition.
>
>
> Yup. Indeed this is the way I outlined in my previous e-mail.
>
> The only point being: Why bother with BRIN when we already have the range
> machinery, and it's trivial to add pointers to partitions from each range?

The part of BRIN I find useful is not its on-disk structure, but all
the execution machinery that checks quals against BRIN tuples. It's
not a trivial part of code, and is especially useful since it's
generalizable. New BRIN operator classes can be created and that's an
interesting power to have in partitioning as well.

Casting from ranges into min-max BRIN tuples seems quite doable, so
both range and list notation should work fine. But BRIN works also for
the generic "routing expression" some people seem to really want, and
dynamically updated BRIN meta-indexes seem to be the only efficient
solution for that.

BRIN lacks some features, as you noted, so it does need some love
before it's usable for this. But they're features BRIN itself would
find useful so you take out two ducks in one shot.

> I suggested that BRIN would solve a situation when the amount of partitions
> is huge (say, thousands) and we might need to be able to efficiently locate
> the appropriate partition. In this situation, a linear search might become
> prohibitive, or the data structure (a simple B-Tree, maybe) become too big
> to be worth keeping in memory. This is where being able to store the
> "partition index" on disk would be interesting.

BRIN also does a linear search, so it doesn't solve that. BRIN's only
power is that it can answer very fast whether some quals rule out a
partition.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-15 Thread José Luis Tallón

On 12/15/2014 07:42 AM, Claudio Freire wrote:

[snip]


If you do that, you start with empty partitions, and each insert 
updates the BRIN tuple. Avoiding concurrency loss in this case would 
be tricky, but in theory this could allow very general partition 
exclusion. In fact it could even work with constraint exclusion right 
now: you'd have a single-tuple BRIN index for each partition and 
benefit from it. But you don't need to pay the price of updating BRIN 
indexes, as min-max tuples for each partition can be produced while 
creating the partitions if the syntax already provides the 
information. Then, it's just a matter of querying this meta-data which 
just happens to have the form of a BRIN tuple for each partition.


Yup. Indeed this is the way I outlined in my previous e-mail.

The only point being: Why bother with BRIN when we already have the 
range machinery, and it's trivial to add pointers to partitions from 
each range?


I suggested that BRIN would solve a situation when the amount of 
partitions is huge (say, thousands) and we might need to be able to 
efficiently locate the appropriate partition. In this situation, a 
linear search might become prohibitive, or the data structure (a simple 
B-Tree, maybe) become too big to be worth keeping in memory. This is 
where being able to store the "partition index" on disk would be 
interesting.


Moreover, I guess that ---by using this approach 
(B-Tree[range]->partition_id and/or BRIN)--- we could efficiently answer 
the question "do we have any tuple with this key in some partition?" 
which AFAICS is pretty close to us having "global indexes".




Regards,

/ J.L.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-15 Thread Amit Langote

Claudio Freire wrote:
> On Sun, Dec 14, 2014 at 11:12 PM, Amit Langote
>  wrote:
> >> On egress you need some direct way to compare the scan quals with the
> >> partitioning values.  I would imagine this to be similar to how scan
> >> quals are compared to the values stored in a BRIN index: each scan qual
> >> has a corresponding operator strategy and a scan key, and you can say
> >> "aye" or "nay" based on a small set of operations that can be run
> >> cheaply, again without any proof or running arbitrary expressions.
> >>
> >
> > My knowledge of this is far from being perfect, though to clear any
> confusions -
> >
> > As far as planning is concerned, I could not imagine how index access
> method way of pruning partitions could be made to work. Of course, I may
> be missing something.
> 
> Let me be overly verbose, don't take it as patronizing, just answering
> in lots of detail why this could be a good idea to try.
> 

Thanks for explaining. It helps.

> Normal indexes store a pointer for each key value of sorts. So B-Tree
> gets you a set of tids for each key, and so does GIN and hash.
> 
> But BRIN is different in that it does the mapping differently. BRIN
> stores a compact, approximate representation of the set of keys within
> a page range. It can tell with some degree of (in)accuracy whether a
> key or key range could be part of that page range or not. The way it
> does this is abstracted out, but at its core, it stores a "compressed"
> representation of the key set that takes a constant amount of bits to
> store, and no more, no matter how many elements. What changes as the
> element it represents grows, is its accuracy.
> 
> Currently, BRIN only supports min-max representations. It will store,
> for each page range, the minimum and maximum of some columns, and
> when
> you query it, you can compare range vs range, and discard whole page
> ranges.
> 
> Now, what are partitions, if not page ranges?

Yes, I can see a partition as a page range. The fixed summary info in BRIN's 
terms would be range bounds in case this is a rang partition, list of values in 
case this is a list partition and hash value in case this is a hash partition.

There is debate on the topic but each of these partitions also happens to be a 
separate relation. IIUC, BRIN is an access method for a relation (say, 
top-level partitioned relation) that comes into play in executor if that access 
method survives as preferred access method by the planner. I cannot see a way 
to generalize it further and make it support each block range as a separate 
relation and then use it for partition pruning in planner. This is assuming a 
partitioned relation is planned as an Append node which contains a list of 
plans for surviving partition relations based on, say, restrict quals.

I may be thinking of BRIN as a whole as not being generalized enough but I may 
be wrong. Could you point out if so?

> A BRIN tuple is a min-max pair. But BRIN in more general, it could use
> other data structures to hold that "compressed representation", if
> someone implemented them. Like bloom filters [0].
> 
> A BRIN index is a complex data structure because it has to account for
> physically growing tables, but all the complexities vanish when you
> fix a "block range" (the BR in BRIN) to a partition. Then, a mere
> array of BRIN tuples would suffice.
> 
> BRIN already contains the machinery to turn quals into something that
> filters out entire partitions, if you provide the BRIN tuples.
> 

IIUC, that machinery comes into play when, say, a Bitmap Heap scan starts, 
right?

> And you could even effectively matain a BRIN index for the partitions
> (just a BRIN tuple per partition, dynamically updated with every
> insertion).
> 
> If you do that, you start with empty partitions, and each insert
> updates the BRIN tuple. Avoiding concurrency loss in this case would
> be tricky, but in theory this could allow very general partition
> exclusion. In fact it could even work with constraint exclusion right
> now: you'd have a single-tuple BRIN index for each partition and
> benefit from it.
> 
> But you don't need to pay the price of updating BRIN indexes, as
> min-max tuples for each partition can be produced while creating the
> partitions if the syntax already provides the information. Then, it's
> just a matter of querying this meta-data which just happens to have
> the form of a BRIN tuple for each partition.
> 

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-14 Thread Claudio Freire
On Sun, Dec 14, 2014 at 11:12 PM, Amit Langote
 wrote:
>> On egress you need some direct way to compare the scan quals with the
>> partitioning values.  I would imagine this to be similar to how scan
>> quals are compared to the values stored in a BRIN index: each scan qual
>> has a corresponding operator strategy and a scan key, and you can say
>> "aye" or "nay" based on a small set of operations that can be run
>> cheaply, again without any proof or running arbitrary expressions.
>>
>
> My knowledge of this is far from being perfect, though to clear any 
> confusions -
>
> As far as planning is concerned, I could not imagine how index access method 
> way of pruning partitions could be made to work. Of course, I may be missing 
> something.

Let me be overly verbose, don't take it as patronizing, just answering
in lots of detail why this could be a good idea to try.

Normal indexes store a pointer for each key value of sorts. So B-Tree
gets you a set of tids for each key, and so does GIN and hash.

But BRIN is different in that it does the mapping differently. BRIN
stores a compact, approximate representation of the set of keys within
a page range. It can tell with some degree of (in)accuracy whether a
key or key range could be part of that page range or not. The way it
does this is abstracted out, but at its core, it stores a "compressed"
representation of the key set that takes a constant amount of bits to
store, and no more, no matter how many elements. What changes as the
element it represents grows, is its accuracy.

Currently, BRIN only supports min-max representations. It will store,
for each page range, the minimum and maximum of some columns, and when
you query it, you can compare range vs range, and discard whole page
ranges.

Now, what are partitions, if not page ranges?

A BRIN tuple is a min-max pair. But BRIN in more general, it could use
other data structures to hold that "compressed representation", if
someone implemented them. Like bloom filters [0].

A BRIN index is a complex data structure because it has to account for
physically growing tables, but all the complexities vanish when you
fix a "block range" (the BR in BRIN) to a partition. Then, a mere
array of BRIN tuples would suffice.

BRIN already contains the machinery to turn quals into something that
filters out entire partitions, if you provide the BRIN tuples.

And you could even effectively matain a BRIN index for the partitions
(just a BRIN tuple per partition, dynamically updated with every
insertion).

If you do that, you start with empty partitions, and each insert
updates the BRIN tuple. Avoiding concurrency loss in this case would
be tricky, but in theory this could allow very general partition
exclusion. In fact it could even work with constraint exclusion right
now: you'd have a single-tuple BRIN index for each partition and
benefit from it.

But you don't need to pay the price of updating BRIN indexes, as
min-max tuples for each partition can be produced while creating the
partitions if the syntax already provides the information. Then, it's
just a matter of querying this meta-data which just happens to have
the form of a BRIN tuple for each partition.

[0] http://en.wikipedia.org/wiki/Bloom_filter


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-14 Thread Amit Langote

Alvaro wrote:
> Claudio Freire wrote:
> 
> > Fair enough, but that's not the same as not requiring easy proofs. The
> > planner might not the one doing the proofs, but you still need proofs.
> >
> > Even if the proving method is hardcoded into the partitioning method,
> > as in the case of list or range partitioning, it's still a proof. With
> > arbitrary functions (which is what prompted me to mention proofs) you
> > can't do that. A function works very well for inserting, but not for
> > selecting.
> >
> > I could be wrong though. Maybe there's a way to turn SQL functions
> > into analyzable things? But it would still be very easy to shoot
> > yourself in the foot by writing one that is too complex.
> 
> Arbitrary SQL expressions (including functions) are not the thing to use
> for partitioning -- at least that's how I understand this whole
> discussion.  I don't think you want to do "proofs" as such -- they are
> expensive.
> 

This means if a user puts arbitrary expressions in a partition definition, say,

... FOR VALUES  extract(month from current_date) TO extract(month from 
current_date + interval '3 months'),

we make sure that those expressions are pre-computed to literal values. The 
exact time when that happens is open for discussion I guess. It could be either 
DDL time or, if feasible, during relation cache building when we compute the 
value from pg_node_tree of this expression which we may choose to store in the 
partition definition catalog. The former entails an obvious challenge of 
figuring out how we store the computed value into catalog (pg_node_tree of a 
Const?).

> To make this discussion a bit clearer, there are two things to
> distinguish: one is routing tuples, when an INSERT or COPY command
> references the partitioned table, into the individual partitions
> (ingress); the other is deciding which partitions to read when a SELECT
> query wants to read tuples from the partitioned table (egress).
> 
> On ingress, what you want is something like being able to do something
> on the tuple that tells you which partition it belongs into.  Ideally
> this is something much lighter than running an expression; if you can
> just apply an operator to the partitioning column values, that should be
> plenty fast.  This requires no proof.
> 

And I am thinking this's all executor stuff.

> On egress you need some direct way to compare the scan quals with the
> partitioning values.  I would imagine this to be similar to how scan
> quals are compared to the values stored in a BRIN index: each scan qual
> has a corresponding operator strategy and a scan key, and you can say
> "aye" or "nay" based on a small set of operations that can be run
> cheaply, again without any proof or running arbitrary expressions.
> 

My knowledge of this is far from being perfect, though to clear any confusions -

As far as planning is concerned, I could not imagine how index access method 
way of pruning partitions could be made to work. Of course, I may be missing 
something. 

When you say "scan qual has a corresponding operator strategy", I'd think that 
is a part of scan key in executor, no?

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-13 Thread Amit Langote
On Sun, Dec 14, 2014 at 1:40 AM, José Luis Tallón
 wrote:
> On 12/12/2014 05:43 AM, Amit Langote wrote:
>
> Amit: mind if I add the DB2 syntax for partitioning to the wiki, too?
>
> This might as well help with deciding the final form of partitioning
> (and define the first implementation boundaries, too)
>

Please go ahead.

Thanks,
Amit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-13 Thread Amit Langote
On Sun, Dec 14, 2014 at 1:57 AM, José Luis Tallón
 wrote:
> On 12/13/2014 03:09 AM, Alvaro Herrera wrote:
>>
>> [snip]
>> Arbitrary SQL expressions (including functions) are not the thing to use
>> for partitioning -- at least that's how I understand this whole
>> discussion.  I don't think you want to do "proofs" as such -- they are
>> expensive.
>
>
> Yup. Plus, it looks like (from reading Oracle's documentation) they end up
> converting the LESS THAN clauses into range lists internally.
> Anyone that can attest to this? (or just disprove it, if I'm wrong)
>
> I just suggested using the existing RangeType infrastructure for this ( <<,
>>> and && operators, specifically, might do the trick) before reading your
> mail citing BRIN.
> ... which might as well allow some interesting runtime optimizations
> when range partitioning is used and *a huge* number of partitions get
> defined --- I'm specifically thinking about massive OLTP with very deep
> (say, 5 years' worth) archival partitioning where it would be inconvenient
> to have the tuple routing information always in memory.
> I'm specifically suggesting some ( range_value -> partitionOID) mapping
> using a BRIN index for this --- it could be auto-created just like we do for
> primary keys.
>
> Just my 2c

Since we are keen on being able to reuse existing infrastructure, I
think this and RangeType, ArrayType stuff is worth thinking about
though I am afraid we may lose a certain level of generality of
expression we might very well be able to afford. Though that's
something difficult to definitely say without actually studying it a
little more detail which I haven't quite yet. We may be able to go
somewhere with it perhaps. And of course the original designers of the
infrastructure in question would be better able to vouch for it I
think.

Thanks,
Amit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-13 Thread José Luis Tallón

On 12/13/2014 05:57 PM, José Luis Tallón wrote:

On 12/13/2014 03:09 AM, Alvaro Herrera wrote:

[snip]
Arbitrary SQL expressions (including functions) are not the thing to use
for partitioning -- at least that's how I understand this whole
discussion.  I don't think you want to do "proofs" as such -- they are
expensive.


Yup. Plus, it looks like (from reading Oracle's documentation) they 
end up converting the LESS THAN clauses into range lists internally.

Anyone that can attest to this? (or just disprove it, if I'm wrong)

I just suggested using the existing RangeType infrastructure for this 
( <<, >> and && operators, specifically, might do the trick) before 
reading your mail citing BRIN.
... which might as well allow some interesting runtime 
optimizations when range partitioning is used and *a huge* number of 
partitions get defined --- I'm specifically thinking about massive 
OLTP with very deep (say, 5 years' worth) archival partitioning where 
it would be inconvenient to have the tuple routing information always 
in memory.
I'm specifically suggesting some ( range_value -> partitionOID) 
mapping using a BRIN index for this --- it could be auto-created just 
like we do for primary keys.


Reviewing the existing documentation on this topic I have stumbled on an 
e-mail by Simon Riggs from almost seven years ago

http://www.postgresql.org/message-id/1199296574.7260.149.ca...@ebony.site

 where he suggested a way of physically partitioning tables by using 
segments in a way that sounds to be quite close to what we are proposing 
here.


ISTM that the partitioning meta-data might very well be augmented a bit 
in the direction Simon pointed to, adding support for "effectively 
read-only" and/or "explicitly marked read-only" PARTITIONS (not segments 
in this case) for an additional optimization. We would need some syntax 
additions (ALTER PARTITION  SET READONLY) in this case.

This feature can be added later on, of course.


I'd like to explicitly remark the potentially performance-enhancing 
effect of fillfactor=100 (cfr. 
http://www.postgresql.org/docs/9.3/static/sql-createtable.html) and 
partitions marked "effectively read-only" (cfr. Simon's proposal) when 
coupled with "fullscan analyze" vs. the regular sample-based analyze 
that autovacuum performs.
When a partition consists of multiple *segments*, a generalization of 
the proposed BRIN index (to cover segments in addition to partitions) 
will further speed up scans.





Just for the record, allowing some partitions to be moved to foreign 
tables (i.e. foreign servers, via postgres_fdw) will multiply the 
usefullness of this "partitioned table wide" BRIN index  now 
becoming a real "global index".



Just my 2c


Thanks,

/ J.L.







--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-13 Thread David Fetter
On Fri, Dec 12, 2014 at 09:03:12AM -0500, Robert Haas wrote:

> Yeah, range and list partition definitions are very similar, but
> hash partition definitions are a different kettle of fish.  I don't
> think we really need hash partitioning for anything right away -
> it's pretty useless unless you've got, say, a way for the partitions
> to be foreign tables living on remote servers -

There's a patch enabling exactly this feature in the queue for 9.5.

https://commitfest.postgresql.org/action/patch_view?id=1386

> but we shouldn't pick a design that will make it really hard to add
> later.

Indeed not :)

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-13 Thread Jim Nasby

On 12/12/14, 3:48 PM, Robert Haas wrote:

On Fri, Dec 12, 2014 at 4:28 PM, Jim Nasby  wrote:

Sure.  Mind you, I'm not proposing that the syntax I just mooted is
actually for the best.  What I'm saying is that we need to talk about
it.


Frankly, if we're going to require users to explicitly define each partition
then I think the most appropriate API would be a function. Users will be
writing code to create new partitions as needed, and it's generally easier
to write code that calls a function as opposed to glomming a text string
together and passing that to EXECUTE.


I have very little idea what the API you're imagining would actually
look like from this description, but it sounds like a terrible idea.
We don't want to make this infinitely general.  We need a *fast* way
to go from a value (or list of values, one per partitioning column) to
a partition OID, and the way to get there is not to call arbitrary
user code.


You were talking about the syntax for partition creation/definition; that's the 
API I was referring to.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-13 Thread José Luis Tallón

On 12/13/2014 03:09 AM, Alvaro Herrera wrote:

[snip]
Arbitrary SQL expressions (including functions) are not the thing to use
for partitioning -- at least that's how I understand this whole
discussion.  I don't think you want to do "proofs" as such -- they are
expensive.


Yup. Plus, it looks like (from reading Oracle's documentation) they end 
up converting the LESS THAN clauses into range lists internally.

Anyone that can attest to this? (or just disprove it, if I'm wrong)

I just suggested using the existing RangeType infrastructure for this ( 
<<, >> and && operators, specifically, might do the trick) before 
reading your mail citing BRIN.
... which might as well allow some interesting runtime 
optimizations when range partitioning is used and *a huge* number of 
partitions get defined --- I'm specifically thinking about massive OLTP 
with very deep (say, 5 years' worth) archival partitioning where it 
would be inconvenient to have the tuple routing information always in 
memory.
I'm specifically suggesting some ( range_value -> partitionOID) mapping 
using a BRIN index for this --- it could be auto-created just like we do 
for primary keys.


Just my 2c


Thanks,

/ J.L.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-13 Thread José Luis Tallón

On 12/12/2014 05:43 AM, Amit Langote wrote:

[snip]
In case of what we would have called a 'LIST' partition, this could look like

... FOR VALUES (val1, val2, val3, ...)

Assuming we only support partition key to contain only one column in such a 
case.


Hmmm….

[...] PARTITION BY LIST(col1 [, col2, ...])

just like we do for indexes would do.


and CREATE PARTITION child_name OF parent_name
FOR [VALUES] (val1a,val2a), (val1b,val2b), (val1c,val2c)
[IN tblspc_name]

just like we do for multi-valued inserts.


In case of what we would have called a 'RANGE' partition, this could look like

... FOR VALUES (val1min, val2min, ...) TO (val1max, val2max, ...)

How about BETWEEN ... AND ... ?


Unless I'm missing something obvious, we already have range types for 
this, don't we?


...   PARTITION BY RANGE (col)

CREATE PARTITION child_name OF parent_name
FOR [VALUES] '[val1min,val1max)', '[val2min,val2max)', 
'[val3min,val3max)'

[IN tblspc_name]

and I guess this should simplify a fully flexible implementation (if you 
can construct a RangeType for it, you can use that for partitioning).
This would substitute the ugly (IMHO) "VALUES LESS THAN" syntax with a 
more flexible one
(even though it might end up being converted into "less than" 
boundaries internally for implementation/optimization purposes)


In both cases we would need to allow for overflows / default partition 
different from the parent table.



Plus some ALTER PARTITION part_name TABLESPACE=tblspc_name


The main problem being that we are assuming named partitions here, which 
might not be that practical at all.



[snip]
I would include the noise keyword VALUES just for readability if 
anything. 


+1


FWIW, deviating from already "standard" syntax (Oracle-like --as 
implemented by PPAS for example-- or DB2-like) is quite 
counter-productive unless we have very good reasons for it... which 
doesn't mean that we have to do it exactly like they do (specially if we 
would like to go the incremental implementation route).


Amit: mind if I add the DB2 syntax for partitioning to the wiki, too?

This might as well help with deciding the final form of 
partitioning (and define the first implementation boundaries, too)



Thanks,

/ J.L.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-12 Thread Claudio Freire
El 12/12/2014 23:09, "Alvaro Herrera"  escribió:
>
> Claudio Freire wrote:
>
> > Fair enough, but that's not the same as not requiring easy proofs. The
> > planner might not the one doing the proofs, but you still need proofs.
> >
> > Even if the proving method is hardcoded into the partitioning method,
> > as in the case of list or range partitioning, it's still a proof. With
> > arbitrary functions (which is what prompted me to mention proofs) you
> > can't do that. A function works very well for inserting, but not for
> > selecting.
> >
> > I could be wrong though. Maybe there's a way to turn SQL functions
> > into analyzable things? But it would still be very easy to shoot
> > yourself in the foot by writing one that is too complex.
>
> Arbitrary SQL expressions (including functions) are not the thing to use
> for partitioning -- at least that's how I understand this whole
> discussion.  I don't think you want to do "proofs" as such -- they are
> expensive.
>
> To make this discussion a bit clearer, there are two things to
> distinguish: one is routing tuples, when an INSERT or COPY command
> references the partitioned table, into the individual partitions
> (ingress); the other is deciding which partitions to read when a SELECT
> query wants to read tuples from the partitioned table (egress).
>
> On ingress, what you want is something like being able to do something
> on the tuple that tells you which partition it belongs into.  Ideally
> this is something much lighter than running an expression; if you can
> just apply an operator to the partitioning column values, that should be
> plenty fast.  This requires no proof.
>
> On egress you need some direct way to compare the scan quals with the
> partitioning values.  I would imagine this to be similar to how scan
> quals are compared to the values stored in a BRIN index: each scan qual
> has a corresponding operator strategy and a scan key, and you can say
> "aye" or "nay" based on a small set of operations that can be run
> cheaply, again without any proof or running arbitrary expressions.

Interesting that you mention BRIN. It does seem that it could be made to
work with BRIN's operator classes.

In fact, a partition-wide brin tuple could be stored per partition and that
in itself could be the definition for the partition.

Either preinitialized or dynamically updated. Would work even for arbitrary
routing functions, especially if the operator class to use is customizable.

I stand corrected.


Re: [HACKERS] On partitioning

2014-12-12 Thread Alvaro Herrera
Claudio Freire wrote:

> Fair enough, but that's not the same as not requiring easy proofs. The
> planner might not the one doing the proofs, but you still need proofs.
> 
> Even if the proving method is hardcoded into the partitioning method,
> as in the case of list or range partitioning, it's still a proof. With
> arbitrary functions (which is what prompted me to mention proofs) you
> can't do that. A function works very well for inserting, but not for
> selecting.
> 
> I could be wrong though. Maybe there's a way to turn SQL functions
> into analyzable things? But it would still be very easy to shoot
> yourself in the foot by writing one that is too complex.

Arbitrary SQL expressions (including functions) are not the thing to use
for partitioning -- at least that's how I understand this whole
discussion.  I don't think you want to do "proofs" as such -- they are
expensive.

To make this discussion a bit clearer, there are two things to
distinguish: one is routing tuples, when an INSERT or COPY command
references the partitioned table, into the individual partitions
(ingress); the other is deciding which partitions to read when a SELECT
query wants to read tuples from the partitioned table (egress).

On ingress, what you want is something like being able to do something
on the tuple that tells you which partition it belongs into.  Ideally
this is something much lighter than running an expression; if you can
just apply an operator to the partitioning column values, that should be
plenty fast.  This requires no proof.

On egress you need some direct way to compare the scan quals with the
partitioning values.  I would imagine this to be similar to how scan
quals are compared to the values stored in a BRIN index: each scan qual
has a corresponding operator strategy and a scan key, and you can say
"aye" or "nay" based on a small set of operations that can be run
cheaply, again without any proof or running arbitrary expressions.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-12 Thread Claudio Freire
On Fri, Dec 12, 2014 at 7:40 PM, Josh Berkus  wrote:
> On 12/12/2014 02:10 PM, Tom Lane wrote:
>> Actually, I'm not sure that's what we want.  I thought what we really
>> wanted here was to postpone partition-routing decisions to runtime,
>> so that the behavior would be efficient whether or not the decision
>> could be predetermined at plan time.
>>
>> This still leads to the same point Robert is making: the routing
>> decisions have to be cheap and fast.  But it's wrong to think of it
>> in terms of planner proofs.
>
> The other reason I'd really like to have the new partitioning taken out
> of the planner: expressions.
>
> Currently, if you have partitions with constraints on, day,
> "event_date", the following WHERE clause will NOT use CE and will scan
> all partitions:
>
> WHERE event_date BETWEEN ( '2014-12-11' - interval '1 month' ) and
> '2014-12-11'.
>
> This is despite the fact that the expression above gets rewritten to a
> constant by the time the query is executed; by then it's too late.  To
> say nothing of functions like to_timestamp(), now(), etc.
>
> As long as partitions need to be chosen at plan time, I don't see a good
> way to fix the expression problem.

Fair enough, but that's not the same as not requiring easy proofs. The
planner might not the one doing the proofs, but you still need proofs.

Even if the proving method is hardcoded into the partitioning method,
as in the case of list or range partitioning, it's still a proof. With
arbitrary functions (which is what prompted me to mention proofs) you
can't do that. A function works very well for inserting, but not for
selecting.

I could be wrong though. Maybe there's a way to turn SQL functions
into analyzable things? But it would still be very easy to shoot
yourself in the foot by writing one that is too complex.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-12 Thread Josh Berkus
On 12/12/2014 02:10 PM, Tom Lane wrote:
> Actually, I'm not sure that's what we want.  I thought what we really
> wanted here was to postpone partition-routing decisions to runtime,
> so that the behavior would be efficient whether or not the decision
> could be predetermined at plan time.
> 
> This still leads to the same point Robert is making: the routing
> decisions have to be cheap and fast.  But it's wrong to think of it
> in terms of planner proofs.

The other reason I'd really like to have the new partitioning taken out
of the planner: expressions.

Currently, if you have partitions with constraints on, day,
"event_date", the following WHERE clause will NOT use CE and will scan
all partitions:

WHERE event_date BETWEEN ( '2014-12-11' - interval '1 month' ) and
'2014-12-11'.

This is despite the fact that the expression above gets rewritten to a
constant by the time the query is executed; by then it's too late.  To
say nothing of functions like to_timestamp(), now(), etc.

As long as partitions need to be chosen at plan time, I don't see a good
way to fix the expression problem.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-12 Thread Claudio Freire
On Fri, Dec 12, 2014 at 7:10 PM, Tom Lane  wrote:
> Claudio Freire  writes:
>> On Fri, Dec 12, 2014 at 6:48 PM, Robert Haas  wrote:
>>> I have very little idea what the API you're imagining would actually
>>> look like from this description, but it sounds like a terrible idea.
>>> We don't want to make this infinitely general.  We need a *fast* way
>>> to go from a value (or list of values, one per partitioning column) to
>>> a partition OID, and the way to get there is not to call arbitrary
>>> user code.
>
>> I think this was mentioned upthread, but I'll repeat it anyway since
>> it seems to need repeating.
>
>> More than fast, you want it analyzable (by the planner). Ie: it has to
>> be easy to prove partition exclusion against a where clause.
>
> Actually, I'm not sure that's what we want.  I thought what we really
> wanted here was to postpone partition-routing decisions to runtime,
> so that the behavior would be efficient whether or not the decision
> could be predetermined at plan time.
>
> This still leads to the same point Robert is making: the routing
> decisions have to be cheap and fast.  But it's wrong to think of it
> in terms of planner proofs.

You'll need proofs whether at the planner or at the execution engine.

A sequential scan over a partition with a query like

select * from foo where date between X and Y

Would be ripe for that but at some point you need to prove that the
where clause excludes whole partitions. Be it at runtime (while
executing the sequential scan node) or planning time.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-12 Thread Tom Lane
Claudio Freire  writes:
> On Fri, Dec 12, 2014 at 6:48 PM, Robert Haas  wrote:
>> I have very little idea what the API you're imagining would actually
>> look like from this description, but it sounds like a terrible idea.
>> We don't want to make this infinitely general.  We need a *fast* way
>> to go from a value (or list of values, one per partitioning column) to
>> a partition OID, and the way to get there is not to call arbitrary
>> user code.

> I think this was mentioned upthread, but I'll repeat it anyway since
> it seems to need repeating.

> More than fast, you want it analyzable (by the planner). Ie: it has to
> be easy to prove partition exclusion against a where clause.

Actually, I'm not sure that's what we want.  I thought what we really
wanted here was to postpone partition-routing decisions to runtime,
so that the behavior would be efficient whether or not the decision
could be predetermined at plan time.

This still leads to the same point Robert is making: the routing
decisions have to be cheap and fast.  But it's wrong to think of it
in terms of planner proofs.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-12 Thread Claudio Freire
On Fri, Dec 12, 2014 at 6:48 PM, Robert Haas  wrote:
> On Fri, Dec 12, 2014 at 4:28 PM, Jim Nasby  wrote:
>>> Sure.  Mind you, I'm not proposing that the syntax I just mooted is
>>> actually for the best.  What I'm saying is that we need to talk about
>>> it.
>>
>> Frankly, if we're going to require users to explicitly define each partition
>> then I think the most appropriate API would be a function. Users will be
>> writing code to create new partitions as needed, and it's generally easier
>> to write code that calls a function as opposed to glomming a text string
>> together and passing that to EXECUTE.
>
> I have very little idea what the API you're imagining would actually
> look like from this description, but it sounds like a terrible idea.
> We don't want to make this infinitely general.  We need a *fast* way
> to go from a value (or list of values, one per partitioning column) to
> a partition OID, and the way to get there is not to call arbitrary
> user code.

I think this was mentioned upthread, but I'll repeat it anyway since
it seems to need repeating.

More than fast, you want it analyzable (by the planner). Ie: it has to
be easy to prove partition exclusion against a where clause.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-12 Thread Robert Haas
On Fri, Dec 12, 2014 at 4:28 PM, Jim Nasby  wrote:
>> Sure.  Mind you, I'm not proposing that the syntax I just mooted is
>> actually for the best.  What I'm saying is that we need to talk about
>> it.
>
> Frankly, if we're going to require users to explicitly define each partition
> then I think the most appropriate API would be a function. Users will be
> writing code to create new partitions as needed, and it's generally easier
> to write code that calls a function as opposed to glomming a text string
> together and passing that to EXECUTE.

I have very little idea what the API you're imagining would actually
look like from this description, but it sounds like a terrible idea.
We don't want to make this infinitely general.  We need a *fast* way
to go from a value (or list of values, one per partitioning column) to
a partition OID, and the way to get there is not to call arbitrary
user code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-12 Thread Jim Nasby

On 12/12/14, 8:03 AM, Robert Haas wrote:

On Thu, Dec 11, 2014 at 11:43 PM, Amit Langote
  wrote:

>In case of what we would have called a 'LIST' partition, this could look like
>
>... FOR VALUES (val1, val2, val3, ...)
>
>Assuming we only support partition key to contain only one column in such a 
case.
>
>In case of what we would have called a 'RANGE' partition, this could look like
>
>... FOR VALUES (val1min, val2min, ...) TO (val1max, val2max, ...)
>
>How about BETWEEN ... AND ... ?

Sure.  Mind you, I'm not proposing that the syntax I just mooted is
actually for the best.  What I'm saying is that we need to talk about
it.


Frankly, if we're going to require users to explicitly define each partition 
then I think the most appropriate API would be a function. Users will be 
writing code to create new partitions as needed, and it's generally easier to 
write code that calls a function as opposed to glomming a text string together 
and passing that to EXECUTE.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-12 Thread Robert Haas
On Thu, Dec 11, 2014 at 11:43 PM, Amit Langote
 wrote:
> In case of what we would have called a 'LIST' partition, this could look like
>
> ... FOR VALUES (val1, val2, val3, ...)
>
> Assuming we only support partition key to contain only one column in such a 
> case.
>
> In case of what we would have called a 'RANGE' partition, this could look like
>
> ... FOR VALUES (val1min, val2min, ...) TO (val1max, val2max, ...)
>
> How about BETWEEN ... AND ... ?

Sure.  Mind you, I'm not proposing that the syntax I just mooted is
actually for the best.  What I'm saying is that we need to talk about
it.

> I am not sure but perhaps RANGE and LIST as partitioning kinds may as well 
> just be noise keywords. We can parse those values into a parse node such that 
> we don’t have to care about whether they describe partition as being one kind 
> or the other. Say a List of something like,
>
> typedef struct PartitionColumnValue
> {
> NodeTagtype,
> Oid*partitionid,
> char   *partcolname,
> Node   *partrangelower,
> Node   *partrangeupper,
> List   *partlistvalues
> };
>
> Or we could still add a (char) partkind just to say which of the fields 
> matter.
>
> We don't need any defining values here for hash partitions if and when we add 
> support for the same. We would either be using a system-wide common hash 
> function or we could add something with partitioning key definition.

Yeah, range and list partition definitions are very similar, but hash
partition definitions are a different kettle of fish.  I don't think
we really need hash partitioning for anything right away - it's pretty
useless unless you've got, say, a way for the partitions to be foreign
tables living on remote servers - but we shouldn't pick a design that
will make it really hard to add later.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-11 Thread Amit Kapila
On Thu, Dec 11, 2014 at 8:42 PM, Robert Haas  wrote:
>
> On Thu, Dec 11, 2014 at 12:00 AM, Amit Kapila 
wrote:
> > Yeah either this way or what Josh has suggested upthread, the main
> > point was that if at all we want to support multi-column list
partitioning
> > then we need to have slightly different syntax, however I feel that we
> > can leave multi-column list partitioning for first version.
>
> Yeah, possibly.
>
> I think we could stand to have a lot more discussion about the syntax
> here.  So far the idea seems to be to copy what Oracle has, but it's
> not clear if we're going to have exactly what Oracle has or something
> subtly different.  I personally don't find the Oracle syntax very
> PostgreSQL-ish.

I share your concern w.r.t the difficulties it can create if we don't
do it carefully (one of the issue you have mentioned upthread about
pg_dump, other such things could cause problems, if not thought
of carefully from the beginning).  One more thing, on a quick check
it seems to me even DB2 uses some-thing similar to Oracle for
defining partitions

CREATE TABLE orders(id INT, shipdate DATE, …)
PARTITION BY RANGE(shipdate)
( PARTITION q4_05 STARTING MINVALUE,
  PARTITION q1_06 STARTING '1/1/2006',
  PARTITION q2_06 STARTING '4/1/2006',
  PARTITION q3_06 STARTING '7/1/2006',
  PARTITION q4_06 STARTING '10/1/2006'
  ENDING ‘12/31/2006' )

I don't think there is any pressing need for PostgreSQL to use
syntax similar to what some of the other databases use, however
it has an advantage for ease of migration and ease of use (as
people are already familiar with using such syntax).

> Stuff like "VALUES LESS THAN 500" doesn't sit
> especially well with me - less than according to which opclass?  Are
> we going to insist that partitioning must use the default btree
> opclass so that we can use that syntax?  That seems kind of lame.
>

Can't we simply specify the opclass along with column name while
specifying partition clause which I feel is something similar to we
already do in CREATE INDEX syntax.

CREATE TABLE sales
 ( invoice_no NUMBER,
   sale_year  INT NOT NULL,
   sale_month INT NOT NULL,
   sale_day   INT NOT NULL )
   PARTITION BY RANGE ( sale_year )
 ( PARTITION sales_q1 VALUES LESS THAN (1999)

Isn't the default operator class for a partition column would fit the
bill for this particular case as the operators required in this syntax
will be quite simple?

> There are lots of interesting things we could do here, e.g.:
>
> CREATE TABLE parent_name PARTITION ON (column [ USING opclass ] [, ... ]);
> CREATE TABLE child_name PARTITION OF parent_name
>FOR { (value, ...) [ TO (value, ...) ] } [, ...];
>

The only thing which slightly bothers me about this syntax is that
it makes apparent that partitions are separate tables and it would
be inconvenient if we choose to disallow some operations on
partitions.  I think it might be better we treat partitions as a way
to divide the large amount of data and users be only given the
option to specify boundaries to divide this data and storage mechanism
of partitions should be an internal detail (something like we do in
TOAST table case).  I am not sure which syntax users will be more
comfortable to use as I am seeing and using Oracle type syntax from
long time so my opinion could be biased in this case.  It would be really
helpful if others who need or use partitioning scheme can share their
inputs.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] On partitioning

2014-12-11 Thread Amit Langote

> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> On Thu, Dec 11, 2014 at 12:00 AM, Amit Kapila 
> wrote:
> > Yeah either this way or what Josh has suggested upthread, the main
> > point was that if at all we want to support multi-column list partitioning
> > then we need to have slightly different syntax, however I feel that we
> > can leave multi-column list partitioning for first version.
> 
> Yeah, possibly.
> 
> I think we could stand to have a lot more discussion about the syntax
> here.  So far the idea seems to be to copy what Oracle has, but it's
> not clear if we're going to have exactly what Oracle has or something
> subtly different.  I personally don't find the Oracle syntax very
> PostgreSQL-ish.  Stuff like "VALUES LESS THAN 500" doesn't sit
> especially well with me - less than according to which opclass?  Are
> we going to insist that partitioning must use the default btree
> opclass so that we can use that syntax?  That seems kind of lame.
> 

Syntax like VALUES LESS THAN 500 also means, we then have to go figure out 
what's that partition's lower bound based on upper bound of the previous one. 
Forget holes in the range if they matter. I expressed that concern elsewhere in 
favour of having available both a range's lower and upper bounds.

> There are lots of interesting things we could do here, e.g.:
> 
> CREATE TABLE parent_name PARTITION ON (column [ USING opclass ] [, ... ]);

So, no PARTITION BY [RANGE | LIST] clause huh?

What we are calling pg_partitioned_rel would obtain following bits of 
information from such a definition of a partitioned relation:

 * column(s) to partition on and respective opclass(es)
 * the level this partitioned relation lies in the partitioning hierarchy
 (determining its relkind and storage qualification)

By the way, I am not sure how we define a partitioning key on a partition (in 
other words, a subpartitioning key on the corresponding partitioned relation). 
Perhaps (only) via ALTER TABLE on a partition relation?

> CREATE TABLE child_name PARTITION OF parent_name
>FOR { (value, ...) [ TO (value, ...) ] } [, ...];
> 

So it's still a CREATE "TABLE" but the part 'PARTITION OF' turns this "table" 
into something having characteristics of a partition relation getting all kinds 
of new treatments at various places. It appears there is a redistribution of 
table-characteristics between a partitioned relation and its partition. We take 
away storage from the former and instead give it to the latter. On the other 
hand, the latter's data is only accessible through the former perhaps with 
escape routes for direct access via some special syntax attached to various 
access commands. We also stand to lose certain abilities with a partitioned 
relation such as not able to define a unique constraint (other than what 
partition key could potentially help ensure) or use it as target of foreign key 
constraint (just reiterating).

What we call pg_partition_def obtains following bits of information from such a 
definition of a partition relation:

 * parent relation (partitioned relation this is partition of)
 * partition kind (do we even want to keep carrying this 
 around as a separate field in catalog?)
 * values this partition holds

The last part being the most important.

In case of what we would have called a 'LIST' partition, this could look like

... FOR VALUES (val1, val2, val3, ...)

Assuming we only support partition key to contain only one column in such a 
case.

In case of what we would have called a 'RANGE' partition, this could look like

... FOR VALUES (val1min, val2min, ...) TO (val1max, val2max, ...)

How about BETWEEN ... AND ... ?

Here we allow a partition key to contain more than one column.

> So instead of making a hard distinction between range and list
> partitioning, you can say:
> 
> CREATE TABLE child_name PARTITION OF parent_name FOR (3), (5), (7);
> CREATE TABLE child2_name PARTITION OF parent_name FOR (8) TO (12);
> CREATE TABLE child2_name PARTITION OF parent_name FOR (20) TO (30),
> (120) TO (130);
> 

I would include the noise keyword VALUES just for readability if anything.

> Now that might be a crappy idea for various reasons, but the point is
> there are a lot of details to be hammered out with the syntax, and
> there are several ways we can go wrong.  If we choose an
> overly-limiting syntax, we're needlessly restricting what can be done.
> If we choose an overly-permissive syntax, we'll restrict the
> optimization opportunities.
> 

I am not sure but perhaps RANGE and LIST as partitioning kinds may as well just 
be noise keywords. We can parse those values into a parse node such that we 
don’t have to care about whether they describe partition as being one kind or 
the other. Say a List of something like,

typedef struct PartitionColumnValue
{
NodeTagtype,
Oid*partitionid,
char   *partcolname,
Node   *partrangelower,
Node   *partrangeup

Re: [HACKERS] On partitioning

2014-12-11 Thread Robert Haas
On Thu, Dec 11, 2014 at 12:00 AM, Amit Kapila  wrote:
> Yeah either this way or what Josh has suggested upthread, the main
> point was that if at all we want to support multi-column list partitioning
> then we need to have slightly different syntax, however I feel that we
> can leave multi-column list partitioning for first version.

Yeah, possibly.

I think we could stand to have a lot more discussion about the syntax
here.  So far the idea seems to be to copy what Oracle has, but it's
not clear if we're going to have exactly what Oracle has or something
subtly different.  I personally don't find the Oracle syntax very
PostgreSQL-ish.  Stuff like "VALUES LESS THAN 500" doesn't sit
especially well with me - less than according to which opclass?  Are
we going to insist that partitioning must use the default btree
opclass so that we can use that syntax?  That seems kind of lame.

There are lots of interesting things we could do here, e.g.:

CREATE TABLE parent_name PARTITION ON (column [ USING opclass ] [, ... ]);
CREATE TABLE child_name PARTITION OF parent_name
   FOR { (value, ...) [ TO (value, ...) ] } [, ...];

So instead of making a hard distinction between range and list
partitioning, you can say:

CREATE TABLE child_name PARTITION OF parent_name FOR (3), (5), (7);
CREATE TABLE child2_name PARTITION OF parent_name FOR (8) TO (12);
CREATE TABLE child2_name PARTITION OF parent_name FOR (20) TO (30),
(120) TO (130);

Now that might be a crappy idea for various reasons, but the point is
there are a lot of details to be hammered out with the syntax, and
there are several ways we can go wrong.  If we choose an
overly-limiting syntax, we're needlessly restricting what can be done.
If we choose an overly-permissive syntax, we'll restrict the
optimization opportunities.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-10 Thread Amit Kapila
On Wed, Dec 10, 2014 at 11:51 PM, Robert Haas  wrote:
>
> On Mon, Dec 8, 2014 at 10:59 PM, Amit Kapila 
wrote:
> > Yeah and also how would user specify the values, as an example
> > assume that table is partitioned on monthly_salary, so partition
> > definition would look:
> >
> > PARTITION BY LIST(monthly_salary)
> > (
> > PARTITION salary_less_than_thousand VALUES(300, 900),
> > PARTITION salary_less_than_two_thousand VALUES (500,1000,1500),
> > ...
> > )
> >
> > Now if user wants to define multi-column Partition based on
> > monthly_salary and annual_salary, how do we want him to
> > specify the values.  Basically how to distinguish which values
> > belong to first column key and which one's belong to second
> > column key.
>
> I assume you just add some parentheses.
>
> PARTITION BY LIST (colA, colB) (PARTITION VALUES ((valA1, valB1),
> (valA2, valB2), (valA3, valB3))
>
> Multi-column list partitioning may or may not be worth implementing,
> but the syntax is not a real problem.
>

Yeah either this way or what Josh has suggested upthread, the main
point was that if at all we want to support multi-column list partitioning
then we need to have slightly different syntax, however I feel that we
can leave multi-column list partitioning for first version.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] On partitioning

2014-12-10 Thread Amit Kapila
On Wed, Dec 10, 2014 at 7:52 PM, Alvaro Herrera 
wrote:
>
> Amit Langote wrote:
>
> > On Wed, Dec 10, 2014 at 12:46 PM, Amit Kapila 
wrote:
> > > On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera <
alvhe...@2ndquadrant.com>
> > > wrote:
>
> > >> FWIW in my original proposal I was rejecting some things that after
> > >> further consideration turn out to be possible to allow; for instance
> > >> directly referencing individual partitions in COPY.  We could allow
> > >> something like
> > >>
> > >> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT
> > >> or maybe
> > >> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT
> > >>
> > > or
> > > COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01'  TO STDOUT
> > > COPY [TABLE] lineitems PARTITION   TO STDOUT
> > >
> > > I think we should try to support operations on partitions via main
> > > table whereever it is required.
>
> Um, I think the only difference is that you added the noise word TABLE
> which we currently don't allow in COPY,

Yeah, we could eliminate TABLE keyword from this syntax, the reason
I have kept was for easier understanding of syntax, currently we don't have
concept of PARTITION in COPY syntax, but now if we want to introduce
such a concept, then it might be better to have TABLE keyword for the
purpose of syntax clarity.

> and that you added the
> possibility of using named partitions, about which see below.
>
> > We can also allow to explicitly name a partition
> >
> > COPY [TABLE ] lineitems PARTITION lineitems_2001 TO STDOUT;
>
> The problem with naming partitions is that the user has to pick names
> for every partition, which is tedious and doesn't provide any
> significant benefit.  The input I had from users of other partitioning
> systems was that they very much preferred not to name the partitions at
> all,

It seems to me both Oracle and DB2 supports named partitions, so even
though there are user's which don't prefer named partitions, I suspect
equal or more number of users will be there who will prefer for the sake
of migration and because they are already used to such a syntax.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] On partitioning

2014-12-10 Thread Robert Haas
On Wed, Dec 10, 2014 at 7:25 PM, Amit Langote
 wrote:
> In heap_create(), do we create storage for a top level partitioned table 
> (say, RELKIND_PARTITIONED_TABLE)? How about a partition that is further 
> sub-partitioned? We might allocate storage for a partition at some point and 
> then later choose to sub-partition it. In such a case, perhaps, we would have 
> to move existing data to the storage of subpartitions and deallocate the 
> partition's storage. In other words only leaf relations in a partition 
> hierarchy would have storage. Is there such a notion within code for some 
> other purpose or we'd have to invent it for partitioning scheme?

I think it would be advantageous to have storage only for the leaf
partitions, because then you don't need to waste time doing a
zero-block sequential scan of the root as part of the append-plan, an
annoyance of the current system.

We have no concept for this right now; in fact, right now, the relkind
fully determines whether a given relation has storage.  One idea is to
make the leaves relkind = 'r' and the interior notes some new relkind.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-10 Thread Amit Langote


> From: Robert Haas [mailto:robertmh...@gmail.com]
> On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund
>  wrote:
> >> I don't think that's mutually exclusive with the idea of
> >> partitions-as-tables.  I mean, you can add code to the ALTER TABLE
> >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
> >> wherever you want.
> >
> > That'll be a lot of places you'll need to touch. More fundamentally: Why
> > should we name something a table that's not one?
> 
> Well, I'm not convinced that it isn't one.  And adding a new relkind
> will involve a bunch of code churn, too.  But I don't much care to
> pre-litigate this: when someone has got a patch, we can either agree
> that the approach is OK or argue that it is problematic because X.  I
> think we need to hammer down the design in broad strokes first, and
> I'm not sure we're totally there yet.
> 

In heap_create(), do we create storage for a top level partitioned table (say, 
RELKIND_PARTITIONED_TABLE)? How about a partition that is further 
sub-partitioned? We might allocate storage for a partition at some point and 
then later choose to sub-partition it. In such a case, perhaps, we would have 
to move existing data to the storage of subpartitions and deallocate the 
partition's storage. In other words only leaf relations in a partition 
hierarchy would have storage. Is there such a notion within code for some other 
purpose or we'd have to invent it for partitioning scheme?

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-10 Thread Robert Haas
On Mon, Dec 8, 2014 at 10:59 PM, Amit Kapila  wrote:
> Yeah and also how would user specify the values, as an example
> assume that table is partitioned on monthly_salary, so partition
> definition would look:
>
> PARTITION BY LIST(monthly_salary)
> (
> PARTITION salary_less_than_thousand VALUES(300, 900),
> PARTITION salary_less_than_two_thousand VALUES (500,1000,1500),
> ...
> )
>
> Now if user wants to define multi-column Partition based on
> monthly_salary and annual_salary, how do we want him to
> specify the values.  Basically how to distinguish which values
> belong to first column key and which one's belong to second
> column key.

I assume you just add some parentheses.

PARTITION BY LIST (colA, colB) (PARTITION VALUES ((valA1, valB1),
(valA2, valB2), (valA3, valB3))

Multi-column list partitioning may or may not be worth implementing,
but the syntax is not a real problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-10 Thread Robert Haas
On Mon, Dec 8, 2014 at 5:05 PM, Jim Nasby  wrote:
> Agreed, but it's possible to keep a block/CTID interface while doing
> something different on the disk.

Objection: hand-waving.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-10 Thread Robert Haas
On Wed, Dec 10, 2014 at 9:22 AM, Alvaro Herrera
 wrote:
> The problem with naming partitions is that the user has to pick names
> for every partition, which is tedious and doesn't provide any
> significant benefit.  The input I had from users of other partitioning
> systems was that they very much preferred not to name the partitions at
> all, which is why I chose the PARTITION FOR VALUE syntax (not sure if
> this syntax is exactly what other systems use; it just seemed the
> natural choice.)

FWIW, Oracle does name partitions.  It generates the names
automatically if you don't care to specify them, and the partition
names for a given table live in their own namespace that is separate
from the toplevel object namespace.  For example:

CREATE TABLE sales
 ( invoice_no NUMBER,
   sale_year  INT NOT NULL,
   sale_month INT NOT NULL,
   sale_day   INT NOT NULL )
   STORAGE (INITIAL 100K NEXT 50K) LOGGING
   PARTITION BY RANGE ( sale_year, sale_month, sale_day)
 ( PARTITION sales_q1 VALUES LESS THAN ( 1999, 04, 01 )
TABLESPACE tsa STORAGE (INITIAL 20K, NEXT 10K),
   PARTITION sales_q2 VALUES LESS THAN ( 1999, 07, 01 )
TABLESPACE tsb,
   PARTITION sales_q3 VALUES LESS THAN ( 1999, 10, 01 )
TABLESPACE tsc,
   PARTITION sales q4 VALUES LESS THAN ( 2000, 01, 01 )
TABLESPACE tsd)
   ENABLE ROW MOVEMENT;

I don't think this practice has much to recommend it.  We're going to
need a way to refer to individual partitions by name, and I don't see
much benefit in making that name something other than what is stored
in pg_class.relname.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-10 Thread Alvaro Herrera
Amit Langote wrote:

> On Wed, Dec 10, 2014 at 12:46 PM, Amit Kapila  wrote:
> > On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera 
> > wrote:

> >> FWIW in my original proposal I was rejecting some things that after
> >> further consideration turn out to be possible to allow; for instance
> >> directly referencing individual partitions in COPY.  We could allow
> >> something like
> >>
> >> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT
> >> or maybe
> >> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT
> >>
> > or
> > COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01'  TO STDOUT
> > COPY [TABLE] lineitems PARTITION   TO STDOUT
> >
> > I think we should try to support operations on partitions via main
> > table whereever it is required.

Um, I think the only difference is that you added the noise word TABLE
which we currently don't allow in COPY, and that you added the
possibility of using named partitions, about which see below.

> We can also allow to explicitly name a partition
> 
> COPY [TABLE ] lineitems PARTITION lineitems_2001 TO STDOUT;

The problem with naming partitions is that the user has to pick names
for every partition, which is tedious and doesn't provide any
significant benefit.  The input I had from users of other partitioning
systems was that they very much preferred not to name the partitions at
all, which is why I chose the PARTITION FOR VALUE syntax (not sure if
this syntax is exactly what other systems use; it just seemed the
natural choice.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-09 Thread Amit Langote


On Wed, Dec 10, 2014 at 12:46 PM, Amit Kapila  wrote:
> On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera 
> wrote:
>>
>> Amit Kapila wrote:
>> > On Tue, Dec 9, 2014 at 1:42 AM, Robert Haas 
>> > wrote:
>> > > On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund 
>> > wrote:
>> > > >> I don't think that's mutually exclusive with the idea of
>> > > >> partitions-as-tables.  I mean, you can add code to the ALTER TABLE
>> > > >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR,
>> > > >> ...)
>> > > >> wherever you want.
>> > > >
>> > > > That'll be a lot of places you'll need to touch. More fundamentally:
>> > > > Why
>> > > > should we name something a table that's not one?
>> > >
>> > > Well, I'm not convinced that it isn't one.  And adding a new relkind
>> > > will involve a bunch of code churn, too.  But I don't much care to
>> > > pre-litigate this: when someone has got a patch, we can either agree
>> > > that the approach is OK or argue that it is problematic because X.  I
>> > > think we need to hammer down the design in broad strokes first, and
>> > > I'm not sure we're totally there yet.
>> >
>> > That's right, I think at this point defining the top level
>> > behaviour/design
>> > is very important to proceed, we can decide about the better
>> > implementation approach afterwards (may be once initial patch is ready,
>> > because it might not be a major work to do it either way).  So here's
>> > where
>> > we are on this point till now as per my understanding, I think that
>> > direct
>> > operations should be prohibited on partitions, you think that they
>> > should be
>> > allowed and Andres think that it might be better to allow direct
>> > operations
>> > on partitions for Read.
>>
>> FWIW in my original proposal I was rejecting some things that after
>> further consideration turn out to be possible to allow; for instance
>> directly referencing individual partitions in COPY.  We could allow
>> something like
>>
>> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT
>> or maybe
>> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT
>>
> or
> COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01'  TO STDOUT
> COPY [TABLE] lineitems PARTITION   TO STDOUT
>
> I think we should try to support operations on partitions via main
> table whereever it is required.
>

We can also allow to explicitly name a partition

COPY [TABLE ] lineitems PARTITION lineitems_2001 TO STDOUT;

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-09 Thread Amit Langote


On Wed, Dec 10, 2014 at 12:33 PM, Amit Kapila  wrote:
> On Tue, Dec 9, 2014 at 11:44 PM, Josh Berkus  wrote:
>> On 12/09/2014 12:17 AM, Amit Langote wrote:
>> >> Now if user wants to define multi-column Partition based on
>> >> > monthly_salary and annual_salary, how do we want him to
>> >> > specify the values.  Basically how to distinguish which values
>> >> > belong to first column key and which one's belong to second
>> >> > column key.
>> >> >
>> > Perhaps you are talking about "syntactic" difficulties that I totally
>> > missed in my other reply to this mail?
>> >
>> > Can we represent the same data by rather using a subpartitioning scheme?
>> > ISTM, semantics would remain the same.
>> >
>> > ... PARTITION BY (monthly_salary) SUBPARTITION BY (annual_salary)?
>>
>
> Using SUBPARTITION is not the answer for multi-column partition,
> I think if we have to support it for List partitioning then something
> on lines what Josh has mentioned below could workout, but I don't
> think it is important to support multi-column partition for List at this
> stage.  
>

Yeah, I realize multicolumn list partitioning and list-list composite 
partitioning are different things in many respects. And given how awkward 
multicolumn list partitioning is looking to implement, I also think we only 
allow single column in a list partition key.

>> ... or just use arrays.
>>
>> PARTITION BY LIST ( monthly_salary, annual_salary )
>> PARTITION salary_small VALUES ({[300,400],[5000,6000]})
>> ) 
>>
>> ... but that begs the question of how partition by list over two columns
>> (or more) would even work?  You'd need an a*b number of partitions, and
>> the user would be pretty much certain to miss a few value combinations.
>>  Maybe we should just restrict list partitioning to a single column for
>> a first release, and wait and see if people ask for more?
>>
>
> I also think we should not support multi-column list partition in first
> release.
>

Yes.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-09 Thread Amit Kapila
On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera 
wrote:
>
> Amit Kapila wrote:
> > On Tue, Dec 9, 2014 at 1:42 AM, Robert Haas 
wrote:
> > > On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund 
> > wrote:
> > > >> I don't think that's mutually exclusive with the idea of
> > > >> partitions-as-tables.  I mean, you can add code to the ALTER TABLE
> > > >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR,
...)
> > > >> wherever you want.
> > > >
> > > > That'll be a lot of places you'll need to touch. More
fundamentally: Why
> > > > should we name something a table that's not one?
> > >
> > > Well, I'm not convinced that it isn't one.  And adding a new relkind
> > > will involve a bunch of code churn, too.  But I don't much care to
> > > pre-litigate this: when someone has got a patch, we can either agree
> > > that the approach is OK or argue that it is problematic because X.  I
> > > think we need to hammer down the design in broad strokes first, and
> > > I'm not sure we're totally there yet.
> >
> > That's right, I think at this point defining the top level
behaviour/design
> > is very important to proceed, we can decide about the better
> > implementation approach afterwards (may be once initial patch is ready,
> > because it might not be a major work to do it either way).  So here's
where
> > we are on this point till now as per my understanding, I think that
direct
> > operations should be prohibited on partitions, you think that they
should be
> > allowed and Andres think that it might be better to allow direct
operations
> > on partitions for Read.
>
> FWIW in my original proposal I was rejecting some things that after
> further consideration turn out to be possible to allow; for instance
> directly referencing individual partitions in COPY.  We could allow
> something like
>
> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT
> or maybe
> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT
>
or
COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01'  TO STDOUT
COPY [TABLE] lineitems PARTITION   TO STDOUT

I think we should try to support operations on partitions via main
table whereever it is required.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] On partitioning

2014-12-09 Thread Amit Kapila
On Tue, Dec 9, 2014 at 11:44 PM, Josh Berkus  wrote:
> On 12/09/2014 12:17 AM, Amit Langote wrote:
> >> Now if user wants to define multi-column Partition based on
> >> > monthly_salary and annual_salary, how do we want him to
> >> > specify the values.  Basically how to distinguish which values
> >> > belong to first column key and which one's belong to second
> >> > column key.
> >> >
> > Perhaps you are talking about "syntactic" difficulties that I totally
missed in my other reply to this mail?
> >
> > Can we represent the same data by rather using a subpartitioning
scheme? ISTM, semantics would remain the same.
> >
> > ... PARTITION BY (monthly_salary) SUBPARTITION BY (annual_salary)?
>

Using SUBPARTITION is not the answer for multi-column partition,
I think if we have to support it for List partitioning then something
on lines what Josh has mentioned below could workout, but I don't
think it is important to support multi-column partition for List at this
stage.

> ... or just use arrays.
>
> PARTITION BY LIST ( monthly_salary, annual_salary )
> PARTITION salary_small VALUES ({[300,400],[5000,6000]})
> ) 
>
> ... but that begs the question of how partition by list over two columns
> (or more) would even work?  You'd need an a*b number of partitions, and
> the user would be pretty much certain to miss a few value combinations.
>  Maybe we should just restrict list partitioning to a single column for
> a first release, and wait and see if people ask for more?
>

I also think we should not support multi-column list partition in first
release.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] On partitioning

2014-12-09 Thread Jim Nasby

On 12/8/14, 5:19 PM, Josh Berkus wrote:

On 12/08/2014 02:12 PM, Jim Nasby wrote:

On 12/8/14, 12:26 PM, Josh Berkus wrote:

4. Creation Locking Problem
high probability of lock pile-ups whenever a new partition is created on
demand due to multiple backends trying to create the partition at the
same time.
Not Addressed?


Do users actually try and create new partitions during DML? That sounds
doomed to failure in pretty much any system...


There is no question that it would be easier for users to create
partitions on demand automatically.  Particularly if you're partitioning
by something other than time.  For a particular case, consider users on
RDS, which has no cron jobs for creating new partitons; it's on demand
or manually.

It's quite possible that there is no good way to work out the locking
for on-demand partitions though, but *if* we're going to have a 2nd
partition system, I think it's important to at least discuss the
problems with on-demand creation.


Yeah, we should discuss it. Perhaps the right answer here may be our own job 
scheduler, something a lot of folks want anyway.


11. Hash Partitioning
Some users would prefer to partition into a fixed number of
hash-allocated partitions.
Not Addressed.


Though, you should be able to do that in either system if you bother to
define your own hash in a BEFORE trigger...


That doesn't do you any good with the SELECT query, unless you change
your middleware to add a hash(column) to every query.  Which would be
really hard to do for joins.


A. COPY/ETL then attach
In inheritance partitioning, you can easily build a partition outside
the master and then "attach" it, allowing for minimal disturbance of
concurrent users.  Could be addressed in the future.


How much of the desire for this is because our current "row routing"
solutions are very slow? I suspect that's the biggest reason, and
hopefully Alvaro's proposal mostly eliminates it.


That doesn't always work, though.  In some cases the partition is being
built using some fairly complex logic (think of partitions which are
based on matviews) and there's no fast way to create the new data.
Again, this is an acceptable casualty of an improved design, but if it
will be so, we should consciously decide that.


Is there an example you can give here? If the scheme is that complicated I'm 
failing to see how you're supposed to do things like partition elimination.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-09 Thread Josh Berkus
On 12/09/2014 12:17 AM, Amit Langote wrote:
>> Now if user wants to define multi-column Partition based on
>> > monthly_salary and annual_salary, how do we want him to
>> > specify the values.  Basically how to distinguish which values
>> > belong to first column key and which one's belong to second
>> > column key.
>> >
> Perhaps you are talking about "syntactic" difficulties that I totally missed 
> in my other reply to this mail?
> 
> Can we represent the same data by rather using a subpartitioning scheme? 
> ISTM, semantics would remain the same.
> 
> ... PARTITION BY (monthly_salary) SUBPARTITION BY (annual_salary)?

... or just use arrays.

PARTITION BY LIST ( monthly_salary, annual_salary )
PARTITION salary_small VALUES ({[300,400],[5000,6000]})
) 

... but that begs the question of how partition by list over two columns
(or more) would even work?  You'd need an a*b number of partitions, and
the user would be pretty much certain to miss a few value combinations.
 Maybe we should just restrict list partitioning to a single column for
a first release, and wait and see if people ask for more?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-09 Thread Alvaro Herrera
Amit Kapila wrote:
> On Tue, Dec 9, 2014 at 1:42 AM, Robert Haas  wrote:
> > On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund 
> wrote:
> > >> I don't think that's mutually exclusive with the idea of
> > >> partitions-as-tables.  I mean, you can add code to the ALTER TABLE
> > >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
> > >> wherever you want.
> > >
> > > That'll be a lot of places you'll need to touch. More fundamentally: Why
> > > should we name something a table that's not one?
> >
> > Well, I'm not convinced that it isn't one.  And adding a new relkind
> > will involve a bunch of code churn, too.  But I don't much care to
> > pre-litigate this: when someone has got a patch, we can either agree
> > that the approach is OK or argue that it is problematic because X.  I
> > think we need to hammer down the design in broad strokes first, and
> > I'm not sure we're totally there yet.
> 
> That's right, I think at this point defining the top level behaviour/design
> is very important to proceed, we can decide about the better
> implementation approach afterwards (may be once initial patch is ready,
> because it might not be a major work to do it either way).  So here's where
> we are on this point till now as per my understanding, I think that direct
> operations should be prohibited on partitions, you think that they should be
> allowed and Andres think that it might be better to allow direct operations
> on partitions for Read.

FWIW in my original proposal I was rejecting some things that after
further consideration turn out to be possible to allow; for instance
directly referencing individual partitions in COPY.  We could allow
something like

COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT
or maybe
COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT

and this would emit the whole partition for year 2000 of table
lineitems, and only that (the value is just computed on the fly to fit
the partitioning constraints for that individual partition).  Then
pg_dump would be able to dump each and every partition separately.

In a similar way we could have COPY FROM allow input into individual
partitions so that such a dump can be restored in parallel for each
partition.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-09 Thread Alvaro Herrera
Josh Berkus wrote:

Hi,

> Pardon me for jumping into this late.  In general, I like Alvaro's
> approach.

Please don't call this "Alvaro's approach" as I'm not involved in this
anymore.  Amit Langote has taken ownership of it now.  While some
resemblance to what I originally proposed might remain, I haven't kept
track of how this has evolved and this might be a totally different
thing now.  Or not.

Anyway I just wanted to comment on a single point:

> 6. Unique Index Problem
> Cannot create a unique index across multiple partitions, which prevents
> the partitioned table from being FK'd.
> Not Addressed
> (but could be addressed in the future)

I think it's unlikely that we will ever create a unique index that spans
all the partitions, actually.  Even if there are some wild ideas on how
to implement such a thing, the number of difficult issues that no one
knows how to attack seems too large.  I would perhaps be thinking in
allowing foreign keys to be defined on column sets that are prefixed by
partition keys; unique indexes must exist on all partitions on the same
columns including the partition keys.  (Perhaps make an extra exception
that if a partition allows a single value for the partition column, that
column need not be part of the unique index.)

> 10. Scaling Problem
> Inheritance partitioning becomes prohibitively slow for the planner at
> somewhere between 100 and 500 partitions depending on various factors.
> No idea?

At least it was my intention to make the system scale to huge number of
partitions, but this requires some forward thinking (such as avoiding
loading the index list of all of them or evern opening all of them at
the planner stage) and I think would be defeated if we want to keep
all the generality of the inheritance-based approach.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-09 Thread Amit Langote


On Tue, Dec 9, 2014 at 12:59 PM, Amit Kapila  wrote:
> On Tue, Dec 9, 2014 at 8:08 AM, Amit Langote 
> wrote:
>> > From: Robert Haas [mailto:robertmh...@gmail.com]
>> > On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila 
>> > wrote:
>> > >> I guess you could list or hash partition on multiple columns, too.
>> > >
>> > > How would you distinguish values in list partition for multiple
>> > > columns? I mean for range partition, we are sure there will
>> > > be either one value for each column, but for list it could
>> > > be multiple and not fixed for each partition, so I think it will not
>> > > be easy to support the multicolumn partition key for list
>> > > partitions.
>> >
>> > I don't understand.  If you want to range partition on columns (a, b),
>> > you say that, say, tuples with (a, b) values less than (100, 200) go
>> > here and the rest go elsewhere.  For list partitioning, you say that,
>> > say, tuples with (a, b) values of EXACTLY (100, 200) go here and the
>> > rest go elsewhere.  I'm not sure how useful that is but it's not
>> > illogical.
>> >
>>
>> In case of list partitioning, 100 and 200 would respectively be one of the
>> values in lists of allowed values for a and b. I thought his concern is
>> whether this "list of values for each column in partkey" is as convenient to
>> store and manipulate as range partvalues.
>>
>
> Yeah and also how would user specify the values, as an example
> assume that table is partitioned on monthly_salary, so partition
> definition would look:
>
> PARTITION BY LIST(monthly_salary)
> (
> PARTITION salary_less_than_thousand VALUES(300, 900),
> PARTITION salary_less_than_two_thousand VALUES (500,1000,1500),
> ...
> )
>
> Now if user wants to define multi-column Partition based on
> monthly_salary and annual_salary, how do we want him to
> specify the values.  Basically how to distinguish which values
> belong to first column key and which one's belong to second
> column key.
>

Perhaps you are talking about "syntactic" difficulties that I totally missed in 
my other reply to this mail?

Can we represent the same data by rather using a subpartitioning scheme? ISTM, 
semantics would remain the same.

... PARTITION BY (monthly_salary) SUBPARTITION BY (annual_salary)?

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-08 Thread Amit Langote

On Tue, Dec 9, 2014 at 12:59 PM, Amit Kapila  wrote:
> On Tue, Dec 9, 2014 at 8:08 AM, Amit Langote 
> wrote:
>> > From: Robert Haas [mailto:robertmh...@gmail.com]
>> > I don't understand.  If you want to range partition on columns (a, b),
>> > you say that, say, tuples with (a, b) values less than (100, 200) go
>> > here and the rest go elsewhere.  For list partitioning, you say that,
>> > say, tuples with (a, b) values of EXACTLY (100, 200) go here and the
>> > rest go elsewhere.  I'm not sure how useful that is but it's not
>> > illogical.
>> >
>>
>> In case of list partitioning, 100 and 200 would respectively be one of the
>> values in lists of allowed values for a and b. I thought his concern is
>> whether this "list of values for each column in partkey" is as convenient to
>> store and manipulate as range partvalues.
>>
>
> Yeah and also how would user specify the values, as an example
> assume that table is partitioned on monthly_salary, so partition
> definition would look:
>
> PARTITION BY LIST(monthly_salary)
> (
> PARTITION salary_less_than_thousand VALUES(300, 900),
> PARTITION salary_less_than_two_thousand VALUES (500,1000,1500),
> ...
> )
>
> Now if user wants to define multi-column Partition based on
> monthly_salary and annual_salary, how do we want him to
> specify the values.  Basically how to distinguish which values
> belong to first column key and which one's belong to second
> column key.
>

Amit, in one of my earlier replies to your question of why we may not want to 
implement multi-column list partitioning (lack of user interest in the feature 
or possible complexity of the code), I tried to explain how that may work if we 
do choose to go that way. Basically, something we may call PartitionColumnValue 
should be such that above issue can be suitably sorted out.

For example, a partition defining/bounding value would be a pg_node_tree 
representation of List of one of the (say) following parse nodes as appropriate 
- 

typedef struct PartitionColumnValue
{
NodeTag type,
Oid *partitionid,
char*partcolname,
charpartkind,
Node*partrangelower,
Node*partrangeupper,
List*partlistvalues
};

OR separately,

typedef struct RangePartitionColumnValue
{
NodeTag type,
Oid *partitionid,
char*partcolname,
Node*partrangelower,
Node*partrangeupper
};

& 

typedef struct ListPartitionColumnValue
{
NodeTag type,
Oid *partitionid,
char*partcolname,
List*partlistvalues
};

Where a partition definition would look like

typedef struct PartitionDef
{
NodeTag type,
RangeVarpartition,
RangeVarparentrel,
char*kind,
Node*values,
List*options,
char*tablespacename
};

PartitionDef.values is an (ordered) List of PartitionColumnValue each of which 
corresponds to one column in the partition key in that order.

We should be able to devise a way to load the pg_node_tree representation of  
PartitionDef.values (on-disk pg_partition_def.partvalues) into relcache using a 
"suitable data structure" so that it becomes readily usable in variety of 
contexts that we are interested in using this information. 

Regards,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-08 Thread Amit Kapila
On Tue, Dec 9, 2014 at 1:42 AM, Robert Haas  wrote:
> On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund 
wrote:
> >> I don't think that's mutually exclusive with the idea of
> >> partitions-as-tables.  I mean, you can add code to the ALTER TABLE
> >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
> >> wherever you want.
> >
> > That'll be a lot of places you'll need to touch. More fundamentally: Why
> > should we name something a table that's not one?
>
> Well, I'm not convinced that it isn't one.  And adding a new relkind
> will involve a bunch of code churn, too.  But I don't much care to
> pre-litigate this: when someone has got a patch, we can either agree
> that the approach is OK or argue that it is problematic because X.  I
> think we need to hammer down the design in broad strokes first, and
> I'm not sure we're totally there yet.

That's right, I think at this point defining the top level behaviour/design
is very important to proceed, we can decide about the better
implementation approach afterwards (may be once initial patch is ready,
because it might not be a major work to do it either way).  So here's where
we are on this point till now as per my understanding, I think that direct
operations should be prohibited on partitions, you think that they should be
allowed and Andres think that it might be better to allow direct operations
on partitions for Read.

>
> >> - Direct access to individual partitions to bypass
> >> tuple-routing/query-planning overhead.
> >
> > I think that might be ok in some cases, but in general I'd be very wary
> > to allow that. I think it might be ok to allow direct read access, but
> > everything else I'd be opposed. I'd much rather go the route of allowing
> > to few things and then gradually opening up if required than the other
> > way round (as that pretty much will never happen because it'll break
> > deployed systems).
>
> Why?
>

Because I think it will be difficult for users to write/maintain more of
such
code, which is one of the complaints with previous system where user
needs to write triggers to route the tuple to appropriate partition.
I think in first step we should try to improve the tuple routing algorithm
so that it is not pain for users or atleast it should be at par with some of
the other competitive database systems and if we are not able
to come up with such an implementation, then may be we can think of
providing it as a special way for users to improve performance.

Another reason is that fundamentally partitions are managed internally
to divide the user data in a way so that access could be cheaper and we
take the specifications for defining the partitions from users and allowing
operations on internally managed objects could lead to user writing quite
some code to do what database actually does internally.  If we see that
TOAST table are internally used to manage large tuples, however we
don't want users to directly perform dml on those tables.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] On partitioning

2014-12-08 Thread Amit Kapila
On Tue, Dec 9, 2014 at 8:08 AM, Amit Langote 
wrote:
> > From: Robert Haas [mailto:robertmh...@gmail.com]
> > On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila 
> > wrote:
> > >> I guess you could list or hash partition on multiple columns, too.
> > >
> > > How would you distinguish values in list partition for multiple
> > > columns? I mean for range partition, we are sure there will
> > > be either one value for each column, but for list it could
> > > be multiple and not fixed for each partition, so I think it will not
> > > be easy to support the multicolumn partition key for list
> > > partitions.
> >
> > I don't understand.  If you want to range partition on columns (a, b),
> > you say that, say, tuples with (a, b) values less than (100, 200) go
> > here and the rest go elsewhere.  For list partitioning, you say that,
> > say, tuples with (a, b) values of EXACTLY (100, 200) go here and the
> > rest go elsewhere.  I'm not sure how useful that is but it's not
> > illogical.
> >
>
> In case of list partitioning, 100 and 200 would respectively be one of
the values in lists of allowed values for a and b. I thought his concern is
whether this "list of values for each column in partkey" is as convenient
to store and manipulate as range partvalues.
>

Yeah and also how would user specify the values, as an example
assume that table is partitioned on monthly_salary, so partition
definition would look:

PARTITION BY LIST(monthly_salary)
(
PARTITION salary_less_than_thousand VALUES(300, 900),
PARTITION salary_less_than_two_thousand VALUES (500,1000,1500),
...
)

Now if user wants to define multi-column Partition based on
monthly_salary and annual_salary, how do we want him to
specify the values.  Basically how to distinguish which values
belong to first column key and which one's belong to second
column key.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] On partitioning

2014-12-08 Thread Amit Langote

> From: Robert Haas [mailto:robertmh...@gmail.com]
> On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila 
> wrote:
> >> I guess you could list or hash partition on multiple columns, too.
> >
> > How would you distinguish values in list partition for multiple
> > columns? I mean for range partition, we are sure there will
> > be either one value for each column, but for list it could
> > be multiple and not fixed for each partition, so I think it will not
> > be easy to support the multicolumn partition key for list
> > partitions.
> 
> I don't understand.  If you want to range partition on columns (a, b),
> you say that, say, tuples with (a, b) values less than (100, 200) go
> here and the rest go elsewhere.  For list partitioning, you say that,
> say, tuples with (a, b) values of EXACTLY (100, 200) go here and the
> rest go elsewhere.  I'm not sure how useful that is but it's not
> illogical.
> 

In case of list partitioning, 100 and 200 would respectively be one of the 
values in lists of allowed values for a and b. I thought his concern is whether 
this "list of values for each column in partkey" is as convenient to store and 
manipulate as range partvalues. 

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-08 Thread Josh Berkus
On 12/08/2014 02:12 PM, Jim Nasby wrote:
> On 12/8/14, 12:26 PM, Josh Berkus wrote:
>> 4. Creation Locking Problem
>> high probability of lock pile-ups whenever a new partition is created on
>> demand due to multiple backends trying to create the partition at the
>> same time.
>> Not Addressed?
> 
> Do users actually try and create new partitions during DML? That sounds
> doomed to failure in pretty much any system...

There is no question that it would be easier for users to create
partitions on demand automatically.  Particularly if you're partitioning
by something other than time.  For a particular case, consider users on
RDS, which has no cron jobs for creating new partitons; it's on demand
or manually.

It's quite possible that there is no good way to work out the locking
for on-demand partitions though, but *if* we're going to have a 2nd
partition system, I think it's important to at least discuss the
problems with on-demand creation.

>> 11. Hash Partitioning
>> Some users would prefer to partition into a fixed number of
>> hash-allocated partitions.
>> Not Addressed.
> 
> Though, you should be able to do that in either system if you bother to
> define your own hash in a BEFORE trigger...

That doesn't do you any good with the SELECT query, unless you change
your middleware to add a hash(column) to every query.  Which would be
really hard to do for joins.

>> A. COPY/ETL then attach
>> In inheritance partitioning, you can easily build a partition outside
>> the master and then "attach" it, allowing for minimal disturbance of
>> concurrent users.  Could be addressed in the future.
> 
> How much of the desire for this is because our current "row routing"
> solutions are very slow? I suspect that's the biggest reason, and
> hopefully Alvaro's proposal mostly eliminates it.

That doesn't always work, though.  In some cases the partition is being
built using some fairly complex logic (think of partitions which are
based on matviews) and there's no fast way to create the new data.
Again, this is an acceptable casualty of an improved design, but if it
will be so, we should consciously decide that.

>> B. Catchall Partition
>> Many partitioning schemes currently contain a "catchall" partition which
>> accepts rows outside of the range of the partitioning scheme, due to bad
>> input data.  Probably not handled on purpose; Alvaro is proposing that
>> we reject these instead, or create the partitions on demand, which is a
>> legitimate approach.
>>
>> C. Asymmetric Partitioning / NULLs in partition column
>> This is the classic Active/Inactive By Month setup for partitions.
>> Could be addressed via special handling for NULL/infinity in the
>> partitioned column.
> 
> If we allowed for a "catchall partition" and supported normal
> inheritance/triggers on that partition then users could continue to do
> whatever they needed with data that didn't fit the "normal" partitioning
> pattern.

That sounds to me like it would fall under the heading of "impossible
levels of backwards-compatibility".


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-08 Thread Jim Nasby

On 12/8/14, 12:26 PM, Josh Berkus wrote:

4. Creation Locking Problem
high probability of lock pile-ups whenever a new partition is created on
demand due to multiple backends trying to create the partition at the
same time.
Not Addressed?


Do users actually try and create new partitions during DML? That sounds doomed 
to failure in pretty much any system...


6. Unique Index Problem
Cannot create a unique index across multiple partitions, which prevents
the partitioned table from being FK'd.
Not Addressed
(but could be addressed in the future)


And would be extremely useful even with simple inheritance, let alone 
partitioning...


9. Hibernate Problem
When using the trigger method, inserts into the master partition return
0, which Hibernate and some other ORMs regard as an insert failure.
Addressed.


It would be really nice to address this with regular inheritance too...


11. Hash Partitioning
Some users would prefer to partition into a fixed number of
hash-allocated partitions.
Not Addressed.


Though, you should be able to do that in either system if you bother to define 
your own hash in a BEFORE trigger...


A. COPY/ETL then attach
In inheritance partitioning, you can easily build a partition outside
the master and then "attach" it, allowing for minimal disturbance of
concurrent users.  Could be addressed in the future.


How much of the desire for this is because our current "row routing" solutions 
are very slow? I suspect that's the biggest reason, and hopefully Alvaro's proposal 
mostly eliminates it.


B. Catchall Partition
Many partitioning schemes currently contain a "catchall" partition which
accepts rows outside of the range of the partitioning scheme, due to bad
input data.  Probably not handled on purpose; Alvaro is proposing that
we reject these instead, or create the partitions on demand, which is a
legitimate approach.

C. Asymmetric Partitioning / NULLs in partition column
This is the classic Active/Inactive By Month setup for partitions.
Could be addressed via special handling for NULL/infinity in the
partitioned column.


If we allowed for a "catchall partition" and supported normal inheritance/triggers on 
that partition then users could continue to do whatever they needed with data that didn't fit the 
"normal" partitioning pattern.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-08 Thread Jim Nasby

On 12/8/14, 1:05 PM, Robert Haas wrote:

Besides, I haven't really seen anyone propose something that sounds
like a credible alternative.  If we could make partition objects
things that the storage layer needs to know about but the query
planner doesn't need to understand, that'd be maybe worth considering.
But I don't see any way that that's remotely feasible.  There are lots
of places that we assume that a heap consists of blocks number 0 up
through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits
and pieces of the way index vacuuming is handled, which in turn bleeds
into Hot Standby.  You can't just decide that now block numbers are
going to be replaced by some more complex structure, or even that
they're now going to be nonlinear, without breaking a huge amount of
stuff.


Agreed, but it's possible to keep a block/CTID interface while doing something 
different on the disk.

If you think about it, partitioning is really a hack anyway. It clutters up 
your logical set implementation with a bunch of physical details. What most 
people really want when they implement partitioning is simply data locality.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-08 Thread Robert Haas
On Mon, Dec 8, 2014 at 2:58 PM, Josh Berkus  wrote:
>> I think any new partitioning system should keep the good things about
>> the existing system, of which there are some, and not try to reinvent
>> the wheel.  The yard stick for a new system shouldn't be "is this
>> different enough?" but "does this solve the problems without creating
>> new ones?".
>
> It's unrealistic to assume that a new system would support all of the
> features of the existing inheritance partitioning without restriction.
>  In fact, I'd say that such a requirement amounts to saying "don't
> bother trying".
>
> For example, inheritance allows us to have different indexes,
> constraints, and even columns on partitions.  We can have overlapping
> partitions, and heterogenous multilevel partitioning (partition this
> customer by month but partition that customer by week).  We can even add
> triggers on individual partitions to reroute data away from a specific
> partition.   A requirement to support all of these peculiar uses of
> inheritance partitioning would doom any new partitioning project.

I don't think it has to be possible to support every use case that we
can support today; clearly, a part of the goal here is to be LESS
general so that we can be more performant.  But I think the urge to
change too many things at once had better be tempered by a clear-eyed
vision of what can reasonably be accomplished in one patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-08 Thread Robert Haas
On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund  wrote:
>> I don't think that's mutually exclusive with the idea of
>> partitions-as-tables.  I mean, you can add code to the ALTER TABLE
>> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
>> wherever you want.
>
> That'll be a lot of places you'll need to touch. More fundamentally: Why
> should we name something a table that's not one?

Well, I'm not convinced that it isn't one.  And adding a new relkind
will involve a bunch of code churn, too.  But I don't much care to
pre-litigate this: when someone has got a patch, we can either agree
that the approach is OK or argue that it is problematic because X.  I
think we need to hammer down the design in broad strokes first, and
I'm not sure we're totally there yet.

>> - Direct access to individual partitions to bypass
>> tuple-routing/query-planning overhead.
>
> I think that might be ok in some cases, but in general I'd be very wary
> to allow that. I think it might be ok to allow direct read access, but
> everything else I'd be opposed. I'd much rather go the route of allowing
> to few things and then gradually opening up if required than the other
> way round (as that pretty much will never happen because it'll break
> deployed systems).

Why?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-08 Thread Josh Berkus
On 12/08/2014 11:40 AM, Robert Haas wrote:
>> I don't thing its feasible to drop inheritance partitioning at this
>> point; too many user exploit a lot of peculiarities of that system which
>> wouldn't be supported by any other system.  So any new partitioning
>> system we're talking about would be *in addition* to the existing
>> system.  Hence my prior email trying to make sure that a new proposed
>> system is sufficiently different from the existing one to be worthwhile.
> 
> I think any new partitioning system should keep the good things about
> the existing system, of which there are some, and not try to reinvent
> the wheel.  The yard stick for a new system shouldn't be "is this
> different enough?" but "does this solve the problems without creating
> new ones?".

It's unrealistic to assume that a new system would support all of the
features of the existing inheritance partitioning without restriction.
 In fact, I'd say that such a requirement amounts to saying "don't
bother trying".

For example, inheritance allows us to have different indexes,
constraints, and even columns on partitions.  We can have overlapping
partitions, and heterogenous multilevel partitioning (partition this
customer by month but partition that customer by week).  We can even add
triggers on individual partitions to reroute data away from a specific
partition.   A requirement to support all of these peculiar uses of
inheritance partitioning would doom any new partitioning project.

>>> Besides, I haven't really seen anyone propose something that sounds
>>> like a credible alternative.  If we could make partition objects
>>> things that the storage layer needs to know about but the query
>>> planner doesn't need to understand, that'd be maybe worth considering.
>>> But I don't see any way that that's remotely feasible.
>>
>> On the other hand, as long as partitions exist exclusively at the
>> planner layer, we can't fix the existing major shortcomings of
>> inheritance partitioning, such as its inability to handle expressions.
>> Again, see previous.
> 
> Huh?

Explained in the other email I posted on this thread.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-08 Thread Andres Freund
On 2014-12-08 14:48:50 -0500, Robert Haas wrote:
> On Mon, Dec 8, 2014 at 2:39 PM, Andres Freund  wrote:
> >> I guess I'm in disagreement with you - and, perhaps - the majority on
> >> this point.  I think that ship has already sailed: partitions ARE
> >> tables.  We can try to make it less necessary for users to ever look
> >> at those tables as separate objects, and I think that's a good idea.
> >> But trying to go from a system where partitions are tables, which is
> >> what we have today, to a system where they are not seems like a bad
> >> idea to me.  If we make a major break from how things work today,
> >> we're going to end up having to reimplement stuff that already works.
> >
> > I don't think this makes much sense. That'd severely restrict our
> > ability to do stuff for a long time. Unless we can absolutely rely on
> > the fact that partitions have the same schema and such we'll rob
> > ourselves of significant optimization opportunities.
> 
> I don't think that's mutually exclusive with the idea of
> partitions-as-tables.  I mean, you can add code to the ALTER TABLE
> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
> wherever you want.

That'll be a lot of places you'll need to touch. More fundamentally: Why
should we name something a table that's not one?

> >> Besides, I haven't really seen anyone propose something that sounds
> >> like a credible alternative.  If we could make partition objects
> >> things that the storage layer needs to know about but the query
> >> planner doesn't need to understand, that'd be maybe worth considering.
> >> But I don't see any way that that's remotely feasible.  There are lots
> >> of places that we assume that a heap consists of blocks number 0 up
> >> through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits
> >> and pieces of the way index vacuuming is handled, which in turn bleeds
> >> into Hot Standby.  You can't just decide that now block numbers are
> >> going to be replaced by some more complex structure, or even that
> >> they're now going to be nonlinear, without breaking a huge amount of
> >> stuff.
> >
> > I think you're making a wrong fundamental assumption here. Just because
> > we define partitions to not be full relations doesn't mean we have to
> > treat them entirely separate. I don't see why a pg_class.relkind = 'p'
> > entry would be something actually problematic. That'd easily allow to
> > treat them differently in all the relevant places (all of ALTER TABLE,
> > DML et al) and still allow all of the current planner/executor
> > infrastructure. We can even allow direct SELECTs from individual
> > partitions if we want to - that's trivial to achieve.
> 
> We may just be using different words to talk about more-or-less the
> same thing, then.

That might be

> What I'm saying is that I want these things to keep working:

> - Indexes.

Nobody argued against that I think.

> - Merge append and any other inheritance-aware query planning
> techniques.

Same here.

> - Direct access to individual partitions to bypass
> tuple-routing/query-planning overhead.

I think that might be ok in some cases, but in general I'd be very wary
to allow that. I think it might be ok to allow direct read access, but
everything else I'd be opposed. I'd much rather go the route of allowing
to few things and then gradually opening up if required than the other
way round (as that pretty much will never happen because it'll break
deployed systems).

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-08 Thread Robert Haas
On Mon, Dec 8, 2014 at 2:39 PM, Andres Freund  wrote:
>> I guess I'm in disagreement with you - and, perhaps - the majority on
>> this point.  I think that ship has already sailed: partitions ARE
>> tables.  We can try to make it less necessary for users to ever look
>> at those tables as separate objects, and I think that's a good idea.
>> But trying to go from a system where partitions are tables, which is
>> what we have today, to a system where they are not seems like a bad
>> idea to me.  If we make a major break from how things work today,
>> we're going to end up having to reimplement stuff that already works.
>
> I don't think this makes much sense. That'd severely restrict our
> ability to do stuff for a long time. Unless we can absolutely rely on
> the fact that partitions have the same schema and such we'll rob
> ourselves of significant optimization opportunities.

I don't think that's mutually exclusive with the idea of
partitions-as-tables.  I mean, you can add code to the ALTER TABLE
path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...)
wherever you want.

>> Besides, I haven't really seen anyone propose something that sounds
>> like a credible alternative.  If we could make partition objects
>> things that the storage layer needs to know about but the query
>> planner doesn't need to understand, that'd be maybe worth considering.
>> But I don't see any way that that's remotely feasible.  There are lots
>> of places that we assume that a heap consists of blocks number 0 up
>> through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits
>> and pieces of the way index vacuuming is handled, which in turn bleeds
>> into Hot Standby.  You can't just decide that now block numbers are
>> going to be replaced by some more complex structure, or even that
>> they're now going to be nonlinear, without breaking a huge amount of
>> stuff.
>
> I think you're making a wrong fundamental assumption here. Just because
> we define partitions to not be full relations doesn't mean we have to
> treat them entirely separate. I don't see why a pg_class.relkind = 'p'
> entry would be something actually problematic. That'd easily allow to
> treat them differently in all the relevant places (all of ALTER TABLE,
> DML et al) and still allow all of the current planner/executor
> infrastructure. We can even allow direct SELECTs from individual
> partitions if we want to - that's trivial to achieve.

We may just be using different words to talk about more-or-less the
same thing, then.  What I'm saying is that I want these things to keep
working:

- Indexes.
- Merge append and any other inheritance-aware query planning techniques.
- Direct access to individual partitions to bypass
tuple-routing/query-planning overhead.

I am not necessarily saying that I have a problem with putting other
restrictions on partitions, like requiring them to have the same tuple
descriptor or the same ACLs as their parents.  Those kinds of details
bear discussion, but I'm not intrinsically opposed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-08 Thread Robert Haas
On Mon, Dec 8, 2014 at 2:30 PM, Josh Berkus  wrote:
> On 12/08/2014 11:05 AM, Robert Haas wrote:
>> I guess I'm in disagreement with you - and, perhaps - the majority on
>> this point.  I think that ship has already sailed: partitions ARE
>> tables.  We can try to make it less necessary for users to ever look
>> at those tables as separate objects, and I think that's a good idea.
>> But trying to go from a system where partitions are tables, which is
>> what we have today, to a system where they are not seems like a bad
>> idea to me.  If we make a major break from how things work today,
>> we're going to end up having to reimplement stuff that already works.
>
> I don't thing its feasible to drop inheritance partitioning at this
> point; too many user exploit a lot of peculiarities of that system which
> wouldn't be supported by any other system.  So any new partitioning
> system we're talking about would be *in addition* to the existing
> system.  Hence my prior email trying to make sure that a new proposed
> system is sufficiently different from the existing one to be worthwhile.

I think any new partitioning system should keep the good things about
the existing system, of which there are some, and not try to reinvent
the wheel.  The yard stick for a new system shouldn't be "is this
different enough?" but "does this solve the problems without creating
new ones?".

>> Besides, I haven't really seen anyone propose something that sounds
>> like a credible alternative.  If we could make partition objects
>> things that the storage layer needs to know about but the query
>> planner doesn't need to understand, that'd be maybe worth considering.
>> But I don't see any way that that's remotely feasible.
>
> On the other hand, as long as partitions exist exclusively at the
> planner layer, we can't fix the existing major shortcomings of
> inheritance partitioning, such as its inability to handle expressions.
> Again, see previous.

Huh?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-08 Thread Andres Freund
On 2014-12-08 14:05:52 -0500, Robert Haas wrote:
> On Sat, Dec 6, 2014 at 3:06 AM, Amit Kapila  wrote:
> > Sure, I don't feel we should not provide anyway to take dump
> > for individual partition but not at level of independent table.
> > May be something like --table 
> > --partition .
> >
> > In general, I think we should try to avoid exposing that partitions are
> > individual tables as that might hinder any future enhancement in that
> > area (example if we someone finds a different and better way to
> > arrange the partition data, then due to the currently exposed syntax,
> > we might feel blocked).
> 
> I guess I'm in disagreement with you - and, perhaps - the majority on
> this point.  I think that ship has already sailed: partitions ARE
> tables.  We can try to make it less necessary for users to ever look
> at those tables as separate objects, and I think that's a good idea.
> But trying to go from a system where partitions are tables, which is
> what we have today, to a system where they are not seems like a bad
> idea to me.  If we make a major break from how things work today,
> we're going to end up having to reimplement stuff that already works.

I don't think this makes much sense. That'd severely restrict our
ability to do stuff for a long time. Unless we can absolutely rely on
the fact that partitions have the same schema and such we'll rob
ourselves of significant optimization opportunities.

> Besides, I haven't really seen anyone propose something that sounds
> like a credible alternative.  If we could make partition objects
> things that the storage layer needs to know about but the query
> planner doesn't need to understand, that'd be maybe worth considering.
> But I don't see any way that that's remotely feasible.  There are lots
> of places that we assume that a heap consists of blocks number 0 up
> through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits
> and pieces of the way index vacuuming is handled, which in turn bleeds
> into Hot Standby.  You can't just decide that now block numbers are
> going to be replaced by some more complex structure, or even that
> they're now going to be nonlinear, without breaking a huge amount of
> stuff.

I think you're making a wrong fundamental assumption here. Just because
we define partitions to not be full relations doesn't mean we have to
treat them entirely separate. I don't see why a pg_class.relkind = 'p'
entry would be something actually problematic. That'd easily allow to
treat them differently in all the relevant places (all of ALTER TABLE,
DML et al) and still allow all of the current planner/executor
infrastructure. We can even allow direct SELECTs from individual
partitions if we want to - that's trivial to achieve.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-08 Thread Josh Berkus
On 12/08/2014 11:05 AM, Robert Haas wrote:
> I guess I'm in disagreement with you - and, perhaps - the majority on
> this point.  I think that ship has already sailed: partitions ARE
> tables.  We can try to make it less necessary for users to ever look
> at those tables as separate objects, and I think that's a good idea.
> But trying to go from a system where partitions are tables, which is
> what we have today, to a system where they are not seems like a bad
> idea to me.  If we make a major break from how things work today,
> we're going to end up having to reimplement stuff that already works.

I don't thing its feasible to drop inheritance partitioning at this
point; too many user exploit a lot of peculiarities of that system which
wouldn't be supported by any other system.  So any new partitioning
system we're talking about would be *in addition* to the existing
system.  Hence my prior email trying to make sure that a new proposed
system is sufficiently different from the existing one to be worthwhile.

> Besides, I haven't really seen anyone propose something that sounds
> like a credible alternative.  If we could make partition objects
> things that the storage layer needs to know about but the query
> planner doesn't need to understand, that'd be maybe worth considering.
> But I don't see any way that that's remotely feasible. 

On the other hand, as long as partitions exist exclusively at the
planner layer, we can't fix the existing major shortcomings of
inheritance partitioning, such as its inability to handle expressions.
Again, see previous.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-08 Thread Robert Haas
On Sat, Dec 6, 2014 at 3:06 AM, Amit Kapila  wrote:
> Sure, I don't feel we should not provide anyway to take dump
> for individual partition but not at level of independent table.
> May be something like --table 
> --partition .
>
> In general, I think we should try to avoid exposing that partitions are
> individual tables as that might hinder any future enhancement in that
> area (example if we someone finds a different and better way to
> arrange the partition data, then due to the currently exposed syntax,
> we might feel blocked).

I guess I'm in disagreement with you - and, perhaps - the majority on
this point.  I think that ship has already sailed: partitions ARE
tables.  We can try to make it less necessary for users to ever look
at those tables as separate objects, and I think that's a good idea.
But trying to go from a system where partitions are tables, which is
what we have today, to a system where they are not seems like a bad
idea to me.  If we make a major break from how things work today,
we're going to end up having to reimplement stuff that already works.

Besides, I haven't really seen anyone propose something that sounds
like a credible alternative.  If we could make partition objects
things that the storage layer needs to know about but the query
planner doesn't need to understand, that'd be maybe worth considering.
But I don't see any way that that's remotely feasible.  There are lots
of places that we assume that a heap consists of blocks number 0 up
through N: CTID pointers, index-to-heap pointers, nodeSeqScan, bits
and pieces of the way index vacuuming is handled, which in turn bleeds
into Hot Standby.  You can't just decide that now block numbers are
going to be replaced by some more complex structure, or even that
they're now going to be nonlinear, without breaking a huge amount of
stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-08 Thread Robert Haas
On Mon, Dec 8, 2014 at 12:13 AM, Amit Langote
 wrote:
> So just to clarify, first and last destinations are considered "defined" if 
> you have something like:
>
> ...
> PARTITION p1 VALUES LESS THAN 10
> PARTITION p2 VALUES BETWEEN 10 AND 20
> PARTITION p3 VALUES GREATER THAN 20
> ...
>
> And "not defined" if:
>
> ...
> PARTITION p1 VALUES BETWEEN 10 AND 20
> ...

Yes.

>> For pg_dump --binary-upgrade, you need a statement like SELECT
>> binary_upgrade.set_next_toast_pg_class_oid('%d'::pg_catalog.oid) for
>> each pg_class entry.  So you can't easily have a single SQL statement
>> creating multiple such entries.
>
> Hmm, do you mean "pg_dump cannot emit" such a SQL or there shouldn't be one 
> in the first place?

I mean that the binary upgrade script needs to set the OID for every
pg_class object being restored, and it does that by stashing away up
to one (1) pg_class OID before each CREATE statement.  If a single
CREATE statement generates multiple pg_class entries, this method
doesn't work.

> Makes sense. This would double as a way to create subpartitions too? And that 
> would have to play well with any choice we end up making about how we treat 
> subpartitioning key (one of the points discussed above)

Yes, I think so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-08 Thread Robert Haas
On Sat, Dec 6, 2014 at 2:59 AM, Amit Kapila  wrote:
>> I guess you could list or hash partition on multiple columns, too.
>
> How would you distinguish values in list partition for multiple
> columns? I mean for range partition, we are sure there will
> be either one value for each column, but for list it could
> be multiple and not fixed for each partition, so I think it will not
> be easy to support the multicolumn partition key for list
> partitions.

I don't understand.  If you want to range partition on columns (a, b),
you say that, say, tuples with (a, b) values less than (100, 200) go
here and the rest go elsewhere.  For list partitioning, you say that,
say, tuples with (a, b) values of EXACTLY (100, 200) go here and the
rest go elsewhere.  I'm not sure how useful that is but it's not
illogical.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-08 Thread Josh Berkus
All,

Pardon me for jumping into this late.  In general, I like Alvaro's
approach.  However, I wanted to list the major shortcomings of the
existing replication system (based on complaints by PGX's users and on
IRC) and compare them to Alvaro's proposed implementation to make sure
that enough of them are addressed, and that the ones which aren't
addressed are not being addressed as a clear decision.  We can't address
*all* of the limitations of the current system, but let's make sure that
we're addressing enough of them to make implementing a 2nd partitioning
system worthwhile.

Where I have ? is because I'm not clear from Alvaro's proposal whether
they're addressed or not.

1.The Trigger Problem
the need to write triggers for INSERT/UPDATE/DELETE.
Addressed.

2. The Clutter Problem
cluttering up system views and dumps with hundreds of partitioned tables
Addressed.

3. Creation Problem
the need two write triggers and/or cron jobs to create new partitions
Addressed.

4. Creation Locking Problem
high probability of lock pile-ups whenever a new partition is created on
demand due to multiple backends trying to create the partition at the
same time.
Not Addressed?

5. Constant Problem
Since current partitioned query planning happens before the rewrite
phase, SELECTs do not use partition logic to evaluate even simple
expressions, let alone IMMUTABLE or STABLE functions.
Addressed??

6. Unique Index Problem
Cannot create a unique index across multiple partitions, which prevents
the partitioned table from being FK'd.
Not Addressed
(but could be addressed in the future)

7. JOIN Problem
Two partitioned tables being JOINed need to append and materialize
before the join, causing a very slow join under some circumstances, even
if both tables are partitioned on the same ranges.
Not Addressed?
(but could be addressed in the future)

8. COPY Problem
Cannot bulk-load into the Master, just into individual partitions.
Addressed.

9. Hibernate Problem
When using the trigger method, inserts into the master partition return
0, which Hibernate and some other ORMs regard as an insert failure.
Addressed.

10. Scaling Problem
Inheritance partitioning becomes prohibitively slow for the planner at
somewhere between 100 and 500 partitions depending on various factors.
No idea?

11. Hash Partitioning
Some users would prefer to partition into a fixed number of
hash-allocated partitions.
Not Addressed.

12. Extra Constraint Evaluation
Inheritance partitioning evaluates *all* constraints on the partitions,
whether they are part of the partitioning scheme or not.  This is way
expensive if those are, say, polygon comparisons.
Addressed.


Additionally, I believe that Alvaro's proposal will make the following
activities which are supported by partition-by-inheritance more
difficult or impossible.  Again, these are probably acceptable because
inheritance partitioning isn't going away.  However, we should
consciously decide that:

A. COPY/ETL then attach
In inheritance partitioning, you can easily build a partition outside
the master and then "attach" it, allowing for minimal disturbance of
concurrent users.  Could be addressed in the future.

B. Catchall Partition
Many partitioning schemes currently contain a "catchall" partition which
accepts rows outside of the range of the partitioning scheme, due to bad
input data.  Probably not handled on purpose; Alvaro is proposing that
we reject these instead, or create the partitions on demand, which is a
legitimate approach.

C. Asymmetric Partitioning / NULLs in partition column
This is the classic Active/Inactive By Month setup for partitions.
Could be addressed via special handling for NULL/infinity in the
partitioned column.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-07 Thread Amit Langote

From: Amit Kapila [mailto:amit.kapil...@gmail.com] 
> > > How would you distinguish values in list partition for multiple
> > > columns? I mean for range partition, we are sure there will
> > > be either one value for each column, but for list it could
> > > be multiple and not fixed for each partition, so I think it will not
> > > be easy to support the multicolumn partition key for list
> > > partitions.
>
> >Irrespective of difficulties of representing it using pg_node_tree, it seems 
> >to me that multicolumn list partitioning is not widely used.
> 
> So I think it is better to be clear why we are not planning to
> support it, is it that because it is not required by users or
> is it due to the reason that code seems to be tricky or is it due
> to both of the reasons.  It might help us if anyone raises this
> during the development of this patch or in general if someone
> requests such a feature.

Coming back to the how pg_node_tree representation for list partitions - 

For each column in a multicolumn list partition key, a value would look like a 
dumped Node for List of Consts (all allowed values in a given list partition). 
And the whole key would then be a List of such Nodes (a dump thereof). That's 
perhaps pretty verbose but I guess that's supposed to be only a catalog 
representation. During relcache building, we turn this back into a collection 
of structs to efficiently locate the partition of interest whatever the method 
of doing that ends up being (based on partition type). The relcache step 
ensures that we have decoupled the concern of quickly locating an interesting 
partition from its catalog representation.

Of course, there may be flaws in this picture and would only reveal themselves 
when actually trying to implement it or they can be pointed out in advance.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-07 Thread Amit Kapila
On Mon, Dec 8, 2014 at 11:01 AM, Amit Langote 
wrote:
> From: Amit Kapila [mailto:amit.kapil...@gmail.com]
> Sent: Saturday, December 06, 2014 5:00 PM
> To: Robert Haas
> Cc: Amit Langote; Andres Freund; Alvaro Herrera; Bruce Momjian; Pg Hackers
> Subject: Re: [HACKERS] On partitioning
>
> On Fri, Dec 5, 2014 at 10:03 PM, Robert Haas 
wrote:
> > On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote
> >  wrote:
> >
> > > I wonder if your suggestion of pg_node_tree plays well here. This
then could be a list of CONSTs or some such... And I am thinking it's a
concern only for range partitions, no? (that is, a multicolumn partition
key)
> >
> > I guess you could list or hash partition on multiple columns, too.
> >
> > How would you distinguish values in list partition for multiple
> > columns? I mean for range partition, we are sure there will
> > be either one value for each column, but for list it could
> > be multiple and not fixed for each partition, so I think it will not
> > be easy to support the multicolumn partition key for list
> > partitions.
>
> Irrespective of difficulties of representing it using pg_node_tree, it
seems to me that multicolumn list partitioning is not widely used.

So I think it is better to be clear why we are not planning to
support it, is it that because it is not required by users or
is it due to the reason that code seems to be tricky or is it due
to both of the reasons.  It might help us if anyone raises this
during the development of this patch or in general if someone
requests such a feature.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] On partitioning

2014-12-07 Thread Amit Langote


From: pgsql-hackers-ow...@postgresql.org 
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Amit Kapila
Sent: Saturday, December 06, 2014 5:06 PM
To: Robert Haas
Cc: Amit Langote; Andres Freund; Alvaro Herrera; Bruce Momjian; Pg Hackers
Subject: Re: [HACKERS] On partitioning

On Fri, Dec 5, 2014 at 10:12 PM, Robert Haas  wrote:
> On Fri, Dec 5, 2014 at 2:18 AM, Amit Kapila  wrote:
> > Do we really need to support dml or pg_dump for individual partitions?
>
> I think we do.  It's quite reasonable for a DBA (or developer or
> whatever) to want to dump all the data that's in a single partition;
> for example, maybe they have the table partitioned, but also spread
> across several servers.  When the data on one machine grows too big,
> they want to dump that partition, move it to a new machine, and drop
> the partition from the old machine.  That needs to be easy and
> efficient.
>
> More generally, with inheritance, I've seen the ability to reference
> individual inheritance children be a real life-saver on any number of
> occasions.  Now, a new partitioning system that is not as clunky as
> constraint exclusion will hopefully be fast enough that people don't
> need to do it very often any more.  But I would be really cautious
> about removing the option.  That is the equivalent of installing a new
> fire suppression system and then boarding up the emergency exit.
> Yeah, you *hope* the new fire suppression system is good enough that
> nobody will ever need to go out that way any more.  But if you're
> wrong, people will die, so getting rid of it isn't prudent.  The
> stakes are not quite so high here, but the principle is the same.
>
> 
> Sure, I don't feel we should not provide anyway to take dump
> for individual partition but not at level of independent table.
> May be something like --table 
> --partition .
> 

This does sound cleaner.

> In general, I think we should try to avoid exposing that partitions are
> individual tables as that might hinder any future enhancement in that
> area (example if we someone finds a different and better way to
> arrange the partition data, then due to the currently exposed syntax,
> we might feel blocked). 

Sounds like a concern. I guess you are referring to whether we allow a 
partition relation to be included in the range table and then some other cases. 
In the former case we could allow referring to individual partitions by some 
additional syntax if it doesn’t end up looking too ugly or invite a bunch of 
other issues.

This seems to have been discussed a little bit upthread (for example, see "Open 
Questions" in Alvaro's original proposal and Hannu Krosing's reply). 

Regards,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-07 Thread Amit Langote


From: Amit Kapila [mailto:amit.kapil...@gmail.com] 
Sent: Saturday, December 06, 2014 5:00 PM
To: Robert Haas
Cc: Amit Langote; Andres Freund; Alvaro Herrera; Bruce Momjian; Pg Hackers
Subject: Re: [HACKERS] On partitioning

On Fri, Dec 5, 2014 at 10:03 PM, Robert Haas  wrote:
> On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote
>  wrote:
>
> > I wonder if your suggestion of pg_node_tree plays well here. This then 
> > could be a list of CONSTs or some such... And I am thinking it's a concern 
> > only for range partitions, no? (that is, a multicolumn partition key)
>
> I guess you could list or hash partition on multiple columns, too.
>
> How would you distinguish values in list partition for multiple
> columns? I mean for range partition, we are sure there will
> be either one value for each column, but for list it could
> be multiple and not fixed for each partition, so I think it will not
> be easy to support the multicolumn partition key for list
> partitions.

Irrespective of difficulties of representing it using pg_node_tree, it seems to 
me that multicolumn list partitioning is not widely used. It is used in 
combination with range or hash partitioning as composite partitioning. So, 
perhaps we need not worry about that.

Regards,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-07 Thread Amit Langote
Hi Robert,

> From: Robert Haas [mailto:robertmh...@gmail.com]
> On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote
>  wrote:
> >> So, we're going to support exactly two levels of partitioning?
> >> partitions with partissub=false and subpartitions with partissub=true?
> >>  Why not support only one level of partitioning here but then let the
> >> children have their own pg_partitioned_rel entries if they are
> >> subpartitioned?  That seems like a cleaner design and lets us support
> >> an arbitrary number of partitioning levels if we ever need them.
> >
> > Yeah, that's what I thought at some point in favour of dropping partissub
> altogether. However, not that this design solves it, there is one question - 
> if
> we would want to support defining for a table both partition key and sub-
> partition key in advance? That is, without having defined a first level 
> partition
> yet; in that case, what level do we associate sub-(sub-) partitioning key with
> or more to the point where do we keep it?
> 
> Do we really need to allow that?  I think you let people partition a
> toplevel table, and then partition its partitions once they've been
> created.  I'm not sure there's a good reason to associate the
> subpartitioning scheme with the toplevel table.  For one thing, that
> forces all subpartitions to be partitioned the same way - do we want
> to insist on that?  If we do, then I agree that we need to think a
> little harder here.
> 

To me, it sounds better if we insist on a uniform subpartitioning scheme across 
all partitions. It seems that's how it's done elsewhere. It would be 
interesting to hear what others think though.

> > That would be a default partition. That is, where the tuples that don't
> belong elsewhere (other defined partitions) go. VALUES clause of the
> definition for such a partition would look like:
> >
> > (a range partition) ... VALUES LESS THAN MAXVALUE
> > (a list partition) ... VALUES DEFAULT
> >
> > There has been discussion about whether there shouldn't be such a place
> for tuples to go. That is, it should generate an error if a tuple can't go
> anywhere (or support auto-creating a new one like in interval partitioning?)
> 
> I think Alvaro's response further down the thread is right on target.
> But to go into a bit more detail, let's consider the three possible
> cases:
> 
> - Hash partitioning.  Every key value gets hashed to some partition.
> The concept of an overflow or default partition doesn't even make
> sense.
> 
> - List partitioning.  Each key for which the user has defined a
> mapping gets sent to the corresponding partition.  The keys that
> aren't mapped anywhere can either (a) cause an error or (b) get mapped
> to some default partition.  It's probably useful to offer both
> behaviors.  But I don't think it requires a partitionisoverflow
> column, because you can represent it some other way, such as by making
> partitionvalues NULL, which is otherwise meaningless.
> 
> - Range partitioning.  In this case, what you've basically got is a
> list of partition bounds and a list of target partitions.   Suppose
> there are N partition bounds; then there will be N+1 targets.  Some of
> those targets can be undefined, meaning an attempt to insert a key
> with that value will error out.  For example, suppose the user defines
> a partition for values 1-3 and 10-13.  Then your list of partition
> bounds looks like this:
> 
> 1,3,10,13
> 
> And your list of destinations looks like this:
> 
> undefined,firstpartition,undefined,secondpartition,undefined
> 
> More commonly, the ranges will be contiguous, so that there are no
> gaps.  If you have everything <10 in the first partition, everything
> 10-20 in the second partition, and everything else in a third
> partition, then you have bounds 10,20 and destinations
> firstpartition,secondpartition,thirdpartition.  If you want values
> greater than 20 to error out, then you have bounds 10,20 and
> destinations firstpartition,secondpartition,undefined.
> 
> In none of this do you really have "an overflow partition".  Rather,
> the first and last destinations, if defined, catch everything that has
> a key lower than the lowest key or higher than the highest key.  If
> not defined, you error out.

So just to clarify, first and last destinations are considered "defined" if you 
have something like:

...
PARTITION p1 VALUES LESS THAN 10
PARTITION p2 VALUES BETWEEN 10 AND 20
PARTITION p3 VALUES GREATER THAN 20
...

And "not defined" if:

...
PARTITION p1 VALUES BETWEEN 10 AND 20
...

In the second case, because no explicit definitions for values less than 10 and 
greater than 20 are in place, rows with that value error out? If so, that makes 
sense. 

> 
> > I wonder if your suggestion of pg_node_tree plays well here. This then
> could be a list of CONSTs or some such... And I am thinking it's a concern 
> only
> for range partitions, no? (that is, a multicolumn partition key)
> 
> I guess you could list or hash partition on multiple colum

Re: [HACKERS] On partitioning

2014-12-06 Thread Amit Kapila
On Fri, Dec 5, 2014 at 10:12 PM, Robert Haas  wrote:
> On Fri, Dec 5, 2014 at 2:18 AM, Amit Kapila 
wrote:
> > Do we really need to support dml or pg_dump for individual partitions?
>
> I think we do.  It's quite reasonable for a DBA (or developer or
> whatever) to want to dump all the data that's in a single partition;
> for example, maybe they have the table partitioned, but also spread
> across several servers.  When the data on one machine grows too big,
> they want to dump that partition, move it to a new machine, and drop
> the partition from the old machine.  That needs to be easy and
> efficient.
>
> More generally, with inheritance, I've seen the ability to reference
> individual inheritance children be a real life-saver on any number of
> occasions.  Now, a new partitioning system that is not as clunky as
> constraint exclusion will hopefully be fast enough that people don't
> need to do it very often any more.  But I would be really cautious
> about removing the option.  That is the equivalent of installing a new
> fire suppression system and then boarding up the emergency exit.
> Yeah, you *hope* the new fire suppression system is good enough that
> nobody will ever need to go out that way any more.  But if you're
> wrong, people will die, so getting rid of it isn't prudent.  The
> stakes are not quite so high here, but the principle is the same.
>

Sure, I don't feel we should not provide anyway to take dump
for individual partition but not at level of independent table.
May be something like --table 
--partition .

In general, I think we should try to avoid exposing that partitions are
individual tables as that might hinder any future enhancement in that
area (example if we someone finds a different and better way to
arrange the partition data, then due to the currently exposed syntax,
we might feel blocked).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] On partitioning

2014-12-06 Thread Amit Kapila
On Fri, Dec 5, 2014 at 10:03 PM, Robert Haas  wrote:
> On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote
>  wrote:
>
> > I wonder if your suggestion of pg_node_tree plays well here. This then
could be a list of CONSTs or some such... And I am thinking it's a concern
only for range partitions, no? (that is, a multicolumn partition key)
>
> I guess you could list or hash partition on multiple columns, too.

How would you distinguish values in list partition for multiple
columns? I mean for range partition, we are sure there will
be either one value for each column, but for list it could
be multiple and not fixed for each partition, so I think it will not
be easy to support the multicolumn partition key for list
partitions.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] On partitioning

2014-12-05 Thread Robert Haas
On Fri, Dec 5, 2014 at 3:05 PM, Jim Nasby  wrote:
>> On what basis do you expect that?  Every time you use a view, you're
>> using a pg_node_tree.  Nobody's ever complained that having to reload
>> the pg_node_tree column was too slow, and I see no reason to suppose
>> that things would be any different here.
>>
>> I mean, we can certainly invent something new if there is a reason to
>> do so.  But you (and a few other people) seem to be trying pretty hard
>> to avoid using the massive amount of infrastructure that we already
>> have to do almost this exact thing, which puzzles the heck out of me.
>
> My concern is how to do the routing of incoming tuples. I'm assuming it'd be
> significantly faster to compare two tuples than to run each tuple through a
> bunch of nodetrees.

As I said before, that's a completely unrelated problem.

To quickly route tuples for range or list partitioning, you're going
to want to have an array of Datums in memory and bseach it.  That says
nothing about how they should be stored on disk.  Whatever the on-disk
representation looks like, the relcache is going to need to reassemble
it into an array that can be binary-searched.  As long as that's not
hard to do - and none of the proposals here would make it hard to do -
there's no reason to care about this from that point of view.

At least, not that I can see.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-05 Thread Jim Nasby

On 12/5/14, 2:02 PM, Robert Haas wrote:

On Fri, Dec 5, 2014 at 2:52 PM, Jim Nasby  wrote:

The other option would be to use some custom rowtype to store boundary
values and have a method that can form a boundary tuple from a real one.
Either way, I suspect this is better than frequently evaluating
pg_node_trees.


On what basis do you expect that?  Every time you use a view, you're
using a pg_node_tree.  Nobody's ever complained that having to reload
the pg_node_tree column was too slow, and I see no reason to suppose
that things would be any different here.

I mean, we can certainly invent something new if there is a reason to
do so.  But you (and a few other people) seem to be trying pretty hard
to avoid using the massive amount of infrastructure that we already
have to do almost this exact thing, which puzzles the heck out of me.


My concern is how to do the routing of incoming tuples. I'm assuming it'd be 
significantly faster to compare two tuples than to run each tuple through a 
bunch of nodetrees.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-05 Thread Robert Haas
On Fri, Dec 5, 2014 at 2:52 PM, Jim Nasby  wrote:
> The other option would be to use some custom rowtype to store boundary
> values and have a method that can form a boundary tuple from a real one.
> Either way, I suspect this is better than frequently evaluating
> pg_node_trees.

On what basis do you expect that?  Every time you use a view, you're
using a pg_node_tree.  Nobody's ever complained that having to reload
the pg_node_tree column was too slow, and I see no reason to suppose
that things would be any different here.

I mean, we can certainly invent something new if there is a reason to
do so.  But you (and a few other people) seem to be trying pretty hard
to avoid using the massive amount of infrastructure that we already
have to do almost this exact thing, which puzzles the heck out of me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-05 Thread Jim Nasby

On 12/5/14, 1:22 PM, Jim Nasby wrote:

On 12/5/14, 3:42 AM, Amit Langote wrote:

>  I think you are right.  I think in this case we need something similar
>to column pg_index.indexprs which is of type pg_node_tree(which
>seems to be already suggested by Robert). So may be we can proceed
>with this type and see if any one else has better idea.

One point raised about/against pg_node_tree was the values represented therein 
would turn out to be too generalized to be used with advantage during planning. 
But, it seems we could deserialize it in advance back to the internal form 
(like an array of a struct) as part of the cached relation data. This overhead 
would only be incurred in case of partitioned tables. Perhaps this is what 
Robert suggested elsewhere.


In order to store a composite type in a catalog, we would need to have one field that has 
the typid of the composite, and the field that stores the actual composite data would 
need to be a "dumb" varlena that stores the composite HeapTupleHeader.


On further thought; if we disallow NULL as a partition boundary, we don't need 
a separate rowtype; we could just use the one associated with the relation 
itself. Presumably that would make comparing tuples to the relation list a lot 
easier.

I was hung up on how that would work in the case of ALTER TABLE, but we'd have 
the same problem with using pg_node_tree: if you alter a table in such a way 
that *might* affect your partitioning, you have to do some kind of revalidation 
anyway.

The other option would be to use some custom rowtype to store boundary values 
and have a method that can form a boundary tuple from a real one. Either way, I 
suspect this is better than frequently evaluating pg_node_trees.

There may be one other option. If range partitions are defined in terms of an 
expression that is different for every partition (ie: (substr(product_key, 1, 
4), date_trunc('month', sales_date))) then we could use a hash of that 
expression to identify a partition. In other words, range partitioning becomes 
a special case of hash partitioning. I do think we need a programmatic means to 
identify the range of an individual partition and hash won't solve that, but 
the performance of that case isn't critical so we could use pretty much 
whatever we wanted to there.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-05 Thread Jim Nasby

On 12/5/14, 3:42 AM, Amit Langote wrote:

>  I think you are right.  I think in this case we need something similar
>to column pg_index.indexprs which is of type pg_node_tree(which
>seems to be already suggested by Robert). So may be we can proceed
>with this type and see if any one else has better idea.

One point raised about/against pg_node_tree was the values represented therein 
would turn out to be too generalized to be used with advantage during planning. 
But, it seems we could deserialize it in advance back to the internal form 
(like an array of a struct) as part of the cached relation data. This overhead 
would only be incurred in case of partitioned tables. Perhaps this is what 
Robert suggested elsewhere.


In order to store a composite type in a catalog, we would need to have one field that has 
the typid of the composite, and the field that stores the actual composite data would 
need to be a "dumb" varlena that stores the composite HeapTupleHeader.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-12-05 Thread Robert Haas
On Fri, Dec 5, 2014 at 3:11 AM, Amit Langote
 wrote:
>> I think you are right.  I think in this case we need something similar
>> to column pg_index.indexprs which is of type pg_node_tree(which
>> seems to be already suggested by Robert). So may be we can proceed
>> with this type and see if any one else has better idea.
>
> Yeah, with that, I was thinking we may be able to do something like dump a 
> Node that describes the range partition bounds or list of allowed values 
> (say, RangePartitionValues, ListPartitionValues).

That's exactly what the kind of thing I was thinking about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >