Re: partitioning an existing table - efficient pg_dump

2017-12-30 Thread Justin Pryzby
On Sat, Dec 30, 2017 at 09:19:05AM -0500, Robert Blayzor wrote:
> On Dec 30, 2017, at 12:38 AM, Justin Pryzby  wrote:

> > BTW depending on your requirements, it may be possible to make pg_dump much
> > more efficient.  For our data, it's reasonable to assume that a table is
> > "final" if its constraints exclude data older than a few days ago, and it 
> > can
> > be permanently dumped and excluded from future, daily backups, which makes 
> > the
> > backups smaller and faster, and probably causes less cache churn, etc.  But 
> > I
> > imagine you might have different requirements, so that may be infeasible, or
> > you'd maybe have to track insertions, either via p
> 
> The idea is only only keep a # of months available for searching over a 
> period of months. Those months could be 3 or more, up to a year, etc. But 
> being able to just drop and entire child table for pruning is very 
> attractive. Right now the average months data is about 2-3 million rows each. 
> Data is just inserted and then only searched. Never updated…
> 
> I also like the idea of skipping all this older data from a PGdump. We 
> archive records inserted into these tables daily into cold storage. ie: 
> export and compressed. So the data is saved cold. We dump the DB nightly 
> also, but probably would make sense to skip anything outside of the newest 
> child table. Just not sure how to make that happen, yet….

For us, I classify the tables as "partitioned" or "not partitioned" and
subdivide "partitioned" into "recent" or "historic" based on table names; but
if you design it from scratch then you'd have the opportunity to keep a list of
partitioned tables, their associated date range, date of most recent insertion,
and most recent "final" backup.

This is the essence of it:
snap= ... SELECT pg_export_snapshot();
pg_dump --snap "$snap" -T "$ptnreg" -f nonpartitioned.new
pg_dump --snap "$snap" -t "$recent" -f recent.new
loop around historic partitioned tables and run "final" pg_dump if it's been
  INSERTed more recently than it's been dumped.
remove any "final" pg_dump not included in any existing backup (assuming you
  keep multiple copies on different rotation).

Note that pg_dump -t/-T is different from "egrep" in a few special ways..

Justin



Re: partitioning an existing table

2017-12-30 Thread Robert Blayzor
On Dec 30, 2017, at 12:38 AM, Justin Pryzby  wrote:
> For inheritance (available in and before PG10), the parent may be nonempty,
> which works fine, although someone else might find it unintuitive.  (Does the
> doc actually say "should" somewhere ?)

Well it doesn’t say should, but says “normally”..

"The parent table itself is normally empty; it exists just to represent the 
entire data set. …


> Just curious: are your constraints/indices on starting time or ending time?

Yes, the child tables will be strictly on a months worth of data.

CREATE TABLE table_201801
  (CHECK (ts >= DATE ‘2018-01-01' AND ts < DATE ‘2018-02-01'))
  INHERITS …


The application will insert directly into the child tables, so no need for 
triggers or rules.


> BTW depending on your requirements, it may be possible to make pg_dump much
> more efficient.  For our data, it's reasonable to assume that a table is
> "final" if its constraints exclude data older than a few days ago, and it can
> be permanently dumped and excluded from future, daily backups, which makes the
> backups smaller and faster, and probably causes less cache churn, etc.  But I
> imagine you might have different requirements, so that may be infeasible, or
> you'd maybe have to track insertions, either via p

The idea is only only keep a # of months available for searching over a period 
of months. Those months could be 3 or more, up to a year, etc. But being able 
to just drop and entire child table for pruning is very attractive. Right now 
the average months data is about 2-3 million rows each. Data is just inserted 
and then only searched. Never updated…

I also like the idea of skipping all this older data from a PGdump. We archive 
records inserted into these tables daily into cold storage. ie: export and 
compressed. So the data is saved cold. We dump the DB nightly also, but 
probably would make sense to skip anything outside of the newest child table. 
Just not sure how to make that happen, yet….







Re: partitioning an existing table

2017-12-29 Thread Justin Pryzby
On Fri, Dec 29, 2017 at 11:37:56PM -0500, Robert Blayzor wrote:
> The docs claim that the master table “should” be empty. It it possible to 
> just create child tables off an existing master table with data, then just 
> inserting data into the new child tables.
> 
> THe plan would be to keep data in the master table and purge it over time 
> until it’s eventually empty, then drop the indexes as well.
> 
> Fully understanding that data needs to be placed in the right child tables. 
> Data outside of those child ranges would remain as “old data” in the master 
> table.
> 
> Just trying to grab if that’s an acceptable migration of live data from a 
> single large table and move into partitioning. Think of it as a very large 
> table of cyclic data that ages out. New data in child tables while removing 
> data from the master table over time.

For PG10 "partitions" (as in relkind='p') the parent is defined as empty
(actually has no underlying storage).

For inheritance (available in and before PG10), the parent may be nonempty,
which works fine, although someone else might find it unintuitive.  (Does the
doc actually say "should" somewhere ?)

You almost certainly want child tables to have constraints, to allow
constraint_exclusion (which is the only reason one child table is more "right"
than any other, besides the associated pruning/retention schedule).

Since you'll be running DELETE rather than DROP on the parent, you might
consider DELETE ONLY..  but it won't matter if your children's constraints are
usable with DELETE's WHERE condition.

Also, note that autoanalyze doesn't know to analyze the PARENT's statistics
when its children are INSERTED/DROPPED/etc.  So I'd suggest to consider ANALYZE
each parent following DROP of its children (or maybe on some more frequent
schedule to handle inserted rows, too).  Perhaps that should be included as a
CAVEAT?
https://www.postgresql.org/docs/10/static/ddl-inherit.html#DDL-INHERIT-CAVEATS

Just curious: are your constraints/indices on starting time or ending time?

BTW depending on your requirements, it may be possible to make pg_dump much
more efficient.  For our data, it's reasonable to assume that a table is
"final" if its constraints exclude data older than a few days ago, and it can
be permanently dumped and excluded from future, daily backups, which makes the
backups smaller and faster, and probably causes less cache churn, etc.  But I
imagine you might have different requirements, so that may be infeasible, or
you'd maybe have to track insertions, either via pg_stat_user_tables, or at the
application layer, and redump the relevant table.

Justin



partitioning an existing table

2017-12-29 Thread Robert Blayzor
The docs claim that the master table “should” be empty. It it possible to just 
create child tables off an existing master table with data, then just inserting 
data into the new child tables.

THe plan would be to keep data in the master table and purge it over time until 
it’s eventually empty, then drop the indexes as well.

Fully understanding that data needs to be placed in the right child tables. 
Data outside of those child ranges would remain as “old data” in the master 
table.


Just trying to grab if that’s an acceptable migration of live data from a 
single large table and move into partitioning. Think of it as a very large 
table of cyclic data that ages out. New data in child tables while removing 
data from the master table over time.

--
inoc.net!rblayzor
XMPP: rblayzor.AT.inoc.net
PGP:  https://inoc.net/~rblayzor/