Re: [PERFORM] partitioning materialized views

2017-07-06 Thread Rick Otten
>
>
> If you _can't_ do
>> that due to cloud restrictions, you'd actually be better off doing an
>> atomic swap.
>>
>> CREATE MATERIALIZED VIEW y AS ...;
>>
>> BEGIN;
>> ALTER MATERIALIZED VIEW x RENAME TO x_old;
>> ALTER MATERIALIZED VIEW y RENAME TO x;
>> DROP MATERIALIZED VIEW x_old;
>> COMMIT;
>>
>> This is an interesting idea.  Thanks!  I'll ponder that one.
>
>
I don't think the downstream dependencies will let that work without
rebuilding them as well.   The drop fails (without a cascade), and the
other views and matviews that are built off of this all simply point to
x_old.


Re: [PERFORM] partitioning materialized views

2017-07-06 Thread Rick Otten
On Thu, Jul 6, 2017 at 11:25 AM, Shaun Thomas 
wrote:

> > I'm curious if I'm overlooking other possible architectures or tools
> that might make this simpler to manage.
>
> One of the issues with materialized views is that they are based on
> views... For a concurrent update, it essentially performs a looped
> merge, which can be pretty ugly. That's the price you pay to be
> non-blocking. For this particular setup, I'd actually recommend using
> something like pglogical to just maintain a live copy of the remote
> table or wait for Postgres 10's logical replication.


Unfortunately the foreign database is Hadoop.  (As A Service)



> If you _can't_ do
> that due to cloud restrictions, you'd actually be better off doing an
> atomic swap.
>
> CREATE MATERIALIZED VIEW y AS ...;
>
> BEGIN;
> ALTER MATERIALIZED VIEW x RENAME TO x_old;
> ALTER MATERIALIZED VIEW y RENAME TO x;
> DROP MATERIALIZED VIEW x_old;
> COMMIT;
>
> This is an interesting idea.  Thanks!  I'll ponder that one.



> You could still follow your partitioned plan if you don't want to
> update all of the data at once. Let's face it, 3-4 hours is still a
> ton of data transfer and calculation.
>
>
yup.


Re: [PERFORM] partitioning materialized views

2017-07-06 Thread Shaun Thomas
> I'm curious if I'm overlooking other possible architectures or tools that 
> might make this simpler to manage.

One of the issues with materialized views is that they are based on
views... For a concurrent update, it essentially performs a looped
merge, which can be pretty ugly. That's the price you pay to be
non-blocking. For this particular setup, I'd actually recommend using
something like pglogical to just maintain a live copy of the remote
table or wait for Postgres 10's logical replication. If you _can't_ do
that due to cloud restrictions, you'd actually be better off doing an
atomic swap.

CREATE MATERIALIZED VIEW y AS ...;

BEGIN;
ALTER MATERIALIZED VIEW x RENAME TO x_old;
ALTER MATERIALIZED VIEW y RENAME TO x;
DROP MATERIALIZED VIEW x_old;
COMMIT;

You could still follow your partitioned plan if you don't want to
update all of the data at once. Let's face it, 3-4 hours is still a
ton of data transfer and calculation.

-- 
Shaun M Thomas - 2ndQuadrant
PostgreSQL Training, Services and Support
shaun.tho...@2ndquadrant.com | www.2ndQuadrant.com


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] partitioning materialized views

2017-07-06 Thread Rick Otten
I'm pondering approaches to partitioning large materialized views and was
hoping for some feedback and thoughts on it from the [perform] minds.

PostgreSQL 9.6.3 on Ubuntu 16.04 in the Google Cloud.

I have a foreign table with 250M or so rows and 50 or so columns, with a
UUID as the primary key.  Queries to the foreign table have high latency.
 (From several minutes to more than an hour to run)

If I create a materialized view of this FT, including indexes, it takes
about 3-4 hours.

If I refresh the materialized view concurrently, it takes 4-5 DAYS.

When I run "refresh materialized view concurrently", it takes about an hour
for it to download the 250M rows and load them onto the SSD tempspace.   At
that point we flatline a single core, and run I/O on the main tablespace up
pretty high, and then stay that way until the refresh is complete.

In order to speed up the concurrent refreshes, I have it broken into 4
materialized views, manually partitioned (by date) with a "union all view"
in front of them.  Refreshing the data which is changing regularly (new
data, in one of the partitions) doesn't require refreshing the entire data
set.  This works fairly well, and I can refresh the most recent partition
in 1 - 2 hours (daily).

However, sometimes I have to reach back in time and refresh the deeper
partitions.  This is taking 3 or more days to complete, even with the data
broken into 4 materialized views.  This approache lets me refresh all of
the partitions at the same time, which uses more cores at the same time
(and more tempspace),   [I'd like to use as much of my system resources as
possible to get the refresh to finish faster.]   Unfortunately I am finding
I need to refresh the deeper data more and more often (at least once per
week), and my table growth is going to jump from adding 3-5M rows per day
to adding 10-20M rows per day over the next month or two.  Waiting 3 or 4
days for the deeper data to be ready for consumption in PostgreSQL is no
longer acceptable to the business.

It doesn't look like partman supports partitioning materialized views.  It
also doesn't look like PG 10's new partitioning features will work with
materialized views (although I haven't tried it yet).   Citus DB also seems
to be unable to help in this scenario.

I could create new materialized views every time I need new data, and then
swap out the view that is in front of them.  There are other objects in the
database which have dependencies on that view. In my experiments so far,
"create and replace" seems to let me get away with this as long as the
columns don't change.

Alternatively, I could come up with a new partitioning scheme that lets me
more selectively run "refresh concurrently", and run more of those at the
same time.

I was leaning towards this latter solution.

Suppose I make a separate materialized view for each month of data.  At the
beginning of each month I would have to make a new materialized view, and
then add it into the "union all view" on the fly.

I would then need a "refresh all" script which refreshed as many of them
concurrently as I am willing to dedicate cores to.  And I need some handy
ways to selectively refresh specific months when I know data for a
particular month or set of months changed.

So, I actually have 2 of these 250M row tables in the Foreign Database,
that I want to do this with.  And maybe more coming soon?

I'm curious if I'm overlooking other possible architectures or tools that
might make this simpler to manage.


Similarly, could I construct the "union all view" in front of the
partitions to be partition aware so that the query planner doesn't try to
look in every one of the materialized views behind it to find the rows I
want?   If I go with the monthly partition, I'll start with about 36
materialized views behind the main view.