subject:"Re\: \[PERFORM\] Millions of tables"

Re: [PERFORM] Millions of tables

2016-11-25 Thread Robert Klemme

Greg, sorry for the resent: I had forgotten to include the list.

On Wed, Oct 5, 2016 at 2:34 PM, Greg Spiegelberg  wrote:

> Data is not static.  The 4M tables fall into one of two groups.
>
> Group A contains 2M tables.  INSERT will occur ~100 times/day and maximum
> number of records anticipated will be 200k.  Periodic DELETE's will occur
> removing "old" records.  Age is something the client sets and I have no way
> of saying 1 or 10k records will be removed.

The ~100 times / day are per table I assume. Also, I assume DELETES
will probably delete batches (because the time criteria catches
several records).

> Group B contains the other 2M tables.  Maximum records ~140k and UPSERT will
> be the only mechanism used to populate and maintain.  Periodic DELETE's may
> run on these tables as well removing "old" records.

So there will be inserts and updates.

Either I missed it or you did not mention the criteria for placing a
record in one of the 4M buckets. Can you shed light on what the
criteria are? That would obviously suggest what indexing could be
done.

Also it would be interesting to see results of your tests with btree
on really large tables as Stephen had suggested. I know it is not the
primary tests you want to do but I would rather first explore
"traditional" schema before I venture in the unknown of the multi
million dollar, pardon, table schema.

Kind regards

-- 
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Millions of tables

2016-10-08 Thread Jim Nasby


On 10/5/16 7:34 AM, Greg Spiegelberg wrote:

When you say "must do a vacuum of the entire database", are you saying
the entire database must be vacuum'd as a whole per 2B transactions or
all tables must be vacuum'd eventually at least once?


All tables at least once. Prior to 9.6, that had to happen ever 2B 
transactions. With 9.6 there's a freeze map, so if a page never gets 
dirtied between vacuum freeze runs then it doesn't need to be frozen.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Millions of tables

2016-10-05 Thread Greg Spiegelberg

On Fri, Sep 30, 2016 at 4:49 PM, Jim Nasby  wrote:

> On 9/29/16 6:11 AM, Alex Ignatov (postgrespro) wrote:
>
>> With millions of tables you have to setautovacuum_max_workers
>>  sky-high =). We have some situation when at thousands of tables
>> autovacuum can’t vacuum all tables that need it. Simply it vacuums some
>> of most modified table and never reach others. Only manual vacuum can
>> help with this situation. With wraparound issue it can be a nightmare
>>
>
> Specifically, autovac isn't going to start worrying about anti-wrap
> vacuums until tables start hitting autovacuum_freeze_max_age (or
> autovacuum_multixact_freeze_max_age). Any tables that hit that threshold
> go to the front of the line for being vacuumed. (But keep in mind that
> there is no universal line, just what each worker computes on it's own when
> it's started).
>
> Where things will completely fall apart for you is if a lot of tables all
> have roughly the same relfrozenxid (or relminmxid), like they would
> immediately after a large load. In that scenario you'll suddenly have loads
> of work for autovac to do, all at the same time. That will make the
> database, DBAs and you Very Unhappy (tm).
>
> Somehow, some way, you *must* do a vacuum of the entire database. Luckily
> the freeze map in 9.6 means you'd only have to do that one time (assuming
> the data really is static). In any older version, (auto)vacuum will need to
> eventually *read everything in every table* at least once every ~2B
> transactions.
>

Data is not static.  The 4M tables fall into one of two groups.

Group A contains 2M tables.  INSERT will occur ~100 times/day and maximum
number of records anticipated will be 200k.  Periodic DELETE's will occur
removing "old" records.  Age is something the client sets and I have no way
of saying 1 or 10k records will be removed.

Group B contains the other 2M tables.  Maximum records ~140k and UPSERT
will be the only mechanism used to populate and maintain.  Periodic
DELETE's may run on these tables as well removing "old" records.

Will a set of tables require vacuum'ing at the same time?  Quite possibly
but I have no way to say 2 or 200k tables will need it.

When you say "must do a vacuum of the entire database", are you saying the
entire database must be vacuum'd as a whole per 2B transactions or all
tables must be vacuum'd eventually at least once?  I want to be absolutely
clear on what you're saying.

> There is one potentially significant difference between autovac and manual
> vacuums here; autovac treats toast tables as just another table, with their
> own stats and their own freeze needs. If you're generating a lot of toast
> records that might make a difference.
>

I do not anticipate TOAST entering the picture.  No single column or record
> 8KB or even approaching it. We have a few databases that (ab)use pg_toast
and I want to avoid those complications.

-Greg

Re: [PERFORM] Millions of tables

2016-10-01 Thread Jeff Janes

On Thu, Sep 29, 2016 at 4:11 AM, Alex Ignatov (postgrespro) <
a.igna...@postgrespro.ru> wrote:

>
>
> *From:* pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance-
> ow...@postgresql.org] *On Behalf Of *
>
> Thank you Terry.  You get the gold star.  :)   I was waiting for that to
> come up.
>
>
>
> Success means handling this condition.  A whole database vacuum and
> dump-restore is out of the question.  Can a properly tuned autovacuum
> prevent the situation?
>
>
>
> -Greg
>
>
>
> Hi!
>
> With millions of tables you have to setautovacuum_max_workers
>  sky-high =). We have some situation when at thousands of tables autovacuum
> can’t vacuum all tables that need it. Simply it vacuums some of most
> modified table and never reach others.
>

Any autovacuum worker should vacuum all tables in its assigned database
which it perceives need vacuuming, as long as it can get the lock.  Unless
the worker is interrupted, for example by frequent database shutdowns, it
should reach all tables in that database before it exits.  Unless there is
a bug, or you are constantly restarting the database before autovacuum can
finish or doing something else to kill them off, what you describe should
not happen.

If it is a bug, we should fix it.  Can you give more details?

There is a known bug when you multiple active databases in the same
cluster.  Once one database reaches the age where anti-wrap around vacuums
kick in, then all future autovacuum workers are directed to that one
database, starving all other databases of auto-vacuuming.  But that doesn't
sound like what you are describing.

Cheers,

Jeff

Re: [PERFORM] Millions of tables

2016-09-30 Thread Jim Nasby


On 9/29/16 6:11 AM, Alex Ignatov (postgrespro) wrote:

With millions of tables you have to setautovacuum_max_workers
 sky-high =). We have some situation when at thousands of tables
autovacuum can’t vacuum all tables that need it. Simply it vacuums some
of most modified table and never reach others. Only manual vacuum can
help with this situation. With wraparound issue it can be a nightmare


Specifically, autovac isn't going to start worrying about anti-wrap 
vacuums until tables start hitting autovacuum_freeze_max_age (or 
autovacuum_multixact_freeze_max_age). Any tables that hit that threshold 
go to the front of the line for being vacuumed. (But keep in mind that 
there is no universal line, just what each worker computes on it's own 
when it's started).


Where things will completely fall apart for you is if a lot of tables 
all have roughly the same relfrozenxid (or relminmxid), like they would 
immediately after a large load. In that scenario you'll suddenly have 
loads of work for autovac to do, all at the same time. That will make 
the database, DBAs and you Very Unhappy (tm).


Somehow, some way, you *must* do a vacuum of the entire database. 
Luckily the freeze map in 9.6 means you'd only have to do that one time 
(assuming the data really is static). In any older version, (auto)vacuum 
will need to eventually *read everything in every table* at least once 
every ~2B transactions.


The only impact the number of tables is going to have on this is 
granularity. If you have a small number of large tables, you'll have 
(auto)vacuum processes that will need to run *uninterrupted* for a long 
time to move the freeze threshold on each table. If you have tons of 
small tables, you'll need tons of separate (auto)vacuums, but each one 
will run for a shorter interval, and if one gets interrupted it won't be 
as big a deal.


There is one potentially significant difference between autovac and 
manual vacuums here; autovac treats toast tables as just another table, 
with their own stats and their own freeze needs. If you're generating a 
lot of toast records that might make a difference.


When it comes to vacuum, you might find 
https://www.pgcon.org/2015/schedule/events/829.en.html useful.


On a different topic... I didn't see anything in the thread about what 
you're storing, but with the row counts you're talking about I'm 
guessing it's something that's time-series. 
https://github.com/ElephantStack/ElephantStack is a project exploring 
the idea of using Postgres array types as a far more efficient way to 
store that kind of data; instead of an overhead of 24 bytes per row 
(plus indexes) arrays give you essentially zero overhead per row. 
There's no code yet, but a few of us have done testing on some real 
world data (see the google group referenced from the README).

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Millions of tables

2016-09-29 Thread Alex Ignatov (postgrespro)

From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Greg Spiegelberg
Sent: Tuesday, September 27, 2016 7:28 PM
To: Terry Schmitt <tschm...@schmittworks.com>
Cc: pgsql-performa. <pgsql-performance@postgresql.org>
Subject: Re: [PERFORM] Millions of tables

On Tue, Sep 27, 2016 at 10:15 AM, Terry Schmitt <tschm...@schmittworks.com 
<mailto:tschm...@schmittworks.com> > wrote:

On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg <gspiegelb...@gmail.com 
<mailto:gspiegelb...@gmail.com> > wrote:

Hey all,

Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has 
said not to have millions of tables.  I too have long believed it until 
recently.

AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for 
PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those 
tables.  Table creation initially took 0.018031 secs, average 0.027467 and 
after tossing out outliers (qty 5) the maximum creation time found was 0.66139 
seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.  Tables were 
created by a single process.  Do note that table creation is done via plpgsql 
function as there are other housekeeping tasks necessary though minimal.

No system tuning but here is a list of PostgreSQL knobs and switches:

shared_buffers = 2GB

work_mem = 48 MB

max_stack_depth = 4 MB

synchronous_commit = off

effective_cache_size = 200 GB

pg_xlog is on it's own file system

There are some still obvious problems.  General DBA functions such as VACUUM 
and ANALYZE should not be done.  Each will run forever and cause much grief.  
Backups are problematic in the traditional pg_dump and PITR space.  Large 
JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test 
case) are no-no's.  A system or database crash could take potentially hours to 
days to recover.  There are likely other issues ahead.

You may wonder, "why is Greg attempting such a thing?"  I looked at DynamoDB, 
BigTable, and Cassandra.  I like Greenplum but, let's face it, it's antiquated 
and don't get me started on "Hadoop".  I looked at many others and ultimately 
the recommended use of each vendor was to have one table for all data.  That 
overcomes the millions of tables problem, right?

Problem with the "one big table" solution is I anticipate 1,200 trillion 
records.  Random access is expected and the customer expects <30ms reads for a 
single record fetch.

No data is loaded... yet  Table and index creation only.  I am interested in 
the opinions of all including tests I may perform.  If you had this setup, what 
would you capture / analyze?  I have a job running preparing data.  I did this 
on a much smaller scale (50k tables) and data load via function allowed close 
to 6,000 records/second.  The schema has been simplified since and last test 
reach just over 20,000 records/second with 300k tables.

I'm not looking for alternatives yet but input to my test.  Takers?

I can't promise immediate feedback but will do my best to respond with results.

TIA,

-Greg

I have not seen any mention of transaction ID wraparound mentioned in this 
thread yet. With the numbers that you are looking at, I could see this as a 
major issue.

T

Thank you Terry.  You get the gold star.  :)   I was waiting for that to come 
up.

Success means handling this condition.  A whole database vacuum and 
dump-restore is out of the question.  Can a properly tuned autovacuum prevent 
the situation?

-Greg

Hi!

With millions of tables you have to setautovacuum_max_workers  sky-high =). 
We have some situation when at thousands of tables autovacuum can’t vacuum all 
tables that need it. Simply it vacuums some of most modified table and never 
reach others. Only manual vacuum can help with this situation. With wraparound 
issue it can be a nightmare 

--

Alex Ignatov 
Postgres Professional:  <http://www.postgrespro.com> http://www.postgrespro.com 
The Russian Postgres Company

Re: [PERFORM] Millions of tables

2016-09-29 Thread Simon Riggs

On 26 September 2016 at 05:19, Greg Spiegelberg  wrote:
> I did look at PostgresXL and CitusDB.  Both are admirable however neither
> could support the need to read a random record consistently under 30ms.
> It's a similar problem Cassandra and others have: network latency.  At this
> scale, to provide the ability to access any given record amongst trillions
> it is imperative to know precisely where it is stored (system & database)
> and read a relatively small index.  I have other requirements that prohibit
> use of any technology that is eventually consistent.

Then XL is exactly what you need, since it does allow you to calculate
exactly where the record is via hash and then access it, which makes
the request just a single datanode task.

XL is not the same as CitusDB.

> I liken the problem to fishing.  To find a particular fish of length, size,
> color  in a data lake you must accept the possibility of scanning the
> entire lake.  However, if all fish were in barrels where each barrel had a
> particular kind of fish of specific length, size, color  then the problem
> is far simpler.

The task of putting the fish in the appropriate barrel is quite hard.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Millions of tables

2016-09-28 Thread Stephen Frost

Greg,

* Greg Spiegelberg (gspiegelb...@gmail.com) wrote:
> On Wed, Sep 28, 2016 at 11:27 AM, Stephen Frost  wrote:
> > * Greg Spiegelberg (gspiegelb...@gmail.com) wrote:
> > > Bigger buckets mean a wider possibility of response times.  Some buckets
> > > may contain 140k records and some 100X more.
> >
> > Have you analyzed the depth of the btree indexes to see how many more
> > pages need to be read to handle finding a row in 140k records vs. 14M
> > records vs. 140M records?
> >
> > I suspect you'd find that the change in actual depth (meaning how many
> > pages have to actually be read to find the row you're looking for) isn't
> > very much and that your concern over the "wider possibility of response
> > times" isn't well founded.

> Excellent feedback!   Um, how does one look at tree depth in PostgreSQL?
> Oracle I know but have not done the same in PG.  Pointers?

CREATE EXTENSION pageinspect;

SELECT * FROM bt_metap('indexname');

https://www.postgresql.org/docs/9.5/static/pageinspect.html

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [PERFORM] Millions of tables

2016-09-28 Thread Greg Spiegelberg

On Wed, Sep 28, 2016 at 11:27 AM, Stephen Frost  wrote:

> Greg,
>
> * Greg Spiegelberg (gspiegelb...@gmail.com) wrote:
> > Bigger buckets mean a wider possibility of response times.  Some buckets
> > may contain 140k records and some 100X more.
>
> Have you analyzed the depth of the btree indexes to see how many more
> pages need to be read to handle finding a row in 140k records vs. 14M
> records vs. 140M records?
>
> I suspect you'd find that the change in actual depth (meaning how many
> pages have to actually be read to find the row you're looking for) isn't
> very much and that your concern over the "wider possibility of response
> times" isn't well founded.
>
>
Stephen,
Excellent feedback!   Um, how does one look at tree depth in PostgreSQL?
Oracle I know but have not done the same in PG.  Pointers?



> Since you have a hard-set 30ms maximum for query response time, I would
> suggest you work out how long it takes to read a cold page from your I/O
> subsystem and then you can work through exactly how many page reads
> could be done in that 30ms (or perhaps 20ms, to allow for whatever
> overhead there will be in the rest of the system and as a buffer) and
> then work that back to how deep the index can be based on that many page
> reads and then how many records are required to create an index of that
> depth.  Of course, the page from the heap will also need to be read and
> there's a bit of additional work to be done, but the disk i/o for cold
> pages is almost certainly where most time will be spent.
>
> I suspect you'll discover that millions of tables is a couple orders of
> magnitude off of how many you'd need to keep the number of page reads
> below the threshold you work out based on your I/O.
>
> Of course, you would need a consistent I/O subsystem, or at least one
> where you know the maximum possible latency to pull a cold page.
>
> Lastly, you'll want to figure out how to handle system crash/restart if
> this system requires a high uptime.  I expect you'd want to have at
> least one replica and a setup which allows you to flip traffic to it
> very quickly to maintain the 30ms response times.
>

I'm replicating via messaging.  PG replication is fine for smaller db's but
I don't trust networks and PG upgrade intricacies complicate matters.

-Greg

Re: [PERFORM] Millions of tables

2016-09-28 Thread Stephen Frost

Greg,

* Greg Spiegelberg (gspiegelb...@gmail.com) wrote:
> Bigger buckets mean a wider possibility of response times.  Some buckets
> may contain 140k records and some 100X more.

Have you analyzed the depth of the btree indexes to see how many more
pages need to be read to handle finding a row in 140k records vs. 14M
records vs. 140M records?

I suspect you'd find that the change in actual depth (meaning how many
pages have to actually be read to find the row you're looking for) isn't
very much and that your concern over the "wider possibility of response
times" isn't well founded.

Since you have a hard-set 30ms maximum for query response time, I would
suggest you work out how long it takes to read a cold page from your I/O
subsystem and then you can work through exactly how many page reads
could be done in that 30ms (or perhaps 20ms, to allow for whatever
overhead there will be in the rest of the system and as a buffer) and
then work that back to how deep the index can be based on that many page
reads and then how many records are required to create an index of that
depth.  Of course, the page from the heap will also need to be read and
there's a bit of additional work to be done, but the disk i/o for cold
pages is almost certainly where most time will be spent.

I suspect you'll discover that millions of tables is a couple orders of
magnitude off of how many you'd need to keep the number of page reads
below the threshold you work out based on your I/O.

Of course, you would need a consistent I/O subsystem, or at least one
where you know the maximum possible latency to pull a cold page.

Lastly, you'll want to figure out how to handle system crash/restart if
this system requires a high uptime.  I expect you'd want to have at
least one replica and a setup which allows you to flip traffic to it
very quickly to maintain the 30ms response times.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [PERFORM] Millions of tables

2016-09-28 Thread Greg Spiegelberg

On Wed, Sep 28, 2016 at 9:39 AM, Vitalii Tymchyshyn  wrote:

> Have you considered having many databases (e.g. 100) and possibly many
> postgresql servers (e.g. 10) started on different ports?
> This would give you 1000x less tables per db.
>

The system design already allows for many database servers.  40 is okay,
100 isn't terrible but if it's thousands then operations might lynch me.

-Greg

Re: [PERFORM] Millions of tables

2016-09-28 Thread Richard Albright

If going that route, why not just use plproxy?

On Wed, Sep 28, 2016 at 11:39 AM, Vitalii Tymchyshyn  wrote:

> Have you considered having many databases (e.g. 100) and possibly many
> postgresql servers (e.g. 10) started on different ports?
> This would give you 1000x less tables per db.
>
>>
>>>

Re: [PERFORM] Millions of tables

2016-09-28 Thread Vitalii Tymchyshyn

>
> Have you considered having many databases (e.g. 100) and possibly many
postgresql servers (e.g. 10) started on different ports?
This would give you 1000x less tables per db.

>
>>

Re: [PERFORM] Millions of tables

2016-09-27 Thread Greg Spiegelberg

On Tue, Sep 27, 2016 at 10:15 AM, Terry Schmitt 
wrote:

>
>
> On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg 
> wrote:
>
>> Hey all,
>>
>> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a
>> time has said not to have millions of tables.  I too have long believed it
>> until recently.
>>
>> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1)
>> for PGDATA.  Over the weekend, I created 8M tables with 16M indexes on
>> those tables.  Table creation initially took 0.018031 secs, average
>> 0.027467 and after tossing out outliers (qty 5) the maximum creation time
>> found was 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049
>> seconds.  Tables were created by a single process.  Do note that table
>> creation is done via plpgsql function as there are other housekeeping tasks
>> necessary though minimal.
>>
>> No system tuning but here is a list of PostgreSQL knobs and switches:
>> shared_buffers = 2GB
>> work_mem = 48 MB
>> max_stack_depth = 4 MB
>> synchronous_commit = off
>> effective_cache_size = 200 GB
>> pg_xlog is on it's own file system
>>
>> There are some still obvious problems.  General DBA functions such as
>> VACUUM and ANALYZE should not be done.  Each will run forever and cause
>> much grief.  Backups are problematic in the traditional pg_dump and PITR
>> space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing
>> it in my test case) are no-no's.  A system or database crash could take
>> potentially hours to days to recover.  There are likely other issues ahead.
>>
>> You may wonder, "why is Greg attempting such a thing?"  I looked at
>> DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
>> it's antiquated and don't get me started on "Hadoop".  I looked at many
>> others and ultimately the recommended use of each vendor was to have one
>> table for all data.  That overcomes the millions of tables problem, right?
>>
>> Problem with the "one big table" solution is I anticipate 1,200 trillion
>> records.  Random access is expected and the customer expects <30ms reads
>> for a single record fetch.
>>
>> No data is loaded... yet  Table and index creation only.  I am interested
>> in the opinions of all including tests I may perform.  If you had this
>> setup, what would you capture / analyze?  I have a job running preparing
>> data.  I did this on a much smaller scale (50k tables) and data load via
>> function allowed close to 6,000 records/second.  The schema has been
>> simplified since and last test reach just over 20,000 records/second with
>> 300k tables.
>>
>> I'm not looking for alternatives yet but input to my test.  Takers?
>>
>> I can't promise immediate feedback but will do my best to respond with
>> results.
>>
>> TIA,
>> -Greg
>>
>
> I have not seen any mention of transaction ID wraparound mentioned in this
> thread yet. With the numbers that you are looking at, I could see this as a
> major issue.
>
> T
>

Thank you Terry.  You get the gold star.  :)   I was waiting for that to
come up.

Success means handling this condition.  A whole database vacuum and
dump-restore is out of the question.  Can a properly tuned autovacuum
prevent the situation?

-Greg

Re: [PERFORM] Millions of tables

2016-09-27 Thread Terry Schmitt

On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg 
wrote:

> Hey all,
>
> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time
> has said not to have millions of tables.  I too have long believed it until
> recently.
>
> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for
> PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those
> tables.  Table creation initially took 0.018031 secs, average 0.027467 and
> after tossing out outliers (qty 5) the maximum creation time found was
> 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.
> Tables were created by a single process.  Do note that table creation is
> done via plpgsql function as there are other housekeeping tasks necessary
> though minimal.
>
> No system tuning but here is a list of PostgreSQL knobs and switches:
> shared_buffers = 2GB
> work_mem = 48 MB
> max_stack_depth = 4 MB
> synchronous_commit = off
> effective_cache_size = 200 GB
> pg_xlog is on it's own file system
>
> There are some still obvious problems.  General DBA functions such as
> VACUUM and ANALYZE should not be done.  Each will run forever and cause
> much grief.  Backups are problematic in the traditional pg_dump and PITR
> space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing
> it in my test case) are no-no's.  A system or database crash could take
> potentially hours to days to recover.  There are likely other issues ahead.
>
> You may wonder, "why is Greg attempting such a thing?"  I looked at
> DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
> it's antiquated and don't get me started on "Hadoop".  I looked at many
> others and ultimately the recommended use of each vendor was to have one
> table for all data.  That overcomes the millions of tables problem, right?
>
> Problem with the "one big table" solution is I anticipate 1,200 trillion
> records.  Random access is expected and the customer expects <30ms reads
> for a single record fetch.
>
> No data is loaded... yet  Table and index creation only.  I am interested
> in the opinions of all including tests I may perform.  If you had this
> setup, what would you capture / analyze?  I have a job running preparing
> data.  I did this on a much smaller scale (50k tables) and data load via
> function allowed close to 6,000 records/second.  The schema has been
> simplified since and last test reach just over 20,000 records/second with
> 300k tables.
>
> I'm not looking for alternatives yet but input to my test.  Takers?
>
> I can't promise immediate feedback but will do my best to respond with
> results.
>
> TIA,
> -Greg
>

I have not seen any mention of transaction ID wraparound mentioned in this
thread yet. With the numbers that you are looking at, I could see this as a
major issue.

T

Re: [PERFORM] Millions of tables

2016-09-27 Thread Greg Spiegelberg

On Tue, Sep 27, 2016 at 9:42 AM, Mike Sofen  wrote:

> *From:* Mike Sofen   *Sent:* Tuesday, September 27, 2016 8:10 AM
>
> *From:* Greg Spiegelberg   *Sent:* Monday, September 26, 2016 7:25 AM
> I've gotten more responses than anticipated and have answered some
> questions and gotten some insight but my challenge again is what should I
> capture along the way to prove or disprove this storage pattern?
> Alternatives to the storage pattern aside, I need ideas to test rig,
> capture metrics and suggestions to tune it.
>
>
>
> In the next 24 hours, I will be sending ~1 trillion records to the test
> database.  Because of time to set up, I'd rather have things set up
> properly the first go.
>
>
>
> Thanks!
>
> -Greg
>
> -
>
> Greg, I ran another quick test on a wider table than you’ve described, but
> this time with 80 million rows, with core counts, ram and ssd storage
> similar to what you’d have on that AWS EC2 instance.  This table had 7
> columns (3 integers, 3 text, 1 timestamptz) with an average width of 157
> chars, one btree index on the pk int column.  Using explain analyze, I
> picked one id value out of the 80m and ran a select * where id = x.  It did
> an index scan, had a planning time of 0.077ms, and an execution time of
> 0.254 seconds.  I ran the query for a variety of widely spaced values (so
> the data was uncached) and the timing never changed. This has been
> mirroring my general experience with PG – very fast reads on indexed
> queries.
>
>
>
> Summary:  I think your buckets can be WAY bigger than you are envisioning
> for the simple table design you’ve described.  I’m betting you can easily
> do 500 million rows per bucket before approaching anything close to the
> 30ms max query time.
>
>
>
> Mike Sofen (Synthetic Genomics)
>
>
>
> Totally typo’d the execution time:  it was 0.254 MILLISECONDS, not
> SECONDS.  Thus my comment about going up 10x in bucket size instead of
> appearing to be right at the limit.  Sorry!
>
>
>
I figured.  :)

Haven't ruled it out but expectations of this implementation is to perform
at worst 3X slower than memcache or Redis.

Bigger buckets mean a wider possibility of response times.  Some buckets
may contain 140k records and some 100X more.

-Greg

Re: [PERFORM] Millions of tables

2016-09-27 Thread Greg Spiegelberg

On Tue, Sep 27, 2016 at 8:30 AM, Craig James  wrote:

> On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg 
> wrote:
>
>> Hey all,
>>
>> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a
>> time has said not to have millions of tables.  I too have long believed it
>> until recently.
>>
>> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1)
>> for PGDATA.  Over the weekend, I created 8M tables with 16M indexes on
>> those tables.  Table creation initially took 0.018031 secs, average
>> 0.027467 and after tossing out outliers (qty 5) the maximum creation time
>> found was 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049
>> seconds.  Tables were created by a single process.  Do note that table
>> creation is done via plpgsql function as there are other housekeeping tasks
>> necessary though minimal.
>>
>> No system tuning but here is a list of PostgreSQL knobs and switches:
>> shared_buffers = 2GB
>> work_mem = 48 MB
>> max_stack_depth = 4 MB
>> synchronous_commit = off
>> effective_cache_size = 200 GB
>> pg_xlog is on it's own file system
>>
>> There are some still obvious problems.  General DBA functions such as
>> VACUUM and ANALYZE should not be done.  Each will run forever and cause
>> much grief.  Backups are problematic in the traditional pg_dump and PITR
>> space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing
>> it in my test case) are no-no's.  A system or database crash could take
>> potentially hours to days to recover.  There are likely other issues ahead.
>>
>> You may wonder, "why is Greg attempting such a thing?"  I looked at
>> DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
>> it's antiquated and don't get me started on "Hadoop".  I looked at many
>> others and ultimately the recommended use of each vendor was to have one
>> table for all data.  That overcomes the millions of tables problem, right?
>>
>> Problem with the "one big table" solution is I anticipate 1,200 trillion
>> records.  Random access is expected and the customer expects <30ms reads
>> for a single record fetch.
>>
>
> You don't give enough details to fully explain the problem you're trying
> to solve.
>
>- Will records ever be updated or deleted? If so, what percentage and
>at what frequency?
>- What specifically are you storing (e.g. list of integers, strings,
>people's sex habits, ...)? Or more importantly, are these fixed- or
>variable-sized records?
>- Once the 1,200 trillion records are loaded, is that it? Or do more
>data arrive, and if so, at what rate?
>- Do your queries change, or is there a fixed set of queries?
>- How complex are the joins?
>
> Excellent questions.

1a. Half of the 4M tables will contain ~140k records and UPDATE's will
occur on roughly 100 records/day/table.  No DELETE's on this first half.
1b. Second half of the 4M tables will contain ~200k records.  Zero UPDATE's
however DELETE's will occur on ~100 records/day/table.

2. All 4M tables contain 7 columns: (4) bigints, (2) timestamptz and (1)
boolean.  2M of the table will have an PKEY on (1) bigint table only.
Second 2M table have a PKEY on (bigint,timestamptz) and two additional
indexes on (bigint, timestamptz) different columns.

3. The trillions-of-records load is just to push the system to find the
maximum record load capability.  Reality, 200M records / day or
~2,300/second average is the expectation once in production.

4. Queries are fixed and match the indexes laid down on the tables.  Goal
is <30ms/query.  I have attempted queries with and without indexes.
Without indexes the average query response varied between 20ms and 40ms
whereas indexes respond within a much tighter range of 5ms to 9ms.  Both
query performance tests were done during data-ingest.

5. Zero JOIN's and I won't let it ever happen.  However the 4M tables
INHERIT a data grouping table.  Test rig limits child tables to
1,000/parent.  This was done to explore some other possible access patterns
but they are secondary and if it doesn't work then either a) the
requirement will be dropped or b) I may look at storing the data in the
1,000 child tables directly in the parent table and I'll need to re-run
load & read tests.

> The reason I ask these specific questions is because, as others have
> pointed out, this might be a perfect case for a custom (non-relational)
> database. Relational databases are general-purpose tools, sort of like a
> Swiss-Army knife. A Swiss-Army knife does most things passably, but if you
> want to carve wood, or butcher meat, or slice vegetables, you get a knife
> meant for that specific task.
>
>
I'm at a sizing phase.  If 4M tables works I'll attempt 16M tables.  If it
points to only 2M or 1M then that's fine.  The 4M table database in only a
single cog in the storage service design.  Anticipating ~40 of these
databases but it is dependent upon how many tables work

Re: [PERFORM] Millions of tables

2016-09-27 Thread Mike Sofen

From: Mike Sofen   Sent: Tuesday, September 27, 2016 8:10 AM

From: Greg Spiegelberg   Sent: Monday, September 26, 2016 7:25 AM
I've gotten more responses than anticipated and have answered some questions 
and gotten some insight but my challenge again is what should I capture along 
the way to prove or disprove this storage pattern?  Alternatives to the storage 
pattern aside, I need ideas to test rig, capture metrics and suggestions to 
tune it.

In the next 24 hours, I will be sending ~1 trillion records to the test 
database.  Because of time to set up, I'd rather have things set up properly 
the first go.

Thanks!

-Greg 

-

Greg, I ran another quick test on a wider table than you’ve described, but this 
time with 80 million rows, with core counts, ram and ssd storage similar to 
what you’d have on that AWS EC2 instance.  This table had 7 columns (3 
integers, 3 text, 1 timestamptz) with an average width of 157 chars, one btree 
index on the pk int column.  Using explain analyze, I picked one id value out 
of the 80m and ran a select * where id = x.  It did an index scan, had a 
planning time of 0.077ms, and an execution time of 0.254 seconds.  I ran the 
query for a variety of widely spaced values (so the data was uncached) and the 
timing never changed. This has been mirroring my general experience with PG – 
very fast reads on indexed queries. 

Summary:  I think your buckets can be WAY bigger than you are envisioning for 
the simple table design you’ve described.  I’m betting you can easily do 500 
million rows per bucket before approaching anything close to the 30ms max query 
time.

Mike Sofen (Synthetic Genomics)

Totally typo’d the execution time:  it was 0.254 MILLISECONDS, not SECONDS.  
Thus my comment about going up 10x in bucket size instead of appearing to be 
right at the limit.  Sorry!

Mike

Re: [PERFORM] Millions of tables

2016-09-27 Thread Mike Sofen

 

 

From: Greg Spiegelberg   Sent: Monday, September 26, 2016 7:25 AM
I've gotten more responses than anticipated and have answered some questions 
and gotten some insight but my challenge again is what should I capture along 
the way to prove or disprove this storage pattern?  Alternatives to the storage 
pattern aside, I need ideas to test rig, capture metrics and suggestions to 
tune it.

 

In the next 24 hours, I will be sending ~1 trillion records to the test 
database.  Because of time to set up, I'd rather have things set up properly 
the first go.

 

Thanks!

-Greg 

-

Greg, I ran another quick test on a wider table than you’ve described, but this 
time with 80 million rows, with core counts, ram and ssd storage similar to 
what you’d have on that AWS EC2 instance.  This table had 7 columns (3 
integers, 3 text, 1 timestamptz) with an average width of 157 chars, one btree 
index on the pk int column.  Using explain analyze, I picked one id value out 
of the 80m and ran a select * where id = x.  It did an index scan, had a 
planning time of 0.077ms, and an execution time of 0.254 seconds.  I ran the 
query for a variety of widely spaced values (so the data was uncached) and the 
timing never changed. This has been mirroring my general experience with PG – 
very fast reads on indexed queries. 

 

Summary:  I think your buckets can be WAY bigger than you are envisioning for 
the simple table design you’ve described.  I’m betting you can easily do 500 
million rows per bucket before approaching anything close to the 30ms max query 
time.

 

Mike Sofen (Synthetic Genomics)

Re: [PERFORM] Millions of tables

2016-09-27 Thread Craig James

On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg 
wrote:

> Hey all,
>
> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time
> has said not to have millions of tables.  I too have long believed it until
> recently.
>
> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for
> PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those
> tables.  Table creation initially took 0.018031 secs, average 0.027467 and
> after tossing out outliers (qty 5) the maximum creation time found was
> 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.
> Tables were created by a single process.  Do note that table creation is
> done via plpgsql function as there are other housekeeping tasks necessary
> though minimal.
>
> No system tuning but here is a list of PostgreSQL knobs and switches:
> shared_buffers = 2GB
> work_mem = 48 MB
> max_stack_depth = 4 MB
> synchronous_commit = off
> effective_cache_size = 200 GB
> pg_xlog is on it's own file system
>
> There are some still obvious problems.  General DBA functions such as
> VACUUM and ANALYZE should not be done.  Each will run forever and cause
> much grief.  Backups are problematic in the traditional pg_dump and PITR
> space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing
> it in my test case) are no-no's.  A system or database crash could take
> potentially hours to days to recover.  There are likely other issues ahead.
>
> You may wonder, "why is Greg attempting such a thing?"  I looked at
> DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
> it's antiquated and don't get me started on "Hadoop".  I looked at many
> others and ultimately the recommended use of each vendor was to have one
> table for all data.  That overcomes the millions of tables problem, right?
>
> Problem with the "one big table" solution is I anticipate 1,200 trillion
> records.  Random access is expected and the customer expects <30ms reads
> for a single record fetch.
>

You don't give enough details to fully explain the problem you're trying to
solve.

   - Will records ever be updated or deleted? If so, what percentage and at
   what frequency?
   - What specifically are you storing (e.g. list of integers, strings,
   people's sex habits, ...)? Or more importantly, are these fixed- or
   variable-sized records?
   - Once the 1,200 trillion records are loaded, is that it? Or do more
   data arrive, and if so, at what rate?
   - Do your queries change, or is there a fixed set of queries?
   - How complex are the joins?

The reason I ask these specific questions is because, as others have
pointed out, this might be a perfect case for a custom (non-relational)
database. Relational databases are general-purpose tools, sort of like a
Swiss-Army knife. A Swiss-Army knife does most things passably, but if you
want to carve wood, or butcher meat, or slice vegetables, you get a knife
meant for that specific task.

I've written several custom database-storage systems for very specific
high-performance systems. It's generally a couple weeks of work, and you
have a tailored performance and storage that's hard for a general-purpose
relational system to match.

The difficulty of building such a system depends a lot on the answers to
the questions above.

Craig


> No data is loaded... yet  Table and index creation only.  I am interested
> in the opinions of all including tests I may perform.  If you had this
> setup, what would you capture / analyze?  I have a job running preparing
> data.  I did this on a much smaller scale (50k tables) and data load via
> function allowed close to 6,000 records/second.  The schema has been
> simplified since and last test reach just over 20,000 records/second with
> 300k tables.
>
> I'm not looking for alternatives yet but input to my test.  Takers?
>
> I can't promise immediate feedback but will do my best to respond with
> results.
>
> TIA,
> -Greg
>



-- 
-
Craig A. James
Chief Technology Officer
eMolecules, Inc.
-

Re: [PERFORM] Millions of tables

2016-09-26 Thread Tom Lane

Jeff Janes  writes:
> A problem is that those statistics are stored in one file (per database; it
> used to be one file per cluster).  With 8 million tables, that is going to
> be a pretty big file.  But the code pretty much assumes the file is going
> to be pretty small, and so it has no compunction about commanding that it
> be read and written, in its entirety, quite often.

I don't know that anyone ever believed it would be small.  But at the
time the pgstats code was written, there was no good alternative to
passing the data through files.  (And I'm not sure we envisioned
applications that would be demanding fresh data constantly, anyway.)

Now that the DSM stuff exists and has been more or less shaken out,
I wonder how practical it'd be to use a DSM segment to make the stats
collector's data available to backends.  You'd need a workaround for
the fact that not all the DSM implementations support resize (although
given the lack of callers of dsm_resize, one could be forgiven for
wondering whether any of that code has been tested at all).  But you
could imagine abandoning one DSM segment and creating a new one of
double the size anytime the hash tables got too big.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Millions of tables

2016-09-26 Thread Jeff Janes

On Mon, Sep 26, 2016 at 5:53 AM, Greg Spiegelberg 
wrote:

>
>>
>> I may need to understand autovacuum better.  My impression was it
> consulted statistics and performed vacuums one table at a time based on the
> vacuum threshold formula on  https://www.postgresql.org/
> docs/9.5/static/routine-vacuuming.html.
>

A problem is that those statistics are stored in one file (per database; it
used to be one file per cluster).  With 8 million tables, that is going to
be a pretty big file.  But the code pretty much assumes the file is going
to be pretty small, and so it has no compunction about commanding that it
be read and written, in its entirety, quite often.

Cheers,

Jeff

Re: [PERFORM] Millions of tables

2016-09-26 Thread Greg Spiegelberg

On Sun, Sep 25, 2016 at 8:50 PM, Greg Spiegelberg 
wrote:

> Hey all,
>
> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time
> has said not to have millions of tables.  I too have long believed it until
> recently.
>
> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for
> PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those
> tables.  Table creation initially took 0.018031 secs, average 0.027467 and
> after tossing out outliers (qty 5) the maximum creation time found was
> 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.
> Tables were created by a single process.  Do note that table creation is
> done via plpgsql function as there are other housekeeping tasks necessary
> though minimal.
>
> No system tuning but here is a list of PostgreSQL knobs and switches:
> shared_buffers = 2GB
> work_mem = 48 MB
> max_stack_depth = 4 MB
> synchronous_commit = off
> effective_cache_size = 200 GB
> pg_xlog is on it's own file system
>
> There are some still obvious problems.  General DBA functions such as
> VACUUM and ANALYZE should not be done.  Each will run forever and cause
> much grief.  Backups are problematic in the traditional pg_dump and PITR
> space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing
> it in my test case) are no-no's.  A system or database crash could take
> potentially hours to days to recover.  There are likely other issues ahead.
>
> You may wonder, "why is Greg attempting such a thing?"  I looked at
> DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
> it's antiquated and don't get me started on "Hadoop".  I looked at many
> others and ultimately the recommended use of each vendor was to have one
> table for all data.  That overcomes the millions of tables problem, right?
>
> Problem with the "one big table" solution is I anticipate 1,200 trillion
> records.  Random access is expected and the customer expects <30ms reads
> for a single record fetch.
>
> No data is loaded... yet  Table and index creation only.  I am interested
> in the opinions of all including tests I may perform.  If you had this
> setup, what would you capture / analyze?  I have a job running preparing
> data.  I did this on a much smaller scale (50k tables) and data load via
> function allowed close to 6,000 records/second.  The schema has been
> simplified since and last test reach just over 20,000 records/second with
> 300k tables.
>
> I'm not looking for alternatives yet but input to my test.  Takers?
>
> I can't promise immediate feedback but will do my best to respond with
> results.
>
> TIA,
> -Greg
>

I've gotten more responses than anticipated and have answered some
questions and gotten some insight but my challenge again is what should I
capture along the way to prove or disprove this storage pattern?
Alternatives to the storage pattern aside, I need ideas to test rig,
capture metrics and suggestions to tune it.

In the next 24 hours, I will be sending ~1 trillion records to the test
database.  Because of time to set up, I'd rather have things set up
properly the first go.

Thanks!
-Greg

Re: [PERFORM] Millions of tables

2016-09-26 Thread Stuart Bishop

On 26 September 2016 at 20:51, Greg Spiegelberg 
wrote:

>
> An alternative if you exhaust or don't trust other options, use a foreign
>> data wrapper to access your own custom storage. A single table at the PG
>> level, you can shard the data yourself into 8 bazillion separate stores, in
>> whatever structure suites your read and write operations (maybe reusing an
>> embedded db engine, ordered flat file+log+index, whatever).
>>
>>
> However even 8 bazillion FDW's may cause an "overflow" of relationships at
> the loss of having an efficient storage engine acting more like a traffic
> cop.  In such a case, I would opt to put such logic in the app to directly
> access the true storage over using FDW's.
>

I mean one fdw table, which shards internally to 8 bazillion stores on
disk. It has the sharding key, can calculate exactly which store(s) need to
be hit, and returns the rows and to PostgreSQL it looks like 1 big table
with 1.3 trillion rows. And if it doesn't do that in 30ms you get to blame
yourself :)


-- 
Stuart Bishop 
http://www.stuartbishop.net/

Re: [PERFORM] Millions of tables

2016-09-26 Thread Greg Spiegelberg

On Mon, Sep 26, 2016 at 4:23 AM, Rick Otten 
wrote:

> Are the tables constantly being written to, or is this a mostly read
> scenario?   One architecture possibility, if the writes are not so
> frequent,  is to create just a handful of very big tables for writing, and
> then make smaller tables as materialized views for reading.  The vacuum and
> bloat management could be done back a the big tables. The materialized
> views could be refreshed or replaced during non-peak hours.  The
> materialized views could be on a different tablespace than the root
> tables.  They could also be structured to reflect real-world query patterns
> which are sometimes different than the raw data storage engineering problem.
>
>
Like any data warehouse, I expect 90%+ of the activity being writes but the
necessity of low latency reads is an absolute must else the design doesn't
get off the ground.

Materialized views are neat for many cases but not this one.  Current
versions of the data must be available the moment after they are written.

I am considering tablespaces however I have no way to properly size.  One
group of tables may contain 1M records and another 100M.  Could be on the
same file system but I'd have the same problems internal to PostgreSQL and
the only thing overcome is millions of files in a single directory which is
a file system selection problem.



> With some logging you may be able to see that the data is not truly
> randomly accessed, but rather clustered around just some of the millions of
> tables.   Then the engineering problem becomes "How do I service 90% of the
> queries on these tables in 30ms ?" Rather than "How do I service 100% of
> the queries 100% of the time in 30ms?"  Knowing 90% of the queries hit just
> a few hundred tables, makes the first question easier to answer.
>
> Similarly, if most of the columns are static and only a few columns are
> actually changing, you could consider pulling the static stuff out of the
> same table with the dynamic stuff and then look at joins in your queries.
> The end goal is to be able to get solid indexes and tables that don't
> change a lot so they can be tightly packed and cached.  (less bloat, less
> fragmentation, fewer disk accesses).
>
> With regards to consistent query performance, I think you need to get out
> of AWS.  That environment is terrible if you are going for consistency
> unless you buy dedicated hardware, and then you are paying so much money it
> is ridiculous.
>
>
True about AWS and though it is possible hardware may be purchased AWS is
the right place to start.  1) AWS is not IT and won't take months to
approve budget for gear+deployment and more importantly 2) still in design
phase and if deployed there is no way to predict true adoption meaning
it'll start small.  AWS is the right place for now.



> Also I think having 10M rows in a table is not a problem for the query
> times you are referring to.  So instead of millions of tables, unless I'm
> doing my math wrong, you probably only need thousands of tables.
>
>
>
> On Mon, Sep 26, 2016 at 5:43 AM, Stuart Bishop 
> wrote:
>
>> On 26 September 2016 at 11:19, Greg Spiegelberg 
>> wrote:
>>
>>> I did look at PostgresXL and CitusDB.  Both are admirable however
>>> neither could support the need to read a random record consistently under
>>> 30ms.  It's a similar problem Cassandra and others have: network latency.
>>> At this scale, to provide the ability to access any given record amongst
>>> trillions it is imperative to know precisely where it is stored (system &
>>> database) and read a relatively small index.  I have other requirements
>>> that prohibit use of any technology that is eventually consistent.
>>>
>>> I liken the problem to fishing.  To find a particular fish of length,
>>> size, color  in a data lake you must accept the possibility of scanning
>>> the entire lake.  However, if all fish were in barrels where each barrel
>>> had a particular kind of fish of specific length, size, color  then the
>>> problem is far simpler.
>>>
>>> -Greg
>>>
>>
>> My gut tells me that if you do solve the problem and get PostgreSQL (or
>> anything) reading consistently at under 30ms with that many tables you will
>> have solved one problem by creating another.
>>
>> You discounted Cassandra due to network latency, but are now trying a
>> monolithic PostgreSQL setup. It might be worth trying a single node
>> ScyllaDB or Cassandra deploy (no need for QUORUM or network overhead),
>> perhaps using layered compaction so all your data gets broken out into
>> 160MB chunks. And certainly wander over to the ScyllaDB mailing list, as
>> they are very focused on performance problems like yours and should offer
>> some insight even if a Cassandra style architecture cannot meet your
>> requirements.
>>
>> An alternative if you exhaust or don't trust other options, use a foreign
>> data wrapper to access your own custom

Re: [PERFORM] Millions of tables

2016-09-26 Thread Greg Spiegelberg

On Mon, Sep 26, 2016 at 7:05 AM, Mike Sofen  wrote:

> *From:* Rick Otten   *Sent:* Monday, September 26, 2016 3:24 AM
> Are the tables constantly being written to, or is this a mostly read
> scenario?
>
>
>
> With regards to consistent query performance, I think you need to get out
> of AWS.  That environment is terrible if you are going for consistency
> unless you buy dedicated hardware, and then you are paying so much money it
> is ridiculous.
>
>
>
> Also I think having 10M rows in a table is not a problem for the query
> times you are referring to.  So instead of millions of tables, unless I'm
> doing my math wrong, you probably only need thousands of tables.
>
> --
>
> Excellent thoughts:  the read/write behavior will/should drive a lot of
> the design;  AWS does not guarantee consistency or latency;  and 10m rows
> is nothing to PG.
>
>
>
> Re AWS:  we’re on it, at least for now.  In my profiling of our
> performance there, I consistently get low latencies…I just know that there
> will be random higher latencies, but the statistical average will be low.
> I just ran a quick test against a modest sized table on a modest sized EC2
> instance (m4.xlarge – 4 core/16gb ram, 3 tb ssd):  the table has 15m rows
> but is huge (it represents nearly 500m rows compressed in jsonb documents),
> with 5 indexed key columns and a total of 12 columns.  I queried for a
> single, non-PK, indexed value using “select *” (so it included the json)
> and it took 22ms, without the json it took 11ms.  Especially with the
> db/memory-optimized EC2 instances now available (with guaranteed IOPS),
> performance against even 100m row tables should still stay within your
> requirements.
>
>
>
> So Rick’s point about not needing millions of tables is right on.  If
> there’s a way to create table “clumps”, at least you’ll have a more modest
> table count.
>
>
>

Absolutely!  The 8M tables do "belong" to a larger group and the option to
reduce the 8M tables to ~4000 is an option however the problem then becomes
rather than having an anticipated 140k records/table to 140M to 500M
records/table.  I'm concerned read access times will go out the window.  It
is on the docket to test.

-Greg

Re: [PERFORM] Millions of tables

2016-09-26 Thread Greg Spiegelberg

Consider the problem though.  Random access to trillions of records with no
guarantee any one will be fetched twice in a short time frame nullifies the
effectiveness of a cache unless the cache is enormous.  If such a cache
were that big, 100's of TB's, I wouldn't be looking at on-disk storage
options.  :)

-Greg

On Mon, Sep 26, 2016 at 6:54 AM, Yves Dorfsman  wrote:

> Something that is not talked about at all in this thread is caching. A
> bunch
> of memcache servers in front of the DB should be able to help with the 30ms
> constraint (doesn't have to be memcache, some caching technology).
>
> --
> http://yves.zioup.com
> gpg: 4096R/32B0F416
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: [PERFORM] Millions of tables

2016-09-26 Thread Greg Spiegelberg

On Mon, Sep 26, 2016 at 3:43 AM, Stuart Bishop 
wrote:

> On 26 September 2016 at 11:19, Greg Spiegelberg 
> wrote:
>
>> I did look at PostgresXL and CitusDB.  Both are admirable however neither
>> could support the need to read a random record consistently under 30ms.
>> It's a similar problem Cassandra and others have: network latency.  At this
>> scale, to provide the ability to access any given record amongst trillions
>> it is imperative to know precisely where it is stored (system & database)
>> and read a relatively small index.  I have other requirements that prohibit
>> use of any technology that is eventually consistent.
>>
>> I liken the problem to fishing.  To find a particular fish of length,
>> size, color  in a data lake you must accept the possibility of scanning
>> the entire lake.  However, if all fish were in barrels where each barrel
>> had a particular kind of fish of specific length, size, color  then the
>> problem is far simpler.
>>
>> -Greg
>>
>
> My gut tells me that if you do solve the problem and get PostgreSQL (or
> anything) reading consistently at under 30ms with that many tables you will
> have solved one problem by creating another.
>
>
Exactly why I am exploring.  What are the trade offs?



> You discounted Cassandra due to network latency, but are now trying a
> monolithic PostgreSQL setup. It might be worth trying a single node
> ScyllaDB or Cassandra deploy (no need for QUORUM or network overhead),
> perhaps using layered compaction so all your data gets broken out into
> 160MB chunks. And certainly wander over to the ScyllaDB mailing list, as
> they are very focused on performance problems like yours and should offer
> some insight even if a Cassandra style architecture cannot meet your
> requirements.
>
>
Cassandra performance, according to the experts I consulted, starts to fall
off once the stored dataset exceeds ~3 TB.  Much too small for my use
case.  Again, I do have other reasons for not using Cassandra and others
namely deduplication of information referenced by my millions of tables.
There are no guarantees in many outside of the RDBMS realm.



> An alternative if you exhaust or don't trust other options, use a foreign
> data wrapper to access your own custom storage. A single table at the PG
> level, you can shard the data yourself into 8 bazillion separate stores, in
> whatever structure suites your read and write operations (maybe reusing an
> embedded db engine, ordered flat file+log+index, whatever).
>
>
However even 8 bazillion FDW's may cause an "overflow" of relationships at
the loss of having an efficient storage engine acting more like a traffic
cop.  In such a case, I would opt to put such logic in the app to directly
access the true storage over using FDW's.

-Greg

Re: [PERFORM] Millions of tables

2016-09-26 Thread Mike Sofen

From: Rick Otten   Sent: Monday, September 26, 2016 3:24 AM
Are the tables constantly being written to, or is this a mostly read scenario? 

 

With regards to consistent query performance, I think you need to get out of 
AWS.  That environment is terrible if you are going for consistency unless you 
buy dedicated hardware, and then you are paying so much money it is ridiculous.

 

Also I think having 10M rows in a table is not a problem for the query times 
you are referring to.  So instead of millions of tables, unless I'm doing my 
math wrong, you probably only need thousands of tables.

--

Excellent thoughts:  the read/write behavior will/should drive a lot of the 
design;  AWS does not guarantee consistency or latency;  and 10m rows is 
nothing to PG.

 

Re AWS:  we’re on it, at least for now.  In my profiling of our performance 
there, I consistently get low latencies…I just know that there will be random 
higher latencies, but the statistical average will be low.  I just ran a quick 
test against a modest sized table on a modest sized EC2 instance (m4.xlarge – 4 
core/16gb ram, 3 tb ssd):  the table has 15m rows but is huge (it represents 
nearly 500m rows compressed in jsonb documents), with 5 indexed key columns and 
a total of 12 columns.  I queried for a single, non-PK, indexed value using 
“select *” (so it included the json) and it took 22ms, without the json it took 
11ms.  Especially with the db/memory-optimized EC2 instances now available 
(with guaranteed IOPS), performance against even 100m row tables should still 
stay within your requirements.

 

So Rick’s point about not needing millions of tables is right on.  If there’s a 
way to create table “clumps”, at least you’ll have a more modest table count.

 

Mike Sofen (Synthetic Genomics)

Re: [PERFORM] Millions of tables

2016-09-26 Thread Yves Dorfsman

Something that is not talked about at all in this thread is caching. A bunch
of memcache servers in front of the DB should be able to help with the 30ms
constraint (doesn't have to be memcache, some caching technology).

-- 
http://yves.zioup.com
gpg: 4096R/32B0F416



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Millions of tables

2016-09-26 Thread Greg Spiegelberg

Following list etiquette response inline ;)

On Mon, Sep 26, 2016 at 2:28 AM, Álvaro Hernández Tortosa 
wrote:

>
>
> On 26/09/16 05:50, Greg Spiegelberg wrote:
>
>> Hey all,
>>
>> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a
>> time has said not to have millions of tables.  I too have long believed it
>> until recently.
>>
>> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1)
>> for PGDATA.  Over the weekend, I created 8M tables with 16M indexes on
>> those tables.  Table creation initially took 0.018031 secs, average
>> 0.027467 and after tossing out outliers (qty 5) the maximum creation time
>> found was 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049
>> seconds.  Tables were created by a single process. Do note that table
>> creation is done via plpgsql function as there are other housekeeping tasks
>> necessary though minimal.
>>
>> No system tuning but here is a list of PostgreSQL knobs and switches:
>> shared_buffers = 2GB
>> work_mem = 48 MB
>> max_stack_depth = 4 MB
>> synchronous_commit = off
>> effective_cache_size = 200 GB
>> pg_xlog is on it's own file system
>>
>> There are some still obvious problems.  General DBA functions such as
>> VACUUM and ANALYZE should not be done.  Each will run forever and cause
>> much grief.  Backups are problematic in the traditional pg_dump and PITR
>> space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing
>> it in my test case) are no-no's.  A system or database crash could take
>> potentially hours to days to recover.  There are likely other issues ahead.
>>
>> You may wonder, "why is Greg attempting such a thing?"  I looked at
>> DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
>> it's antiquated and don't get me started on "Hadoop".  I looked at many
>> others and ultimately the recommended use of each vendor was to have one
>> table for all data.  That overcomes the millions of tables problem, right?
>>
>> Problem with the "one big table" solution is I anticipate 1,200 trillion
>> records.  Random access is expected and the customer expects <30ms reads
>> for a single record fetch.
>>
>> No data is loaded... yet  Table and index creation only. I am interested
>> in the opinions of all including tests I may perform.  If you had this
>> setup, what would you capture / analyze?  I have a job running preparing
>> data.  I did this on a much smaller scale (50k tables) and data load via
>> function allowed close to 6,000 records/second.  The schema has been
>> simplified since and last test reach just over 20,000 records/second with
>> 300k tables.
>>
>> I'm not looking for alternatives yet but input to my test. Takers?
>>
>> I can't promise immediate feedback but will do my best to respond with
>> results.
>>
>> TIA,
>> -Greg
>>
>
> Hi Greg.
>
> This is a problem (creating a large number of tables; really large
> indeed) that we researched in my company a while ago. You might want to
> read about it: https://www.pgcon.org/2013/schedule/events/595.en.html
>
>
updatedb, funny.  Thank you for the pointer.  I had no intention of going
to 1B tables.

I may need to understand autovacuum better.  My impression was it consulted
statistics and performed vacuums one table at a time based on the vacuum
threshold formula on
https://www.postgresql.org/docs/9.5/static/routine-vacuuming.html.


 -Greg

Re: [PERFORM] Millions of tables

2016-09-26 Thread Rick Otten

Are the tables constantly being written to, or is this a mostly read
scenario?   One architecture possibility, if the writes are not so
frequent,  is to create just a handful of very big tables for writing, and
then make smaller tables as materialized views for reading.  The vacuum and
bloat management could be done back a the big tables. The materialized
views could be refreshed or replaced during non-peak hours.  The
materialized views could be on a different tablespace than the root
tables.  They could also be structured to reflect real-world query patterns
which are sometimes different than the raw data storage engineering problem.

With some logging you may be able to see that the data is not truly
randomly accessed, but rather clustered around just some of the millions of
tables.   Then the engineering problem becomes "How do I service 90% of the
queries on these tables in 30ms ?" Rather than "How do I service 100% of
the queries 100% of the time in 30ms?"  Knowing 90% of the queries hit just
a few hundred tables, makes the first question easier to answer.

Similarly, if most of the columns are static and only a few columns are
actually changing, you could consider pulling the static stuff out of the
same table with the dynamic stuff and then look at joins in your queries.
The end goal is to be able to get solid indexes and tables that don't
change a lot so they can be tightly packed and cached.  (less bloat, less
fragmentation, fewer disk accesses).

With regards to consistent query performance, I think you need to get out
of AWS.  That environment is terrible if you are going for consistency
unless you buy dedicated hardware, and then you are paying so much money it
is ridiculous.

Also I think having 10M rows in a table is not a problem for the query
times you are referring to.  So instead of millions of tables, unless I'm
doing my math wrong, you probably only need thousands of tables.

On Mon, Sep 26, 2016 at 5:43 AM, Stuart Bishop 
wrote:

> On 26 September 2016 at 11:19, Greg Spiegelberg 
> wrote:
>
>> I did look at PostgresXL and CitusDB.  Both are admirable however neither
>> could support the need to read a random record consistently under 30ms.
>> It's a similar problem Cassandra and others have: network latency.  At this
>> scale, to provide the ability to access any given record amongst trillions
>> it is imperative to know precisely where it is stored (system & database)
>> and read a relatively small index.  I have other requirements that prohibit
>> use of any technology that is eventually consistent.
>>
>> I liken the problem to fishing.  To find a particular fish of length,
>> size, color  in a data lake you must accept the possibility of scanning
>> the entire lake.  However, if all fish were in barrels where each barrel
>> had a particular kind of fish of specific length, size, color  then the
>> problem is far simpler.
>>
>> -Greg
>>
>
> My gut tells me that if you do solve the problem and get PostgreSQL (or
> anything) reading consistently at under 30ms with that many tables you will
> have solved one problem by creating another.
>
> You discounted Cassandra due to network latency, but are now trying a
> monolithic PostgreSQL setup. It might be worth trying a single node
> ScyllaDB or Cassandra deploy (no need for QUORUM or network overhead),
> perhaps using layered compaction so all your data gets broken out into
> 160MB chunks. And certainly wander over to the ScyllaDB mailing list, as
> they are very focused on performance problems like yours and should offer
> some insight even if a Cassandra style architecture cannot meet your
> requirements.
>
> An alternative if you exhaust or don't trust other options, use a foreign
> data wrapper to access your own custom storage. A single table at the PG
> level, you can shard the data yourself into 8 bazillion separate stores, in
> whatever structure suites your read and write operations (maybe reusing an
> embedded db engine, ordered flat file+log+index, whatever).
>
> --
> Stuart Bishop 
> http://www.stuartbishop.net/
>

Re: [PERFORM] Millions of tables

2016-09-26 Thread Stuart Bishop

On 26 September 2016 at 11:19, Greg Spiegelberg 
wrote:

> I did look at PostgresXL and CitusDB.  Both are admirable however neither
> could support the need to read a random record consistently under 30ms.
> It's a similar problem Cassandra and others have: network latency.  At this
> scale, to provide the ability to access any given record amongst trillions
> it is imperative to know precisely where it is stored (system & database)
> and read a relatively small index.  I have other requirements that prohibit
> use of any technology that is eventually consistent.
>
> I liken the problem to fishing.  To find a particular fish of length,
> size, color  in a data lake you must accept the possibility of scanning
> the entire lake.  However, if all fish were in barrels where each barrel
> had a particular kind of fish of specific length, size, color  then the
> problem is far simpler.
>
> -Greg
>

My gut tells me that if you do solve the problem and get PostgreSQL (or
anything) reading consistently at under 30ms with that many tables you will
have solved one problem by creating another.

You discounted Cassandra due to network latency, but are now trying a
monolithic PostgreSQL setup. It might be worth trying a single node
ScyllaDB or Cassandra deploy (no need for QUORUM or network overhead),
perhaps using layered compaction so all your data gets broken out into
160MB chunks. And certainly wander over to the ScyllaDB mailing list, as
they are very focused on performance problems like yours and should offer
some insight even if a Cassandra style architecture cannot meet your
requirements.

An alternative if you exhaust or don't trust other options, use a foreign
data wrapper to access your own custom storage. A single table at the PG
level, you can shard the data yourself into 8 bazillion separate stores, in
whatever structure suites your read and write operations (maybe reusing an
embedded db engine, ordered flat file+log+index, whatever).

-- 
Stuart Bishop 
http://www.stuartbishop.net/

Re: [PERFORM] Millions of tables

2016-09-26 Thread Álvaro Hernández Tortosa




On 26/09/16 05:50, Greg Spiegelberg wrote:

Hey all,

Obviously everyone who's been in PostgreSQL or almost any RDBMS for a 
time has said not to have millions of tables.  I too have long 
believed it until recently.


AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) 
for PGDATA.  Over the weekend, I created 8M tables with 16M indexes on 
those tables.  Table creation initially took 0.018031 secs, average 
0.027467 and after tossing out outliers (qty 5) the maximum creation 
time found was 0.66139 seconds.  Total time 30 hours, 31 minutes and 
8.435049 seconds.  Tables were created by a single process. Do note 
that table creation is done via plpgsql function as there are other 
housekeeping tasks necessary though minimal.


No system tuning but here is a list of PostgreSQL knobs and switches:
shared_buffers = 2GB
work_mem = 48 MB
max_stack_depth = 4 MB
synchronous_commit = off
effective_cache_size = 200 GB
pg_xlog is on it's own file system

There are some still obvious problems.  General DBA functions such as 
VACUUM and ANALYZE should not be done.  Each will run forever and 
cause much grief.  Backups are problematic in the traditional pg_dump 
and PITR space.  Large JOIN's by VIEW, SELECT or via table inheritance 
(I am abusing it in my test case) are no-no's.  A system or database 
crash could take potentially hours to days to recover.  There are 
likely other issues ahead.


You may wonder, "why is Greg attempting such a thing?"  I looked at 
DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face 
it, it's antiquated and don't get me started on "Hadoop".  I looked at 
many others and ultimately the recommended use of each vendor was to 
have one table for all data.  That overcomes the millions of tables 
problem, right?


Problem with the "one big table" solution is I anticipate 1,200 
trillion records.  Random access is expected and the customer expects 
<30ms reads for a single record fetch.


No data is loaded... yet  Table and index creation only. I am 
interested in the opinions of all including tests I may perform.  If 
you had this setup, what would you capture / analyze?  I have a job 
running preparing data.  I did this on a much smaller scale (50k 
tables) and data load via function allowed close to 6,000 
records/second.  The schema has been simplified since and last test 
reach just over 20,000 records/second with 300k tables.


I'm not looking for alternatives yet but input to my test. Takers?

I can't promise immediate feedback but will do my best to respond with 
results.


TIA,
-Greg


Hi Greg.

This is a problem (creating a large number of tables; really large 
indeed) that we researched in my company a while ago. You might want to 
read about it: https://www.pgcon.org/2013/schedule/events/595.en.html


Cheers,

Álvaro


--

Álvaro Hernández Tortosa


---
8Kdata



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Millions of tables

2016-09-26 Thread julyanto SUTANDANG

-sorry for my last email, which also not bottom posting-

Hi Greg,
On Mon, Sep 26, 2016 at 11:19 AM, Greg Spiegelberg 
wrote:

> I did look at PostgresXL and CitusDB.  Both are admirable however neither
> could support the need to read a random record consistently under 30ms.
> It's a similar problem Cassandra and others have: network latency.  At this
> scale, to provide the ability to access any given record amongst trillions
> it is imperative to know precisely where it is stored (system & database)
> and read a relatively small index.  I have other requirements that prohibit
> use of any technology that is eventually consistent.
>
 Then, you can get below 30ms, but how many process you might have to have
conncurently?
This is something that you should consider, single machine can only have
less than 50 HT for intel, 192HT for Power8, still it is far below millions
compare with the number of tables (8Million)
If you use index correctly, you would not need sequencial scan since the
scanning run on the memory (index loaded into memory)
Do you plan to query thru Master table of the partition? it is quite slow
actually, considering millions rule to check for every query.

with 8 Millions of data, you would require very big data storage for sure
and it would not fit mounted into single machine unless you would planning
to use IBM z machines.


> I liken the problem to fishing.  To find a particular fish of length,
> size, color  in a data lake you must accept the possibility of scanning
> the entire lake.  However, if all fish were in barrels where each barrel
> had a particular kind of fish of specific length, size, color  then the
> problem is far simpler.
>
>

Re: [PERFORM] Millions of tables

2016-09-26 Thread Jeff Janes

On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg 
wrote:

> Hey all,
>
> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time
> has said not to have millions of tables.  I too have long believed it until
> recently.
>
> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for
> PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those
> tables.  Table creation initially took 0.018031 secs, average 0.027467 and
> after tossing out outliers (qty 5) the maximum creation time found was
> 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.
> Tables were created by a single process.  Do note that table creation is
> done via plpgsql function as there are other housekeeping tasks necessary
> though minimal.
>
> No system tuning but here is a list of PostgreSQL knobs and switches:
> shared_buffers = 2GB
> work_mem = 48 MB
> max_stack_depth = 4 MB
> synchronous_commit = off
> effective_cache_size = 200 GB
> pg_xlog is on it's own file system
>
> There are some still obvious problems.  General DBA functions such as
> VACUUM and ANALYZE should not be done.  Each will run forever and cause
> much grief.
>

Why would the auto versions of those cause less grief than the manual
versions?


>   Backups are problematic in the traditional pg_dump and PITR space.
>

Is there a third option to those two spaces?  File-system snapshots?


> Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in
> my test case) are no-no's.  A system or database crash could take
> potentially hours to days to recover.
>

Isn't that a show-stopper?


> There are likely other issues ahead.
>
> You may wonder, "why is Greg attempting such a thing?"  I looked at
> DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
> it's antiquated and don't get me started on "Hadoop".  I looked at many
> others and ultimately the recommended use of each vendor was to have one
> table for all data.  That overcomes the millions of tables problem, right?
>
> Problem with the "one big table" solution is I anticipate 1,200 trillion
> records.  Random access is expected and the customer expects <30ms reads
> for a single record fetch.
>

Sorry, I don't really follow.  Whether you have 1 table or millions,
eventually someone has to go get the data off the disk. Why would the
number of tables make much of a difference to that fundamental?

Also, how many tablespaces do you anticipate having?  Can you get 120
petabytes of storage all mounted to one machine?


> No data is loaded... yet  Table and index creation only.  I am interested
> in the opinions of all including tests I may perform.  If you had this
> setup, what would you capture / analyze?  I have a job running preparing
> data.  I did this on a much smaller scale (50k tables) and data load via
> function allowed close to 6,000 records/second.  The schema has been
> simplified since and last test reach just over 20,000 records/second with
> 300k tables.
>
> I'm not looking for alternatives yet but input to my test.  Takers?
>

Go through and put one row (or 8kB worth of rows) into each of 8 million
table.  The stats collector and the autovacuum process will start going
nuts.  Now, maybe you can deal with it.  But maybe not.  That is the first
non-obvious thing I'd look at.

Cheers,

Jeff

Re: [PERFORM] Millions of tables

2016-09-25 Thread Gavin Flower


Hi Greg,

Please follow the conventions of this mailing list,  to avoid confusion 
- see bottom of this posting for further comments



On 26/09/16 17:05, Greg Spiegelberg wrote:
Precisely why I shared with the group.  I must understand the risks 
involved.  I need to explore if it can be stable at this size when 
does it become unstable?  Aside from locking down user access to 
superuser, is there a way to prohibit database-wide VACUUM & ANALYZE?  
Certainly putting my trust in autovacuum :) which is something I have 
not yet fully explored how to best tune.


Couple more numbers... ~231 GB is the size of PGDATA with 8M empty 
tables and 16M empty indexes.  ~5% of inodes on the file system have 
been used.  Sar data during the 8M table creation shows a very stable 
and regular I/O pattern.  Not a blip worth mentioning.


Another point worth mentioning, the tables contain a boolean, int8's 
and timestamptz's only.  Nothing of variable size like bytea, text, 
json or xml.  Each of the 8M tables will contain on the very high side 
between 140k and 200k records.  The application also has a heads up as 
to which table contains which record.  The searches come in saying 
"give me record X from partition key Y" where Y identifies the table 
and X is used in the filter on the table.


Last point, add column will never be done.  I can hear eyes rolling :) 
but the schema and it's intended use is complete.  You'll have to 
trust me on that one.


-Greg

On Sun, Sep 25, 2016 at 9:23 PM, Mike Sofen > wrote:


*From:*Greg Spiegelberg *Sent:* Sunday, September 25, 2016 7:50 PM
… Over the weekend, I created 8M tables with 16M indexes on those
tables.

… A system or database crash could take potentially hours to days
to recover.  There are likely other issues ahead.

You may wonder, "why is Greg attempting such a thing?"  I looked
at DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's
face it, it's antiquated and don't get me started on "Hadoop". 
Problem with the "one big table" solution is I anticipate 1,200

trillion records.  Random access is expected and the customer
expects <30ms reads for a single record fetch.

I'm not looking for alternatives yet but input to my test.

_

Holy guacamole, batman!  Ok, here’s my take:  you’ve traded the
risks/limitations of the known for the risks of the unknown.  The
unknown being, in the numerous places where postgres historical
development may have cut corners, you may be the first to exercise
those corners and flame out like the recent SpaceX rocket.

Put it another way – you’re going to bet your career (perhaps) or
a client’s future on an architectural model that just doesn’t seem
feasible.  I think you’ve got a remarkable design problem to
solve, and am glad you’ve chosen to share that problem with us.

And I do think it will boil down to this: it’s not that you CAN do
it on Postgres (which you clearly can), but once in production,
assuming things are actually stable, how will you handle the data
management aspects like inevitable breakage, data integrity
issues, backups, restores, user contention for resources, fault
tolerance and disaster recovery.  Just listing the tables will
take forever.  Add a column? Never.  I do think the amount of
testing you’ll need to do prove that every normal data management
function still works at that table count…that in itself is going
to be not a lot of fun.

This one hurts my head.  Ironically, the most logical destination
for this type of data may actually be Hadoop – auto-scale,
auto-shard, fault tolerant, etc…and I’m not a Hadoopie.

I am looking forward to hearing how this all plays out, it will be
quite an adventure!  All the best,

Mike Sofen (Synthetic Genomics…on Postgres 9.5x)



In this list, the convention is to post replies at the end (with some rare 
exceptions), or interspersed when appropriate,
and to omit parts no longer relevant.

The motivation of bottom posting like this: is that people get to see the 
context before the reply, AND emails don't end
up getting longer & longer as people reply at the beginning forgetting to trim 
the now irrelevant stuff at the end.


Cheers,
Gavin



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Millions of tables

2016-09-25 Thread Greg Spiegelberg

I did look at PostgresXL and CitusDB.  Both are admirable however neither
could support the need to read a random record consistently under 30ms.
It's a similar problem Cassandra and others have: network latency.  At this
scale, to provide the ability to access any given record amongst trillions
it is imperative to know precisely where it is stored (system & database)
and read a relatively small index.  I have other requirements that prohibit
use of any technology that is eventually consistent.

I liken the problem to fishing.  To find a particular fish of length, size,
color  in a data lake you must accept the possibility of scanning the
entire lake.  However, if all fish were in barrels where each barrel had a
particular kind of fish of specific length, size, color  then the problem
is far simpler.

-Greg

On Sun, Sep 25, 2016 at 9:04 PM, julyanto SUTANDANG 
wrote:

> Dear Greg,
>
> Have you checked PostgresXL ?
> with millions of table, how the apps choose which table is approriate?
> in my opinion, with that scale it should go with parallel query with
> data sharing like what PostgresXL is done.
>
> Thanks,
>
>
> Julyanto SUTANDANG
>
> Equnix Business Solutions, PT
> (An Open Source and Open Mind Company)
> www.equnix.co.id
> Pusat Niaga ITC Roxy Mas Blok C2/42.  Jl. KH Hasyim Ashari 125, Jakarta
> Pusat
> T: +6221 2282 F: +62216315281 M: +628164858028
>
>
> Caution: The information enclosed in this email (and any attachments)
> may be legally privileged and/or confidential and is intended only for
> the use of the addressee(s). No addressee should forward, print, copy,
> or otherwise reproduce this message in any manner that would allow it
> to be viewed by any individual not originally listed as a recipient.
> If the reader of this message is not the intended recipient, you are
> hereby notified that any unauthorized disclosure, dissemination,
> distribution, copying or the taking of any action in reliance on the
> information herein is strictly prohibited. If you have received this
> communication in error, please immediately notify the sender and
> delete this message.Unless it is made by the authorized person, any
> views expressed in this message are those of the individual sender and
> may not necessarily reflect the views of PT Equnix Business Solutions.
>
>
> On Mon, Sep 26, 2016 at 9:50 AM, Greg Spiegelberg
>  wrote:
> > Hey all,
> >
> > Obviously everyone who's been in PostgreSQL or almost any RDBMS for a
> time
> > has said not to have millions of tables.  I too have long believed it
> until
> > recently.
> >
> > AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1)
> for
> > PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those
> > tables.  Table creation initially took 0.018031 secs, average 0.027467
> and
> > after tossing out outliers (qty 5) the maximum creation time found was
> > 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.
> > Tables were created by a single process.  Do note that table creation is
> > done via plpgsql function as there are other housekeeping tasks necessary
> > though minimal.
> >
> > No system tuning but here is a list of PostgreSQL knobs and switches:
> > shared_buffers = 2GB
> > work_mem = 48 MB
> > max_stack_depth = 4 MB
> > synchronous_commit = off
> > effective_cache_size = 200 GB
> > pg_xlog is on it's own file system
> >
> > There are some still obvious problems.  General DBA functions such as
> VACUUM
> > and ANALYZE should not be done.  Each will run forever and cause much
> grief.
> > Backups are problematic in the traditional pg_dump and PITR space.  Large
> > JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my
> test
> > case) are no-no's.  A system or database crash could take potentially
> hours
> > to days to recover.  There are likely other issues ahead.
> >
> > You may wonder, "why is Greg attempting such a thing?"  I looked at
> > DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
> > it's antiquated and don't get me started on "Hadoop".  I looked at many
> > others and ultimately the recommended use of each vendor was to have one
> > table for all data.  That overcomes the millions of tables problem,
> right?
> >
> > Problem with the "one big table" solution is I anticipate 1,200 trillion
> > records.  Random access is expected and the customer expects <30ms reads
> for
> > a single record fetch.
> >
> > No data is loaded... yet  Table and index creation only.  I am
> interested in
> > the opinions of all including tests I may perform.  If you had this
> setup,
> > what would you capture / analyze?  I have a job running preparing data.
> I
> > did this on a much smaller scale (50k tables) and data load via function
> > allowed close to 6,000 records/second.  The schema has been simplified
> since
> > and last test reach just over 20,000 records/second with 300k tables.
> >
> > I'm not looking

Re: [PERFORM] Millions of tables

2016-09-25 Thread Greg Spiegelberg

Precisely why I shared with the group.  I must understand the risks
involved.  I need to explore if it can be stable at this size when does it
become unstable?  Aside from locking down user access to superuser, is
there a way to prohibit database-wide VACUUM & ANALYZE?  Certainly putting
my trust in autovacuum :) which is something I have not yet fully explored
how to best tune.

Couple more numbers... ~231 GB is the size of PGDATA with 8M empty tables
and 16M empty indexes.  ~5% of inodes on the file system have been used.
Sar data during the 8M table creation shows a very stable and regular I/O
pattern.  Not a blip worth mentioning.

Another point worth mentioning, the tables contain a boolean, int8's and
timestamptz's only.  Nothing of variable size like bytea, text, json or
xml.  Each of the 8M tables will contain on the very high side between 140k
and 200k records.  The application also has a heads up as to which table
contains which record.  The searches come in saying "give me record X from
partition key Y" where Y identifies the table and X is used in the filter
on the table.

Last point, add column will never be done.  I can hear eyes rolling :) but
the schema and it's intended use is complete.  You'll have to trust me on
that one.

-Greg

On Sun, Sep 25, 2016 at 9:23 PM, Mike Sofen  wrote:

> *From:* Greg Spiegelberg  *Sent:* Sunday, September 25, 2016 7:50 PM
> … Over the weekend, I created 8M tables with 16M indexes on those tables.
>
> … A system or database crash could take potentially hours to days to
> recover.  There are likely other issues ahead.
>
>
>
> You may wonder, "why is Greg attempting such a thing?"  I looked at
> DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
> it's antiquated and don't get me started on "Hadoop".  Problem with the
> "one big table" solution is I anticipate 1,200 trillion records.  Random
> access is expected and the customer expects <30ms reads for a single record
> fetch.
>
>
>
> I'm not looking for alternatives yet but input to my test.
>
> _
>
>
>
> Holy guacamole, batman!  Ok, here’s my take:  you’ve traded the
> risks/limitations of the known for the risks of the unknown.  The unknown
> being, in the numerous places where postgres historical development may
> have cut corners, you may be the first to exercise those corners and flame
> out like the recent SpaceX rocket.
>
>
>
> Put it another way – you’re going to bet your career (perhaps) or a
> client’s future on an architectural model that just doesn’t seem feasible.
> I think you’ve got a remarkable design problem to solve, and am glad you’ve
> chosen to share that problem with us.
>
>
>
> And I do think it will boil down to this: it’s not that you CAN do it on
> Postgres (which you clearly can), but once in production, assuming things
> are actually stable, how will you handle the data management aspects like
> inevitable breakage, data integrity issues, backups, restores, user
> contention for resources, fault tolerance and disaster recovery.  Just
> listing the tables will take forever.  Add a column?  Never.  I do think
> the amount of testing you’ll need to do prove that every normal data
> management function still works at that table count…that in itself is going
> to be not a lot of fun.
>
>
>
> This one hurts my head.  Ironically, the most logical destination for this
> type of data may actually be Hadoop – auto-scale, auto-shard, fault
> tolerant, etc…and I’m not a Hadoopie.
>
>
>
> I am looking forward to hearing how this all plays out, it will be quite
> an adventure!  All the best,
>
>
>
> Mike Sofen (Synthetic Genomics…on Postgres 9.5x)
>

Re: [PERFORM] Millions of tables

2016-09-25 Thread Mike Sofen

From: Greg Spiegelberg  Sent: Sunday, September 25, 2016 7:50 PM
… Over the weekend, I created 8M tables with 16M indexes on those tables. 

… A system or database crash could take potentially hours to days to recover.  
There are likely other issues ahead.

 

You may wonder, "why is Greg attempting such a thing?"  I looked at DynamoDB, 
BigTable, and Cassandra.  I like Greenplum but, let's face it, it's antiquated 
and don't get me started on "Hadoop".  Problem with the "one big table" 
solution is I anticipate 1,200 trillion records.  Random access is expected and 
the customer expects <30ms reads for a single record fetch.

 

I'm not looking for alternatives yet but input to my test.

_

 

Holy guacamole, batman!  Ok, here’s my take:  you’ve traded the 
risks/limitations of the known for the risks of the unknown.  The unknown 
being, in the numerous places where postgres historical development may have 
cut corners, you may be the first to exercise those corners and flame out like 
the recent SpaceX rocket.

 

Put it another way – you’re going to bet your career (perhaps) or a client’s 
future on an architectural model that just doesn’t seem feasible.  I think 
you’ve got a remarkable design problem to solve, and am glad you’ve chosen to 
share that problem with us.

 

And I do think it will boil down to this: it’s not that you CAN do it on 
Postgres (which you clearly can), but once in production, assuming things are 
actually stable, how will you handle the data management aspects like 
inevitable breakage, data integrity issues, backups, restores, user contention 
for resources, fault tolerance and disaster recovery.  Just listing the tables 
will take forever.  Add a column?  Never.  I do think the amount of testing 
you’ll need to do prove that every normal data management function still works 
at that table count…that in itself is going to be not a lot of fun.

 

This one hurts my head.  Ironically, the most logical destination for this type 
of data may actually be Hadoop – auto-scale, auto-shard, fault tolerant, 
etc…and I’m not a Hadoopie.

 

I am looking forward to hearing how this all plays out, it will be quite an 
adventure!  All the best,

 

Mike Sofen (Synthetic Genomics…on Postgres 9.5x)

Re: [PERFORM] Millions of tables

2016-09-25 Thread julyanto SUTANDANG

Dear Greg,

Have you checked PostgresXL ?
with millions of table, how the apps choose which table is approriate?
in my opinion, with that scale it should go with parallel query with
data sharing like what PostgresXL is done.

Thanks,

Julyanto SUTANDANG

Equnix Business Solutions, PT
(An Open Source and Open Mind Company)
www.equnix.co.id
Pusat Niaga ITC Roxy Mas Blok C2/42.  Jl. KH Hasyim Ashari 125, Jakarta Pusat
T: +6221 2282 F: +62216315281 M: +628164858028

Caution: The information enclosed in this email (and any attachments)
may be legally privileged and/or confidential and is intended only for
the use of the addressee(s). No addressee should forward, print, copy,
or otherwise reproduce this message in any manner that would allow it
to be viewed by any individual not originally listed as a recipient.
If the reader of this message is not the intended recipient, you are
hereby notified that any unauthorized disclosure, dissemination,
distribution, copying or the taking of any action in reliance on the
information herein is strictly prohibited. If you have received this
communication in error, please immediately notify the sender and
delete this message.Unless it is made by the authorized person, any
views expressed in this message are those of the individual sender and
may not necessarily reflect the views of PT Equnix Business Solutions.

On Mon, Sep 26, 2016 at 9:50 AM, Greg Spiegelberg
 wrote:
> Hey all,
>
> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time
> has said not to have millions of tables.  I too have long believed it until
> recently.
>
> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for
> PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those
> tables.  Table creation initially took 0.018031 secs, average 0.027467 and
> after tossing out outliers (qty 5) the maximum creation time found was
> 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.
> Tables were created by a single process.  Do note that table creation is
> done via plpgsql function as there are other housekeeping tasks necessary
> though minimal.
>
> No system tuning but here is a list of PostgreSQL knobs and switches:
> shared_buffers = 2GB
> work_mem = 48 MB
> max_stack_depth = 4 MB
> synchronous_commit = off
> effective_cache_size = 200 GB
> pg_xlog is on it's own file system
>
> There are some still obvious problems.  General DBA functions such as VACUUM
> and ANALYZE should not be done.  Each will run forever and cause much grief.
> Backups are problematic in the traditional pg_dump and PITR space.  Large
> JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test
> case) are no-no's.  A system or database crash could take potentially hours
> to days to recover.  There are likely other issues ahead.
>
> You may wonder, "why is Greg attempting such a thing?"  I looked at
> DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
> it's antiquated and don't get me started on "Hadoop".  I looked at many
> others and ultimately the recommended use of each vendor was to have one
> table for all data.  That overcomes the millions of tables problem, right?
>
> Problem with the "one big table" solution is I anticipate 1,200 trillion
> records.  Random access is expected and the customer expects <30ms reads for
> a single record fetch.
>
> No data is loaded... yet  Table and index creation only.  I am interested in
> the opinions of all including tests I may perform.  If you had this setup,
> what would you capture / analyze?  I have a job running preparing data.  I
> did this on a much smaller scale (50k tables) and data load via function
> allowed close to 6,000 records/second.  The schema has been simplified since
> and last test reach just over 20,000 records/second with 300k tables.
>
> I'm not looking for alternatives yet but input to my test.  Takers?
>
> I can't promise immediate feedback but will do my best to respond with
> results.
>
> TIA,
> -Greg

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

41 matches

Mail list logo