Re: [PERFORM] Millions of tables
Greg, sorry for the resent: I had forgotten to include the list. On Wed, Oct 5, 2016 at 2:34 PM, Greg Spiegelbergwrote: > Data is not static. The 4M tables fall into one of two groups. > > Group A contains 2M tables. INSERT will occur ~100 times/day and maximum > number of records anticipated will be 200k. Periodic DELETE's will occur > removing "old" records. Age is something the client sets and I have no way > of saying 1 or 10k records will be removed. The ~100 times / day are per table I assume. Also, I assume DELETES will probably delete batches (because the time criteria catches several records). > Group B contains the other 2M tables. Maximum records ~140k and UPSERT will > be the only mechanism used to populate and maintain. Periodic DELETE's may > run on these tables as well removing "old" records. So there will be inserts and updates. Either I missed it or you did not mention the criteria for placing a record in one of the 4M buckets. Can you shed light on what the criteria are? That would obviously suggest what indexing could be done. Also it would be interesting to see results of your tests with btree on really large tables as Stephen had suggested. I know it is not the primary tests you want to do but I would rather first explore "traditional" schema before I venture in the unknown of the multi million dollar, pardon, table schema. Kind regards -- [guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can - without end} http://blog.rubybestpractices.com/ -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Millions of tables
On 10/5/16 7:34 AM, Greg Spiegelberg wrote: When you say "must do a vacuum of the entire database", are you saying the entire database must be vacuum'd as a whole per 2B transactions or all tables must be vacuum'd eventually at least once? All tables at least once. Prior to 9.6, that had to happen ever 2B transactions. With 9.6 there's a freeze map, so if a page never gets dirtied between vacuum freeze runs then it doesn't need to be frozen. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Millions of tables
On Fri, Sep 30, 2016 at 4:49 PM, Jim Nasbywrote: > On 9/29/16 6:11 AM, Alex Ignatov (postgrespro) wrote: > >> With millions of tables you have to setautovacuum_max_workers >> sky-high =). We have some situation when at thousands of tables >> autovacuum can’t vacuum all tables that need it. Simply it vacuums some >> of most modified table and never reach others. Only manual vacuum can >> help with this situation. With wraparound issue it can be a nightmare >> > > Specifically, autovac isn't going to start worrying about anti-wrap > vacuums until tables start hitting autovacuum_freeze_max_age (or > autovacuum_multixact_freeze_max_age). Any tables that hit that threshold > go to the front of the line for being vacuumed. (But keep in mind that > there is no universal line, just what each worker computes on it's own when > it's started). > > Where things will completely fall apart for you is if a lot of tables all > have roughly the same relfrozenxid (or relminmxid), like they would > immediately after a large load. In that scenario you'll suddenly have loads > of work for autovac to do, all at the same time. That will make the > database, DBAs and you Very Unhappy (tm). > > Somehow, some way, you *must* do a vacuum of the entire database. Luckily > the freeze map in 9.6 means you'd only have to do that one time (assuming > the data really is static). In any older version, (auto)vacuum will need to > eventually *read everything in every table* at least once every ~2B > transactions. > Data is not static. The 4M tables fall into one of two groups. Group A contains 2M tables. INSERT will occur ~100 times/day and maximum number of records anticipated will be 200k. Periodic DELETE's will occur removing "old" records. Age is something the client sets and I have no way of saying 1 or 10k records will be removed. Group B contains the other 2M tables. Maximum records ~140k and UPSERT will be the only mechanism used to populate and maintain. Periodic DELETE's may run on these tables as well removing "old" records. Will a set of tables require vacuum'ing at the same time? Quite possibly but I have no way to say 2 or 200k tables will need it. When you say "must do a vacuum of the entire database", are you saying the entire database must be vacuum'd as a whole per 2B transactions or all tables must be vacuum'd eventually at least once? I want to be absolutely clear on what you're saying. > There is one potentially significant difference between autovac and manual > vacuums here; autovac treats toast tables as just another table, with their > own stats and their own freeze needs. If you're generating a lot of toast > records that might make a difference. > I do not anticipate TOAST entering the picture. No single column or record > 8KB or even approaching it. We have a few databases that (ab)use pg_toast and I want to avoid those complications. -Greg
Re: [PERFORM] Millions of tables
On Thu, Sep 29, 2016 at 4:11 AM, Alex Ignatov (postgrespro) < a.igna...@postgrespro.ru> wrote: > > > *From:* pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance- > ow...@postgresql.org] *On Behalf Of * > > Thank you Terry. You get the gold star. :) I was waiting for that to > come up. > > > > Success means handling this condition. A whole database vacuum and > dump-restore is out of the question. Can a properly tuned autovacuum > prevent the situation? > > > > -Greg > > > > Hi! > > With millions of tables you have to setautovacuum_max_workers > sky-high =). We have some situation when at thousands of tables autovacuum > can’t vacuum all tables that need it. Simply it vacuums some of most > modified table and never reach others. > Any autovacuum worker should vacuum all tables in its assigned database which it perceives need vacuuming, as long as it can get the lock. Unless the worker is interrupted, for example by frequent database shutdowns, it should reach all tables in that database before it exits. Unless there is a bug, or you are constantly restarting the database before autovacuum can finish or doing something else to kill them off, what you describe should not happen. If it is a bug, we should fix it. Can you give more details? There is a known bug when you multiple active databases in the same cluster. Once one database reaches the age where anti-wrap around vacuums kick in, then all future autovacuum workers are directed to that one database, starving all other databases of auto-vacuuming. But that doesn't sound like what you are describing. Cheers, Jeff
Re: [PERFORM] Millions of tables
On 9/29/16 6:11 AM, Alex Ignatov (postgrespro) wrote: With millions of tables you have to setautovacuum_max_workers sky-high =). We have some situation when at thousands of tables autovacuum can’t vacuum all tables that need it. Simply it vacuums some of most modified table and never reach others. Only manual vacuum can help with this situation. With wraparound issue it can be a nightmare Specifically, autovac isn't going to start worrying about anti-wrap vacuums until tables start hitting autovacuum_freeze_max_age (or autovacuum_multixact_freeze_max_age). Any tables that hit that threshold go to the front of the line for being vacuumed. (But keep in mind that there is no universal line, just what each worker computes on it's own when it's started). Where things will completely fall apart for you is if a lot of tables all have roughly the same relfrozenxid (or relminmxid), like they would immediately after a large load. In that scenario you'll suddenly have loads of work for autovac to do, all at the same time. That will make the database, DBAs and you Very Unhappy (tm). Somehow, some way, you *must* do a vacuum of the entire database. Luckily the freeze map in 9.6 means you'd only have to do that one time (assuming the data really is static). In any older version, (auto)vacuum will need to eventually *read everything in every table* at least once every ~2B transactions. The only impact the number of tables is going to have on this is granularity. If you have a small number of large tables, you'll have (auto)vacuum processes that will need to run *uninterrupted* for a long time to move the freeze threshold on each table. If you have tons of small tables, you'll need tons of separate (auto)vacuums, but each one will run for a shorter interval, and if one gets interrupted it won't be as big a deal. There is one potentially significant difference between autovac and manual vacuums here; autovac treats toast tables as just another table, with their own stats and their own freeze needs. If you're generating a lot of toast records that might make a difference. When it comes to vacuum, you might find https://www.pgcon.org/2015/schedule/events/829.en.html useful. On a different topic... I didn't see anything in the thread about what you're storing, but with the row counts you're talking about I'm guessing it's something that's time-series. https://github.com/ElephantStack/ElephantStack is a project exploring the idea of using Postgres array types as a far more efficient way to store that kind of data; instead of an overhead of 24 bytes per row (plus indexes) arrays give you essentially zero overhead per row. There's no code yet, but a few of us have done testing on some real world data (see the google group referenced from the README). -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Millions of tables
From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Greg Spiegelberg Sent: Tuesday, September 27, 2016 7:28 PM To: Terry Schmitt <tschm...@schmittworks.com> Cc: pgsql-performa. <pgsql-performance@postgresql.org> Subject: Re: [PERFORM] Millions of tables On Tue, Sep 27, 2016 at 10:15 AM, Terry Schmitt <tschm...@schmittworks.com <mailto:tschm...@schmittworks.com> > wrote: On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg <gspiegelb...@gmail.com <mailto:gspiegelb...@gmail.com> > wrote: Hey all, Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has said not to have millions of tables. I too have long believed it until recently. AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for PGDATA. Over the weekend, I created 8M tables with 16M indexes on those tables. Table creation initially took 0.018031 secs, average 0.027467 and after tossing out outliers (qty 5) the maximum creation time found was 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds. Tables were created by a single process. Do note that table creation is done via plpgsql function as there are other housekeeping tasks necessary though minimal. No system tuning but here is a list of PostgreSQL knobs and switches: shared_buffers = 2GB work_mem = 48 MB max_stack_depth = 4 MB synchronous_commit = off effective_cache_size = 200 GB pg_xlog is on it's own file system There are some still obvious problems. General DBA functions such as VACUUM and ANALYZE should not be done. Each will run forever and cause much grief. Backups are problematic in the traditional pg_dump and PITR space. Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test case) are no-no's. A system or database crash could take potentially hours to days to recover. There are likely other issues ahead. You may wonder, "why is Greg attempting such a thing?" I looked at DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop". I looked at many others and ultimately the recommended use of each vendor was to have one table for all data. That overcomes the millions of tables problem, right? Problem with the "one big table" solution is I anticipate 1,200 trillion records. Random access is expected and the customer expects <30ms reads for a single record fetch. No data is loaded... yet Table and index creation only. I am interested in the opinions of all including tests I may perform. If you had this setup, what would you capture / analyze? I have a job running preparing data. I did this on a much smaller scale (50k tables) and data load via function allowed close to 6,000 records/second. The schema has been simplified since and last test reach just over 20,000 records/second with 300k tables. I'm not looking for alternatives yet but input to my test. Takers? I can't promise immediate feedback but will do my best to respond with results. TIA, -Greg I have not seen any mention of transaction ID wraparound mentioned in this thread yet. With the numbers that you are looking at, I could see this as a major issue. T Thank you Terry. You get the gold star. :) I was waiting for that to come up. Success means handling this condition. A whole database vacuum and dump-restore is out of the question. Can a properly tuned autovacuum prevent the situation? -Greg Hi! With millions of tables you have to setautovacuum_max_workers sky-high =). We have some situation when at thousands of tables autovacuum can’t vacuum all tables that need it. Simply it vacuums some of most modified table and never reach others. Only manual vacuum can help with this situation. With wraparound issue it can be a nightmare -- Alex Ignatov Postgres Professional: <http://www.postgrespro.com> http://www.postgrespro.com The Russian Postgres Company
Re: [PERFORM] Millions of tables
On 26 September 2016 at 05:19, Greg Spiegelbergwrote: > I did look at PostgresXL and CitusDB. Both are admirable however neither > could support the need to read a random record consistently under 30ms. > It's a similar problem Cassandra and others have: network latency. At this > scale, to provide the ability to access any given record amongst trillions > it is imperative to know precisely where it is stored (system & database) > and read a relatively small index. I have other requirements that prohibit > use of any technology that is eventually consistent. Then XL is exactly what you need, since it does allow you to calculate exactly where the record is via hash and then access it, which makes the request just a single datanode task. XL is not the same as CitusDB. > I liken the problem to fishing. To find a particular fish of length, size, > color in a data lake you must accept the possibility of scanning the > entire lake. However, if all fish were in barrels where each barrel had a > particular kind of fish of specific length, size, color then the problem > is far simpler. The task of putting the fish in the appropriate barrel is quite hard. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Millions of tables
Greg, * Greg Spiegelberg (gspiegelb...@gmail.com) wrote: > On Wed, Sep 28, 2016 at 11:27 AM, Stephen Frostwrote: > > * Greg Spiegelberg (gspiegelb...@gmail.com) wrote: > > > Bigger buckets mean a wider possibility of response times. Some buckets > > > may contain 140k records and some 100X more. > > > > Have you analyzed the depth of the btree indexes to see how many more > > pages need to be read to handle finding a row in 140k records vs. 14M > > records vs. 140M records? > > > > I suspect you'd find that the change in actual depth (meaning how many > > pages have to actually be read to find the row you're looking for) isn't > > very much and that your concern over the "wider possibility of response > > times" isn't well founded. > Excellent feedback! Um, how does one look at tree depth in PostgreSQL? > Oracle I know but have not done the same in PG. Pointers? CREATE EXTENSION pageinspect; SELECT * FROM bt_metap('indexname'); https://www.postgresql.org/docs/9.5/static/pageinspect.html Thanks! Stephen signature.asc Description: Digital signature
Re: [PERFORM] Millions of tables
On Wed, Sep 28, 2016 at 11:27 AM, Stephen Frostwrote: > Greg, > > * Greg Spiegelberg (gspiegelb...@gmail.com) wrote: > > Bigger buckets mean a wider possibility of response times. Some buckets > > may contain 140k records and some 100X more. > > Have you analyzed the depth of the btree indexes to see how many more > pages need to be read to handle finding a row in 140k records vs. 14M > records vs. 140M records? > > I suspect you'd find that the change in actual depth (meaning how many > pages have to actually be read to find the row you're looking for) isn't > very much and that your concern over the "wider possibility of response > times" isn't well founded. > > Stephen, Excellent feedback! Um, how does one look at tree depth in PostgreSQL? Oracle I know but have not done the same in PG. Pointers? > Since you have a hard-set 30ms maximum for query response time, I would > suggest you work out how long it takes to read a cold page from your I/O > subsystem and then you can work through exactly how many page reads > could be done in that 30ms (or perhaps 20ms, to allow for whatever > overhead there will be in the rest of the system and as a buffer) and > then work that back to how deep the index can be based on that many page > reads and then how many records are required to create an index of that > depth. Of course, the page from the heap will also need to be read and > there's a bit of additional work to be done, but the disk i/o for cold > pages is almost certainly where most time will be spent. > > I suspect you'll discover that millions of tables is a couple orders of > magnitude off of how many you'd need to keep the number of page reads > below the threshold you work out based on your I/O. > > Of course, you would need a consistent I/O subsystem, or at least one > where you know the maximum possible latency to pull a cold page. > > Lastly, you'll want to figure out how to handle system crash/restart if > this system requires a high uptime. I expect you'd want to have at > least one replica and a setup which allows you to flip traffic to it > very quickly to maintain the 30ms response times. > I'm replicating via messaging. PG replication is fine for smaller db's but I don't trust networks and PG upgrade intricacies complicate matters. -Greg
Re: [PERFORM] Millions of tables
Greg, * Greg Spiegelberg (gspiegelb...@gmail.com) wrote: > Bigger buckets mean a wider possibility of response times. Some buckets > may contain 140k records and some 100X more. Have you analyzed the depth of the btree indexes to see how many more pages need to be read to handle finding a row in 140k records vs. 14M records vs. 140M records? I suspect you'd find that the change in actual depth (meaning how many pages have to actually be read to find the row you're looking for) isn't very much and that your concern over the "wider possibility of response times" isn't well founded. Since you have a hard-set 30ms maximum for query response time, I would suggest you work out how long it takes to read a cold page from your I/O subsystem and then you can work through exactly how many page reads could be done in that 30ms (or perhaps 20ms, to allow for whatever overhead there will be in the rest of the system and as a buffer) and then work that back to how deep the index can be based on that many page reads and then how many records are required to create an index of that depth. Of course, the page from the heap will also need to be read and there's a bit of additional work to be done, but the disk i/o for cold pages is almost certainly where most time will be spent. I suspect you'll discover that millions of tables is a couple orders of magnitude off of how many you'd need to keep the number of page reads below the threshold you work out based on your I/O. Of course, you would need a consistent I/O subsystem, or at least one where you know the maximum possible latency to pull a cold page. Lastly, you'll want to figure out how to handle system crash/restart if this system requires a high uptime. I expect you'd want to have at least one replica and a setup which allows you to flip traffic to it very quickly to maintain the 30ms response times. Thanks! Stephen signature.asc Description: Digital signature
Re: [PERFORM] Millions of tables
On Wed, Sep 28, 2016 at 9:39 AM, Vitalii Tymchyshynwrote: > Have you considered having many databases (e.g. 100) and possibly many > postgresql servers (e.g. 10) started on different ports? > This would give you 1000x less tables per db. > The system design already allows for many database servers. 40 is okay, 100 isn't terrible but if it's thousands then operations might lynch me. -Greg
Re: [PERFORM] Millions of tables
If going that route, why not just use plproxy? On Wed, Sep 28, 2016 at 11:39 AM, Vitalii Tymchyshynwrote: > Have you considered having many databases (e.g. 100) and possibly many > postgresql servers (e.g. 10) started on different ports? > This would give you 1000x less tables per db. > >> >>>
Re: [PERFORM] Millions of tables
> > Have you considered having many databases (e.g. 100) and possibly many postgresql servers (e.g. 10) started on different ports? This would give you 1000x less tables per db. > >>
Re: [PERFORM] Millions of tables
On Tue, Sep 27, 2016 at 10:15 AM, Terry Schmittwrote: > > > On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg > wrote: > >> Hey all, >> >> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a >> time has said not to have millions of tables. I too have long believed it >> until recently. >> >> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) >> for PGDATA. Over the weekend, I created 8M tables with 16M indexes on >> those tables. Table creation initially took 0.018031 secs, average >> 0.027467 and after tossing out outliers (qty 5) the maximum creation time >> found was 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 >> seconds. Tables were created by a single process. Do note that table >> creation is done via plpgsql function as there are other housekeeping tasks >> necessary though minimal. >> >> No system tuning but here is a list of PostgreSQL knobs and switches: >> shared_buffers = 2GB >> work_mem = 48 MB >> max_stack_depth = 4 MB >> synchronous_commit = off >> effective_cache_size = 200 GB >> pg_xlog is on it's own file system >> >> There are some still obvious problems. General DBA functions such as >> VACUUM and ANALYZE should not be done. Each will run forever and cause >> much grief. Backups are problematic in the traditional pg_dump and PITR >> space. Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing >> it in my test case) are no-no's. A system or database crash could take >> potentially hours to days to recover. There are likely other issues ahead. >> >> You may wonder, "why is Greg attempting such a thing?" I looked at >> DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, >> it's antiquated and don't get me started on "Hadoop". I looked at many >> others and ultimately the recommended use of each vendor was to have one >> table for all data. That overcomes the millions of tables problem, right? >> >> Problem with the "one big table" solution is I anticipate 1,200 trillion >> records. Random access is expected and the customer expects <30ms reads >> for a single record fetch. >> >> No data is loaded... yet Table and index creation only. I am interested >> in the opinions of all including tests I may perform. If you had this >> setup, what would you capture / analyze? I have a job running preparing >> data. I did this on a much smaller scale (50k tables) and data load via >> function allowed close to 6,000 records/second. The schema has been >> simplified since and last test reach just over 20,000 records/second with >> 300k tables. >> >> I'm not looking for alternatives yet but input to my test. Takers? >> >> I can't promise immediate feedback but will do my best to respond with >> results. >> >> TIA, >> -Greg >> > > I have not seen any mention of transaction ID wraparound mentioned in this > thread yet. With the numbers that you are looking at, I could see this as a > major issue. > > T > Thank you Terry. You get the gold star. :) I was waiting for that to come up. Success means handling this condition. A whole database vacuum and dump-restore is out of the question. Can a properly tuned autovacuum prevent the situation? -Greg
Re: [PERFORM] Millions of tables
On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelbergwrote: > Hey all, > > Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time > has said not to have millions of tables. I too have long believed it until > recently. > > AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for > PGDATA. Over the weekend, I created 8M tables with 16M indexes on those > tables. Table creation initially took 0.018031 secs, average 0.027467 and > after tossing out outliers (qty 5) the maximum creation time found was > 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds. > Tables were created by a single process. Do note that table creation is > done via plpgsql function as there are other housekeeping tasks necessary > though minimal. > > No system tuning but here is a list of PostgreSQL knobs and switches: > shared_buffers = 2GB > work_mem = 48 MB > max_stack_depth = 4 MB > synchronous_commit = off > effective_cache_size = 200 GB > pg_xlog is on it's own file system > > There are some still obvious problems. General DBA functions such as > VACUUM and ANALYZE should not be done. Each will run forever and cause > much grief. Backups are problematic in the traditional pg_dump and PITR > space. Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing > it in my test case) are no-no's. A system or database crash could take > potentially hours to days to recover. There are likely other issues ahead. > > You may wonder, "why is Greg attempting such a thing?" I looked at > DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, > it's antiquated and don't get me started on "Hadoop". I looked at many > others and ultimately the recommended use of each vendor was to have one > table for all data. That overcomes the millions of tables problem, right? > > Problem with the "one big table" solution is I anticipate 1,200 trillion > records. Random access is expected and the customer expects <30ms reads > for a single record fetch. > > No data is loaded... yet Table and index creation only. I am interested > in the opinions of all including tests I may perform. If you had this > setup, what would you capture / analyze? I have a job running preparing > data. I did this on a much smaller scale (50k tables) and data load via > function allowed close to 6,000 records/second. The schema has been > simplified since and last test reach just over 20,000 records/second with > 300k tables. > > I'm not looking for alternatives yet but input to my test. Takers? > > I can't promise immediate feedback but will do my best to respond with > results. > > TIA, > -Greg > I have not seen any mention of transaction ID wraparound mentioned in this thread yet. With the numbers that you are looking at, I could see this as a major issue. T
Re: [PERFORM] Millions of tables
On Tue, Sep 27, 2016 at 9:42 AM, Mike Sofenwrote: > *From:* Mike Sofen *Sent:* Tuesday, September 27, 2016 8:10 AM > > *From:* Greg Spiegelberg *Sent:* Monday, September 26, 2016 7:25 AM > I've gotten more responses than anticipated and have answered some > questions and gotten some insight but my challenge again is what should I > capture along the way to prove or disprove this storage pattern? > Alternatives to the storage pattern aside, I need ideas to test rig, > capture metrics and suggestions to tune it. > > > > In the next 24 hours, I will be sending ~1 trillion records to the test > database. Because of time to set up, I'd rather have things set up > properly the first go. > > > > Thanks! > > -Greg > > - > > Greg, I ran another quick test on a wider table than you’ve described, but > this time with 80 million rows, with core counts, ram and ssd storage > similar to what you’d have on that AWS EC2 instance. This table had 7 > columns (3 integers, 3 text, 1 timestamptz) with an average width of 157 > chars, one btree index on the pk int column. Using explain analyze, I > picked one id value out of the 80m and ran a select * where id = x. It did > an index scan, had a planning time of 0.077ms, and an execution time of > 0.254 seconds. I ran the query for a variety of widely spaced values (so > the data was uncached) and the timing never changed. This has been > mirroring my general experience with PG – very fast reads on indexed > queries. > > > > Summary: I think your buckets can be WAY bigger than you are envisioning > for the simple table design you’ve described. I’m betting you can easily > do 500 million rows per bucket before approaching anything close to the > 30ms max query time. > > > > Mike Sofen (Synthetic Genomics) > > > > Totally typo’d the execution time: it was 0.254 MILLISECONDS, not > SECONDS. Thus my comment about going up 10x in bucket size instead of > appearing to be right at the limit. Sorry! > > > I figured. :) Haven't ruled it out but expectations of this implementation is to perform at worst 3X slower than memcache or Redis. Bigger buckets mean a wider possibility of response times. Some buckets may contain 140k records and some 100X more. -Greg
Re: [PERFORM] Millions of tables
On Tue, Sep 27, 2016 at 8:30 AM, Craig Jameswrote: > On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg > wrote: > >> Hey all, >> >> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a >> time has said not to have millions of tables. I too have long believed it >> until recently. >> >> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) >> for PGDATA. Over the weekend, I created 8M tables with 16M indexes on >> those tables. Table creation initially took 0.018031 secs, average >> 0.027467 and after tossing out outliers (qty 5) the maximum creation time >> found was 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 >> seconds. Tables were created by a single process. Do note that table >> creation is done via plpgsql function as there are other housekeeping tasks >> necessary though minimal. >> >> No system tuning but here is a list of PostgreSQL knobs and switches: >> shared_buffers = 2GB >> work_mem = 48 MB >> max_stack_depth = 4 MB >> synchronous_commit = off >> effective_cache_size = 200 GB >> pg_xlog is on it's own file system >> >> There are some still obvious problems. General DBA functions such as >> VACUUM and ANALYZE should not be done. Each will run forever and cause >> much grief. Backups are problematic in the traditional pg_dump and PITR >> space. Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing >> it in my test case) are no-no's. A system or database crash could take >> potentially hours to days to recover. There are likely other issues ahead. >> >> You may wonder, "why is Greg attempting such a thing?" I looked at >> DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, >> it's antiquated and don't get me started on "Hadoop". I looked at many >> others and ultimately the recommended use of each vendor was to have one >> table for all data. That overcomes the millions of tables problem, right? >> >> Problem with the "one big table" solution is I anticipate 1,200 trillion >> records. Random access is expected and the customer expects <30ms reads >> for a single record fetch. >> > > You don't give enough details to fully explain the problem you're trying > to solve. > >- Will records ever be updated or deleted? If so, what percentage and >at what frequency? >- What specifically are you storing (e.g. list of integers, strings, >people's sex habits, ...)? Or more importantly, are these fixed- or >variable-sized records? >- Once the 1,200 trillion records are loaded, is that it? Or do more >data arrive, and if so, at what rate? >- Do your queries change, or is there a fixed set of queries? >- How complex are the joins? > > Excellent questions. 1a. Half of the 4M tables will contain ~140k records and UPDATE's will occur on roughly 100 records/day/table. No DELETE's on this first half. 1b. Second half of the 4M tables will contain ~200k records. Zero UPDATE's however DELETE's will occur on ~100 records/day/table. 2. All 4M tables contain 7 columns: (4) bigints, (2) timestamptz and (1) boolean. 2M of the table will have an PKEY on (1) bigint table only. Second 2M table have a PKEY on (bigint,timestamptz) and two additional indexes on (bigint, timestamptz) different columns. 3. The trillions-of-records load is just to push the system to find the maximum record load capability. Reality, 200M records / day or ~2,300/second average is the expectation once in production. 4. Queries are fixed and match the indexes laid down on the tables. Goal is <30ms/query. I have attempted queries with and without indexes. Without indexes the average query response varied between 20ms and 40ms whereas indexes respond within a much tighter range of 5ms to 9ms. Both query performance tests were done during data-ingest. 5. Zero JOIN's and I won't let it ever happen. However the 4M tables INHERIT a data grouping table. Test rig limits child tables to 1,000/parent. This was done to explore some other possible access patterns but they are secondary and if it doesn't work then either a) the requirement will be dropped or b) I may look at storing the data in the 1,000 child tables directly in the parent table and I'll need to re-run load & read tests. > The reason I ask these specific questions is because, as others have > pointed out, this might be a perfect case for a custom (non-relational) > database. Relational databases are general-purpose tools, sort of like a > Swiss-Army knife. A Swiss-Army knife does most things passably, but if you > want to carve wood, or butcher meat, or slice vegetables, you get a knife > meant for that specific task. > > I'm at a sizing phase. If 4M tables works I'll attempt 16M tables. If it points to only 2M or 1M then that's fine. The 4M table database in only a single cog in the storage service design. Anticipating ~40 of these databases but it is dependent upon how many tables work
Re: [PERFORM] Millions of tables
From: Mike Sofen Sent: Tuesday, September 27, 2016 8:10 AM From: Greg Spiegelberg Sent: Monday, September 26, 2016 7:25 AM I've gotten more responses than anticipated and have answered some questions and gotten some insight but my challenge again is what should I capture along the way to prove or disprove this storage pattern? Alternatives to the storage pattern aside, I need ideas to test rig, capture metrics and suggestions to tune it. In the next 24 hours, I will be sending ~1 trillion records to the test database. Because of time to set up, I'd rather have things set up properly the first go. Thanks! -Greg - Greg, I ran another quick test on a wider table than you’ve described, but this time with 80 million rows, with core counts, ram and ssd storage similar to what you’d have on that AWS EC2 instance. This table had 7 columns (3 integers, 3 text, 1 timestamptz) with an average width of 157 chars, one btree index on the pk int column. Using explain analyze, I picked one id value out of the 80m and ran a select * where id = x. It did an index scan, had a planning time of 0.077ms, and an execution time of 0.254 seconds. I ran the query for a variety of widely spaced values (so the data was uncached) and the timing never changed. This has been mirroring my general experience with PG – very fast reads on indexed queries. Summary: I think your buckets can be WAY bigger than you are envisioning for the simple table design you’ve described. I’m betting you can easily do 500 million rows per bucket before approaching anything close to the 30ms max query time. Mike Sofen (Synthetic Genomics) Totally typo’d the execution time: it was 0.254 MILLISECONDS, not SECONDS. Thus my comment about going up 10x in bucket size instead of appearing to be right at the limit. Sorry! Mike
Re: [PERFORM] Millions of tables
From: Greg Spiegelberg Sent: Monday, September 26, 2016 7:25 AM I've gotten more responses than anticipated and have answered some questions and gotten some insight but my challenge again is what should I capture along the way to prove or disprove this storage pattern? Alternatives to the storage pattern aside, I need ideas to test rig, capture metrics and suggestions to tune it. In the next 24 hours, I will be sending ~1 trillion records to the test database. Because of time to set up, I'd rather have things set up properly the first go. Thanks! -Greg - Greg, I ran another quick test on a wider table than you’ve described, but this time with 80 million rows, with core counts, ram and ssd storage similar to what you’d have on that AWS EC2 instance. This table had 7 columns (3 integers, 3 text, 1 timestamptz) with an average width of 157 chars, one btree index on the pk int column. Using explain analyze, I picked one id value out of the 80m and ran a select * where id = x. It did an index scan, had a planning time of 0.077ms, and an execution time of 0.254 seconds. I ran the query for a variety of widely spaced values (so the data was uncached) and the timing never changed. This has been mirroring my general experience with PG – very fast reads on indexed queries. Summary: I think your buckets can be WAY bigger than you are envisioning for the simple table design you’ve described. I’m betting you can easily do 500 million rows per bucket before approaching anything close to the 30ms max query time. Mike Sofen (Synthetic Genomics)
Re: [PERFORM] Millions of tables
On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelbergwrote: > Hey all, > > Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time > has said not to have millions of tables. I too have long believed it until > recently. > > AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for > PGDATA. Over the weekend, I created 8M tables with 16M indexes on those > tables. Table creation initially took 0.018031 secs, average 0.027467 and > after tossing out outliers (qty 5) the maximum creation time found was > 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds. > Tables were created by a single process. Do note that table creation is > done via plpgsql function as there are other housekeeping tasks necessary > though minimal. > > No system tuning but here is a list of PostgreSQL knobs and switches: > shared_buffers = 2GB > work_mem = 48 MB > max_stack_depth = 4 MB > synchronous_commit = off > effective_cache_size = 200 GB > pg_xlog is on it's own file system > > There are some still obvious problems. General DBA functions such as > VACUUM and ANALYZE should not be done. Each will run forever and cause > much grief. Backups are problematic in the traditional pg_dump and PITR > space. Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing > it in my test case) are no-no's. A system or database crash could take > potentially hours to days to recover. There are likely other issues ahead. > > You may wonder, "why is Greg attempting such a thing?" I looked at > DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, > it's antiquated and don't get me started on "Hadoop". I looked at many > others and ultimately the recommended use of each vendor was to have one > table for all data. That overcomes the millions of tables problem, right? > > Problem with the "one big table" solution is I anticipate 1,200 trillion > records. Random access is expected and the customer expects <30ms reads > for a single record fetch. > You don't give enough details to fully explain the problem you're trying to solve. - Will records ever be updated or deleted? If so, what percentage and at what frequency? - What specifically are you storing (e.g. list of integers, strings, people's sex habits, ...)? Or more importantly, are these fixed- or variable-sized records? - Once the 1,200 trillion records are loaded, is that it? Or do more data arrive, and if so, at what rate? - Do your queries change, or is there a fixed set of queries? - How complex are the joins? The reason I ask these specific questions is because, as others have pointed out, this might be a perfect case for a custom (non-relational) database. Relational databases are general-purpose tools, sort of like a Swiss-Army knife. A Swiss-Army knife does most things passably, but if you want to carve wood, or butcher meat, or slice vegetables, you get a knife meant for that specific task. I've written several custom database-storage systems for very specific high-performance systems. It's generally a couple weeks of work, and you have a tailored performance and storage that's hard for a general-purpose relational system to match. The difficulty of building such a system depends a lot on the answers to the questions above. Craig > No data is loaded... yet Table and index creation only. I am interested > in the opinions of all including tests I may perform. If you had this > setup, what would you capture / analyze? I have a job running preparing > data. I did this on a much smaller scale (50k tables) and data load via > function allowed close to 6,000 records/second. The schema has been > simplified since and last test reach just over 20,000 records/second with > 300k tables. > > I'm not looking for alternatives yet but input to my test. Takers? > > I can't promise immediate feedback but will do my best to respond with > results. > > TIA, > -Greg > -- - Craig A. James Chief Technology Officer eMolecules, Inc. -
Re: [PERFORM] Millions of tables
Jeff Janeswrites: > A problem is that those statistics are stored in one file (per database; it > used to be one file per cluster). With 8 million tables, that is going to > be a pretty big file. But the code pretty much assumes the file is going > to be pretty small, and so it has no compunction about commanding that it > be read and written, in its entirety, quite often. I don't know that anyone ever believed it would be small. But at the time the pgstats code was written, there was no good alternative to passing the data through files. (And I'm not sure we envisioned applications that would be demanding fresh data constantly, anyway.) Now that the DSM stuff exists and has been more or less shaken out, I wonder how practical it'd be to use a DSM segment to make the stats collector's data available to backends. You'd need a workaround for the fact that not all the DSM implementations support resize (although given the lack of callers of dsm_resize, one could be forgiven for wondering whether any of that code has been tested at all). But you could imagine abandoning one DSM segment and creating a new one of double the size anytime the hash tables got too big. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Millions of tables
On Mon, Sep 26, 2016 at 5:53 AM, Greg Spiegelbergwrote: > >> >> I may need to understand autovacuum better. My impression was it > consulted statistics and performed vacuums one table at a time based on the > vacuum threshold formula on https://www.postgresql.org/ > docs/9.5/static/routine-vacuuming.html. > A problem is that those statistics are stored in one file (per database; it used to be one file per cluster). With 8 million tables, that is going to be a pretty big file. But the code pretty much assumes the file is going to be pretty small, and so it has no compunction about commanding that it be read and written, in its entirety, quite often. Cheers, Jeff
Re: [PERFORM] Millions of tables
On Sun, Sep 25, 2016 at 8:50 PM, Greg Spiegelbergwrote: > Hey all, > > Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time > has said not to have millions of tables. I too have long believed it until > recently. > > AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for > PGDATA. Over the weekend, I created 8M tables with 16M indexes on those > tables. Table creation initially took 0.018031 secs, average 0.027467 and > after tossing out outliers (qty 5) the maximum creation time found was > 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds. > Tables were created by a single process. Do note that table creation is > done via plpgsql function as there are other housekeeping tasks necessary > though minimal. > > No system tuning but here is a list of PostgreSQL knobs and switches: > shared_buffers = 2GB > work_mem = 48 MB > max_stack_depth = 4 MB > synchronous_commit = off > effective_cache_size = 200 GB > pg_xlog is on it's own file system > > There are some still obvious problems. General DBA functions such as > VACUUM and ANALYZE should not be done. Each will run forever and cause > much grief. Backups are problematic in the traditional pg_dump and PITR > space. Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing > it in my test case) are no-no's. A system or database crash could take > potentially hours to days to recover. There are likely other issues ahead. > > You may wonder, "why is Greg attempting such a thing?" I looked at > DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, > it's antiquated and don't get me started on "Hadoop". I looked at many > others and ultimately the recommended use of each vendor was to have one > table for all data. That overcomes the millions of tables problem, right? > > Problem with the "one big table" solution is I anticipate 1,200 trillion > records. Random access is expected and the customer expects <30ms reads > for a single record fetch. > > No data is loaded... yet Table and index creation only. I am interested > in the opinions of all including tests I may perform. If you had this > setup, what would you capture / analyze? I have a job running preparing > data. I did this on a much smaller scale (50k tables) and data load via > function allowed close to 6,000 records/second. The schema has been > simplified since and last test reach just over 20,000 records/second with > 300k tables. > > I'm not looking for alternatives yet but input to my test. Takers? > > I can't promise immediate feedback but will do my best to respond with > results. > > TIA, > -Greg > I've gotten more responses than anticipated and have answered some questions and gotten some insight but my challenge again is what should I capture along the way to prove or disprove this storage pattern? Alternatives to the storage pattern aside, I need ideas to test rig, capture metrics and suggestions to tune it. In the next 24 hours, I will be sending ~1 trillion records to the test database. Because of time to set up, I'd rather have things set up properly the first go. Thanks! -Greg
Re: [PERFORM] Millions of tables
On 26 September 2016 at 20:51, Greg Spiegelbergwrote: > > An alternative if you exhaust or don't trust other options, use a foreign >> data wrapper to access your own custom storage. A single table at the PG >> level, you can shard the data yourself into 8 bazillion separate stores, in >> whatever structure suites your read and write operations (maybe reusing an >> embedded db engine, ordered flat file+log+index, whatever). >> >> > However even 8 bazillion FDW's may cause an "overflow" of relationships at > the loss of having an efficient storage engine acting more like a traffic > cop. In such a case, I would opt to put such logic in the app to directly > access the true storage over using FDW's. > I mean one fdw table, which shards internally to 8 bazillion stores on disk. It has the sharding key, can calculate exactly which store(s) need to be hit, and returns the rows and to PostgreSQL it looks like 1 big table with 1.3 trillion rows. And if it doesn't do that in 30ms you get to blame yourself :) -- Stuart Bishop http://www.stuartbishop.net/
Re: [PERFORM] Millions of tables
On Mon, Sep 26, 2016 at 4:23 AM, Rick Ottenwrote: > Are the tables constantly being written to, or is this a mostly read > scenario? One architecture possibility, if the writes are not so > frequent, is to create just a handful of very big tables for writing, and > then make smaller tables as materialized views for reading. The vacuum and > bloat management could be done back a the big tables. The materialized > views could be refreshed or replaced during non-peak hours. The > materialized views could be on a different tablespace than the root > tables. They could also be structured to reflect real-world query patterns > which are sometimes different than the raw data storage engineering problem. > > Like any data warehouse, I expect 90%+ of the activity being writes but the necessity of low latency reads is an absolute must else the design doesn't get off the ground. Materialized views are neat for many cases but not this one. Current versions of the data must be available the moment after they are written. I am considering tablespaces however I have no way to properly size. One group of tables may contain 1M records and another 100M. Could be on the same file system but I'd have the same problems internal to PostgreSQL and the only thing overcome is millions of files in a single directory which is a file system selection problem. > With some logging you may be able to see that the data is not truly > randomly accessed, but rather clustered around just some of the millions of > tables. Then the engineering problem becomes "How do I service 90% of the > queries on these tables in 30ms ?" Rather than "How do I service 100% of > the queries 100% of the time in 30ms?" Knowing 90% of the queries hit just > a few hundred tables, makes the first question easier to answer. > > Similarly, if most of the columns are static and only a few columns are > actually changing, you could consider pulling the static stuff out of the > same table with the dynamic stuff and then look at joins in your queries. > The end goal is to be able to get solid indexes and tables that don't > change a lot so they can be tightly packed and cached. (less bloat, less > fragmentation, fewer disk accesses). > > With regards to consistent query performance, I think you need to get out > of AWS. That environment is terrible if you are going for consistency > unless you buy dedicated hardware, and then you are paying so much money it > is ridiculous. > > True about AWS and though it is possible hardware may be purchased AWS is the right place to start. 1) AWS is not IT and won't take months to approve budget for gear+deployment and more importantly 2) still in design phase and if deployed there is no way to predict true adoption meaning it'll start small. AWS is the right place for now. > Also I think having 10M rows in a table is not a problem for the query > times you are referring to. So instead of millions of tables, unless I'm > doing my math wrong, you probably only need thousands of tables. > > > > On Mon, Sep 26, 2016 at 5:43 AM, Stuart Bishop > wrote: > >> On 26 September 2016 at 11:19, Greg Spiegelberg >> wrote: >> >>> I did look at PostgresXL and CitusDB. Both are admirable however >>> neither could support the need to read a random record consistently under >>> 30ms. It's a similar problem Cassandra and others have: network latency. >>> At this scale, to provide the ability to access any given record amongst >>> trillions it is imperative to know precisely where it is stored (system & >>> database) and read a relatively small index. I have other requirements >>> that prohibit use of any technology that is eventually consistent. >>> >>> I liken the problem to fishing. To find a particular fish of length, >>> size, color in a data lake you must accept the possibility of scanning >>> the entire lake. However, if all fish were in barrels where each barrel >>> had a particular kind of fish of specific length, size, color then the >>> problem is far simpler. >>> >>> -Greg >>> >> >> My gut tells me that if you do solve the problem and get PostgreSQL (or >> anything) reading consistently at under 30ms with that many tables you will >> have solved one problem by creating another. >> >> You discounted Cassandra due to network latency, but are now trying a >> monolithic PostgreSQL setup. It might be worth trying a single node >> ScyllaDB or Cassandra deploy (no need for QUORUM or network overhead), >> perhaps using layered compaction so all your data gets broken out into >> 160MB chunks. And certainly wander over to the ScyllaDB mailing list, as >> they are very focused on performance problems like yours and should offer >> some insight even if a Cassandra style architecture cannot meet your >> requirements. >> >> An alternative if you exhaust or don't trust other options, use a foreign >> data wrapper to access your own custom
Re: [PERFORM] Millions of tables
On Mon, Sep 26, 2016 at 7:05 AM, Mike Sofenwrote: > *From:* Rick Otten *Sent:* Monday, September 26, 2016 3:24 AM > Are the tables constantly being written to, or is this a mostly read > scenario? > > > > With regards to consistent query performance, I think you need to get out > of AWS. That environment is terrible if you are going for consistency > unless you buy dedicated hardware, and then you are paying so much money it > is ridiculous. > > > > Also I think having 10M rows in a table is not a problem for the query > times you are referring to. So instead of millions of tables, unless I'm > doing my math wrong, you probably only need thousands of tables. > > -- > > Excellent thoughts: the read/write behavior will/should drive a lot of > the design; AWS does not guarantee consistency or latency; and 10m rows > is nothing to PG. > > > > Re AWS: we’re on it, at least for now. In my profiling of our > performance there, I consistently get low latencies…I just know that there > will be random higher latencies, but the statistical average will be low. > I just ran a quick test against a modest sized table on a modest sized EC2 > instance (m4.xlarge – 4 core/16gb ram, 3 tb ssd): the table has 15m rows > but is huge (it represents nearly 500m rows compressed in jsonb documents), > with 5 indexed key columns and a total of 12 columns. I queried for a > single, non-PK, indexed value using “select *” (so it included the json) > and it took 22ms, without the json it took 11ms. Especially with the > db/memory-optimized EC2 instances now available (with guaranteed IOPS), > performance against even 100m row tables should still stay within your > requirements. > > > > So Rick’s point about not needing millions of tables is right on. If > there’s a way to create table “clumps”, at least you’ll have a more modest > table count. > > > Absolutely! The 8M tables do "belong" to a larger group and the option to reduce the 8M tables to ~4000 is an option however the problem then becomes rather than having an anticipated 140k records/table to 140M to 500M records/table. I'm concerned read access times will go out the window. It is on the docket to test. -Greg
Re: [PERFORM] Millions of tables
Consider the problem though. Random access to trillions of records with no guarantee any one will be fetched twice in a short time frame nullifies the effectiveness of a cache unless the cache is enormous. If such a cache were that big, 100's of TB's, I wouldn't be looking at on-disk storage options. :) -Greg On Mon, Sep 26, 2016 at 6:54 AM, Yves Dorfsmanwrote: > Something that is not talked about at all in this thread is caching. A > bunch > of memcache servers in front of the DB should be able to help with the 30ms > constraint (doesn't have to be memcache, some caching technology). > > -- > http://yves.zioup.com > gpg: 4096R/32B0F416 > > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
Re: [PERFORM] Millions of tables
On Mon, Sep 26, 2016 at 3:43 AM, Stuart Bishopwrote: > On 26 September 2016 at 11:19, Greg Spiegelberg > wrote: > >> I did look at PostgresXL and CitusDB. Both are admirable however neither >> could support the need to read a random record consistently under 30ms. >> It's a similar problem Cassandra and others have: network latency. At this >> scale, to provide the ability to access any given record amongst trillions >> it is imperative to know precisely where it is stored (system & database) >> and read a relatively small index. I have other requirements that prohibit >> use of any technology that is eventually consistent. >> >> I liken the problem to fishing. To find a particular fish of length, >> size, color in a data lake you must accept the possibility of scanning >> the entire lake. However, if all fish were in barrels where each barrel >> had a particular kind of fish of specific length, size, color then the >> problem is far simpler. >> >> -Greg >> > > My gut tells me that if you do solve the problem and get PostgreSQL (or > anything) reading consistently at under 30ms with that many tables you will > have solved one problem by creating another. > > Exactly why I am exploring. What are the trade offs? > You discounted Cassandra due to network latency, but are now trying a > monolithic PostgreSQL setup. It might be worth trying a single node > ScyllaDB or Cassandra deploy (no need for QUORUM or network overhead), > perhaps using layered compaction so all your data gets broken out into > 160MB chunks. And certainly wander over to the ScyllaDB mailing list, as > they are very focused on performance problems like yours and should offer > some insight even if a Cassandra style architecture cannot meet your > requirements. > > Cassandra performance, according to the experts I consulted, starts to fall off once the stored dataset exceeds ~3 TB. Much too small for my use case. Again, I do have other reasons for not using Cassandra and others namely deduplication of information referenced by my millions of tables. There are no guarantees in many outside of the RDBMS realm. > An alternative if you exhaust or don't trust other options, use a foreign > data wrapper to access your own custom storage. A single table at the PG > level, you can shard the data yourself into 8 bazillion separate stores, in > whatever structure suites your read and write operations (maybe reusing an > embedded db engine, ordered flat file+log+index, whatever). > > However even 8 bazillion FDW's may cause an "overflow" of relationships at the loss of having an efficient storage engine acting more like a traffic cop. In such a case, I would opt to put such logic in the app to directly access the true storage over using FDW's. -Greg
Re: [PERFORM] Millions of tables
From: Rick Otten Sent: Monday, September 26, 2016 3:24 AM Are the tables constantly being written to, or is this a mostly read scenario? With regards to consistent query performance, I think you need to get out of AWS. That environment is terrible if you are going for consistency unless you buy dedicated hardware, and then you are paying so much money it is ridiculous. Also I think having 10M rows in a table is not a problem for the query times you are referring to. So instead of millions of tables, unless I'm doing my math wrong, you probably only need thousands of tables. -- Excellent thoughts: the read/write behavior will/should drive a lot of the design; AWS does not guarantee consistency or latency; and 10m rows is nothing to PG. Re AWS: we’re on it, at least for now. In my profiling of our performance there, I consistently get low latencies…I just know that there will be random higher latencies, but the statistical average will be low. I just ran a quick test against a modest sized table on a modest sized EC2 instance (m4.xlarge – 4 core/16gb ram, 3 tb ssd): the table has 15m rows but is huge (it represents nearly 500m rows compressed in jsonb documents), with 5 indexed key columns and a total of 12 columns. I queried for a single, non-PK, indexed value using “select *” (so it included the json) and it took 22ms, without the json it took 11ms. Especially with the db/memory-optimized EC2 instances now available (with guaranteed IOPS), performance against even 100m row tables should still stay within your requirements. So Rick’s point about not needing millions of tables is right on. If there’s a way to create table “clumps”, at least you’ll have a more modest table count. Mike Sofen (Synthetic Genomics)
Re: [PERFORM] Millions of tables
Something that is not talked about at all in this thread is caching. A bunch of memcache servers in front of the DB should be able to help with the 30ms constraint (doesn't have to be memcache, some caching technology). -- http://yves.zioup.com gpg: 4096R/32B0F416 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Millions of tables
Following list etiquette response inline ;) On Mon, Sep 26, 2016 at 2:28 AM, Álvaro Hernández Tortosawrote: > > > On 26/09/16 05:50, Greg Spiegelberg wrote: > >> Hey all, >> >> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a >> time has said not to have millions of tables. I too have long believed it >> until recently. >> >> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) >> for PGDATA. Over the weekend, I created 8M tables with 16M indexes on >> those tables. Table creation initially took 0.018031 secs, average >> 0.027467 and after tossing out outliers (qty 5) the maximum creation time >> found was 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 >> seconds. Tables were created by a single process. Do note that table >> creation is done via plpgsql function as there are other housekeeping tasks >> necessary though minimal. >> >> No system tuning but here is a list of PostgreSQL knobs and switches: >> shared_buffers = 2GB >> work_mem = 48 MB >> max_stack_depth = 4 MB >> synchronous_commit = off >> effective_cache_size = 200 GB >> pg_xlog is on it's own file system >> >> There are some still obvious problems. General DBA functions such as >> VACUUM and ANALYZE should not be done. Each will run forever and cause >> much grief. Backups are problematic in the traditional pg_dump and PITR >> space. Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing >> it in my test case) are no-no's. A system or database crash could take >> potentially hours to days to recover. There are likely other issues ahead. >> >> You may wonder, "why is Greg attempting such a thing?" I looked at >> DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, >> it's antiquated and don't get me started on "Hadoop". I looked at many >> others and ultimately the recommended use of each vendor was to have one >> table for all data. That overcomes the millions of tables problem, right? >> >> Problem with the "one big table" solution is I anticipate 1,200 trillion >> records. Random access is expected and the customer expects <30ms reads >> for a single record fetch. >> >> No data is loaded... yet Table and index creation only. I am interested >> in the opinions of all including tests I may perform. If you had this >> setup, what would you capture / analyze? I have a job running preparing >> data. I did this on a much smaller scale (50k tables) and data load via >> function allowed close to 6,000 records/second. The schema has been >> simplified since and last test reach just over 20,000 records/second with >> 300k tables. >> >> I'm not looking for alternatives yet but input to my test. Takers? >> >> I can't promise immediate feedback but will do my best to respond with >> results. >> >> TIA, >> -Greg >> > > Hi Greg. > > This is a problem (creating a large number of tables; really large > indeed) that we researched in my company a while ago. You might want to > read about it: https://www.pgcon.org/2013/schedule/events/595.en.html > > updatedb, funny. Thank you for the pointer. I had no intention of going to 1B tables. I may need to understand autovacuum better. My impression was it consulted statistics and performed vacuums one table at a time based on the vacuum threshold formula on https://www.postgresql.org/docs/9.5/static/routine-vacuuming.html. -Greg
Re: [PERFORM] Millions of tables
Are the tables constantly being written to, or is this a mostly read scenario? One architecture possibility, if the writes are not so frequent, is to create just a handful of very big tables for writing, and then make smaller tables as materialized views for reading. The vacuum and bloat management could be done back a the big tables. The materialized views could be refreshed or replaced during non-peak hours. The materialized views could be on a different tablespace than the root tables. They could also be structured to reflect real-world query patterns which are sometimes different than the raw data storage engineering problem. With some logging you may be able to see that the data is not truly randomly accessed, but rather clustered around just some of the millions of tables. Then the engineering problem becomes "How do I service 90% of the queries on these tables in 30ms ?" Rather than "How do I service 100% of the queries 100% of the time in 30ms?" Knowing 90% of the queries hit just a few hundred tables, makes the first question easier to answer. Similarly, if most of the columns are static and only a few columns are actually changing, you could consider pulling the static stuff out of the same table with the dynamic stuff and then look at joins in your queries. The end goal is to be able to get solid indexes and tables that don't change a lot so they can be tightly packed and cached. (less bloat, less fragmentation, fewer disk accesses). With regards to consistent query performance, I think you need to get out of AWS. That environment is terrible if you are going for consistency unless you buy dedicated hardware, and then you are paying so much money it is ridiculous. Also I think having 10M rows in a table is not a problem for the query times you are referring to. So instead of millions of tables, unless I'm doing my math wrong, you probably only need thousands of tables. On Mon, Sep 26, 2016 at 5:43 AM, Stuart Bishopwrote: > On 26 September 2016 at 11:19, Greg Spiegelberg > wrote: > >> I did look at PostgresXL and CitusDB. Both are admirable however neither >> could support the need to read a random record consistently under 30ms. >> It's a similar problem Cassandra and others have: network latency. At this >> scale, to provide the ability to access any given record amongst trillions >> it is imperative to know precisely where it is stored (system & database) >> and read a relatively small index. I have other requirements that prohibit >> use of any technology that is eventually consistent. >> >> I liken the problem to fishing. To find a particular fish of length, >> size, color in a data lake you must accept the possibility of scanning >> the entire lake. However, if all fish were in barrels where each barrel >> had a particular kind of fish of specific length, size, color then the >> problem is far simpler. >> >> -Greg >> > > My gut tells me that if you do solve the problem and get PostgreSQL (or > anything) reading consistently at under 30ms with that many tables you will > have solved one problem by creating another. > > You discounted Cassandra due to network latency, but are now trying a > monolithic PostgreSQL setup. It might be worth trying a single node > ScyllaDB or Cassandra deploy (no need for QUORUM or network overhead), > perhaps using layered compaction so all your data gets broken out into > 160MB chunks. And certainly wander over to the ScyllaDB mailing list, as > they are very focused on performance problems like yours and should offer > some insight even if a Cassandra style architecture cannot meet your > requirements. > > An alternative if you exhaust or don't trust other options, use a foreign > data wrapper to access your own custom storage. A single table at the PG > level, you can shard the data yourself into 8 bazillion separate stores, in > whatever structure suites your read and write operations (maybe reusing an > embedded db engine, ordered flat file+log+index, whatever). > > -- > Stuart Bishop > http://www.stuartbishop.net/ >
Re: [PERFORM] Millions of tables
On 26 September 2016 at 11:19, Greg Spiegelbergwrote: > I did look at PostgresXL and CitusDB. Both are admirable however neither > could support the need to read a random record consistently under 30ms. > It's a similar problem Cassandra and others have: network latency. At this > scale, to provide the ability to access any given record amongst trillions > it is imperative to know precisely where it is stored (system & database) > and read a relatively small index. I have other requirements that prohibit > use of any technology that is eventually consistent. > > I liken the problem to fishing. To find a particular fish of length, > size, color in a data lake you must accept the possibility of scanning > the entire lake. However, if all fish were in barrels where each barrel > had a particular kind of fish of specific length, size, color then the > problem is far simpler. > > -Greg > My gut tells me that if you do solve the problem and get PostgreSQL (or anything) reading consistently at under 30ms with that many tables you will have solved one problem by creating another. You discounted Cassandra due to network latency, but are now trying a monolithic PostgreSQL setup. It might be worth trying a single node ScyllaDB or Cassandra deploy (no need for QUORUM or network overhead), perhaps using layered compaction so all your data gets broken out into 160MB chunks. And certainly wander over to the ScyllaDB mailing list, as they are very focused on performance problems like yours and should offer some insight even if a Cassandra style architecture cannot meet your requirements. An alternative if you exhaust or don't trust other options, use a foreign data wrapper to access your own custom storage. A single table at the PG level, you can shard the data yourself into 8 bazillion separate stores, in whatever structure suites your read and write operations (maybe reusing an embedded db engine, ordered flat file+log+index, whatever). -- Stuart Bishop http://www.stuartbishop.net/
Re: [PERFORM] Millions of tables
On 26/09/16 05:50, Greg Spiegelberg wrote: Hey all, Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has said not to have millions of tables. I too have long believed it until recently. AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for PGDATA. Over the weekend, I created 8M tables with 16M indexes on those tables. Table creation initially took 0.018031 secs, average 0.027467 and after tossing out outliers (qty 5) the maximum creation time found was 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds. Tables were created by a single process. Do note that table creation is done via plpgsql function as there are other housekeeping tasks necessary though minimal. No system tuning but here is a list of PostgreSQL knobs and switches: shared_buffers = 2GB work_mem = 48 MB max_stack_depth = 4 MB synchronous_commit = off effective_cache_size = 200 GB pg_xlog is on it's own file system There are some still obvious problems. General DBA functions such as VACUUM and ANALYZE should not be done. Each will run forever and cause much grief. Backups are problematic in the traditional pg_dump and PITR space. Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test case) are no-no's. A system or database crash could take potentially hours to days to recover. There are likely other issues ahead. You may wonder, "why is Greg attempting such a thing?" I looked at DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop". I looked at many others and ultimately the recommended use of each vendor was to have one table for all data. That overcomes the millions of tables problem, right? Problem with the "one big table" solution is I anticipate 1,200 trillion records. Random access is expected and the customer expects <30ms reads for a single record fetch. No data is loaded... yet Table and index creation only. I am interested in the opinions of all including tests I may perform. If you had this setup, what would you capture / analyze? I have a job running preparing data. I did this on a much smaller scale (50k tables) and data load via function allowed close to 6,000 records/second. The schema has been simplified since and last test reach just over 20,000 records/second with 300k tables. I'm not looking for alternatives yet but input to my test. Takers? I can't promise immediate feedback but will do my best to respond with results. TIA, -Greg Hi Greg. This is a problem (creating a large number of tables; really large indeed) that we researched in my company a while ago. You might want to read about it: https://www.pgcon.org/2013/schedule/events/595.en.html Cheers, Álvaro -- Álvaro Hernández Tortosa --- 8Kdata -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Millions of tables
-sorry for my last email, which also not bottom posting- Hi Greg, On Mon, Sep 26, 2016 at 11:19 AM, Greg Spiegelbergwrote: > I did look at PostgresXL and CitusDB. Both are admirable however neither > could support the need to read a random record consistently under 30ms. > It's a similar problem Cassandra and others have: network latency. At this > scale, to provide the ability to access any given record amongst trillions > it is imperative to know precisely where it is stored (system & database) > and read a relatively small index. I have other requirements that prohibit > use of any technology that is eventually consistent. > Then, you can get below 30ms, but how many process you might have to have conncurently? This is something that you should consider, single machine can only have less than 50 HT for intel, 192HT for Power8, still it is far below millions compare with the number of tables (8Million) If you use index correctly, you would not need sequencial scan since the scanning run on the memory (index loaded into memory) Do you plan to query thru Master table of the partition? it is quite slow actually, considering millions rule to check for every query. with 8 Millions of data, you would require very big data storage for sure and it would not fit mounted into single machine unless you would planning to use IBM z machines. > I liken the problem to fishing. To find a particular fish of length, > size, color in a data lake you must accept the possibility of scanning > the entire lake. However, if all fish were in barrels where each barrel > had a particular kind of fish of specific length, size, color then the > problem is far simpler. > >
Re: [PERFORM] Millions of tables
On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelbergwrote: > Hey all, > > Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time > has said not to have millions of tables. I too have long believed it until > recently. > > AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for > PGDATA. Over the weekend, I created 8M tables with 16M indexes on those > tables. Table creation initially took 0.018031 secs, average 0.027467 and > after tossing out outliers (qty 5) the maximum creation time found was > 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds. > Tables were created by a single process. Do note that table creation is > done via plpgsql function as there are other housekeeping tasks necessary > though minimal. > > No system tuning but here is a list of PostgreSQL knobs and switches: > shared_buffers = 2GB > work_mem = 48 MB > max_stack_depth = 4 MB > synchronous_commit = off > effective_cache_size = 200 GB > pg_xlog is on it's own file system > > There are some still obvious problems. General DBA functions such as > VACUUM and ANALYZE should not be done. Each will run forever and cause > much grief. > Why would the auto versions of those cause less grief than the manual versions? > Backups are problematic in the traditional pg_dump and PITR space. > Is there a third option to those two spaces? File-system snapshots? > Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in > my test case) are no-no's. A system or database crash could take > potentially hours to days to recover. > Isn't that a show-stopper? > There are likely other issues ahead. > > You may wonder, "why is Greg attempting such a thing?" I looked at > DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, > it's antiquated and don't get me started on "Hadoop". I looked at many > others and ultimately the recommended use of each vendor was to have one > table for all data. That overcomes the millions of tables problem, right? > > Problem with the "one big table" solution is I anticipate 1,200 trillion > records. Random access is expected and the customer expects <30ms reads > for a single record fetch. > Sorry, I don't really follow. Whether you have 1 table or millions, eventually someone has to go get the data off the disk. Why would the number of tables make much of a difference to that fundamental? Also, how many tablespaces do you anticipate having? Can you get 120 petabytes of storage all mounted to one machine? > No data is loaded... yet Table and index creation only. I am interested > in the opinions of all including tests I may perform. If you had this > setup, what would you capture / analyze? I have a job running preparing > data. I did this on a much smaller scale (50k tables) and data load via > function allowed close to 6,000 records/second. The schema has been > simplified since and last test reach just over 20,000 records/second with > 300k tables. > > I'm not looking for alternatives yet but input to my test. Takers? > Go through and put one row (or 8kB worth of rows) into each of 8 million table. The stats collector and the autovacuum process will start going nuts. Now, maybe you can deal with it. But maybe not. That is the first non-obvious thing I'd look at. Cheers, Jeff
Re: [PERFORM] Millions of tables
Hi Greg, Please follow the conventions of this mailing list, to avoid confusion - see bottom of this posting for further comments On 26/09/16 17:05, Greg Spiegelberg wrote: Precisely why I shared with the group. I must understand the risks involved. I need to explore if it can be stable at this size when does it become unstable? Aside from locking down user access to superuser, is there a way to prohibit database-wide VACUUM & ANALYZE? Certainly putting my trust in autovacuum :) which is something I have not yet fully explored how to best tune. Couple more numbers... ~231 GB is the size of PGDATA with 8M empty tables and 16M empty indexes. ~5% of inodes on the file system have been used. Sar data during the 8M table creation shows a very stable and regular I/O pattern. Not a blip worth mentioning. Another point worth mentioning, the tables contain a boolean, int8's and timestamptz's only. Nothing of variable size like bytea, text, json or xml. Each of the 8M tables will contain on the very high side between 140k and 200k records. The application also has a heads up as to which table contains which record. The searches come in saying "give me record X from partition key Y" where Y identifies the table and X is used in the filter on the table. Last point, add column will never be done. I can hear eyes rolling :) but the schema and it's intended use is complete. You'll have to trust me on that one. -Greg On Sun, Sep 25, 2016 at 9:23 PM, Mike Sofen> wrote: *From:*Greg Spiegelberg *Sent:* Sunday, September 25, 2016 7:50 PM … Over the weekend, I created 8M tables with 16M indexes on those tables. … A system or database crash could take potentially hours to days to recover. There are likely other issues ahead. You may wonder, "why is Greg attempting such a thing?" I looked at DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop". Problem with the "one big table" solution is I anticipate 1,200 trillion records. Random access is expected and the customer expects <30ms reads for a single record fetch. I'm not looking for alternatives yet but input to my test. _ Holy guacamole, batman! Ok, here’s my take: you’ve traded the risks/limitations of the known for the risks of the unknown. The unknown being, in the numerous places where postgres historical development may have cut corners, you may be the first to exercise those corners and flame out like the recent SpaceX rocket. Put it another way – you’re going to bet your career (perhaps) or a client’s future on an architectural model that just doesn’t seem feasible. I think you’ve got a remarkable design problem to solve, and am glad you’ve chosen to share that problem with us. And I do think it will boil down to this: it’s not that you CAN do it on Postgres (which you clearly can), but once in production, assuming things are actually stable, how will you handle the data management aspects like inevitable breakage, data integrity issues, backups, restores, user contention for resources, fault tolerance and disaster recovery. Just listing the tables will take forever. Add a column? Never. I do think the amount of testing you’ll need to do prove that every normal data management function still works at that table count…that in itself is going to be not a lot of fun. This one hurts my head. Ironically, the most logical destination for this type of data may actually be Hadoop – auto-scale, auto-shard, fault tolerant, etc…and I’m not a Hadoopie. I am looking forward to hearing how this all plays out, it will be quite an adventure! All the best, Mike Sofen (Synthetic Genomics…on Postgres 9.5x) In this list, the convention is to post replies at the end (with some rare exceptions), or interspersed when appropriate, and to omit parts no longer relevant. The motivation of bottom posting like this: is that people get to see the context before the reply, AND emails don't end up getting longer & longer as people reply at the beginning forgetting to trim the now irrelevant stuff at the end. Cheers, Gavin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Millions of tables
I did look at PostgresXL and CitusDB. Both are admirable however neither could support the need to read a random record consistently under 30ms. It's a similar problem Cassandra and others have: network latency. At this scale, to provide the ability to access any given record amongst trillions it is imperative to know precisely where it is stored (system & database) and read a relatively small index. I have other requirements that prohibit use of any technology that is eventually consistent. I liken the problem to fishing. To find a particular fish of length, size, color in a data lake you must accept the possibility of scanning the entire lake. However, if all fish were in barrels where each barrel had a particular kind of fish of specific length, size, color then the problem is far simpler. -Greg On Sun, Sep 25, 2016 at 9:04 PM, julyanto SUTANDANGwrote: > Dear Greg, > > Have you checked PostgresXL ? > with millions of table, how the apps choose which table is approriate? > in my opinion, with that scale it should go with parallel query with > data sharing like what PostgresXL is done. > > Thanks, > > > Julyanto SUTANDANG > > Equnix Business Solutions, PT > (An Open Source and Open Mind Company) > www.equnix.co.id > Pusat Niaga ITC Roxy Mas Blok C2/42. Jl. KH Hasyim Ashari 125, Jakarta > Pusat > T: +6221 2282 F: +62216315281 M: +628164858028 > > > Caution: The information enclosed in this email (and any attachments) > may be legally privileged and/or confidential and is intended only for > the use of the addressee(s). No addressee should forward, print, copy, > or otherwise reproduce this message in any manner that would allow it > to be viewed by any individual not originally listed as a recipient. > If the reader of this message is not the intended recipient, you are > hereby notified that any unauthorized disclosure, dissemination, > distribution, copying or the taking of any action in reliance on the > information herein is strictly prohibited. If you have received this > communication in error, please immediately notify the sender and > delete this message.Unless it is made by the authorized person, any > views expressed in this message are those of the individual sender and > may not necessarily reflect the views of PT Equnix Business Solutions. > > > On Mon, Sep 26, 2016 at 9:50 AM, Greg Spiegelberg > wrote: > > Hey all, > > > > Obviously everyone who's been in PostgreSQL or almost any RDBMS for a > time > > has said not to have millions of tables. I too have long believed it > until > > recently. > > > > AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) > for > > PGDATA. Over the weekend, I created 8M tables with 16M indexes on those > > tables. Table creation initially took 0.018031 secs, average 0.027467 > and > > after tossing out outliers (qty 5) the maximum creation time found was > > 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds. > > Tables were created by a single process. Do note that table creation is > > done via plpgsql function as there are other housekeeping tasks necessary > > though minimal. > > > > No system tuning but here is a list of PostgreSQL knobs and switches: > > shared_buffers = 2GB > > work_mem = 48 MB > > max_stack_depth = 4 MB > > synchronous_commit = off > > effective_cache_size = 200 GB > > pg_xlog is on it's own file system > > > > There are some still obvious problems. General DBA functions such as > VACUUM > > and ANALYZE should not be done. Each will run forever and cause much > grief. > > Backups are problematic in the traditional pg_dump and PITR space. Large > > JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my > test > > case) are no-no's. A system or database crash could take potentially > hours > > to days to recover. There are likely other issues ahead. > > > > You may wonder, "why is Greg attempting such a thing?" I looked at > > DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, > > it's antiquated and don't get me started on "Hadoop". I looked at many > > others and ultimately the recommended use of each vendor was to have one > > table for all data. That overcomes the millions of tables problem, > right? > > > > Problem with the "one big table" solution is I anticipate 1,200 trillion > > records. Random access is expected and the customer expects <30ms reads > for > > a single record fetch. > > > > No data is loaded... yet Table and index creation only. I am > interested in > > the opinions of all including tests I may perform. If you had this > setup, > > what would you capture / analyze? I have a job running preparing data. > I > > did this on a much smaller scale (50k tables) and data load via function > > allowed close to 6,000 records/second. The schema has been simplified > since > > and last test reach just over 20,000 records/second with 300k tables. > > > > I'm not looking
Re: [PERFORM] Millions of tables
Precisely why I shared with the group. I must understand the risks involved. I need to explore if it can be stable at this size when does it become unstable? Aside from locking down user access to superuser, is there a way to prohibit database-wide VACUUM & ANALYZE? Certainly putting my trust in autovacuum :) which is something I have not yet fully explored how to best tune. Couple more numbers... ~231 GB is the size of PGDATA with 8M empty tables and 16M empty indexes. ~5% of inodes on the file system have been used. Sar data during the 8M table creation shows a very stable and regular I/O pattern. Not a blip worth mentioning. Another point worth mentioning, the tables contain a boolean, int8's and timestamptz's only. Nothing of variable size like bytea, text, json or xml. Each of the 8M tables will contain on the very high side between 140k and 200k records. The application also has a heads up as to which table contains which record. The searches come in saying "give me record X from partition key Y" where Y identifies the table and X is used in the filter on the table. Last point, add column will never be done. I can hear eyes rolling :) but the schema and it's intended use is complete. You'll have to trust me on that one. -Greg On Sun, Sep 25, 2016 at 9:23 PM, Mike Sofenwrote: > *From:* Greg Spiegelberg *Sent:* Sunday, September 25, 2016 7:50 PM > … Over the weekend, I created 8M tables with 16M indexes on those tables. > > … A system or database crash could take potentially hours to days to > recover. There are likely other issues ahead. > > > > You may wonder, "why is Greg attempting such a thing?" I looked at > DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, > it's antiquated and don't get me started on "Hadoop". Problem with the > "one big table" solution is I anticipate 1,200 trillion records. Random > access is expected and the customer expects <30ms reads for a single record > fetch. > > > > I'm not looking for alternatives yet but input to my test. > > _ > > > > Holy guacamole, batman! Ok, here’s my take: you’ve traded the > risks/limitations of the known for the risks of the unknown. The unknown > being, in the numerous places where postgres historical development may > have cut corners, you may be the first to exercise those corners and flame > out like the recent SpaceX rocket. > > > > Put it another way – you’re going to bet your career (perhaps) or a > client’s future on an architectural model that just doesn’t seem feasible. > I think you’ve got a remarkable design problem to solve, and am glad you’ve > chosen to share that problem with us. > > > > And I do think it will boil down to this: it’s not that you CAN do it on > Postgres (which you clearly can), but once in production, assuming things > are actually stable, how will you handle the data management aspects like > inevitable breakage, data integrity issues, backups, restores, user > contention for resources, fault tolerance and disaster recovery. Just > listing the tables will take forever. Add a column? Never. I do think > the amount of testing you’ll need to do prove that every normal data > management function still works at that table count…that in itself is going > to be not a lot of fun. > > > > This one hurts my head. Ironically, the most logical destination for this > type of data may actually be Hadoop – auto-scale, auto-shard, fault > tolerant, etc…and I’m not a Hadoopie. > > > > I am looking forward to hearing how this all plays out, it will be quite > an adventure! All the best, > > > > Mike Sofen (Synthetic Genomics…on Postgres 9.5x) >
Re: [PERFORM] Millions of tables
From: Greg Spiegelberg Sent: Sunday, September 25, 2016 7:50 PM … Over the weekend, I created 8M tables with 16M indexes on those tables. … A system or database crash could take potentially hours to days to recover. There are likely other issues ahead. You may wonder, "why is Greg attempting such a thing?" I looked at DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, it's antiquated and don't get me started on "Hadoop". Problem with the "one big table" solution is I anticipate 1,200 trillion records. Random access is expected and the customer expects <30ms reads for a single record fetch. I'm not looking for alternatives yet but input to my test. _ Holy guacamole, batman! Ok, here’s my take: you’ve traded the risks/limitations of the known for the risks of the unknown. The unknown being, in the numerous places where postgres historical development may have cut corners, you may be the first to exercise those corners and flame out like the recent SpaceX rocket. Put it another way – you’re going to bet your career (perhaps) or a client’s future on an architectural model that just doesn’t seem feasible. I think you’ve got a remarkable design problem to solve, and am glad you’ve chosen to share that problem with us. And I do think it will boil down to this: it’s not that you CAN do it on Postgres (which you clearly can), but once in production, assuming things are actually stable, how will you handle the data management aspects like inevitable breakage, data integrity issues, backups, restores, user contention for resources, fault tolerance and disaster recovery. Just listing the tables will take forever. Add a column? Never. I do think the amount of testing you’ll need to do prove that every normal data management function still works at that table count…that in itself is going to be not a lot of fun. This one hurts my head. Ironically, the most logical destination for this type of data may actually be Hadoop – auto-scale, auto-shard, fault tolerant, etc…and I’m not a Hadoopie. I am looking forward to hearing how this all plays out, it will be quite an adventure! All the best, Mike Sofen (Synthetic Genomics…on Postgres 9.5x)
Re: [PERFORM] Millions of tables
Dear Greg, Have you checked PostgresXL ? with millions of table, how the apps choose which table is approriate? in my opinion, with that scale it should go with parallel query with data sharing like what PostgresXL is done. Thanks, Julyanto SUTANDANG Equnix Business Solutions, PT (An Open Source and Open Mind Company) www.equnix.co.id Pusat Niaga ITC Roxy Mas Blok C2/42. Jl. KH Hasyim Ashari 125, Jakarta Pusat T: +6221 2282 F: +62216315281 M: +628164858028 Caution: The information enclosed in this email (and any attachments) may be legally privileged and/or confidential and is intended only for the use of the addressee(s). No addressee should forward, print, copy, or otherwise reproduce this message in any manner that would allow it to be viewed by any individual not originally listed as a recipient. If the reader of this message is not the intended recipient, you are hereby notified that any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is strictly prohibited. If you have received this communication in error, please immediately notify the sender and delete this message.Unless it is made by the authorized person, any views expressed in this message are those of the individual sender and may not necessarily reflect the views of PT Equnix Business Solutions. On Mon, Sep 26, 2016 at 9:50 AM, Greg Spiegelbergwrote: > Hey all, > > Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time > has said not to have millions of tables. I too have long believed it until > recently. > > AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for > PGDATA. Over the weekend, I created 8M tables with 16M indexes on those > tables. Table creation initially took 0.018031 secs, average 0.027467 and > after tossing out outliers (qty 5) the maximum creation time found was > 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds. > Tables were created by a single process. Do note that table creation is > done via plpgsql function as there are other housekeeping tasks necessary > though minimal. > > No system tuning but here is a list of PostgreSQL knobs and switches: > shared_buffers = 2GB > work_mem = 48 MB > max_stack_depth = 4 MB > synchronous_commit = off > effective_cache_size = 200 GB > pg_xlog is on it's own file system > > There are some still obvious problems. General DBA functions such as VACUUM > and ANALYZE should not be done. Each will run forever and cause much grief. > Backups are problematic in the traditional pg_dump and PITR space. Large > JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test > case) are no-no's. A system or database crash could take potentially hours > to days to recover. There are likely other issues ahead. > > You may wonder, "why is Greg attempting such a thing?" I looked at > DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, > it's antiquated and don't get me started on "Hadoop". I looked at many > others and ultimately the recommended use of each vendor was to have one > table for all data. That overcomes the millions of tables problem, right? > > Problem with the "one big table" solution is I anticipate 1,200 trillion > records. Random access is expected and the customer expects <30ms reads for > a single record fetch. > > No data is loaded... yet Table and index creation only. I am interested in > the opinions of all including tests I may perform. If you had this setup, > what would you capture / analyze? I have a job running preparing data. I > did this on a much smaller scale (50k tables) and data load via function > allowed close to 6,000 records/second. The schema has been simplified since > and last test reach just over 20,000 records/second with 300k tables. > > I'm not looking for alternatives yet but input to my test. Takers? > > I can't promise immediate feedback but will do my best to respond with > results. > > TIA, > -Greg -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance