Re: [HACKERS] [PERFORM] pg_dump and thousands of schemas
Tom Lane-2 wrote Denis lt; socsam@ gt; writes: Tom Lane-2 wrote Hmmm ... so the problem here isn't that you've got 2600 schemas, it's that you've got 183924 tables. That's going to take some time no matter what. I wonder why pg_dump has to have deal with all these 183924 tables, if I specified to dump only one scheme: pg_dump -n schema_name or even like this to dump just one table pg_dump -t 'schema_name.comments' ? It has to know about all the tables even if it's not going to dump them all, for purposes such as dependency analysis. We have a web application where we create a schema with a number of tables in it for each customer. This architecture was chosen to ease the process of backup/restoring data. I find that argument fairly dubious, but in any case you should not imagine that hundreds of thousands of tables are going to be cost-free. regards, tom lane -- Sent via pgsql-performance mailing list ( pgsql-performance@ ) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance Still I can't undesrtand why pg_dump has to know about all the tables? For example I have such an easy table CREATE TABLE CLog ( fromUser integer, toUser integer, message character varying(2048) NOT NULL, dateSend timestamp without time zone NOT NULL ); no foreign keys, it doesn't use partitioning, it doesn't have any relations to any other table. Why pg_dump has to gother information about ALL the tables in the database just to dump one this table? -- View this message in context: http://postgresql.1045698.n5.nabble.com/pg-dump-and-thousands-of-schemas-tp5709766p5731188.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Thousands databases or schemas
We have a web application where we create a schema or a database with a number of tables in it for each customer. Now we have about 2600 clients. The problem we met using a separate DB for each client is that the creation of new DB can take up to 2 minutes, that is absolutely unacceptable. Using schemes instead (one DB with a number of schemes containing similar tables in it) solved this problem (schemes are created in a couple of seconds), but created two other blocking points: 1. sometimes creation of a new table in a schema takes up to 5 seconds. In case when we have create up to 40 tables in a schema this takes way too much time. 2. pg_dump -n schema_name db_name takes from 30 to 60 seconds, no matter how big is the amount of data in the schema. Also, the dump of the tables structure only takes at least 30 seconds. Basing on this topic http://postgresql.1045698.n5.nabble.com/pg-dump-and-thousands-of-schemas-td5709766i60.html, pg_dump always analyses ALL the tables in the DB, i.e. in my case more than 100 000 tables. I know you guys will ask me about selecting this particular application architecture. This architecture was chosen to ease the process of backup/restoring data and isolating client's data from each other. Sometimes clients ask us to restore data for the last month or roll back to last week's state. This task is easy to accomplish then the client's data is isolated in a schema/DB. If we put all the clients data in one table - operations of this kind will be much harder to perform. We will have to restore a huge DB with an enormously large tables in it to find the requested data. Sometime client even doesn't remember the exact date, he or she just say I lost my data somewhere between Tuesday and Friday last week and I have to restore backups for several days. If I have one huge table instead of small tables it will be a nightmare! Different clients have different activity rate and we can select different backup strategies according to it. This would be impossible in case we keep all the clients data in one table. Besides all the above mentioned, the probability of massive data corruption (if an error in our web application occurs) is much higher. P.S. Not to start a holywar, but FYI: in a similar project where we used MySQL now we have about 6000 DBs and everything works like a charm. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Thousands-databases-or-schemas-tp5731189.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Thousands databases or schemas
On Thu, Nov 8, 2012 at 1:36 AM, Denis soc...@gmail.com wrote: P.S. Not to start a holywar, but FYI: in a similar project where we used MySQL now we have about 6000 DBs and everything works like a charm. You seem to have answered your own question here. If my recollection of a previous discussion about many schemas and pg_dump performance is accurate, I suspect you are going to be told that you've got a data architecture that is fairly incompatible with postgresql's architecture and you've specifically ruled out a solution that would play to postgresql's strengths.
Re: [PERFORM] Thousands databases or schemas
Samuel Gendler wrote On Thu, Nov 8, 2012 at 1:36 AM, Denis lt; socsam@ gt; wrote: P.S. Not to start a holywar, but FYI: in a similar project where we used MySQL now we have about 6000 DBs and everything works like a charm. You seem to have answered your own question here. If my recollection of a previous discussion about many schemas and pg_dump performance is accurate, I suspect you are going to be told that you've got a data architecture that is fairly incompatible with postgresql's architecture and you've specifically ruled out a solution that would play to postgresql's strengths. Ok guys, it was not my intention to hurt anyone's feelings by mentioning MySQL. Sorry about that. There simply was a project with a similar architecture built using MySQL. When we started the current project, I have made a decision to give PostgreSQL a try. Now I see that the same architecture is not applicable if PostgreSQL is used. I would recommend you to refresh the info here http://wiki.postgresql.org/wiki/FAQ. There is a question What is the maximum size for a row, a table, and a database?. Please add there info on maximum DBs number and tables number one DB can contain while PostgreSQL continues to work properly. PS: the easiest solution in my case is to create initially 500 DBs (like app_template_[0-500]) and create up to 500 schemas in each of it. This will make 25 possible clients in total. This should be enough. The question is: can you see the possible pitfalls of this solution? -- View this message in context: http://postgresql.1045698.n5.nabble.com/Thousands-databases-or-schemas-tp5731189p5731203.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Thousands databases or schemas
Hello 2012/11/8 Denis soc...@gmail.com: Samuel Gendler wrote On Thu, Nov 8, 2012 at 1:36 AM, Denis lt; socsam@ gt; wrote: P.S. Not to start a holywar, but FYI: in a similar project where we used MySQL now we have about 6000 DBs and everything works like a charm. You seem to have answered your own question here. If my recollection of a previous discussion about many schemas and pg_dump performance is accurate, I suspect you are going to be told that you've got a data architecture that is fairly incompatible with postgresql's architecture and you've specifically ruled out a solution that would play to postgresql's strengths. Ok guys, it was not my intention to hurt anyone's feelings by mentioning MySQL. Sorry about that. There simply was a project with a similar architecture built using MySQL. When we started the current project, I have made a decision to give PostgreSQL a try. Now I see that the same architecture is not applicable if PostgreSQL is used. I would recommend you to refresh the info here http://wiki.postgresql.org/wiki/FAQ. There is a question What is the maximum size for a row, a table, and a database?. Please add there info on maximum DBs number and tables number one DB can contain while PostgreSQL continues to work properly. PS: the easiest solution in my case is to create initially 500 DBs (like app_template_[0-500]) and create up to 500 schemas in each of it. This will make 25 possible clients in total. This should be enough. The question is: can you see the possible pitfalls of this solution? we use about 2000 databases per warehouse - and it working well, but pg_dumpall doesn't work well in this environment. So we use a different backup methods. Regards Pavel -- View this message in context: http://postgresql.1045698.n5.nabble.com/Thousands-databases-or-schemas-tp5731189p5731203.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Thousands databases or schemas
On 08/11/12 09:36, Denis wrote: We have a web application where we create a schema or a database with a number of tables in it for each customer. Now we have about 2600 clients. The problem we met using a separate DB for each client is that the creation of new DB can take up to 2 minutes, that is absolutely unacceptable. Using schemes instead (one DB with a number of schemes containing similar tables in it) solved this problem (schemes are created in a couple of seconds), but created two other blocking points: 1. sometimes creation of a new table in a schema takes up to 5 seconds. In case when we have create up to 40 tables in a schema this takes way too much time. 2. pg_dump -n schema_name db_name takes from 30 to 60 seconds, no matter how big is the amount of data in the schema. Also, the dump of the tables structure only takes at least 30 seconds. Basing on this topic http://postgresql.1045698.n5.nabble.com/pg-dump-and-thousands-of-schemas-td5709766i60.html, pg_dump always analyses ALL the tables in the DB, i.e. in my case more than 100 000 tables. The obvious solution would be to write your own version of pg_dump which only examines the tables within a schema. You can even start with the source of the standard pg_dump! However, you could then eliminate the per customer schema/tables and add an extra 'customer' key column on each table. Now you modify pg_dump to only dump the parts of each table matching a given customer id. Mark Thornton -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] HT on or off for E5-26xx ?
On 11/07/2012 09:16 PM, David Boreham wrote: bash-4.1$ /usr/pgsql-9.2/bin/pgbench -T 600 -j 48 -c 48 Unfortunately without -S, you're not really testing the processors. A regular pgbench can fluctuate a more than that due to writing and checkpoints. For what it's worth, our X5675's perform about 40-50% better with HT enabled. Not the 2x you might expect by doubling the amount of processors, but it definitely didn't make things worse. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] HT on or off for E5-26xx ?
On 11/8/2012 6:58 AM, Shaun Thomas wrote: On 11/07/2012 09:16 PM, David Boreham wrote: bash-4.1$ /usr/pgsql-9.2/bin/pgbench -T 600 -j 48 -c 48 Unfortunately without -S, you're not really testing the processors. A regular pgbench can fluctuate a more than that due to writing and checkpoints. Hmm...my goal was to test with a workload close to our application's (which is heavy OLTP, small transactions and hence sensitive to I/O commit rate). The hypothesis I was testing was that enabling HT positively degrades performance (which in my case it does not). I wasn't to be honest really testing the additional benefit from HT, rather observing that it is non-negative :) If I have time I can run the select-only test for you and post the results. The DB fits into memory so it will be a good CPU test. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2
Em 07-11-2012 22:58, Tom Lane escreveu: Rodrigo Rosenfeld Rosasrr.ro...@gmail.com writes: Ok, I could finally strip part of my database schema that will allow you to run the explain query and reproduce the issue. There is a simple SQL dump in plain format that you can restore both on 9.1 and 9.2 and an example EXPLAIN query so that you can see the difference between both versions. Please keep me up to date with regards to any progress. Let me know if the commit above fixed this issue. AFAICT, HEAD and 9.2 branch tip plan this query a bit faster than 9.1 does. Great! What is the estimate for 9.2.2 release? It does appear that the problem is the same one fixed in that recent commit: the problem is you've got N join clauses all involving t.id and so there are lots of redundant ways to use the index on t.id. And what is the reason why fast.sql performs much better than slow.sql? Is it possible to optimize the planner so that both fast.sql and slow.sql finish about the same time? I've got to say though that this is one of the most bizarre database schemas I've ever seen. Merlin seems to share your opinion on that. I'd love to try a different database design when I have a chance. What would you guys suggest me for handling my application requirements? The only reason it is bizarre is because I have no idea on how to simplify much our database design using relational databases. And pstore also doesn't sound like a reasonable option either for our requirements. The only other option I can think of is stop splitting transaction_condition in many tables (one for each data type). Then I'd need to include all possible columns in transaction_condition and I'm not sure if it would perform better and what would be the implications with regards to the database size since most columns will be null for each record. This also introduces another issue. I would need to create a trigger to detect if the record is valid upon insertion to avoid creating records with all columns set to NULL for instance. Currently each separate table that store the values have not-null constraints among others to prevent this kind of problem. Triggers are more complicated to maintain, specially because we're used to using an ORM (except for this particular case where I generate the SQL query manually instead of using an ORM for this). Also, we migrate the database using standalone_migrations: https://github.com/thuss/standalone-migrations If we change a single line in the trigger code it won't be easy to see what line has changed in the commit that introduces the change because we would have to create a separate migration to alter the trigger with all code repeated. It seems to be sort of an unholy combination of EAV and a star schema. A star schema might not actually be a bad model for what you're trying to do, but what you want for that is one big fact table and a collection of *small* detail tables you join to it (small meaning just one entry per possible value). The way this is set up, you need to join two or three tables before you can even join to the main fact table - and those tables don't even have the virtue of being small. That's never going to perform well. If I understand correctly, you're suggesting that I dropped transaction_condition(id, transaction_id, type_id) and replaced condition_boolean_value(id, condition_id, value) with condition_boolean_value(id, transaction_id, type_id, value) and repeat the same idea for the other tables. Is that right? Would that perform much better? If you think so, I could try this approach when I find some time. But I'd also need to denormalize other related tables I didn't send in the schema dump. For instance, the documents snippets have also a condition_id column. Each field value (transaction_condition) can have multiple contract snippets in a table called condition_document_excerpt(id, document_id, condition_id, position). I'd need to remove condition_id from it and append transaction_id and type_id just like the values tables. No big deal if this would speed up our queries. Am I missing something?
Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2
Rodrigo Rosenfeld Rosas escribió: Em 07-11-2012 22:58, Tom Lane escreveu: Rodrigo Rosenfeld Rosasrr.ro...@gmail.com writes: Ok, I could finally strip part of my database schema that will allow you to run the explain query and reproduce the issue. There is a simple SQL dump in plain format that you can restore both on 9.1 and 9.2 and an example EXPLAIN query so that you can see the difference between both versions. Please keep me up to date with regards to any progress. Let me know if the commit above fixed this issue. AFAICT, HEAD and 9.2 branch tip plan this query a bit faster than 9.1 does. Great! What is the estimate for 9.2.2 release? Hasn't been announced, but you can grab a snapshot right now from ftp.postgresql.org if you want. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2
Em 08-11-2012 13:38, Alvaro Herrera escreveu: Rodrigo Rosenfeld Rosas escribió: Em 07-11-2012 22:58, Tom Lane escreveu: Rodrigo Rosenfeld Rosasrr.ro...@gmail.com writes: Ok, I could finally strip part of my database schema that will allow you to run the explain query and reproduce the issue. There is a simple SQL dump in plain format that you can restore both on 9.1 and 9.2 and an example EXPLAIN query so that you can see the difference between both versions. Please keep me up to date with regards to any progress. Let me know if the commit above fixed this issue. AFAICT, HEAD and 9.2 branch tip plan this query a bit faster than 9.1 does. Great! What is the estimate for 9.2.2 release? Hasn't been announced, but you can grab a snapshot right now from ftp.postgresql.org if you want. Thank you, Álvaro, but I prefer to use official Debian packages instead since they are easier to manage and more integrated to our OS. For now I have rolled back to 9.1 this morning and it is working fine, so I don't have any rush. I just want an estimate to know when I should try upgrading 9.2 from experimental again after 9.2.2 is released. Cheers, Rodrigo. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Thousands databases or schemas
On 11/08/2012 09:29 PM, Denis wrote: Ok guys, it was not my intention to hurt anyone's feelings by mentioning MySQL. Sorry about that. It's pretty silly to be upset by someone mentioning another DB product. I wouldn't worry. There simply was a project with a similar architecture built using MySQL. When we started the current project, I have made a decision to give PostgreSQL a try. It's certainly interesting that MySQL currently scales to much larger table counts better than PostgreSQL appears to. I'd like to see if this can be improved down the track. Various people are doing work on PostgreSQL scaling and performance, so with luck huge table counts will come into play there. If nothing else, supporting large table counts is important when dealing with very large amounts of data in partitioned tables. I think I saw mention of better performance with higher table counts in 9.3 in -hackers, too. I would recommend you to refresh the info here http://wiki.postgresql.org/wiki/FAQ. There is a question What is the maximum size for a row, a table, and a database?. Please add there info on maximum DBs number and tables number one DB can contain while PostgreSQL continues to work properly. Yeah, a number of people have been thrown by that. Technical limitations aren't the same as practical limitations, and in some cases the practical limitations are lower. The trouble is: How do you put a number to it when something is a slow and gradual drop in performance? And when one person's performs adequately is another's way too slow ? -- Craig Ringer -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance