Re: [HACKERS] [PERFORM] pg_dump and thousands of schemas

2012-11-08 Thread Denis
Tom Lane-2 wrote
 Denis lt;

 socsam@

 gt; writes:
 Tom Lane-2 wrote
 Hmmm ... so the problem here isn't that you've got 2600 schemas, it's
 that you've got 183924 tables.  That's going to take some time no matter
 what.
 
 I wonder why pg_dump has to have deal with all these 183924 tables, if I
 specified to dump only one scheme: pg_dump -n schema_name or even like
 this to dump just one table pg_dump -t 'schema_name.comments'   ?
 
 It has to know about all the tables even if it's not going to dump them
 all, for purposes such as dependency analysis.
 
 We have a web application where we create a schema with a number of
 tables
 in it for each customer. This architecture was chosen to ease the process
 of
 backup/restoring data.
 
 I find that argument fairly dubious, but in any case you should not
 imagine that hundreds of thousands of tables are going to be cost-free.
 
   regards, tom lane
 
 
 -- 
 Sent via pgsql-performance mailing list (

 pgsql-performance@

 )
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance

Still I can't undesrtand why pg_dump has to know about all the tables? For
example I have such an easy table 
CREATE TABLE CLog (
fromUser integer,
toUser integer,
message character varying(2048) NOT NULL,
dateSend timestamp without time zone NOT NULL
);
no foreign keys, it doesn't use partitioning, it doesn't have any relations
to any other table. Why pg_dump has to gother information about ALL the
tables in the database just to dump one this table?



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/pg-dump-and-thousands-of-schemas-tp5709766p5731188.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Thousands databases or schemas

2012-11-08 Thread Denis
We have a web application where we create a schema or a database with a
number of tables in it for each customer. Now we have about 2600 clients.

The problem we met using a separate DB for each client is that the creation
of new DB can take up to 2 minutes, that is absolutely unacceptable. Using
schemes instead (one DB with a number of schemes containing similar tables
in it) solved this problem (schemes are created in a couple of seconds), but
created two other blocking points:
1. sometimes creation of a new table in a schema takes up to 5 seconds. In
case when we have create up to 40 tables in a schema this takes way too much
time.
2. pg_dump -n schema_name db_name takes from 30 to 60 seconds, no matter
how big is the amount of data in the schema. Also, the dump of the tables
structure only takes at least 30 seconds. Basing on this topic
http://postgresql.1045698.n5.nabble.com/pg-dump-and-thousands-of-schemas-td5709766i60.html,
pg_dump always analyses ALL the tables in the DB, i.e. in my case more than
100 000 tables.

I know you guys will ask me about selecting this particular application
architecture.
This architecture was chosen to ease the process of backup/restoring data
and isolating client's data from each other. Sometimes clients ask us to
restore data for the last month or roll back to last week's state. This task
is easy to accomplish then the client's data is isolated in a schema/DB. If
we put all the clients data in one table - operations of this kind will be
much harder to perform. We will have to restore a huge DB with an enormously
large tables in it to find the requested data. Sometime client even doesn't
remember the exact date, he or she just say I lost my data somewhere
between Tuesday and Friday last week and I have to restore backups for
several days. If I have one huge table instead of small tables it will be a
nightmare!
 
Different clients have different activity rate and we can select different
backup strategies according to it. This would be impossible in case we keep
all the clients data in one table. 
Besides all the above mentioned, the probability of massive data corruption
(if an error in our web application occurs) is much higher. 


P.S.
Not to start a holywar, but FYI: in a similar project where we used MySQL
now we have about 6000 DBs and everything works like a charm.



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Thousands-databases-or-schemas-tp5731189.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Thousands databases or schemas

2012-11-08 Thread Samuel Gendler
On Thu, Nov 8, 2012 at 1:36 AM, Denis soc...@gmail.com wrote:


 P.S.
 Not to start a holywar, but FYI: in a similar project where we used MySQL
 now we have about 6000 DBs and everything works like a charm.


You seem to have answered your own question here.  If my recollection of a
previous discussion about many schemas and pg_dump performance is accurate,
I suspect you are going to be told that you've got a data architecture that
is fairly incompatible with postgresql's architecture and you've
specifically ruled out a solution that would play to postgresql's strengths.


Re: [PERFORM] Thousands databases or schemas

2012-11-08 Thread Denis
Samuel Gendler wrote
 On Thu, Nov 8, 2012 at 1:36 AM, Denis lt;

 socsam@

 gt; wrote:
 

 P.S.
 Not to start a holywar, but FYI: in a similar project where we used MySQL
 now we have about 6000 DBs and everything works like a charm.

 
 You seem to have answered your own question here.  If my recollection of a
 previous discussion about many schemas and pg_dump performance is
 accurate,
 I suspect you are going to be told that you've got a data architecture
 that
 is fairly incompatible with postgresql's architecture and you've
 specifically ruled out a solution that would play to postgresql's
 strengths.

Ok guys, it was not my intention to hurt anyone's feelings by mentioning
MySQL. Sorry about that. There simply was a project with a similar
architecture built using MySQL. When we started the current project, I have
made a decision to give PostgreSQL a try. Now I see that the same
architecture is not applicable if PostgreSQL is used. 

I would recommend you to refresh the info here 
http://wiki.postgresql.org/wiki/FAQ. There is a question What is the
maximum size for a row, a table, and a database?. Please add there info on
maximum DBs number and tables number one DB can contain while PostgreSQL
continues to work properly.

PS: the easiest solution in my case is to create initially 500 DBs (like
app_template_[0-500]) and create up to 500 schemas in each of it. This will
make  25 possible clients in total. This should be enough.  The question
is: can you see the possible pitfalls of this solution?



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Thousands-databases-or-schemas-tp5731189p5731203.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Thousands databases or schemas

2012-11-08 Thread Pavel Stehule
Hello

2012/11/8 Denis soc...@gmail.com:
 Samuel Gendler wrote
 On Thu, Nov 8, 2012 at 1:36 AM, Denis lt;

 socsam@

 gt; wrote:


 P.S.
 Not to start a holywar, but FYI: in a similar project where we used MySQL
 now we have about 6000 DBs and everything works like a charm.


 You seem to have answered your own question here.  If my recollection of a
 previous discussion about many schemas and pg_dump performance is
 accurate,
 I suspect you are going to be told that you've got a data architecture
 that
 is fairly incompatible with postgresql's architecture and you've
 specifically ruled out a solution that would play to postgresql's
 strengths.

 Ok guys, it was not my intention to hurt anyone's feelings by mentioning
 MySQL. Sorry about that. There simply was a project with a similar
 architecture built using MySQL. When we started the current project, I have
 made a decision to give PostgreSQL a try. Now I see that the same
 architecture is not applicable if PostgreSQL is used.

 I would recommend you to refresh the info here
 http://wiki.postgresql.org/wiki/FAQ. There is a question What is the
 maximum size for a row, a table, and a database?. Please add there info on
 maximum DBs number and tables number one DB can contain while PostgreSQL
 continues to work properly.

 PS: the easiest solution in my case is to create initially 500 DBs (like
 app_template_[0-500]) and create up to 500 schemas in each of it. This will
 make  25 possible clients in total. This should be enough.  The question
 is: can you see the possible pitfalls of this solution?


we use about 2000 databases per warehouse - and it working well, but
pg_dumpall doesn't work well in this environment. So we use a
different backup methods.

Regards

Pavel



 --
 View this message in context: 
 http://postgresql.1045698.n5.nabble.com/Thousands-databases-or-schemas-tp5731189p5731203.html
 Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Thousands databases or schemas

2012-11-08 Thread Mark Thornton

On 08/11/12 09:36, Denis wrote:

We have a web application where we create a schema or a database with a
number of tables in it for each customer. Now we have about 2600 clients.

The problem we met using a separate DB for each client is that the creation
of new DB can take up to 2 minutes, that is absolutely unacceptable. Using
schemes instead (one DB with a number of schemes containing similar tables
in it) solved this problem (schemes are created in a couple of seconds), but
created two other blocking points:
1. sometimes creation of a new table in a schema takes up to 5 seconds. In
case when we have create up to 40 tables in a schema this takes way too much
time.
2. pg_dump -n schema_name db_name takes from 30 to 60 seconds, no matter
how big is the amount of data in the schema. Also, the dump of the tables
structure only takes at least 30 seconds. Basing on this topic
http://postgresql.1045698.n5.nabble.com/pg-dump-and-thousands-of-schemas-td5709766i60.html,
pg_dump always analyses ALL the tables in the DB, i.e. in my case more than
100 000 tables.
The obvious solution would be to write your own version of pg_dump which 
only examines the tables within a schema. You can even start with the 
source of the standard pg_dump! However, you could then eliminate the 
per customer schema/tables and add an extra 'customer' key column on 
each table. Now you modify pg_dump to only dump the parts of each table 
matching a given customer id.


Mark Thornton



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] HT on or off for E5-26xx ?

2012-11-08 Thread Shaun Thomas

On 11/07/2012 09:16 PM, David Boreham wrote:


bash-4.1$ /usr/pgsql-9.2/bin/pgbench -T 600 -j 48 -c 48


Unfortunately without -S, you're not really testing the processors. A 
regular pgbench can fluctuate a more than that due to writing and 
checkpoints.


For what it's worth, our X5675's perform about 40-50% better with HT 
enabled. Not the 2x you might expect by doubling the amount of 
processors, but it definitely didn't make things worse.


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] HT on or off for E5-26xx ?

2012-11-08 Thread David Boreham

On 11/8/2012 6:58 AM, Shaun Thomas wrote:

On 11/07/2012 09:16 PM, David Boreham wrote:


bash-4.1$ /usr/pgsql-9.2/bin/pgbench -T 600 -j 48 -c 48


Unfortunately without -S, you're not really testing the processors. A 
regular pgbench can fluctuate a more than that due to writing and 
checkpoints.
Hmm...my goal was to test with a workload close to our application's 
(which is heavy OLTP, small transactions and hence sensitive to I/O 
commit rate).
The hypothesis I was testing was that enabling HT positively degrades 
performance (which in my case it does not). I wasn't to be honest really 
testing the additional benefit from HT, rather observing that it is 
non-negative :)


If I have time I can run the select-only test for you and post the 
results. The DB fits into memory so it will be a good CPU test.





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-08 Thread Rodrigo Rosenfeld Rosas

Em 07-11-2012 22:58, Tom Lane escreveu:

Rodrigo Rosenfeld Rosasrr.ro...@gmail.com  writes:

Ok, I could finally strip part of my database schema that will allow you
to run the explain query and reproduce the issue.
There is a simple SQL dump in plain format that you can restore both on
9.1 and 9.2 and an example EXPLAIN query so that you can see the
difference between both versions.
Please keep me up to date with regards to any progress. Let me know if
the commit above fixed this issue.

AFAICT, HEAD and 9.2 branch tip plan this query a bit faster than 9.1
does.


Great! What is the estimate for 9.2.2 release?


   It does appear that the problem is the same one fixed in that
recent commit: the problem is you've got N join clauses all involving
t.id and so there are lots of redundant ways to use the index on t.id.


And what is the reason why fast.sql performs much better than slow.sql? 
Is it possible to optimize the planner so that both fast.sql and 
slow.sql finish about the same time?



I've got to say though that this is one of the most bizarre database
schemas I've ever seen.


Merlin seems to share your opinion on that. I'd love to try a different 
database design when I have a chance.


What would you guys suggest me for handling my application requirements?

The only reason it is bizarre is because I have no idea on how to 
simplify much our database design using relational databases. And pstore 
also doesn't sound like a reasonable option either for our requirements.


The only other option I can think of is stop splitting 
transaction_condition in many tables (one for each data type). Then I'd 
need to include all possible columns in transaction_condition and I'm 
not sure if it would perform better and what would be the implications 
with regards to the database size since most columns will be null for 
each record. This also introduces another issue. I would need to create 
a trigger to detect if the record is valid upon insertion to avoid 
creating records with all columns set to NULL for instance. Currently 
each separate table that store the values have not-null constraints 
among others to prevent this kind of problem. Triggers are more 
complicated to maintain, specially because we're used to using an ORM 
(except for this particular case where I generate the SQL query manually 
instead of using an ORM for this).


Also, we migrate the database using standalone_migrations:

https://github.com/thuss/standalone-migrations

If we change a single line in the trigger code it won't be easy to see 
what line has changed in the commit that introduces the change because 
we would have to create a separate migration to alter the trigger with 
all code repeated.



   It seems to be sort of an unholy combination of
EAV and a star schema.  A star schema might not actually be a bad model
for what you're trying to do, but what you want for that is one big fact
table and a collection of *small* detail tables you join to it (small
meaning just one entry per possible value).  The way this is set up, you
need to join two or three tables before you can even join to the main
fact table - and those tables don't even have the virtue of being small.
That's never going to perform well.


If I understand correctly, you're suggesting that I dropped 
transaction_condition(id, transaction_id, type_id) and replaced 
condition_boolean_value(id, condition_id, value) with 
condition_boolean_value(id, transaction_id, type_id, value) and repeat 
the same idea for the other tables.


Is that right? Would that perform much better? If you think so, I could 
try this approach when I find some time. But I'd also need to 
denormalize other related tables I didn't send in the schema dump. For 
instance, the documents snippets have also a condition_id column. Each 
field value (transaction_condition) can have multiple contract snippets 
in a table called condition_document_excerpt(id, document_id, 
condition_id, position). I'd need to remove condition_id from it and 
append transaction_id and type_id just like the values tables. No big 
deal if this would speed up our queries.


Am I missing something?


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-08 Thread Alvaro Herrera
Rodrigo Rosenfeld Rosas escribió:
 Em 07-11-2012 22:58, Tom Lane escreveu:
 Rodrigo Rosenfeld Rosasrr.ro...@gmail.com  writes:
 Ok, I could finally strip part of my database schema that will allow you
 to run the explain query and reproduce the issue.
 There is a simple SQL dump in plain format that you can restore both on
 9.1 and 9.2 and an example EXPLAIN query so that you can see the
 difference between both versions.
 Please keep me up to date with regards to any progress. Let me know if
 the commit above fixed this issue.
 AFAICT, HEAD and 9.2 branch tip plan this query a bit faster than 9.1
 does.
 
 Great! What is the estimate for 9.2.2 release?

Hasn't been announced, but you can grab a snapshot right now from
ftp.postgresql.org if you want.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-08 Thread Rodrigo Rosenfeld Rosas

Em 08-11-2012 13:38, Alvaro Herrera escreveu:

Rodrigo Rosenfeld Rosas escribió:

Em 07-11-2012 22:58, Tom Lane escreveu:

Rodrigo Rosenfeld Rosasrr.ro...@gmail.com   writes:

Ok, I could finally strip part of my database schema that will allow you
to run the explain query and reproduce the issue.
There is a simple SQL dump in plain format that you can restore both on
9.1 and 9.2 and an example EXPLAIN query so that you can see the
difference between both versions.
Please keep me up to date with regards to any progress. Let me know if
the commit above fixed this issue.

AFAICT, HEAD and 9.2 branch tip plan this query a bit faster than 9.1
does.

Great! What is the estimate for 9.2.2 release?

Hasn't been announced, but you can grab a snapshot right now from
ftp.postgresql.org if you want.


Thank you, Álvaro, but I prefer to use official Debian packages instead 
since they are easier to manage and more integrated to our OS.


For now I have rolled back to 9.1 this morning and it is working fine, 
so I don't have any rush. I just want an estimate to know when I should 
try upgrading 9.2 from experimental again after 9.2.2 is released.


Cheers,
Rodrigo.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Thousands databases or schemas

2012-11-08 Thread Craig Ringer
On 11/08/2012 09:29 PM, Denis wrote:
 Ok guys, it was not my intention to hurt anyone's feelings by mentioning
 MySQL. Sorry about that.
It's pretty silly to be upset by someone mentioning another DB product.
I wouldn't worry.
 There simply was a project with a similar
 architecture built using MySQL. When we started the current project, I have
 made a decision to give PostgreSQL a try.
It's certainly interesting that MySQL currently scales to much larger
table counts better than PostgreSQL appears to.

I'd like to see if this can be improved down the track. Various people
are doing work on PostgreSQL scaling and performance, so with luck huge
table counts will come into play there. If nothing else, supporting
large table counts is important when dealing with very large amounts of
data in partitioned tables.

I think I saw mention of better performance with higher table counts in
9.3 in -hackers, too.

 I would recommend you to refresh the info here 
 http://wiki.postgresql.org/wiki/FAQ. There is a question What is the
 maximum size for a row, a table, and a database?. Please add there info on
 maximum DBs number and tables number one DB can contain while PostgreSQL
 continues to work properly.
Yeah, a number of people have been thrown by that. Technical limitations
aren't the same as practical limitations, and in some cases the
practical limitations are lower.

The trouble is: How do you put a number to it when something is a slow
and gradual drop in performance? And when one person's performs
adequately is another's way too slow ?

--
Craig Ringer


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance