Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-10 Thread aasat
Please try LIMIT 1 in exists

 exists(select id from condition_document_excerpt where
condition_id=c1686.id LIMIT 1) as v1686_has_reference



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Query-completed-in-1s-in-PG-9-1-and-700s-in-PG-9-2-tp5730899p5731021.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-08 Thread Rodrigo Rosenfeld Rosas

Em 07-11-2012 22:58, Tom Lane escreveu:

Rodrigo Rosenfeld Rosasrr.ro...@gmail.com  writes:

Ok, I could finally strip part of my database schema that will allow you
to run the explain query and reproduce the issue.
There is a simple SQL dump in plain format that you can restore both on
9.1 and 9.2 and an example EXPLAIN query so that you can see the
difference between both versions.
Please keep me up to date with regards to any progress. Let me know if
the commit above fixed this issue.

AFAICT, HEAD and 9.2 branch tip plan this query a bit faster than 9.1
does.


Great! What is the estimate for 9.2.2 release?


   It does appear that the problem is the same one fixed in that
recent commit: the problem is you've got N join clauses all involving
t.id and so there are lots of redundant ways to use the index on t.id.


And what is the reason why fast.sql performs much better than slow.sql? 
Is it possible to optimize the planner so that both fast.sql and 
slow.sql finish about the same time?



I've got to say though that this is one of the most bizarre database
schemas I've ever seen.


Merlin seems to share your opinion on that. I'd love to try a different 
database design when I have a chance.


What would you guys suggest me for handling my application requirements?

The only reason it is bizarre is because I have no idea on how to 
simplify much our database design using relational databases. And pstore 
also doesn't sound like a reasonable option either for our requirements.


The only other option I can think of is stop splitting 
transaction_condition in many tables (one for each data type). Then I'd 
need to include all possible columns in transaction_condition and I'm 
not sure if it would perform better and what would be the implications 
with regards to the database size since most columns will be null for 
each record. This also introduces another issue. I would need to create 
a trigger to detect if the record is valid upon insertion to avoid 
creating records with all columns set to NULL for instance. Currently 
each separate table that store the values have not-null constraints 
among others to prevent this kind of problem. Triggers are more 
complicated to maintain, specially because we're used to using an ORM 
(except for this particular case where I generate the SQL query manually 
instead of using an ORM for this).


Also, we migrate the database using standalone_migrations:

https://github.com/thuss/standalone-migrations

If we change a single line in the trigger code it won't be easy to see 
what line has changed in the commit that introduces the change because 
we would have to create a separate migration to alter the trigger with 
all code repeated.



   It seems to be sort of an unholy combination of
EAV and a star schema.  A star schema might not actually be a bad model
for what you're trying to do, but what you want for that is one big fact
table and a collection of *small* detail tables you join to it (small
meaning just one entry per possible value).  The way this is set up, you
need to join two or three tables before you can even join to the main
fact table - and those tables don't even have the virtue of being small.
That's never going to perform well.


If I understand correctly, you're suggesting that I dropped 
transaction_condition(id, transaction_id, type_id) and replaced 
condition_boolean_value(id, condition_id, value) with 
condition_boolean_value(id, transaction_id, type_id, value) and repeat 
the same idea for the other tables.


Is that right? Would that perform much better? If you think so, I could 
try this approach when I find some time. But I'd also need to 
denormalize other related tables I didn't send in the schema dump. For 
instance, the documents snippets have also a condition_id column. Each 
field value (transaction_condition) can have multiple contract snippets 
in a table called condition_document_excerpt(id, document_id, 
condition_id, position). I'd need to remove condition_id from it and 
append transaction_id and type_id just like the values tables. No big 
deal if this would speed up our queries.


Am I missing something?


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-08 Thread Alvaro Herrera
Rodrigo Rosenfeld Rosas escribió:
 Em 07-11-2012 22:58, Tom Lane escreveu:
 Rodrigo Rosenfeld Rosasrr.ro...@gmail.com  writes:
 Ok, I could finally strip part of my database schema that will allow you
 to run the explain query and reproduce the issue.
 There is a simple SQL dump in plain format that you can restore both on
 9.1 and 9.2 and an example EXPLAIN query so that you can see the
 difference between both versions.
 Please keep me up to date with regards to any progress. Let me know if
 the commit above fixed this issue.
 AFAICT, HEAD and 9.2 branch tip plan this query a bit faster than 9.1
 does.
 
 Great! What is the estimate for 9.2.2 release?

Hasn't been announced, but you can grab a snapshot right now from
ftp.postgresql.org if you want.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-08 Thread Rodrigo Rosenfeld Rosas

Em 08-11-2012 13:38, Alvaro Herrera escreveu:

Rodrigo Rosenfeld Rosas escribió:

Em 07-11-2012 22:58, Tom Lane escreveu:

Rodrigo Rosenfeld Rosasrr.ro...@gmail.com   writes:

Ok, I could finally strip part of my database schema that will allow you
to run the explain query and reproduce the issue.
There is a simple SQL dump in plain format that you can restore both on
9.1 and 9.2 and an example EXPLAIN query so that you can see the
difference between both versions.
Please keep me up to date with regards to any progress. Let me know if
the commit above fixed this issue.

AFAICT, HEAD and 9.2 branch tip plan this query a bit faster than 9.1
does.

Great! What is the estimate for 9.2.2 release?

Hasn't been announced, but you can grab a snapshot right now from
ftp.postgresql.org if you want.


Thank you, Álvaro, but I prefer to use official Debian packages instead 
since they are easier to manage and more integrated to our OS.


For now I have rolled back to 9.1 this morning and it is working fine, 
so I don't have any rush. I just want an estimate to know when I should 
try upgrading 9.2 from experimental again after 9.2.2 is released.


Cheers,
Rodrigo.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-07 Thread Rodrigo Rosenfeld Rosas

Em 06-11-2012 19:11, Merlin Moncure escreveu:

On Tue, Nov 6, 2012 at 1:45 PM, Tom Lanet...@sss.pgh.pa.us  wrote:

Rodrigo Rosenfeld Rosasrr.ro...@gmail.com  writes:

Em 06-11-2012 17:24, Tom Lane escreveu:

Can you put together a self-contained test case to duplicate these
results?  I'm prepared to believe there's some sort of planner
regression involved here, but we'll never find it without a test case.

I'd love to, but I'm afraid I won't have time to do this any time soon.
Maybe on Sunday. I'll see if I can get a script to generate the database
on Sunday and hope for it to replicate the issue.
Would you mind if I coded it using Ruby? (can you run Ruby code in your
computer?) I mean, for filling with some sample data.

No objection.

hm, wouldn't timing the time to generate a raw EXPLAIN (that is,
without ANALYZE) give a rough estimate of planning time?   better to
rule it out before OP goes to the trouble...


This was a great guess! Congrats, Merlin:

PG 9.1 (port 5433):

time psql -p 5433 -f slow-explain-only.sql db_name  /dev/null

real0m0.284s
user0m0.068s
sys 0m0.012s

time psql -p 5432 -f slow-explain-only.sql db_name  /dev/null

real2m10.409s
user0m0.056s
sys 0m0.016s

time psql -p 5433 -f fast-explain-only.sql db_name  /dev/null

real0m0.264s
user0m0.064s
sys 0m0.020s

time psql -p 5432 -f fast-explain-only.sql db_name  /dev/null

real12m25.084s
user0m0.052s
sys 0m0.020s


This is great news because it makes it easier for me to provide a 
test-case since the results were the same in my test database (which is 
mostly empty):


time psql -p 5432 -f fast-explain-only.sql db_test  /dev/null

real6m0.414s
user0m0.064s
sys 0m0.024s

I'm in Brazil which is 3 hours behind NY, where my client is. Later when 
they start their journey I'll ask them if I can send our plain database 
schema to make it even easier. Otherwise, if they prefer me to create 
another database schema or to drop the unrelated tables first I'll do 
that. Maybe they could be afraid of SQL injection attacks although I 
believe we're currently free of errors of this nature in our applications.


Thank you so much for narrowing down the real problem with 9.2.

After this regression is fixed in 9.2 I'd like to know if it would be 
possible to optimize the planner so that slow.sql could perform as well 
as fast.sql. I believe the unique index on (transaction_id, type_id) 
helps slow.sql to perform better but if the planner could be smart 
enough to understand that slow.sql and fast.sql are equivalents I'd 
prefer to use slow.sql instead of fast.sql as it reads better and it is 
easier to maintain and write tests for and reduces our database log files.


Cheers,
Rodrigo.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-07 Thread Rodrigo Rosenfeld Rosas

Em 06-11-2012 19:48, Merlin Moncure escreveu:

On Tue, Nov 6, 2012 at 12:57 PM, Rodrigo Rosenfeld Rosas
rr.ro...@gmail.com  wrote:

I would strongly consider investigation of hstore type along with
gist/gin index.
select * from company_transaction where contract_attributes @
'State=Delaware, Paid=Y';
etc


I'm not very familiar with hstore yet but this was one of the reasons I
wanted to migrate to PG 9.2 but I won't be able to migrate the application
quickly to use hstore.

sure -- it's a major change.  note though that 9.1 hstore has
everything you need.


Great to know.


Also, I'm not sure if hstore allows us to be as flexible as we currently are
(c1 and (c2 or c3 and not (c4 and c5))). c == condition

your not gated from that functionality, although making complicated
expressions might require some thought and defeat some or all of GIST
optimization. that said, nothing is keeping you from doing:

where fields @  'c1=true, c2=45' and not (fields @  'c3=false, c4=xyz');

range searches would completely bypass GIST.  so that:
select * from foo where attributes -  'somekey' between 'value1' and 'value2';

would work but would be brute force.  Still, with a little bit of
though, you should be able to optimize most common cases and when it
boils down to straight filter (a and b and c) you'll get an orders of
magnitude faster query.


Then I'm not sure if hstore would speed up anything because except for 
boolean fields most types won't use the equal (=) operator.


For instance, for numeric types (number, percent, currency) and dates it 
is more usual to use something like (), () or (between) than (=). For 
strings we use ILIKE operator instead of (=).



Barring that, I would then consider complete elimination of integer

proxies for your variables.  They make your query virtually impossible
to read/write, and they don't help.

I'm not sure if I understood what you're talking about. The template is
dynamic and contains lots of information for each field, like type (number,
percent, string, date, etc), parent_id (auto-referencing), aggregator_id
(also auto-referencing) and several other columns. But the values associate
the field id (type_id) and the transaction id in a unique way (see unique
index in my first message of the thread). Then I need different tables to
store the actual value because we're using SQL instead of MongoDB or
something else. The table that stores the value depend on the field type.

Well, that's probably a mistake.  It's probably better to have a
single table with a text field (which is basically a variant) and a
'type' column storing the type of it if you need special handling down
the line.


This would require tons of run-time conversions that would not be 
indexable (dates, numbers, etc). I thought that approach would be much 
slower. The user can also sort the results by any field and the sort 
operation could also become too slow with all those run-time conversions 
in place.



One thing I'm sure of is that abstracting type behind
type_id is doing nothing but creating needless extra work.


You said that in the other message and I asked for an example when I 
told you why I need a separate table for storing all field data. I still 
don't understand what you mean, that is why I asked for some example. I 
guess the main problem here is terminology because when I joined this 
project I had the same problems I think you're having to understand the 
query.


Currently there is a condition_type table that actually should be 
called contract_fields as it contains the possible fields to be 
extracted from some contract using our clients' terminology. In this 
table we find the label of the field, its actual data type (string, 
currency, date, etc) among several other database fields.


So, type_id should actually be called field_id or 
contract_field_id. It doesn't hold only the data type.


Then we have a table called transaction_condition where I would call 
it field_value or transaction_field_value (I simplified before since 
a transaction can have multiple contracts but the field is actually part 
of a transaction, not of some contract really - we have a references 
table that will join the contract and position (paragraph,etc) in the 
contract to the transaction).


So I can see two options here. We could either have a column of each 
type in transaction_condition (or field_value as I would call it) 
and create an index for each column, or we could have different tables 
to store the values. It wasn't me who decided what approach to take some 
years ago when this database was designed (I have not joined this 
project by then). But I'm not sure either what approach I would have 
taken. I would probably perform some benchmarks first before deciding 
which one to choose.


But I guess you're seeing a third approach I'm unable to understand, 
although I'd love to understand your proposal. Could you please provide 
some example?



You're doing all kinds of acrobatics to fight the 

Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-07 Thread Merlin Moncure
On Wed, Nov 7, 2012 at 5:16 AM, Rodrigo Rosenfeld Rosas
rr.ro...@gmail.com wrote:
 Em 06-11-2012 19:11, Merlin Moncure escreveu:

 On Tue, Nov 6, 2012 at 1:45 PM, Tom Lanet...@sss.pgh.pa.us  wrote:

 Rodrigo Rosenfeld Rosasrr.ro...@gmail.com  writes:

 Em 06-11-2012 17:24, Tom Lane escreveu:

 Can you put together a self-contained test case to duplicate these
 results?  I'm prepared to believe there's some sort of planner
 regression involved here, but we'll never find it without a test case.

 I'd love to, but I'm afraid I won't have time to do this any time soon.
 Maybe on Sunday. I'll see if I can get a script to generate the database
 on Sunday and hope for it to replicate the issue.
 Would you mind if I coded it using Ruby? (can you run Ruby code in your
 computer?) I mean, for filling with some sample data.

 No objection.

 hm, wouldn't timing the time to generate a raw EXPLAIN (that is,
 without ANALYZE) give a rough estimate of planning time?   better to
 rule it out before OP goes to the trouble...

 This was a great guess! Congrats, Merlin:

Heh -- that was tom's guess, not mine.  What this does is confirm the
planner regression and that elevates the importance of Tom's request
to get sample data so we (he) can fix it.

merlin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-07 Thread Tom Lane
Merlin Moncure mmonc...@gmail.com writes:
 On Wed, Nov 7, 2012 at 5:16 AM, Rodrigo Rosenfeld Rosas
 rr.ro...@gmail.com wrote:
 This was a great guess! Congrats, Merlin:

 Heh -- that was tom's guess, not mine.  What this does is confirm the
 planner regression and that elevates the importance of Tom's request
 to get sample data so we (he) can fix it.

Well, the fact that it's a planner runtime problem and not a
quality-of-plan problem is new information (I'd been assuming the
latter).  Given that, it's possible it's already fixed:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ca2d6a6cef5740b29406980eb8d21d44da32634b
but I'd still want to see a test case to be sure.  In any case,
it's not clear what's the critical difference between the fast and
slow versions of the query.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-07 Thread Rodrigo Rosenfeld Rosas

Em 07-11-2012 14:58, Tom Lane escreveu:

Merlin Moncuremmonc...@gmail.com  writes:

On Wed, Nov 7, 2012 at 5:16 AM, Rodrigo Rosenfeld Rosas
rr.ro...@gmail.com  wrote:

This was a great guess! Congrats, Merlin:

Heh -- that was tom's guess, not mine.  What this does is confirm the
planner regression and that elevates the importance of Tom's request
to get sample data so we (he) can fix it.

Well, the fact that it's a planner runtime problem and not a
quality-of-plan problem is new information (I'd been assuming the
latter).  Given that, it's possible it's already fixed:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ca2d6a6cef5740b29406980eb8d21d44da32634b
but I'd still want to see a test case to be sure.  In any case,
it's not clear what's the critical difference between the fast and
slow versions of the query.

regards, tom lane


Ok, I could finally strip part of my database schema that will allow you 
to run the explain query and reproduce the issue.


There is a simple SQL dump in plain format that you can restore both on 
9.1 and 9.2 and an example EXPLAIN query so that you can see the 
difference between both versions.


Please keep me up to date with regards to any progress. Let me know if 
the commit above fixed this issue.


Thanks in advance,

Rodrigo.



use_case.tar.bz2
Description: application/tbz

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-07 Thread Tom Lane
Rodrigo Rosenfeld Rosas rr.ro...@gmail.com writes:
 Ok, I could finally strip part of my database schema that will allow you 
 to run the explain query and reproduce the issue.

 There is a simple SQL dump in plain format that you can restore both on 
 9.1 and 9.2 and an example EXPLAIN query so that you can see the 
 difference between both versions.

 Please keep me up to date with regards to any progress. Let me know if 
 the commit above fixed this issue.

AFAICT, HEAD and 9.2 branch tip plan this query a bit faster than 9.1
does.  It does appear that the problem is the same one fixed in that
recent commit: the problem is you've got N join clauses all involving
t.id and so there are lots of redundant ways to use the index on t.id.

I've got to say though that this is one of the most bizarre database
schemas I've ever seen.  It seems to be sort of an unholy combination of
EAV and a star schema.  A star schema might not actually be a bad model
for what you're trying to do, but what you want for that is one big fact
table and a collection of *small* detail tables you join to it (small
meaning just one entry per possible value).  The way this is set up, you
need to join two or three tables before you can even join to the main
fact table - and those tables don't even have the virtue of being small.
That's never going to perform well.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-06 Thread Rodrigo Rosenfeld Rosas
Hello, this is my first message to this list, so sorry if this is not 
the right place to discuss this or if some data is missing from this 
message.


I'll gladly send any data you request that would help us to understand 
this issue. I don't believe I'm allowed to share the actual database 
dump, but other than that I can provide much more details you might ask for.


I can't understand why PG 9.2 performs so differently from PG 9.1.

I tested these queries in my Debian unstable amd64 box after restoring 
the same database dump this morning in both PG 9.1 (Debian unstable 
repository) and PG9.2 (Debian experimental repository) with same settings:


https://gist.github.com/3f1f3aad3847155e1e35

Ignore all lines like the line below because it doesn't make any 
difference on my tests if I just remove them or any other column from 
the SELECT clause:


  exists(select id from condition_document_excerpt where 
condition_id=c1686.id) as v1686_has_reference,


The results below are pretty much the same if you assume SELECT 1 FROM 



I have proper indices created for all tables and the query  is fast in 
both PG versions when I don't use many conditions in the WHERE clause.


fast.sql returns the same data as slow.sql but it returns much faster in 
my tests with PG 9.1.


So here are the completion times for each query on each PG version:

Query   | PG 9.1 | PG 9.2 |
---
fast.sql| 650 ms (0.65s) | 690s   |
slow.sql| 419s   | 111s   |


For the curious, the results would be very similar to slow.sql if I use 
inner joins with the conditions inside the WHERE moved to the ON 
clause of the inner join instead of the left outer join + global WHERE 
approach. But I don't have this option anyway because this query is 
generated dynamically and not all my queries are ALL-like queries.


Here are the relevant indices (id is SERIAL primary key in all tables):

CREATE UNIQUE INDEX transaction_condition_transaction_id_type_id_idx
  ON transaction_condition
  USING btree
  (transaction_id, type_id);
CREATE INDEX index_transaction_condition_on_transaction_id
  ON transaction_condition
  USING btree
  (transaction_id);
CREATE INDEX index_transaction_condition_on_type_id
  ON transaction_condition
  USING btree
  (type_id);

CREATE INDEX acquirer_target_names
  ON company_transaction
  USING btree
  (acquiror_company_name COLLATE pg_catalog.default, 
target_company_name COLLATE pg_catalog.default);

CREATE INDEX index_company_transaction_on_target_company_name
  ON company_transaction
  USING btree
  (target_company_name COLLATE pg_catalog.default);
CREATE INDEX index_company_transaction_on_date
  ON company_transaction
  USING btree
  (date);
CREATE INDEX index_company_transaction_on_edit_status
  ON company_transaction
  USING btree
  (edit_status COLLATE pg_catalog.default);

CREATE UNIQUE INDEX index_condition_boolean_value_on_condition_id
  ON condition_boolean_value
  USING btree
  (condition_id);
CREATE INDEX index_condition_boolean_value_on_value_and_condition_id
  ON condition_boolean_value
  USING btree
  (value COLLATE pg_catalog.default, condition_id);

CREATE UNIQUE INDEX index_condition_option_value_on_condition_id
  ON condition_option_value
  USING btree
  (condition_id);
CREATE INDEX index_condition_option_value_on_value_id_and_condition_id
  ON condition_option_value
  USING btree
  (value_id, condition_id);


CREATE INDEX index_condition_option_label_on_type_id_and_position
  ON condition_option_label
  USING btree
  (type_id, position);
CREATE INDEX index_condition_option_label_on_type_id_and_value
  ON condition_option_label
  USING btree
  (type_id, value COLLATE pg_catalog.default);


CREATE UNIQUE INDEX index_condition_string_value_on_condition_id
  ON condition_string_value
  USING btree
  (condition_id);
CREATE INDEX index_condition_string_value_on_value_and_condition_id
  ON condition_string_value
  USING btree
  (value COLLATE pg_catalog.default, condition_id);


Please let me know of any suggestions on how to try to get similar 
results in PG 9.2 as well as to understand why fast.sql performs so much 
better than slow.sql on PG 9.1.


Best,
Rodrigo.


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-06 Thread Merlin Moncure
On Tue, Nov 6, 2012 at 11:11 AM, Rodrigo Rosenfeld Rosas
rr.ro...@gmail.com wrote:
 Hello, this is my first message to this list, so sorry if this is not the
 right place to discuss this or if some data is missing from this message.

 I'll gladly send any data you request that would help us to understand this
 issue. I don't believe I'm allowed to share the actual database dump, but
 other than that I can provide much more details you might ask for.

 I can't understand why PG 9.2 performs so differently from PG 9.1.

 I tested these queries in my Debian unstable amd64 box after restoring the
 same database dump this morning in both PG 9.1 (Debian unstable repository)
 and PG9.2 (Debian experimental repository) with same settings:

 https://gist.github.com/3f1f3aad3847155e1e35

 Ignore all lines like the line below because it doesn't make any difference
 on my tests if I just remove them or any other column from the SELECT
 clause:

   exists(select id from condition_document_excerpt where
 condition_id=c1686.id) as v1686_has_reference,

 The results below are pretty much the same if you assume SELECT 1 FROM
 

 I have proper indices created for all tables and the query  is fast in both
 PG versions when I don't use many conditions in the WHERE clause.

 fast.sql returns the same data as slow.sql but it returns much faster in my
 tests with PG 9.1.

 So here are the completion times for each query on each PG version:

 Query   | PG 9.1 | PG 9.2 |
 ---
 fast.sql| 650 ms (0.65s) | 690s   |
 slow.sql| 419s   | 111s   |


 For the curious, the results would be very similar to slow.sql if I use
 inner joins with the conditions inside the WHERE moved to the ON clause of
 the inner join instead of the left outer join + global WHERE approach. But I
 don't have this option anyway because this query is generated dynamically
 and not all my queries are ALL-like queries.

 Here are the relevant indices (id is SERIAL primary key in all tables):

 CREATE UNIQUE INDEX transaction_condition_transaction_id_type_id_idx
   ON transaction_condition
   USING btree
   (transaction_id, type_id);
 CREATE INDEX index_transaction_condition_on_transaction_id
   ON transaction_condition
   USING btree
   (transaction_id);
 CREATE INDEX index_transaction_condition_on_type_id
   ON transaction_condition
   USING btree
   (type_id);

 CREATE INDEX acquirer_target_names
   ON company_transaction
   USING btree
   (acquiror_company_name COLLATE pg_catalog.default, target_company_name
 COLLATE pg_catalog.default);
 CREATE INDEX index_company_transaction_on_target_company_name
   ON company_transaction
   USING btree
   (target_company_name COLLATE pg_catalog.default);
 CREATE INDEX index_company_transaction_on_date
   ON company_transaction
   USING btree
   (date);
 CREATE INDEX index_company_transaction_on_edit_status
   ON company_transaction
   USING btree
   (edit_status COLLATE pg_catalog.default);

 CREATE UNIQUE INDEX index_condition_boolean_value_on_condition_id
   ON condition_boolean_value
   USING btree
   (condition_id);
 CREATE INDEX index_condition_boolean_value_on_value_and_condition_id
   ON condition_boolean_value
   USING btree
   (value COLLATE pg_catalog.default, condition_id);

 CREATE UNIQUE INDEX index_condition_option_value_on_condition_id
   ON condition_option_value
   USING btree
   (condition_id);
 CREATE INDEX index_condition_option_value_on_value_id_and_condition_id
   ON condition_option_value
   USING btree
   (value_id, condition_id);


 CREATE INDEX index_condition_option_label_on_type_id_and_position
   ON condition_option_label
   USING btree
   (type_id, position);
 CREATE INDEX index_condition_option_label_on_type_id_and_value
   ON condition_option_label
   USING btree
   (type_id, value COLLATE pg_catalog.default);


 CREATE UNIQUE INDEX index_condition_string_value_on_condition_id
   ON condition_string_value
   USING btree
   (condition_id);
 CREATE INDEX index_condition_string_value_on_value_and_condition_id
   ON condition_string_value
   USING btree
   (value COLLATE pg_catalog.default, condition_id);


 Please let me know of any suggestions on how to try to get similar results
 in PG 9.2 as well as to understand why fast.sql performs so much better than
 slow.sql on PG 9.1.


need explain analyze for 9.1 vs 9.2.  use this site:
http://explain.depesz.com/ to post info.

looking at your query -- it's a fair possibility that the root cause
of your issue is your database schema and organization.  It's hard to
tell for sure, but it looks like you might have dived head first into
the EAV anti-pattern -- deconstructing your data to such a degree that
accurate statistics and query plans are difficult or impossible.  I
mean this in the most constructive way possible naturally.  If that is
indeed the case a good plan is going to be sheer luck as the database
is essentially guessing.

Problem could also be no statistics (run ANALYZE to test) or some

Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-06 Thread Tom Lane
Merlin Moncure mmonc...@gmail.com writes:
 Problem could also be no statistics (run ANALYZE to test) or some
 other configuration problem (like index locale), or a bona fide
 regression.

I'm wondering about join_collapse_limit in particular --- if that wasn't
cranked up in the 9.1 installation, it would be pure luck if you got a
good query plan for an example like this.  Maybe that and/or other
parameter settings didn't get transposed to the 9.2 installation.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-06 Thread k...@rice.edu
Hi Rodrigo,

It looks like a lot of joins and 9.2 does some optimizations that
internally add additional joins. Did you try raising the
join_collapse_limit and maybe the from_collapse_limit from the
default values of 8?

Regards,
Ken

On Tue, Nov 06, 2012 at 03:11:58PM -0200, Rodrigo Rosenfeld Rosas wrote:
 Hello, this is my first message to this list, so sorry if this is
 not the right place to discuss this or if some data is missing from
 this message.
 
 I'll gladly send any data you request that would help us to
 understand this issue. I don't believe I'm allowed to share the
 actual database dump, but other than that I can provide much more
 details you might ask for.
 
 I can't understand why PG 9.2 performs so differently from PG 9.1.
 
 I tested these queries in my Debian unstable amd64 box after
 restoring the same database dump this morning in both PG 9.1 (Debian
 unstable repository) and PG9.2 (Debian experimental repository) with
 same settings:
 
 https://gist.github.com/3f1f3aad3847155e1e35
 
 Ignore all lines like the line below because it doesn't make any
 difference on my tests if I just remove them or any other column
 from the SELECT clause:
 
   exists(select id from condition_document_excerpt where
 condition_id=c1686.id) as v1686_has_reference,
 
 The results below are pretty much the same if you assume SELECT 1
 FROM 
 
 I have proper indices created for all tables and the query  is fast
 in both PG versions when I don't use many conditions in the WHERE
 clause.
 
 fast.sql returns the same data as slow.sql but it returns much
 faster in my tests with PG 9.1.
 
 So here are the completion times for each query on each PG version:
 
 Query   | PG 9.1 | PG 9.2 |
 ---
 fast.sql| 650 ms (0.65s) | 690s   |
 slow.sql| 419s   | 111s   |
 
 
 For the curious, the results would be very similar to slow.sql if I
 use inner joins with the conditions inside the WHERE moved to the
 ON clause of the inner join instead of the left outer join +
 global WHERE approach. But I don't have this option anyway because
 this query is generated dynamically and not all my queries are
 ALL-like queries.
 
 Here are the relevant indices (id is SERIAL primary key in all tables):
 
 CREATE UNIQUE INDEX transaction_condition_transaction_id_type_id_idx
   ON transaction_condition
   USING btree
   (transaction_id, type_id);
 CREATE INDEX index_transaction_condition_on_transaction_id
   ON transaction_condition
   USING btree
   (transaction_id);
 CREATE INDEX index_transaction_condition_on_type_id
   ON transaction_condition
   USING btree
   (type_id);
 
 CREATE INDEX acquirer_target_names
   ON company_transaction
   USING btree
   (acquiror_company_name COLLATE pg_catalog.default,
 target_company_name COLLATE pg_catalog.default);
 CREATE INDEX index_company_transaction_on_target_company_name
   ON company_transaction
   USING btree
   (target_company_name COLLATE pg_catalog.default);
 CREATE INDEX index_company_transaction_on_date
   ON company_transaction
   USING btree
   (date);
 CREATE INDEX index_company_transaction_on_edit_status
   ON company_transaction
   USING btree
   (edit_status COLLATE pg_catalog.default);
 
 CREATE UNIQUE INDEX index_condition_boolean_value_on_condition_id
   ON condition_boolean_value
   USING btree
   (condition_id);
 CREATE INDEX index_condition_boolean_value_on_value_and_condition_id
   ON condition_boolean_value
   USING btree
   (value COLLATE pg_catalog.default, condition_id);
 
 CREATE UNIQUE INDEX index_condition_option_value_on_condition_id
   ON condition_option_value
   USING btree
   (condition_id);
 CREATE INDEX index_condition_option_value_on_value_id_and_condition_id
   ON condition_option_value
   USING btree
   (value_id, condition_id);
 
 
 CREATE INDEX index_condition_option_label_on_type_id_and_position
   ON condition_option_label
   USING btree
   (type_id, position);
 CREATE INDEX index_condition_option_label_on_type_id_and_value
   ON condition_option_label
   USING btree
   (type_id, value COLLATE pg_catalog.default);
 
 
 CREATE UNIQUE INDEX index_condition_string_value_on_condition_id
   ON condition_string_value
   USING btree
   (condition_id);
 CREATE INDEX index_condition_string_value_on_value_and_condition_id
   ON condition_string_value
   USING btree
   (value COLLATE pg_catalog.default, condition_id);
 
 
 Please let me know of any suggestions on how to try to get similar
 results in PG 9.2 as well as to understand why fast.sql performs so
 much better than slow.sql on PG 9.1.
 
 Best,
 Rodrigo.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-06 Thread Rodrigo Rosenfeld Rosas

Hi Merlin,

Em 06-11-2012 15:22, Merlin Moncure escreveu:

On Tue, Nov 6, 2012 at 11:11 AM, Rodrigo Rosenfeld Rosas
rr.ro...@gmail.com  wrote:

Hello, this is my first message to this list, so sorry if this is not the
right place to discuss this or if some data is missing from this message.

I'll gladly send any data you request that would help us to understand this
issue. I don't believe I'm allowed to share the actual database dump, but
other than that I can provide much more details you might ask for.

I can't understand why PG 9.2 performs so differently from PG 9.1.

I tested these queries in my Debian unstable amd64 box after restoring the
same database dump this morning in both PG 9.1 (Debian unstable repository)
and PG9.2 (Debian experimental repository) with same settings:

https://gist.github.com/3f1f3aad3847155e1e35

Ignore all lines like the line below because it doesn't make any difference
on my tests if I just remove them or any other column from the SELECT
clause:

  exists(select id from condition_document_excerpt where
condition_id=c1686.id) as v1686_has_reference,

The results below are pretty much the same if you assume SELECT 1 FROM


I have proper indices created for all tables and the query  is fast in both
PG versions when I don't use many conditions in the WHERE clause.

fast.sql returns the same data as slow.sql but it returns much faster in my
tests with PG 9.1.

So here are the completion times for each query on each PG version:

Query   | PG 9.1 | PG 9.2 |
---
fast.sql| 650 ms (0.65s) | 690s   |
slow.sql| 419s   | 111s   |


For the curious, the results would be very similar to slow.sql if I use
inner joins with the conditions inside the WHERE moved to the ON clause of
the inner join instead of the left outer join + global WHERE approach. But I
don't have this option anyway because this query is generated dynamically
and not all my queries are ALL-like queries.

Here are the relevant indices (id is SERIAL primary key in all tables):

CREATE UNIQUE INDEX transaction_condition_transaction_id_type_id_idx
   ON transaction_condition
   USING btree
   (transaction_id, type_id);
CREATE INDEX index_transaction_condition_on_transaction_id
   ON transaction_condition
   USING btree
   (transaction_id);
CREATE INDEX index_transaction_condition_on_type_id
   ON transaction_condition
   USING btree
   (type_id);

CREATE INDEX acquirer_target_names
   ON company_transaction
   USING btree
   (acquiror_company_name COLLATE pg_catalog.default, target_company_name
COLLATE pg_catalog.default);
CREATE INDEX index_company_transaction_on_target_company_name
   ON company_transaction
   USING btree
   (target_company_name COLLATE pg_catalog.default);
CREATE INDEX index_company_transaction_on_date
   ON company_transaction
   USING btree
   (date);
CREATE INDEX index_company_transaction_on_edit_status
   ON company_transaction
   USING btree
   (edit_status COLLATE pg_catalog.default);

CREATE UNIQUE INDEX index_condition_boolean_value_on_condition_id
   ON condition_boolean_value
   USING btree
   (condition_id);
CREATE INDEX index_condition_boolean_value_on_value_and_condition_id
   ON condition_boolean_value
   USING btree
   (value COLLATE pg_catalog.default, condition_id);

CREATE UNIQUE INDEX index_condition_option_value_on_condition_id
   ON condition_option_value
   USING btree
   (condition_id);
CREATE INDEX index_condition_option_value_on_value_id_and_condition_id
   ON condition_option_value
   USING btree
   (value_id, condition_id);


CREATE INDEX index_condition_option_label_on_type_id_and_position
   ON condition_option_label
   USING btree
   (type_id, position);
CREATE INDEX index_condition_option_label_on_type_id_and_value
   ON condition_option_label
   USING btree
   (type_id, value COLLATE pg_catalog.default);


CREATE UNIQUE INDEX index_condition_string_value_on_condition_id
   ON condition_string_value
   USING btree
   (condition_id);
CREATE INDEX index_condition_string_value_on_value_and_condition_id
   ON condition_string_value
   USING btree
   (value COLLATE pg_catalog.default, condition_id);


Please let me know of any suggestions on how to try to get similar results
in PG 9.2 as well as to understand why fast.sql performs so much better than
slow.sql on PG 9.1.


need explain analyze for 9.1 vs 9.2.  use this site:
http://explain.depesz.com/ to post info.


http://explain.depesz.com/s/ToX (fast on 9.1)
http://explain.depesz.com/s/65t (fast on 9.2)
http://explain.depesz.com/s/gZm (slow on 9.1)
http://explain.depesz.com/s/END (slow on 9.2 - funny that the generated 
URL was END while this was my last explain :D )



looking at your query -- it's a fair possibility that the root cause
of your issue is your database schema and organization.  It's hard to
tell for sure, but it looks like you might have dived head first into
the EAV anti-pattern -- deconstructing your data to such a degree that
accurate statistics and query 

Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-06 Thread Rodrigo Rosenfeld Rosas
I've raised both to 25 in PG 9.2 and reloaded the server. Didn't make 
any difference. :(


Thanks for the suggestion anyway.

Cheers,
Rodrigo.

Em 06-11-2012 16:08, k...@rice.edu escreveu:

Hi Rodrigo,

It looks like a lot of joins and 9.2 does some optimizations that
internally add additional joins. Did you try raising the
join_collapse_limit and maybe the from_collapse_limit from the
default values of 8?

Regards,
Ken

On Tue, Nov 06, 2012 at 03:11:58PM -0200, Rodrigo Rosenfeld Rosas wrote:

Hello, this is my first message to this list, so sorry if this is
not the right place to discuss this or if some data is missing from
this message.

I'll gladly send any data you request that would help us to
understand this issue. I don't believe I'm allowed to share the
actual database dump, but other than that I can provide much more
details you might ask for.

I can't understand why PG 9.2 performs so differently from PG 9.1.

I tested these queries in my Debian unstable amd64 box after
restoring the same database dump this morning in both PG 9.1 (Debian
unstable repository) and PG9.2 (Debian experimental repository) with
same settings:

https://gist.github.com/3f1f3aad3847155e1e35

Ignore all lines like the line below because it doesn't make any
difference on my tests if I just remove them or any other column
from the SELECT clause:

  exists(select id from condition_document_excerpt where
condition_id=c1686.id) as v1686_has_reference,

The results below are pretty much the same if you assume SELECT 1
FROM 

I have proper indices created for all tables and the query  is fast
in both PG versions when I don't use many conditions in the WHERE
clause.

fast.sql returns the same data as slow.sql but it returns much
faster in my tests with PG 9.1.

So here are the completion times for each query on each PG version:

Query   | PG 9.1 | PG 9.2 |
---
fast.sql| 650 ms (0.65s) | 690s   |
slow.sql| 419s   | 111s   |


For the curious, the results would be very similar to slow.sql if I
use inner joins with the conditions inside the WHERE moved to the
ON clause of the inner join instead of the left outer join +
global WHERE approach. But I don't have this option anyway because
this query is generated dynamically and not all my queries are
ALL-like queries.

Here are the relevant indices (id is SERIAL primary key in all tables):

CREATE UNIQUE INDEX transaction_condition_transaction_id_type_id_idx
   ON transaction_condition
   USING btree
   (transaction_id, type_id);
CREATE INDEX index_transaction_condition_on_transaction_id
   ON transaction_condition
   USING btree
   (transaction_id);
CREATE INDEX index_transaction_condition_on_type_id
   ON transaction_condition
   USING btree
   (type_id);

CREATE INDEX acquirer_target_names
   ON company_transaction
   USING btree
   (acquiror_company_name COLLATE pg_catalog.default,
target_company_name COLLATE pg_catalog.default);
CREATE INDEX index_company_transaction_on_target_company_name
   ON company_transaction
   USING btree
   (target_company_name COLLATE pg_catalog.default);
CREATE INDEX index_company_transaction_on_date
   ON company_transaction
   USING btree
   (date);
CREATE INDEX index_company_transaction_on_edit_status
   ON company_transaction
   USING btree
   (edit_status COLLATE pg_catalog.default);

CREATE UNIQUE INDEX index_condition_boolean_value_on_condition_id
   ON condition_boolean_value
   USING btree
   (condition_id);
CREATE INDEX index_condition_boolean_value_on_value_and_condition_id
   ON condition_boolean_value
   USING btree
   (value COLLATE pg_catalog.default, condition_id);

CREATE UNIQUE INDEX index_condition_option_value_on_condition_id
   ON condition_option_value
   USING btree
   (condition_id);
CREATE INDEX index_condition_option_value_on_value_id_and_condition_id
   ON condition_option_value
   USING btree
   (value_id, condition_id);


CREATE INDEX index_condition_option_label_on_type_id_and_position
   ON condition_option_label
   USING btree
   (type_id, position);
CREATE INDEX index_condition_option_label_on_type_id_and_value
   ON condition_option_label
   USING btree
   (type_id, value COLLATE pg_catalog.default);


CREATE UNIQUE INDEX index_condition_string_value_on_condition_id
   ON condition_string_value
   USING btree
   (condition_id);
CREATE INDEX index_condition_string_value_on_value_and_condition_id
   ON condition_string_value
   USING btree
   (value COLLATE pg_catalog.default, condition_id);


Please let me know of any suggestions on how to try to get similar
results in PG 9.2 as well as to understand why fast.sql performs so
much better than slow.sql on PG 9.1.

Best,
Rodrigo.




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-06 Thread Rodrigo Rosenfeld Rosas

Em 06-11-2012 15:36, Tom Lane escreveu:

Merlin Moncuremmonc...@gmail.com  writes:

Problem could also be no statistics (run ANALYZE to test) or some
other configuration problem (like index locale), or a bona fide
regression.

I'm wondering about join_collapse_limit in particular --- if that wasn't
cranked up in the 9.1 installation, it would be pure luck if you got a
good query plan for an example like this.


I tried increasing it from 8 to 25 and it didn't make any difference.

   Maybe that and/or other
parameter settings didn't get transposed to the 9.2 installation.


diff /etc/postgresql/9.[12]/main/postgresql.conf

41c41
 data_directory = '/var/lib/postgresql/9.1/main'   # use 
data in another directory

---
 data_directory = '/var/lib/postgresql/9.2/main'   # use 
data in another directory

43c43
 hba_file = '/etc/postgresql/9.1/main/pg_hba.conf' # host-based 
authentication file

---
 hba_file = '/etc/postgresql/9.2/main/pg_hba.conf' # host-based 
authentication file

45c45
 ident_file = '/etc/postgresql/9.1/main/pg_ident.conf' # ident 
configuration file

---
 ident_file = '/etc/postgresql/9.2/main/pg_ident.conf' # ident 
configuration file

49c49
 external_pid_file = '/var/run/postgresql/9.1-main.pid'
# write an extra PID file

---
 external_pid_file = '/var/run/postgresql/9.2-main.pid'
# write an extra PID file

63c63
 port = 5433   # (change requires restart)
---
 port = 5432   # (change requires restart)
556a557,558
 ssl_cert_file = '/etc/ssl/certs/ssl-cert-snakeoil.pem'
 ssl_key_file = '/etc/ssl/private/ssl-cert-snakeoil.key'

Any other idea?


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-06 Thread Merlin Moncure
On Tue, Nov 6, 2012 at 12:09 PM, Rodrigo Rosenfeld Rosas
rr.ro...@gmail.com wrote:
 http://explain.depesz.com/s/ToX (fast on 9.1)
 http://explain.depesz.com/s/65t (fast on 9.2)
 http://explain.depesz.com/s/gZm (slow on 9.1)
 http://explain.depesz.com/s/END (slow on 9.2 - funny that the generated URL
 was END while this was my last explain :D )

Hm -- looking at your 'slow' 9.2 query, it is reporting that the query
took 3 seconds  (reported times are in milliseconds).  How are you
timing the data?  What happens when you run explain analyze
your_query from psql (as in, how long does it take)?

 Let me explain how the application works, how the database was designed and
 hopefully you'll be able to guide me in the correct way to design the
 database for this use case.

 Our application will present a big contract to some attorneys. There is
 currently a dynamic template with around 800 fields to be extracted from
 each contract in our system. These fields can be of different types
 (boolean, string, number, currency, percents, fixed options, dates,
 time-spans and so on). There is a fields tree that is maintained by the
 application editors. The application will allow the attorneys to read the
 contracts and highlight parts of the contract where they extracted each
 field from and associate each field with its value interpreted by the
 attorney and store the reference to what paragraphs in the contract
 demonstrate where the value came from.

 Then there is an interface that will allow clients to search for
 transactions based on its associated contracts and those ~800 fields. For
 the particular query above, 14 of the 800 fields have been searched by this
 particular user (most of them were boolean ones plus a few options and a
 string field). Usually the queries perform much better when less than 10
 fields are used in the criteria. But our client wants us to handle up to 20
 fields in a single query or they won't close the deal and this is a really
 important client to us.

 So, for the time being my only plan is to rollback to PG 9.1 and replace my
 query builder that currently generate queries like slow.sql and change it to
 generate the queries like fast.sql but I'm pretty sure this approach should
 be avoided. I just don't know any other alternative for the time being.

 What database design would you recommend me for this use case?

I would strongly consider investigation of hstore type along with
gist/gin index.
select * from company_transaction where contract_attributes @
'State=Delaware, Paid=Y';
etc

Barring that, I would then consider complete elimination of integer
proxies for your variables.  They make your query virtually impossible
to read/write, and they don't help.

merlin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-06 Thread Rodrigo Rosenfeld Rosas

Em 06-11-2012 16:42, Merlin Moncure escreveu:

On Tue, Nov 6, 2012 at 12:09 PM, Rodrigo Rosenfeld Rosas
rr.ro...@gmail.com  wrote:

http://explain.depesz.com/s/ToX (fast on 9.1)
http://explain.depesz.com/s/65t (fast on 9.2)
http://explain.depesz.com/s/gZm (slow on 9.1)
http://explain.depesz.com/s/END (slow on 9.2 - funny that the generated URL
was END while this was my last explain :D )

Hm -- looking at your 'slow' 9.2 query, it is reporting that the query
took 3 seconds  (reported times are in milliseconds).  How are you
timing the data?  What happens when you run explain analyze
your_query  from psql (as in, how long does it take)?


The time I reported in the tables of my first message were the time 
reported by pgAdmin3 (compiled from source).


But I get similar time when I run like this:

time psql -p 5432 -f slow.sql db_name  slow-9.2-again.explain

real1m56.353s
user0m0.068s
sys 0m0.020s

slow-9.2-again.explain: http://explain.depesz.com/s/zF1


Let me explain how the application works, how the database was designed and
hopefully you'll be able to guide me in the correct way to design the
database for this use case.

Our application will present a big contract to some attorneys. There is
currently a dynamic template with around 800 fields to be extracted from
each contract in our system. These fields can be of different types
(boolean, string, number, currency, percents, fixed options, dates,
time-spans and so on). There is a fields tree that is maintained by the
application editors. The application will allow the attorneys to read the
contracts and highlight parts of the contract where they extracted each
field from and associate each field with its value interpreted by the
attorney and store the reference to what paragraphs in the contract
demonstrate where the value came from.

Then there is an interface that will allow clients to search for
transactions based on its associated contracts and those ~800 fields. For
the particular query above, 14 of the 800 fields have been searched by this
particular user (most of them were boolean ones plus a few options and a
string field). Usually the queries perform much better when less than 10
fields are used in the criteria. But our client wants us to handle up to 20
fields in a single query or they won't close the deal and this is a really
important client to us.

So, for the time being my only plan is to rollback to PG 9.1 and replace my
query builder that currently generate queries like slow.sql and change it to
generate the queries like fast.sql but I'm pretty sure this approach should
be avoided. I just don't know any other alternative for the time being.

What database design would you recommend me for this use case?

I would strongly consider investigation of hstore type along with
gist/gin index.
select * from company_transaction where contract_attributes @
'State=Delaware, Paid=Y';
etc


I'm not very familiar with hstore yet but this was one of the reasons I 
wanted to migrate to PG 9.2 but I won't be able to migrate the 
application quickly to use hstore.


Also, I'm not sure if hstore allows us to be as flexible as we currently 
are (c1 and (c2 or c3 and not (c4 and c5))). c == condition



Barring that, I would then consider complete elimination of integer
proxies for your variables.  They make your query virtually impossible
to read/write, and they don't help.


I'm not sure if I understood what you're talking about. The template is 
dynamic and contains lots of information for each field, like type 
(number, percent, string, date, etc), parent_id (auto-referencing), 
aggregator_id (also auto-referencing) and several other columns. But the 
values associate the field id (type_id) and the transaction id in a 
unique way (see unique index in my first message of the thread). Then I 
need different tables to store the actual value because we're using SQL 
instead of MongoDB or something else. The table that stores the value 
depend on the field type.


Maybe it would help me to understand if you could provide some example 
for the design you're proposing.


Thank you very much,
Rodrigo.



Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-06 Thread Tom Lane
Rodrigo Rosenfeld Rosas rr.ro...@gmail.com writes:
 Em 06-11-2012 16:42, Merlin Moncure escreveu:
 Hm -- looking at your 'slow' 9.2 query, it is reporting that the query
 took 3 seconds  (reported times are in milliseconds).  How are you
 timing the data?  What happens when you run explain analyze
 your_query  from psql (as in, how long does it take)?

 The time I reported in the tables of my first message were the time 
 reported by pgAdmin3 (compiled from source).

 But I get similar time when I run like this:

 time psql -p 5432 -f slow.sql db_name  slow-9.2-again.explain

 real1m56.353s
 user0m0.068s
 sys 0m0.020s

 slow-9.2-again.explain: http://explain.depesz.com/s/zF1

But that again shows only five seconds runtime.  If you repeat the query
several dozen times in a row, run the same way each time, do you get
consistent timings?

Can you put together a self-contained test case to duplicate these
results?  I'm prepared to believe there's some sort of planner
regression involved here, but we'll never find it without a test case.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-06 Thread Rodrigo Rosenfeld Rosas

Em 06-11-2012 17:24, Tom Lane escreveu:

Rodrigo Rosenfeld Rosasrr.ro...@gmail.com  writes:

Em 06-11-2012 16:42, Merlin Moncure escreveu:

Hm -- looking at your 'slow' 9.2 query, it is reporting that the query
took 3 seconds  (reported times are in milliseconds).  How are you
timing the data?  What happens when you run explain analyze
your_query   from psql (as in, how long does it take)?

The time I reported in the tables of my first message were the time
reported by pgAdmin3 (compiled from source).
But I get similar time when I run like this:
time psql -p 5432 -f slow.sql db_name  slow-9.2-again.explain
real1m56.353s
user0m0.068s
sys 0m0.020s
slow-9.2-again.explain: http://explain.depesz.com/s/zF1

But that again shows only five seconds runtime.  If you repeat the query
several dozen times in a row, run the same way each time, do you get
consistent timings?


Yes, the timings are consistent here.


Can you put together a self-contained test case to duplicate these
results?  I'm prepared to believe there's some sort of planner
regression involved here, but we'll never find it without a test case.


I'd love to, but I'm afraid I won't have time to do this any time soon. 
Maybe on Sunday. I'll see if I can get a script to generate the database 
on Sunday and hope for it to replicate the issue.


Would you mind if I coded it using Ruby? (can you run Ruby code in your 
computer?) I mean, for filling with some sample data.


Right now I need to concentrate on getting a working solution for 9.1 
and downgrade the database and work in several other requested fixes. 
That is why I'm out of time for writing this test case right now... I'll 
try to find some time on Sunday and will post here if I can replicate.


Thank you so much!

Rodrigo.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-06 Thread Tom Lane
Rodrigo Rosenfeld Rosas rr.ro...@gmail.com writes:
 Em 06-11-2012 17:24, Tom Lane escreveu:
 Can you put together a self-contained test case to duplicate these
 results?  I'm prepared to believe there's some sort of planner
 regression involved here, but we'll never find it without a test case.

 I'd love to, but I'm afraid I won't have time to do this any time soon. 
 Maybe on Sunday. I'll see if I can get a script to generate the database 
 on Sunday and hope for it to replicate the issue.

 Would you mind if I coded it using Ruby? (can you run Ruby code in your 
 computer?) I mean, for filling with some sample data.

No objection.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-06 Thread Merlin Moncure
On Tue, Nov 6, 2012 at 1:45 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Rodrigo Rosenfeld Rosas rr.ro...@gmail.com writes:
 Em 06-11-2012 17:24, Tom Lane escreveu:
 Can you put together a self-contained test case to duplicate these
 results?  I'm prepared to believe there's some sort of planner
 regression involved here, but we'll never find it without a test case.

 I'd love to, but I'm afraid I won't have time to do this any time soon.
 Maybe on Sunday. I'll see if I can get a script to generate the database
 on Sunday and hope for it to replicate the issue.

 Would you mind if I coded it using Ruby? (can you run Ruby code in your
 computer?) I mean, for filling with some sample data.

 No objection.

hm, wouldn't timing the time to generate a raw EXPLAIN (that is,
without ANALYZE) give a rough estimate of planning time?   better to
rule it out before OP goes to the trouble...

merlin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-06 Thread Tom Lane
Merlin Moncure mmonc...@gmail.com writes:
 hm, wouldn't timing the time to generate a raw EXPLAIN (that is,
 without ANALYZE) give a rough estimate of planning time?   better to
 rule it out before OP goes to the trouble...

Well, we still wouldn't know *why* there was a problem ...

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query completed in 1s in PG 9.1 and ~ 700s in PG 9.2

2012-11-06 Thread Merlin Moncure
On Tue, Nov 6, 2012 at 12:57 PM, Rodrigo Rosenfeld Rosas
rr.ro...@gmail.com wrote:
 I would strongly consider investigation of hstore type along with
 gist/gin index.
 select * from company_transaction where contract_attributes @
 'State=Delaware, Paid=Y';
 etc


 I'm not very familiar with hstore yet but this was one of the reasons I
 wanted to migrate to PG 9.2 but I won't be able to migrate the application
 quickly to use hstore.

sure -- it's a major change.  note though that 9.1 hstore has
everything you need.

 Also, I'm not sure if hstore allows us to be as flexible as we currently are
 (c1 and (c2 or c3 and not (c4 and c5))). c == condition

your not gated from that functionality, although making complicated
expressions might require some thought and defeat some or all of GIST
optimization. that said, nothing is keeping you from doing:

where fields @ 'c1=true, c2=45' and not (fields @ 'c3=false, c4=xyz');

range searches would completely bypass GIST.  so that:
select * from foo where attributes - 'somekey' between 'value1' and 'value2';

would work but would be brute force.  Still, with a little bit of
though, you should be able to optimize most common cases and when it
boils down to straight filter (a and b and c) you'll get an orders of
magnitude faster query.

 Barring that, I would then consider complete elimination of integer
 proxies for your variables.  They make your query virtually impossible
 to read/write, and they don't help.

 I'm not sure if I understood what you're talking about. The template is
 dynamic and contains lots of information for each field, like type (number,
 percent, string, date, etc), parent_id (auto-referencing), aggregator_id
 (also auto-referencing) and several other columns. But the values associate
 the field id (type_id) and the transaction id in a unique way (see unique
 index in my first message of the thread). Then I need different tables to
 store the actual value because we're using SQL instead of MongoDB or
 something else. The table that stores the value depend on the field type.

Well, that's probably a mistake.  It's probably better to have a
single table with a text field (which is basically a variant) and a
'type' column storing the type of it if you need special handling down
the line.  One thing I'm sure of is that abstracting type behind
type_id is doing nothing but creating needless extra work.  You're
doing all kinds of acrobatics to fight the schema by hiding it under
various layers of abstraction.

merlin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance