Re: [PERFORM] Many-to-many performance problem

2016-06-10 Thread Alex Ignatov


On 10.06.2016 16:04, Rowan Seymour wrote:
In our Django app we have messages (currently about 7 million in table 
msgs_message) and labels (about 300), and a join table to associate 
messages with labels (about 500,000 in msgs_message_labels). Not sure 
you'll need them, but here are the relevant table schemas:


CREATE TABLE msgs_message
(
id INTEGER PRIMARY KEY NOT NULL,
type VARCHAR NOT NULL,
text TEXT NOT NULL,
is_archived BOOLEAN NOT NULL,
created_on TIMESTAMP WITH TIME ZONE NOT NULL,
contact_id INTEGER NOT NULL,
org_id INTEGER NOT NULL,
case_id INTEGER,
backend_id INTEGER NOT NULL,
is_handled BOOLEAN NOT NULL,
is_flagged BOOLEAN NOT NULL,
is_active BOOLEAN NOT NULL,
has_labels BOOLEAN NOT NULL,
CONSTRAINT 
msgs_message_contact_id_5c8e3f216c115643_fk_contacts_contact_id 
FOREIGN KEY (contact_id) REFERENCES contacts_contact (id),
CONSTRAINT msgs_message_org_id_81a0adfcc99151d_fk_orgs_org_id 
FOREIGN KEY (org_id) REFERENCES orgs_org (id),
CONSTRAINT msgs_message_case_id_51998150f9629c_fk_cases_case_id 
FOREIGN KEY (case_id) REFERENCES cases_case (id)

);
CREATE UNIQUE INDEX msgs_message_backend_id_key ON msgs_message 
(backend_id);

CREATE INDEX msgs_message_6d82f13d ON msgs_message (contact_id);
CREATE INDEX msgs_message_9cf869aa ON msgs_message (org_id);
CREATE INDEX msgs_message_7f12ca67 ON msgs_message (case_id);

CREATE TABLE msgs_message_labels
(
id INTEGER PRIMARY KEY NOT NULL,
message_id INTEGER NOT NULL,
label_id INTEGER NOT NULL,
CONSTRAINT 
msgs_message_lab_message_id_1dfa44628fe448dd_fk_msgs_message_id 
FOREIGN KEY (message_id) REFERENCES msgs_message (id),
CONSTRAINT 
msgs_message_labels_label_id_77cbdebd8d255b7a_fk_msgs_label_id FOREIGN 
KEY (label_id) REFERENCES msgs_label (id)

);
CREATE UNIQUE INDEX msgs_message_labels_message_id_label_id_key ON 
msgs_message_labels (message_id, label_id);
CREATE INDEX msgs_message_labels_4ccaa172 ON msgs_message_labels 
(message_id);
CREATE INDEX msgs_message_labels_abec2aca ON msgs_message_labels 
(label_id);


Users can search for messages, and they are returned page by page in 
reverse chronological order. There are several partial multi-column 
indexes on the message table, but the one used for the example queries 
below is


CREATE INDEX msgs_inbox ON msgs_message(org_id, created_on DESC)
WHERE is_active = TRUE AND is_handled = TRUE AND is_archived = FALSE 
AND has_labels = TRUE;


So a typical query for the latest page of messages looks like 
(https://explain.depesz.com/s/G9ew):


SELECT "msgs_message".*
FROM "msgs_message"
WHERE ("msgs_message"."org_id" = 7
AND "msgs_message"."is_active" = true
AND "msgs_message"."is_handled" = true
AND "msgs_message"."has_labels" = true
AND "msgs_message"."is_archived" = false
AND "msgs_message"."created_on" < '2016-06-10T07:11:06.381000 
+00:00'::timestamptz

) ORDER BY "msgs_message"."created_on" DESC LIMIT 50

But users can also search for messages that have one or more labels, 
leading to queries that look like:


SELECT DISTINCT "msgs_message".*
FROM "msgs_message"
INNER JOIN "msgs_message_labels" ON ( "msgs_message"."id" = 
"msgs_message_labels"."message_id" )

WHERE ("msgs_message"."org_id" = 7
AND "msgs_message"."is_active" = true
AND "msgs_message"."is_handled" = true
AND "msgs_message_labels"."label_id" IN (127, 128, 135, 136, 137, 
138, 140, 141, 143, 144)

AND "msgs_message"."has_labels" = true
AND "msgs_message"."is_archived" = false
AND "msgs_message"."created_on" < '2016-06-10T07:11:06.381000 
+00:00'::timestamptz

) ORDER BY "msgs_message"."created_on" DESC LIMIT 50

Most of time, this query performs like 
https://explain.depesz.com/s/ksOC (~15ms). It's no longer using the 
using the msgs_inbox index, but it's plenty fast. However, sometimes 
it performs like https://explain.depesz.com/s/81c (67000ms)


And if you run it again, it'll be fast again. Am I correct in 
interpreting that second explain as being slow because 
msgs_message_pkey isn't cached? It looks like it read from that index 
3556 times, and each time took 18.559 (?) ms, and that adds up 
to 65,996ms. The database server says it has lots of free memory so is 
there something I should be doing to keep that index in memory?


Generally speaking, is there a good strategy for optimising queries 
like these which involve two tables?


  * I tried moving the label references into an int array on
msgs_message, and then using btree_gin to create a multi-column
index involving the array column, but that doesn't appear to be
very useful for these ordered queries because it's not an ordered
index.
  * I tried adding created_on to msgs_message_labels table but I
couldn't find a way of avoiding the in-memory sort.
  * Have thought about dynamically creating partial indexes for each
label using an array column on msgs_message to hold label ids, and
index condition like WHERE label_ids && 

Re: [PERFORM] Many-to-many performance problem

2016-06-10 Thread Tom Lane
Yves Dorfsman  writes:
> On 2016-06-10 08:13, Tom Lane wrote:
>> It looks like everything is fine as long as all the data the query needs
>> is already in PG's shared buffers.  As soon as it has to go to disk,
>> you're hurting, because disk reads seem to be taking ~10ms on average.

> ->  Index Scan using msgs_message_pkey on msgs_message  
> (cost=0.43..8.04 rows=1 width=47) (actual time=18.550..18.559 rows=0 
> loops=3556)
>   Index Cond: (id = msgs_message_labels.message_id)
>   Filter: (is_active AND is_handled AND has_labels 
> AND (NOT is_archived) AND (created_on < '2016-06-10 
> 07:11:06.381+00'::timestamp with time zone) AND (org_id = 7))
>   Rows Removed by Filter: 1
>   Buffers: shared hit=11032 read=3235 dirtied=5

> Do you mean that it reads the index from disk? Or that it looks things up in 
> the index, and fetch data on disk (based on that lookup)?

The "reads" here might be either index pages or table pages; we can't tell
from EXPLAIN's statistics.  It's probably a good bet that more of them are
table pages than index pages though, just because the index should be a
lot smaller than the table and more fully represented in cache.

As for the numbers, we see that 18.559 * 3556 = 65995 ms were spent in
this indexscan plan node, versus negligible time for the same plan node
when no reads happened.  So we can blame pretty much all that time on
the 3235 disk reads, giving an average per read of just over 20ms.  Some
of the other plan nodes show lower averages, though, so I was conservative
and said "~10 ms".

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Many-to-many performance problem

2016-06-10 Thread Yves Dorfsman

I thought this was a really interesting case, and would love to learn from it, 
please bare with me if my questions are naive.

On 2016-06-10 08:13, Tom Lane wrote:
> Rowan Seymour  writes:
>> Most of time, this query performs like https://explain.depesz.com/s/ksOC
>> (~15ms). It's no longer using the using the msgs_inbox index, but it's
>> plenty fast. However, sometimes it performs like
>> https://explain.depesz.com/s/81c (67000ms)
>> And if you run it again, it'll be fast again.
> 
> It looks like everything is fine as long as all the data the query needs
> is already in PG's shared buffers.  As soon as it has to go to disk,
> you're hurting, because disk reads seem to be taking ~10ms on average.


->  Index Scan using msgs_message_pkey on msgs_message  
(cost=0.43..8.04 rows=1 width=47) (actual time=18.550..18.559 rows=0 loops=3556)
  Index Cond: (id = msgs_message_labels.message_id)
  Filter: (is_active AND is_handled AND has_labels AND 
(NOT is_archived) AND (created_on < '2016-06-10 07:11:06.381+00'::timestamp 
with time zone) AND (org_id = 7))
  Rows Removed by Filter: 1
  Buffers: shared hit=11032 read=3235 dirtied=5

Do you mean that it reads the index from disk? Or that it looks things up in 
the index, and fetch data on disk (based on that lookup)?
Is the 18ms from the Buffers: read=3235? That's 3235 rows read from disk?

-- 
http://yves.zioup.com
gpg: 4096R/32B0F416 



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Many-to-many performance problem

2016-06-10 Thread Tom Lane
Rowan Seymour  writes:
> Most of time, this query performs like https://explain.depesz.com/s/ksOC
> (~15ms). It's no longer using the using the msgs_inbox index, but it's
> plenty fast. However, sometimes it performs like
> https://explain.depesz.com/s/81c (67000ms)
> And if you run it again, it'll be fast again.

It looks like everything is fine as long as all the data the query needs
is already in PG's shared buffers.  As soon as it has to go to disk,
you're hurting, because disk reads seem to be taking ~10ms on average.

> Server is an Amazon RDS instance with default settings and Postgres 9.3.10,

And I think you just explained your problem.  Did you spring for adequate
guaranteed IOPS on this instance?  If not, you need to, or else live with
erratic performance.  If you did, you have a beef to raise with AWS that
you're not getting the performance you paid for.

You might be able to ameliorate matters by raising shared_buffers, but
unless your database isn't growing that approach has limited future.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Many-to-many performance problem

2016-06-10 Thread Rowan Seymour
In our Django app we have messages (currently about 7 million in table
msgs_message) and labels (about 300), and a join table to associate
messages with labels (about 500,000 in msgs_message_labels). Not sure
you'll need them, but here are the relevant table schemas:

CREATE TABLE msgs_message
(
id INTEGER PRIMARY KEY NOT NULL,
type VARCHAR NOT NULL,
text TEXT NOT NULL,
is_archived BOOLEAN NOT NULL,
created_on TIMESTAMP WITH TIME ZONE NOT NULL,
contact_id INTEGER NOT NULL,
org_id INTEGER NOT NULL,
case_id INTEGER,
backend_id INTEGER NOT NULL,
is_handled BOOLEAN NOT NULL,
is_flagged BOOLEAN NOT NULL,
is_active BOOLEAN NOT NULL,
has_labels BOOLEAN NOT NULL,
CONSTRAINT
msgs_message_contact_id_5c8e3f216c115643_fk_contacts_contact_id FOREIGN KEY
(contact_id) REFERENCES contacts_contact (id),
CONSTRAINT msgs_message_org_id_81a0adfcc99151d_fk_orgs_org_id FOREIGN
KEY (org_id) REFERENCES orgs_org (id),
CONSTRAINT msgs_message_case_id_51998150f9629c_fk_cases_case_id FOREIGN
KEY (case_id) REFERENCES cases_case (id)
);
CREATE UNIQUE INDEX msgs_message_backend_id_key ON msgs_message
(backend_id);
CREATE INDEX msgs_message_6d82f13d ON msgs_message (contact_id);
CREATE INDEX msgs_message_9cf869aa ON msgs_message (org_id);
CREATE INDEX msgs_message_7f12ca67 ON msgs_message (case_id);

CREATE TABLE msgs_message_labels
(
id INTEGER PRIMARY KEY NOT NULL,
message_id INTEGER NOT NULL,
label_id INTEGER NOT NULL,
CONSTRAINT
msgs_message_lab_message_id_1dfa44628fe448dd_fk_msgs_message_id FOREIGN KEY
(message_id) REFERENCES msgs_message (id),
CONSTRAINT
msgs_message_labels_label_id_77cbdebd8d255b7a_fk_msgs_label_id FOREIGN KEY
(label_id) REFERENCES msgs_label (id)
);
CREATE UNIQUE INDEX msgs_message_labels_message_id_label_id_key ON
msgs_message_labels (message_id, label_id);
CREATE INDEX msgs_message_labels_4ccaa172 ON msgs_message_labels
(message_id);
CREATE INDEX msgs_message_labels_abec2aca ON msgs_message_labels (label_id);

Users can search for messages, and they are returned page by page in
reverse chronological order. There are several partial multi-column indexes
on the message table, but the one used for the example queries below is

CREATE INDEX msgs_inbox ON msgs_message(org_id, created_on DESC)
WHERE is_active = TRUE AND is_handled = TRUE AND is_archived = FALSE AND
has_labels = TRUE;

So a typical query for the latest page of messages looks like (
https://explain.depesz.com/s/G9ew):

SELECT "msgs_message".*
FROM "msgs_message"
WHERE ("msgs_message"."org_id" = 7
AND "msgs_message"."is_active" = true
AND "msgs_message"."is_handled" = true
AND "msgs_message"."has_labels" = true
AND "msgs_message"."is_archived" = false
AND "msgs_message"."created_on" < '2016-06-10T07:11:06.381000
+00:00'::timestamptz
) ORDER BY "msgs_message"."created_on" DESC LIMIT 50

But users can also search for messages that have one or more labels,
leading to queries that look like:

SELECT DISTINCT "msgs_message".*
FROM "msgs_message"
INNER JOIN "msgs_message_labels" ON ( "msgs_message"."id" =
"msgs_message_labels"."message_id" )
WHERE ("msgs_message"."org_id" = 7
AND "msgs_message"."is_active" = true
AND "msgs_message"."is_handled" = true
AND "msgs_message_labels"."label_id" IN (127, 128, 135, 136, 137, 138,
140, 141, 143, 144)
AND "msgs_message"."has_labels" = true
AND "msgs_message"."is_archived" = false
AND "msgs_message"."created_on" < '2016-06-10T07:11:06.381000
+00:00'::timestamptz
) ORDER BY "msgs_message"."created_on" DESC LIMIT 50

Most of time, this query performs like https://explain.depesz.com/s/ksOC
(~15ms). It's no longer using the using the msgs_inbox index, but it's
plenty fast. However, sometimes it performs like
https://explain.depesz.com/s/81c (67000ms)

And if you run it again, it'll be fast again. Am I correct in interpreting
that second explain as being slow because msgs_message_pkey isn't cached?
It looks like it read from that index 3556 times, and each time took 18.559
(?) ms, and that adds up to 65,996ms. The database server says it has lots
of free memory so is there something I should be doing to keep that index
in memory?

Generally speaking, is there a good strategy for optimising queries like
these which involve two tables?

   - I tried moving the label references into an int array on msgs_message,
   and then using btree_gin to create a multi-column index involving the array
   column, but that doesn't appear to be very useful for these ordered queries
   because it's not an ordered index.
   - I tried adding created_on to msgs_message_labels table but I couldn't
   find a way of avoiding the in-memory sort.
   - Have thought about dynamically creating partial indexes for each label
   using an array column on msgs_message to hold label ids, and index
   condition like WHERE label_ids && ARRAY[123] but not sure what other
   problems I'll run into with hundreds of indexes on the same