Re: [PERFORM] Unexpected expensive index scan

2016-09-30 Thread Jim Nasby

On 9/28/16 1:11 PM, Jake Nielsen wrote:

Beautiful! After changing the random_page_cost to 1.0 the original query
went from ~3.5s to ~35ms. This is exactly the kind of insight I was
fishing for in the original post. I'll keep in mind that the query
planner is very tunable and has these sorts of hardware-related
trade-offs in the future. I can't thank you enough!


Be careful with setting random_page_cost to exactly 1... that tells the 
planner that an index scan has nearly the same cost as a sequential 
scan, which is absolutely never the case, even with the database in 
memory. 1.1 or maybe even 1.01 is probably a safer bet.


Also note that you can set those parameters within a single session, as 
well as within a single transaction. So if you need to force a different 
setting for a single query, you could always do


BEGIN;
SET LOCAL random_page_cost = 1;
SELECT ...
COMMIT; (or rollback...)
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Unexpected expensive index scan

2016-09-28 Thread Jake Nielsen
On Wed, Sep 28, 2016 at 6:04 AM, Tom Lane  wrote:

> [ Please don't re-quote the entire damn thread in each followup. Have
> some respect for your readers' time, and assume that they have already
> seen the previous traffic, or could go look it up if they haven't.
> The point of quoting at all is just to quickly remind people where we
> are in the discussion. ]
>

Sorry, understood.


>
> If you say "well yeah, but it seems to perform fine when I force
> it to use that index anyway", the answer may be that you need to
> adjust random_page_cost.  The default value is OK for tables that
> are mostly sitting on spinning rust, but if your database is
> RAM-resident or SSD-resident you probably want a value closer to 1.
>

Ahhh, this could absolutely be the key right here. I could totally see why
it would make sense for the planner to do what it's doing given that it's
weighting sequential access more favorably than random access.

Beautiful! After changing the random_page_cost to 1.0 the original query
went from ~3.5s to ~35ms. This is exactly the kind of insight I was fishing
for in the original post. I'll keep in mind that the query planner is very
tunable and has these sorts of hardware-related trade-offs in the future. I
can't thank you enough!

Cheers!


Re: [PERFORM] Unexpected expensive index scan

2016-09-28 Thread Tom Lane
[ Please don't re-quote the entire damn thread in each followup. Have
some respect for your readers' time, and assume that they have already
seen the previous traffic, or could go look it up if they haven't.
The point of quoting at all is just to quickly remind people where we
are in the discussion. ]

Jake Nielsen  writes:
> So... it seems that setting the userId to one that has less rows in the
> table results in the index actually being used...
> EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM SyncerEvent WHERe userId =
> '57d35db7353b0d627c0e592f' AND ID > 12468 ORDER BY ID LIMIT 4000;

It looks from the numbers floating around in this thread that the
userId used in your original query actually matches about 50% of
the table.  That would make it unsurprising that the planner doesn't
want to use an index.  A rule of thumb is that a seqscan is going
to be cheaper than an indexscan if your query retrieves, or even
just has to fetch, more than a few percent of the table.

Now, given the existence of an index on (userID, ID) --- in that
order --- I would expect the planner to want to use that index
for a query shaped exactly as you show above.  Basically, it knows
that that just requires starting at the ('57d35db7353b0d627c0e592f',
12468) position in the index and scanning forward for 4000 index
entries; no extraneous table rows will be fetched at all.  If you
increased the LIMIT enough, it'd go over to a seqscan-and-sort to
avoid doing so much random access to the table, but I'd think the
crossover point for that is well above 4000 out of 3.3M rows.

However, as soon as you add any other unindexable conditions,
the situation changes because rows that fail the additional
conditions represent useless fetches.  Now, instead of fetching
4000 rows using the index, it's fetching 4000 times some multiplier.

It's hard to tell for sure given the available info, but I think
that the extra inequalities in your original query reject a pretty
sizable proportion of rows, resulting in the indexscan approach
needing to fetch a great deal more than 4000 rows, making it look
to be more expensive than a seqscan.

I'm not sure why it's preferring the pkey index to the one on
(userID, ID), but possibly that has something to do with that
index being better correlated to the physical table order, resulting
in a prediction of less random I/O when using that index.

So the bottom line is that given your data statistics, there may
well be no really good plan for your original query.  It just
requires fetching a lot of rows, and indexes can't help very much.

If you say "well yeah, but it seems to perform fine when I force
it to use that index anyway", the answer may be that you need to
adjust random_page_cost.  The default value is OK for tables that
are mostly sitting on spinning rust, but if your database is
RAM-resident or SSD-resident you probably want a value closer to 1.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Unexpected expensive index scan

2016-09-27 Thread Jake Nielsen
On Tue, Sep 27, 2016 at 6:24 PM, Jake Nielsen 
wrote:

>
>
> On Tue, Sep 27, 2016 at 6:03 PM, Jake Nielsen 
> wrote:
>
>>
>> On Tue, Sep 27, 2016 at 5:41 PM, Mike Sofen  wrote:
>>
>>> *From:* Jake Nielsen*Sent:* Tuesday, September 27, 2016 5:22 PM
>>>
>>>
>>> the query
>>>
>>> SELECT * FROM SyncerEvent WHERE ID > 12468 AND propogatorId NOT IN
>>> ('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"') AND conflicted != 1 AND
>>> userId = '57dc984f1c87461c0967e228' ORDER BY ID LIMIT 4000;^
>>>
>>>
>>>
>>> On Tue, Sep 27, 2016 at 5:02 PM, Jake Nielsen 
>>> wrote:
>>>
>>> I've got a query that takes a surprisingly long time to run, and I'm
>>> having a really rough time trying to figure it out.
>>>
>>>
>>>
>>> Before I get started, here are the specifics of the situation:
>>>
>>>
>>>
>>> Here is the table that I'm working with (apologies for spammy indices,
>>> I've been throwing shit at the wall)
>>>
>>> Table "public.syncerevent"
>>>
>>> Column|  Type   |Modifiers
>>>
>>>
>>> --+-+---
>>> ---
>>>
>>>  id   | bigint  | not null default nextval('syncerevent_id_seq'::
>>> regclass)
>>>
>>>  userid   | text|
>>>
>>>  event| text|
>>>
>>>  eventid  | text|
>>>
>>>  originatorid | text|
>>>
>>>  propogatorid | text|
>>>
>>>  kwargs   | text|
>>>
>>>  conflicted   | integer |
>>>
>>> Indexes:
>>>
>>> "syncerevent_pkey" PRIMARY KEY, btree (id)
>>>
>>> "syncereventidindex" UNIQUE, btree (eventid)
>>>
>>> "anothersyncereventidindex" btree (userid)
>>>
>>> "anothersyncereventidindexwithascending" btree (userid, id)
>>>
>>> "asdfasdgasdf" btree (userid, id DESC)
>>>
>>> "syncereventuseridhashindex" hash (userid)
>>>
>>>
>>>
>>> To provide some context, as per the wiki,
>>>
>>> there are 3,290,600 rows in this table.
>>>
>>> It gets added to frequently, but never deleted from.
>>>
>>> The "kwargs" column often contains mid-size JSON strings (roughly 30K
>>> characters on average)
>>>
>>> As of right now, the table has 53 users in it. About 20% of those have a
>>> negligible number of events, but the rest of the users have a fairly even
>>> smattering.
>>>
>>>
>>>
>>> EXPLAIN (ANALYZE, BUFFERS) says:
>>>
>>>
>>>   QUERY PLAN
>>>
>>>
>>> 
>>> 
>>> --
>>>
>>>  Limit  (cost=0.43..1218.57 rows=4000 width=615) (actual
>>> time=3352.390..3403.572 rows=4000 loops=1)  Buffers: shared hit=120244
>>> read=160198
>>>
>>>->  Index Scan using syncerevent_pkey on syncerevent
>>> (cost=0.43..388147.29 rows=1274560 width=615) (actual
>>> time=3352.386..3383.100 rows=4000 loops=1)
>>>
>>>  Index Cond: (id > 12468)
>>>
>>>  Filter: ((propogatorid <> 
>>> '"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"'::text)
>>> AND (conflicted <> 1) AND (userid = '57dc984f1c87461c0967e228'::text))
>>>
>>>  Rows Removed by Filter: 1685801
>>>
>>>  Buffers: shared hit=120244 read=160198
>>>
>>>  Planning time: 0.833 ms
>>>
>>>  Execution time: 3407.633 ms
>>>
>>> (9 rows)
>>>
>>> If it matters/interests you, here is my underlying confusion:
>>>
>>> From some internet sleuthing, I've decided that having a table per user
>>> (which would totally make this problem a non-issue) isn't a great idea.
>>> Because there is a file per table, having a table per user would not scale.
>>> My next thought was partial indexes (which would also totally help), but
>>> since there is also a table per index, this really doesn't side-step the
>>> problem. My rough mental model says: If there exists a way that a
>>> table-per-user scheme would make this more efficient, then there should
>>> also exist an index that could achieve the same effect (or close enough to
>>> not matter). I would think that "userid = '57dc984f1c87461c0967e228'" could
>>> utilize at least one of the two indexes on the userId column, but clearly
>>> I'm not understanding something.
>>>
>>> Any help in making this query more efficient would be greatly
>>> appreciated, and any conceptual insights would be extra awesome.
>>>
>>> Thanks for reading.
>>>
>>> -Jake
>>>
>>> --
>>>
>>>
>>>
>>> This stands out:  WHERE ID > 12468 AND propogatorId NOT IN
>>> ('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"')
>>>
>>> As does this from the analyze:  Rows Removed by Filter: 1685801
>>>
>>>
>>>
>>> The propogaterid is practically the only column NOT indexed and it’s
>>> used in a “not in”.  It looks like it’s having to do a table scan for all
>>> the rows above the id cutoff to see if any meet the filter requirement.
>>> “not in” can be very expensive.  An index might help on this column.  Have
>>> you tried that?
>>>
>>>

Re: [PERFORM] Unexpected expensive index scan

2016-09-27 Thread Jake Nielsen
On Tue, Sep 27, 2016 at 6:03 PM, Jake Nielsen 
wrote:

>
> On Tue, Sep 27, 2016 at 5:41 PM, Mike Sofen  wrote:
>
>> *From:* Jake Nielsen*Sent:* Tuesday, September 27, 2016 5:22 PM
>>
>>
>> the query
>>
>> SELECT * FROM SyncerEvent WHERE ID > 12468 AND propogatorId NOT IN
>> ('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"') AND conflicted != 1 AND
>> userId = '57dc984f1c87461c0967e228' ORDER BY ID LIMIT 4000;^
>>
>>
>>
>> On Tue, Sep 27, 2016 at 5:02 PM, Jake Nielsen 
>> wrote:
>>
>> I've got a query that takes a surprisingly long time to run, and I'm
>> having a really rough time trying to figure it out.
>>
>>
>>
>> Before I get started, here are the specifics of the situation:
>>
>>
>>
>> Here is the table that I'm working with (apologies for spammy indices,
>> I've been throwing shit at the wall)
>>
>> Table "public.syncerevent"
>>
>> Column|  Type   |Modifiers
>>
>>
>> --+-+---
>> ---
>>
>>  id   | bigint  | not null default nextval('syncerevent_id_seq'::
>> regclass)
>>
>>  userid   | text|
>>
>>  event| text|
>>
>>  eventid  | text|
>>
>>  originatorid | text|
>>
>>  propogatorid | text|
>>
>>  kwargs   | text|
>>
>>  conflicted   | integer |
>>
>> Indexes:
>>
>> "syncerevent_pkey" PRIMARY KEY, btree (id)
>>
>> "syncereventidindex" UNIQUE, btree (eventid)
>>
>> "anothersyncereventidindex" btree (userid)
>>
>> "anothersyncereventidindexwithascending" btree (userid, id)
>>
>> "asdfasdgasdf" btree (userid, id DESC)
>>
>> "syncereventuseridhashindex" hash (userid)
>>
>>
>>
>> To provide some context, as per the wiki,
>>
>> there are 3,290,600 rows in this table.
>>
>> It gets added to frequently, but never deleted from.
>>
>> The "kwargs" column often contains mid-size JSON strings (roughly 30K
>> characters on average)
>>
>> As of right now, the table has 53 users in it. About 20% of those have a
>> negligible number of events, but the rest of the users have a fairly even
>> smattering.
>>
>>
>>
>> EXPLAIN (ANALYZE, BUFFERS) says:
>>
>>
>>   QUERY PLAN
>>
>>
>> 
>> 
>> --
>>
>>  Limit  (cost=0.43..1218.57 rows=4000 width=615) (actual
>> time=3352.390..3403.572 rows=4000 loops=1)  Buffers: shared hit=120244
>> read=160198
>>
>>->  Index Scan using syncerevent_pkey on syncerevent
>> (cost=0.43..388147.29 rows=1274560 width=615) (actual
>> time=3352.386..3383.100 rows=4000 loops=1)
>>
>>  Index Cond: (id > 12468)
>>
>>  Filter: ((propogatorid <> 
>> '"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"'::text)
>> AND (conflicted <> 1) AND (userid = '57dc984f1c87461c0967e228'::text))
>>
>>  Rows Removed by Filter: 1685801
>>
>>  Buffers: shared hit=120244 read=160198
>>
>>  Planning time: 0.833 ms
>>
>>  Execution time: 3407.633 ms
>>
>> (9 rows)
>>
>> If it matters/interests you, here is my underlying confusion:
>>
>> From some internet sleuthing, I've decided that having a table per user
>> (which would totally make this problem a non-issue) isn't a great idea.
>> Because there is a file per table, having a table per user would not scale.
>> My next thought was partial indexes (which would also totally help), but
>> since there is also a table per index, this really doesn't side-step the
>> problem. My rough mental model says: If there exists a way that a
>> table-per-user scheme would make this more efficient, then there should
>> also exist an index that could achieve the same effect (or close enough to
>> not matter). I would think that "userid = '57dc984f1c87461c0967e228'" could
>> utilize at least one of the two indexes on the userId column, but clearly
>> I'm not understanding something.
>>
>> Any help in making this query more efficient would be greatly
>> appreciated, and any conceptual insights would be extra awesome.
>>
>> Thanks for reading.
>>
>> -Jake
>>
>> --
>>
>>
>>
>> This stands out:  WHERE ID > 12468 AND propogatorId NOT IN
>> ('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"')
>>
>> As does this from the analyze:  Rows Removed by Filter: 1685801
>>
>>
>>
>> The propogaterid is practically the only column NOT indexed and it’s used
>> in a “not in”.  It looks like it’s having to do a table scan for all the
>> rows above the id cutoff to see if any meet the filter requirement.  “not
>> in” can be very expensive.  An index might help on this column.  Have you
>> tried that?
>>
>>
>>
>> Your rowcounts aren’t high enough to require partitioning or any other
>> changes to your table that I can see right now.
>>
>>
>>
>> Mike Sofen  (Synthetic Genomics)
>>
>>
>>
>
> Thanks Mike, that's true, I hadn't thought of non-indexed 

Re: [PERFORM] Unexpected expensive index scan

2016-09-27 Thread Jake Nielsen
On Tue, Sep 27, 2016 at 5:41 PM, Mike Sofen  wrote:

> *From:* Jake Nielsen*Sent:* Tuesday, September 27, 2016 5:22 PM
>
>
> the query
>
> SELECT * FROM SyncerEvent WHERE ID > 12468 AND propogatorId NOT IN
> ('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"') AND conflicted != 1 AND
> userId = '57dc984f1c87461c0967e228' ORDER BY ID LIMIT 4000;^
>
>
>
> On Tue, Sep 27, 2016 at 5:02 PM, Jake Nielsen 
> wrote:
>
> I've got a query that takes a surprisingly long time to run, and I'm
> having a really rough time trying to figure it out.
>
>
>
> Before I get started, here are the specifics of the situation:
>
>
>
> Here is the table that I'm working with (apologies for spammy indices,
> I've been throwing shit at the wall)
>
> Table "public.syncerevent"
>
> Column|  Type   |Modifiers
>
>
> --+-+---
> ---
>
>  id   | bigint  | not null default nextval('syncerevent_id_seq'::
> regclass)
>
>  userid   | text|
>
>  event| text|
>
>  eventid  | text|
>
>  originatorid | text|
>
>  propogatorid | text|
>
>  kwargs   | text|
>
>  conflicted   | integer |
>
> Indexes:
>
> "syncerevent_pkey" PRIMARY KEY, btree (id)
>
> "syncereventidindex" UNIQUE, btree (eventid)
>
> "anothersyncereventidindex" btree (userid)
>
> "anothersyncereventidindexwithascending" btree (userid, id)
>
> "asdfasdgasdf" btree (userid, id DESC)
>
> "syncereventuseridhashindex" hash (userid)
>
>
>
> To provide some context, as per the wiki,
>
> there are 3,290,600 rows in this table.
>
> It gets added to frequently, but never deleted from.
>
> The "kwargs" column often contains mid-size JSON strings (roughly 30K
> characters on average)
>
> As of right now, the table has 53 users in it. About 20% of those have a
> negligible number of events, but the rest of the users have a fairly even
> smattering.
>
>
>
> EXPLAIN (ANALYZE, BUFFERS) says:
>
>
>   QUERY PLAN
>
>
> 
> 
> --
>
>  Limit  (cost=0.43..1218.57 rows=4000 width=615) (actual
> time=3352.390..3403.572 rows=4000 loops=1)  Buffers: shared hit=120244
> read=160198
>
>->  Index Scan using syncerevent_pkey on syncerevent
> (cost=0.43..388147.29 rows=1274560 width=615) (actual
> time=3352.386..3383.100 rows=4000 loops=1)
>
>  Index Cond: (id > 12468)
>
>  Filter: ((propogatorid <> 
> '"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"'::text)
> AND (conflicted <> 1) AND (userid = '57dc984f1c87461c0967e228'::text))
>
>  Rows Removed by Filter: 1685801
>
>  Buffers: shared hit=120244 read=160198
>
>  Planning time: 0.833 ms
>
>  Execution time: 3407.633 ms
>
> (9 rows)
>
> If it matters/interests you, here is my underlying confusion:
>
> From some internet sleuthing, I've decided that having a table per user
> (which would totally make this problem a non-issue) isn't a great idea.
> Because there is a file per table, having a table per user would not scale.
> My next thought was partial indexes (which would also totally help), but
> since there is also a table per index, this really doesn't side-step the
> problem. My rough mental model says: If there exists a way that a
> table-per-user scheme would make this more efficient, then there should
> also exist an index that could achieve the same effect (or close enough to
> not matter). I would think that "userid = '57dc984f1c87461c0967e228'" could
> utilize at least one of the two indexes on the userId column, but clearly
> I'm not understanding something.
>
> Any help in making this query more efficient would be greatly appreciated,
> and any conceptual insights would be extra awesome.
>
> Thanks for reading.
>
> -Jake
>
> --
>
>
>
> This stands out:  WHERE ID > 12468 AND propogatorId NOT IN
> ('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"')
>
> As does this from the analyze:  Rows Removed by Filter: 1685801
>
>
>
> The propogaterid is practically the only column NOT indexed and it’s used
> in a “not in”.  It looks like it’s having to do a table scan for all the
> rows above the id cutoff to see if any meet the filter requirement.  “not
> in” can be very expensive.  An index might help on this column.  Have you
> tried that?
>
>
>
> Your rowcounts aren’t high enough to require partitioning or any other
> changes to your table that I can see right now.
>
>
>
> Mike Sofen  (Synthetic Genomics)
>
>
>

Thanks Mike, that's true, I hadn't thought of non-indexed columns forcing a
scan. Unfortunately, just to test this out, I tried pulling out the more
suspect parts of the query, and it still seems to want to do an index scan:


EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM SyncerEvent WHERE userId =

Re: [PERFORM] Unexpected expensive index scan

2016-09-27 Thread Mike Sofen
From: Jake NielsenSent: Tuesday, September 27, 2016 5:22 PM


the query

SELECT * FROM SyncerEvent WHERE ID > 12468 AND propogatorId NOT IN 
('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"') AND conflicted != 1 AND userId = 
'57dc984f1c87461c0967e228' ORDER BY ID LIMIT 4000;^

 

On Tue, Sep 27, 2016 at 5:02 PM, Jake Nielsen  > wrote:

I've got a query that takes a surprisingly long time to run, and I'm having a 
really rough time trying to figure it out.

 

Before I get started, here are the specifics of the situation:

 

Here is the table that I'm working with (apologies for spammy indices, I've 
been throwing shit at the wall)

Table "public.syncerevent"

Column|  Type   |Modifiers  
   

--+-+--

 id   | bigint  | not null default 
nextval('syncerevent_id_seq'::regclass)

 userid   | text| 

 event| text| 

 eventid  | text| 

 originatorid | text| 

 propogatorid | text| 

 kwargs   | text| 

 conflicted   | integer | 

Indexes:

"syncerevent_pkey" PRIMARY KEY, btree (id)

"syncereventidindex" UNIQUE, btree (eventid)

"anothersyncereventidindex" btree (userid)

"anothersyncereventidindexwithascending" btree (userid, id)

"asdfasdgasdf" btree (userid, id DESC)

"syncereventuseridhashindex" hash (userid)

 

To provide some context, as per the wiki, 

there are 3,290,600 rows in this table. 

It gets added to frequently, but never deleted from. 

The "kwargs" column often contains mid-size JSON strings (roughly 30K 
characters on average)

As of right now, the table has 53 users in it. About 20% of those have a 
negligible number of events, but the rest of the users have a fairly even 
smattering.

 

EXPLAIN (ANALYZE, BUFFERS) says:

  QUERY 
PLAN  

--

 Limit  (cost=0.43..1218.57 rows=4000 width=615) (actual 
time=3352.390..3403.572 rows=4000 loops=1)  Buffers: shared hit=120244 
read=160198

   ->  Index Scan using syncerevent_pkey on syncerevent  (cost=0.43..388147.29 
rows=1274560 width=615) (actual time=3352.386..3383.100 rows=4000 loops=1)

 Index Cond: (id > 12468)

 Filter: ((propogatorid <> 
'"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"'::text) AND (conflicted <> 1) AND 
(userid = '57dc984f1c87461c0967e228'::text))

 Rows Removed by Filter: 1685801

 Buffers: shared hit=120244 read=160198

 Planning time: 0.833 ms

 Execution time: 3407.633 ms

(9 rows)

If it matters/interests you, here is my underlying confusion:

>From some internet sleuthing, I've decided that having a table per user (which 
>would totally make this problem a non-issue) isn't a great idea. Because there 
>is a file per table, having a table per user would not scale. My next thought 
>was partial indexes (which would also totally help), but since there is also a 
>table per index, this really doesn't side-step the problem. My rough mental 
>model says: If there exists a way that a table-per-user scheme would make this 
>more efficient, then there should also exist an index that could achieve the 
>same effect (or close enough to not matter). I would think that "userid = 
>'57dc984f1c87461c0967e228'" could utilize at least one of the two indexes on 
>the userId column, but clearly I'm not understanding something.

Any help in making this query more efficient would be greatly appreciated, and 
any conceptual insights would be extra awesome.

Thanks for reading.

-Jake

--

 

This stands out:  WHERE ID > 12468 AND propogatorId NOT IN 
('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"')

As does this from the analyze:  Rows Removed by Filter: 1685801

 

The propogaterid is practically the only column NOT indexed and it’s used in a 
“not in”.  It looks like it’s having to do a table scan for all the rows above 
the id cutoff to see if any meet the filter requirement.  “not in” can be very 
expensive.  An index might help on this column.  Have you tried that?

 

Your rowcounts aren’t high enough to require partitioning or any other changes 
to your table that I can see right now.

 

Mike Sofen  (Synthetic Genomics)

 



Re: [PERFORM] Unexpected expensive index scan

2016-09-27 Thread Jake Nielsen
Herp, forgot to include the query:

SELECT * FROM SyncerEvent WHERE ID > 12468 AND propogatorId NOT IN
('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"') AND conflicted != 1 AND
userId = '57dc984f1c87461c0967e228' ORDER BY ID LIMIT 4000;^

On Tue, Sep 27, 2016 at 5:02 PM, Jake Nielsen 
wrote:

> I've got a query that takes a surprisingly long time to run, and I'm
> having a really rough time trying to figure it out.
>
> Before I get started, here are the specifics of the situation:
>
> Here is the table that I'm working with (apologies for spammy indices,
> I've been throwing shit at the wall)
>
> Table "public.syncerevent"
>
> Column|  Type   |Modifiers
>
>
> --+-+---
> ---
>
>  id   | bigint  | not null default nextval('syncerevent_id_seq'::
> regclass)
>
>  userid   | text|
>
>  event| text|
>
>  eventid  | text|
>
>  originatorid | text|
>
>  propogatorid | text|
>
>  kwargs   | text|
>
>  conflicted   | integer |
>
> Indexes:
>
> "syncerevent_pkey" PRIMARY KEY, btree (id)
>
> "syncereventidindex" UNIQUE, btree (eventid)
>
> "anothersyncereventidindex" btree (userid)
>
> "anothersyncereventidindexwithascending" btree (userid, id)
>
> "asdfasdgasdf" btree (userid, id DESC)
>
> "syncereventuseridhashindex" hash (userid)
>
> To provide some context, as per the wiki,
> there are 3,290,600 rows in this table.
> It gets added to frequently, but never deleted from.
> The "kwargs" column often contains mid-size JSON strings (roughly 30K
> characters on average)
> As of right now, the table has 53 users in it. About 20% of those have a
> negligible number of events, but the rest of the users have a fairly even
> smattering.
>
> EXPLAIN (ANALYZE, BUFFERS) says:
>
>
>   QUERY PLAN
>
>
> 
> 
> --
>
>  Limit  (cost=0.43..1218.57 rows=4000 width=615) (actual
> time=3352.390..3403.572 rows=4000 loops=1)
>
>Buffers: shared hit=120244 read=160198
>
>->  Index Scan using syncerevent_pkey on syncerevent
> (cost=0.43..388147.29 rows=1274560 width=615) (actual
> time=3352.386..3383.100 rows=4000 loops=1)
>
>  Index Cond: (id > 12468)
>
>  Filter: ((propogatorid <> 
> '"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"'::text)
> AND (conflicted <> 1) AND (userid = '57dc984f1c87461c0967e228'::text))
>
>  Rows Removed by Filter: 1685801
>
>  Buffers: shared hit=120244 read=160198
>
>  Planning time: 0.833 ms
>
>  Execution time: 3407.633 ms
>
> (9 rows)
>
>
> The postgres verison is: PostgreSQL 9.5.2 on x86_64-pc-linux-gnu, compiled
> by gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16), 64-bit
>
>
> This query has gotten slower over time.
>
> The postgres server is running on a db.m3.medium RDS instance on Amazon.
>
> (3.75GB of ram)
>
> (~3 GHz processor, single core)
>
> I ran VACUUM, and ANALYZEd this table just prior to running the EXPLAIN
> command.
>
> Here are the server settings:
>
>  name   | current_setting
>   | source
>
>
>
>
>  application_name   | psql
>  | client
>
>  archive_command| 
> /etc/rds/dbbin/pgscripts/rds_wal_archive
> %p   | configuration file
>
>  archive_mode   | on
>  | configuration file
>
>  archive_timeout| 5min
>  | configuration file
>
>  autovacuum_analyze_scale_factor| 0.05
>| configuration file
>
>  autovacuum_naptime | 30s
>   | configuration file
>
>  autovacuum_vacuum_scale_factor | 0.1
>   | configuration file
>
>  checkpoint_completion_target   | 0.9
>   | configuration file
>
>  client_encoding| UTF8
>  | client
>
>  effective_cache_size   | 1818912kB
>   | configuration file
>
>  fsync  | on
>  | configuration file
>
>  full_page_writes   | on
>  | configuration file
>
>  hot_standby| off
>   | configuration file
>
>  listen_addresses   | *
>   | command line
>
>  lo_compat_privileges   | off
>   | configuration file
>
>  log_checkpoints| on
>  | configuration file
>
>  log_directory  | /rdsdbdata/log/error
>
> Sorry for the formatting, I'm not sure of the best way to format this data
> on a mailing list.
>
>
> If it matters/interests you, here is my underlying confusion:
>
> From some internet sleuthing, I've decided 

[PERFORM] Unexpected expensive index scan

2016-09-27 Thread Jake Nielsen
I've got a query that takes a surprisingly long time to run, and I'm having
a really rough time trying to figure it out.

Before I get started, here are the specifics of the situation:

Here is the table that I'm working with (apologies for spammy indices, I've
been throwing shit at the wall)

Table "public.syncerevent"

Column|  Type   |Modifiers


--+-+--

 id   | bigint  | not null default
nextval('syncerevent_id_seq'::regclass)

 userid   | text|

 event| text|

 eventid  | text|

 originatorid | text|

 propogatorid | text|

 kwargs   | text|

 conflicted   | integer |

Indexes:

"syncerevent_pkey" PRIMARY KEY, btree (id)

"syncereventidindex" UNIQUE, btree (eventid)

"anothersyncereventidindex" btree (userid)

"anothersyncereventidindexwithascending" btree (userid, id)

"asdfasdgasdf" btree (userid, id DESC)

"syncereventuseridhashindex" hash (userid)

To provide some context, as per the wiki,
there are 3,290,600 rows in this table.
It gets added to frequently, but never deleted from.
The "kwargs" column often contains mid-size JSON strings (roughly 30K
characters on average)
As of right now, the table has 53 users in it. About 20% of those have a
negligible number of events, but the rest of the users have a fairly even
smattering.

EXPLAIN (ANALYZE, BUFFERS) says:


  QUERY PLAN


--

 Limit  (cost=0.43..1218.57 rows=4000 width=615) (actual
time=3352.390..3403.572 rows=4000 loops=1)

   Buffers: shared hit=120244 read=160198

   ->  Index Scan using syncerevent_pkey on syncerevent
(cost=0.43..388147.29 rows=1274560 width=615) (actual
time=3352.386..3383.100 rows=4000 loops=1)

 Index Cond: (id > 12468)

 Filter: ((propogatorid <>
'"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"'::text) AND (conflicted <> 1)
AND (userid = '57dc984f1c87461c0967e228'::text))

 Rows Removed by Filter: 1685801

 Buffers: shared hit=120244 read=160198

 Planning time: 0.833 ms

 Execution time: 3407.633 ms

(9 rows)


The postgres verison is: PostgreSQL 9.5.2 on x86_64-pc-linux-gnu, compiled
by gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16), 64-bit


This query has gotten slower over time.

The postgres server is running on a db.m3.medium RDS instance on Amazon.

(3.75GB of ram)

(~3 GHz processor, single core)

I ran VACUUM, and ANALYZEd this table just prior to running the EXPLAIN
command.

Here are the server settings:

 name   | current_setting
| source




 application_name   | psql
 | client

 archive_command|
/etc/rds/dbbin/pgscripts/rds_wal_archive %p   | configuration file

 archive_mode   | on
 | configuration file

 archive_timeout| 5min
 | configuration file

 autovacuum_analyze_scale_factor| 0.05
 | configuration file

 autovacuum_naptime | 30s
| configuration file

 autovacuum_vacuum_scale_factor | 0.1
| configuration file

 checkpoint_completion_target   | 0.9
| configuration file

 client_encoding| UTF8
 | client

 effective_cache_size   | 1818912kB
| configuration file

 fsync  | on
 | configuration file

 full_page_writes   | on
 | configuration file

 hot_standby| off
| configuration file

 listen_addresses   | *
| command line

 lo_compat_privileges   | off
| configuration file

 log_checkpoints| on
 | configuration file

 log_directory  | /rdsdbdata/log/error

Sorry for the formatting, I'm not sure of the best way to format this data
on a mailing list.


If it matters/interests you, here is my underlying confusion:

>From some internet sleuthing, I've decided that having a table per user
(which would totally make this problem a non-issue) isn't a great idea.
Because there is a file per table, having a table per user would not scale.
My next thought was partial indexes (which would also totally help), but
since there is also a table per index, this really doesn't side-step the
problem. My rough mental model says: If there exists a way that a
table-per-user scheme would make this more efficient, then there should
also exist an index that could achieve the same effect (or close enough to
not matter). I would think that "userid = '57dc984f1c87461c0967e228'" could