[PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
So, here's the query:

SELECT private, COUNT(block_id) FROM blocks WHERE created  'yesterday' AND
shared IS FALSE GROUP BY private

What confuses me is that though this is a largish table (millions of rows)
with constant writes, the query is over indexed columns of types timestamp
and boolean so I would expect it to be very fast. The clause where created
 'yesterday' is there mostly to speed it up, but apparently it doesn't
help much.

Here's the *Full Table and Index Schema*:

CREATE TABLE blocks
(
  block_id character(24) NOT NULL,
  user_id character(24) NOT NULL,
  created timestamp with time zone,
  locale character varying,
  shared boolean,
  private boolean,
  moment_type character varying NOT NULL,
  user_agent character varying,
  inserted timestamp without time zone NOT NULL DEFAULT now(),
  networks character varying[],
  lnglat point,
  CONSTRAINT blocks_pkey PRIMARY KEY (block_id )
)

WITH (
  OIDS=FALSE
);

CREATE INDEX blocks_created_idx
  ON blocks
  USING btree
  (created  DESC NULLS LAST);

CREATE INDEX blocks_lnglat_idx
  ON blocks
  USING gist
  (lnglat );

CREATE INDEX blocks_networks_idx
  ON blocks
  USING btree
  (networks );

CREATE INDEX blocks_private_idx
  ON blocks
  USING btree
  (private );

CREATE INDEX blocks_shared_idx
  ON blocks
  USING btree
  (shared );

Here's the results from *EXPLAIN ANALYZE:*

HashAggregate  (cost=156619.01..156619.02 rows=2 width=26) (actual
time=43131.154..43131.156 rows=2 loops=1)
*  -  Seq Scan on blocks  (cost=0.00..156146.14 rows=472871 width=26)
(actual time=274.881..42124.505 rows=562888 loops=1)
**Filter: ((shared IS FALSE) AND (created  '2012-01-29
00:00:00+00'::timestamp with time zone))
**Total runtime: 43131.221 ms*
I'm using *Postgres version:* 9.0.5 (courtesy of Heroku)

As for *History:* I've only recently started using this query, so there
really isn't any.

As for *Hardware*: I'm using Heroku's Ronin setup which involves 1.7 GB
Cache. Beyond that I don't really know.

As for *Maintenance Setup*: I let Heroku handle that, so I again, I don't
really know. FWIW though, vacuuming should not really be an issue (as I
understand it) since I don't really do any updates or deletions. It's
pretty much all inserts and selects.

As for *WAL Configuration*: I'm afraid I don't even know what that is. The
query is normally run from a Python web server though the above explain was
run using pgAdmin3, though I doubt that's relevant.

As for *GUC Settings*: Again, I don't know what this is. Whatever Heroku
defaults to is what I'm using.

Thank you in advance!
-Alessandro Gagliardi


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Claudio Freire
On Mon, Jan 30, 2012 at 4:13 PM, Alessandro Gagliardi
alessan...@path.com wrote:
 So, here's the query:

 SELECT private, COUNT(block_id) FROM blocks WHERE created  'yesterday' AND
 shared IS FALSE GROUP BY private

 What confuses me is that though this is a largish table (millions of rows)
 with constant writes, the query is over indexed columns of types timestamp
 and boolean so I would expect it to be very fast. The clause where created 
 'yesterday' is there mostly to speed it up, but apparently it doesn't help
 much.

The number of rows touched is ~0.5M, and is correctly estimated, which
would lead me to believe PG estimates the index plan to be slower.

You could try by executing first set enable_seqscan=false; and then
your query with explain analyze again. You'll probably get an index
scan, and you'll see both how it performs and how PG thought it would
perform. Any mismatch between the two probably means you'll have to
change the planner tunables (the x_tuple_cost ones) to better match
your hardware.


 As for Hardware: I'm using Heroku's Ronin setup which involves 1.7 GB
 Cache. Beyond that I don't really know.
snip
 As for GUC Settings: Again, I don't know what this is. Whatever Heroku
 defaults to is what I'm using.

And there's your problem. Without knowing/understanding those, you
won't get anywhere. I don't know what Heroku is, but you should find
out both hardware details and PG configuration details.

 As for Maintenance Setup: I let Heroku handle that, so I again, I don't
 really know. FWIW though, vacuuming should not really be an issue (as I
 understand it) since I don't really do any updates or deletions. It's pretty
 much all inserts and selects.

Maintainance also includes analyzing the table, to gather stats that
feed the optimizer, and it's very important to keep the stats
accurate. You can do it manually - just perform an ANALYZE. However,
the plan doesn't show any serious mismatch between expected and actual
rowcounts, which suggests stats aren't your problem.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
Well that was a *lot* faster:

HashAggregate  (cost=156301.82..156301.83 rows=2 width=26) (actual
time=2692.806..2692.807 rows=2 loops=1)
  -  Bitmap Heap Scan on blocks  (cost=14810.54..155828.95 rows=472871
width=26) (actual time=289.828..1593.893 rows=575186 loops=1)
Recheck Cond: (created  '2012-01-29 00:00:00+00'::timestamp with
time zone)
Filter: (shared IS FALSE)
-  Bitmap Index Scan on blocks_created_idx  (cost=0.00..14786.89
rows=550404 width=0) (actual time=277.407..277.407 rows=706663 loops=1)
  Index Cond: (created  '2012-01-29 00:00:00+00'::timestamp
with time zone)
Total runtime: 2693.107 ms

To answer your (non-)question about Heroku, it's a cloud service, so I
don't host PostgreSQL myself. I'm not sure how much I can mess with things
like GUC since I don't even have access to the postgres database on the
server. I am a long time SQL user but new to Postgres so I welcome
suggestions on where to start with that sort of thing.
Setting enable_seqscan=false made a huge difference, so I think I'll start
there.

Thank you very much!
-Alessandro

On Mon, Jan 30, 2012 at 11:24 AM, Claudio Freire klaussfre...@gmail.comwrote:

 On Mon, Jan 30, 2012 at 4:13 PM, Alessandro Gagliardi
 alessan...@path.com wrote:
  So, here's the query:
 
  SELECT private, COUNT(block_id) FROM blocks WHERE created  'yesterday'
 AND
  shared IS FALSE GROUP BY private
 
  What confuses me is that though this is a largish table (millions of
 rows)
  with constant writes, the query is over indexed columns of types
 timestamp
  and boolean so I would expect it to be very fast. The clause where
 created 
  'yesterday' is there mostly to speed it up, but apparently it doesn't
 help
  much.

 The number of rows touched is ~0.5M, and is correctly estimated, which
 would lead me to believe PG estimates the index plan to be slower.

 You could try by executing first set enable_seqscan=false; and then
 your query with explain analyze again. You'll probably get an index
 scan, and you'll see both how it performs and how PG thought it would
 perform. Any mismatch between the two probably means you'll have to
 change the planner tunables (the x_tuple_cost ones) to better match
 your hardware.


  As for Hardware: I'm using Heroku's Ronin setup which involves 1.7 GB
  Cache. Beyond that I don't really know.
 snip
  As for GUC Settings: Again, I don't know what this is. Whatever Heroku
  defaults to is what I'm using.

 And there's your problem. Without knowing/understanding those, you
 won't get anywhere. I don't know what Heroku is, but you should find
 out both hardware details and PG configuration details.

  As for Maintenance Setup: I let Heroku handle that, so I again, I don't
  really know. FWIW though, vacuuming should not really be an issue (as I
  understand it) since I don't really do any updates or deletions. It's
 pretty
  much all inserts and selects.

 Maintainance also includes analyzing the table, to gather stats that
 feed the optimizer, and it's very important to keep the stats
 accurate. You can do it manually - just perform an ANALYZE. However,
 the plan doesn't show any serious mismatch between expected and actual
 rowcounts, which suggests stats aren't your problem.



Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Claudio Freire
On Mon, Jan 30, 2012 at 5:35 PM, Alessandro Gagliardi
alessan...@path.com wrote:
 To answer your (non-)question about Heroku, it's a cloud service, so I don't
 host PostgreSQL myself. I'm not sure how much I can mess with things like
 GUC since I don't even have access to the postgres database on the server.
 I am a long time SQL user but new to Postgres so I welcome suggestions on
 where to start with that sort of thing. Setting enable_seqscan=false made a
 huge difference, so I think I'll start there.

It's not a good idea to abuse of the enable_stuff settings, they're
for debugging, not for general use. In particular, disable sequential
scans everywhere can have a disastrous effect on performance.

It sounds as if PG had a misconfigured effective_cache_size. What does
show effective_cache_size tell you?

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
Hm. Well, it looks like setting enable_seqscan=false is session specific,
so it seems like I can use it with this query alone; but it sounds like
even if that works, it's a bad practice. (Is that true?)

My effective_cache_size is 153kB

On Mon, Jan 30, 2012 at 12:50 PM, Claudio Freire klaussfre...@gmail.comwrote:

 On Mon, Jan 30, 2012 at 5:35 PM, Alessandro Gagliardi
 alessan...@path.com wrote:
  To answer your (non-)question about Heroku, it's a cloud service, so I
 don't
  host PostgreSQL myself. I'm not sure how much I can mess with things like
  GUC since I don't even have access to the postgres database on the
 server.
  I am a long time SQL user but new to Postgres so I welcome suggestions on
  where to start with that sort of thing. Setting enable_seqscan=false
 made a
  huge difference, so I think I'll start there.

 It's not a good idea to abuse of the enable_stuff settings, they're
 for debugging, not for general use. In particular, disable sequential
 scans everywhere can have a disastrous effect on performance.

 It sounds as if PG had a misconfigured effective_cache_size. What does
 show effective_cache_size tell you?



Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Claudio Freire
On Mon, Jan 30, 2012 at 5:55 PM, Alessandro Gagliardi
alessan...@path.com wrote:
 Hm. Well, it looks like setting enable_seqscan=false is session specific, so
 it seems like I can use it with this query alone; but it sounds like even if
 that works, it's a bad practice. (Is that true?)

Yep

 My effective_cache_size is 153kB

Um... barring some really bizarre GUC setting, I cannot imagine how it
could be preferring the sequential scan.
Maybe some of the more knowedgeable folks has a hint.

In the meanwhile, you can use the seqscan stuff on that query alone.
Be sure to use it on that query alone - ie, re-enable it afterwards,
or discard the connection.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Fernando Hevia
On Mon, Jan 30, 2012 at 17:35, Alessandro Gagliardi alessan...@path.comwrote:

 Well that was a *lot* faster:

 HashAggregate  (cost=156301.82..156301.83 rows=2 width=26) (actual
 time=2692.806..2692.807 rows=2 loops=1)
   -  Bitmap Heap Scan on blocks  (cost=14810.54..155828.95 rows=472871
 width=26) (actual time=289.828..1593.893 rows=575186 loops=1)
 Recheck Cond: (created  '2012-01-29 00:00:00+00'::timestamp
 with time zone)
 Filter: (shared IS FALSE)
 -  Bitmap Index Scan on blocks_created_idx  (cost=0.00..14786.89
 rows=550404 width=0) (actual time=277.407..277.407 rows=706663 loops=1)
   Index Cond: (created  '2012-01-29 00:00:00+00'::timestamp
 with time zone)
 Total runtime: 2693.107 ms


U sure the new timing isn't owed to cached data? If I am reading it
correctly, from the latest explain you posted the Index Scan shouldn't have
made a difference as it is reporting pretty much all rows in the table have
created  'yesterday'.
If the number of rows with created  'yesterday' isn't significant (~ over
25% with default config) a full scan will be chosen and it will probably be
the better choice too.


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Josh Berkus
On 1/30/12 12:59 PM, Claudio Freire wrote:
 On Mon, Jan 30, 2012 at 5:55 PM, Alessandro Gagliardi
 alessan...@path.com wrote:
 Hm. Well, it looks like setting enable_seqscan=false is session specific, so
 it seems like I can use it with this query alone; but it sounds like even if
 that works, it's a bad practice. (Is that true?)
 
 Yep

The issue with that is that the enable_seqscan setting is not limited to
that one table in that query, and won't change over time.  So by all
means use it as a hotfix right now, but it's not a long-term solution to
your problem.

 
 My effective_cache_size is 153kB

That seems appropriate for the Ronin; I'll test one out and see what
random_page_cost is set to as well, possibly Heroku needs to adjust the
basic template for the Ronin.  For Heroku, we want to favor index scans
a bit more than you would on regular hardware because the underlying
storage is Amazon, which has good seeks but crap throughput.

You can do SHOW random_page_cost yourself right now, too.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
On Mon, Jan 30, 2012 at 1:25 PM, Josh Berkus j...@agliodbs.com wrote:

 You can do SHOW random_page_cost yourself right now, too.

 4

I also tried SHOW seq_page_cost and that's 1.

Looking at
http://www.postgresql.org/docs/current/static/runtime-config-query.html#GUC-RANDOM-PAGE-COSTI
wonder if I should try reducing random_page_cost?

Something that might help when it comes to advice on performance tuning is
that this database is used only for analytics. It's essentially a partial
replication of a production (document-oriented) database. So a lot of
normal operations that might employ a series of sequential fetches may not
actually be the norm in my case. Rather, I'm doing a lot of counts on data
that is typically randomly distributed.

Thanks,

-Alessandro


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Scott Marlowe
On Mon, Jan 30, 2012 at 2:39 PM, Alessandro Gagliardi
alessan...@path.com wrote:
 On Mon, Jan 30, 2012 at 1:25 PM, Josh Berkus j...@agliodbs.com wrote:

 You can do SHOW random_page_cost yourself right now, too.

 4

 I also tried SHOW seq_page_cost and that's 1.

 Looking
 at http://www.postgresql.org/docs/current/static/runtime-config-query.html#GUC-RANDOM-PAGE-COST
 I wonder if I should try reducing random_page_cost?

 Something that might help when it comes to advice on performance tuning is
 that this database is used only for analytics. It's essentially a partial
 replication of a production (document-oriented) database. So a lot of normal
 operations that might employ a series of sequential fetches may not actually
 be the norm in my case. Rather, I'm doing a lot of counts on data that is
 typically randomly distributed.

Yes try lowering it.  Generally speaking, random page cost should
always be = seq page cost.  Start with a number between 1.5 and 2.0
to start with and see if that helps.   You can make it sticky for
your user or database with alter user or alter database...

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
Pretty sure. I just ran the same query twice in a row
with enable_seqscan=true and the actual time was on the order of 42
seconds both times. With enable_seqscan=false, it was on the order 3
seconds. Going back to enable_seqscan=true, it's back to 42 seconds. Unless
you're saying that enable_seqscan is determining whether or not the data is
being cached

On Mon, Jan 30, 2012 at 1:13 PM, Fernando Hevia fhe...@gmail.com wrote:

 On Mon, Jan 30, 2012 at 17:35, Alessandro Gagliardi 
 alessan...@path.comwrote:

 Well that was a *lot* faster:

 HashAggregate  (cost=156301.82..156301.83 rows=2 width=26) (actual
 time=2692.806..2692.807 rows=2 loops=1)
   -  Bitmap Heap Scan on blocks  (cost=14810.54..155828.95 rows=472871
 width=26) (actual time=289.828..1593.893 rows=575186 loops=1)
 Recheck Cond: (created  '2012-01-29 00:00:00+00'::timestamp
 with time zone)
 Filter: (shared IS FALSE)
 -  Bitmap Index Scan on blocks_created_idx
  (cost=0.00..14786.89 rows=550404 width=0) (actual time=277.407..277.407
 rows=706663 loops=1)
   Index Cond: (created  '2012-01-29 00:00:00+00'::timestamp
 with time zone)
 Total runtime: 2693.107 ms


 U sure the new timing isn't owed to cached data? If I am reading it
 correctly, from the latest explain you posted the Index Scan shouldn't have
 made a difference as it is reporting pretty much all rows in the table have
 created  'yesterday'.
 If the number of rows with created  'yesterday' isn't significant (~ over
 25% with default config) a full scan will be chosen and it will probably be
 the better choice too.





Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
I set random_page_cost to 2 (with enable_seqscan on) and get the same
performance I got with enable_seqscan off.
So far so good. Now I just need to figure out how to set it globally. :-/

On Mon, Jan 30, 2012 at 1:45 PM, Scott Marlowe scott.marl...@gmail.comwrote:

 On Mon, Jan 30, 2012 at 2:39 PM, Alessandro Gagliardi
 alessan...@path.com wrote:
  Looking
  at
 http://www.postgresql.org/docs/current/static/runtime-config-query.html#GUC-RANDOM-PAGE-COST
  I wonder if I should try reducing random_page_cost?
 

 Yes try lowering it.  Generally speaking, random page cost should
 always be = seq page cost.  Start with a number between 1.5 and 2.0
 to start with and see if that helps.   You can make it sticky for
 your user or database with alter user or alter database...



Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Scott Marlowe
On Mon, Jan 30, 2012 at 2:55 PM, Alessandro Gagliardi
alessan...@path.com wrote:
 I set random_page_cost to 2 (with enable_seqscan on) and get the same
 performance I got with enable_seqscan off.
 So far so good. Now I just need to figure out how to set it globally. :-/

alter database set random_page_cost=2.0;

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Scott Marlowe
On Mon, Jan 30, 2012 at 3:19 PM, Scott Marlowe scott.marl...@gmail.com wrote:
 On Mon, Jan 30, 2012 at 2:55 PM, Alessandro Gagliardi
 alessan...@path.com wrote:
 I set random_page_cost to 2 (with enable_seqscan on) and get the same
 performance I got with enable_seqscan off.
 So far so good. Now I just need to figure out how to set it globally. :-/

 alter database set random_page_cost=2.0;

That should be:

alter database dbnamegoeshere set random_page_cost=2.0;



-- 
To understand recursion, one must first understand recursion.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
Got it (with a little bit of klutzing around). :) Thanks!

On Mon, Jan 30, 2012 at 2:24 PM, Scott Marlowe scott.marl...@gmail.comwrote:

 On Mon, Jan 30, 2012 at 3:19 PM, Scott Marlowe scott.marl...@gmail.com
 wrote:
  On Mon, Jan 30, 2012 at 2:55 PM, Alessandro Gagliardi
  alessan...@path.com wrote:
  I set random_page_cost to 2 (with enable_seqscan on) and get the same
  performance I got with enable_seqscan off.
  So far so good. Now I just need to figure out how to set it globally.
 :-/
 
  alter database set random_page_cost=2.0;

 That should be:

 alter database dbnamegoeshere set random_page_cost=2.0;



 --
 To understand recursion, one must first understand recursion.



Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Josh Berkus

 Looking at
 http://www.postgresql.org/docs/current/static/runtime-config-query.html#GUC-RANDOM-PAGE-COSTI
 wonder if I should try reducing random_page_cost?

Yes, and I should speak to Heroku about reducing it by default.  RPC
represents the ratio between the cost of a sequential lookup of a single
row vs. the cost of a random lookup.  On standard spinning media on a
dedicated server 4.0 is a pretty good estimate of this.  However, you
are running on shared storage in a cloud, which has different math.

 Something that might help when it comes to advice on performance tuning is
 that this database is used only for analytics. It's essentially a partial
 replication of a production (document-oriented) database. So a lot of
 normal operations that might employ a series of sequential fetches may not
 actually be the norm in my case. Rather, I'm doing a lot of counts on data
 that is typically randomly distributed.

In that case, you might consider increasing default_statistics_target to
1000 and ANALYZEing your whole database.  That increases the sample size
for the database statstics collector, and most of the time will result
in somewhat better plans on large tables and data with skewed
distributions.  This is not something which Heroku would do as standard,
since most of their users are doing basic transactional webapps.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance