Re: [PERFORM] Simple join doesn't use index

2013-01-29 Thread Alex Vinnik
On Mon, Jan 28, 2013 at 6:55 PM, Filip Rembiałkowski plk.zu...@gmail.comwrote:


 On Mon, Jan 28, 2013 at 5:43 PM, Alex Vinnik alvinni...@gmail.com wrote:

 It sure turned out that default settings are not a good fit.


 do you know pgtune?
 it's a good tool for starters, if you want a fast postgres and don't
 really want to learn what's behind the scenes.

Yeah.. I came across pgtune but noticed that latest version dated
2009-10-29 http://pgfoundry.org/frs/?group_id=1000416 which is kind of
outdated. Tar file has settings for pg 8.3. Is still relevant?



 random_page_cost=1 might be not what you really want.
 it would mean that random reads are as fast as as sequential reads, which
 probably is true only for SSD

What randon_page_cost would be more appropriate for EC2 EBS Provisioned
volume that can handle 2,000 IOPS?




 Filip




Re: [PERFORM] Simple join doesn't use index

2013-01-29 Thread Alex Vinnik
Setting work_mem to 64MB triggers in memory sort but look what happens with
views look up. PG goes through all records there Seq Scan on views
instead of using visitor_id index and I have only subset of real data to
play around. Can imagine what cost would be running it against bigger
dataset. Something else is in play here that makes planner to take this
route. Any ideas how to gain more insight into planner's inner workings?

QUERY PLAN
Sort  (cost=960280.46..960844.00 rows=225414 width=8) (actual
time=23328.040..23537.126 rows=209401 loops=1)
  Sort Key: visits.id, views.id
  Sort Method: quicksort  Memory: 15960kB
  -  Hash Join  (cost=8089.16..940238.66 rows=225414 width=8) (actual
time=6622.072..22995.890 rows=209401 loops=1)
Hash Cond: (views.visit_id = visits.id)
-  Seq Scan on views  (cost=0.00..831748.05 rows=8724205 width=8)
(actual time=0.093..10552.306 rows=6995893 loops=1)
-  Hash  (cost=6645.51..6645.51 rows=115492 width=4) (actual
time=307.389..307.389 rows=131311 loops=1)
  Buckets: 16384  Batches: 1  Memory Usage: 4617kB
  -  Index Scan using visits_created_at_index on visits
 (cost=0.00..6645.51 rows=115492 width=4) (actual time=0.040..163.151
rows=131311 loops=1)
Index Cond: ((created_at = '2013-01-15
00:00:00'::timestamp without time zone) AND (created_at  '2013-01-16
00:00:00'::timestamp without time zone))
Total runtime: 23733.045 ms


On Mon, Jan 28, 2013 at 8:31 PM, Merlin Moncure mmonc...@gmail.com wrote:

 On Mon, Jan 28, 2013 at 5:43 PM, Alex Vinnik alvinni...@gmail.com wrote:
  It sure turned out that default settings are not a good fit. Setting
  random_page_cost to 1.0 made query to run in 2.6 seconds and I clearly
 see
  that indexes are being used in explain plan and IO utilization is close
 to
  0.
 
  QUERY PLAN
  Sort  (cost=969787.23..970288.67 rows=200575 width=8) (actual
  time=2176.045..2418.162 rows=241238 loops=1)
Sort Key: visits.id, views.id
Sort Method: external sort  Disk: 4248kB
-  Nested Loop  (cost=0.00..950554.81 rows=200575 width=8) (actual
  time=0.048..1735.357 rows=241238 loops=1)
  -  Index Scan using visits_created_at_index on visits
  (cost=0.00..5459.16 rows=82561 width=4) (actual time=0.032..178.591
  rows=136021 loops=1)
Index Cond: ((created_at = '2012-12-15
 00:00:00'::timestamp
  without time zone) AND (created_at  '2012-12-16 00:00:00'::timestamp
  without time zone))
  -  Index Scan using views_visit_id_index on views
  (cost=0.00..11.33 rows=12 width=8) (actual time=0.004..0.006 rows=2
  loops=136021)
Index Cond: (visit_id = visits.id)
  Total runtime: 2635.169 ms
 
  However I noticed that sorting is done using disk(external sort  Disk:
  4248kB) which prompted me to take a look at work_mem. But it turned out
  that small increase to 4MB from default 1MB turns off index usage and
 query
  gets x10 slower. IO utilization jumped to 100% from literally nothing. so
  back to square one...
 
  QUERY PLAN
  Sort  (cost=936642.75..937144.19 rows=200575 width=8) (actual
  time=33200.762..33474.443 rows=241238 loops=1)
Sort Key: visits.id, views.id
Sort Method: external merge  Disk: 4248kB
-  Hash Join  (cost=6491.17..917410.33 rows=200575 width=8) (actual
  time=7156.498..32723.221 rows=241238 loops=1)
  Hash Cond: (views.visit_id = visits.id)
  -  Seq Scan on views  (cost=0.00..832189.95 rows=8768395
 width=8)
  (actual time=0.100..12126.342 rows=8200704 loops=1)
  -  Hash  (cost=5459.16..5459.16 rows=82561 width=4) (actual
  time=353.683..353.683 rows=136021 loops=1)
Buckets: 16384  Batches: 2 (originally 1)  Memory Usage:
  4097kB
-  Index Scan using visits_created_at_index on visits
  (cost=0.00..5459.16 rows=82561 width=4) (actual time=0.032..175.051
  rows=136021 loops=1)
  Index Cond: ((created_at = '2012-12-15
  00:00:00'::timestamp without time zone) AND (created_at  '2012-12-16
  00:00:00'::timestamp without time zone))
  Total runtime: 33698.000 ms
 
  Basically PG is going through all views again and not using Index Scan
  using views_visit_id_index on views. Looks like setting work_mem
 confuses
  planner somehow. Any idea what can be done to do sorting in memory. I
  suspect it should make query even more faster. Thanks -Alex

 hm, what happens when you set work_mem a fair amount higher? (say,
 64mb).   You can set it for one session by going set work_mem='64mb';
  as opposed to the entire server in postgresql.conf.

 merlin



Re: [PERFORM] Simple join doesn't use index

2013-01-29 Thread Filip Rembiałkowski
On Tue, Jan 29, 2013 at 8:24 AM, Alex Vinnik alvinni...@gmail.com wrote:

 On Mon, Jan 28, 2013 at 6:55 PM, Filip Rembiałkowski 
 plk.zu...@gmail.comwrote:


 do you know pgtune?
 it's a good tool for starters, if you want a fast postgres and don't
 really want to learn what's behind the scenes.

 Yeah.. I came across pgtune but noticed that latest version dated
 2009-10-29 http://pgfoundry.org/frs/?group_id=1000416 which is kind of
 outdated. Tar file has settings for pg 8.3. Is still relevant?


Yes, I'm sure it will not do anything bad to your config.



 random_page_cost=1 might be not what you really want.
 it would mean that random reads are as fast as as sequential reads, which
 probably is true only for SSD

 What randon_page_cost would be more appropriate for EC2 EBS Provisioned
 volume that can handle 2,000 IOPS?



I'd say: don't guess. Measure.
Use any tool that can test sequential disk block reads versus random disk
block reads.
bonnie++ is quite popular.



Filip


Re: [PERFORM] Simple join doesn't use index

2013-01-29 Thread Merlin Moncure
On Tue, Jan 29, 2013 at 8:41 AM, Alex Vinnik alvinni...@gmail.com wrote:
 Setting work_mem to 64MB triggers in memory sort but look what happens with
 views look up. PG goes through all records there Seq Scan on views instead
 of using visitor_id index and I have only subset of real data to play
 around. Can imagine what cost would be running it against bigger dataset.
 Something else is in play here that makes planner to take this route. Any
 ideas how to gain more insight into planner's inner workings?

did you set effective_cache_seize as noted upthread?

merlin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Simple join doesn't use index

2013-01-29 Thread Ben Chobot
On Jan 29, 2013, at 6:24 AM, Alex Vinnik wrote:

 random_page_cost=1 might be not what you really want. 
 it would mean that random reads are as fast as as sequential reads, which 
 probably is true only for SSD
 What randon_page_cost would be more appropriate for EC2 EBS Provisioned 
 volume that can handle 2,000 IOPS? 

For EC2 Provisioned IOPS volumes - not standard EBS - random_page_cost=1 is 
exactly what you want.



Re: [PERFORM] Simple join doesn't use index

2013-01-29 Thread Alex Vinnik
On Tue, Jan 29, 2013 at 11:39 AM, Ben Chobot be...@silentmedia.com wrote:

 On Jan 29, 2013, at 6:24 AM, Alex Vinnik wrote:

 random_page_cost=1 might be not what you really want.
 it would mean that random reads are as fast as as sequential reads, which
 probably is true only for SSD

 What randon_page_cost would be more appropriate for EC2 EBS Provisioned
 volume that can handle 2,000 IOPS?


 For EC2 Provisioned IOPS volumes - not standard EBS - random_page_cost=1
 is exactly what you want.

 Well... after some experimentation it turned out that random_page_cost=0.6
gives me fast query

QUERY PLAN
Sort  (cost=754114.96..754510.46 rows=158199 width=8) (actual
time=1839.324..2035.405 rows=209401 loops=1)
  Sort Key: visits.id, views.id
  Sort Method: quicksort  Memory: 15960kB
  -  Nested Loop  (cost=0.00..740453.38 rows=158199 width=8) (actual
time=0.048..1531.592 rows=209401 loops=1)
-  Index Scan using visits_created_at_index on visits
 (cost=0.00..5929.82 rows=115492 width=4) (actual time=0.032..161.488
rows=131311 loops=1)
  Index Cond: ((created_at = '2013-01-15 00:00:00'::timestamp
without time zone) AND (created_at  '2013-01-16 00:00:00'::timestamp
without time zone))
-  Index Scan using views_visit_id_index on views
 (cost=0.00..6.26 rows=10 width=8) (actual time=0.003..0.005 rows=2
loops=131311)
  Index Cond: (visit_id = visits.id)
Total runtime: 2234.142 ms

random_page_cost=0.7 slows it down 16 times

Sort  (cost=804548.42..804943.92 rows=158199 width=8) (actual
time=37011.337..37205.449 rows=209401 loops=1)
  Sort Key: visits.id, views.id
  Sort Method: quicksort  Memory: 15960kB
  -  Merge Join  (cost=15871.37..790886.85 rows=158199 width=8) (actual
time=35673.602..36714.056 rows=209401 loops=1)
Merge Cond: (visits.id = views.visit_id)
-  Sort  (cost=15824.44..16113.17 rows=115492 width=4) (actual
time=335.486..463.085 rows=131311 loops=1)
  Sort Key: visits.id
  Sort Method: quicksort  Memory: 12300kB
  -  Index Scan using visits_created_at_index on visits
 (cost=0.00..6113.04 rows=115492 width=4) (actual time=0.034..159.326
rows=131311 loops=1)
Index Cond: ((created_at = '2013-01-15
00:00:00'::timestamp without time zone) AND (created_at  '2013-01-16
00:00:00'::timestamp without time zone))
-  Index Scan using views_visit_id_visit_buoy_index on views
 (cost=0.00..757596.22 rows=6122770 width=8) (actual time=0.017..30765.316
rows=5145902 loops=1)
Total runtime: 37407.174 ms

I am totally puzzled now...


Re: [PERFORM] Simple join doesn't use index

2013-01-29 Thread Merlin Moncure
On Tue, Jan 29, 2013 at 12:59 PM, Alex Vinnik alvinni...@gmail.com wrote:



 On Tue, Jan 29, 2013 at 11:39 AM, Ben Chobot be...@silentmedia.com wrote:

 On Jan 29, 2013, at 6:24 AM, Alex Vinnik wrote:

 random_page_cost=1 might be not what you really want.
 it would mean that random reads are as fast as as sequential reads, which
 probably is true only for SSD

 What randon_page_cost would be more appropriate for EC2 EBS Provisioned
 volume that can handle 2,000 IOPS?


 For EC2 Provisioned IOPS volumes - not standard EBS - random_page_cost=1
 is exactly what you want.

 Well... after some experimentation it turned out that random_page_cost=0.6
 gives me fast query

 QUERY PLAN
 Sort  (cost=754114.96..754510.46 rows=158199 width=8) (actual
 time=1839.324..2035.405 rows=209401 loops=1)
   Sort Key: visits.id, views.id
   Sort Method: quicksort  Memory: 15960kB
   -  Nested Loop  (cost=0.00..740453.38 rows=158199 width=8) (actual
 time=0.048..1531.592 rows=209401 loops=1)
 -  Index Scan using visits_created_at_index on visits
 (cost=0.00..5929.82 rows=115492 width=4) (actual time=0.032..161.488
 rows=131311 loops=1)
   Index Cond: ((created_at = '2013-01-15 00:00:00'::timestamp
 without time zone) AND (created_at  '2013-01-16 00:00:00'::timestamp
 without time zone))
 -  Index Scan using views_visit_id_index on views  (cost=0.00..6.26
 rows=10 width=8) (actual time=0.003..0.005 rows=2 loops=131311)
   Index Cond: (visit_id = visits.id)
 Total runtime: 2234.142 ms

 random_page_cost=0.7 slows it down 16 times

 Sort  (cost=804548.42..804943.92 rows=158199 width=8) (actual
 time=37011.337..37205.449 rows=209401 loops=1)
   Sort Key: visits.id, views.id
   Sort Method: quicksort  Memory: 15960kB
   -  Merge Join  (cost=15871.37..790886.85 rows=158199 width=8) (actual
 time=35673.602..36714.056 rows=209401 loops=1)
 Merge Cond: (visits.id = views.visit_id)
 -  Sort  (cost=15824.44..16113.17 rows=115492 width=4) (actual
 time=335.486..463.085 rows=131311 loops=1)
   Sort Key: visits.id
   Sort Method: quicksort  Memory: 12300kB
   -  Index Scan using visits_created_at_index on visits
 (cost=0.00..6113.04 rows=115492 width=4) (actual time=0.034..159.326
 rows=131311 loops=1)
 Index Cond: ((created_at = '2013-01-15
 00:00:00'::timestamp without time zone) AND (created_at  '2013-01-16
 00:00:00'::timestamp without time zone))

 -  Index Scan using views_visit_id_visit_buoy_index on views
 (cost=0.00..757596.22 rows=6122770 width=8) (actual time=0.017..30765.316
 rows=5145902 loops=1)

Something is awry here. pg is doing an index scan via
views_visit_id_visit_buoy_index with no matching condition.  What's
the definition of that index? The reason why the random_page_cost
adjustment is working is that you are highly penalizing sequential
type scans so that the database is avoiding the merge (sort A, sort B,
stepwise compare).

SQL server is doing a nestloop/index scan, just like the faster pg
plan, but is a bit faster because it's parallelizing.

 merlin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Simple join doesn't use index

2013-01-29 Thread Jeff Janes
On Mon, Jan 28, 2013 at 3:43 PM, Alex Vinnik alvinni...@gmail.com wrote:
 It sure turned out that default settings are not a good fit. Setting
 random_page_cost to 1.0 made query to run in 2.6 seconds and I clearly see
 that indexes are being used in explain plan and IO utilization is close to
 0.

 QUERY PLAN
 Sort  (cost=969787.23..970288.67 rows=200575 width=8) (actual
 time=2176.045..2418.162 rows=241238 loops=1)
   Sort Key: visits.id, views.id
   Sort Method: external sort  Disk: 4248kB

What query are you running?  The query you originally showed us should
not be doing this sort in the first place.

Cheers,

Jeff


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Simple join doesn't use index

2013-01-29 Thread Alex Vinnik
On Tue, Jan 29, 2013 at 2:06 PM, Jeff Janes jeff.ja...@gmail.com wrote:

Sort Key: visits.id, views.id
Sort Method: external sort  Disk: 4248kB

 What query are you running?  The query you originally showed us should
 not be doing this sort in the first place.

 Cheers,

 Jeff


Here is the query

select visits.id, views.id
from visits join views on visits.id = views.visit_id
where visits.created_at = '01/15/2013' and visits.created_at  '01/16/2013'
order by visits.id, views.id;

Original query didn't have order by clause

Here query plan w/o order by
Merge Join  (cost=18213.46..802113.80 rows=182579 width=8) (actual
time=13.693..145469.499 rows=209401 loops=1)
  Merge Cond: (visits.id = views.visit_id)
  -  Sort  (cost=18195.47..18523.91 rows=131373 width=4) (actual
time=335.496..464.929 rows=131311 loops=1)
Sort Key: visits.id
Sort Method: quicksort  Memory: 12300kB
-  Index Scan using visits_created_at_index on visits
 (cost=0.00..7026.59 rows=131373 width=4) (actual time=0.037..162.047
rows=131311 loops=1)
  Index Cond: ((created_at = '2013-01-15 00:00:00'::timestamp
without time zone) AND (created_at  '2013-01-16 00:00:00'::timestamp
without time zone))
  -  Index Scan using views_visit_id_visit_buoy_index on views
 (cost=0.00..766120.99 rows=6126002 width=8) (actual
time=18.960..140565.130 rows=4014837 loops=1)
Total runtime: 145664.274 ms


Re: [PERFORM] Simple join doesn't use index

2013-01-29 Thread Alex Vinnik
index definition
CREATE INDEX views_visit_id_visit_buoy_index ON views USING btree
(visit_id, visit_buoy)



On Tue, Jan 29, 2013 at 1:35 PM, Merlin Moncure mmonc...@gmail.com wrote:

 On Tue, Jan 29, 2013 at 12:59 PM, Alex Vinnik alvinni...@gmail.com
 wrote:
 
 
 
  On Tue, Jan 29, 2013 at 11:39 AM, Ben Chobot be...@silentmedia.com
 wrote:
 
  On Jan 29, 2013, at 6:24 AM, Alex Vinnik wrote:
 
  random_page_cost=1 might be not what you really want.
  it would mean that random reads are as fast as as sequential reads,
 which
  probably is true only for SSD
 
  What randon_page_cost would be more appropriate for EC2 EBS Provisioned
  volume that can handle 2,000 IOPS?
 
 
  For EC2 Provisioned IOPS volumes - not standard EBS - random_page_cost=1
  is exactly what you want.
 
  Well... after some experimentation it turned out that
 random_page_cost=0.6
  gives me fast query
 
  QUERY PLAN
  Sort  (cost=754114.96..754510.46 rows=158199 width=8) (actual
  time=1839.324..2035.405 rows=209401 loops=1)
Sort Key: visits.id, views.id
Sort Method: quicksort  Memory: 15960kB
-  Nested Loop  (cost=0.00..740453.38 rows=158199 width=8) (actual
  time=0.048..1531.592 rows=209401 loops=1)
  -  Index Scan using visits_created_at_index on visits
  (cost=0.00..5929.82 rows=115492 width=4) (actual time=0.032..161.488
  rows=131311 loops=1)
Index Cond: ((created_at = '2013-01-15
 00:00:00'::timestamp
  without time zone) AND (created_at  '2013-01-16 00:00:00'::timestamp
  without time zone))
  -  Index Scan using views_visit_id_index on views
  (cost=0.00..6.26
  rows=10 width=8) (actual time=0.003..0.005 rows=2 loops=131311)
Index Cond: (visit_id = visits.id)
  Total runtime: 2234.142 ms
 
  random_page_cost=0.7 slows it down 16 times
 
  Sort  (cost=804548.42..804943.92 rows=158199 width=8) (actual
  time=37011.337..37205.449 rows=209401 loops=1)
Sort Key: visits.id, views.id
Sort Method: quicksort  Memory: 15960kB
-  Merge Join  (cost=15871.37..790886.85 rows=158199 width=8) (actual
  time=35673.602..36714.056 rows=209401 loops=1)
  Merge Cond: (visits.id = views.visit_id)
  -  Sort  (cost=15824.44..16113.17 rows=115492 width=4) (actual
  time=335.486..463.085 rows=131311 loops=1)
Sort Key: visits.id
Sort Method: quicksort  Memory: 12300kB
-  Index Scan using visits_created_at_index on visits
  (cost=0.00..6113.04 rows=115492 width=4) (actual time=0.034..159.326
  rows=131311 loops=1)
  Index Cond: ((created_at = '2013-01-15
  00:00:00'::timestamp without time zone) AND (created_at  '2013-01-16
  00:00:00'::timestamp without time zone))

  -  Index Scan using views_visit_id_visit_buoy_index on views
  (cost=0.00..757596.22 rows=6122770 width=8) (actual time=0.017..30765.316
  rows=5145902 loops=1)

 Something is awry here. pg is doing an index scan via
 views_visit_id_visit_buoy_index with no matching condition.  What's
 the definition of that index? The reason why the random_page_cost
 adjustment is working is that you are highly penalizing sequential
 type scans so that the database is avoiding the merge (sort A, sort B,
 stepwise compare).

 SQL server is doing a nestloop/index scan, just like the faster pg
 plan, but is a bit faster because it's parallelizing.

  merlin



Re: [PERFORM] Simple join doesn't use index

2013-01-29 Thread Jeff Janes
On Mon, Jan 28, 2013 at 4:55 PM, Filip Rembiałkowski
plk.zu...@gmail.com wrote:

 On Mon, Jan 28, 2013 at 5:43 PM, Alex Vinnik alvinni...@gmail.com wrote:

 It sure turned out that default settings are not a good fit.


 do you know pgtune?
 it's a good tool for starters, if you want a fast postgres and don't really
 want to learn what's behind the scenes.

 random_page_cost=1 might be not what you really want.
 it would mean that random reads are as fast as as sequential reads, which
 probably is true only for SSD

Or that the reads are cached and coming from RAM, which is almost
surely the case here.

Cheers,

Jeff


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance