Re: [PERFORM] Window functions, partitioning, and sorting performance

2014-08-21 Thread Marti Raudsepp
On Thu, Aug 21, 2014 at 4:29 PM, Eli Naeher enae...@gmail.com wrote:
 Clearly the bulk of the time is spent sorting the rows in the original
 table, and then again sorting the results of the subselect. But I'm afraid I
 don't really know what to do with this information. Is there any way I can
 speed this up?

Sort Method: external merge  Disk: 120976kB

The obvious first step is to bump up work_mem to avoid disk-based
sort. Try setting it to something like 256MB in your session and see
how it performs then. This may also allow the planner to choose
HashAggregate instead of sort.

It not always straightforward how to tune correctly. It depends on
your hardware, concurrency and query complexity, here's some advice:
https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server#work_mem_maintainance_work_mem

Also you could create an index on (route, direction, stop, stop_time)
to avoid the inner sort entirely.

And it seems that you can move the INNER JOIN stop to the outer
query as well, not sure if that will change much.

Try these and if it's still problematic, report back with a new EXPLAIN ANALYZE

Regards,
Marti


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Window functions, partitioning, and sorting performance

2014-08-21 Thread Shaun Thomas

On 08/21/2014 08:29 AM, Eli Naeher wrote:


With around 1.2 million rows, this takes 20 seconds to run. 1.2 million
rows is only about a week's worth of data, so I'd like to figure out a
way to make this faster.


Well, you'll probably be able to reduce the run time a bit, but even 
with really good hardware and all in-memory processing, you're not going 
to see significant run-time improvements with that many rows. This is 
one of the reasons reporting-specific structures, such as fact tables, 
were designed to address.


Repeatedly processing the same week/month/year aggregate worth of 
several million rows will just increase linearly with each iteration as 
data size increases. You need to maintain up-to-date aggregates on the 
metrics you actually want to measure, so you're only reading the few 
hundred rows you introduce every update period. You can retrieve those 
kind of results in a few milliseconds.


--
Shaun Thomas
OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Window functions, partitioning, and sorting performance

2014-08-21 Thread Eli Naeher
Upping work_mem did roughly halve the time, but after thinking about
Shaun's suggestion, I figured it's better to calculate this stuff once and
then store it. So here is how the table looks now:

  Table public.stop_event
   Column|Type |
 Modifiers
-+-+-
 stop_time   | timestamp without time zone | not null
 stop| integer | not null
 bus | integer | not null
 direction   | integer | not null
 route   | integer | not null
 id  | bigint  | not null default
nextval('stop_event_id_seq'::regclass)
 previous_stop_event | bigint  |
Indexes:
stop_event_pkey PRIMARY KEY, btree (id)
stop_event_previous_stop_event_idx btree (previous_stop_event)
Foreign-key constraints:
stop_event_direction_id_fkey FOREIGN KEY (direction) REFERENCES
direction(id)
stop_event_previous_stop_event_fkey FOREIGN KEY (previous_stop_event)
REFERENCES stop_event(id)
stop_event_route_fkey FOREIGN KEY (route) REFERENCES route(id)
stop_event_stop FOREIGN KEY (stop) REFERENCES stop(id)
Referenced by:
TABLE stop_event CONSTRAINT stop_event_previous_stop_event_fkey
FOREIGN KEY (previous_stop_event) REFERENCES stop_event(id)

previous_stop_event simply references the previous (by stop_time) stop
event for the combination of stop, route, and direction. I have
successfully populated this column for my existing test data. However, when
I try to do a test self-join using it, Postgres does two seq scans across
the whole table, even though I have indexes on both id and
previous_stop_event: http://explain.depesz.com/s/ctck. Any idea why those
indexes are not being used?

Thank you again,
-Eli

On Thu, Aug 21, 2014 at 9:05 AM, Shaun Thomas stho...@optionshouse.com
 wrote:

 On 08/21/2014 08:29 AM, Eli Naeher wrote:

  With around 1.2 million rows, this takes 20 seconds to run. 1.2 million
 rows is only about a week's worth of data, so I'd like to figure out a
 way to make this faster.


 Well, you'll probably be able to reduce the run time a bit, but even with
 really good hardware and all in-memory processing, you're not going to see
 significant run-time improvements with that many rows. This is one of the
 reasons reporting-specific structures, such as fact tables, were designed
 to address.

 Repeatedly processing the same week/month/year aggregate worth of several
 million rows will just increase linearly with each iteration as data size
 increases. You need to maintain up-to-date aggregates on the metrics you
 actually want to measure, so you're only reading the few hundred rows you
 introduce every update period. You can retrieve those kind of results in a
 few milliseconds.

 --
 Shaun Thomas
 OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
 312-676-8870
 stho...@optionshouse.com

 __

 See http://www.peak6.com/email_disclaimer/ for terms and conditions
 related to this email





Re: [PERFORM] Window functions, partitioning, and sorting performance

2014-08-21 Thread Eli Naeher
Oops, I forgot to include the test self-join query I'm using. It is simply:

SELECT se1.stop_time AS curr, se2.stop_time AS prev
FROM stop_event se1
JOIN stop_event se2 ON se1.previous_stop_event = se2.id;



On Thu, Aug 21, 2014 at 11:19 AM, Eli Naeher enae...@gmail.com wrote:

 Upping work_mem did roughly halve the time, but after thinking about
 Shaun's suggestion, I figured it's better to calculate this stuff once and
 then store it. So here is how the table looks now:

   Table public.stop_event
Column|Type |
  Modifiers

 -+-+-
  stop_time   | timestamp without time zone | not null
  stop| integer | not null
  bus | integer | not null
  direction   | integer | not null
  route   | integer | not null
  id  | bigint  | not null default
 nextval('stop_event_id_seq'::regclass)
  previous_stop_event | bigint  |
 Indexes:
 stop_event_pkey PRIMARY KEY, btree (id)
 stop_event_previous_stop_event_idx btree (previous_stop_event)
 Foreign-key constraints:
 stop_event_direction_id_fkey FOREIGN KEY (direction) REFERENCES
 direction(id)
 stop_event_previous_stop_event_fkey FOREIGN KEY
 (previous_stop_event) REFERENCES stop_event(id)
 stop_event_route_fkey FOREIGN KEY (route) REFERENCES route(id)
 stop_event_stop FOREIGN KEY (stop) REFERENCES stop(id)
 Referenced by:
 TABLE stop_event CONSTRAINT stop_event_previous_stop_event_fkey
 FOREIGN KEY (previous_stop_event) REFERENCES stop_event(id)

 previous_stop_event simply references the previous (by stop_time) stop
 event for the combination of stop, route, and direction. I have
 successfully populated this column for my existing test data. However, when
 I try to do a test self-join using it, Postgres does two seq scans across
 the whole table, even though I have indexes on both id and
 previous_stop_event: http://explain.depesz.com/s/ctck. Any idea why those
 indexes are not being used?

 Thank you again,
 -Eli

 On Thu, Aug 21, 2014 at 9:05 AM, Shaun Thomas stho...@optionshouse.com
 wrote:

 On 08/21/2014 08:29 AM, Eli Naeher wrote:

  With around 1.2 million rows, this takes 20 seconds to run. 1.2 million
 rows is only about a week's worth of data, so I'd like to figure out a
 way to make this faster.


 Well, you'll probably be able to reduce the run time a bit, but even
 with really good hardware and all in-memory processing, you're not going to
 see significant run-time improvements with that many rows. This is one of
 the reasons reporting-specific structures, such as fact tables, were
 designed to address.

 Repeatedly processing the same week/month/year aggregate worth of
 several million rows will just increase linearly with each iteration as
 data size increases. You need to maintain up-to-date aggregates on the
 metrics you actually want to measure, so you're only reading the few
 hundred rows you introduce every update period. You can retrieve those kind
 of results in a few milliseconds.

 --
 Shaun Thomas
 OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
 312-676-8870
 stho...@optionshouse.com

 __

 See http://www.peak6.com/email_disclaimer/ for terms and conditions
 related to this email






Re: [PERFORM] Window functions, partitioning, and sorting performance

2014-08-21 Thread Marti Raudsepp
On Thu, Aug 21, 2014 at 7:19 PM, Eli Naeher enae...@gmail.com wrote:
 However, when I try to do a
 test self-join using it, Postgres does two seq scans across the whole table,
 even though I have indexes on both id and previous_stop_event:
 http://explain.depesz.com/s/ctck. Any idea why those indexes are not being
 used?

Because the planner thinks seq scan+hash join is going to be faster
than incurring the overhead of index scans for other kinds of plans.

You can try out alternative plan types by running 'set
enable_hashjoin=off' in your session. If it does turn out to be
faster, then it usually means you haven't set planner tunables right
(random_page_cost, effective_cache_size and possibly cpu_tuple_cost).

Regards,
Marti


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance