Re: [PERFORM] Random Page Cost and Planner

2010-05-26 Thread David Jarvis
Hi, Rob.

I tried bumping the effective_cache_size. It made no difference.

My latest attempt at forcing PostgreSQL to use the indexes involved two
loops: one to loop over the stations, the other to extract the station data
from the measurement table. The outer loop executes in 1.5 seconds. The
inner loop does a full table scan for each record in the outer loop:

  FOR station IN
SELECT
  sc.station_id,
  sc.taken_start,
  sc.taken_end
FROM
  climate.city c,
  climate.station s,
  climate.station_category sc
WHERE
  c.id = city_id AND
  earth_distance(
ll_to_earth(c.latitude_decimal,c.longitude_decimal),
ll_to_earth(s.latitude_decimal,s.longitude_decimal)) / 1000 =
radius AND
  s.elevation BETWEEN elevation1 AND elevation2 AND
  s.applicable AND
  sc.station_id = s.id AND
  sc.category_id = category_id AND
  extract(YEAR FROM sc.taken_start) = year1 AND
  extract(YEAR FROM sc.taken_end) = year2
ORDER BY
  sc.station_id
  LOOP
RAISE NOTICE 'B.1. % % %', station.station_id, station.taken_start,
station.taken_end;

FOR measure IN
  SELECT
extract(YEAR FROM m.taken) AS year,
avg(m.amount) AS amount
  FROM
climate.measurement m
  WHERE
*m.station_id = station.station_id AND
m.taken BETWEEN station.taken_start AND station.taken_end AND
m.category_id = category_id
*  GROUP BY
extract(YEAR FROM m.taken)
LOOP
  RAISE NOTICE '  B.2. % %', measure.year, measure.amount;
END LOOP;
  END LOOP;

I thought that the bold lines would have evoked index use. The values used
for the inner query:

NOTICE:  B.1. 754 1980-08-01 2001-11-30

When I run the query manually, using constants, it executes in ~25
milliseconds:

SELECT
  extract(YEAR FROM m.taken) AS year,
  avg(m.amount) AS amount
FROM
  climate.measurement m
WHERE
  m.station_id = 754 AND
  m.taken BETWEEN '1980-08-01'::date AND '2001-11-30'::date AND
  m.category_id = 7
GROUP BY
  extract(YEAR FROM m.taken)

With 106 rows it should execute in ~2.65 seconds, which is better than the 5
seconds I get when everything is cached and a tremendous improvement over
the ~85 seconds from cold.

I do not understand why the below query uses a full table scan (executes in
~13 seconds):

SELECT
  extract(YEAR FROM m.taken) AS year,
  avg(m.amount) AS amount
FROM
  climate.measurement m
WHERE
*  m.station_id = station.station_id AND*
*   m.taken BETWEEN station.taken_start AND station.taken_end AND*
*  m.category_id = category_id*
GROUP BY
  extract(YEAR FROM m.taken)

Moreover, what can I do to solve the problem?

Thanks again!

Dave


Re: [PERFORM] Random Page Cost and Planner

2010-05-26 Thread Alexey Klyukin

On May 26, 2010, at 6:50 AM, David Jarvis wrote:
 
 That said, when using the following condition, the query is fast (1 second):
 
 extract(YEAR FROM sc.taken_start) = 1963 AND
 extract(YEAR FROM sc.taken_end) = 2009 AND 
 
 -  Index Scan using measurement_013_stc_idx on 
 measurement_013 m  (cost=0.00..511.00 rows=511 width=15) (actual 
 time=0.018..3.601 rows=3356 loops=104)
   Index Cond: ((m.station_id = sc.station_id) AND 
 (m.taken = sc.taken_start) AND (m.taken = sc.taken_end) AND (m.category_id 
 = 7))
 
 This condition makes it slow (13 seconds on first run, 8 seconds thereafter):
 
 extract(YEAR FROM sc.taken_start) = 1900 AND
 extract(YEAR FROM sc.taken_end) = 2009 AND 
 
   Filter: (category_id = 7)
 -  Seq Scan on measurement_013 m  (cost=0.00..359704.80 
 rows=18118464 width=15) (actual time=0.008..4025.692 rows=18118395 loops=1)
 
 At this point, I'm tempted to write a stored procedure that iterates over 
 each station category for all the years of each station. My guess is that the 
 planner's estimate for the number of rows that will be returned by 
 extract(YEAR FROM sc.taken_start) = 1900 is incorrect and so it chooses a 
 full table scan for all rows. 

Nope, it appears that the planner estimate is correct (it estimates 18118464 vs 
18118464 real rows). I think what's happening there is  that 18M rows is large 
enough part of the total table rows that it makes sense to scan it sequentially 
(eliminating random access costs).  Try SET enable_seqsan = false and repeat 
the query - there is a chance that the index scan would be even slower.

 The part I am having trouble with is convincing PG to use the index for the 
 station ID and the date range for when the station was active. Each station 
 has a unique ID; the data in the measurement table is ordered by measurement 
 date then by station.
 
 Should I add a clustered index by station then by date?
 
 Any other suggestions are very much appreciated.

Is it necessary to get the data as far as 1900 all the time ? Maybe there is a 
possibility to aggregate results
from the past years if they are constant. 

Regards,
--
Alexey Klyukin al...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] shared_buffers advice

2010-05-26 Thread Cédric Villemain
2010/5/24 Konrad Garus konrad.ga...@gmail.com:
 2010/3/11 Paul McGarry p...@paulmcgarry.com:

 I'm basically wondering how the postgresql cache (ie shared_buffers)
 and the OS page_cache interact. The general advice seems to be to
 assign 1/4 of RAM to shared buffers.

 I don't have a good knowledge of the internals but I'm wondering if
 this will effectively mean that roughly the same amount of RAM being
 used for the OS page cache will be used for redundantly caching
 something the Postgres is caching as well?

 I have a similar problem but I can't see an answer in this thread.

 Our dedicated server has 16 GB RAM. Among other settings
 shared_buffers is 2 GB, effective_cache_size is 12 GB.

 Do shared_buffers duplicate contents of OS page cache? If so, how do I
 know if 25% RAM is the right value for me? Actually it would not seem
 to be true - the less redundancy the better.

At the moment where a block is requested for the first time (usualy
8kb from postgres, so in fact 2 blocks in OS), you have 'duplicate'
buffers.
But, depending of your workload, it is not so bad because those 2
blocks should not be requested untill some time (because in postgresql
shared buffers) and should be evicted by OS in favor of new blocks
requests.
Again it depends on your workload, if you have a case where you
refresh a lot the shared buffers then you will have more blocks in the
2 caches  at the same time.

You can try pgfincore extension to grab stats from OS cache and/or
patch postgresql if you want real stats ;)
pgbuffercache is provided with postgresql and deliver very usefull information :
http://www.postgresql.org/docs/8.4/interactive/pgbuffercache.html


 Another question - is there a tool or built-in statistic that tells
 when/how often/how much a table is read from disk? I mean physical
 read, not poll from OS cache to shared_buffers.

 --
 Konrad Garus

 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance




-- 
Cédric Villemain   2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Random Page Cost and Planner

2010-05-26 Thread tv
 Current Folder: Sent   Sign Out
Compose   Addresses   Folders   Options   Autoreply   Search   Help  
CalendarG-Hosting.cz

Message List | Delete | Edit Message as New Previous | Next Forward 
|
Forward as Attachment | Reply | Reply All
Subject:Re: [PERFORM] Random Page Cost and Planner
From:   t...@fuzzy.cz
Date:   Wed, May 26, 2010 12:01 pm
To: David Jarvis thanga...@gmail.com
Priority:   Normal
Options:View Full Header |  View Printable Version  | Download this as
a file  | View Message details

 Hi, Tom.

 Yes, that is what happened, making the tests rather meaningless, and
 giving
 me the false impression that the indexes were being used. They were but
 only
 because of cached results. When multiple users making different queries,
 the
 performance will return to ~80s per query.

 I also tried Kevin's suggestion, which had no noticeable effect:
 effective_cache_size = 512MB

 That said, when using the following condition, the query is fast (1
 second):

 extract(YEAR FROM sc.taken_start) = 1963 AND
 extract(YEAR FROM sc.taken_end) = 2009 AND

 -  Index Scan using measurement_013_stc_idx on
 measurement_013 m  (cost=0.00..511.00 rows=511 width=15) (actual
 time=0.018..3.601 rows=3356 loops=104)
   Index Cond: ((m.station_id = sc.station_id) AND
 (m.taken = sc.taken_start) AND (m.taken = sc.taken_end) AND
 (m.category_id
 = 7))

 This condition makes it slow (13 seconds on first run, 8 seconds
 thereafter):

 *extract(YEAR FROM sc.taken_start) = 1900 AND
 *extract(YEAR FROM sc.taken_end) = 2009 AND

   Filter: (category_id = 7)
 -  Seq Scan on measurement_013 m
 (cost=0.00..359704.80 rows=18118464 width=15) (actual time=0.008..4025.692
 rows=18118395 loops=1)

 At this point, I'm tempted to write a stored procedure that iterates over
 each station category for all the years of each station. My guess is that
 the planner's estimate for the number of rows that will be returned by
 *extract(YEAR
 FROM sc.taken_start) = 1900* is incorrect and so it chooses a full table
 scan for all rows. Even though the lower bound appears to be a constant
 value of the 1900, the average year a station started collecting data was
 44
 years ago (1965), and did so for an average of 21.4 years.

 The part I am having trouble with is convincing PG to use the index for
 the
 station ID and the date range for when the station was active. Each
 station
 has a unique ID; the data in the measurement table is ordered by
 measurement
 date then by station.

Well, don't forget indexes may not be the best way to evaluate the query -
if the selectivity is low (the query returns large portion of the table)
the sequetial scan is actually faster. The problem is using index means
you have to read the index blocks too, and then the table blocks, and this
is actually random access. So your belief that thanks to using indexes the
query will run faster could be false.

And this is what happens in the queries above - the first query covers
years 1963-2009, while the second one covers 1900-2009. Given the fact
this table contains ~40m rows, the first query returns about 0.01% (3k
rows) while the second one returns almost 50% of the data (18m rows). So I
doubt this might be improved using an index ...

But you can try that by setting enable_seqscan=off or proper setting of
the random_page_cost / seq_page_cost variables (so that the plan with
indexes is cheaper than the sequential scan). You can do that in the
session (e.g. use SET enable_seqscan=off) so that you won't harm other
sessions.

 Should I add a clustered index by station then by date?

 Any other suggestions are very much appreciated.

Well, the only thing that crossed my mind is partitioning with properly
defined constraints and constrain_exclusion=on. I'd recommend partitioning
by time (each year a separate partition) but you'll have to investigate
that on your own (depends on your use-cases).

BTW the cache_effective_size mentioned in the previous posts is just an
'information parameter' - it does not increase the amount of memory
allocated by PostgreSQL. It merely informs PostgreSQL of expected disk
cache size maintained by the OS (Linux), so that PostgreSQL may estimate
the change that the requested data are actually cached (and won't be read
from the disk).

regards
Tomas




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Random Page Cost and Planner

2010-05-26 Thread Kevin Grittner
David Jarvis thanga...@gmail.com wrote:
 
 It sounds as though the active portion of your database is pretty
 much cached in RAM.  True?
 
 I would not have thought so; there are seven tables, each with 39
 to 43 million rows
 
 The machine has 4GB of RAM
 
In that case, modifying seq_page_cost or setting random_page_cost
below something in the range of 1.5 to 2 is probably not going to be
a good choice for the mix as a whole.
 
 effective_cache_size = 256MB
 
This should probably be set to something on the order of 3GB.  This
will help the optimizer make more intelligent choices about when use
of the index will be a win.
 
 It would tend to be better than random access to 43 million rows,
 at least if you need to go to disk for many of them.
 
 I thought that the index would take care of this?
 
When the index can limit the number of rows to a fraction of the 43
million rows, using it is a win.  The trick is to accurately model
the relative costs of different aspects of running the query, so
that when the various plans are compared, the one which looks the
cheapest actually *is*.  Attempting to force any particular plan
through other means is risky.
 
 I will be trying various other indexes. I've noticed now that
 sometimes the results are very quick and sometimes very slow. For
 the query I posted, it would be great to know what would be the
 best indexes to use. I have a suspicion that that's going to
 require trial and many errors.
 
Yeah, there's no substitute for testing your actual software against
the actual data.  Be careful, though -- as previously mentioned
caching can easily distort results, particularly when you run the
same query, all by itself (with no competing queries) multiple
times.  You'll get your best information if you can simulate a
more-or-less realistic load, and try that with various settings and
indexes.  The cache turnover and resource contention involved in
production can influence performance, and are hard to estimate any
other way.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] performance of temporary vs. regular tables

2010-05-26 Thread Joachim Worringen

Am 25.05.2010 12:41, schrieb Andres Freund:

On Tuesday 25 May 2010 11:00:24 Joachim Worringen wrote:

Thanks. So, the Write-Ahead-Logging (being used or not) does not matter?

It does matter quite significantly in my experience. Both from an io and a cpu
overhead perspective.


O.k., looks as if I have to make my own experience... I'll let you know 
if possible.


 Joachim



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] PostgreSQL Function Language Performance: C vs PL/PGSQL

2010-05-26 Thread Eliot Gable
I have been Googling for answers on this for a while, and have not been able
to find anything satisfactory.

Imagine that you have a stored procedure which is currently written using
PL/PGSQL. This stored procedure performs lots of long, complex SQL queries
(95% SELECT statements, 5% INSERT or UPDATE) and these queries are
interspersed with some minor math and some control logic, along with some
logging through the use of RAISE. Each logging statement is inside an
IF/THEN which just checks a boolean flag to determine if logging is turned
on. The function returns a set of cursors to several different result sets.
The function is 50%-60% SQL queries and the rest is logging, control logic,
and little bit of math.

Would a query such as this obtain any performance improvement by being
re-written using C?

Are there specific cases where writing a function in C would be highly
desirable verses using PL/PGSQL (aside from simply gaining access to
functionality not present in PL/PGSQL)?

Are there specific cases where writing a function in C would be slower than
writing the equivalent in PL/PGSQL?

Basically, I am looking for some guidelines based primarily on performance
of when I should use C to write a function verses using PL/PGSQL.

Can anybody quantify any of the performance differences between doing a
particular task in C verses doing the same thing in PL/PGSQL? For example,
performing a SELECT query or executing a certain number of lines of control
logic (primarily IF/THEN, but an occasional loop included)? How about
assignments or basic math like
addition/subtraction/multiplication/division?

When executing SQL queries inside a C-based function, is there any way to
have all of the SQL queries pre-planned through the compilation process,
definition of the function, and loading of the .so file similar to PL/PGSQL?
Would I get better performance writing each SQL query as a stored procedure
and then call these stored procedures from within a C-based function which
does the logging, math, control logic, and builds the result sets and
cursors?

Thanks in advance for any answers anyone can provide to these questions.


Re: [PERFORM] PostgreSQL Function Language Performance: C vs PL/PGSQL

2010-05-26 Thread Stephen Frost
* Eliot Gable (egable+pgsql-performa...@gmail.com) wrote:
 Would a query such as this obtain any performance improvement by being
 re-written using C?

I wouldn't expect the queries called by the pl/pgsql function to be much
faster if called through SPI from C instead.  I think the question you
need to answer is- how long does the pl/pgsql code take vs. the overall
time the function takes as a whole?  You could then consider that your
'max benefit' (or pretty close to it) which could be gained by rewriting
it in C.

 Are there specific cases where writing a function in C would be highly
 desirable verses using PL/PGSQL (aside from simply gaining access to
 functionality not present in PL/PGSQL)?

Cases where a function is called over and over again, or there are loops
which go through tons of data, or there's alot of data processing to be
done.

 Are there specific cases where writing a function in C would be slower than
 writing the equivalent in PL/PGSQL?

Probably not- provided the C code is written correctly.  You can
certainly screw that up (eg: not preparing a query in C and having PG
replan it every time would probably chew up any advantage C has over
pl/pgsql, in a simple function).

 Basically, I am looking for some guidelines based primarily on performance
 of when I should use C to write a function verses using PL/PGSQL.

Realize that C functions have alot of other issues associated with them-
typically they're much larger foot-guns, for one, for another, C is an
untrusted language because it can do all kinds of bad things.  So you
have to be a superuser to create them.

 Can anybody quantify any of the performance differences between doing a
 particular task in C verses doing the same thing in PL/PGSQL? For example,
 performing a SELECT query or executing a certain number of lines of control
 logic (primarily IF/THEN, but an occasional loop included)? How about
 assignments or basic math like
 addition/subtraction/multiplication/division?

Actually performing a SELECT through SPI vs. calling it from pl/pgsql
probably won't result in that much difference, presuming most of the
time there is in the actual query itself.  Assignments, basic math,
control logic, etc, will all be faster in C.  You need to figure out if
that work is taking enough time to justify the switch though.

 When executing SQL queries inside a C-based function, is there any way to
 have all of the SQL queries pre-planned through the compilation process,
 definition of the function, and loading of the .so file similar to PL/PGSQL?

You might be able to do that when the module is loaded, but I'm not 100%
sure..  Depends on if you can start using SPI in _PG_init..  I think
there was some discussion about that recently but I'm not sure what the
answer was.

 Would I get better performance writing each SQL query as a stored procedure
 and then call these stored procedures from within a C-based function which
 does the logging, math, control logic, and builds the result sets and
 cursors?

Uhh, I'd guess 'no' to that one.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PERFORM] PostgreSQL Function Language Performance: C vs PL/PGSQL

2010-05-26 Thread Eliot Gable
Thanks for the quick follow-up. So, you are saying that if I can do SPI in
_PG_init, then I could prepare all my queries there and they would be
prepared once for the entire function when it is loaded? That would
certainly achieve what I want. Does anybody know whether I can do SPI in
_PG_init?

The function gets called a lot, but not in the same transaction. It is only
called once per transaction.

On Wed, May 26, 2010 at 12:18 PM, Stephen Frost sfr...@snowman.net wrote:

 * Eliot Gable 
 (egable+pgsql-performa...@gmail.comegable%2bpgsql-performa...@gmail.com)
 wrote:
  Would a query such as this obtain any performance improvement by being
  re-written using C?

 I wouldn't expect the queries called by the pl/pgsql function to be much
 faster if called through SPI from C instead.  I think the question you
 need to answer is- how long does the pl/pgsql code take vs. the overall
 time the function takes as a whole?  You could then consider that your
 'max benefit' (or pretty close to it) which could be gained by rewriting
 it in C.

  Are there specific cases where writing a function in C would be highly
  desirable verses using PL/PGSQL (aside from simply gaining access to
  functionality not present in PL/PGSQL)?

 Cases where a function is called over and over again, or there are loops
 which go through tons of data, or there's alot of data processing to be
 done.

  Are there specific cases where writing a function in C would be slower
 than
  writing the equivalent in PL/PGSQL?

 Probably not- provided the C code is written correctly.  You can
 certainly screw that up (eg: not preparing a query in C and having PG
 replan it every time would probably chew up any advantage C has over
 pl/pgsql, in a simple function).

  Basically, I am looking for some guidelines based primarily on
 performance
  of when I should use C to write a function verses using PL/PGSQL.

 Realize that C functions have alot of other issues associated with them-
 typically they're much larger foot-guns, for one, for another, C is an
 untrusted language because it can do all kinds of bad things.  So you
 have to be a superuser to create them.

  Can anybody quantify any of the performance differences between doing a
  particular task in C verses doing the same thing in PL/PGSQL? For
 example,
  performing a SELECT query or executing a certain number of lines of
 control
  logic (primarily IF/THEN, but an occasional loop included)? How about
  assignments or basic math like
  addition/subtraction/multiplication/division?

 Actually performing a SELECT through SPI vs. calling it from pl/pgsql
 probably won't result in that much difference, presuming most of the
 time there is in the actual query itself.  Assignments, basic math,
 control logic, etc, will all be faster in C.  You need to figure out if
 that work is taking enough time to justify the switch though.

  When executing SQL queries inside a C-based function, is there any way to
  have all of the SQL queries pre-planned through the compilation process,
  definition of the function, and loading of the .so file similar to
 PL/PGSQL?

 You might be able to do that when the module is loaded, but I'm not 100%
 sure..  Depends on if you can start using SPI in _PG_init..  I think
 there was some discussion about that recently but I'm not sure what the
 answer was.

  Would I get better performance writing each SQL query as a stored
 procedure
  and then call these stored procedures from within a C-based function
 which
  does the logging, math, control logic, and builds the result sets and
  cursors?

 Uhh, I'd guess 'no' to that one.

Thanks,

Stephen

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.9 (GNU/Linux)

 iEYEARECAAYFAkv9Sd8ACgkQrzgMPqB3kihj/gCdEIA8DhnvZX4Hz3tof6yzLscS
 Lf8An2Xp8R/KXnkmp8uWg+84Cz7Pp7R3
 =AX4g
 -END PGP SIGNATURE-




-- 
Eliot Gable

We do not inherit the Earth from our ancestors: we borrow it from our
children. ~David Brower

I decided the words were too conservative for me. We're not borrowing from
our children, we're stealing from them--and it's not even considered to be a
crime. ~David Brower

Esse oportet ut vivas, non vivere ut edas. (Thou shouldst eat to live; not
live to eat.) ~Marcus Tullius Cicero


Re: [PERFORM] Random Page Cost and Planner

2010-05-26 Thread David Jarvis
Hi, Alexey.

Is it necessary to get the data as far as 1900 all the time ? Maybe there is
 a possibility to aggregate results from the past years if they are constant.


This I have done. I created another table (station_category) that associates
stations with when they started to take measurements and when they stopped
(based on the data in the measurement table). For example:

station_id; category_id; taken_start; taken_end
1;4;1984-07-01;1996-11-30
1;5;1984-07-01;1996-11-30
1;6;1984-07-01;1996-11-10
1;7;1984-07-01;1996-10-31

This means that station 1 has data for categories 4 through 7. The
measurement table returns 3865 rows for station 1 and category 7 (this uses
an index and took 7 seconds cold):

station_id; taken; amount
1;1984-07-01;0.00
1;1984-07-02;0.00
1;1984-07-03;0.00
1;1984-07-04;0.00

The station_category table is basically another index.

Would explicitly sorting the measurement table (273M rows) by station then
by date help?

Dave


Re: [PERFORM] PostgreSQL Function Language Performance: C vs PL/PGSQL

2010-05-26 Thread Stephen Frost
* Eliot Gable (egable+pgsql-performa...@gmail.com) wrote:
 Thanks for the quick follow-up. So, you are saying that if I can do SPI in
 _PG_init, then I could prepare all my queries there and they would be
 prepared once for the entire function when it is loaded? That would
 certainly achieve what I want. Does anybody know whether I can do SPI in
 _PG_init?

Unless you're using EXECUTE in your pl/pgsql, the queries in your
pl/pgsql function are already getting prepared on the first call of the
function for a given backend connection..  If you're using EXECUTE in
pl/gpsql then your problem might be planning time.  Moving that to C
isn't going to change things as much as you might hope if you still have
to plan the query every time you call it..

 The function gets called a lot, but not in the same transaction. It is only
 called once per transaction.

That's not really relevant..  Is it called alot from the same
backend/database connection?  If so, and if you're using regular SELECT
statements and the like (not EXECUTE), then they're getting prepared the
first time they're used and that is kept across transactions.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PERFORM] PostgreSQL Function Language Performance: C vs PL/PGSQL

2010-05-26 Thread Eliot Gable
Ah, that clears things up. Yes, the connections are more or less persistent.
I have a connection manager which doles connections out to the worker
threads and reclaims them when the workers are done with them. It
dynamically adds new connections based on load. Each worker obtains a
connection from the connection manager, performs a transaction which
involves executing the function and pulling back the results from the
cursors, then releases the connection back to the connection manager for
other workers to use. So, this means that even when written in C, the SQL
queries will be planned and cached on each connection after the first
execution. So, I guess the question just becomes whether using SPI in C has
any extra overhead verses using PL/PGSQL which might make it slower for
performing queries. Since PostgreSQL is written in C, I assume there is no
such additional overhead. I assume that the PL/PGSQL implementation at its
heart also uses SPI to perform those executions. Is that a fair statement?

On Wed, May 26, 2010 at 12:32 PM, Stephen Frost sfr...@snowman.net wrote:

 * Eliot Gable 
 (egable+pgsql-performa...@gmail.comegable%2bpgsql-performa...@gmail.com)
 wrote:
  Thanks for the quick follow-up. So, you are saying that if I can do SPI
 in
  _PG_init, then I could prepare all my queries there and they would be
  prepared once for the entire function when it is loaded? That would
  certainly achieve what I want. Does anybody know whether I can do SPI in
  _PG_init?

 Unless you're using EXECUTE in your pl/pgsql, the queries in your
 pl/pgsql function are already getting prepared on the first call of the
 function for a given backend connection..  If you're using EXECUTE in
 pl/gpsql then your problem might be planning time.  Moving that to C
 isn't going to change things as much as you might hope if you still have
 to plan the query every time you call it..

  The function gets called a lot, but not in the same transaction. It is
 only
  called once per transaction.

 That's not really relevant..  Is it called alot from the same
 backend/database connection?  If so, and if you're using regular SELECT
 statements and the like (not EXECUTE), then they're getting prepared the
 first time they're used and that is kept across transactions.

Thanks,

Stephen

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.9 (GNU/Linux)

 iEYEARECAAYFAkv9TTMACgkQrzgMPqB3kijiNQCfY/wTud+VZ4Z53Lw8cNY/N9ZD
 0R4AnA4diz1aptFGYXh3j8N9/k96C7/S
 =6oz+
 -END PGP SIGNATURE-




-- 
Eliot Gable

We do not inherit the Earth from our ancestors: we borrow it from our
children. ~David Brower

I decided the words were too conservative for me. We're not borrowing from
our children, we're stealing from them--and it's not even considered to be a
crime. ~David Brower

Esse oportet ut vivas, non vivere ut edas. (Thou shouldst eat to live; not
live to eat.) ~Marcus Tullius Cicero


Re: [PERFORM] PostgreSQL Function Language Performance: C vs PL/PGSQL

2010-05-26 Thread Stephen Frost
* Eliot Gable (egable+pgsql-performa...@gmail.com) wrote:
 Since PostgreSQL is written in C, I assume there is no
 such additional overhead. I assume that the PL/PGSQL implementation at its
 heart also uses SPI to perform those executions. Is that a fair statement?

Right, but I also wouldn't expect a huge improvment either, unless
you're calling these queries a ton, or the queries that you're calling
from the pl/pgsql are pretty short-lived.

Don't get me wrong, C is going to be faster, but it depends on exactly
what's going on as to if it's going to be an overall improvment of, say,
10%, or a 10-fold improvment. :)

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PERFORM] performance of temporary vs. regular tables

2010-05-26 Thread Grzegorz Jaśkiewicz
WAL matters in performance. Hence why it is advisable to have it on a
separate drive :)

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PostgreSQL Function Language Performance: C vs PL/PGSQL

2010-05-26 Thread Craig James

On 5/26/10 9:47 AM, Stephen Frost wrote:

* Eliot Gable (egable+pgsql-performa...@gmail.com) wrote:

Since PostgreSQL is written in C, I assume there is no
such additional overhead. I assume that the PL/PGSQL implementation at its
heart also uses SPI to perform those executions. Is that a fair statement?


Right, but I also wouldn't expect a huge improvment either, unless
you're calling these queries a ton, or the queries that you're calling
from the pl/pgsql are pretty short-lived.

Don't get me wrong, C is going to be faster, but it depends on exactly
what's going on as to if it's going to be an overall improvment of, say,
10%, or a 10-fold improvment. :)


Or a 0.1% improvement, which is more likely.  Or that the PL/PGSQL version is 
even faster than the C version, because if you do any string regexp in your 
function, Perl has extremely efficient algorithms, probably better than you 
have time to write in C.

We use Perl extensively and have never had any complaints.  The database 
activity completely dominates all queries, and the performance of Perl has 
never even been noticable.

We use a C functions for a few things, and it is a big nuisance.  Every time 
you upgrade Postgres or your OS, there's a chance the recompile will fail 
because of changed header files.  Any bugs in your code crash Postgres itself.  
We avoid C as much as possible (and I love C, been doing it since 1984).

Craig

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Random Page Cost and Planner

2010-05-26 Thread tv
 Hi,

 And this is what happens in the queries above - the first query covers
 years 1963-2009, while the second one covers 1900-2009. Given the fact
 this table contains ~40m rows, the first query returns about 0.01% (3k
 rows) while the second one returns almost 50% of the data (18m rows). So
 I
 doubt this might be improved using an index ...


 I don't think that's what I'm doing.

 There are two tables involved: station_category (sc) and measurement (m).

 The first part of the query:

  extract(YEAR FROM sc.taken_start) = 1900 AND
  extract(YEAR FROM sc.taken_end) = 2009 AND

 That is producing a limit on the station_category table. There are, as far
 as I can tell, no stations that have been taking weather readings for 110
 years. Most of them have a lifespan of 24 years. The above condition just
 makes sure that I don't get data before 1900 or after 2009.



OK, I admit I'm a little bit condfused by the query, especially by these
rows:

sc.taken_start = '1900-01-01'::date AND
sc.taken_end = '1996-12-31'::date AND
m.taken BETWEEN sc.taken_start AND sc.taken_end AND

Which seems to me a little bit convoluted. Well, I think I understand
what that means - give me all stations for a given city, collecting the
category of data at a certain time. But I'm afraid this makes the planning
much more difficult, as the select from measurements depend on the data
returned by other parts of the query (rows from category).

See this http://explain.depesz.com/s/H1 and this
http://explain.depesz.com/s/GGx

I guess the planner is confused in the second case - believes it has to
read a lot more data from the measurement table, and so chooses the
sequential scan. The question is if this is the right decision (I believe
it is not).

How many rows does the query return without the group by clause? About
14 in both cases, right?

 by time (each year a separate partition) but you'll have to investigate
 that on your own (depends on your use-cases).


 I cannot partition by time. First, there are 7 categories, which would
 mean
 770 partitions if I did it by year -- 345000 rows per partition. This will
 grow in the future. I have heard there are troubles with having lots of
 child tables (too many files for the operating system). Second, the user
 has
 the ability to pick arbitrary day ranges for arbitrary year spans.

 There's a year wrapping issue that I won't explain because I never get
 it
 right the first time. ;-)

OK, I haven't noticed the table is already partitioned by category_id and
I didn't mean to partition by (taken, category_id) - that would produce a
lot of partitions. Yes, that might cause problems related to number of
files, but that's rather a filesystem related issue.

I'd expect rather issues related to RULEs or triggers (not sure which of
them you use to redirect the data into partitions). But when partitioning
by time (and not by category_id) the number of partitions will be much
lower and you don't have to keep all of the rules active - all you need is
a rule for the current year (and maybe the next one).

I'm not sure what you mean by 'year wrapping issue' but I think it might
work quite well - right not the problem is PostgreSQL decides to scan the
whole partition (all data for a given category_id).

regards
Tomas


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Random Page Cost and Planner

2010-05-26 Thread David Jarvis
Hi, Kevin.

below something in the range of 1.5 to 2 is probably not going to be
 a good choice for the mix as a whole.


Good to know; thanks.


 This should probably be set to something on the order of 3GB.  This
 will help the optimizer make more intelligent choices about when use
 of the index will be a win.


I'll try this.


 times.  You'll get your best information if you can simulate a
 more-or-less realistic load, and try that with various settings and


I have no idea what a realistic load will be. The system is still in
development and not open to the general public. I also don't know how much
publicity the system will receive when finished. Could be a few hundred
hits, could be over ten thousand.

I want the system to be ready for the latter case, which means it needs to
return data for many different query parameters (date span, elevation, year,
radius, etc.) in under two seconds.


 indexes.  The cache turnover and resource contention involved in
 production can influence performance, and are hard to estimate any
 other way.


Another person suggested to take a look at the data.

I ran a query to see if it makes sense to split the data by year. The
trouble is that there are 110 years and 7 categories. The data is already
filtered into child tables by category (that is logical because reporting on
two different categories is nonsensical -- it is meaningless to report on
snow depth *and* temperature: we already know it needs to be cold for snow).

count;decade start; decade end; min date; max date
3088;1990;2000;1990-01-01;2009-12-31
2925;1980;2000;1980-01-01;2009-12-31
2752;2000;2000;2000-01-01;2009-12-31
2487;1970;1970;1970-01-01;1979-12-31
2391;1980;1990;1980-02-01;1999-12-31
2221;1980;1980;1980-01-01;1989-12-31
1934;1960;2000;1960-01-01;2009-12-31
1822;1960;1960;1960-01-01;1969-12-31
1659;1970;1980;1970-01-01;1989-12-31
1587;1960;1970;1960-01-01;1979-12-31
1524;1970;2000;1970-01-01;2009-12-31

The majority of data collected by weather stations is between 1960 and 2009,
which makes sense because transistor technology would have made for
(relatively) inexpensive automated monitoring stations. Or maybe there were
more people and more taxes collected thus a bigger budget for weather study.
Either way. ;-)

The point is the top three decades (1990, 1980, 2000) have the most data,
giving me a few options:

   - Split the seven tables twice more: before 1960 and after 1960.
   - Split the seven tables by decade.

The first case gives 14 tables. The second case gives 102 tables (at 2.5M
rows per table) as there are about 17 decades in total. This seems like a
manageable number of tables as the data might eventually span 22 decades,
which would be 132 tables.

Even though the users will be selecting 1900 to 2009, most of the stations
themselves will be within the 1960 - 2009 range, with the majority of those
active between 1980 and 2009.

Would splitting by decade improve the speed?

Thank you very much.

Dave


Re: [PERFORM] Random Page Cost and Planner

2010-05-26 Thread David Jarvis
Hi,

sc.taken_end = '1996-12-31'::date AND
 m.taken BETWEEN sc.taken_start AND sc.taken_end AND

 category of data at a certain time. But I'm afraid this makes the planning
 much more difficult, as the select from measurements depend on the data
 returned by other parts of the query (rows from category).


Right. Users can select 1900 - 2009. Station data hardly ever spans that
range.

The *station_category* is used to create a unique key into the measurement
data for every station: station_id, category_id, and taken_start. The
measurement data should be contiguous until taken_end.

I thought that that combination would be a pointer to the exact spot in the
measurement table  where the data starts, which should be ridiculously fast
to find.

See this http://explain.depesz.com/s/H1 and this
 http://explain.depesz.com/s/GGx


I was getting some red lines when I looked at a different plan. It's a great
site.

How many rows does the query return without the group by clause? About
 14 in both cases, right?


SELECT
  *
FROM
  climate.measurement m
WHERE
  m.station_id = 5148 AND
  m.taken BETWEEN '1900-08-01'::date AND '2009-12-31'::date AND
  m.category_id = 1

5397 rows (10 seconds cold; 0.5 seconds hot); estimated too high by 2275
rows?

http://explain.depesz.com/s/uq

 OK, I haven't noticed the table is already partitioned by category_id and
 I didn't mean to partition by (taken, category_id) - that would produce a
 lot of partitions. Yes, that might cause problems related to number of
 files, but that's rather a filesystem related issue.


Constrained as:

  CONSTRAINT measurement_013_category_id_ck CHECK (category_id = 7)


 I'd expect rather issues related to RULEs or triggers (not sure which of
 them you use to redirect the data into partitions). But when partitioning


I created seven child tables of measurement. Each of these has a constraint
by category_id. This makes it extremely fast to select the correct
partition.


 I'm not sure what you mean by 'year wrapping issue' but I think it might
 work quite well - right not the problem is PostgreSQL decides to scan the
 whole partition (all data for a given category_id).


I'll give it another try. :-)

*Use Case #1*
User selects: Mar 22 to Dec 22
User selects: 1900 to 2009

Result: Query should average *9 months* of climate data per year between Mar
22 and Dec 22 of Year.

*Use Case #2*
User selects: Dec 22 to Mar 22
User selects: 1900 to 2009

Result: Query should average *3 months* of climate data per year between Dec
22 of Year and Mar 22 of Year+1.

So if a user selects 1950 to *1960*:

   - first case should average between 1950 and *1960*; and
   - second case should average between 1950 and *1961*.

Dave


Re: [PERFORM] Random Page Cost and Planner

2010-05-26 Thread David Jarvis
I was told to try OVERLAPS instead of checking years. The query is now:

  SELECT
extract(YEAR FROM m.taken) AS year,
avg(m.amount) as amount
  FROM
climate.city c,
climate.station s,
climate.station_category sc,
climate.measurement m
  WHERE
c.id = 5148 AND
earth_distance(
  ll_to_earth(c.latitude_decimal,c.longitude_decimal),
  ll_to_earth(s.latitude_decimal,s.longitude_decimal)) / 1000 = 30 AND
s.elevation BETWEEN 0 AND 3000 AND
s.applicable = TRUE AND
sc.station_id = s.id AND
sc.category_id = 7 AND
*(sc.taken_start, sc.taken_end) OVERLAPS ('1900-01-01'::date,
'2009-12-31'::date) AND*
m.station_id = s.id AND
m.taken BETWEEN sc.taken_start AND sc.taken_end AND
m.category_id = sc.category_id
  GROUP BY
extract(YEAR FROM m.taken)
  ORDER BY
extract(YEAR FROM m.taken)

25 seconds from cold, no full table scan:

http://explain.depesz.com/s/VV5

Much better than 85 seconds, but still an order of magnitude too slow.

I was thinking of changing the *station_category* table to use the
measurement table's primary key, instead of keying off date, as converting
the dates for comparison strikes me as a bit of overhead. Also, I can get
remove the / 1000 by changing the Earth's radius to kilometres (from
metres), but a constant division shouldn't be significant.

I really appreciate all your patience and help over the last sixteen days
trying to optimize this database and these queries.

Dave