Re: [PERFORM] Analysis Function

2010-06-16 Thread David Jarvis
Fair enough. How about something like make_timestamp? It's at least shorter and easier than construct :-) Agreed. Dave

Re: [PERFORM] Analysis Function

2010-06-14 Thread David Jarvis
Does it make sense to allow minutes when hours isn't specified? Or For time, 00 seems a reasonable default for all values; clearly document the defaults. Also, having a default makes the code simpler than level n plus level n-1. (Not to mention explaining it.) ;-) SELECT to_timestamp( minutes

Re: [PERFORM] Analysis Function

2010-06-13 Thread David Jarvis
Hi, We had a little chat about this with Magnus. It's pretty surprising that there's no built-in function to do this, we should consider adding one. I agree; you should be able to create a timestamp or a date from integer values. Others, apparently, have written code. The implementation I did

Re: [PERFORM] Analysis Function

2010-06-13 Thread David Jarvis
Hi, It's not immediately obvious what the default value of timezone will be? The system's locale, like now(); documentation can clarify. By named parameter, I meant default value. You could construct a timestamp variable using: construct_timestamp( year := 1900, hour := 1 ) When I read

Re: [PERFORM] Analysis Function

2010-06-11 Thread David Jarvis
Hi, To avoid string concatenation using dates, I figured I could write a C function: #include postgres.h #include fmgr.h #include utils/date.h #include utils/nabstime.h #ifdef PG_MODULE_MAGIC PG_MODULE_MAGIC; #endif Datum dateserial (PG_FUNCTION_ARGS); PG_FUNCTION_INFO_V1 (dateserial); Datum

Re: [PERFORM] Analysis Function

2010-06-11 Thread David Jarvis
Hi, Tom. extract(YEAR FROM m.taken) I thought that returned a double precision? Dave

Re: [PERFORM] Analysis Function

2010-06-11 Thread David Jarvis
Hi, I added an explicit cast in the SQL: dateserial(extract(YEAR FROM m.taken)::int,'||p_month1||','||p_day1||') d1, dateserial(extract(YEAR FROM m.taken)::int,'||p_month2||','||p_day2||') d2 The function now takes three integer parameters; there was no performance loss. Thank

Re: [PERFORM] Analysis Function

2010-06-10 Thread David Jarvis
Hi, I found a slow part of the query: SELECT * date(extract(YEAR FROM m.taken)||'-1-1') d1,* * date(extract(YEAR FROM m.taken)||'-1-31') d2* FROM climate.city c, climate.station s, climate.station_category sc, climate.measurement m WHERE c.id = 5148 AND ... Date extraction is 3.2

Re: [PERFORM] Analysis Function

2010-06-10 Thread David Jarvis
Hi, Andy. I assume you are doing this in a loop? Many Many Many times? cuz: Yes. Here are the variations I have benchmarked (times are best of three): Variation #0 -no date field- Explain: http://explain.depesz.com/s/Y9R Time: 2.2s Variation #1 date('1960-1-1') Explain:

Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-06-02 Thread David Jarvis
Sorry, Alvaro. I was contemplating using a GIN or GiST index as a way of optimizing the query. Instead, I found that re-inserting the data in order of station ID (the primary look-up column) and then CLUSTER'ing on the station ID, taken date, and category index increased the speed by an order of

Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-06-02 Thread David Jarvis
Hi, Hmm, that's nice, though I cannot but wonder whether the exclusive lock required by CLUSTER is going to be a problem in the long run. Not an issue; the inserts are one-time (or very rare; at most: once a year). Hm, keep in mind that if the station clause alone is not selective enough,

Re: [PERFORM] Random Page Cost and Planner

2010-05-27 Thread David Jarvis
Hi, Bryan. I was just about to reply to the thread, thanks for asking. Clustering was key. After rebooting the machine (just to make sure absolutely nothing was cached), I immediately ran a report on Toronto: 5.25 seconds! Here's what I did: 1. Created a new set of tables that matched the

Re: [PERFORM] Random Page Cost and Planner

2010-05-27 Thread David Jarvis
Salut, Cédric. I wonder what the plan will be if you replace sc.taken_* in : m.taken BETWEEN sc.taken_start AND sc.taken_end by values. It might help the planner... That is a fairly important restriction. I will try making it * (year1||'-01-01')::date*, but I have no constant value for it --

Re: [PERFORM] Random Page Cost and Planner

2010-05-27 Thread David Jarvis
Hi, Bryan. Thanks for the notes. I thought about using a prepared statement, but I cannot find any examples of using a PREPARE statement from within a function, and don't really feel like tinkering around to figure it out. Performance is at the point where the Java/PHP bridge and JasperReports

Re: [PERFORM] Random Page Cost and Planner

2010-05-26 Thread David Jarvis
Hi, Rob. I tried bumping the effective_cache_size. It made no difference. My latest attempt at forcing PostgreSQL to use the indexes involved two loops: one to loop over the stations, the other to extract the station data from the measurement table. The outer loop executes in 1.5 seconds. The

Re: [PERFORM] Random Page Cost and Planner

2010-05-26 Thread David Jarvis
Hi, Alexey. Is it necessary to get the data as far as 1900 all the time ? Maybe there is a possibility to aggregate results from the past years if they are constant. This I have done. I created another table (station_category) that associates stations with when they started to take

Re: [PERFORM] Random Page Cost and Planner

2010-05-26 Thread David Jarvis
Hi, Kevin. below something in the range of 1.5 to 2 is probably not going to be a good choice for the mix as a whole. Good to know; thanks. This should probably be set to something on the order of 3GB. This will help the optimizer make more intelligent choices about when use of the index

Re: [PERFORM] Random Page Cost and Planner

2010-05-26 Thread David Jarvis
Hi, sc.taken_end = '1996-12-31'::date AND m.taken BETWEEN sc.taken_start AND sc.taken_end AND category of data at a certain time. But I'm afraid this makes the planning much more difficult, as the select from measurements depend on the data returned by other parts of the query (rows from

Re: [PERFORM] Random Page Cost and Planner

2010-05-26 Thread David Jarvis
I was told to try OVERLAPS instead of checking years. The query is now: SELECT extract(YEAR FROM m.taken) AS year, avg(m.amount) as amount FROM climate.city c, climate.station s, climate.station_category sc, climate.measurement m WHERE c.id = 5148 AND

Re: [PERFORM] Random Page Cost and Planner

2010-05-25 Thread David Jarvis
Hi, I changed the date comparison to be based on year alone: extract(YEAR FROM sc.taken_start) = 1900 AND extract(YEAR FROM sc.taken_end) = 2009 AND The indexes are now always used; if someone wants to explain why using the numbers works (a constant) but using a date (another constant?)

Re: [PERFORM] Random Page Cost and Planner

2010-05-25 Thread David Jarvis
Hi, Kevin. Thanks for the response. It sounds as though the active portion of your database is pretty much cached in RAM. True? I would not have thought so; there are seven tables, each with 39 to 43 million rows as: CREATE TABLE climate.measurement ( id bigserial NOT NULL, taken date

Re: [PERFORM] Random Page Cost and Planner

2010-05-25 Thread David Jarvis
Hi, Tom. Yes, that is what happened, making the tests rather meaningless, and giving me the false impression that the indexes were being used. They were but only because of cached results. When multiple users making different queries, the performance will return to ~80s per query. I also tried

[PERFORM] Random Page Cost and Planner

2010-05-24 Thread David Jarvis
Hi, I wrote a query (see below) that extracts climate data from weather stations within a given radius of a city using the dates for which those weather stations actually have data. The query uses the measurement table's only index: CREATE UNIQUE INDEX measurement_001_stc_idx ON

Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-23 Thread David Jarvis
Hi, The problem is now solved (in theory). Well, it's not the functions per se that's the problem, it's the lack of a useful index on the expression. The measurement table indexes (on date and weather station) were not being used because the only given date ranges (e.g., 1900 - 2009) were

Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-22 Thread David Jarvis
Hi, CREATE INDEX measurement_01_001_y_idx ON climate.measurement_01_001 USING btree (date_part('year'::text, taken)); Is that equivalent to what you suggest? No. It is not the same function, so Postgres has no way to know it produces the same results (if it does). This is what I

Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-22 Thread David Jarvis
Hi, certainly understand that you wouldn't want to partition by year. It Definitely not. does strike me that perhaps you could partition by day ranges, but you'd I don't think that will work; users can choose any day range, with the most common as Jan 1 - Dec 31, followed by seasonal

Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-21 Thread David Jarvis
Hi, Yeb. This is starting to go back to the design I used with MySQL: - YEAR_REF - Has year and station - MONTH_REF - Has month, category, and yea referencer - MEASUREMENT - Has month reference, amount, and day Normalizing by date parts was fast. Partitioning the tables by year won't

Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
Hi, I have posted an image of the user inputs here: http://i.imgur.com/MUkuZ.png The problem is that I am given a range of days (Dec 22 - Mar 22) over a range of years (1900 - 2009) and the range of days can span from one year to the next. This is not the same as saying Dec 22, 1900 to Mar 22,

Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
When using MySQL, the performance was okay (~5 seconds per query) using: date( concat_ws( '-', y.year, m.month, d.day ) ) between -- Start date. date( concat_ws( '-', y.year, $P{Month1}, $P{Day1} ) ) AND -- End date. Calculated by checking to see if the end date wraps -- into

Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
What if I were to have the application pass in two sets of date ranges? For the condition of Dec 22 to Mar 22: Dec 22 would become: - Dec 22 - Dec 31 Mar 22 would become: - Jan 1 - Mar 22 The first range would always be for the current year; the second range would always be for the

Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
I was hoping to eliminate this part of the query: (cast(extract( YEAR FROM m.taken ) + greatest( -1 * sign( (extract( YEAR FROM m.taken )||'-12-31')::date - (extract( YEAR FROM m.taken )||'-01-01')::date ), 0 ) AS text)||'-12-31')::date That uses

Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
Hi, I was still referring to the measurement table. You have an index on stationid, but still seem to be getting a sequential scan. Maybe the planner does not realise that you are selecting a small number of stations. Posting an EXPLAIN ANALYSE would really help here. Here is the result from

Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
The greatest() expression reduces to either the current year (year + 0) or the next year (year + 1) by taking the sign of the difference in start/end days. This allows me to derive an end date, such as: Dec 22, 1900 to Mar 22, 1901 Then I check if the measured date falls between those two dates.

Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
Hi, ~300 million measurements ~12000 stations (not 7 as I mentioned before) ~5500 cities some serious data tho, at least. Basically, PG is sequentially scanning through all of the tables in your partitioning setup. What is constraint_exclusion set to? What version of PG is this? Do the

Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
Hi, check (taken = '1913-12-01' and taken = '1913-12-31') I don't think I want to constrain by year, for a few reasons: 1. There are a lot of years -- over 110. 2. There will be more years added (both in the future for 2010 and in the past as I get data from other sources). Currently I

Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
Hi, Something in here really smells fishy to me. Those extract's above are working on values which are from the table.. Why aren't you using these functions to figure out how to construct the actual dates based on the values provided by the *user*..? Because I've only been using PostgreSQL

Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
I took out the date conditions: SELECT m.* FROM climate.measurement m WHERE m.category_id = 1 and m.station_id = 2043 This uses the station indexes: Result (cost=0.00..21781.18 rows=8090 width=28) - Append (cost=0.00..21781.18 rows=8090 width=28) - Seq Scan on measurement

Re: [PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-20 Thread David Jarvis
Hi, (An EXPLAIN ANALYSE would be better here). Look at the expected number of stations Nested Loop (cost=0.00..994.94 rows=4046 width=4) (actual time=0.053..41.173 rows=78 loops=1) Join Filter: ((6371.009::double precision * sqrt((pow(radians(((c.latitude_decimal -

[PERFORM] Optimize date query for large child tables: GiST or GIN?

2010-05-19 Thread David Jarvis
Hi, I recently switched to PostgreSQL from MySQL so that I can use PL/R for data analysis. The query in MySQL form (against a more complex table structure) takes ~5 seconds to run. The query in PostgreSQL I have yet to let finish, as it takes over a minute. I think I have the correct table