[PERFORM] efficient data reduction (and deduping)

2012-03-01 Thread Alessandro Gagliardi
Hi folks, I have a system that racks up about 40M log lines per day. I'm able to COPY the log files into a PostgreSQL table that looks like this: CREATE TABLE activity_unlogged ( user_id character(24) NOT NULL, client_ip inet, hr_timestamp timestamp without time zone, locale character

Re: [PERFORM] efficient data reduction (and deduping)

2012-03-01 Thread Alessandro Gagliardi
in that assumption? On Thu, Mar 1, 2012 at 10:40 AM, Peter van Hardenberg p...@pvh.ca wrote: On Thu, Mar 1, 2012 at 10:27 AM, Alessandro Gagliardi alessan...@path.com wrote: Now, I want to reduce that data to get the last activity that was performed by each user in any given hour. It should fit

Re: [PERFORM] efficient data reduction (and deduping)

2012-03-01 Thread Alessandro Gagliardi
, Mar 1, 2012 at 10:51 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Alessandro Gagliardi alessan...@path.com wrote: hr_timestamp timestamp without time zone, In addition to the responses which more directly answer your question, I feel I should point out

Re: [PERFORM] efficient data reduction (and deduping)

2012-03-01 Thread Alessandro Gagliardi
Hah! Yeah, that might would work. Except that I suck at grep. :( Perhaps that's a weakness I should remedy. On Thu, Mar 1, 2012 at 10:35 AM, Craig James cja...@emolecules.com wrote: On Thu, Mar 1, 2012 at 10:27 AM, Alessandro Gagliardi alessan...@path.com wrote: Hi folks, I have a system

Re: [PERFORM] efficient data reduction (and deduping)

2012-03-01 Thread Alessandro Gagliardi
Interesting solution. If I'm not mistaken, this does solve the problem of having two entries for the same user at the exact same time (which violates my pk constraint) but it does so by leaving both of them out (since there is no au1.hr_timestamp au2.hr_timestamp in that case). Is that right? On

Re: [PERFORM] efficient data reduction (and deduping)

2012-03-01 Thread Alessandro Gagliardi
Ah, yes, that makes sense. Thank you! On Thu, Mar 1, 2012 at 11:39 AM, Claudio Freire klaussfre...@gmail.comwrote: On Thu, Mar 1, 2012 at 4:35 PM, Alessandro Gagliardi alessan...@path.com wrote: Interesting solution. If I'm not mistaken, this does solve the problem of having two entries

[PERFORM] set autovacuum=off

2012-02-23 Thread Alessandro Gagliardi
I have a database where I virtually never delete and almost never do updates. (The updates might change in the future but for now it's okay to assume they never happen.) As such, it seems like it might be worth it to set autovacuum=off or at least make it so vacuuming hardly ever occurs. Actually,

Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Alessandro Gagliardi
I should have been more clear. I virtually never delete or do updates, but I insert *a lot*. So the table does change quite a bit, but only in one direction. I was unable to disable autovacuum universally (due to the cant_change_runtime_param error) but I was able to disable it on individual

Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Alessandro Gagliardi
a database that is being used in this way. Any cache sizes I should be messing with? Etc. Thank you, -Alessandro On Thu, Feb 23, 2012 at 9:45 AM, Thom Brown t...@linux.com wrote: On 23 February 2012 17:35, Alessandro Gagliardi alessan...@path.com wrote: I should have been more clear. I virtually

Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Alessandro Gagliardi
On Thu, Feb 23, 2012 at 10:01 AM, Steve Crawford scrawf...@pinpointresearch.com wrote: ** The documentation has information like This parameter can only be set in the postgresql.conf file or on the server command line. that will tell you in advance which settings will fail when you attempt

Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Alessandro Gagliardi
, Peter van Hardenberg p...@pvh.ca wrote: On Thu, Feb 23, 2012 at 10:38 AM, Alessandro Gagliardi alessan...@path.com wrote: around the same time as disabling auto-vacuum, so that could account for the coincidental speed up). I'm not sure what else I could be doing wrong. It's definitely

Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Alessandro Gagliardi
On Thu, Feb 23, 2012 at 11:07 AM, Andy Colson a...@squeakycode.net wrote: That depends on if you have triggers that are doing selects. But in general you are correct, analyze wont help inserts. I do have some, actually. I have a couple trigger functions like: CREATE OR REPLACE FUNCTION

Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Alessandro Gagliardi
On Thu, Feb 23, 2012 at 11:26 AM, Steve Crawford scrawf...@pinpointresearch.com wrote: ** You need to rethink things a bit. Databases can fail in all sorts of ways and can slow down during bursts of activity, data dumps, etc. You may need to investigate some form of intermediate buffering.

Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Alessandro Gagliardi
On Thu, Feb 23, 2012 at 1:37 PM, Steve Crawford scrawf...@pinpointresearch.com wrote: It's possible that you might get a nice boost by wrapping the inserts into a transaction: begin; insert into...; insert into...; insert into...; ... commit; This only requires all that disk-intensive

Re: [PERFORM] set autovacuum=off

2012-02-23 Thread Alessandro Gagliardi
On Thu, Feb 23, 2012 at 1:11 PM, Peter van Hardenberg p...@heroku.comwrote: My hunch is still that your issue is lock contention. How would I check that? I tried looking at pg_locks but I don't know what to look for. We have many customers who do much more than this throughput, though I'm

Re: [PERFORM] Indexes and Primary Keys on Rapidly Growing Tables

2012-02-21 Thread Alessandro Gagliardi
: On 2/20/12 2:06 PM, Alessandro Gagliardi wrote: . But first I just want to know if people think that this might be a viable solution or if I'm barking up the wrong tree. Batching is usually helpful for inserts, especially if there's a unique key on a very large table involved. I

Re: [PERFORM] Indexes and Primary Keys on Rapidly Growing Tables

2012-02-21 Thread Alessandro Gagliardi
True. I implemented the SAVEPOINTs solution across the board. We'll see what kind of difference it makes. If it's fast enough, I may be able to do without that. On Tue, Feb 21, 2012 at 3:53 PM, Samuel Gendler sgend...@ideasculptor.comwrote: On Tue, Feb 21, 2012 at 9:59 AM, Alessandro Gagliardi

Re: [PERFORM] Why so slow?

2012-02-20 Thread Alessandro Gagliardi
Ah, that did make a big difference! It went from taking 10x as long to taking only 1.5x as long (about what I would have expected, if not better.) Thank you! On Fri, Feb 17, 2012 at 9:29 PM, Ants Aasma ants.aa...@eesti.ee wrote: On Feb 17, 2012 8:35 PM, Alessandro Gagliardi alessan...@path.com

[PERFORM] Indexes and Primary Keys on Rapidly Growing Tables

2012-02-20 Thread Alessandro Gagliardi
be a viable solution or if I'm barking up the wrong tree. Thanks! -Alessandro On Fri, Feb 17, 2012 at 10:34 AM, Alessandro Gagliardi alessan...@path.comwrote: CREATE TABLE seen_its ( user_id character(24) NOT NULL, moment_id character(24) NOT NULL, created timestamp without time zone

[PERFORM] Why so slow?

2012-02-17 Thread Alessandro Gagliardi
Comparing SELECT DISTINCT(user_id) FROM blocks JOIN seen_its USING (user_id) WHERE seen_its.created BETWEEN (now()::date - interval '8 days')::timestamp AND now()::date::timestamp to SELECT DISTINCT(user_id) FROM seen_its WHERE created BETWEEN (now()::date - interval '8 days')::timestamp AND

Re: [PERFORM] Why so slow?

2012-02-17 Thread Alessandro Gagliardi
, Alessandro Gagliardi wrote: Comparing SELECT DISTINCT(user_id) FROM blocks JOIN seen_its USING (user_id) WHERE seen_its.created BETWEEN (now()::date - interval '8 days')::timestamp AND now()::date::timestamp to SELECT DISTINCT(user_id) FROM seen_its WHERE created BETWEEN (now()::date

Re: [PERFORM] timestamp with time zone

2012-02-10 Thread Alessandro Gagliardi
at 10:19 PM, Tom Lane t...@sss.pgh.pa.us wrote: Alessandro Gagliardi alessan...@path.com writes: Still slow as mud: http://explain.depesz.com/s/Zfn Now I've got indices on created, timezone, created at time zone timezone, and (created at time zone timezone)::date. Clearly the problem isn't

[PERFORM] timestamp with time zone

2012-02-09 Thread Alessandro Gagliardi
Here's my query: SELECT foursquare.name, foursquare.city, COUNT(moment_id) AS popularity FROM foursq_categories JOIN foursquare USING (foursq_id) JOIN places USING (foursq_id) JOIN blocks USING (block_id) WHERE primary AND (created at time zone timezone)::date = 'yesterday' AND (country =

Re: [PERFORM] timestamp with time zone

2012-02-09 Thread Alessandro Gagliardi
INDEX blocks_private_idx ON blocks USING btree (private ); CREATE INDEX blocks_shared_idx ON blocks USING btree (shared ); CREATE INDEX blocks_timezone_idx ON blocks USING btree (timezone ); On Thu, Feb 9, 2012 at 11:46 AM, Tom Lane t...@sss.pgh.pa.us wrote: Alessandro Gagliardi

Re: [PERFORM] timestamp with time zone

2012-02-09 Thread Alessandro Gagliardi
(or blocks_created_at_timezone_idx). How do I make that happen? On Thu, Feb 9, 2012 at 12:15 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Alessandro Gagliardi alessan...@path.com wrote: (Actually, I originally did try one on (created at time zone timezone)::date but couldn't figure out how to phrase

Re: [PERFORM] From Simple to Complex

2012-02-02 Thread Alessandro Gagliardi
On Thu, Feb 2, 2012 at 6:52 AM, Merlin Moncure mmonc...@gmail.com wrote: also, is effective_cache_size set to a reasonable value? Yeah, it's 153kB

Re: [PERFORM] From Simple to Complex

2012-02-01 Thread Alessandro Gagliardi
a memory limit and paging out. Is that right? On Tue, Jan 31, 2012 at 3:43 PM, Alessandro Gagliardi alessan...@path.comwrote: I just got a pointer on presenting EXPLAIN ANALYZE in a more human friendly fashion (thanks, Agent M!): http://explain.depesz.com/s/A9S From this it looks like the bottleneck

Re: [PERFORM] From Simple to Complex

2012-02-01 Thread Alessandro Gagliardi
Scan. I've already lowered random_page_cost to 2. Maybe I should lower it to 1.5? Actually 60K should be plenty for my purposes anyway. On Wed, Feb 1, 2012 at 10:35 AM, Scott Marlowe scott.marl...@gmail.comwrote: On Wed, Feb 1, 2012 at 11:19 AM, Alessandro Gagliardi alessan...@path.com wrote

Re: [PERFORM] From Simple to Complex

2012-02-01 Thread Alessandro Gagliardi
On Wed, Feb 1, 2012 at 11:04 AM, Bob Lunney bob_lun...@yahoo.com wrote: Possibly. What does psql show work_mem; say? 100MB

[PERFORM] From Simple to Complex

2012-01-31 Thread Alessandro Gagliardi
cool. But that's definitely secondary compared to getting the above query to run faster. Thank you very much for any help! -Alessandro Gagliardi -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref

Re: [PERFORM] From Simple to Complex

2012-01-31 Thread Alessandro Gagliardi
. I should have realized that as I exploited that limitation in three of my tables. Gradually adding those indices now; will report on what kind of difference it makes On Tue, Jan 31, 2012 at 1:22 PM, Alessandro Gagliardi alessan...@path.comwrote: My slow query today is somewhat more complex

Re: [PERFORM] From Simple to Complex

2012-01-31 Thread Alessandro Gagliardi
loops=1) Total runtime: 73511.072 ms Please let me know if there is any way to make this more efficient. Thank you, -Alessandro On Tue, Jan 31, 2012 at 2:10 PM, Alessandro Gagliardi alessan...@path.com wrote: Looks like I missed a key sentence in http://www.postgresql.org/docs/9.0/static/ddl

Re: [PERFORM] From Simple to Complex

2012-01-31 Thread Alessandro Gagliardi
; That was a bit faster, but still very slow. Here's the EXPLAIN: http://explain.depesz.com/s/ZdF On Tue, Jan 31, 2012 at 2:53 PM, Alessandro Gagliardi alessan...@path.comwrote: I changed the query a bit so the results would not change over the course of the day to: SELECT relname, emotion

[PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
using pgAdmin3, though I doubt that's relevant. As for *GUC Settings*: Again, I don't know what this is. Whatever Heroku defaults to is what I'm using. Thank you in advance! -Alessandro Gagliardi

Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
30, 2012 at 4:13 PM, Alessandro Gagliardi alessan...@path.com wrote: So, here's the query: SELECT private, COUNT(block_id) FROM blocks WHERE created 'yesterday' AND shared IS FALSE GROUP BY private What confuses me is that though this is a largish table (millions of rows

Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
klaussfre...@gmail.comwrote: On Mon, Jan 30, 2012 at 5:35 PM, Alessandro Gagliardi alessan...@path.com wrote: To answer your (non-)question about Heroku, it's a cloud service, so I don't host PostgreSQL myself. I'm not sure how much I can mess with things like GUC since I don't even have access

Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
On Mon, Jan 30, 2012 at 1:25 PM, Josh Berkus j...@agliodbs.com wrote: You can do SHOW random_page_cost yourself right now, too. 4 I also tried SHOW seq_page_cost and that's 1. Looking at http://www.postgresql.org/docs/current/static/runtime-config-query.html#GUC-RANDOM-PAGE-COSTI wonder if I

Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
that enable_seqscan is determining whether or not the data is being cached On Mon, Jan 30, 2012 at 1:13 PM, Fernando Hevia fhe...@gmail.com wrote: On Mon, Jan 30, 2012 at 17:35, Alessandro Gagliardi alessan...@path.comwrote: Well that was a *lot* faster: HashAggregate (cost=156301.82..156301.83 rows=2

Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
, Alessandro Gagliardi alessan...@path.com wrote: Looking at http://www.postgresql.org/docs/current/static/runtime-config-query.html#GUC-RANDOM-PAGE-COST I wonder if I should try reducing random_page_cost? Yes try lowering it. Generally speaking, random page cost should always be = seq

Re: [PERFORM] Why should such a simple query over indexed columns be so slow?

2012-01-30 Thread Alessandro Gagliardi
Got it (with a little bit of klutzing around). :) Thanks! On Mon, Jan 30, 2012 at 2:24 PM, Scott Marlowe scott.marl...@gmail.comwrote: On Mon, Jan 30, 2012 at 3:19 PM, Scott Marlowe scott.marl...@gmail.com wrote: On Mon, Jan 30, 2012 at 2:55 PM, Alessandro Gagliardi alessan...@path.com