subject:"Re\: \[PERFORM\] Duplicate deletion optimizations"

Re: [PERFORM] Duplicate deletion optimizations

2012-01-09 Thread antoine


Hello,

Thanks for your numerous and complete answers!

For those who have asked for more information about the process and 
hardware:


The goal of the process is to compute data from a nosql cluster and 
write results in a PostgreSQL database. This process is triggered every 
5 minutes for the latest 5 minutes data. 80% of data can be wrote in the 
database with a simple copy, which is the fastest solution we found for 
bulk insertion. But for some data, duplicates are possible (but very 
unusual), and the new data must replace the old one in database. I'm 
looking for the fastest solution to do this upsert.


About the hardware:

The PostgreSQL database run on a KVM virtual machine, configured with 
8GB of ram and 4 cores of a L5640 CPU. The hypervisor have two 7,2k 
standard SAS disks working in linux software raid 1. Disks are shared by 
VMs, and obviously, this PostgreSQL VM doesn't share its hypervisor with 
another write-intensive VM.


Also, this database is dedicated to store the data outgoing the 
process, so I'm really free for its configuration and tuning. I also 
plan to add a replicated slave database for read operations, and maybe 
do a partitioning of data, if needed.


If I summarize your solutions:

 - Add an order by statement to my initial query can help the planner 
to use the index.
 - Temporary tables, with a COPY of new data to the temporary table and 
a merge of data (you proposed different ways for the merge).
 - Use EXISTS statement in the delete (but not recommended by another 
reply)


I'll try your ideas this week, and I'll give you results.

Antoine.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Duplicate deletion optimizations

2012-01-08 Thread Pierre C


That's almost identical to my tables.


You explained your problem very well ;)


I certainly will. Many thanks for those great lines of SQL!


You're welcome !
Strangely I didn't receive the mail I posted to the list (received yours  
though).


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Duplicate deletion optimizations

2012-01-08 Thread Strange, John W

Are your stats updated on the table after you added the index?

- run the bad query with explain verbose on (you should send this anyways)
- check to see what the difference is in expected rows vs. actual rows
- make sure that your work_mem is high enough if you are sorting, if not you'll 
see it write out a temp file which will be slow.
- if there is different analyze the table and rerun the query to see if you get 
the expected results.
- I do believe having COUNT(*)  1 will never use an index, but someone more 
experience can comment here.


-Original Message-
From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of anto...@inaps.org
Sent: Friday, January 06, 2012 8:36 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Duplicate deletion optimizations

Hello,

I've a table with approximately 50 million rows with a schema like
this:

 id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
 t_value integer NOT NULL DEFAULT 0,
 t_record integer NOT NULL DEFAULT 0,
 output_id integer NOT NULL DEFAULT 0,
 count bigint NOT NULL DEFAULT 0,
 CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)

Every 5 minutes, a process have to insert a few thousand of rows in this table, 
but sometime, the process have to insert an already existing row (based on 
values in the triplet (t_value, t_record, output_id). In this case, the row 
must be updated with the new count value. I've tried some solution given on 
this stackoverflow question [1] but the insertion rate is always too low for my 
needs.

So, I've decided to do it in two times:

  - I insert all my new data with a COPY command
  - When it's done, I run a delete query to remove oldest duplicates

Right now, my delete query look like this:

 SELECT min(id) FROM stats_5mn
 GROUP BY t_value, t_record, output_id
 HAVING count(*)  1;

The duration of the query on my test machine with approx. 16 million rows is 
~18s.

To reduce this duration, I've tried to add an index on my triplet:

 CREATE INDEX test
   ON stats_5mn
   USING btree
   (t_value , t_record , output_id );

By default, the PostgreSQL planner doesn't want to use my index and do a 
sequential scan [2], but if I force it with SET enable_seqscan = off, the 
index is used [3] and query duration is lowered to ~5s.


My questions:

  - Why the planner refuse to use my index?
  - Is there a better method for my problem?


Thanks by advance for your help,
Antoine Millet.


[1]
http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql
 
http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different

[2] http://explain.depesz.com/s/UzW :
 GroupAggregate  (cost=1167282.380..1294947.770 rows=762182
width=20) (actual time=20067.661..20067.661 rows=0 loops=1)
 Filter: (five(*)  1)
   -  Sort  (cost=1167282.380..1186336.910 rows=7621814 width=20) (actual 
time=15663.549..17463.458 rows=7621805 loops=1)
   Sort Key: delta, kilo, four
   Sort Method:  external merge  Disk: 223512kB
 -  Seq Scan on three  (cost=0.000..139734.140 rows=7621814
width=20) (actual time=0.041..2093.434 rows=7621805 loops=1)

[3] http://explain.depesz.com/s/o9P :
 GroupAggregate  (cost=0.000..11531349.190 rows=762182 width=20) (actual 
time=5307.734..5307.734 rows=0 loops=1)
 Filter: (five(*)  1)
   -  Index Scan using charlie on three  (cost=0.000..11422738.330
rows=7621814 width=20) (actual time=0.046..2062.952 rows=7621805
loops=1)

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  
-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Duplicate deletion optimizations

2012-01-08 Thread Jochen Erwied

Saturday, January 7, 2012, 1:21:02 PM you wrote:

   where t_imp.id is null and test.id=t_imp.id;
   =
   where t_imp.id is not null and test.id=t_imp.id;

You're right, overlooked that one. But the increase to execute the query is
- maybe not completely - suprisingly minimal.

Because the query updating the id-column of t_imp fetches all rows from
test to be updated, they are already cached, and the second query is run 
completely from cache. I suppose you will get a severe performance hit when 
the table cannot be cached...

I ran the loop again, after 30 minutes I'm at about 3-5 seconds per loop,
as long as the server isn't doing something else. Under load it's at about
10-20 seconds, with a ratio of 40% updates, 60% inserts.

 and a partial index on matching rows might help (should be tested):

  (after the first updat)
  create index t_imp_ix on t_imp(t_value,t_record,output_id) where t_imp.id is 
 not null.

I don't think this will help much since t_imp is scanned sequentially
anyway, so creating an index is just unneeded overhead.

-- 
Jochen Erwied |   home: joc...@erwied.eu +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: j...@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erw...@vodafone.de   +49-173-5404164


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Duplicate deletion optimizations

2012-01-07 Thread Jochen Erwied

Friday, January 6, 2012, 4:21:06 PM you wrote:

 Every 5 minutes, a process have to insert a few thousand of rows in this
 table, but sometime, the process have to insert an already existing row
 (based on values in the triplet (t_value, t_record, output_id). In this
 case, the row must be updated with the new count value. I've tried some
 solution given on this stackoverflow question [1] but the insertion rate
 is always too low for my needs.

I did check the following in a loop, starting with an empty table, and
inserting/updating 5 random unique entries. After 15 minutes I've got
about 10 million records, each loop takes about 3 seconds. After 30 minutes
the table contains approx. 18 million entries, time per loop only slightly
increased. After 90 minutes the database has about 30 million entries. The
speed has dropped to about 15-20 seconds per loop, but the server is doing
lots of other queries in parallel, so with an unloaded server the updates
should still take less than 10 seconds.

The generator runs in perl, and generates records for a maximum of 100 
million different entries:

use strict;

srand time;
my $i = 0;
open FD, data.in;
for (1..5)
{
$i += rand(2000);
print FD sprintf(%d\t%d\t%d\t%d\n, $i/65536, ($i/256)%255, $i%255, 
rand(1000));
}
close FD;

The SQL-script looks like this:

\timing on
begin;
create temp table t_imp(id bigint,t_value integer,t_record integer,output_id 
integer,count bigint);
\copy t_imp (t_value, t_record, output_id, count) from 'data.in'
--an index is not really needed, table is in memory anyway
--create index t_imp_ix on t_imp(t_value,t_record,output_id);

-- find matching rows
update t_imp
   set id=test.id
   from test
   where 
(t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id);
-- update matching rows using primary key
update test
   set count=t_imp.count
   from t_imp
   where t_imp.id is null and test.id=t_imp.id;
-- insert missing rows
insert into test(t_value,t_record,output_id,count)
   select t_value,t_record,output_id,count
  from t_imp
  where id is null;
commit;

Advantages of this solution:

- all updates are done in-place, no index modifications (except for the 
  inserts, of course)
- big table only gets inserts
- no dead tuples from deletes
- completely avoids sequential scans on the big table

Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6 
GHz, table and indices stored on a SSD)

Table statistics:

relid | 14332525
schemaname| public
relname   | test
seq_scan  | 8
seq_tup_read  | 111541821
idx_scan  | 149240169
idx_tup_fetch | 117901695
n_tup_ins | 30280175
n_tup_upd | 0
n_tup_del | 0
n_tup_hot_upd | 0
n_live_tup| 30264431
n_dead_tup| 0
last_vacuum   |
last_autovacuum   |
last_analyze  |
last_autoanalyze  | 2012-01-07 12:38:49.593651+01
vacuum_count  | 0
autovacuum_count  | 0
analyze_count | 0
autoanalyze_count | 31

The sequential scans were from some 'select count(*)' in between.

HTH.

-- 
Jochen Erwied |   home: joc...@erwied.eu +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: j...@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erw...@vodafone.de   +49-173-5404164


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Duplicate deletion optimizations

2012-01-07 Thread Pierre C




It's a fairly tricky problem. I have a number of sensors producing
energy data about every 5 minutes, but at random times between 1 and
15 minutes. I can't change that as that's the way the hardware of the
sensors works. These feed into another unit, which accumulates them
and forwards them in batches over the Internet to my PostgreSQL
database server every few minutes (again at random times outside my
control and with random batch sizes). To make things worse, if the
Internet connection between the unit and the database server fails, it
will send the latest data first to provide a quick update to the
current values and then send the backlog of stored values. Thus, data
do not always arrive in correct time order.


I'm stuck home with flu, so I'm happy to help ;)

I'll build an example setup to make it clearer...

-- A list of all sensors
create table sensors( sensor_id integer primary key );
insert into sensors select generate_series(1,100);

-- A table to contain raw sensor data
create table log(
  sensor_id integer not null references sensors(sensor_id),
  time integer not null,
  value float not null
);

-- Fill it up with test data
insert into log
select sensor_id, time, time from (
  select distinct sensor_id,
(n+random()*10)::INTEGER as time
  from generate_series(0,5,5) n
   cross join sensors
) d;

-- index it
alter table log add primary key( time, sensor_id );
create index log_sensor_time on log( sensor_id, time );

select * from log where sensor_id=1 order by time;
 sensor_id | time  | value
---+---+---
 1 |12 |12
 1 |14 |14
 1 |21 |21
 1 |29 |29
 1 |30 |30
()
 1 | 49996 | 49996
 1 | 50001 | 50001

-- create a table which will contain the time ticks
-- which will be used as x-axis for interpolation
-- (in this example, one tick every 10 time units)

create table ticks( time integer primary key,
   check( time%10 = 0 ) );
insert into ticks select
  generate_series( 0, (select max(time) from log), 10 );

-- create interpolated values table
create table interp(
  sensor_id integer not null references sensors( sensor_id ),
  time  integer not null references ticks( time ),
  value float,
  distance  integer not null
);

-- fill interpolated values table
-- (pretty slow)

insert into interp
select
sensor_id,
t.time,
start_value +  
(end_value-start_value)*(t.time-start_time)/(end_time-start_time),

greatest( t.time - start_time, end_time-t.time )
  from
(select
  sensor_id,
  lag(time) over (partition by sensor_id order by time) as start_time,
  time as end_time,
  lag(value) over (partition by sensor_id order by time) as  
start_value,

  value as end_value
from log
) as l
  join ticks t on (t.time = start_time and t.time  end_time);

-- alternate query if you don't like the ticks table (same sesult) :
insert into interp
select
sensor_id,
time,
start_value +  
(end_value-start_value)*(time-start_time)/(end_time-start_time),

greatest( time - start_time, end_time-time )
  from
  (select
*,
generate_series( ((start_time+9)/10)*10, ((end_time-1)/10)*10, 10 ) AS  
time

from
  (select
sensor_id,
lag(time) over (partition by sensor_id order by time) as  
start_time,

time as end_time,
lag(value) over (partition by sensor_id order by time) as  
start_value,

value as end_value
  from log
  ) as l
) l;

alter table interp add primary key( time,sensor_id );
create index interp_sensor_time on interp( sensor_id, time );

For each interval in the log table that contains a time tick, this query  
generates the interpolated data at that tick.


Note that the distance field represents the distance (in time) between  
the interpolated value and the farthest real data point that was used to  
calculate it. Therefore, it can be used as a measure of the quality of the  
interpolated point ; if the distance is greater than some threshold, the  
value might not be that precise.


Now, suppose we receive a bunch of data. The data isn't ordered according  
to time.

There are two possibilities :

- the new data starts right where we left off (ie, just after the last  
time for each sensor in table log)
- the new data starts later in time, and we want to process the results  
right away, expecting to receive, at some later point, older data to fill  
the holes


The second one is hairier, lets' do that.

Anyway, let's create a packet :

-- A table to contain raw sensor data
create temporary table packet(
  sensor_id integer not null,
  time integer not null,
  value float not null
);

-- Fill it up with test data
insert into packet
select sensor_id, time, time from (
  select distinct sensor_id,
(n+random()*10)::INTEGER as time
  from generate_series(50200,50400) n
   cross join sensors
) d;

Note that I deliberately inserted a hole : the log table

Re: [PERFORM] Duplicate deletion optimizations

2012-01-07 Thread Marc Mamin

Yes, but it should become a bit slower if you fix your code :-)

  where t_imp.id is null and test.id=t_imp.id;
  =
  where t_imp.id is not null and test.id=t_imp.id;

and a partial index on matching rows might help (should be tested):

 (after the first updat)
 create index t_imp_ix on t_imp(t_value,t_record,output_id) where t_imp.id is 
not null.

regards,
Marc Mamin

-Ursprüngliche Nachricht-
Von: pgsql-performance-ow...@postgresql.org im Auftrag von Jochen Erwied
Gesendet: Sa 1/7/2012 12:57
An: anto...@inaps.org
Cc: pgsql-performance@postgresql.org
Betreff: Re: [PERFORM] Duplicate deletion optimizations
 
Friday, January 6, 2012, 4:21:06 PM you wrote:

 Every 5 minutes, a process have to insert a few thousand of rows in this
 table, but sometime, the process have to insert an already existing row
 (based on values in the triplet (t_value, t_record, output_id). In this
 case, the row must be updated with the new count value. I've tried some
 solution given on this stackoverflow question [1] but the insertion rate
 is always too low for my needs.

I did check the following in a loop, starting with an empty table, and
inserting/updating 5 random unique entries. After 15 minutes I've got
about 10 million records, each loop takes about 3 seconds. After 30 minutes
the table contains approx. 18 million entries, time per loop only slightly
increased. After 90 minutes the database has about 30 million entries. The
speed has dropped to about 15-20 seconds per loop, but the server is doing
lots of other queries in parallel, so with an unloaded server the updates
should still take less than 10 seconds.

The generator runs in perl, and generates records for a maximum of 100 
million different entries:

use strict;

srand time;
my $i = 0;
open FD, data.in;
for (1..5)
{
$i += rand(2000);
print FD sprintf(%d\t%d\t%d\t%d\n, $i/65536, ($i/256)%255, $i%255, 
rand(1000));
}
close FD;

The SQL-script looks like this:

\timing on
begin;
create temp table t_imp(id bigint,t_value integer,t_record integer,output_id 
integer,count bigint);
\copy t_imp (t_value, t_record, output_id, count) from 'data.in'
--an index is not really needed, table is in memory anyway
--create index t_imp_ix on t_imp(t_value,t_record,output_id);

-- find matching rows
update t_imp
   set id=test.id
   from test
   where 
(t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id);
-- update matching rows using primary key
update test
   set count=t_imp.count
   from t_imp
   where t_imp.id is null and test.id=t_imp.id;
-- insert missing rows
insert into test(t_value,t_record,output_id,count)
   select t_value,t_record,output_id,count
  from t_imp
  where id is null;
commit;

Advantages of this solution:

- all updates are done in-place, no index modifications (except for the 
  inserts, of course)
- big table only gets inserts
- no dead tuples from deletes
- completely avoids sequential scans on the big table

Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6 
GHz, table and indices stored on a SSD)

Table statistics:

relid | 14332525
schemaname| public
relname   | test
seq_scan  | 8
seq_tup_read  | 111541821
idx_scan  | 149240169
idx_tup_fetch | 117901695
n_tup_ins | 30280175
n_tup_upd | 0
n_tup_del | 0
n_tup_hot_upd | 0
n_live_tup| 30264431
n_dead_tup| 0
last_vacuum   |
last_autovacuum   |
last_analyze  |
last_autoanalyze  | 2012-01-07 12:38:49.593651+01
vacuum_count  | 0
autovacuum_count  | 0
analyze_count | 0
autoanalyze_count | 31

The sequential scans were from some 'select count(*)' in between.

HTH.

-- 
Jochen Erwied |   home: joc...@erwied.eu +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: j...@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erw...@vodafone.de   +49-173-5404164


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Duplicate deletion optimizations

2012-01-07 Thread Misa Simic

If solution with temp table is acceptable - i think steps could be
reduced...

• copy to temp_imp ( temp table does not have id column)

• update live set count = temp_imp.count from temp_imp using (
col1,col2,col3)

• insert into live from temp where col1, col2 and col3 not exists in
live

Kind Regards,

Misa

Sent from my Windows Phone
From: Jochen Erwied
Sent: 07/01/2012 12:58
To: anto...@inaps.org
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Duplicate deletion optimizations
Friday, January 6, 2012, 4:21:06 PM you wrote:

 Every 5 minutes, a process have to insert a few thousand of rows in this
 table, but sometime, the process have to insert an already existing row
 (based on values in the triplet (t_value, t_record, output_id). In this
 case, the row must be updated with the new count value. I've tried some
 solution given on this stackoverflow question [1] but the insertion rate
 is always too low for my needs.

I did check the following in a loop, starting with an empty table, and
inserting/updating 5 random unique entries. After 15 minutes I've got
about 10 million records, each loop takes about 3 seconds. After 30 minutes
the table contains approx. 18 million entries, time per loop only slightly
increased. After 90 minutes the database has about 30 million entries. The
speed has dropped to about 15-20 seconds per loop, but the server is doing
lots of other queries in parallel, so with an unloaded server the updates
should still take less than 10 seconds.

The generator runs in perl, and generates records for a maximum of 100
million different entries:

use strict;

srand time;
my $i = 0;
open FD, data.in;
for (1..5)
{
$i += rand(2000);
print FD sprintf(%d\t%d\t%d\t%d\n, $i/65536, ($i/256)%255,
$i%255, rand(1000));
}
close FD;

The SQL-script looks like this:

\timing on
begin;
create temp table t_imp(id bigint,t_value integer,t_record
integer,output_id integer,count bigint);
\copy t_imp (t_value, t_record, output_id, count) from 'data.in'
--an index is not really needed, table is in memory anyway
--create index t_imp_ix on t_imp(t_value,t_record,output_id);

-- find matching rows
update t_imp
   set id=test.id
   from test
   where 
(t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id);
-- update matching rows using primary key
update test
   set count=t_imp.count
   from t_imp
   where t_imp.id is null and test.id=t_imp.id;
-- insert missing rows
insert into test(t_value,t_record,output_id,count)
   select t_value,t_record,output_id,count
  from t_imp
  where id is null;
commit;

Advantages of this solution:

- all updates are done in-place, no index modifications (except for the
  inserts, of course)
- big table only gets inserts
- no dead tuples from deletes
- completely avoids sequential scans on the big table

Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6
GHz, table and indices stored on a SSD)

Table statistics:

relid | 14332525
schemaname| public
relname   | test
seq_scan  | 8
seq_tup_read  | 111541821
idx_scan  | 149240169
idx_tup_fetch | 117901695
n_tup_ins | 30280175
n_tup_upd | 0
n_tup_del | 0
n_tup_hot_upd | 0
n_live_tup| 30264431
n_dead_tup| 0
last_vacuum   |
last_autovacuum   |
last_analyze  |
last_autoanalyze  | 2012-01-07 12:38:49.593651+01
vacuum_count  | 0
autovacuum_count  | 0
analyze_count | 0
autoanalyze_count | 31

The sequential scans were from some 'select count(*)' in between.

HTH.

-- 
Jochen Erwied |   home: joc...@erwied.eu +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: j...@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erw...@vodafone.de   +49-173-5404164


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Duplicate deletion optimizations

2012-01-07 Thread Jochen Erwied

Saturday, January 7, 2012, 3:02:10 PM you wrote:

 • insert into live from temp where col1, col2 and col3 not exists in
 live

'not exists' is something I'm trying to avoid, even if the optimizer is 
able to handle it. 

-- 
Jochen Erwied |   home: joc...@erwied.eu +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: j...@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erw...@vodafone.de   +49-173-5404164


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Duplicate deletion optimizations

2012-01-07 Thread Pierre C




It's a fairly tricky problem. I have a number of sensors producing
energy data about every 5 minutes, but at random times between 1 and
15 minutes. I can't change that as that's the way the hardware of the
sensors works. These feed into another unit, which accumulates them
and forwards them in batches over the Internet to my PostgreSQL
database server every few minutes (again at random times outside my
control and with random batch sizes). To make things worse, if the
Internet connection between the unit and the database server fails, it
will send the latest data first to provide a quick update to the
current values and then send the backlog of stored values. Thus, data
do not always arrive in correct time order.


I'm stuck home with flu, so I'm happy to help ;)

I'll build an example setup to make it clearer...

-- A list of all sensors
create table sensors( sensor_id integer primary key );
insert into sensors select generate_series(1,100);

-- A table to contain raw sensor data
create table log(
sensor_id integer not null references sensors(sensor_id),
time integer not null,
value float not null
);

-- Fill it up with test data
insert into log
select sensor_id, time, time from (
select distinct sensor_id,
  (n+random()*10)::INTEGER as time
from generate_series(0,5,5) n
 cross join sensors
) d;

-- index it
alter table log add primary key( time, sensor_id );
create index log_sensor_time on log( sensor_id, time );

select * from log where sensor_id=1 order by time;
   sensor_id | time  | value
---+---+---
   1 |12 |12
   1 |14 |14
   1 |21 |21
   1 |29 |29
   1 |30 |30
()
   1 | 49996 | 49996
   1 | 50001 | 50001

-- create a table which will contain the time ticks
-- which will be used as x-axis for interpolation
-- (in this example, one tick every 10 time units)

create table ticks( time integer primary key,
 check( time%10 = 0 ) );
insert into ticks select
generate_series( 0, (select max(time) from log), 10 );

-- create interpolated values table
create table interp(
sensor_id integer not null references sensors( sensor_id ),
time  integer not null references ticks( time ),
value float,
distance  integer not null
);

-- fill interpolated values table
-- (pretty slow)

insert into interp
select
  sensor_id,
  t.time,
  start_value +
(end_value-start_value)*(t.time-start_time)/(end_time-start_time),
  greatest( t.time - start_time, end_time-t.time )
from
  (select
sensor_id,
lag(time) over (partition by sensor_id order by time) as  
start_time,

time as end_time,
lag(value) over (partition by sensor_id order by time) as
start_value,
value as end_value
  from log
  ) as l
join ticks t on (t.time = start_time and t.time  end_time);

-- alternate query if you don't like the ticks table (same sesult) :
insert into interp
select
  sensor_id,
  time,
  start_value +
(end_value-start_value)*(time-start_time)/(end_time-start_time),
  greatest( time - start_time, end_time-time )
from
(select
  *,
  generate_series( ((start_time+9)/10)*10, ((end_time-1)/10)*10, 10 )  
AS

time
  from
(select
  sensor_id,
  lag(time) over (partition by sensor_id order by time) as
start_time,
  time as end_time,
  lag(value) over (partition by sensor_id order by time) as
start_value,
  value as end_value
from log
) as l
  ) l;

alter table interp add primary key( time,sensor_id );
create index interp_sensor_time on interp( sensor_id, time );

For each interval in the log table that contains a time tick, this query
generates the interpolated data at that tick.

Note that the distance field represents the distance (in time) between
the interpolated value and the farthest real data point that was used to
calculate it. Therefore, it can be used as a measure of the quality of the
interpolated point ; if the distance is greater than some threshold, the
value might not be that precise.

Now, suppose we receive a bunch of data. The data isn't ordered according
to time.
There are two possibilities :

- the new data starts right where we left off (ie, just after the last
time for each sensor in table log)
- the new data starts later in time, and we want to process the results
right away, expecting to receive, at some later point, older data to fill
the holes

The second one is hairier, lets' do that.

Anyway, let's create a packet :

-- A table to contain raw sensor data
create temporary table packet(
sensor_id integer not null,
time integer not null,
value float not null
);

-- Fill it up with test data
insert into packet
select sensor_id, time, time from (
select distinct sensor_id,
  (n+random()*10)::INTEGER as time
from generate_series(50200,50400) n

Re: [PERFORM] Duplicate deletion optimizations

2012-01-07 Thread Jeff Janes

On Fri, Jan 6, 2012 at 6:35 AM,  anto...@inaps.org wrote:
 Hello,

 I've a table with approximately 50 million rows with a schema like this:

    id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
    t_value integer NOT NULL DEFAULT 0,
    t_record integer NOT NULL DEFAULT 0,
    output_id integer NOT NULL DEFAULT 0,
    count bigint NOT NULL DEFAULT 0,
    CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)

 Every 5 minutes, a process have to insert a few thousand of rows in this
 table,
 but sometime, the process have to insert an already existing row (based on
 values in the triplet (t_value, t_record, output_id). In this case, the row
 must be updated with the new count value. I've tried some solution given on
 this
 stackoverflow question [1] but the insertion rate is always too low for my
 needs.

What are your needs?  It should take no special hardware or coding to
be able to manage a few thousand rows over 5 minutes.


 So, I've decided to do it in two times:

  - I insert all my new data with a COPY command
  - When it's done, I run a delete query to remove oldest duplicates

 Right now, my delete query look like this:

    SELECT min(id) FROM stats_5mn
    GROUP BY t_value, t_record, output_id
    HAVING count(*)  1;

 The duration of the query on my test machine with approx. 16 million rows is
 ~18s.

 To reduce this duration, I've tried to add an index on my triplet:

    CREATE INDEX test
      ON stats_5mn
      USING btree
      (t_value , t_record , output_id );

 By default, the PostgreSQL planner doesn't want to use my index and do a
 sequential
 scan [2], but if I force it with SET enable_seqscan = off, the index is
 used [3]
 and query duration is lowered to ~5s.


 My questions:

  - Why the planner refuse to use my index?

It thinks that using the index will be about 9 times more expensive
than the full scan.  Probably your settings for seq_page_cost and
random_page_cost are such that the planner thinks that nearly every
buffer read is going to be from disk.  But in reality (in this case)
your data is all in memory.  So the planner is mis-estimating.  (It
would help verify this if you did your EXPLAIN ANALYZE with BUFFERS as
well).  But before trying to fix this by tweaking settings, will the
real case always be like your test case?  If the data stops being all
in memory, either because the problem size increases or because you
have to compete for buffer space with other things going on, then
using the index scan could be catastrophic.

Cheers,

Jeff

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Duplicate deletion optimizations

2012-01-07 Thread Misa Simic

It was not query... Just sentence where some index values in one table
not exist in another...

So query could be with:
• WHERE (col1,col2,col2) NOT IN
• WHERE NOT EXISTS
• LEFT JOIN live USING (col1,col2,col2) WHERE live.id IS NULL

what ever whoever prefer more or what gives better results... But I
think it is more personal feelings which is better then real...

Sent from my Windows Phone
From: Jochen Erwied
Sent: 07/01/2012 15:18
To: Misa Simic
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Duplicate deletion optimizations
Saturday, January 7, 2012, 3:02:10 PM you wrote:

 • insert into live from temp where col1, col2 and col3 not exists in
 live

'not exists' is something I'm trying to avoid, even if the optimizer is
able to handle it.

-- 
Jochen Erwied |   home: joc...@erwied.eu +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: j...@mbs-software.de  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erw...@vodafone.de   +49-173-5404164

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Duplicate deletion optimizations

2012-01-07 Thread Marc Eberhard

Hi Pierre!

On 7 January 2012 12:20, Pierre C li...@peufeu.com wrote:
 I'm stuck home with flu, so I'm happy to help ;)
[...]
 I'll build an example setup to make it clearer...
[...]

That's almost identical to my tables. :-)

 Note that the distance field represents the distance (in time) between the
 interpolated value and the farthest real data point that was used to
 calculate it. Therefore, it can be used as a measure of the quality of the
 interpolated point ; if the distance is greater than some threshold, the
 value might not be that precise.

Nice idea!

 Although this query is huge, it's very fast, since it doesn't hit the big
 tables with any seq scans (hence the max() and min() tricks to use the
 indexes instead).

And it can easily be tamed by putting parts of it into stored pgpsql functions.

 I love how postgres can blast that huge pile of SQL in, like, 50 ms...

Yes, indeed. It's incredible fast. Brilliant!

 If there is some overlap between packet data and data already in the log,
 you might get some division by zero errors, in this case you'll need to
 apply a DISTINCT somewhere (or simply replace the UNION ALL with an UNION,
 which might be wiser anyway...)

I do have a unique constraint on the actual table to prevent duplicate
data in case of retransmission after a failed connect. It's easy
enough to delete the rows from packet that already exist in the main
table with a short one line SQL delete statement before the
interpolation and merge.

 Tada.

:-

 Enjoy !

I certainly will. Many thanks for those great lines of SQL!

Hope you recover from your flu quickly!

All the best,
Marc

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Duplicate deletion optimizations

2012-01-06 Thread antoine


On Fri, 06 Jan 2012 15:35:36 +0100, anto...@inaps.org wrote:

Hello,

I've a table with approximately 50 million rows with a schema like 
this:


id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
t_value integer NOT NULL DEFAULT 0,
t_record integer NOT NULL DEFAULT 0,
output_id integer NOT NULL DEFAULT 0,
count bigint NOT NULL DEFAULT 0,
CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)

Every 5 minutes, a process have to insert a few thousand of rows in
this table,
but sometime, the process have to insert an already existing row 
(based on
values in the triplet (t_value, t_record, output_id). In this case, 
the row

must be updated with the new count value. I've tried some solution
given on this
stackoverflow question [1] but the insertion rate is always too low
for my needs.

So, I've decided to do it in two times:

 - I insert all my new data with a COPY command
 - When it's done, I run a delete query to remove oldest duplicates

Right now, my delete query look like this:

SELECT min(id) FROM stats_5mn
GROUP BY t_value, t_record, output_id
HAVING count(*)  1;


Correction:

DELETE FROM stats_5mn WHERE id in (
SELECT min(id) FROM stats_5mn
GROUP BY t_value, t_record, output_id
HAVING count(*)  1;
);

Sorry :-)



The duration of the query on my test machine with approx. 16 million
rows is ~18s.

To reduce this duration, I've tried to add an index on my triplet:

CREATE INDEX test
  ON stats_5mn
  USING btree
  (t_value , t_record , output_id );

By default, the PostgreSQL planner doesn't want to use my index and
do a sequential
scan [2], but if I force it with SET enable_seqscan = off, the
index is used [3]
and query duration is lowered to ~5s.


My questions:

 - Why the planner refuse to use my index?
 - Is there a better method for my problem?


Thanks by advance for your help,
Antoine Millet.


[1]

http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql


http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different

[2] http://explain.depesz.com/s/UzW :
GroupAggregate  (cost=1167282.380..1294947.770 rows=762182
width=20) (actual time=20067.661..20067.661 rows=0 loops=1)
Filter: (five(*)  1)
  -  Sort  (cost=1167282.380..1186336.910 rows=7621814 width=20)
(actual time=15663.549..17463.458 rows=7621805 loops=1)
  Sort Key: delta, kilo, four
  Sort Method:  external merge  Disk: 223512kB
-  Seq Scan on three  (cost=0.000..139734.140
rows=7621814 width=20) (actual time=0.041..2093.434 rows=7621805
loops=1)

[3] http://explain.depesz.com/s/o9P :
GroupAggregate  (cost=0.000..11531349.190 rows=762182 width=20)
(actual time=5307.734..5307.734 rows=0 loops=1)
Filter: (five(*)  1)
  -  Index Scan using charlie on three
(cost=0.000..11422738.330 rows=7621814 width=20) (actual
time=0.046..2062.952 rows=7621805 loops=1)



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Duplicate deletion optimizations

2012-01-06 Thread Samuel Gendler

On Fri, Jan 6, 2012 at 6:35 AM, anto...@inaps.org wrote:

 Hello,

 I've a table with approximately 50 million rows with a schema like this:

id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass)**,
t_value integer NOT NULL DEFAULT 0,
t_record integer NOT NULL DEFAULT 0,
output_id integer NOT NULL DEFAULT 0,
count bigint NOT NULL DEFAULT 0,
CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)

 Every 5 minutes, a process have to insert a few thousand of rows in this
 table,
 but sometime, the process have to insert an already existing row (based on
 values in the triplet (t_value, t_record, output_id). In this case, the row
 must be updated with the new count value. I've tried some solution given
 on this
 stackoverflow question [1] but the insertion rate is always too low for my
 needs.

 So, I've decided to do it in two times:

  - I insert all my new data with a COPY command
  - When it's done, I run a delete query to remove oldest duplicates

 Right now, my delete query look like this:

SELECT min(id) FROM stats_5mn
GROUP BY t_value, t_record, output_id
HAVING count(*)  1;

 The duration of the query on my test machine with approx. 16 million rows
 is ~18s.


Have you considered doing the insert by doing a bulk insert into a temp
table and then pulling rows that don't exist across to the final table in
one query and updating rows that do exist in another query?  I did a very
brief scan of the SO thread and didn't see it suggested.  Something like
this:

update stats_5mn set count = count + t.count
from temp_table t
where stats_5mn.t_value = t.t_value and stats_5mn.t_record and
stats_5mn.output_id = t.output_id;

insert into stats_5mn
select * from temp_table t
where not exists (
select 1 from stats_5mn s
where s.t_value = t.t_value and s.t_record = t.t_record and s.output_id =
t.output_id
);

drop table temp_table;

Note - you must do the update before the insert because doing it the other
way around will cause every row you just inserted to also be updated.

I'm not sure it'd be markedly faster, but you'd at least be able to retain
a unique constraint on the triplet, if desired.  And, to my eye, the logic
is easier to comprehend.  The different query structure may make better use
of your index, but I imagine that it is not using it currently because your
db isn't configured to accurately reflect the real cost of index use vs
sequential scan, so it is incorrectly determining the cost of looking up
7.5 million rows.  Its estimate of the row count is correct, so the
estimate of the cost must be the problem.  We'd need to know more about
your current config and hardware specs to be able to even start making
suggestions about config changes to correct the problem.

Re: [PERFORM] Duplicate deletion optimizations

2012-01-06 Thread Marc Eberhard

Hi Samuel!

On 6 January 2012 20:02, Samuel Gendler sgend...@ideasculptor.com wrote:
 Have you considered doing the insert by doing a bulk insert into a temp
 table and then pulling rows that don't exist across to the final table in
 one query and updating rows that do exist in another query?  I did a very
 brief scan of the SO thread and didn't see it suggested.  Something like
 this:

 update stats_5mn set count = count + t.count
 from temp_table t
 where stats_5mn.t_value = t.t_value and stats_5mn.t_record and
 stats_5mn.output_id = t.output_id;

 insert into stats_5mn
 select * from temp_table t
 where not exists (
 select 1 from stats_5mn s
 where s.t_value = t.t_value and s.t_record = t.t_record and s.output_id =
 t.output_id
 );

 drop table temp_table;

Am I right to assume that the update/insert needs to be placed into a
begin / end transaction block if such batch uploads might happen
concurrently? Doesn't seem to be the case for this question here, but
I like the solution and wonder if it works under more general
circumstances.

What's the overhead of creating and dropping a temporary table? Is it
only worth doing this for a large number of inserted/updated elements?
What if the number of inserts/updates is only a dozen at a time for a
large table (10M entries)?

Thanks,
Marc

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Duplicate deletion optimizations

2012-01-06 Thread Samuel Gendler

On Fri, Jan 6, 2012 at 12:22 PM, Marc Eberhard eberhar...@googlemail.comwrote:

 Hi Samuel!

 On 6 January 2012 20:02, Samuel Gendler sgend...@ideasculptor.com wrote:
  Have you considered doing the insert by doing a bulk insert into a temp
  table and then pulling rows that don't exist across to the final table in
  one query and updating rows that do exist in another query?  I did a very
  brief scan of the SO thread and didn't see it suggested.  Something like
  this:
 
  update stats_5mn set count = count + t.count
  from temp_table t
  where stats_5mn.t_value = t.t_value and stats_5mn.t_record and
  stats_5mn.output_id = t.output_id;
 
  insert into stats_5mn
  select * from temp_table t
  where not exists (
  select 1 from stats_5mn s
  where s.t_value = t.t_value and s.t_record = t.t_record and s.output_id =
  t.output_id
  );
 
  drop table temp_table;

 Am I right to assume that the update/insert needs to be placed into a
 begin / end transaction block if such batch uploads might happen
 concurrently? Doesn't seem to be the case for this question here, but
 I like the solution and wonder if it works under more general
 circumstances.


yes, assuming you are concerned about making the insertion atomic.
 Obviously, a failure in the second query after success in the 1st query
would be problematic outside of a transaction, since any attempt to repeat
the entire operation would result in repeated updates.


 What's the overhead of creating and dropping a temporary table? Is it
 only worth doing this for a large number of inserted/updated elements?
 What if the number of inserts/updates is only a dozen at a time for a
 large table (10M entries)?


pretty minimal, but enough that doing a handful of rows at a time probably
wouldn't be worth it.  You'd surely get index usage on a plain insert in
such a case, so I'd probably just use an upsert stored proc for doing small
numbers of rows - unless you are doing large numbers of inserts, just a few
at a time.  In that case, I'd try to accumulate them and then do them in
bulk.  Those are tough questions to answer without a specific context.  My
real answer is 'try it and see.'  You'll always get an answer that is
specific to your exact circumstance that way.

By the way, there is definitely a difference between creating a temp table
and creating a table temporarily.  See the postgres docs about temp tables
for specifics, but many databases treat temp tables differently from
ordinary tables, so it is worth understanding what those differences are.
 Temp tables are automatically dropped when a connection (or transaction)
is closed.  Temp table names are local to the connection, so multiple
connections can each create a temp table with the same name without
conflict, which is convenient. I believe they are also created in a
specific tablespace on disk, etc.

Re: [PERFORM] Duplicate deletion optimizations

2012-01-06 Thread Marc Eberhard

On 6 January 2012 20:38, Samuel Gendler sgend...@ideasculptor.com wrote:
 On Fri, Jan 6, 2012 at 12:22 PM, Marc Eberhard eberhar...@googlemail.com
 wrote:
 On 6 January 2012 20:02, Samuel Gendler sgend...@ideasculptor.com wrote:
  Have you considered doing the insert by doing a bulk insert into a temp
  table and then pulling rows that don't exist across to the final table
  in
  one query and updating rows that do exist in another query?  I did a
  very
  brief scan of the SO thread and didn't see it suggested.  Something like
  this:
 
  update stats_5mn set count = count + t.count
  from temp_table t
  where stats_5mn.t_value = t.t_value and stats_5mn.t_record and
  stats_5mn.output_id = t.output_id;
 
  insert into stats_5mn
  select * from temp_table t
  where not exists (
  select 1 from stats_5mn s
  where s.t_value = t.t_value and s.t_record = t.t_record and s.output_id
  =
  t.output_id
  );
 
  drop table temp_table;

 Am I right to assume that the update/insert needs to be placed into a
 begin / end transaction block if such batch uploads might happen
 concurrently? Doesn't seem to be the case for this question here, but
 I like the solution and wonder if it works under more general
 circumstances.


 yes, assuming you are concerned about making the insertion atomic.
  Obviously, a failure in the second query after success in the 1st query
 would be problematic outside of a transaction, since any attempt to repeat
 the entire operation would result in repeated updates.

True, but I was more concerned about concurrency, where a second
upsert inserts an element between update/insert from the first. That
would then skip the element in the first upsert as it is neither
updated (doesn't exist at that point in time) nor inserted (does
exists at that later point). Or would that be impossible anyway?

 What's the overhead of creating and dropping a temporary table? Is it
 only worth doing this for a large number of inserted/updated elements?
 What if the number of inserts/updates is only a dozen at a time for a
 large table (10M entries)?

 pretty minimal, but enough that doing a handful of rows at a time probably
 wouldn't be worth it.  You'd surely get index usage on a plain insert in
 such a case, so I'd probably just use an upsert stored proc for doing small
 numbers of rows - unless you are doing large numbers of inserts, just a few
 at a time.  In that case, I'd try to accumulate them and then do them in
 bulk.  Those are tough questions to answer without a specific context.  My
 real answer is 'try it and see.'  You'll always get an answer that is
 specific to your exact circumstance that way.

It's a fairly tricky problem. I have a number of sensors producing
energy data about every 5 minutes, but at random times between 1 and
15 minutes. I can't change that as that's the way the hardware of the
sensors works. These feed into another unit, which accumulates them
and forwards them in batches over the Internet to my PostgreSQL
database server every few minutes (again at random times outside my
control and with random batch sizes). To make things worse, if the
Internet connection between the unit and the database server fails, it
will send the latest data first to provide a quick update to the
current values and then send the backlog of stored values. Thus, data
do not always arrive in correct time order.

At the moment I only look at the latest data for each sensor and these
should be as close to real time as possible. Thus, collecting data for
some time to get a larger size for a batch update isn't preferable.
What I want to do, and this is where the upsert problem starts, is to
build a table with energy values at fixed times. These should be
calculated as a linear interpolation between the nearest reported
values from the sensors. Please note each sensor is reporting a
measured energy value (not instant power), which always increases
monotonically with time. To compare the performance of the different
devices that are measured, I need to have the energy values at the
same time and not at the random times when the sensors report. This
also allows the calculation of average power for the devices by taking
the difference of the energy values over longer periods, like 30
minutes.

What I simply haven't got my head around is how to do this in an
efficient way. When new values arrive, the table of interpolated
values needs to be updated. For some times, there will already be
values in the table, but for other times there won't. Thus, the
upsert. If there was a communication failure, the reported sensor
times will go backwards as the most recent is transmitted first until
the backlog is cleared. In that case the interpolation table will be
populated with intermediate values from the first insert with the
latest timestamp and then these values will be refined by the backlog
data as they trickle in. Under normal circumstances, reported
timestamps will be monotonically increasing and the interpolation

Re: [PERFORM] Duplicate deletion optimizations

2012-01-06 Thread Misa Simic

hi,

Maybe these thoughts could help

1) order by those three columns in your select min query could force
index usage...

2) or

DELETE FROM table
WHERE EXISTS(SELECT id FROM table t WHERE t.id  table.id AND t.col1 =
table.col1 AND t.col2 = table.col2 AND col3 = table.col3)



Sent from my Windows Phone
From: anto...@inaps.org
Sent: 06/01/2012 15:36
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Duplicate deletion optimizations
Hello,

I've a table with approximately 50 million rows with a schema like
this:

 id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
 t_value integer NOT NULL DEFAULT 0,
 t_record integer NOT NULL DEFAULT 0,
 output_id integer NOT NULL DEFAULT 0,
 count bigint NOT NULL DEFAULT 0,
 CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)

Every 5 minutes, a process have to insert a few thousand of rows in
this table,
but sometime, the process have to insert an already existing row (based
on
values in the triplet (t_value, t_record, output_id). In this case, the
row
must be updated with the new count value. I've tried some solution
given on this
stackoverflow question [1] but the insertion rate is always too low for
my needs.

So, I've decided to do it in two times:

  - I insert all my new data with a COPY command
  - When it's done, I run a delete query to remove oldest duplicates

Right now, my delete query look like this:

 SELECT min(id) FROM stats_5mn
 GROUP BY t_value, t_record, output_id
 HAVING count(*)  1;

The duration of the query on my test machine with approx. 16 million
rows is ~18s.

To reduce this duration, I've tried to add an index on my triplet:

 CREATE INDEX test
   ON stats_5mn
   USING btree
   (t_value , t_record , output_id );

By default, the PostgreSQL planner doesn't want to use my index and do
a sequential
scan [2], but if I force it with SET enable_seqscan = off, the index
is used [3]
and query duration is lowered to ~5s.


My questions:

  - Why the planner refuse to use my index?
  - Is there a better method for my problem?


Thanks by advance for your help,
Antoine Millet.


[1]
http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql

http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different

[2] http://explain.depesz.com/s/UzW :
 GroupAggregate  (cost=1167282.380..1294947.770 rows=762182
width=20) (actual time=20067.661..20067.661 rows=0 loops=1)
 Filter: (five(*)  1)
   -  Sort  (cost=1167282.380..1186336.910 rows=7621814 width=20)
(actual time=15663.549..17463.458 rows=7621805 loops=1)
   Sort Key: delta, kilo, four
   Sort Method:  external merge  Disk: 223512kB
 -  Seq Scan on three  (cost=0.000..139734.140 rows=7621814
width=20) (actual time=0.041..2093.434 rows=7621805 loops=1)

[3] http://explain.depesz.com/s/o9P :
 GroupAggregate  (cost=0.000..11531349.190 rows=762182 width=20)
(actual time=5307.734..5307.734 rows=0 loops=1)
 Filter: (five(*)  1)
   -  Index Scan using charlie on three  (cost=0.000..11422738.330
rows=7621814 width=20) (actual time=0.046..2062.952 rows=7621805
loops=1)

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Duplicate deletion optimizations

Re: [PERFORM] Duplicate deletion optimizations

Re: [PERFORM] Duplicate deletion optimizations

Re: [PERFORM] Duplicate deletion optimizations

Re: [PERFORM] Duplicate deletion optimizations

Re: [PERFORM] Duplicate deletion optimizations

Re: [PERFORM] Duplicate deletion optimizations

Re: [PERFORM] Duplicate deletion optimizations

Re: [PERFORM] Duplicate deletion optimizations

Re: [PERFORM] Duplicate deletion optimizations

Re: [PERFORM] Duplicate deletion optimizations

Re: [PERFORM] Duplicate deletion optimizations

Re: [PERFORM] Duplicate deletion optimizations

Re: [PERFORM] Duplicate deletion optimizations

Re: [PERFORM] Duplicate deletion optimizations

Re: [PERFORM] Duplicate deletion optimizations

Re: [PERFORM] Duplicate deletion optimizations

Re: [PERFORM] Duplicate deletion optimizations

Re: [PERFORM] Duplicate deletion optimizations

19 matches

Site Navigation

Mail list logo

Footer information