Re: [PERFORM] Deleting certain duplicates

2004-04-20 Thread Greg Sabino Mullane

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
 
 
 So I was thinking maybe of doing the deletion in chunks, perhaps based on
 reception time.
 Are there any suggestions for a better way to do this, or using multiple
 queries to delete selectively a week at a time based on the reception_time.
 I would say there are a lot of duplicate entries between mid march to the
 first week of April.
 
You are on the right track, in that dividing up the table will help. However,
you cannot divide on the reception_time as that is the unique column. Analyze
your data and divide on a row with a fairly uniform distribution over the
time period in question. Then copy a segment out, clean it up, and put it
back in. Make sure there is an index on the column in question, of course.
 
For example, if 1/10 of the table has a units of 12, you could do something
like this:
 
CREATE INDEX units_dev ON forecastelement (units);
 
CREATE TEMPORARY TABLE units_temp AS SELECT * FROM forecastelement WHERE units='12';
 
CREATE INDEX units_oid_index ON units_temp(oid);
 
(Delete out duplicate rows from units_temp using your previous query or something else)
 
DELETE FROM forecastelement WHERE units='12';
 
INSERT INTO forecastelement SELECT * FROM units_temp;
 
DELETE FROM units_temp;
 
Repeat as needed until all rows are done. Subsequent runs can be done by doing a
 
INSERT INTO units_temp SELECT * FROM forecastelement WHERE units='...'
 
and skipping the CREATE INDEX steps.
 
On the other hand, your original deletion query may work as is, with the addition
of an oid index. Perhaps try an EXPLAIN on it.
 
- --
Greg Sabino Mullane [EMAIL PROTECTED]
PGP Key: 0x14964AC8 200404200706
 
-BEGIN PGP SIGNATURE-
 
iD8DBQFAhQVWvJuQZxSWSsgRAvLEAKDCVcX3Llm8JgszI/BBC1SobtjVawCfVGKu
ERcV5J2JolwgZRhMbXnNM90=
=JqET
-END PGP SIGNATURE-



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Deleting certain duplicates

2004-04-13 Thread Rajesh Kumar Mallah
Shea,Dan [CIS] wrote:

The index is
Indexes:
   forecastelement_rwv_idx btree (region_id, wx_element, valid_time)
-Original Message-
From: Shea,Dan [CIS] [mailto:[EMAIL PROTECTED]
Sent: Monday, April 12, 2004 10:39 AM
To: Postgres Performance
Subject: [PERFORM] Deleting certain duplicates
We have a large database which recently increased dramatically due to a
change in our insert program allowing all entries.
PWFPM_DEV=# select relname,relfilenode,reltuples from pg_class where relname
= 'forecastelement';
relname | relfilenode |  reltuples
-+-+-
forecastelement |   361747866 | 4.70567e+08
Column |Type | Modifiers
+-+---
version| character varying(99)   |
origin | character varying(10)   |
timezone   | character varying(99)   |
region_id  | character varying(20)   |
wx_element | character varying(99)   |
value  | character varying(99)   |
flag   | character(3)|
units  | character varying(99)   |
valid_time | timestamp without time zone |
issue_time | timestamp without time zone |
next_forecast  | timestamp without time zone |
reception_time | timestamp without time zone |
The program is supposed to check to ensure that all fields but the
reception_time are unique using a select statement, and if so, insert it.
Due an error in a change, reception time was included in the select to check
for duplicates.  The reception_time is created by a program creating the dat
file to insert. 
Essentially letting all duplicate files to be inserted.

I tried the delete query below.
PWFPM_DEV=# delete from forecastelement where oid not in (select min(oid)
from forecastelement group by
version,origin,timezone,region_id,wx_element,value,flag,units,valid_time,iss
ue_time,next_forecast);
It ran for 3 days creating what I assume is an index in pgsql_tmp of the
group by statement. 
The query ended up failing with dateERROR:write failed.
Well the long weekend is over and we do not have the luxury of trying this
again. 
So I was thinking maybe of doing the deletion in chunks, perhaps based on
reception time.
 

its more of an sql question though.

to deduplicate on basis of

version,origin,timezone,region_id,wx_element,value,flag,units,valid_time,
issue_time,next_forecast
You could do this.

begin work;
create temp_table as select distinct on 
(version,origin,timezone,region_id,wx_element,value,flag,units,valid_time,
issue_time,next_forecast) * from forecastelement ;
truncate table forecastelement ;
drop index index on forecastelement   ;
insert into forecastelement  select * from temp_table ;
commit;
create indexes
Analyze forecastelement ;

note that distinct on will keep only one row out of all rows having  
distinct values
of the specified columns. kindly go thru the distinct on manual before 
trying
the queries.

regds
mallah.
Are there any suggestions for a better way to do this, or using multiple
queries to delete selectively a week at a time based on the reception_time.
I would say there are a lot of duplicate entries between mid march to the
first week of April.


---(end of broadcast)---
TIP 8: explain analyze is your friend
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly
 



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Deleting certain duplicates

2004-04-12 Thread Shea,Dan [CIS]
The index is
Indexes:
forecastelement_rwv_idx btree (region_id, wx_element, valid_time)

-Original Message-
From: Shea,Dan [CIS] [mailto:[EMAIL PROTECTED]
Sent: Monday, April 12, 2004 10:39 AM
To: Postgres Performance
Subject: [PERFORM] Deleting certain duplicates


We have a large database which recently increased dramatically due to a
change in our insert program allowing all entries.
PWFPM_DEV=# select relname,relfilenode,reltuples from pg_class where relname
= 'forecastelement';
 relname | relfilenode |  reltuples
-+-+-
 forecastelement |   361747866 | 4.70567e+08

 Column |Type | Modifiers
+-+---
 version| character varying(99)   |
 origin | character varying(10)   |
 timezone   | character varying(99)   |
 region_id  | character varying(20)   |
 wx_element | character varying(99)   |
 value  | character varying(99)   |
 flag   | character(3)|
 units  | character varying(99)   |
 valid_time | timestamp without time zone |
 issue_time | timestamp without time zone |
 next_forecast  | timestamp without time zone |
 reception_time | timestamp without time zone |

The program is supposed to check to ensure that all fields but the
reception_time are unique using a select statement, and if so, insert it.
Due an error in a change, reception time was included in the select to check
for duplicates.  The reception_time is created by a program creating the dat
file to insert. 
Essentially letting all duplicate files to be inserted.

I tried the delete query below.
PWFPM_DEV=# delete from forecastelement where oid not in (select min(oid)
from forecastelement group by
version,origin,timezone,region_id,wx_element,value,flag,units,valid_time,iss
ue_time,next_forecast);
It ran for 3 days creating what I assume is an index in pgsql_tmp of the
group by statement. 
The query ended up failing with dateERROR:write failed.
Well the long weekend is over and we do not have the luxury of trying this
again. 
So I was thinking maybe of doing the deletion in chunks, perhaps based on
reception time.
Are there any suggestions for a better way to do this, or using multiple
queries to delete selectively a week at a time based on the reception_time.
I would say there are a lot of duplicate entries between mid march to the
first week of April.



---(end of broadcast)---
TIP 8: explain analyze is your friend

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly