[PERFORM] faster INSERT with possible pre-existing row?

2005-07-26 Thread Dan Harris
I am working on a process that will be inserting tens of million rows  
and need this to be as quick as possible.


The catch is that for each row I could potentially insert, I need to  
look and see if the relationship is already there  to prevent  
multiple entries.  Currently I am doing a SELECT before doing the  
INSERT, but I recognize the speed penalty in doing to operations.  I  
wonder if there is some way I can say insert this record, only if it  
doesn't exist already.  To see if it exists, I would need to compare  
3 fields instead of just enforcing a primary key.


Even if this could be a small increase per record, even a few percent  
faster compounded over the whole load could be a significant reduction.


Thanks for any ideas you might have.

-Dan

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] faster INSERT with possible pre-existing row?

2005-07-26 Thread John A Meinel

Dan Harris wrote:
I am working on a process that will be inserting tens of million rows  
and need this to be as quick as possible.


The catch is that for each row I could potentially insert, I need to  
look and see if the relationship is already there  to prevent  multiple 
entries.  Currently I am doing a SELECT before doing the  INSERT, but I 
recognize the speed penalty in doing to operations.  I  wonder if there 
is some way I can say insert this record, only if it  doesn't exist 
already.  To see if it exists, I would need to compare  3 fields 
instead of just enforcing a primary key.


Even if this could be a small increase per record, even a few percent  
faster compounded over the whole load could be a significant reduction.


Thanks for any ideas you might have.

-Dan



You could insert all of your data into a temporary table, and then do:

INSERT INTO final_table SELECT * FROM temp_table WHERE NOT EXISTS 
(SELECT info FROM final_table WHERE id=id, path=path, y=y);


Or you could load it into the temporary table, and then:
DELETE FROM temp_table WHERE EXISTS (SELECT FROM final_table WHERE id...);

And then do a plain INSERT INTO.

I can't say what the specific performance increases would be, but 
temp_table could certainly be an actual TEMP table (meaning it only 
exists during the connection), and you could easily do a COPY into that 
table to load it up quickly, without having to check any constraints.


Just a thought,
John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] faster INSERT with possible pre-existing row?

2005-07-26 Thread Luke Lonergan
John,

On 7/26/05 9:56 AM, John A Meinel [EMAIL PROTECTED] wrote:

 You could insert all of your data into a temporary table, and then do:
 
 INSERT INTO final_table SELECT * FROM temp_table WHERE NOT EXISTS
 (SELECT info FROM final_table WHERE id=id, path=path, y=y);
 
 Or you could load it into the temporary table, and then:
 DELETE FROM temp_table WHERE EXISTS (SELECT FROM final_table WHERE id...);
 
 And then do a plain INSERT INTO.
 
 I can't say what the specific performance increases would be, but
 temp_table could certainly be an actual TEMP table (meaning it only
 exists during the connection), and you could easily do a COPY into that
 table to load it up quickly, without having to check any constraints.

Yah - that's a typical approach, and it would be excellent if the COPY
bypassed WAL for the temp table load.  This is something we discussed in
bizgres development a while back.  I think we should do this for sure -
would nearly double the temp table load rate, and the subsequent temp table
delete *should* be fast enough (?)  Any performance tests you've done on
that delete/subselect operation?

- Luke



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] faster INSERT with possible pre-existing row?

2005-07-26 Thread Sven Willenberger
On Tue, 2005-07-26 at 10:50 -0600, Dan Harris wrote:
 I am working on a process that will be inserting tens of million rows  
 and need this to be as quick as possible.
 
 The catch is that for each row I could potentially insert, I need to  
 look and see if the relationship is already there  to prevent  
 multiple entries.  Currently I am doing a SELECT before doing the  
 INSERT, but I recognize the speed penalty in doing to operations.  I  
 wonder if there is some way I can say insert this record, only if it  
 doesn't exist already.  To see if it exists, I would need to compare  
 3 fields instead of just enforcing a primary key.
 
 Even if this could be a small increase per record, even a few percent  
 faster compounded over the whole load could be a significant reduction.
 
 Thanks for any ideas you might have.
 

Perhaps a trigger:

CREATE FUNCTION verify_unique() RETURNS TRIGGER AS $func$
BEGIN
PERFORM a,b,c FROM table1 WHERE a = NEW.a and b = NEW.b and c = NEW.c;
IF FOUND THEN 
RETURN NULL;
END IF;
RETURN NEW;
END;
$func$ LANGUAGE plpgsql STABLE;

CREATE TRIGGER verify_unique BEFORE INSERT ON table1 FOR EACH ROW
EXECUTE PROCEDURE verify_unique();

Triggers are fired on COPY commands and if table1 is able to be cached
and you have an index on table1(a,b,c) the results should be fairly
decent. I would be interested in seeing the difference in timing between
this approach and the temp table approach.


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] faster INSERT with possible pre-existing row?

2005-07-26 Thread Matthew Nuzum
On 7/26/05, Dan Harris [EMAIL PROTECTED] wrote:
 I am working on a process that will be inserting tens of million rows
 and need this to be as quick as possible.
 
 The catch is that for each row I could potentially insert, I need to
 look and see if the relationship is already there  to prevent
 multiple entries.  Currently I am doing a SELECT before doing the
 INSERT, but I recognize the speed penalty in doing to operations.  I
 wonder if there is some way I can say insert this record, only if it
 doesn't exist already.  To see if it exists, I would need to compare
 3 fields instead of just enforcing a primary key.

I struggled with this for a while. At first I tried stored procedures
and triggers, but it took very long (over 24 hours for my dataset).
After several iterations of rewritting it, first into C# then into
Python I got the whole process down to under 30 min.

My scenario is this:
I want to normalize log data. For example, for the IP address in a log
entry, I need to look up the unique id of the IP address, or if the IP
address is new, insert it and then return the newly created entry.
Multiple processes use the data, but only one process, run daily,
actually changes it. Because this one process knows that the data is
static, it selects the tables into in-memory hash tables (C#) or
Dictionaries (Python) and then does the lookups there. It is *super*
fast, but it uses a *lot* of ram. ;-)

To limit the ram, I wrote a version of the python code that uses gdbm
files instead of Dictionaries. This requires a newer version of Python
(to allow a gdbm db to work just like a dictionary) but makes life
easier in case someone is using my software on a lower end machine.
This doubled the time of the lookups from about 15 minutes to 30,
bringing the whole process to about 45 minutes.

-- 
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] faster INSERT with possible pre-existing row?

2005-07-26 Thread John A Meinel

Matthew Nuzum wrote:

On 7/26/05, Dan Harris [EMAIL PROTECTED] wrote:


I am working on a process that will be inserting tens of million rows
and need this to be as quick as possible.

The catch is that for each row I could potentially insert, I need to
look and see if the relationship is already there  to prevent
multiple entries.  Currently I am doing a SELECT before doing the
INSERT, but I recognize the speed penalty in doing to operations.  I
wonder if there is some way I can say insert this record, only if it
doesn't exist already.  To see if it exists, I would need to compare
3 fields instead of just enforcing a primary key.



I struggled with this for a while. At first I tried stored procedures
and triggers, but it took very long (over 24 hours for my dataset).
After several iterations of rewritting it, first into C# then into
Python I got the whole process down to under 30 min.

My scenario is this:
I want to normalize log data. For example, for the IP address in a log
entry, I need to look up the unique id of the IP address, or if the IP
address is new, insert it and then return the newly created entry.
Multiple processes use the data, but only one process, run daily,
actually changes it. Because this one process knows that the data is
static, it selects the tables into in-memory hash tables (C#) or
Dictionaries (Python) and then does the lookups there. It is *super*
fast, but it uses a *lot* of ram. ;-)

To limit the ram, I wrote a version of the python code that uses gdbm
files instead of Dictionaries. This requires a newer version of Python
(to allow a gdbm db to work just like a dictionary) but makes life
easier in case someone is using my software on a lower end machine.
This doubled the time of the lookups from about 15 minutes to 30,
bringing the whole process to about 45 minutes.



Did you ever try the temp table approach? You could:

COPY all records into temp_table, with an empty row for ip_id
-- Get any entries which already exist
UPDATE temp_table SET ip_id =
(SELECT ip_id from ipaddress WHERE add=add)
 WHERE EXISTS (SELECT ip_id FROM ipaddress WHERE add=add);
-- Create new entries
INSERT INTO ipaddress(add) SELECT add FROM temp_table
  WHERE ip_id IS NULL;
-- Update the rest
UPDATE temp_table SET ip_id =
(SELECT ip_id from ipaddress WHERE add=add)
 WHERE ip_id IS NULL AND
EXISTS (SELECT ip_id FROM ipaddress WHERE add=add);

This would let the database do all of the updating work in bulk on it's 
side, rather than you pulling all the data out and doing it locally.


An alternative would be something like:

CREATE TEMP TABLE new_ids (address text, ip_id int);
COPY all potentially new addresses into that table.
-- Delete all entries which already exist
DELETE FROM new_ids WHERE EXISTS
(SELECT ip_id FROM ipaddresses
  WHERE add=new_ids.address);
-- Now create the new entries
INSERT INTO ipaddresses(add) SELECT address FROM new_ids;

-- At this point you are guaranteed to have all addresses existing in
-- the database

If you then insert your full data into the final table, only leave the 
ip_id column as null. Then if you have a partial index where ip_id is 
NULL, you could use the command:


UPDATE final_table SET ip_id =
(SELECT ip_id FROM ipaddresses WHERE add=final_table.add)
WHERE ip_id IS NULL;

You could also do this in a temporary table, before bulk inserting into 
the final table.


I don't know what you have tried, but I know that for Dan, he easily has 
 36M rows. So I don't think he wants to pull that locally and create a 
in-memory hash just to insert 100 rows or so.


Also, for your situation, if you do keep a local cache, you could 
certainly save the cache between runs, and use a temp table to determine 
what new ids you need to add to it. Then you wouldn't have to pull the 
complete set each time. You just pull new values for entries you haven't 
added yet.


John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] faster INSERT with possible pre-existing row?

2005-07-26 Thread Mark Lewis
Easier and faster than doing the custom trigger is to simply define a
unique index and let the DB enforce the constraint with an index lookup,
something like:

create unique index happy_index ON happy_table(col1, col2, col3);

That should run faster than the custom trigger, but not as fast as the
temp table solution suggested elsewhere because it will need to do an
index lookup for each row.  With this solution, it is important that
your shared_buffers are set high enough that the happy_index can be kept
in memory, otherwise performance will drop precipitously.  Also, if you
are increasing the size of the table by a large percentage, you will
want to ANALYZE periodically, as an optimal plan for a small table may
be a disaster for a large table, and PostgreSQL won't switch plans
unless you run ANALYZE.

-- Mark

On Tue, 2005-07-26 at 14:51 -0500, John A Meinel wrote:
 Matthew Nuzum wrote:
  On 7/26/05, Dan Harris [EMAIL PROTECTED] wrote:
  
 I am working on a process that will be inserting tens of million rows
 and need this to be as quick as possible.
 
 The catch is that for each row I could potentially insert, I need to
 look and see if the relationship is already there  to prevent
 multiple entries.  Currently I am doing a SELECT before doing the
 INSERT, but I recognize the speed penalty in doing to operations.  I
 wonder if there is some way I can say insert this record, only if it
 doesn't exist already.  To see if it exists, I would need to compare
 3 fields instead of just enforcing a primary key.
  
  
  I struggled with this for a while. At first I tried stored procedures
  and triggers, but it took very long (over 24 hours for my dataset).
  After several iterations of rewritting it, first into C# then into
  Python I got the whole process down to under 30 min.
  
  My scenario is this:
  I want to normalize log data. For example, for the IP address in a log
  entry, I need to look up the unique id of the IP address, or if the IP
  address is new, insert it and then return the newly created entry.
  Multiple processes use the data, but only one process, run daily,
  actually changes it. Because this one process knows that the data is
  static, it selects the tables into in-memory hash tables (C#) or
  Dictionaries (Python) and then does the lookups there. It is *super*
  fast, but it uses a *lot* of ram. ;-)
  
  To limit the ram, I wrote a version of the python code that uses gdbm
  files instead of Dictionaries. This requires a newer version of Python
  (to allow a gdbm db to work just like a dictionary) but makes life
  easier in case someone is using my software on a lower end machine.
  This doubled the time of the lookups from about 15 minutes to 30,
  bringing the whole process to about 45 minutes.
  
 
 Did you ever try the temp table approach? You could:
 
 COPY all records into temp_table, with an empty row for ip_id
 -- Get any entries which already exist
 UPDATE temp_table SET ip_id =
   (SELECT ip_id from ipaddress WHERE add=add)
   WHERE EXISTS (SELECT ip_id FROM ipaddress WHERE add=add);
 -- Create new entries
 INSERT INTO ipaddress(add) SELECT add FROM temp_table
WHERE ip_id IS NULL;
 -- Update the rest
 UPDATE temp_table SET ip_id =
   (SELECT ip_id from ipaddress WHERE add=add)
   WHERE ip_id IS NULL AND
   EXISTS (SELECT ip_id FROM ipaddress WHERE add=add);
 
 This would let the database do all of the updating work in bulk on it's 
 side, rather than you pulling all the data out and doing it locally.
 
 An alternative would be something like:
 
 CREATE TEMP TABLE new_ids (address text, ip_id int);
 COPY all potentially new addresses into that table.
 -- Delete all entries which already exist
 DELETE FROM new_ids WHERE EXISTS
   (SELECT ip_id FROM ipaddresses
 WHERE add=new_ids.address);
 -- Now create the new entries
 INSERT INTO ipaddresses(add) SELECT address FROM new_ids;
 
 -- At this point you are guaranteed to have all addresses existing in
 -- the database
 
 If you then insert your full data into the final table, only leave the 
 ip_id column as null. Then if you have a partial index where ip_id is 
 NULL, you could use the command:
 
 UPDATE final_table SET ip_id =
   (SELECT ip_id FROM ipaddresses WHERE add=final_table.add)
 WHERE ip_id IS NULL;
 
 You could also do this in a temporary table, before bulk inserting into 
 the final table.
 
 I don't know what you have tried, but I know that for Dan, he easily has 
   36M rows. So I don't think he wants to pull that locally and create a 
 in-memory hash just to insert 100 rows or so.
 
 Also, for your situation, if you do keep a local cache, you could 
 certainly save the cache between runs, and use a temp table to determine 
 what new ids you need to add to it. Then you wouldn't have to pull the 
 complete set each time. You just pull new values for entries you haven't 
 added yet.
 
 John
 =:-


---(end of broadcast)---
TIP 5: don't 

Re: [PERFORM] faster INSERT with possible pre-existing row?

2005-07-26 Thread Christopher Kings-Lynne
Insert into a temp table then use INSERT INTO...SELECT FROM to insert 
all rows into the proper table that don't have a relationship.


Chris

Dan Harris wrote:
I am working on a process that will be inserting tens of million rows  
and need this to be as quick as possible.


The catch is that for each row I could potentially insert, I need to  
look and see if the relationship is already there  to prevent  multiple 
entries.  Currently I am doing a SELECT before doing the  INSERT, but I 
recognize the speed penalty in doing to operations.  I  wonder if there 
is some way I can say insert this record, only if it  doesn't exist 
already.  To see if it exists, I would need to compare  3 fields 
instead of just enforcing a primary key.


Even if this could be a small increase per record, even a few percent  
faster compounded over the whole load could be a significant reduction.


Thanks for any ideas you might have.

-Dan

---(end of broadcast)---
TIP 6: explain analyze is your friend



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match