Re: [PERFORM] How filesystems matter with PostgreSQL

2010-06-06 Thread Craig Ringer

On 06/06/10 14:51, Ron Mayer wrote:

Jon Schewe wrote:


OK, so if I want the 15 minute speed, I need to give up safety (OK in
this case as this is just research testing), or see if I can tune
postgres better.


Depending on your app, one more possibility would be to see if you
can re-factor the application so it can do multiple writes in parallel
rather than waiting for each one to complete.   If I understand right,
then many transactions could potentially be handled by a single fsync.


By using a commit delay, yes. (see postgresql.conf). You do open up the 
risk of losing transactions committed within the commit delay period, 
but you don't risk corruption like you do with fsync.


Sometimes you can also batch work into bigger transactions. The classic 
example here is the usual long stream of individual auto-committed 
INSERTs, which when wrapped in an explicit transaction can be vastly 
quicker.


--
Craig Ringer

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to insert a bulk of data with unique-violations very fast

2010-06-06 Thread Andy Colson

On 06/01/2010 10:03 AM, Torsten Zühlsdorff wrote:

Hello,

i have a set of unique data which about 150.000.000 rows. Regullary i
get a list of data, which contains multiple times of rows than the
already stored one. Often around 2.000.000.000 rows. Within this rows
are many duplicates and often the set of already stored data.
I want to store just every entry, which is not within the already stored
one. Also i do not want to store duplicates. Example:

Already stored set:
a,b,c

Given set:
a,b,a,c,d,a,c,d,b

Expected set after import:
a,b,c,d

I now looking for a faster way for the import. At the moment i import
the new data with copy into an table 'import'. then i remove the
duplicates and insert every row which is not already known. after that
import is truncated.

Is there a faster way? Should i just insert every row and ignore it, if
the unique constrain fails?

Here the simplified table-schema. in real life it's with partitions:
test=# \d urls
Tabelle »public.urls«
Spalte | Typ | Attribute
+-+---
url_id | integer | not null default nextval('urls_url_id_seq'::regclass)
url | text | not null
Indexe:
»urls_url« UNIQUE, btree (url)
»urls_url_id« btree (url_id)

Thanks for every hint or advice! :)

Greetings from Germany,
Torsten


I do this with a stored procedure.  I do not care about speed because my db is 
really small and I only insert a few records a month.  So I dont know how fast 
this is, but here is my func:

CREATE FUNCTION addentry(idate timestamp without time zone, ilevel integer) 
RETURNS character varying
AS $$
declare
tmp integer;
begin
insert into blood(adate, alevel) values(idate, ilevel);
return 'ok';
exception
when unique_violation then
select into tmp alevel from blood where adate = idate;
if tmp  ilevel then
return idate || ' levels differ!';
else
return 'ok, already in table';
end if;
end; $$
LANGUAGE plpgsql;


Use it like, select * from addentry('2010-006-06 8:00:00', 130);

I do an extra check that if the date's match that the level's match too, but 
you wouldnt have to.  There is a unique index on adate.

-Andy


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to insert a bulk of data with unique-violations very fast

2010-06-06 Thread Torsten Zühlsdorff

Cédric Villemain schrieb:


I think you need to have a look at pgloader. It does COPY with error
handling. very effective.


Thanks for this advice. I will have a look at it.

Greetings from Germany,
Torsten

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to insert a bulk of data with unique-violations very fast

2010-06-06 Thread Torsten Zühlsdorff

Scott Marlowe schrieb:


i have a set of unique data which about 150.000.000 rows. Regullary i get
a
list of data, which contains multiple times of rows than the already
stored
one. Often around 2.000.000.000 rows. Within this rows are many
duplicates
and often the set of already stored data.
I want to store just every entry, which is not within the already stored
one. Also i do not want to store duplicates. Example:

The standard method in pgsql is to load the data into a temp table
then insert where not exists in old table.

Sorry, i didn't get it. I've googled some examples, but no one match at my
case. Every example i found was a single insert which should be done or
ignored, if the row is already stored.

But in my case i have a bulk of rows with duplicates. Either your tipp
doesn't match my case or i didn't unterstand it correctly. Can you provide a
simple example?


create table main (id int primary key, info text);
create table loader (id int, info text);
insert into main values (1,'abc'),(2,'def'),(3,'ghi');
insert into loader values (1,'abc'),(4,'xyz');
select * from main;
 id | info
+--
  1 | abc
  2 | def
  3 | ghi
(3 rows)

select * from loader;
 id | info
+--
  1 | abc
  4 | xyz
(2 rows)

insert into main select * from loader except select * from main;
select * from main;
 id | info
+--
  1 | abc
  2 | def
  3 | ghi
  4 | xyz
(4 rows)

Note that for the where not exists to work the fields would need to be
all the same, or you'd need a more complex query.  If the info field
here was different you'd get an error an no insert / update.  For that
case you might want to use where not in:

insert into main select * from loader where id not in (select id from main);


Thank you very much for your example. Now i've got it :)

I've test your example on a small set of my rows. While testing i've 
stumpled over a difference in sql-formulation. Using except seems to be 
a little slower than the more complex where not in (subquery) group by. 
Here is my example:


CREATE TABLE tseq (value text);
INSERT INTO tseq VALUES ('a') , ('b'), ('c');
CREATE UNIQUE INDEX tseq_unique on tseq (value);
CREATE TEMP TABLE tmpseq(value text);
INSERT INTO tmpseq VALUES ('a') , ('b'), ('c');
INSERT INTO tmpseq VALUES ('a') , ('b'), ('c');
INSERT INTO tmpseq VALUES ('a') , ('b'), ('d');
INSERT INTO tmpseq VALUES ('d') , ('b'), ('d');
SELECT* from tseq;
 value
---
 a
 b
 c
(3 rows)

SELECT* from tmpseq;
 value
---
 a
 b
 c
 a
 b
 c
 a
 b
 d
 d
 b
 d
(12 rows)

VACUUM VERBOSE ANALYSE;

explain analyze SELECT value FROM tmpseq except SELECT value FROM tseq;
  QUERY PLAN 


--
 HashSetOp Except  (cost=0.00..2.34 rows=4 width=2) (actual 
time=0.157..0.158 rows=1 loops=1)
   -  Append  (cost=0.00..2.30 rows=15 width=2) (actual 
time=0.012..0.126 rows=15 loops=1)
 -  Subquery Scan *SELECT* 1  (cost=0.00..1.24 rows=12 
width=2) (actual time=0.009..0.060 rows=12 loops=1)
   -  Seq Scan on tmpseq  (cost=0.00..1.12 rows=12 
width=2) (actual time=0.004..0.022 rows=12 loops=1)
 -  Subquery Scan *SELECT* 2  (cost=0.00..1.06 rows=3 
width=2) (actual time=0.006..0.018 rows=3 loops=1)
   -  Seq Scan on tseq  (cost=0.00..1.03 rows=3 width=2) 
(actual time=0.003..0.009 rows=3 loops=1)

 Total runtime: 0.216 ms
(7 rows)

explain analyze SELECT value FROM tmpseq WHERE value NOT IN (SELECT 
value FROM tseq) GROUP BY value;
 QUERY PLAN 



 HashAggregate  (cost=2.20..2.22 rows=2 width=2) (actual 
time=0.053..0.055 rows=1 loops=1)
   -  Seq Scan on tmpseq  (cost=1.04..2.19 rows=6 width=2) (actual 
time=0.038..0.043 rows=3 loops=1)

 Filter: (NOT (hashed SubPlan 1))
 SubPlan 1
   -  Seq Scan on tseq  (cost=0.00..1.03 rows=3 width=2) 
(actual time=0.004..0.009 rows=3 loops=1)

 Total runtime: 0.105 ms
(6 rows)

My question: is this an generall behavior or just an effect of the small 
case?


Greetings form Germany,
Torsten

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to insert a bulk of data with unique-violations very fast

2010-06-06 Thread Scott Marlowe
On Sun, Jun 6, 2010 at 6:02 AM, Torsten Zühlsdorff
f...@meisterderspiele.de wrote:
 Scott Marlowe schrieb:
 Thank you very much for your example. Now i've got it :)

 I've test your example on a small set of my rows. While testing i've
 stumpled over a difference in sql-formulation. Using except seems to be a
 little slower than the more complex where not in (subquery) group by. Here
 is my example:

Yeah, to get a good idea you need a more realistic example.  Build
some tables with millions of rows using generate_series() and then
test against those.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to insert a bulk of data with unique-violations very fast

2010-06-06 Thread Pierre C

Since you have lots of data you can use parallel loading.

Split your data in several files and then do :

CREATE TEMPORARY TABLE loader1 ( ... )
COPY loader1 FROM ...

Use a TEMPORARY TABLE for this : you don't need crash-recovery since if  
something blows up, you can COPY it again... and it will be much faster  
because no WAL will be written.


If your disk is fast, COPY is cpu-bound, so if you can do 1 COPY process  
per core, and avoid writing WAL, it will scale.


This doesn't solve the other half of your problem (removing the  
duplicates) which isn't easy to parallelize, but it will make the COPY  
part a lot faster.


Note that you can have 1 core process the INSERT / removing duplicates  
while the others are handling COPY and filling temp tables, so if you  
pipeline it, you could save some time.


Does your data contain a lot of duplicates, or are they rare ? What  
percentage ?


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance