Re: [HACKERS] [PERFORM] Slow count(*) again...

2011-02-04 Thread Torsten Zühlsdorff

Mladen Gogala schrieb:

Well, the problem will not go away.  As I've said before, all other 
databases have that feature and none of the reasons listed here 
convinced me that everybody else has a crappy optimizer.  The problem 
may go away altogether if people stop using PostgreSQL.


A common problem of programmers is, that they want a solution they 
already know for a problem they already know, even if it is the worst 
solution the can choose.


There are so many possibilities to solve a given problem and you even 
have time to do this before your application get released.


Also: if you rely so heavily on hints, then use a database which 
supports hints. A basic mantra in every training i have given is: use 
the tool/technic/persons which fits best for the needs of the project. 
There are many databases out there - choose for every project the one, 
which fits best!


Greetings from Germany,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8 
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse 
auswerten kann.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Using more tha one index per table

2010-07-24 Thread Torsten Zühlsdorff

Craig James schrieb:


The problem is that Google ranks pages based on inbound links, so
older versions of Postgres *always* come up before the latest version
in page ranking.


Since 2009 you can deal with this by defining the canonical-version.
(http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html) 



This is a really cool feature, but it's not what we need.  The 
canonical refers to the URL, not the web page.  It's only supposed to 
be used if you have multiple URLs that are actually the *same* page; the 
canonical URL tells Google use only this URL for this page.


But in our case, the Postgres manuals for each release have different 
URLs *and* different content, so the canonical URL isn't the right 
solution.


This is true, but the content is allowed to change a little. Of course 
their is no percentage of allowed changes. But it can be quite much. 
I've used this feature for some clients, which push their content into 
very different websites and it does work.
Most of the content of the documentation doesn't change much between the 
releases. In most cases the canonical will work the way i suggest.


In case of big changes even the recommandation of using a current 
version won't work. Its true that Google ranks pages based on inbound 
links. But there are more than 200 other factores, which influence the 
rankings. Most people do not know, that changing most of a sites content 
makes the inbound links for a long time useless. After big changes in 
the documentation the current entry will be droped for some monthes 
and the old entries will appear. But note, that every single site of the 
documentation is ranked for itself. From my experience i would expect 
the canonical-version with better results, than the current-version.


But the canonical is not the best solution in my opinion. I often edit 
the urls of some documentations, because i need it for a special 
postgresql version. The documentation clearly misses a version-switch. 
Combined with an big note, that the current displayed documentation is 
not the one of the current postgresql-version, this will be the best 
compromiss in my opinion.


Greetings from Germany,
Torsten

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Using more tha one index per table

2010-07-23 Thread Torsten Zühlsdorff

Craig James schrieb:


  A useful trick to know is that if you replace the version number
with current, you'll get to the latest version most of the time
(sometimes the name of the page is changed between versions, too, but
this isn't that frequent).


The docs pages could perhaps benefit from an auto-generated note saying:

The current version of Pg is 8.4. This documentation is for version
8.2. Click [here] for documentation on the current version.

... or something to that effect. It'd be a nice (and more user-friendly)
alternative to url twiddling when searches reveal docs for an old
version, and might help push the /current/ pages up in search rank too.


In addition, why not use symlinks so that the current version is simply 
called current, as in


   http://www.postgresql.org/docs/current/static/sql-insert.html

If you google for postgres insert, you get this:

  http://www.postgresql.org/docs/8.1/static/sql-insert.html

The problem is that Google ranks pages based on inbound links, so older 
versions of Postgres *always* come up before the latest version in page 
ranking.  


Since 2009 you can deal with this by defining the canonical-version. 
(http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html)


Greetings from Germany,
Torsten

--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8 
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse 
auswerten kann.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to insert a bulk of data with unique-violations very fast

2010-06-09 Thread Torsten Zühlsdorff

Pierre C schrieb:


Within the data to import most rows have 20 till 50 duplicates. 
Sometime much more, sometimes less.


In that case (source data has lots of redundancy), after importing the 
data chunks in parallel, you can run a first pass of de-duplication on 
the chunks, also in parallel, something like :


CREATE TEMP TABLE foo_1_dedup AS SELECT DISTINCT * FROM foo_1;

or you could compute some aggregates, counts, etc. Same as before, no 
WAL needed, and you can use all your cores in parallel.


 From what you say this should reduce the size of your imported data by 
a lot (and hence the time spent in the non-parallel operation).


Thank you very much for this advice. I've tried it inanother project 
with similar import-problems. This really speed the import up.


Thank everyone for your time and help!

Greetings,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8 
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse 
auswerten kann.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to insert a bulk of data with unique-violations very fast

2010-06-07 Thread Torsten Zühlsdorff

Pierre C schrieb:

Since you have lots of data you can use parallel loading.

Split your data in several files and then do :

CREATE TEMPORARY TABLE loader1 ( ... )
COPY loader1 FROM ...

Use a TEMPORARY TABLE for this : you don't need crash-recovery since if 
something blows up, you can COPY it again... and it will be much faster 
because no WAL will be written.


That's a good advice, thank yo :)

If your disk is fast, COPY is cpu-bound, so if you can do 1 COPY process 
per core, and avoid writing WAL, it will scale.


This doesn't solve the other half of your problem (removing the 
duplicates) which isn't easy to parallelize, but it will make the COPY 
part a lot faster.


Note that you can have 1 core process the INSERT / removing duplicates 
while the others are handling COPY and filling temp tables, so if you 
pipeline it, you could save some time.


Does your data contain a lot of duplicates, or are they rare ? What 
percentage ?


Within the data to import most rows have 20 till 50 duplicates. Sometime 
much more, sometimes less.


But over 99,1% of the rows to import are already know. This percentage 
is growing, because there is a finite number of rows i want to know.


In my special case i'm collection domain-names. Till now it's completly 
for private interests and with normal pc-hardware. I'm collecting them 
by crawling known sites and checking them for new hosts. Maybe i will 
build later an expired domain service or an reverse ip database or 
something like that. But now i'm just interested in the connection of 
the sites and the structure people choose domain-names.


(Before someone ask: Till now i have more rows than domains (nearly) 
exists, because i collect subdomain of all levels too and do not delete 
entries)


Thanks everyone for your advices. This will help me a lot!

Greetings from Germany,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8 
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse 
auswerten kann.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to insert a bulk of data with unique-violations very fast

2010-06-06 Thread Torsten Zühlsdorff

Cédric Villemain schrieb:


I think you need to have a look at pgloader. It does COPY with error
handling. very effective.


Thanks for this advice. I will have a look at it.

Greetings from Germany,
Torsten

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How to insert a bulk of data with unique-violations very fast

2010-06-06 Thread Torsten Zühlsdorff

Scott Marlowe schrieb:


i have a set of unique data which about 150.000.000 rows. Regullary i get
a
list of data, which contains multiple times of rows than the already
stored
one. Often around 2.000.000.000 rows. Within this rows are many
duplicates
and often the set of already stored data.
I want to store just every entry, which is not within the already stored
one. Also i do not want to store duplicates. Example:

The standard method in pgsql is to load the data into a temp table
then insert where not exists in old table.

Sorry, i didn't get it. I've googled some examples, but no one match at my
case. Every example i found was a single insert which should be done or
ignored, if the row is already stored.

But in my case i have a bulk of rows with duplicates. Either your tipp
doesn't match my case or i didn't unterstand it correctly. Can you provide a
simple example?


create table main (id int primary key, info text);
create table loader (id int, info text);
insert into main values (1,'abc'),(2,'def'),(3,'ghi');
insert into loader values (1,'abc'),(4,'xyz');
select * from main;
 id | info
+--
  1 | abc
  2 | def
  3 | ghi
(3 rows)

select * from loader;
 id | info
+--
  1 | abc
  4 | xyz
(2 rows)

insert into main select * from loader except select * from main;
select * from main;
 id | info
+--
  1 | abc
  2 | def
  3 | ghi
  4 | xyz
(4 rows)

Note that for the where not exists to work the fields would need to be
all the same, or you'd need a more complex query.  If the info field
here was different you'd get an error an no insert / update.  For that
case you might want to use where not in:

insert into main select * from loader where id not in (select id from main);


Thank you very much for your example. Now i've got it :)

I've test your example on a small set of my rows. While testing i've 
stumpled over a difference in sql-formulation. Using except seems to be 
a little slower than the more complex where not in (subquery) group by. 
Here is my example:


CREATE TABLE tseq (value text);
INSERT INTO tseq VALUES ('a') , ('b'), ('c');
CREATE UNIQUE INDEX tseq_unique on tseq (value);
CREATE TEMP TABLE tmpseq(value text);
INSERT INTO tmpseq VALUES ('a') , ('b'), ('c');
INSERT INTO tmpseq VALUES ('a') , ('b'), ('c');
INSERT INTO tmpseq VALUES ('a') , ('b'), ('d');
INSERT INTO tmpseq VALUES ('d') , ('b'), ('d');
SELECT* from tseq;
 value
---
 a
 b
 c
(3 rows)

SELECT* from tmpseq;
 value
---
 a
 b
 c
 a
 b
 c
 a
 b
 d
 d
 b
 d
(12 rows)

VACUUM VERBOSE ANALYSE;

explain analyze SELECT value FROM tmpseq except SELECT value FROM tseq;
  QUERY PLAN 


--
 HashSetOp Except  (cost=0.00..2.34 rows=4 width=2) (actual 
time=0.157..0.158 rows=1 loops=1)
   -  Append  (cost=0.00..2.30 rows=15 width=2) (actual 
time=0.012..0.126 rows=15 loops=1)
 -  Subquery Scan *SELECT* 1  (cost=0.00..1.24 rows=12 
width=2) (actual time=0.009..0.060 rows=12 loops=1)
   -  Seq Scan on tmpseq  (cost=0.00..1.12 rows=12 
width=2) (actual time=0.004..0.022 rows=12 loops=1)
 -  Subquery Scan *SELECT* 2  (cost=0.00..1.06 rows=3 
width=2) (actual time=0.006..0.018 rows=3 loops=1)
   -  Seq Scan on tseq  (cost=0.00..1.03 rows=3 width=2) 
(actual time=0.003..0.009 rows=3 loops=1)

 Total runtime: 0.216 ms
(7 rows)

explain analyze SELECT value FROM tmpseq WHERE value NOT IN (SELECT 
value FROM tseq) GROUP BY value;
 QUERY PLAN 



 HashAggregate  (cost=2.20..2.22 rows=2 width=2) (actual 
time=0.053..0.055 rows=1 loops=1)
   -  Seq Scan on tmpseq  (cost=1.04..2.19 rows=6 width=2) (actual 
time=0.038..0.043 rows=3 loops=1)

 Filter: (NOT (hashed SubPlan 1))
 SubPlan 1
   -  Seq Scan on tseq  (cost=0.00..1.03 rows=3 width=2) 
(actual time=0.004..0.009 rows=3 loops=1)

 Total runtime: 0.105 ms
(6 rows)

My question: is this an generall behavior or just an effect of the small 
case?


Greetings form Germany,
Torsten

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] How to insert a bulk of data with unique-violations very fast

2010-06-02 Thread Torsten Zühlsdorff

Hello,

i have a set of unique data which about 150.000.000 rows. Regullary i 
get a list of data, which contains multiple times of rows than the 
already stored one. Often around 2.000.000.000 rows. Within this rows 
are many duplicates and often the set of already stored data.
I want to store just every entry, which is not within the already stored 
one. Also i do not want to store duplicates. Example:


Already stored set:
a,b,c

Given set:
a,b,a,c,d,a,c,d,b

Expected set after import:
a,b,c,d

I now looking for a faster way for the import. At the moment i import 
the new data with copy into an table 'import'. then i remove the 
duplicates and insert every row which is not already known. after that 
import is truncated.


Is there a faster way? Should i just insert every row and ignore it, if 
the unique constrain fails?


Here the simplified table-schema. in real life it's with partitions:
test=# \d urls
 Tabelle »public.urls«
 Spalte |   Typ   |   Attribute
+-+---
 url_id | integer | not null default nextval('urls_url_id_seq'::regclass)
 url| text| not null
Indexe:
»urls_url« UNIQUE, btree (url)
»urls_url_id« btree (url_id)

Thanks for every hint or advice! :)

Greetings from Germany,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8 
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse 
auswerten kann.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Are folks running 8.4 in production environments? and 8.4 and slon 1.2?

2009-10-17 Thread Torsten Zühlsdorff

Tory M Blue schrieb:


Any issues, has it baked long enough, is it time for us 8.3 folks to deal
with the pain and upgrade?


I've upgraded all my databases to 8.4. They pain was not so big, the new 
-j Parameter from pg_restore is fantastic. I really like the new 
functions around Pl/PGSQL. All is stable and fast.


Greetings from Germany,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8 
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse 
auswerten kann.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why is vacuum_freeze_min_age 100m?

2009-08-12 Thread Torsten Zühlsdorff

Tom Lane schrieb:

Josh Berkus j...@agliodbs.com writes:

I've just been tweaking some autovac settings for a large database, and
came to wonder: why does vacuum_max_freeze_age default to such a high
number?  What's the logic behind that?


(1) not destroying potentially useful forensic evidence too soon;
(2) there's not really much to be gained by reducing it.


If there is not really much to gain by changing the value, why do not 
remove the parameter?


Greetings from germany,
Torsten

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance