Re: [PERFORM] No hash join across partitioned tables?

2010-10-16 Thread Alvaro Herrera
Excerpts from Samuel Gendler's message of sáb oct 16 02:35:46 -0300 2010:

 An issue with automatically analyzing the entire hierarchy is 'abstract'
 table definitions.  I've got a set of tables for storing the same data at
 different granularities of aggregation.  Within each granularity, I've got
 partitions, but because the set of columns is identical for each
 granularity, I've got an abstract table definition that is inherited by
 everything.  I don't need or want statistics kept on that table because I
 never query across the abstract table, only the parent table of each
 aggregation granularity

Hmm, I think you'd be better served by using LIKE instead of regular
inheritance.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Select count(*), the sequel

2010-10-16 Thread Mladen Gogala
There was some doubt as for the speed of doing the select count(*) in 
PostgreSQL and Oracle.
To that end, I copied the most part of the Oracle table I used before to 
Postgres. Although the copy
wasn't complete, the resulting table is already significantly larger 
than the table it was copied from. The result still shows that Oracle is 
significantly faster:

Oracle result:

SQL alter system flush buffer_cache;

System altered.

SQL select /*+ full(NO) noparallel */ count(*) from ni_occurrence no;

  COUNT(*)
--
 402062638

Elapsed: 00:03:16.45



Hints are necessary because Oracle table is declared as parallel and I 
didn't want the PK index to be used for counting. Oracle has a good 
habit of using PK's for counting, if available.



SQL select bytes/1048576 as MB
  2  from user_segments
  3  where segment_name='NI_OCCURRENCE';

MB
--
 35329

Elapsed: 00:00:00.85
SQL

So, oracle stores 402 million records in 35GB and counts them in 3 
minutes 16.45 seconds  The very same table was partially copied to 
Postgres, copying died with ORA-01555 snapshot too old sometimes this 
morning. I ran vacuumdb -f -z on the database after the copy completed 
and the results are below.


mgogala=# select count(*) from ni_occurrence;
   count
---
 382400476
(1 row)

Time: 221716.466 ms
mgogala=#
mgogala=# select 221/60::real;
 ?column?
--
 3.68
(1 row)

Time: 0.357 ms
mgogala=#
mgogala=# select pg_size_pretty(pg_table_size('ni_occurrence'));
 pg_size_pretty

 46 GB
(1 row)

Time: 0.420 ms
mgogala=#

The database wasn't restarted, no caches were flushed, the comparison 
was done with a serious advantage for PostgreSQL. Postgres needed 3.68 
minutes to complete the count which is about the same Oracle but still 
somewhat slower. Also, I am worried about the sizes. Postgres table is 
11GB larger than the original, despite having less data. That was an 
unfair and unbalanced comparison because Oracle's cache was flushed and 
Oracle was artificially restrained to use the full table scan without 
the aid of parallelism. Here is the same result, with no hints and the 
autotrace on, which shows what happens if I turn the hints off:


SQL select count(*) from ni_occurrence no;

  COUNT(*)
--
 402062638

Elapsed: 00:00:52.61

Execution Plan
--
Plan hash value: 53476935

 




| Id  | Operation  | Name  | Rows  | Cost (%CPU)|
 Time  |TQ |IN-OUT| PQ Distrib |

 




|   0 | SELECT STATEMENT  |  |1 | 54001  (19)|
 00:01:08 |   |  |   |

|   1 |  SORT AGGREGATE   |  |1 |   |
  |   |  |   |

|   2 |   PX COORDINATOR  |  |  |   |
  |   |  |   |

|   3 |PX SEND QC (RANDOM)  | :TQ1  |1 
|   |

  |  Q1,00 | P-S | QC (RAND)  |

|   4 | SORT AGGREGATE  |  |1 |   |
  |  Q1,00 | PCWP |   |

|   5 |  PX BLOCK ITERATOR  |  |   402M| 54001  (19)|
 00:01:08 |  Q1,00 | PCWC |   |

|   6 |   INDEX FAST FULL SCAN| IDX_NI_OCCURRENCE_PID |   402M| 
54001  (19)|

 00:01:08 |  Q1,00 | PCWP |   |

 




It took just 52 seconds to count everything, but Oracle didn't even scan 
the table, it scanned a unique index, in parallel. That is the 
algorithmic advantage that forced me to restrict the execution plan with 
hints. My conclusion is that the speed of the full scan is OK, about the 
same as Oracle speed.  There are, however, three significant algorithm 
advantages on the Oracle's side:


1) Oracle can use indexes to calculate select count
2) Oracle can use parallelism.
3) Oracle can use indexes in combination with the parallel processing.



Here are the descriptions:

SQL desc ni_occurrence
 Name   Null?Type
 -  


 ID   NOT NULL NUMBER(22)
 PERMANENT_ID   NOT NULL VARCHAR2(12)
 CALL_LETTERS   NOT NULL VARCHAR2(5)
 AIRDATE   NOT NULL DATE
 DURATION   NOT NULL NUMBER(4)
 PROGRAM_TITLEVARCHAR2(360)
 COSTNUMBER(15)
 ASSETIDNUMBER(12)
 MARKET_IDNUMBER
 GMT_TIMEDATE
 ORIG_ST_OCC_ID NUMBER
 EPISODEVARCHAR2(450)
 IMPRESSIONS

Re: [PERFORM] UUID performance as primary key

2010-10-16 Thread Craig James

On 10/15/10 6:58 PM, Navkirat Singh wrote:

I am interested in finding out the pros/cons of using UUID as a
 primary key field. My requirement states that UUID would be perfect
 in my case as I will be having many small databases which will link
up to a global database using the UUID. Hence, the need for a unique
 key across all databases.


You left out one piece of information: How many keys per second do you need?

We put a sequence in the global database that all secondary databases use to 
get their IDs.  It means an extra connect/disconnect (a pooler can minimize 
this), so if you're issuing thousands of IDs per second, this isn't a good 
idea.  But for a small-ish number of IDs per second, it gets you the benefit of 
a universal ID without the size of the UUID field.

Craig (the other one)

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Select count(*), the sequel

2010-10-16 Thread Kenneth Marshall
Hi,

Interesting data points. The amount of rows that you managed to
insert into PostgreSQL before Oracle gave up the ghost is 95%
of the rows in the Oracle version of the database. To count 5%
fewer rows, it took PostgreSQL 24 seconds longer. Or adjusting
for the missing rows, 52 seconds longer for the entire table
or 18% longer than the full table scan in Oracle. This seems to
be well within the table layout size differences, possibly due
to the fillfactor used --not really bad at all. Now the timings
due to algorithm changes are interesting as indicating the room
for improvement due to those type of changes. A parallel sequential
full-table scan in PostgreSQL could provide the same speed up.
Currently that is not possible ... but development continues a
pace...

In fact, developing such functions in PostgreSQL could end up
being less expensive long-term than licensing Oracle RAC. I think
the point that you have helped make is that PostgreSQL performs
very well for many use cases that have typically been relegated
to expensive commecial databases such as Oracle, DB2,...

Regards,
Ken

On Sat, Oct 16, 2010 at 12:53:50PM -0400, Mladen Gogala wrote:
 There was some doubt as for the speed of doing the select count(*) in 
 PostgreSQL and Oracle.
 To that end, I copied the most part of the Oracle table I used before to 
 Postgres. Although the copy
 wasn't complete, the resulting table is already significantly larger than 
 the table it was copied from. The result still shows that Oracle is 
 significantly faster:
 Oracle result:

 SQL alter system flush buffer_cache;

 System altered.

 SQL select /*+ full(NO) noparallel */ count(*) from ni_occurrence no;

   COUNT(*)
 --
  402062638

 Elapsed: 00:03:16.45



 Hints are necessary because Oracle table is declared as parallel and I 
 didn't want the PK index to be used for counting. Oracle has a good habit 
 of using PK's for counting, if available.


 SQL select bytes/1048576 as MB
   2  from user_segments
   3  where segment_name='NI_OCCURRENCE';

 MB
 --
  35329

 Elapsed: 00:00:00.85
 SQL

 So, oracle stores 402 million records in 35GB and counts them in 3 minutes 
 16.45 seconds  The very same table was partially copied to Postgres, 
 copying died with ORA-01555 snapshot too old sometimes this morning. I ran 
 vacuumdb -f -z on the database after the copy completed and the results are 
 below.

 mgogala=# select count(*) from ni_occurrence;
count
 ---
  382400476
 (1 row)

 Time: 221716.466 ms
 mgogala=#
 mgogala=# select 221/60::real;
  ?column?
 --
  3.68
 (1 row)

 Time: 0.357 ms
 mgogala=#
 mgogala=# select pg_size_pretty(pg_table_size('ni_occurrence'));
  pg_size_pretty
 
  46 GB
 (1 row)

 Time: 0.420 ms
 mgogala=#

 The database wasn't restarted, no caches were flushed, the comparison was 
 done with a serious advantage for PostgreSQL. Postgres needed 3.68 minutes 
 to complete the count which is about the same Oracle but still somewhat 
 slower. Also, I am worried about the sizes. Postgres table is 11GB larger 
 than the original, despite having less data. That was an unfair and 
 unbalanced comparison because Oracle's cache was flushed and Oracle was 
 artificially restrained to use the full table scan without the aid of 
 parallelism. Here is the same result, with no hints and the autotrace on, 
 which shows what happens if I turn the hints off:

 SQL select count(*) from ni_occurrence no;

   COUNT(*)
 --
  402062638

 Elapsed: 00:00:52.61

 Execution Plan
 --
 Plan hash value: 53476935

 
 

 | Id  | Operation  | Name  | Rows  | Cost (%CPU)|
  Time  |TQ |IN-OUT| PQ Distrib |

 
 

 |   0 | SELECT STATEMENT  |  |1 | 54001  (19)|
  00:01:08 |   |  |   |

 |   1 |  SORT AGGREGATE   |  |1 |   |
   |   |  |   |

 |   2 |   PX COORDINATOR  |  |  |   |
   |   |  |   |

 |   3 |PX SEND QC (RANDOM)  | :TQ1  |1 |   
 |
   |  Q1,00 | P-S | QC (RAND)  |

 |   4 | SORT AGGREGATE  |  |1 |   |
   |  Q1,00 | PCWP |   |

 |   5 |  PX BLOCK ITERATOR  |  |   402M| 54001  (19)|
  00:01:08 |  Q1,00 | PCWC |   |

 |   6 |   INDEX FAST FULL SCAN| IDX_NI_OCCURRENCE_PID |   402M| 54001  
 (19)|
  00:01:08 |  Q1,00 | PCWP |   |

 
 

 It took just 52 seconds to count everything, but Oracle didn't 

Re: [PERFORM] UUID performance as primary key

2010-10-16 Thread Merlin Moncure
On Fri, Oct 15, 2010 at 10:59 PM, Craig Ringer
cr...@postnewspapers.com.au wrote:
 On 16/10/2010 9:58 AM, Navkirat Singh wrote:

 Hi Guys,

 I am interested in finding out the pros/cons of using UUID as a primary
 key field. My requirement states that UUID would be perfect in my case as I
 will be having many small databases which will link up to a global database
 using the UUID. Hence, the need for a unique key across all databases. It
 would be extremely helpful if someone could help me figure this out, as it
 is critical for my project.

 Pro: No need for (serverid,serverseq) pair primary keys or hacks with
 modulus based key generation. Doesn't set any pre-determined limit on how
 many servers/databases may be in a cluster.

 Con: Slower than modulo key generation approach, uses more storage. Foreign
 key relationships may be slower too.

 Overall, UUIDs seem to be a favoured approach. The other way people seem to
 do this is by assigning a unique instance id to each server/database out of
 a maximum n instances decided at setup time. Every key generation sequence
 increments by n whenever it generates a key, with an offset of the
 server/database id. That way, if n=100, server 1 will generate primary keys
 001, 101, 201, 301, ..., server 2 will generate primary keys 002, 102, 202,
 302, ... and so on.

 That works great until you need more than 100 instances, at which point
 you're really, REALLY boned. In really busy systems it also limits the total
 amount of primary key space - but with BIGINT primary keys, that's unlikely
 to be something you need to worry about.

 The composite primary key (serverid,sequenceid) approach avoids the need for
 a pre-defined maximum number of servers, but can be slow to index and can
 require more storage, especially because of tuple headers.

 I have no firsthand experience with any of these approaches so I can't offer
 you a considered opinion. I know that the MS-SQL crowd at least strongly
 prefer UUIDs, but they have very strong in-database UUID support. MySQL
 folks seem to mostly favour the modulo primary key generation approach. I
 don't see much discussion of the issue here - I get the impression Pg
 doesn't see heavy use in sharded environments.

I think your analysis is right on the money except for one thing: the
composite approach doesn't need server_id as part of the key and could
be left off the index.  In fact, it can be left off the table
completely since the value is static for the entire database.  You
obviously can't check RI between databases so storing the value
everywhere is of no value.  server_id only matters when comparing data
from one database to another, which will rarely happen inside a
particular client database (and if it does, you'd have to store the
foreign server_id).

Any 'master' database that did control operations would of course have
to store server_id for each row but I suspect that's not where the
bulk of the data would be.  Ditto any application code...it would have
to do something like this:

select server_id(), foo_id from foo where ..

server_id() is of course immutable function.  Since you are not
managing 2 billion+ servers, this will be an 'int', or even a
smallint.  I think this approach is stronger than UUID approach in
every way.  Even stronger would be to not use surrogate keys at all,
but involve what ever makes the decision that routes data between
databases as part of a more natural key (no way to know for sure if
this works for OP w/available info).

I personally dislike sequence hacks.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Stored procedure declared as VOLATILE = no good optimization is done

2010-10-16 Thread Merlin Moncure
On Fri, Oct 15, 2010 at 10:31 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Tatsuo Ishii is...@postgresql.org writes:
 So can I say if a function is marked IMMUTABLE, then it should never
 modify database? Is there any counter example?
 It seems if above is correct, I can say STABLE functions should never
 modify databases as well.

 Both of those things are explicitly stated here:
 http://developer.postgresql.org/pgdocs/postgres/xfunc-volatility.html

Ok, being pedantic here, but:

I think more interesting is *why* the 'immutable shall not modify the
database' requirement is there.  IOW, suppose you ignore the warnings
on the docs and force immutability on a function that writes (via the
function loophole) to the database, why exactly is this a bad idea?
The reasoning given in the documentation explains a problematic
symptom of doing so but gives little technical reasoning what it
should never be done.

One reason why writing to the database breaks immutability is that
writing to the database depends on resources that can change after the
fact: function immutability also pertains to failure -- if a function
errors (or not) with a set of inputs, it should always do so.  If you
write to a table, you could violate a constraint from one call to the
next, or the table may not even be there at all...

Writing to the database means you are influencing other systems, and
via constraints they are influencing you, so it makes it wrong by
definition.  That said, if you were writing to, say, a table with no
meaningful constraints this actually wouldn't be so bad as long as you
can also deal with the other big issue with immutability, namely that
there is not 1:1 correspondence between when the function is logically
evaluated and when it is executed.  This more or less eliminates
logging (at least outside of debugging purposes), the only thing I can
figure you can usefully do on a table w/no enforceable constraints.
Also, a big use case for immutable function is to allow use in
indexing, and it would be just crazy (again, debugging purposes aside)
to write to a table on index evaluation.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance