Re: [PERFORM] No hash join across partitioned tables?
Excerpts from Samuel Gendler's message of sáb oct 16 02:35:46 -0300 2010: An issue with automatically analyzing the entire hierarchy is 'abstract' table definitions. I've got a set of tables for storing the same data at different granularities of aggregation. Within each granularity, I've got partitions, but because the set of columns is identical for each granularity, I've got an abstract table definition that is inherited by everything. I don't need or want statistics kept on that table because I never query across the abstract table, only the parent table of each aggregation granularity Hmm, I think you'd be better served by using LIKE instead of regular inheritance. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Select count(*), the sequel
There was some doubt as for the speed of doing the select count(*) in PostgreSQL and Oracle. To that end, I copied the most part of the Oracle table I used before to Postgres. Although the copy wasn't complete, the resulting table is already significantly larger than the table it was copied from. The result still shows that Oracle is significantly faster: Oracle result: SQL alter system flush buffer_cache; System altered. SQL select /*+ full(NO) noparallel */ count(*) from ni_occurrence no; COUNT(*) -- 402062638 Elapsed: 00:03:16.45 Hints are necessary because Oracle table is declared as parallel and I didn't want the PK index to be used for counting. Oracle has a good habit of using PK's for counting, if available. SQL select bytes/1048576 as MB 2 from user_segments 3 where segment_name='NI_OCCURRENCE'; MB -- 35329 Elapsed: 00:00:00.85 SQL So, oracle stores 402 million records in 35GB and counts them in 3 minutes 16.45 seconds The very same table was partially copied to Postgres, copying died with ORA-01555 snapshot too old sometimes this morning. I ran vacuumdb -f -z on the database after the copy completed and the results are below. mgogala=# select count(*) from ni_occurrence; count --- 382400476 (1 row) Time: 221716.466 ms mgogala=# mgogala=# select 221/60::real; ?column? -- 3.68 (1 row) Time: 0.357 ms mgogala=# mgogala=# select pg_size_pretty(pg_table_size('ni_occurrence')); pg_size_pretty 46 GB (1 row) Time: 0.420 ms mgogala=# The database wasn't restarted, no caches were flushed, the comparison was done with a serious advantage for PostgreSQL. Postgres needed 3.68 minutes to complete the count which is about the same Oracle but still somewhat slower. Also, I am worried about the sizes. Postgres table is 11GB larger than the original, despite having less data. That was an unfair and unbalanced comparison because Oracle's cache was flushed and Oracle was artificially restrained to use the full table scan without the aid of parallelism. Here is the same result, with no hints and the autotrace on, which shows what happens if I turn the hints off: SQL select count(*) from ni_occurrence no; COUNT(*) -- 402062638 Elapsed: 00:00:52.61 Execution Plan -- Plan hash value: 53476935 | Id | Operation | Name | Rows | Cost (%CPU)| Time |TQ |IN-OUT| PQ Distrib | | 0 | SELECT STATEMENT | |1 | 54001 (19)| 00:01:08 | | | | | 1 | SORT AGGREGATE | |1 | | | | | | | 2 | PX COORDINATOR | | | | | | | | | 3 |PX SEND QC (RANDOM) | :TQ1 |1 | | | Q1,00 | P-S | QC (RAND) | | 4 | SORT AGGREGATE | |1 | | | Q1,00 | PCWP | | | 5 | PX BLOCK ITERATOR | | 402M| 54001 (19)| 00:01:08 | Q1,00 | PCWC | | | 6 | INDEX FAST FULL SCAN| IDX_NI_OCCURRENCE_PID | 402M| 54001 (19)| 00:01:08 | Q1,00 | PCWP | | It took just 52 seconds to count everything, but Oracle didn't even scan the table, it scanned a unique index, in parallel. That is the algorithmic advantage that forced me to restrict the execution plan with hints. My conclusion is that the speed of the full scan is OK, about the same as Oracle speed. There are, however, three significant algorithm advantages on the Oracle's side: 1) Oracle can use indexes to calculate select count 2) Oracle can use parallelism. 3) Oracle can use indexes in combination with the parallel processing. Here are the descriptions: SQL desc ni_occurrence Name Null?Type - ID NOT NULL NUMBER(22) PERMANENT_ID NOT NULL VARCHAR2(12) CALL_LETTERS NOT NULL VARCHAR2(5) AIRDATE NOT NULL DATE DURATION NOT NULL NUMBER(4) PROGRAM_TITLEVARCHAR2(360) COSTNUMBER(15) ASSETIDNUMBER(12) MARKET_IDNUMBER GMT_TIMEDATE ORIG_ST_OCC_ID NUMBER EPISODEVARCHAR2(450) IMPRESSIONS
Re: [PERFORM] UUID performance as primary key
On 10/15/10 6:58 PM, Navkirat Singh wrote: I am interested in finding out the pros/cons of using UUID as a primary key field. My requirement states that UUID would be perfect in my case as I will be having many small databases which will link up to a global database using the UUID. Hence, the need for a unique key across all databases. You left out one piece of information: How many keys per second do you need? We put a sequence in the global database that all secondary databases use to get their IDs. It means an extra connect/disconnect (a pooler can minimize this), so if you're issuing thousands of IDs per second, this isn't a good idea. But for a small-ish number of IDs per second, it gets you the benefit of a universal ID without the size of the UUID field. Craig (the other one) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Select count(*), the sequel
Hi, Interesting data points. The amount of rows that you managed to insert into PostgreSQL before Oracle gave up the ghost is 95% of the rows in the Oracle version of the database. To count 5% fewer rows, it took PostgreSQL 24 seconds longer. Or adjusting for the missing rows, 52 seconds longer for the entire table or 18% longer than the full table scan in Oracle. This seems to be well within the table layout size differences, possibly due to the fillfactor used --not really bad at all. Now the timings due to algorithm changes are interesting as indicating the room for improvement due to those type of changes. A parallel sequential full-table scan in PostgreSQL could provide the same speed up. Currently that is not possible ... but development continues a pace... In fact, developing such functions in PostgreSQL could end up being less expensive long-term than licensing Oracle RAC. I think the point that you have helped make is that PostgreSQL performs very well for many use cases that have typically been relegated to expensive commecial databases such as Oracle, DB2,... Regards, Ken On Sat, Oct 16, 2010 at 12:53:50PM -0400, Mladen Gogala wrote: There was some doubt as for the speed of doing the select count(*) in PostgreSQL and Oracle. To that end, I copied the most part of the Oracle table I used before to Postgres. Although the copy wasn't complete, the resulting table is already significantly larger than the table it was copied from. The result still shows that Oracle is significantly faster: Oracle result: SQL alter system flush buffer_cache; System altered. SQL select /*+ full(NO) noparallel */ count(*) from ni_occurrence no; COUNT(*) -- 402062638 Elapsed: 00:03:16.45 Hints are necessary because Oracle table is declared as parallel and I didn't want the PK index to be used for counting. Oracle has a good habit of using PK's for counting, if available. SQL select bytes/1048576 as MB 2 from user_segments 3 where segment_name='NI_OCCURRENCE'; MB -- 35329 Elapsed: 00:00:00.85 SQL So, oracle stores 402 million records in 35GB and counts them in 3 minutes 16.45 seconds The very same table was partially copied to Postgres, copying died with ORA-01555 snapshot too old sometimes this morning. I ran vacuumdb -f -z on the database after the copy completed and the results are below. mgogala=# select count(*) from ni_occurrence; count --- 382400476 (1 row) Time: 221716.466 ms mgogala=# mgogala=# select 221/60::real; ?column? -- 3.68 (1 row) Time: 0.357 ms mgogala=# mgogala=# select pg_size_pretty(pg_table_size('ni_occurrence')); pg_size_pretty 46 GB (1 row) Time: 0.420 ms mgogala=# The database wasn't restarted, no caches were flushed, the comparison was done with a serious advantage for PostgreSQL. Postgres needed 3.68 minutes to complete the count which is about the same Oracle but still somewhat slower. Also, I am worried about the sizes. Postgres table is 11GB larger than the original, despite having less data. That was an unfair and unbalanced comparison because Oracle's cache was flushed and Oracle was artificially restrained to use the full table scan without the aid of parallelism. Here is the same result, with no hints and the autotrace on, which shows what happens if I turn the hints off: SQL select count(*) from ni_occurrence no; COUNT(*) -- 402062638 Elapsed: 00:00:52.61 Execution Plan -- Plan hash value: 53476935 | Id | Operation | Name | Rows | Cost (%CPU)| Time |TQ |IN-OUT| PQ Distrib | | 0 | SELECT STATEMENT | |1 | 54001 (19)| 00:01:08 | | | | | 1 | SORT AGGREGATE | |1 | | | | | | | 2 | PX COORDINATOR | | | | | | | | | 3 |PX SEND QC (RANDOM) | :TQ1 |1 | | | Q1,00 | P-S | QC (RAND) | | 4 | SORT AGGREGATE | |1 | | | Q1,00 | PCWP | | | 5 | PX BLOCK ITERATOR | | 402M| 54001 (19)| 00:01:08 | Q1,00 | PCWC | | | 6 | INDEX FAST FULL SCAN| IDX_NI_OCCURRENCE_PID | 402M| 54001 (19)| 00:01:08 | Q1,00 | PCWP | | It took just 52 seconds to count everything, but Oracle didn't
Re: [PERFORM] UUID performance as primary key
On Fri, Oct 15, 2010 at 10:59 PM, Craig Ringer cr...@postnewspapers.com.au wrote: On 16/10/2010 9:58 AM, Navkirat Singh wrote: Hi Guys, I am interested in finding out the pros/cons of using UUID as a primary key field. My requirement states that UUID would be perfect in my case as I will be having many small databases which will link up to a global database using the UUID. Hence, the need for a unique key across all databases. It would be extremely helpful if someone could help me figure this out, as it is critical for my project. Pro: No need for (serverid,serverseq) pair primary keys or hacks with modulus based key generation. Doesn't set any pre-determined limit on how many servers/databases may be in a cluster. Con: Slower than modulo key generation approach, uses more storage. Foreign key relationships may be slower too. Overall, UUIDs seem to be a favoured approach. The other way people seem to do this is by assigning a unique instance id to each server/database out of a maximum n instances decided at setup time. Every key generation sequence increments by n whenever it generates a key, with an offset of the server/database id. That way, if n=100, server 1 will generate primary keys 001, 101, 201, 301, ..., server 2 will generate primary keys 002, 102, 202, 302, ... and so on. That works great until you need more than 100 instances, at which point you're really, REALLY boned. In really busy systems it also limits the total amount of primary key space - but with BIGINT primary keys, that's unlikely to be something you need to worry about. The composite primary key (serverid,sequenceid) approach avoids the need for a pre-defined maximum number of servers, but can be slow to index and can require more storage, especially because of tuple headers. I have no firsthand experience with any of these approaches so I can't offer you a considered opinion. I know that the MS-SQL crowd at least strongly prefer UUIDs, but they have very strong in-database UUID support. MySQL folks seem to mostly favour the modulo primary key generation approach. I don't see much discussion of the issue here - I get the impression Pg doesn't see heavy use in sharded environments. I think your analysis is right on the money except for one thing: the composite approach doesn't need server_id as part of the key and could be left off the index. In fact, it can be left off the table completely since the value is static for the entire database. You obviously can't check RI between databases so storing the value everywhere is of no value. server_id only matters when comparing data from one database to another, which will rarely happen inside a particular client database (and if it does, you'd have to store the foreign server_id). Any 'master' database that did control operations would of course have to store server_id for each row but I suspect that's not where the bulk of the data would be. Ditto any application code...it would have to do something like this: select server_id(), foo_id from foo where .. server_id() is of course immutable function. Since you are not managing 2 billion+ servers, this will be an 'int', or even a smallint. I think this approach is stronger than UUID approach in every way. Even stronger would be to not use surrogate keys at all, but involve what ever makes the decision that routes data between databases as part of a more natural key (no way to know for sure if this works for OP w/available info). I personally dislike sequence hacks. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Stored procedure declared as VOLATILE = no good optimization is done
On Fri, Oct 15, 2010 at 10:31 PM, Tom Lane t...@sss.pgh.pa.us wrote: Tatsuo Ishii is...@postgresql.org writes: So can I say if a function is marked IMMUTABLE, then it should never modify database? Is there any counter example? It seems if above is correct, I can say STABLE functions should never modify databases as well. Both of those things are explicitly stated here: http://developer.postgresql.org/pgdocs/postgres/xfunc-volatility.html Ok, being pedantic here, but: I think more interesting is *why* the 'immutable shall not modify the database' requirement is there. IOW, suppose you ignore the warnings on the docs and force immutability on a function that writes (via the function loophole) to the database, why exactly is this a bad idea? The reasoning given in the documentation explains a problematic symptom of doing so but gives little technical reasoning what it should never be done. One reason why writing to the database breaks immutability is that writing to the database depends on resources that can change after the fact: function immutability also pertains to failure -- if a function errors (or not) with a set of inputs, it should always do so. If you write to a table, you could violate a constraint from one call to the next, or the table may not even be there at all... Writing to the database means you are influencing other systems, and via constraints they are influencing you, so it makes it wrong by definition. That said, if you were writing to, say, a table with no meaningful constraints this actually wouldn't be so bad as long as you can also deal with the other big issue with immutability, namely that there is not 1:1 correspondence between when the function is logically evaluated and when it is executed. This more or less eliminates logging (at least outside of debugging purposes), the only thing I can figure you can usefully do on a table w/no enforceable constraints. Also, a big use case for immutable function is to allow use in indexing, and it would be just crazy (again, debugging purposes aside) to write to a table on index evaluation. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance