Re: [PERFORM] Performance penalty when using WITH
On Tue, Aug 2, 2011 at 11:48 PM, Merlin Moncure mmonc...@gmail.com wrote: On Sat, Jul 30, 2011 at 8:10 AM, Robert Klemme shortcut...@googlemail.com wrote: On Thu, Jul 28, 2011 at 11:00 PM, Li Jin l...@tripadvisor.com wrote: I met with the problem that when I was using WITH clause to reuse a subquery, I got a huge performance penalty because of query planner. Here are the details, the original query is EXPLAIN ANALYZE WITH latest_identities AS ( SELECT DISTINCT ON (memberid) memberid, username, changedate FROM t_username_history WHERE memberid IN (SELECT memberid FROM t_member WHERE firstname || ' ' || substring(lastname,1,1) = 'Eddie T') ORDER BY memberid, changedate DESC ) Another observation: That criterion looks suspicious to me. I would expect any RDBMS to be better able to optimize this: WHERE firstname = 'Eddie' AND lastname like 'T%' I know it's semantically not the same but I would assume this is good enough for the common usecase. Plus, if there is an index on (firstname, lastname) then that could be used. disagree. just one of the ways that could be stymied would to change the function behind the '||' operator. I don't understand what you mean. Can you please elaborate? To explain my point a bit: I meant that by querying individual fields separately instead of applying a criterion on a function of the two the RDBMS has a better chance to use indexes and come up with a better plan for this part of the query. Kind regards robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/ -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Re: GROUP BY with reasonable timings in PLAN but unreasonable execution time
On Fri, Jul 8, 2011 at 9:33 PM, Clem Dickey dicke...@us.ibm.com wrote: a. The Join cost estimators could have been given more information The functions which estimate JOIN selectivity (e.g. the chance that tuples will match in an equijoin, for instance) use data produced by ANALYZE. But the SELECT .. GROUP BY does not propagate ANALYZE data from the columns of its input relation to its output relation. That is too bad, because the column value statistics (number of unique values) would have improved selectivity estimates for all three join plans (merge join, nested loop, and hash join). Yeah, I've had this same thought. In fact, I think that it would probably be an improvement to pass through not just the number of unique values but the MCVs and frequencies of the non-GROUP-BY columns. Of course, for the grouping columns, we ought to let n_distinct = -1 pop out. Granted, the GROUP BY might totally change the data distribution, so relying on the input column statistics to be meaningful could be totally wrong, but on average it seems more likely to give a useful answer than a blind stab in the dark. I haven't gotten around to doing anything about this, but it seems like a good idea. b. the Merge Join cost estimator did a poor job with the data it was given: In function eqjoinsel_inner there are two cases (1) ANALYZE data is available for both sides of the join and (2) ANALYZE data is missing for one or both sides. Due to the GROUP BY processing described above, ANALYZE data was available for t but not for SELECT * FROM t GROUP BY The logic in that case is use the column with the most distinct values to estimate selectivity. The default number of distinct values for a column with no data (DEFAULT_NUM_DISTINCT) is 200. In my join the number of values was: col in GROUP BY in table t j 200 1 k 200 1 x 200 10 y 200 1000 z 200 30 In 4 of the 5 columns the default value had more distinct values, and the combined selectivity (chance that two arbitrary rows would have a join match) was (1/200)^4 * 1/1000. Very small. The error is, IMO, that the code does not distinguish known numbers from default numbers. A comment in the code acknowledges this: XXX Can we be smarter if we have an MCV list for just one side? But it concludes It seems that if we assume equal distribution for the other side, we end up with the same answer anyway. I don't think that is the case. Preferring a known value, where one exists, would provide a better estimate of the actual range of the data. Indeed, the var_eq_non_const in the same file (used by the nested loop join estimator) does essentially that. I'm not sure I understand what you're getting at here, unless the idea is to make get_variable_numdistinct() somehow indicate to the caller whether it had to punt. That might be worth doing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance penalty when using WITH
On Wed, Aug 3, 2011 at 3:27 PM, Li Jin l...@tripadvisor.com wrote: Robert, I've built an index on this expression firstname || ' ' || substring(lastname,1,1). I believe this is the best index for this particular query. Correct me if I am wrong. Maybe, maybe not. Difficult to tell from a distance. I would have an index on (firstname, lastname). You could try that and look at the plan for the other query. That's the only ultimate test which will give you hard facts. Kind regards robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/ -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance penalty when using WITH
On Wed, Aug 3, 2011 at 2:18 AM, Robert Klemme shortcut...@googlemail.com wrote: Another observation: That criterion looks suspicious to me. I would expect any RDBMS to be better able to optimize this: WHERE firstname = 'Eddie' AND lastname like 'T%' I know it's semantically not the same but I would assume this is good enough for the common usecase. Plus, if there is an index on (firstname, lastname) then that could be used. disagree. just one of the ways that could be stymied would to change the function behind the '||' operator. I don't understand what you mean. Can you please elaborate? To explain my point a bit: I meant that by querying individual fields separately instead of applying a criterion on a function of the two the RDBMS has a better chance to use indexes and come up with a better plan for this part of the query. Yes, but your assuming that it is safe and generally advantageous to do that. Both assumptions I think are false. The || operator is trivially hacked: create or replace function funky_concat(l text, r text) returns text as $$ select textcat(textcat($1, 'abc'), $2); $$ language sql immutable ; update pg_operator set oprcode = 'funky_concat' where oid = 654; postgres=# select 'a' || 'b'; ?column? -- aabcb (1 row) Also even ignoring the above it's not free to have the database try and analyze every instance of the || operator to see if it can be decomposed to boolean field operations. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Postgres performance on Linux and Windows
I had done some testing for my application (WIP) and I had executed same SQL script and queries on real physical 64-bit Windows 7 and on virtualized 64-bit CentOS 6. Both database servers are tuned with real having 8 GB RAM and 4 cores, virtualized having 2 GB RAM and 2 virtual cores. Virtualized server crushed real physical server in performance in both DDL and DML scripts. My question is simple. Does PostgreSQL perform better on Linux than on Windows and how much is it faster in your tests? Thank you for your time.
Re: [PERFORM] Postgres performance on Linux and Windows
On 8/3/2011 11:37 AM, Dusan Misic wrote: I had done some testing for my application (WIP) and I had executed same SQL script and queries on real physical 64-bit Windows 7 and on virtualized 64-bit CentOS 6. Both database servers are tuned with real having 8 GB RAM and 4 cores, virtualized having 2 GB RAM and 2 virtual cores. Virtualized server crushed real physical server in performance in both DDL and DML scripts. My question is simple. Does PostgreSQL perform better on Linux than on Windows and how much is it faster in your tests? Thank you for your time. Given the exact same hardware, I think PG will perform better on Linux. Your question how much faster is really dependent on usage. If you're cpu bound then I'd bet they perform the same. You are cpu bound after all, and on the exact same hardware, it should be the same. If you have lots of clients, with lots of IO, I think linux would perform better, but hard to say how much. I cant recall anyone posting benchmarks from the exact same hardware. Comparing windows on metal vs linux on vm is like comparing apples to Missouri. If your test was io bound, and the vmserver was write caching, that's why your vm won so well... but I'd hate to see a power failure. It would be interesting to compare windows on metal vs windows on vm though. (Which, I have done linux on metal vs linux on vm, but the hardware specs where different (dual amd64 4 sata software raid10 vs intel 8-core something with 6-disk scsi hardware raid), but linux on metal won every time.) I think in the long run, running the system you are best at, will be a win. If you don't know linux much, and run into problems, how much time/money will you spend fixing it. Compared to windows. If you have to have the fastest, absolute, system. Linux on metal is the way to go. (This is all speculation and personal opinion, I have no numbers to back anything up) -Andy -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance penalty when using WITH
On Wed, Aug 3, 2011 at 3:27 PM, Li Jin l...@tripadvisor.com wrote: Robert, I've built an index on this expression firstname || ' ' || substring(lastname,1,1). I believe this is the best index for this particular query. Correct me if I am wrong. Maybe, maybe not. Difficult to tell from a distance. I would have an index on (firstname, lastname). You could try that and look at the plan for the other query. That's the only ultimate test which will give you hard facts. Kind regards robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/ -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Postgres performance on Linux and Windows
Thank you Andy for your answer. That is exactly what I had expected, but it is better to consult with experts on this matter. Again, thank you. Dusan On Aug 3, 2011 7:05 PM, Andy Colson a...@squeakycode.net wrote: On 8/3/2011 11:37 AM, Dusan Misic wrote: I had done some testing for my application (WIP) and I had executed same SQL script and queries on real physical 64-bit Windows 7 and on virtualized 64-bit CentOS 6. Both database servers are tuned with real having 8 GB RAM and 4 cores, virtualized having 2 GB RAM and 2 virtual cores. Virtualized server crushed real physical server in performance in both DDL and DML scripts. My question is simple. Does PostgreSQL perform better on Linux than on Windows and how much is it faster in your tests? Thank you for your time. Given the exact same hardware, I think PG will perform better on Linux. Your question how much faster is really dependent on usage. If you're cpu bound then I'd bet they perform the same. You are cpu bound after all, and on the exact same hardware, it should be the same. If you have lots of clients, with lots of IO, I think linux would perform better, but hard to say how much. I cant recall anyone posting benchmarks from the exact same hardware. Comparing windows on metal vs linux on vm is like comparing apples to Missouri. If your test was io bound, and the vmserver was write caching, that's why your vm won so well... but I'd hate to see a power failure. It would be interesting to compare windows on metal vs windows on vm though. (Which, I have done linux on metal vs linux on vm, but the hardware specs where different (dual amd64 4 sata software raid10 vs intel 8-core something with 6-disk scsi hardware raid), but linux on metal won every time.) I think in the long run, running the system you are best at, will be a win. If you don't know linux much, and run into problems, how much time/money will you spend fixing it. Compared to windows. If you have to have the fastest, absolute, system. Linux on metal is the way to go. (This is all speculation and personal opinion, I have no numbers to back anything up) -Andy
Re: [PERFORM] Postgres performance on Linux and Windows
Dusan Misic promi...@gmail.com wrote: My question is simple. Does PostgreSQL perform better on Linux than on Windows and how much is it faster in your tests? We tested this quite a while back (on 8.0 and 8.1) with identical hardware and identical databases running in matching versions of PostgreSQL. On both saturation stress tests and load balancing a real live web site between PostgreSQL on Windows and Linux, Linux came out about 40% faster. Who knows what the number would be today, with current PostgreSQL, Linux, and Windows? Anyway, perhaps it's a useful data point for you. BTW, I wrote a tiny Java program to push data in both directions as fast a possible over our network to check for networking problems (it really showed up half duplex legs pretty dramatically), and when everything was on one switch it ran 30% faster if both ends were Linux than when both ends were Windows. I found it interesting that with one end on Linux and one on Windows, it split the difference. So this is not unique to PostgreSQL. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance penalty when using WITH
On Wed, Aug 3, 2011 at 6:24 PM, Merlin Moncure mmonc...@gmail.com wrote: On Wed, Aug 3, 2011 at 2:18 AM, Robert Klemme shortcut...@googlemail.com wrote: Another observation: That criterion looks suspicious to me. I would expect any RDBMS to be better able to optimize this: WHERE firstname = 'Eddie' AND lastname like 'T%' I know it's semantically not the same but I would assume this is good enough for the common usecase. Plus, if there is an index on (firstname, lastname) then that could be used. disagree. just one of the ways that could be stymied would to change the function behind the '||' operator. I don't understand what you mean. Can you please elaborate? To explain my point a bit: I meant that by querying individual fields separately instead of applying a criterion on a function of the two the RDBMS has a better chance to use indexes and come up with a better plan for this part of the query. Yes, but your assuming that it is safe and generally advantageous to do that. Both assumptions I think are false. I am not sure why you say I assume this is _safe_. I said it is good enough for the common usecase. And it is certainly good enough for this particular query. As for the generally advantageous I'd say that an index on raw column values is usually useful for more queries than an index on a specific function. That's why I'd say generally an index on column values is more versatile and I would prefer it. Of course you might achieve orders of magnitude of speedup for individual queries with an index on a function tailored to that particular query but if you need to do that for multiple queries you pay a higher penalty for updates. The || operator is trivially hacked: create or replace function funky_concat(l text, r text) returns text as $$ select textcat(textcat($1, 'abc'), $2); $$ language sql immutable ; update pg_operator set oprcode = 'funky_concat' where oid = 654; postgres=# select 'a' || 'b'; ?column? -- aabcb (1 row) Also even ignoring the above it's not free to have the database try and analyze every instance of the || operator to see if it can be decomposed to boolean field operations. Even with your hacked operator you would need an index on the expression to make it efficient. That could be done with the original || as well. But my point was to query WHERE a = 'foo' and b like 'b%' instead of WHERE a || ' ' || substring(b, 1, 1) = 'foo b' to use an index on (a,b). That index would also be useful for queries like WHERE a = 'foo' WHERE a like 'fo%' WHERE a = 'foo' and b = 'bar' and probably also WHERE a 'foo' WHERE a 'foo' and b like 'b%' WHERE a 'foo' and b = 'bar' Kind regards robert PS: Sorry for the earlier duplicate. Gmail had a hickup. -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/ -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Parameters for PostgreSQL
Hello, The most important spec has been omitted. What's the storage subsystem? We have storage on SAN, RAID 5. We are suing weblogic. ^ Best. Typo. Ever. I hear most people who use it want to, you're just brave enough to do it :-P I wish I could make a few millions that way. Thank you for all the replies. The first step is, of course, to migrate the data. I am working with ora2pg for that. I assume creating files with 'COPY' to work as input for PostgreSQL is the right approach? We don't have many stored procedures or packages. So that part should be OK. DISCLAIMER: The information in this e-mail and any attachment is intended only for the person to whom it is addressed and may contain confidential and/or privileged material. If you have received this e-mail in error, kindly contact the sender and destroy all copies of the original communication. IBS makes no warranty, express or implied, nor guarantees the accuracy, adequacy or completeness of the information contained in this email or any attachment and is not liable for any errors, defects, omissions, viruses or for resultant loss or damage, if any, direct or indirect.
[PERFORM] Need to tune for Heavy Write
Dear all, From the last few days, I researched a lot on Postgresql Performance Tuning due to slow speed of my server. My application selects data from mysql database about 10 rows , process it insert into postgres 2 tables by making about 45 connections. I set my postgresql parameters in postgresql.conf as below: ( OS : Ubuntu, RAM : 16 GB, Postgres : 8.4.2 ) max_connections= 80 shared_buffers= 2048MB work_mem = 32MB maintenance_work_mem = 512MB fsync=off full_page_writes=off synchronous_commit=off checkpoint_segments = 32 checkpoint_completion_target = 0.7 effective_cache_size = 4096MB After this I change my pg_xlog directory to a separate directory other than data directory by symlinking. By Application issue insert statements through postgresql connections only. Please let me know if I missing any other important configuration. Thanks -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Parameters for PostgreSQL
On 04/08/11 11:42, Jayadevan M wrote: Hello, The most important spec has been omitted. What's the storage subsystem? We have storage on SAN, RAID 5. RAID 5? That's *really* not ideal for database workloads, either Pg or Oracle, unless your RAID 5 storage backend has enough battery-backed write cache to keep huge amounts of writes in RAM and reorder them really effectively. I hope each RAID 5 LUN is only across a few disks and is layered with RAID 1, though. RAID 5 becomes less reliable than using a single disk when used with too many HDDs, because the probability of a double-disk failure becomes greater than that of a single standalone disk failing. After being bitten by that a few times, these days I'm using RAID 6 in most cases where RAID 10 isn't practical. In any case, SAN can be anything from a Linux box running an iSCSI target on top of a RAID 5 `md' software RAID volume on four 5400RPM HDDs, right up to a giant hundreds-of-fast-disks monster filer full of dedicated ASICs and great gobs of battery backed write cache DRAM. Are you able to be any more specific about what you're dealing with? We are suing weblogic. ^ Best. Typo. Ever. I hear most people who use it want to, you're just brave enough to do it :-P I wish I could make a few millions that way. Thank you for all the replies. The first step is, of course, to migrate the data. I am working with ora2pg for that. I assume creating files with 'COPY' to work as input for PostgreSQL is the right approach? We don't have many stored procedures or packages. So that part should be OK.
Re: [PERFORM] Parameters for PostgreSQL
I think RAID 10 is best among all the RAID Levels. Thanks Craig Ringer wrote: On 04/08/11 11:42, Jayadevan M wrote: Hello, The most important spec has been omitted. What's the storage subsystem? We have storage on SAN, RAID 5. RAID 5? That's *really* not ideal for database workloads, either Pg or Oracle, unless your RAID 5 storage backend has enough battery-backed write cache to keep huge amounts of writes in RAM and reorder them really effectively. I hope each RAID 5 LUN is only across a few disks and is layered with RAID 1, though. RAID 5 becomes less reliable than using a single disk when used with too many HDDs, because the probability of a double-disk failure becomes greater than that of a single standalone disk failing. After being bitten by that a few times, these days I'm using RAID 6 in most cases where RAID 10 isn't practical. In any case, SAN can be anything from a Linux box running an iSCSI target on top of a RAID 5 `md' software RAID volume on four 5400RPM HDDs, right up to a giant hundreds-of-fast-disks monster filer full of dedicated ASICs and great gobs of battery backed write cache DRAM. Are you able to be any more specific about what you're dealing with? We are suing weblogic. ^ Best. Typo. Ever. I hear most people who use it want to, you're just brave enough to do it :-P I wish I could make a few millions that way. Thank you for all the replies. The first step is, of course, to migrate the data. I am working with ora2pg for that. I assume creating files with 'COPY' to work as input for PostgreSQL is the right approach? We don't have many stored procedures or packages. So that part should be OK.