Re: [PERFORM] Performance penalty when using WITH

2011-08-03 Thread Robert Klemme
On Tue, Aug 2, 2011 at 11:48 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Sat, Jul 30, 2011 at 8:10 AM, Robert Klemme
 shortcut...@googlemail.com wrote:
 On Thu, Jul 28, 2011 at 11:00 PM, Li Jin l...@tripadvisor.com wrote:
 I met with the problem that when I was using WITH clause to reuse a
 subquery, I got a huge performance penalty because of query planner.
 Here are the details, the original query is
 EXPLAIN ANALYZE WITH latest_identities AS
 (
     SELECT DISTINCT ON (memberid) memberid, username, changedate
     FROM t_username_history
     WHERE memberid IN (SELECT memberid FROM t_member WHERE firstname || ' '
 || substring(lastname,1,1) = 'Eddie T')
     ORDER BY memberid, changedate DESC
 )

 Another observation: That criterion looks suspicious to me. I would
 expect any RDBMS to be better able to optimize this:

 WHERE firstname = 'Eddie' AND lastname like 'T%'

 I know it's semantically not the same but I would assume this is good
 enough for the common usecase.  Plus, if there is an index on
 (firstname, lastname) then that could be used.

 disagree. just one of the ways that could be stymied would to change
 the function behind the '||' operator.

I don't understand what you mean.  Can you please elaborate?

To explain my point a bit: I meant that by querying individual fields
separately instead of applying a criterion on a function of the two
the RDBMS has a better chance to use indexes and come up with a better
plan for this part of the query.

Kind regards

robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Re: GROUP BY with reasonable timings in PLAN but unreasonable execution time

2011-08-03 Thread Robert Haas
On Fri, Jul 8, 2011 at 9:33 PM, Clem Dickey dicke...@us.ibm.com wrote:
 a. The Join cost estimators could have been given more information

 The functions which estimate JOIN selectivity (e.g. the chance that tuples
 will match in an equijoin, for instance) use data produced by ANALYZE. But
 the SELECT .. GROUP BY does not propagate ANALYZE data from the columns of
 its input relation to its output relation. That is too bad, because the
 column value statistics (number of unique values) would have improved
 selectivity estimates for all three join plans (merge join, nested loop, and
 hash join).

Yeah, I've had this same thought.  In fact, I think that it would
probably be an improvement to pass through not just the number of
unique values but the MCVs and frequencies of the non-GROUP-BY
columns.  Of course, for the grouping columns, we ought to let
n_distinct = -1 pop out.  Granted, the GROUP BY might totally change
the data distribution, so relying on the input column statistics to be
meaningful could be totally wrong, but on average it seems more likely
to give a useful answer than a blind stab in the dark.  I haven't
gotten around to doing anything about this, but it seems like a good
idea.

 b. the Merge Join cost estimator did a poor job with the data it was given:

 In function eqjoinsel_inner there are two cases (1) ANALYZE data is
 available for both sides of the join and (2) ANALYZE data is missing for one
 or both sides. Due to the GROUP BY processing described above, ANALYZE data
 was available for t but not for SELECT * FROM t GROUP BY 

 The logic in that case is use the column with the most distinct values to
 estimate selectivity. The default number of distinct values for a column
 with no data (DEFAULT_NUM_DISTINCT) is 200. In my join the number of values
 was:

 col  in GROUP BY   in table t
 j      200            1
 k      200            1
 x      200           10
 y      200         1000
 z      200           30

 In 4 of the 5 columns the default value had more distinct values, and the
 combined selectivity (chance that two arbitrary rows would have a join
 match) was (1/200)^4 * 1/1000. Very small. The error is, IMO, that the code
 does not distinguish known numbers from default numbers. A comment in the
 code acknowledges this:

 XXX Can we be smarter if we have an MCV list for just one side?

 But it concludes

 It seems that if we assume equal distribution for the other side, we end up
 with the same answer anyway.

 I don't think that is the case. Preferring a known value, where one exists,
 would provide a better estimate of the actual range of the data. Indeed, the
 var_eq_non_const in the same file (used by the nested loop join estimator)
 does essentially that.

I'm not sure I understand what you're getting at here, unless the idea
is to make get_variable_numdistinct() somehow indicate to the caller
whether it had to punt.  That might be worth doing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance penalty when using WITH

2011-08-03 Thread Robert Klemme
On Wed, Aug 3, 2011 at 3:27 PM, Li Jin l...@tripadvisor.com wrote:
 Robert,
 I've built an index on this expression firstname || ' ' ||
 substring(lastname,1,1). I believe this is the best index for this
 particular query. Correct me if I am wrong.

Maybe, maybe not.  Difficult to tell from a distance.  I would have an
index on (firstname, lastname).  You could try that and look at the
plan for the other query.  That's the only ultimate test which will
give you hard facts.

Kind regards

robert


-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance penalty when using WITH

2011-08-03 Thread Merlin Moncure
On Wed, Aug 3, 2011 at 2:18 AM, Robert Klemme
shortcut...@googlemail.com wrote:
 Another observation: That criterion looks suspicious to me. I would
 expect any RDBMS to be better able to optimize this:

 WHERE firstname = 'Eddie' AND lastname like 'T%'

 I know it's semantically not the same but I would assume this is good
 enough for the common usecase.  Plus, if there is an index on
 (firstname, lastname) then that could be used.

 disagree. just one of the ways that could be stymied would to change
 the function behind the '||' operator.

 I don't understand what you mean.  Can you please elaborate?

 To explain my point a bit: I meant that by querying individual fields
 separately instead of applying a criterion on a function of the two
 the RDBMS has a better chance to use indexes and come up with a better
 plan for this part of the query.

Yes, but your assuming that it is safe and generally advantageous to
do that.  Both assumptions I think are false.

The || operator is trivially hacked:
create or replace function funky_concat(l text, r text) returns text as
$$
  select textcat(textcat($1, 'abc'), $2);
$$ language sql immutable ;

update pg_operator set oprcode = 'funky_concat' where oid = 654;

postgres=# select 'a' || 'b';
?column?
--
 aabcb
(1 row)

Also even ignoring the above it's not free to have the database try
and analyze every instance of the || operator to see if it can be
decomposed to boolean field operations.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Postgres performance on Linux and Windows

2011-08-03 Thread Dusan Misic
I had done some testing for my application (WIP) and I had executed same SQL
script and queries on real physical 64-bit Windows 7 and on virtualized
64-bit CentOS 6.

Both database servers are tuned with real having 8 GB RAM and 4 cores,
virtualized having 2 GB RAM and 2 virtual cores.

Virtualized server crushed real physical server in performance in both DDL
and DML scripts.

My question is simple. Does PostgreSQL perform better on Linux than on
Windows and how much is it faster in your tests?

Thank you for your time.


Re: [PERFORM] Postgres performance on Linux and Windows

2011-08-03 Thread Andy Colson

On 8/3/2011 11:37 AM, Dusan Misic wrote:

I had done some testing for my application (WIP) and I had executed same
SQL script and queries on real physical 64-bit Windows 7 and on
virtualized 64-bit CentOS 6.

Both database servers are tuned with real having 8 GB RAM and 4 cores,
virtualized having 2 GB RAM and 2 virtual cores.

Virtualized server crushed real physical server in performance in both
DDL and DML scripts.

My question is simple. Does PostgreSQL perform better on Linux than on
Windows and how much is it faster in your tests?

Thank you for your time.



Given the exact same hardware, I think PG will perform better on Linux.

Your question how much faster is really dependent on usage.  If you're 
cpu bound then I'd bet they perform the same.  You are cpu bound after 
all, and on the exact same hardware, it should be the same.


If you have lots of clients, with lots of IO, I think linux would 
perform better, but hard to say how much.  I cant recall anyone posting 
benchmarks from the exact same hardware.


Comparing windows on metal vs linux on vm is like comparing apples to 
Missouri.  If your test was io bound, and the vmserver was write 
caching, that's why your vm won so well... but I'd hate to see a power 
failure.


It would be interesting to compare windows on metal vs windows on vm 
though.  (Which, I have done linux on metal vs linux on vm, but the 
hardware specs where different (dual amd64 4 sata software raid10 vs 
intel 8-core something with 6-disk scsi hardware raid), but linux on 
metal won every time.)


I think in the long run, running the system you are best at, will be a 
win.  If you don't know linux much, and run into problems, how much 
time/money will you spend fixing it.  Compared to windows.


If you have to have the fastest, absolute, system.  Linux on metal is 
the way to go.


(This is all speculation and personal opinion, I have no numbers to back 
anything up)


-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance penalty when using WITH

2011-08-03 Thread Robert Klemme
On Wed, Aug 3, 2011 at 3:27 PM, Li Jin l...@tripadvisor.com wrote:
 Robert,
 I've built an index on this expression firstname || ' ' ||
 substring(lastname,1,1). I believe this is the best index for this
 particular query. Correct me if I am wrong.

Maybe, maybe not.  Difficult to tell from a distance.  I would have an
index on (firstname, lastname).  You could try that and look at the
plan for the other query.  That's the only ultimate test which will
give you hard facts.

Kind regards

robert


-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Postgres performance on Linux and Windows

2011-08-03 Thread Dusan Misic
Thank you Andy for your answer.

That is exactly what I had expected, but it is better to consult with
experts on this matter.

Again, thank you.

Dusan
 On Aug 3, 2011 7:05 PM, Andy Colson a...@squeakycode.net wrote:
 On 8/3/2011 11:37 AM, Dusan Misic wrote:
 I had done some testing for my application (WIP) and I had executed same
 SQL script and queries on real physical 64-bit Windows 7 and on
 virtualized 64-bit CentOS 6.

 Both database servers are tuned with real having 8 GB RAM and 4 cores,
 virtualized having 2 GB RAM and 2 virtual cores.

 Virtualized server crushed real physical server in performance in both
 DDL and DML scripts.

 My question is simple. Does PostgreSQL perform better on Linux than on
 Windows and how much is it faster in your tests?

 Thank you for your time.


 Given the exact same hardware, I think PG will perform better on Linux.

 Your question how much faster is really dependent on usage. If you're
 cpu bound then I'd bet they perform the same. You are cpu bound after
 all, and on the exact same hardware, it should be the same.

 If you have lots of clients, with lots of IO, I think linux would
 perform better, but hard to say how much. I cant recall anyone posting
 benchmarks from the exact same hardware.

 Comparing windows on metal vs linux on vm is like comparing apples to
 Missouri. If your test was io bound, and the vmserver was write
 caching, that's why your vm won so well... but I'd hate to see a power
 failure.

 It would be interesting to compare windows on metal vs windows on vm
 though. (Which, I have done linux on metal vs linux on vm, but the
 hardware specs where different (dual amd64 4 sata software raid10 vs
 intel 8-core something with 6-disk scsi hardware raid), but linux on
 metal won every time.)

 I think in the long run, running the system you are best at, will be a
 win. If you don't know linux much, and run into problems, how much
 time/money will you spend fixing it. Compared to windows.

 If you have to have the fastest, absolute, system. Linux on metal is
 the way to go.

 (This is all speculation and personal opinion, I have no numbers to back
 anything up)

 -Andy


Re: [PERFORM] Postgres performance on Linux and Windows

2011-08-03 Thread Kevin Grittner
Dusan Misic promi...@gmail.com wrote:
 
 My question is simple. Does PostgreSQL perform better on Linux
 than on Windows and how much is it faster in your tests?
 
We tested this quite a while back (on 8.0 and 8.1) with identical
hardware and identical databases running in matching versions of
PostgreSQL.  On both saturation stress tests and load balancing a
real live web site between PostgreSQL on Windows and Linux, Linux
came out about 40% faster.  Who knows what the number would be
today, with current PostgreSQL, Linux, and Windows?  Anyway, perhaps
it's a useful data point for you.
 
BTW, I wrote a tiny Java program to push data in both directions as
fast a possible over our network to check for networking problems
(it really showed up half duplex legs pretty dramatically), and when
everything was on one switch it ran 30% faster if both ends were
Linux than when both ends were Windows. I found it interesting that
with one end on Linux and one on Windows, it split the difference. 
So this is not unique to PostgreSQL.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance penalty when using WITH

2011-08-03 Thread Robert Klemme
On Wed, Aug 3, 2011 at 6:24 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Wed, Aug 3, 2011 at 2:18 AM, Robert Klemme
 shortcut...@googlemail.com wrote:
 Another observation: That criterion looks suspicious to me. I would
 expect any RDBMS to be better able to optimize this:

 WHERE firstname = 'Eddie' AND lastname like 'T%'

 I know it's semantically not the same but I would assume this is good
 enough for the common usecase.  Plus, if there is an index on
 (firstname, lastname) then that could be used.

 disagree. just one of the ways that could be stymied would to change
 the function behind the '||' operator.

 I don't understand what you mean.  Can you please elaborate?

 To explain my point a bit: I meant that by querying individual fields
 separately instead of applying a criterion on a function of the two
 the RDBMS has a better chance to use indexes and come up with a better
 plan for this part of the query.

 Yes, but your assuming that it is safe and generally advantageous to
 do that.  Both assumptions I think are false.

I am not sure why you say I assume this is _safe_.  I said it is good
enough for the common usecase.  And it is certainly good enough for
this particular query.

As for the generally advantageous I'd say that an index on raw
column values is usually useful for more queries than an index on a
specific function.  That's why I'd say generally an index on column
values is more versatile and I would prefer it.  Of course you might
achieve orders of magnitude of speedup for individual queries with an
index on a function tailored to that particular query but if you need
to do that for multiple queries you pay a higher penalty for updates.

 The || operator is trivially hacked:
 create or replace function funky_concat(l text, r text) returns text as
 $$
  select textcat(textcat($1, 'abc'), $2);
 $$ language sql immutable ;

 update pg_operator set oprcode = 'funky_concat' where oid = 654;

 postgres=# select 'a' || 'b';
 ?column?
 --
  aabcb
 (1 row)

 Also even ignoring the above it's not free to have the database try
 and analyze every instance of the || operator to see if it can be
 decomposed to boolean field operations.

Even with your hacked operator you would need an index on the
expression to make it efficient.  That could be done with the original
|| as well.  But my point was to query

WHERE a = 'foo' and b like 'b%'
instead of WHERE a || ' ' || substring(b, 1, 1) = 'foo b'

to use an index on (a,b).  That index would also be useful for queries like

WHERE a = 'foo'
WHERE a like 'fo%'
WHERE a = 'foo' and b = 'bar'

and probably also

WHERE a  'foo'
WHERE a  'foo' and b like 'b%'
WHERE a  'foo' and b = 'bar'

Kind regards

robert


PS: Sorry for the earlier duplicate.  Gmail had a hickup.

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Parameters for PostgreSQL

2011-08-03 Thread Jayadevan M
Hello,

The most important spec has been omitted. What's the storage subsystem? 
We have storage on SAN, RAID 5.
 
  We are suing weblogic.
   ^
 Best. Typo. Ever.
 
 I hear most people who use it want to, you're just brave enough to do it 
:-P
I wish I could make a few millions that way.


Thank you for all the replies. The first step is, of course, to migrate 
the data. I am working with ora2pg for that. I assume creating files with 
'COPY' to work as input for PostgreSQL is the right approach? We don't 
have many stored procedures or packages. So that part should be OK.





DISCLAIMER: 

The information in this e-mail and any attachment is intended only for 
the person to whom it is addressed and may contain confidential and/or 
privileged material. If you have received this e-mail in error, kindly 
contact the sender and destroy all copies of the original communication. 
IBS makes no warranty, express or implied, nor guarantees the accuracy, 
adequacy or completeness of the information contained in this email or any 
attachment and is not liable for any errors, defects, omissions, viruses 
or for resultant loss or damage, if any, direct or indirect.






[PERFORM] Need to tune for Heavy Write

2011-08-03 Thread Adarsh Sharma

Dear all,

From the last few days, I researched a lot on Postgresql Performance 
Tuning due to slow speed of my server.
My application selects data from mysql database about 10 rows , 
process it  insert into postgres 2 tables by making about 45 connections.


I set my postgresql parameters in postgresql.conf as below: ( OS : 
Ubuntu, RAM : 16 GB, Postgres : 8.4.2 )


max_connections= 80
shared_buffers= 2048MB
work_mem = 32MB
maintenance_work_mem = 512MB
fsync=off 
full_page_writes=off 
synchronous_commit=off 
checkpoint_segments = 32
checkpoint_completion_target = 0.7  
effective_cache_size = 4096MB



After this I change my pg_xlog directory to a separate directory other 
than data directory by symlinking.



By Application issue insert statements through postgresql connections only.

Please let me know if I missing any other important configuration.



Thanks



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Parameters for PostgreSQL

2011-08-03 Thread Craig Ringer
On 04/08/11 11:42, Jayadevan M wrote:
 Hello,

 The most important spec has been omitted. What's the storage subsystem?
 We have storage on SAN, RAID 5.

RAID 5? That's *really* not ideal for database workloads, either Pg or
Oracle, unless your RAID 5 storage backend has enough battery-backed
write cache to keep huge amounts of writes in RAM and reorder them
really effectively.

I hope each RAID 5 LUN is only across a few disks and is layered with
RAID 1, though. RAID 5 becomes less reliable than using a single disk
when used with too many HDDs, because the probability of a double-disk
failure becomes greater than that of a single standalone disk failing.
After being bitten by that a few times, these days I'm using RAID 6 in
most cases where RAID 10 isn't practical.

In any case, SAN can be anything from a Linux box running an iSCSI
target on top of a RAID 5 `md' software RAID volume on four 5400RPM
HDDs, right up to a giant hundreds-of-fast-disks monster filer full of
dedicated ASICs and great gobs of battery backed write cache DRAM. Are
you able to be any more specific about what you're dealing with?

  
   We are suing weblogic.
^
  Best. Typo. Ever.
 
  I hear most people who use it want to, you're just brave enough to
 do it :-P
 I wish I could make a few millions that way.


 Thank you for all the replies. The first step is, of course, to
 migrate the data. I am working with ora2pg for that. I assume creating
 files with 'COPY' to work as input for PostgreSQL is the right
 approach? We don't have many stored procedures or packages. So that
 part should be OK.




Re: [PERFORM] Parameters for PostgreSQL

2011-08-03 Thread Adarsh Sharma

I think RAID 10 is best among all the RAID Levels.


Thanks


Craig Ringer wrote:

On 04/08/11 11:42, Jayadevan M wrote:

Hello,

The most important spec has been omitted. What's the storage subsystem?
We have storage on SAN, RAID 5.


RAID 5? That's *really* not ideal for database workloads, either Pg or 
Oracle, unless your RAID 5 storage backend has enough battery-backed 
write cache to keep huge amounts of writes in RAM and reorder them 
really effectively.


I hope each RAID 5 LUN is only across a few disks and is layered with 
RAID 1, though. RAID 5 becomes less reliable than using a single disk 
when used with too many HDDs, because the probability of a double-disk 
failure becomes greater than that of a single standalone disk failing. 
After being bitten by that a few times, these days I'm using RAID 6 in 
most cases where RAID 10 isn't practical.


In any case, SAN can be anything from a Linux box running an iSCSI 
target on top of a RAID 5 `md' software RAID volume on four 5400RPM 
HDDs, right up to a giant hundreds-of-fast-disks monster filer full of 
dedicated ASICs and great gobs of battery backed write cache DRAM. Are 
you able to be any more specific about what you're dealing with?


 
  We are suing weblogic.

   ^
 Best. Typo. Ever.

 I hear most people who use it want to, you're just brave enough to 
do it :-P

I wish I could make a few millions that way.


Thank you for all the replies. The first step is, of course, to 
migrate the data. I am working with ora2pg for that. I assume 
creating files with 'COPY' to work as input for PostgreSQL is the 
right approach? We don't have many stored procedures or packages. So 
that part should be OK.