Re: [PERFORM] reducing random_page_cost from 4 to 2 to force index scan

2011-04-26 Thread Merlin Moncure
On Tue, Apr 26, 2011 at 4:37 PM, Kevin Grittner
 wrote:
> Sok Ann Yap  wrote:
>
>> So, index scan wins by a very small margin over sequential scan
>> after the tuning. I am a bit puzzled because index scan is more
>> than 3000 times faster in this case, but the estimated costs are
>> about the same. Did I do something wrong?
>
> Tuning is generally needed to get best performance from PostgreSQL.
> Needing to reduce random_page_cost is not unusual in situations
> where a good portion of the active data is in cache (between
> shared_buffers and the OS cache).  Please show us your overall
> configuration and give a description of the hardware (how many of
> what kind of cores, how much RAM, what sort of storage system).  The
> configuration part can be obtained by running the query on this page
> and pasting the result into your next post:
>
> http://wiki.postgresql.org/wiki/Server_Configuration
>
> There are probably some other configuration adjustments you could do
> to ensure that good plans are chosen.

The very first thing to check is effective_cache_size and to set it to
a reasonable value.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] reducing random_page_cost from 4 to 2 to force index scan

2011-04-26 Thread Kevin Grittner
Sok Ann Yap  wrote:
 
> So, index scan wins by a very small margin over sequential scan
> after the tuning. I am a bit puzzled because index scan is more
> than 3000 times faster in this case, but the estimated costs are
> about the same. Did I do something wrong?
 
Tuning is generally needed to get best performance from PostgreSQL. 
Needing to reduce random_page_cost is not unusual in situations
where a good portion of the active data is in cache (between
shared_buffers and the OS cache).  Please show us your overall
configuration and give a description of the hardware (how many of
what kind of cores, how much RAM, what sort of storage system).  The
configuration part can be obtained by running the query on this page
and pasting the result into your next post:
 
http://wiki.postgresql.org/wiki/Server_Configuration
 
There are probably some other configuration adjustments you could do
to ensure that good plans are chosen.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] reducing random_page_cost from 4 to 2 to force index scan

2011-04-26 Thread Sok Ann Yap
Hi,

I am using PostgreSQL 9.0. There is a salutations table with 44 rows,
and a contacts table with more than a million rows. The contacts table
has a nullable (only 0.002% null) salutation_id column, referencing
salutations.id.

With this query:

SELECT
salutations.id,
salutations.name,
salutations.description,
EXISTS (
SELECT 1
FROM contacts
WHERE salutations.id = contacts.salutation_id
) AS in_use
FROM salutations

I have to reduce random_page_cost from 4 to 2 to force index scan.

EXPLAIN ANALYSE output with random_page_cost = 4:

 Seq Scan on salutations  (cost=0.00..50.51 rows=44 width=229) (actual
time=0.188..3844.037 rows=44 loops=1)
   SubPlan 1
 ->  Seq Scan on contacts  (cost=0.00..64578.41 rows=57906
width=0) (actual time=87.358..87.358 rows=1 loops=44)
   Filter: ($0 = salutation_id)
 Total runtime: 3844.113 ms

EXPLAIN ANALYSE output with random_page_cost = 4, enable_seqscan = 0:

 Seq Scan on salutations  (cost=100.00..195.42 rows=44
width=229) (actual time=0.053..0.542 rows=44 loops=1)
   SubPlan 1
 ->  Index Scan using ix_contacts_salutation_id on contacts
(cost=0.00..123682.07 rows=57906 width=0) (actual time=0.011..0.011
rows=1 loops=44)
   Index Cond: ($0 = salutation_id)
 Total runtime: 0.592 ms

EXPLAIN ANALYSE output with random_page_cost = 2:

 Seq Scan on salutations  (cost=0.00..48.87 rows=44 width=229) (actual
time=0.053..0.541 rows=44 loops=1)
   SubPlan 1
 ->  Index Scan using ix_contacts_salutation_id on contacts
(cost=0.00..62423.45 rows=57906 width=0) (actual time=0.011..0.011
rows=1 loops=44)
   Index Cond: ($0 = salutation_id)
 Total runtime: 0.594 ms

So, index scan wins by a very small margin over sequential scan after
the tuning. I am a bit puzzled because index scan is more than 3000
times faster in this case, but the estimated costs are about the same.
Did I do something wrong?

Regards,
Yap

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance

2011-04-26 Thread Tomas Vondra
Dne 26.4.2011 07:35, Robert Haas napsal(a):
> On Apr 13, 2011, at 6:19 PM, Tomas Vondra  wrote:
>> Yes, I've had some lectures on non-linear programming so I'm aware that
>> this won't work if the cost function has multiple extremes (walleys /
>> hills etc.) but I somehow suppose that's not the case of cost estimates.
> 
> I think that supposition might turn out to be incorrect, though. Probably
> what will happen on simple queries is that a small change will make no
> difference, and a large enough change will cause a plan change.  On
> complex queries it will approach continuous variation but why
> shouldn't there be local minima?

Aaaah, damn! I was not talking about cost estimates - those obviously do
not have this feature, as you've pointed out (thanks!).

I was talking about the 'response time' I mentioned when describing the
autotuning using real workload. The idea is to change the costs a bit
and then measure the average response time - if the overall performance
improved, do another step in the same direction. Etc.

I wonder if there are cases where an increase of random_page_cost would
hurt performance, and another increase would improve it ... And I'm not
talking about individual queries, I'm talking about overall performance.

regards
Tomas

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Time to put theory to the test?

2011-04-26 Thread Josh Berkus
J,

> Long story short, every single PostgreSQL machine survived the failure
> with *zero* data corruption.  I had a few issues with SQL Server
> machines, and virtually every MySQL machine has required data cleanup
> and table scans and tweaks to get it back to "production" status.

Can I quote you on this?   I'll need name/company.

And, thank you for posting that regardless ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Time to put theory to the test?

2011-04-26 Thread Kenneth Marshall
On Tue, Apr 26, 2011 at 09:58:49AM -0500, Kevin Grittner wrote:
> J Sisson  wrote:
> > Rob Wultsch  wrote:
> >> Tip from someone that manages thousands of MySQL servers: Use
> >> InnoDB when using MySQL.
> > 
> > Granted, my knowledge of PostgreSQL (and even MSSQL) far surpasses
> > my knowledge of MySQL, but if InnoDB has such amazing benefits as
> > being crash safe, and even speed increases in some instances, why
> > isn't InnoDB default?
>  
> Because it's not as fast as the unsafe ISAM implementation for most
> benchmarks.
>  
> There is one minor gotcha in InnoDB (unless it's been fixed since
> 2008): the release of locks is not atomic with the persistence of
> the data in the write-ahead log (which makes it S2PL but not SS2PL).
> So it is possible for another connection to see data that won't be
> there after crash recovery. This is justified as an optimization.
> Personally, I would prefer not to see data from other transactions
> until it has actually been successfully committed.
>  
> -Kevin
> 

In addition, their fulltext indexing only works with MyISAM tables.

Ken

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Time to put theory to the test?

2011-04-26 Thread Magnus Hagander
On Tue, Apr 26, 2011 at 17:51, Tom Lane  wrote:
> J Sisson  writes:
>> Granted, my knowledge of PostgreSQL (and even MSSQL) far surpasses my
>> knowledge of MySQL, but if InnoDB has such amazing benefits as being
>> crash safe, and even speed increases in some instances, why isn't
>> InnoDB default?
>
> It *is* default in the most recent versions (5.5 and up).  They saw
> the light eventually.  I wonder whether being bought out by Oracle
> had something to do with that attitude adjustment ...

Oracle has owned innodb for quite some time. MySQL didn't want to make
themselves dependant on an Oracle controlled technology. That argument
certainly went away when Oracle bought them - and I think that was the
main reason. Not the "oracle mindset" or anything like that...

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Time to put theory to the test?

2011-04-26 Thread Tom Lane
J Sisson  writes:
> Granted, my knowledge of PostgreSQL (and even MSSQL) far surpasses my
> knowledge of MySQL, but if InnoDB has such amazing benefits as being
> crash safe, and even speed increases in some instances, why isn't
> InnoDB default?

It *is* default in the most recent versions (5.5 and up).  They saw
the light eventually.  I wonder whether being bought out by Oracle
had something to do with that attitude adjustment ...

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Time to put theory to the test?

2011-04-26 Thread gnuoytr


 Original message 
>Date: Tue, 26 Apr 2011 09:13:17 -0500
>From: pgsql-performance-ow...@postgresql.org (on behalf of J Sisson 
>)
>Subject: Re: [PERFORM] Time to put theory to the test?  
>To: Rob Wultsch 
>Cc: "pgsql-performance@postgresql.org" 
>
>On Mon, Apr 25, 2011 at 10:04 PM, Rob Wultsch  wrote:
>> Tip from someone that manages thousands of MySQL servers: Use InnoDB
>> when using MySQL.
>
>Granted, my knowledge of PostgreSQL (and even MSSQL) far surpasses my
>knowledge of MySQL, but if InnoDB has such amazing benefits as being
>crash safe, and even speed increases in some instances, why isn't
>InnoDB default? 

because it is.  recently.
http://dev.mysql.com/doc/refman/5.5/en/innodb-default-se.html


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] optimizing a cpu-heavy query

2011-04-26 Thread Joel Reymont
Folks,

I'm trying to optimize the following query that performs KL Divergence [1]. As 
you can see the distance function operates on vectors of 150 floats. 

The query takes 12 minutes to run on an idle (apart from pgsql) EC2 m1 large 
instance with 2 million documents in the docs table. The CPU is pegged at 100% 
during this time. I need to be able to both process concurrent distance queries 
and otherwise use the database.

I have the option of moving this distance calculation off of PG but are there 
other options?

Is there anything clearly wrong that I'm doing here?

Would it speed things up to make the float array a custom data type backed by C 
code?

Thanks in advance, Joel

[1] http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

---

CREATE DOMAIN topics AS float[150];
CREATE DOMAIN doc_id AS varchar(64);

CREATE TABLE docs
(
 id  serial,
 doc_id  doc_id NOT NULL PRIMARY KEY,
 topics  topics NOT NULL
);

CREATE OR REPLACE FUNCTION docs_within_distance(vec topics, threshold float) 
RETURNS TABLE(id doc_id, distance float) AS $$
BEGIN
RETURN QUERY
   SELECT * 
   FROM (SELECT doc_id, (SELECT sum(vec[i] * ln(vec[i] / topics[i])) 
 FROM generate_subscripts(topics, 1) AS i
 WHERE topics[i] > 0) AS distance
 FROM docs) AS tab
   WHERE tab.distance <= threshold;
END;
$$ LANGUAGE plpgsql;


--
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
-++---
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
-++---




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] tuning on ec2

2011-04-26 Thread Joel Reymont
I'm running pgsql on an m1.large EC2 instance with 7.5gb available memory. 

The free command shows 7gb of free+cached. My understand from the docs is that 
I should dedicate 1.75gb to shared_buffers (25%) and set effective_cache_size 
to 7gb. 

Is this correct? I'm running 64-bit Ubuntu 10.10, e.g. 

Linux ... 2.6.35-28-virtual #50-Ubuntu SMP Fri Mar 18 19:16:26 UTC 2011 x86_64 
GNU/Linux

Thanks, Joel

--
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
-++---
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
-++---




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Time to put theory to the test?

2011-04-26 Thread Scott Marlowe
On Tue, Apr 26, 2011 at 8:13 AM, J Sisson  wrote:
> On Mon, Apr 25, 2011 at 10:04 PM, Rob Wultsch  wrote:
>> Tip from someone that manages thousands of MySQL servers: Use InnoDB
>> when using MySQL.
>
> Granted, my knowledge of PostgreSQL (and even MSSQL) far surpasses my
> knowledge of MySQL, but if InnoDB has such amazing benefits as being
> crash safe, and even speed increases in some instances, why isn't
> InnoDB default?  I suppose the real issue is that I prefer software
> that gives me safe defaults that I can adjust towards the "unsafe" end
> as far as I'm comfortable with, rather than starting off in la-la land
> and working back towards sanity.

Because for many read heavy workloads myisam is still faster.  Note
that even if you use innodb tables, your system catalogs are stored in
myisam.  The Drizzle project aims to fix such things, but I'd assume
they're a little ways from full production ready status.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Time to put theory to the test?

2011-04-26 Thread Kevin Grittner
J Sisson  wrote:
> Rob Wultsch  wrote:
>> Tip from someone that manages thousands of MySQL servers: Use
>> InnoDB when using MySQL.
> 
> Granted, my knowledge of PostgreSQL (and even MSSQL) far surpasses
> my knowledge of MySQL, but if InnoDB has such amazing benefits as
> being crash safe, and even speed increases in some instances, why
> isn't InnoDB default?
 
Because it's not as fast as the unsafe ISAM implementation for most
benchmarks.
 
There is one minor gotcha in InnoDB (unless it's been fixed since
2008): the release of locks is not atomic with the persistence of
the data in the write-ahead log (which makes it S2PL but not SS2PL).
So it is possible for another connection to see data that won't be
there after crash recovery. This is justified as an optimization.
Personally, I would prefer not to see data from other transactions
until it has actually been successfully committed.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Time to put theory to the test?

2011-04-26 Thread J Sisson
On Mon, Apr 25, 2011 at 10:04 PM, Rob Wultsch  wrote:
> Tip from someone that manages thousands of MySQL servers: Use InnoDB
> when using MySQL.

Granted, my knowledge of PostgreSQL (and even MSSQL) far surpasses my
knowledge of MySQL, but if InnoDB has such amazing benefits as being
crash safe, and even speed increases in some instances, why isn't
InnoDB default?  I suppose the real issue is that I prefer software
that gives me safe defaults that I can adjust towards the "unsafe" end
as far as I'm comfortable with, rather than starting off in la-la land
and working back towards sanity.

I'll concede that the issues we had with MySQL were self-inflicted for
using MyISAM.  Thanks for pointing this out.  Time to go get my
knowledge of MySQL up to par with my knowledge of PostgreSQL...

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance

2011-04-26 Thread Claudio Freire
On Tue, Apr 26, 2011 at 7:30 AM, Robert Haas  wrote:
> On Apr 14, 2011, at 2:49 AM, Claudio Freire  wrote:
>> This particular factor is not about an abstract and opaque "Workload"
>> the server can't know about. It's about cache hit rate, and the server
>> can indeed measure that.
>
> The server can and does measure hit rates for the PG buffer pool, but to my 
> knowledge there is no clear-cut way for PG to know whether read() is 
> satisfied from the OS cache or a drive cache or the platter.

Isn't latency an indicator?

If you plot latencies, you should see three markedly obvious clusters:
OS cache (microseconds), Drive cache (slightly slower), platter
(tail).

I think I had seen a study of sorts somewhere[0]...

Ok, that link is about sequential/random access, but I distinctively
remember one about caches and CAV...

[0] http://blogs.sun.com/brendan/entry/heat_map_analytics

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance