Re: [PERFORM] reducing random_page_cost from 4 to 2 to force index scan

2011-04-28 Thread Sok Ann Yap
On Thu, Apr 28, 2011 at 7:23 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Sok Ann Yap sok...@gmail.com wrote:

 Anyway, the overhead of spawning 44 extra queries means that it is
 still better off for me to stick with the original query and tune
 PostgreSQL to choose index scan.

 Maybe, but what is *best* for you is to tune PostgreSQL so that your
 costs are accurately modeled, at which point it will automatically
 pick the best plan for most or all of your queries without you
 needing to worry about it.

 If you set your effective_cache_size to the sum of shared_buffers
 and what your OS reports as cache after you've been running a while,
 that will help the optimizer know what size index fits in RAM, and
 will tend to encourage index use.  If the active portion of your
 data is heavily cached, you might want to set random_page_cost and
 seq_page_cost to the same value, and make that value somewhere in
 the 0.1 to 0.05 range.  If you have moderate caching, using 1 and 2
 can be good.

 If you're still not getting reasonable plans, please post again with
 more information about your hardware along with the query and its
 EXPLAIN ANALYZE output.

 -Kevin


I understand the need to tune PostgreSQL properly for my use case.
What I am curious about is, for the data set I have, under what
circumstances (hardware/workload/cache status/etc) would a sequential
scan really be faster than an index scan for that particular query?

To simulate a scenario when nothing is cached, I stopped PostgreSQL,
dropped all system cache (sync; echo 3  /proc/sys/vm/drop_caches),
restarted PostgreSQL, and ran the query. A sequential scan run took
13.70 seconds, while an index scan run took 0.34 seconds, which is
still 40 times faster.

Also, I tried increasing effective_cache_size from 512MB to 3GB (the
database size is 2+GB), and it still favor sequential scan. The
estimated costs did not change at all.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] VX_CONCURRENT flag on vxfs( 5.1 or later) for performance for postgresql?

2011-04-28 Thread HSIEN-WEN CHU
Dear all

When database files are on a VxFS filesystem, performance can be
significantly improved by setting the VX_CONCURRENT cache advisory on
the file according to vxfs document,

my question is that have any tested by this?


#include sys/fs/vx_ioctl.h
ioctl(fd, VX_SETCACHE, VX_CONCURRENT);


Regards

HSIEN WEN

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] reducing random_page_cost from 4 to 2 to force index scan

2011-04-28 Thread Sok Ann Yap
On Wed, Apr 27, 2011 at 8:40 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Sok Ann Yap  wrote:
 Kevin Grittner  wrote:

 Please show us your overall configuration and give a description
 of the hardware (how many of what kind of cores, how much RAM,
 what sort of storage system).

 Here's the configuration (this is just a low end laptop):

 version | PostgreSQL 9.0.4 on x86_64-pc-linux-gnu,
 compiled by GCC x86_64-pc-linux-gnu-gcc (Gentoo 4.5.2 p1.0,
 pie-0.4.5) 4.5.2, 64-bit
 checkpoint_segments | 16
 default_statistics_target | 1

 Usually overkill.  If this didn't help, you should probably change it
 back.

 effective_cache_size | 512MB
 lc_collate | en_US.UTF-8
 lc_ctype | en_US.UTF-8
 listen_addresses | *
 log_destination | syslog
 log_min_duration_statement | 0
 maintenance_work_mem | 256MB
 max_connections | 100

 You probably don't need this many connections.

 max_stack_depth | 2MB
 port | 5432
 random_page_cost | 4
 server_encoding | UTF8
 shared_buffers | 256MB
 silent_mode | on
 TimeZone | Asia/Kuala_Lumpur
 wal_buffers | 1MB
 work_mem | 32MB
 (20 rows)

 It's hard to recommend other changes without knowing the RAM on the
 system.  How many of what kind of CPUs would help, too.

 The thing is, the query I posted was fairly simple (I think), and
 PostgreSQL should be able to choose the 3000+ times faster index
 scan with the default random_page_cost of 4.

 It picks the plan with the lowest estimated cost.  If it's not
 picking the best plan, that's usually an indication that you need to
 adjust cost factors so that estimates better model the actual costs.

 If I need to reduce it to 2 when using a 5.4k rpm slow disk, what
 is random_page_cost = 4 good for?

 It's good for large databases with a lot of physical disk I/O.  In
 fact, in some of those cases, it needs to be higher.  In your test,
 the numbers indicate that everything was cached in RAM.  That makes
 the effective cost very low.

 Also, the odds are that you have more total cache space between the
 shared_buffers and the OS cache than the effective_cache_size
 setting, so the optimizer doesn't expect the number of cache hits
 you're getting on index usage.

 -Kevin


Thanks for the tips and explanation. I wrongly assumed the
random_page_cost value is independent from caching.

Now, let's go back to the original query:

SELECT
salutations.id,
salutations.name,
EXISTS (
SELECT 1
FROM contacts
WHERE salutations.id = contacts.salutation_id
) AS in_use
FROM salutations

If I split up the query, i.e. running this once:

SELECT
salutations.id,
salutations.name
FROM salutations

and then running this 44 times, once for each row:

SELECT
EXISTS (
SELECT 1
FROM contacts
WHERE contacts.salutation_id = ?
) AS in_use

I can see that PostgreSQL will smartly pick the best plan, i.e. for
common salutations (Madam, Ms, etc), it will do sequential scan, while
for salutations that are rarely used or not used at all, it will do
index scan.

Anyway, the overhead of spawning 44 extra queries means that it is
still better off for me to stick with the original query and tune
PostgreSQL to choose index scan.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance

2011-04-28 Thread Sethu Prasad
Just want to share the DBT(25) thing

http://archives.postgresql.org/pgsql-performance/2011-04/msg00145.php
http://sourceforge.net/mailarchive/forum.php?forum_name=osdldbt-generalmax_rows=25style=nestedviewmonth=201104



On Wed, Apr 27, 2011 at 11:55 PM, Greg Smith g...@2ndquadrant.com wrote:

 Tomas Vondra wrote:

 Hmmm, just wondering - what would be needed to build such 'workload
 library'? Building it from scratch is not feasible IMHO, but I guess
 people could provide their own scripts (as simple as 'set up a a bunch
 of tables, fill it with data, run some queries') and there's a pile of
 such examples in the pgsql-performance list.



 The easiest place to start is by re-using the work already done by the TPC
 for benchmarking commercial databases.  There are ports of the TPC workloads
 to PostgreSQL available in the DBT-2, DBT-3, and DBT-5 tests; see
 http://wiki.postgresql.org/wiki/Category:Benchmarking for initial
 information on those (the page on TPC-H is quite relevant too).  I'd like to
 see all three of those DBT tests running regularly, as well as two tests
 it's possible to simulate with pgbench or sysbench:  an in-cache read-only
 test, and a write as fast as possible test.

 The main problem with re-using posts from this list for workload testing is
 getting an appropriately sized data set for them that stays relevant.  The
 nature of this sort of benchmark always includes some notion of the size of
 the database, and you get different results based on how large things are
 relative to RAM and the database parameters.  That said, some sort of
 systematic collection of hard queries would also be a very useful project
 for someone to take on.

 People show up regularly who want to play with the optimizer in some way.
  It's still possible to do that by targeting specific queries you want to
 accelerate, where it's obvious (or, more likely, hard but still
 straightforward) how to do better.  But I don't think any of these proposed
 exercises adjusting the caching model or default optimizer parameters in the
 database is going anywhere without some sort of benchmarking framework for
 evaluating the results.  And the TPC tests are a reasonable place to start.
  They're a good mixed set of queries, and improving results on those does
 turn into a real commercial benefit to PostgreSQL in the future too.


 --
 Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
 PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
 PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance



Re: [PERFORM] Order of tables

2011-04-28 Thread Heikki Linnakangas

On 28.04.2011 12:20, Rishabh Kumar Jain wrote:

How the tables must be ordered in the list of tables in from statement?


There is no difference in performance, if that's what you mean. (If not, 
then pgsql-novice or pgsql-sql mailing list would've be more appropriate)


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] reducing random_page_cost from 4 to 2 to force index scan

2011-04-28 Thread Jeff Janes
On Wed, Apr 27, 2011 at 5:19 PM, Sok Ann Yap sok...@gmail.com wrote:

 I understand the need to tune PostgreSQL properly for my use case.
 What I am curious about is, for the data set I have, under what
 circumstances (hardware/workload/cache status/etc) would a sequential
 scan really be faster than an index scan for that particular query?


The sequential scan on contacts can be terminated as soon as the first
matching row is found.  If each block of the contacts table contains
one example of each salutation, then the inner sequential scan will
always be very short, and faster than an index scan.

I can engineer this to be the case by populating the table like this:

insert into contacts select (generate_series%44+1)::int from
generate_series (1,100);

Here I get the seq scan being 2.6ms while the index scan is 5.6ms.

Predicting how far the inner scan needs to go would be quite
difficult, and I don't know how the system will do it.

However, when I create and populate simple tables based on your
description, I get the index scan being the lower estimated cost.  So
the tables I built are not sufficient to study the matter in detail.



Cheers,

Jeff

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] VX_CONCURRENT flag on vxfs( 5.1 or later) for performance for postgresql?

2011-04-28 Thread Greg Smith

On 04/27/2011 11:33 PM, HSIEN-WEN CHU wrote:

When database files are on a VxFS filesystem, performance can be
significantly improved by setting the VX_CONCURRENT cache advisory on
the file according to vxfs document,
   


That won't improve performance, and it's not safe either.  VX_CONCURRENT 
switches the filesystem to use direct I/O.  That's usually slower for 
PostgreSQL.  And it introduces some requirements for both block 
alignment and the application avoiding overlapping writes.  PostgreSQL 
doesn't do either, so I wouldn't expect it to be compatible with 
VX_CONCURRENT.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance