Re: [PERFORM] Shouldn't we have a way to avoid "risky" plans?

2011-03-24 Thread Віталій Тимчишин
2011/3/23 Tom Lane 

> Claudio Freire  writes:
> > On Wed, Mar 23, 2011 at 5:29 PM, Josh Berkus  wrote:
> >> On 3/23/11 10:35 AM, Claudio Freire wrote:
> >>>  *  consider plan bailout: execute a tempting plan, if it takes too
> >>> long or its effective cost raises well above the expected cost, bail
> >>> to a safer plan
>
> >> That would actually solve this particular case.  It would still require
> >> us to have some definition of "safer" though.
>
> > In my head, safer = better worst-case performance.
>
> If the planner starts operating on the basis of worst case rather than
> expected-case performance, the complaints will be far more numerous than
> they are today.
>
> This can se GUC-controllable. Like plan_safety=0..1 with low default value.
This can influence costs of plans where cost changes dramatically with small
table changes and/or statistics is uncertain. Also this can be used as
direct "hint" for such dangerous queries by changing GUC for session/single
query.


-- 
Best regards,
 Vitalii Tymchyshyn


Re: [PERFORM] pg9.0.3 explain analyze running very slow compared to a different box with much less configuration

2011-03-24 Thread Achilleas Mantzios
You might take a look here:
http://archives.postgresql.org/pgsql-admin/2011-01/msg00050.php
My problem had to do with the speed of gettimeofday. You might want to do some 
special setting regarding
your box's way of reading time for the hw clock.

Στις Thursday 24 March 2011 04:04:21 ο/η DM έγραψε:
> Hi All,
> 
> pg9.0.3 explain analyze running very slow compared to old box with much less
> configuration.
> 
> But actual query is performing much better than the old server.
> 
> old Server===
> OS: CentOS release 5.4 (Final)
> Linux Server 2.6.18-164.6.1.el5 #1 SMP Tue Nov 3 16:12:36 EST 2009 x86_64
> x86_64 x86_64 GNU/Linux
> 
> RAM - 16GB
> CPU - 8 Core
> disk - 300GB
> RAID10 on the disk
> 
> Postgresql 9.0.3
> 
> Postgres Config:
> shared_buffers = 6GB
> work_mem = 32MB
> maintenance_work_mem = 512MB
> effective_cache_size = 12GB
> 
> #explain analyze select * from photo;
>  QUERY PLAN
> 
>  Seq Scan on photo  (cost=0.00..8326849.24 rows=395405824 width=168) (actual
> time=5.632..157757.284 rows=395785382 loops=1)
>  Total runtime: 187443.850 ms
> (2 rows)
> 
> newServer===
> 
> CentOS release 5.4 (Final)
> Linux Server 2.6.18-164.6.1.el5 #1 SMP Tue Nov 3 16:12:36 EST 2009 x86_64
> x86_64 x86_64 GNU/Linux
> 
> RAM - 64GB
> CPU - 12 Core
> disk - 1TB
> RAID10 on the disk
> 
> Postgresql 9.0.3
> Postgres Config:
> shared_buffers = 16GB
> work_mem = 32MB
> maintenance_work_mem = 1024MB
> effective_cache_size = 12GB
> 
> 
> # explain analyze select * from photo;
> QUERY PLAN
> 
>  Seq Scan on photo  (cost=0.00..8326810.24 rows=395579424 width=165) (actual
> time=0.051..316879.928 rows=395648020 loops=1)
>  Total runtime: 605703.206 ms
> (2 rows)
> 
> 
> I read other articles about the same issue but could not find the exact
> solution.
> 
> 
> I ran gettimeofday() on both machines and got the below results:
> 
> Results:
> 
> *[Old Server]# time /tmp/gtod*
> 
> real  0m0.915s
> 
> user  0m0.914s
> 
> sys   0m0.001s
> 
> *[New Server]#  time /tmp/gtod*
> 
> real  0m7.542s
> 
> user  0m7.540s
> 
> sys   0m0.001s
> 
> 
> I am not sure how to fix this issue, any help would be in great assistance.
> 
> 
> Thanks
> 
> Deepak
> 



-- 
Achilleas Mantzios

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] buffercache/bgwriter

2011-03-24 Thread Uwe Bartels
Hi Cédric,

OK, sounds promising. But all of these improvements are for the postgres
developers.
For me as an administrator I can't do a thing right now. OK.

Thanks for you suggestions. I think for batchjobs other that just COPY they
could speed up the process quite well if now the backend process has to do
all (or 50%)  of the writings.

It would also be good to see how many buffers were written by backend
processes grouped by Buffer Access Strategy - to better distinguish evil
backend writes from wanted backend writes.

Best Regards,
Uwe

On 23 March 2011 21:23, Cédric Villemain
wrote:

> 2011/3/23 Uwe Bartels :
> > On 23 March 2011 16:36, Jeff Janes  wrote:
> >>
> >> On Wed, Mar 23, 2011 at 6:19 AM, Jochen Erwied
> >>  wrote:
> >> > Wednesday, March 23, 2011, 1:51:31 PM you wrote:
> >> >
> >> > [rearranged for quoting]
> >> >
> >> >> background writer stats
> >> >>  checkpoints_timed | checkpoints_req | buffers_checkpoint |
> >> >> buffers_clean |
> >> >> maxwritten_clean | buffers_backend | buffers_alloc
> >> >>
> >> >>
> ---+-++---+--+-+---
> >> >>  3 |   0 |  99754 |
> >> >> 0
> >> >> |0 |  115307 |246173
> >> >> (1 row)
> >> >
> >> > buffers_clean = 0 ?!
> >> >
> >> >> But I don't understand how postgres is unable to fetch a free buffer.
> >> >> Does any body have an idea?
> >> >
> >> > Somehow looks like the bgwriter is completely disabled. How are the
> >> > relevant settings in your postgresql.conf?
> >>
> >> I suspect the work load is entirely bulk inserts, and is using a
> >> Buffer Access Strategy.  By design, bulk inserts generally write out
> >> their own buffers.
> >>
> >> Cheers,
> >>
> >> Jeff
> >
> > Yes. that's true. We are converting databases from one schema into
> another
> > with a lot of computing in between.
> > But most of the written data is accessed soon for other conversions.
> > OK. That sounds very simple and thus trustable ;).
>
> yes, it is.
>
> >
> > So everything is fine and there is no need/potential for optimization?
> >
>
> There are probably room for improvements, without more thinking, I
> would suggest:
>
>  * review bufferstrategy to increase the buffer size for the pool when
> there is a lot of free buffers
> * have a bgwriter working just behind the seqscan (and probably a
> biger pool of buffers anyway)
> * do not use  the special bufferstrategy when  the buffer cache has
> more than X% of free pages
> * add more :)
>
> I believe it should be ok to do good improvement for special case
> easely identifiable like yours.
>
> --
> Cédric Villemain   2ndQuadrant
> http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
>


Re: [PERFORM] pg9.0.3 explain analyze running very slow compared to a different box with much less configuration

2011-03-24 Thread Marti Raudsepp
On Thu, Mar 24, 2011 at 11:11, Achilleas Mantzios
 wrote:
> My problem had to do with the speed of gettimeofday. You might want to do 
> some special setting regarding
> your box's way of reading time for the hw clock.

Just for extra info, on x86, TSC is usually the "fast" timeofday
implementation. On recent CPUs in single-socket configurations, TSC
should always be available, regardless of any power management. I
don't know about multi-socket. If you want to know whether your kernel
is using tsc, run:

cat /sys/devices/system/clocksource/clocksource0/current_clocksource

On older CPUs, you often had to disable some sort of power management
in order to get a stable TSC -- the "ondemand" scaling governor is the
top suspect. Disabling this is distro-specific. You have to reboot to
get the kernel to re-test TSC. Unfortunately disabling power
management later at boot doesn't help you, you have to prevent it from
activating at all.

For debugging, grepping dmesg for tsc or clocksource is often helpful.
On machines with unstable TSC you'll see output like this:

[0.00] Fast TSC calibration using PIT
[0.164068] checking TSC synchronization [CPU#0 -> CPU#1]: passed.
[0.196730] Switching to clocksource tsc
[0.261347] Marking TSC unstable due to TSC halts in idle
[0.261536] Switching to clocksource acpi_pm

If you just want to get repeatable timings, you can force both
machines to use the hpet clocksource:
echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource

Marti

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] pg9.0.3 explain analyze running very slow compared to a different box with much less configuration

2011-03-24 Thread Achilleas Mantzios
Στις Thursday 24 March 2011 13:39:19 ο/η Marti Raudsepp έγραψε:
> On Thu, Mar 24, 2011 at 11:11, Achilleas Mantzios
>  wrote:
> > My problem had to do with the speed of gettimeofday. You might want to do 
> > some special setting regarding
> > your box's way of reading time for the hw clock.
> 
> Just for extra info, on x86, TSC is usually the "fast" timeofday
> implementation. On recent CPUs in single-socket configurations, TSC
> should always be available, regardless of any power management. I
> don't know about multi-socket. If you want to know whether your kernel
> is using tsc, run:
> 

That's what i am experiencing as well, in two of my FreeBSD boxes (work/home) i 
get:

phenom ii X4 :
==
% sysctl -a | grep -i timecounter
kern.timecounter.tick: 1
kern.timecounter.choice: TSC(-100) HPET(900) ACPI-fast(1000) i8254(0) 
dummy(-100)
kern.timecounter.hardware: TSC
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.i8254.counter: 1960
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.ACPI-fast.mask: 4294967295
kern.timecounter.tc.ACPI-fast.counter: 3642319843
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.quality: 1000
kern.timecounter.tc.HPET.mask: 4294967295
kern.timecounter.tc.HPET.counter: 1160619197
kern.timecounter.tc.HPET.frequency: 14318180
kern.timecounter.tc.HPET.quality: 900
kern.timecounter.tc.TSC.mask: 4294967295
kern.timecounter.tc.TSC.counter: 2788277817
kern.timecounter.tc.TSC.frequency: 3400155810
kern.timecounter.tc.TSC.quality: -100
kern.timecounter.smp_tsc: 0
kern.timecounter.invariant_tsc: 1

Pentium 4
==
% sysctl -a | grep -i timecounter
kern.timecounter.tick: 1
kern.timecounter.choice: TSC(800) ACPI-fast(1000) i8254(0) dummy(-100)
kern.timecounter.hardware: ACPI-fast
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.i8254.counter: 13682
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.ACPI-fast.mask: 16777215
kern.timecounter.tc.ACPI-fast.counter: 6708142
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.quality: 1000
kern.timecounter.tc.TSC.mask: 4294967295
kern.timecounter.tc.TSC.counter: 3109326068
kern.timecounter.tc.TSC.frequency: 2663194296
kern.timecounter.tc.TSC.quality: 800
kern.timecounter.smp_tsc: 0
kern.timecounter.invariant_tsc: 0

TSC, it seems, outperform the rest of clocks in terms of frequency.

> cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> 
> On older CPUs, you often had to disable some sort of power management
> in order to get a stable TSC -- the "ondemand" scaling governor is the
> top suspect. Disabling this is distro-specific. You have to reboot to
> get the kernel to re-test TSC. Unfortunately disabling power
> management later at boot doesn't help you, you have to prevent it from
> activating at all.
> 
> For debugging, grepping dmesg for tsc or clocksource is often helpful.
> On machines with unstable TSC you'll see output like this:
> 
> [0.00] Fast TSC calibration using PIT
> [0.164068] checking TSC synchronization [CPU#0 -> CPU#1]: passed.
> [0.196730] Switching to clocksource tsc
> [0.261347] Marking TSC unstable due to TSC halts in idle
> [0.261536] Switching to clocksource acpi_pm
> 
> If you just want to get repeatable timings, you can force both
> machines to use the hpet clocksource:
> echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource
> 
> Marti
> 



-- 
Achilleas Mantzios

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] maintenance_work_mem + create index

2011-03-24 Thread Uwe Bartels
Hi,

I see my application creating temporary files while creating an index.
LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp7076.0", size 779853824
STATEMENT:  CREATE INDEX IDX_LPA_LINKID ON NNDB.LPA (LINK_ID);

So I checked this again and raised afterwards maintenance_work_mem step by
step up 64GB.
I logged in via psql, run the following statements
set maintenance_work_mem = '64GB';
CREATE INDEX IDX_LPA_LINKID ON NNDB.LPA (LINK_ID);

But still I get that evil message in the log file about creating a temporary
file.
I also raised work_mem in my session up to 32GB - again without changing the
behavior.

According to the postgres docs
http://www.postgresql.org/docs/8.4/static/populate.html#POPULATE-WORK-MEMthis
is supposed to help.
Any ideas?

I'm running postgres 8.4 64 bit on Linux from an enterprisedb package.
# file /var/lib/pgsql/bin/postgres
/var/lib/pgsql/bin/postgres: ELF 64-bit LSB executable, x86-64, version 1
(SYSV), for GNU/Linux 2.4.0, dynamically linked (uses shared libs), not
stripped


Best Regards,
Uwe


Re: [PERFORM] maintenance_work_mem + create index

2011-03-24 Thread Stephen Frost
Uwe,

* Uwe Bartels (uwe.bart...@gmail.com) wrote:
> So I checked this again and raised afterwards maintenance_work_mem step by
> step up 64GB.
> I logged in via psql, run the following statements
> set maintenance_work_mem = '64GB';

I believe maintenance_work_mem suffers from the same problem that
work_mem has, specifically that PG still won't allocate more than
1GB of memory for any single operation.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PERFORM] maintenance_work_mem + create index

2011-03-24 Thread Uwe Bartels
OK. I didn't now that. Thanks for sharing that information.
Can anybody tell if we have this limitation on maintenance_work_mem as well?

Does anybody know of a solution out of that on Linux?
Or is there a dynamic way to put $PGDATA/base/pgsql_tmp into RAM without
blocking it completely like a ram disk?

Best Regards,
Uwe

On 24 March 2011 15:13, Stephen Frost  wrote:

> Uwe,
>
> * Uwe Bartels (uwe.bart...@gmail.com) wrote:
> > So I checked this again and raised afterwards maintenance_work_mem step
> by
> > step up 64GB.
> > I logged in via psql, run the following statements
> > set maintenance_work_mem = '64GB';
>
> I believe maintenance_work_mem suffers from the same problem that
> work_mem has, specifically that PG still won't allocate more than
> 1GB of memory for any single operation.
>
>Thanks,
>
>Stephen
>
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.10 (GNU/Linux)
>
> iEYEARECAAYFAk2LUW8ACgkQrzgMPqB3kigZMwCfUVL/5nSdK5xiV+/SjWB6BG9B
> Fm0An2V5Tald8PUYXc5VIuKL/C1WNYTp
> =MSxh
> -END PGP SIGNATURE-
>
>


Re: [PERFORM] maintenance_work_mem + create index

2011-03-24 Thread Shaun Thomas

On 03/24/2011 09:40 AM, Uwe Bartels wrote:


Does anybody know of a solution out of that on Linux?
Or is there a dynamic way to put $PGDATA/base/pgsql_tmp into RAM without
blocking it completely like a ram disk?


We put this in our startup script just before starting the actual database:

for x in $(find ${PGDATA}/base -mindepth 1 -maxdepth 1 -type d); do
  nDBNum=${x##*/}
  sDir=${DBSHM}/${nDBNum}

  if [ ! -d "$sDir" ]; then
su -c "mkdir $sDir" - $PGUSER
  fi
done

Where PGDATA, DBSHM, and PGUSER are all set in 
/etc/sysconfig/postgresql. But DBSHM defaults to /dev/shm/pgsql_tmp on 
our Linux box.


Basically what this does is ensures a directory exists for each of your 
databases in shared memory. Then all we did was symlink the pgsql_tmp 
folder to point to those shared-memory directories. Many systems default 
so that up to half of total RAM can be used this way, so we're not at 
any risk with 64GB on our main nodes.


We already run a custom init.d script anyway because we needed something 
LSB compatible for Pacemaker. I highly recommend it. :)


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
stho...@peak6.com

__

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] maintenance_work_mem + create index

2011-03-24 Thread Uwe Bartels
OK. sounds promising. On my machine this looks similar.
I'll try this.

Thanks,
Uwe


On 24 March 2011 16:14, Shaun Thomas  wrote:

> On 03/24/2011 09:40 AM, Uwe Bartels wrote:
>
>  Does anybody know of a solution out of that on Linux?
>> Or is there a dynamic way to put $PGDATA/base/pgsql_tmp into RAM without
>> blocking it completely like a ram disk?
>>
>
> We put this in our startup script just before starting the actual database:
>
> for x in $(find ${PGDATA}/base -mindepth 1 -maxdepth 1 -type d); do
>  nDBNum=${x##*/}
>  sDir=${DBSHM}/${nDBNum}
>
>  if [ ! -d "$sDir" ]; then
>su -c "mkdir $sDir" - $PGUSER
>  fi
> done
>
> Where PGDATA, DBSHM, and PGUSER are all set in /etc/sysconfig/postgresql.
> But DBSHM defaults to /dev/shm/pgsql_tmp on our Linux box.
>
> Basically what this does is ensures a directory exists for each of your
> databases in shared memory. Then all we did was symlink the pgsql_tmp folder
> to point to those shared-memory directories. Many systems default so that up
> to half of total RAM can be used this way, so we're not at any risk with
> 64GB on our main nodes.
>
> We already run a custom init.d script anyway because we needed something
> LSB compatible for Pacemaker. I highly recommend it. :)
>
> --
> Shaun Thomas
> OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
> 312-676-8870
> stho...@peak6.com
>
> __
>
> See  http://www.peak6.com/email_disclaimer.php
> for terms and conditions related to this email
>


Re: [PERFORM] maintenance_work_mem + create index

2011-03-24 Thread Shaun Thomas

On 03/24/2011 10:28 AM, Uwe Bartels wrote:


OK. sounds promising. On my machine this looks similar.
I'll try this.


I just realized I may have implied that DBSHM automatically defaults to 
/db/shm/pgsql_tmp. It dosen't. I also have this at the very top of our 
/etc/init.d/postgresql script:


if [ -f /etc/sysconfig/postgresql ]; then
  source /etc/sysconfig/postgresql
fi

DBSHM=${DBSHM:-/dev/shm/pgsql_tmp}
PGDATA=${PGDATA:-"/db/data/pgdata"}
PGUSER=${PGUSER:-postgres}

DBSHM doesn't exist, and the other vars will probably be empty unless 
you set them in the sysconfig file. What I meant was that /dev/shm 
automatically exists on our Linux box and we make use of it. :)


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
stho...@peak6.com

__

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] maintenance_work_mem + create index

2011-03-24 Thread Euler Taveira de Oliveira

Em 24-03-2011 11:40, Uwe Bartels escreveu:

Or is there a dynamic way to put $PGDATA/base/pgsql_tmp into RAM without
blocking it completely like a ram disk?


Create a tablespace in a ram disk and set temp_tablespaces.


--
  Euler Taveira de Oliveira
  http://www.timbira.com/

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Shouldn't we have a way to avoid "risky" plans?

2011-03-24 Thread Merlin Moncure
2011/3/24 Віталій Тимчишин :
> 2011/3/23 Tom Lane 
>>
>> Claudio Freire  writes:
>> > On Wed, Mar 23, 2011 at 5:29 PM, Josh Berkus  wrote:
>> >> On 3/23/11 10:35 AM, Claudio Freire wrote:
>> >>>  *  consider plan bailout: execute a tempting plan, if it takes too
>> >>> long or its effective cost raises well above the expected cost, bail
>> >>> to a safer plan
>>
>> >> That would actually solve this particular case.  It would still require
>> >> us to have some definition of "safer" though.
>>
>> > In my head, safer = better worst-case performance.
>>
>> If the planner starts operating on the basis of worst case rather than
>> expected-case performance, the complaints will be far more numerous than
>> they are today.
>>
> This can se GUC-controllable. Like plan_safety=0..1 with low default value.
> This can influence costs of plans where cost changes dramatically with small
> table changes and/or statistics is uncertain. Also this can be used as
> direct "hint" for such dangerous queries by changing GUC for session/single
> query.

ISTM if you add statistics miss and 'risk margin' to the things the
planner would have to consider while generating a plan, you are
greatly increasing the number of plan paths that would have to be
considered for any non trivial query.

merlin

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Shouldn't we have a way to avoid "risky" plans?

2011-03-24 Thread Nathan Boley
>> This can se GUC-controllable. Like plan_safety=0..1 with low default value.
>> This can influence costs of plans where cost changes dramatically with small
>> table changes and/or statistics is uncertain. Also this can be used as
>> direct "hint" for such dangerous queries by changing GUC for session/single
>> query.
>
> ISTM if you add statistics miss and 'risk margin' to the things the
> planner would have to consider while generating a plan, you are
> greatly increasing the number of plan paths that would have to be
> considered for any non trivial query.


FWIW, these ideas are well established in the statistical community.

Currently, we are essentially using "maximum likelihood estimators".
We estimate a bunch of selectivities by choosing what is most likely,
plug them in to our objective function, and then minimize based upon
the plugged in values. In a wide variety of cases, MLE's can be shown
to be "asymptotically" optimal. That is, as our sample distribution
approaches the true distribution, the best we can possibly do is to
use the MLE. This is pretty sensible - if we actually knew all of the
selectivities then the results aren't really random anymore. However,
they often perform very poorly with small sample sizes - particularly
if the loss function is very sensitive to relatively small
fluctuations in the parameter estimates ( which postgres certainly is
- switching from a hash join to a nest-loop can be disastrous ).

Using the estimate that minimizes the "worst-case" performance is
precisely a minimax estimator. There, the goal is to minimize the risk
function ( iow, plan cost ) under the worst possible conditions.
Wikipedia has a pretty good treatment - just think "plan cost"
whenever you see "risk".

Another approach, that hasn't been suggested yet, is some Bayesian
update method. There, rather than calculating a specific parameter
value ( like ndistinct ), you try to store the entire distribution and
choose the plan that minimizes cost averaged over all of the possible
parameter values.

Example: ( please excuse the unrealistic numbers )

For instance, rather than estimate the selectivity of the join (
relation1.col1 = relation2.col1 ) to be 0.01, we would say it is 0.1
w/ probability 0.2 and 0.001 with probability 0.8. So, here is how we
would choose the plan now:

cost( nestloop | selectivity = 0.01 ) = 1
cost( hashjoin | selectivity = 0.01 ) = 2
cost( mergejoin | selectivity = 0.01 ) = 50

Here would be the bayesian approach:

cost( nestloop | selectivity = 0.001 ) = 0.1
cost( hashjoin | selectivity = 0.001 ) = 1
cost( mergejoin | selectivity = 0.001 ) = 50

cost( nestloop | selectivity = 0.1 ) = 10
cost( hashjoin | selectivity = 0.1 ) = 3
cost( mergejoin | selectivity = 0.1 ) = 50

So, the bayesian costs are:

nestloop: 0.1*0.8 + 10*0.2 = 2.08
hashjoin: 1.0*0.8 + 3*0.2 = 1.4
nestloop: 50*0.8 + 50*0.2 = 50

so the hashjoin would be chosen.

For completeness, the minimax costs would be:

nestloop: max( 0.1, 10 )
hashjoin: max( 1, 3   )
nestloop: max( 50, 50 )

So, again, the hashjoin is chosen.

I obviously have a bias towards the Bayesian approach, but it's not
because I expect it to necessarily perform a whole lot better but,
rather, it reduces to the other two approaches. If we want the current
behavior, then simply store the MLE selectivity with probability 1. If
we want the minimax estimate, choose the worst possible value. Or
anything in between.

Also, ( not that I have even close to the experience / expertise to
make this claim - so take this with a grain of salt ) it seems that
the code changes would be substantial but pretty straightforward and
easy to localize. Rather than passing a selectivity, pass a pair of
arrays with selectivities and probabilities. Initially, we could keep
the current estimates ( every passed array would be of length one )
and then make changes as problems appear ( like Josh's )

I hope my little estimation procedure tutorial has been a little
helpful, please feel free to contact me off list if you have
questions/want references.

Best,
Nathan Boley

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Shouldn't we have a way to avoid "risky" plans?

2011-03-24 Thread Claudio Freire
On Thu, Mar 24, 2011 at 5:30 PM, Nathan Boley  wrote:
> Another approach, that hasn't been suggested yet, is some Bayesian
> update method. There, rather than calculating a specific parameter
> value ( like ndistinct ), you try to store the entire distribution and
> choose the plan that minimizes cost averaged over all of the possible
> parameter values.

I've done similar stuff for work, you don't have to go all the way to
storing complete probability distributions, usually a simple
likelihood range is enough.

In essence, instead of having a scalar MLE for plan cost, you
implement a "ranged" estimator, that estimates the most-likely range
of plan costs, with mean and standard deviation from mean.

This essentially gives a risk value, since risky plans will have very
large standard deviations from the mean.

> Also, ( not that I have even close to the experience / expertise to
> make this claim - so take this with a grain of salt ) it seems that
> the code changes would be substantial but pretty straightforward and
> easy to localize. Rather than passing a selectivity, pass a pair of
> arrays with selectivities and probabilities.

If you approximage the probability distributions as I outlined above,
it's even simpler. Approximate, but simpler - and since you retain the
original cost estimations in the form of mean cost values, you can
easily tune the GEQO to perform as it currently does (ignore variance)
or with a safety margin (account for variance).


About issues like these being uncommon - I disagree.

I routinely have to work around query inefficiencies because GEQO does
something odd - and since postgres gives me too few tools to tweak
plans (increase statistics, use subqueries, rephrase joins, no direct
tool before CTEs which are rather new), it becomes an art form, and it
becomes very unpredictable and an administrative burden. Out of the
blue, statistics change, queries that worked fine start to perform
poorly, and sites go down.

If GEQO could detect unsafe plans and work around them automatically,
it would be a major improvement.

Granted, though, this should be approached with care.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] pg9.0.3 explain analyze running very slow compared to a different box with much less configuration

2011-03-24 Thread DM
Thank you for your research on and posting on it, when I first encountered
this issue I saw your posting/research on this issue, this gave me a great
insight.

gettimeofday() on my new box is slow, after further research we found that,
when we set ACPI=Off, we got a good clock performance even the explain
analyze gave approximately gave the right values, but the hyperthreading is
off.

could you guide me how to set, the parameter current_clocksource to TSC,


Thanks
Deepak

On Thu, Mar 24, 2011 at 5:07 AM, Achilleas Mantzios <
ach...@matrix.gatewaynet.com> wrote:

> Στις Thursday 24 March 2011 13:39:19 ο/η Marti Raudsepp έγραψε:
> > On Thu, Mar 24, 2011 at 11:11, Achilleas Mantzios
> >  wrote:
> > > My problem had to do with the speed of gettimeofday. You might want to
> do some special setting regarding
> > > your box's way of reading time for the hw clock.
> >
> > Just for extra info, on x86, TSC is usually the "fast" timeofday
> > implementation. On recent CPUs in single-socket configurations, TSC
> > should always be available, regardless of any power management. I
> > don't know about multi-socket. If you want to know whether your kernel
> > is using tsc, run:
> >
>
> That's what i am experiencing as well, in two of my FreeBSD boxes
> (work/home) i get:
>
> phenom ii X4 :
> ==
> % sysctl -a | grep -i timecounter
> kern.timecounter.tick: 1
> kern.timecounter.choice: TSC(-100) HPET(900) ACPI-fast(1000) i8254(0)
> dummy(-100)
> kern.timecounter.hardware: TSC
> kern.timecounter.stepwarnings: 0
> kern.timecounter.tc.i8254.mask: 65535
> kern.timecounter.tc.i8254.counter: 1960
> kern.timecounter.tc.i8254.frequency: 1193182
> kern.timecounter.tc.i8254.quality: 0
> kern.timecounter.tc.ACPI-fast.mask: 4294967295
> kern.timecounter.tc.ACPI-fast.counter: 3642319843
> kern.timecounter.tc.ACPI-fast.frequency: 3579545
> kern.timecounter.tc.ACPI-fast.quality: 1000
> kern.timecounter.tc.HPET.mask: 4294967295
> kern.timecounter.tc.HPET.counter: 1160619197
> kern.timecounter.tc.HPET.frequency: 14318180
> kern.timecounter.tc.HPET.quality: 900
> kern.timecounter.tc.TSC.mask: 4294967295
> kern.timecounter.tc.TSC.counter: 2788277817
> kern.timecounter.tc.TSC.frequency: 3400155810
> kern.timecounter.tc.TSC.quality: -100
> kern.timecounter.smp_tsc: 0
> kern.timecounter.invariant_tsc: 1
>
> Pentium 4
> ==
> % sysctl -a | grep -i timecounter
> kern.timecounter.tick: 1
> kern.timecounter.choice: TSC(800) ACPI-fast(1000) i8254(0) dummy(-100)
> kern.timecounter.hardware: ACPI-fast
> kern.timecounter.stepwarnings: 0
> kern.timecounter.tc.i8254.mask: 65535
> kern.timecounter.tc.i8254.counter: 13682
> kern.timecounter.tc.i8254.frequency: 1193182
> kern.timecounter.tc.i8254.quality: 0
> kern.timecounter.tc.ACPI-fast.mask: 16777215
> kern.timecounter.tc.ACPI-fast.counter: 6708142
> kern.timecounter.tc.ACPI-fast.frequency: 3579545
> kern.timecounter.tc.ACPI-fast.quality: 1000
> kern.timecounter.tc.TSC.mask: 4294967295
> kern.timecounter.tc.TSC.counter: 3109326068
> kern.timecounter.tc.TSC.frequency: 2663194296
> kern.timecounter.tc.TSC.quality: 800
> kern.timecounter.smp_tsc: 0
> kern.timecounter.invariant_tsc: 0
>
> TSC, it seems, outperform the rest of clocks in terms of frequency.
>
> > cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> >
> > On older CPUs, you often had to disable some sort of power management
> > in order to get a stable TSC -- the "ondemand" scaling governor is the
> > top suspect. Disabling this is distro-specific. You have to reboot to
> > get the kernel to re-test TSC. Unfortunately disabling power
> > management later at boot doesn't help you, you have to prevent it from
> > activating at all.
> >
> > For debugging, grepping dmesg for tsc or clocksource is often helpful.
> > On machines with unstable TSC you'll see output like this:
> >
> > [0.00] Fast TSC calibration using PIT
> > [0.164068] checking TSC synchronization [CPU#0 -> CPU#1]: passed.
> > [0.196730] Switching to clocksource tsc
> > [0.261347] Marking TSC unstable due to TSC halts in idle
> > [0.261536] Switching to clocksource acpi_pm
> >
> > If you just want to get repeatable timings, you can force both
> > machines to use the hpet clocksource:
> > echo hpet >
> /sys/devices/system/clocksource/clocksource0/current_clocksource
> >
> > Marti
> >
>
>
>
> --
> Achilleas Mantzios
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


[PERFORM] Why Index is not used

2011-03-24 Thread Adarsh Sharma

Dear all,

Today I got to run a query internally from my application by more than 
10 connections.


But The query performed very badly. A the data size of tables are as :

pdc_uima=#  select pg_size_pretty(pg_total_relation_size('clause2'));
pg_size_pretty

5858 MB
(1 row)

pdc_uima=#  select pg_size_pretty(pg_total_relation_size('svo2'));  
pg_size_pretty


4719 MB
(1 row)


I explain the query as after making the  indexes as :

pdc_uima=# explain select c.clause, s.* from clause2 c, svo2 s where 
c.clause_id=s.clause_id and s.doc_id=c.source_id and c.

pdc_uima-# sentence_id=s.sentence_id ;
 QUERY 
PLAN 
--

Merge Join  (cost=5673831.05..34033959.87 rows=167324179 width=2053)
  Merge Cond: ((s.clause_id = c.clause_id) AND (s.doc_id = c.source_id) 
AND (s.sentence_id = c.sentence_id))
  ->  Index Scan using idx_svo2 on svo2 s  (cost=0.00..24489343.65 
rows=27471560 width=1993)

  ->  Materialize  (cost=5673828.74..6071992.29 rows=31853084 width=72)
->  Sort  (cost=5673828.74..5753461.45 rows=31853084 width=72)
  Sort Key: c.clause_id, c.source_id, c.sentence_id
  ->  Seq Scan on clause2 c  (cost=0.00..770951.84 
rows=31853084 width=72)




Indexes are :

CREATE INDEX idx_clause  ON clause2  USING btree  (clause_id, source_id, 
sentence_id);
CREATE INDEX idx_svo2  ON svo2  USING btree (clause_id, doc_id, 
sentence_id);


I don't know why it not uses the index scan for clause2 table.

Any suggestions to tune the query.


Thanks & best Regards,
Adarsh Sharma

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why Index is not used

2011-03-24 Thread Andreas Kretschmer
Adarsh Sharma  wrote:

> Dear all,
>
> Today I got to run a query internally from my application by more than  
> 10 connections.
>
> But The query performed very badly. A the data size of tables are as :
>
> pdc_uima=#  select pg_size_pretty(pg_total_relation_size('clause2'));
> pg_size_pretty
> 
> 5858 MB
> (1 row)
>
> pdc_uima=#  select pg_size_pretty(pg_total_relation_size('svo2'));   
> pg_size_pretty
> 
> 4719 MB
> (1 row)
>
>
> I explain the query as after making the  indexes as :
>
> pdc_uima=# explain select c.clause, s.* from clause2 c, svo2 s where  
> c.clause_id=s.clause_id and s.doc_id=c.source_id and c.
> pdc_uima-# sentence_id=s.sentence_id ;
>  QUERY PLAN   
>
> --
> Merge Join  (cost=5673831.05..34033959.87 rows=167324179 width=2053)
>   Merge Cond: ((s.clause_id = c.clause_id) AND (s.doc_id = c.source_id)  
> AND (s.sentence_id = c.sentence_id))
>   ->  Index Scan using idx_svo2 on svo2 s  (cost=0.00..24489343.65  
> rows=27471560 width=1993)
>   ->  Materialize  (cost=5673828.74..6071992.29 rows=31853084 width=72)
> ->  Sort  (cost=5673828.74..5753461.45 rows=31853084 width=72)
>   Sort Key: c.clause_id, c.source_id, c.sentence_id
>   ->  Seq Scan on clause2 c  (cost=0.00..770951.84  
> rows=31853084 width=72)
>
>
>
> Indexes are :
>
> CREATE INDEX idx_clause  ON clause2  USING btree  (clause_id, source_id,  
> sentence_id);
> CREATE INDEX idx_svo2  ON svo2  USING btree (clause_id, doc_id,  
> sentence_id);
>
> I don't know why it not uses the index scan for clause2 table.

How many rows contains clause2? The planner expected 167324179 returning
rows, can you run the same explain with ANALYSE to see the real amount
of returning rows?


Andreas
-- 
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect.  (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly."   (unknown)
Kaufbach, Saxony, Germany, Europe.  N 51.05082°, E 13.56889°

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why Index is not used

2011-03-24 Thread Adarsh Sharma


Thanks Andreas, I was about print the output but it takes too much time.

Below is the output of explain analyze command :
pdc_uima=# explain analyze select c.clause, s.* from clause2 c, svo2 s 
where c.clause_id=s.clause_id and s.doc_id=c.source_id and c.

pdc_uima-# sentence_id=s.sentence_id ;

QUERY 
PLAN

Merge Join  (cost=5673831.05..34033959.87 rows=167324179 width=2053) 
(actual time=216281.162..630721.636 rows=30473117 loops=1)
  Merge Cond: ((s.clause_id = c.clause_id) AND (s.doc_id = c.source_id) 
AND (s.sentence_id = c.sentence_id))
  ->  Index Scan using idx_svo2 on svo2 s  (cost=0.00..24489343.65 
rows=27471560 width=1993) (actual time=0.130..177599.310 rows=27471560 
loops=1)
  ->  Materialize  (cost=5673828.74..6071992.29 rows=31853084 width=72) 
(actual time=216280.596..370507.452 rows=52037763 loops=1)
->  Sort  (cost=5673828.74..5753461.45 rows=31853084 width=72) 
(actual time=216280.591..324707.956 rows=31853083 loops=1)

  Sort Key: c.clause_id, c.source_id, c.sentence_id
  Sort Method:  external merge  Disk: 2616520kB
  ->  Seq Scan on clause2 c  (cost=0.00..770951.84 
rows=31853084 width=72) (actual time=0.025..25018.665 rows=31853083 loops=1)

Total runtime: 647804.037 ms
(9 rows)


Thanks , Adarsh

Andreas Kretschmer wrote:

Adarsh Sharma  wrote:

  

Dear all,

Today I got to run a query internally from my application by more than  
10 connections.


But The query performed very badly. A the data size of tables are as :

pdc_uima=#  select pg_size_pretty(pg_total_relation_size('clause2'));
pg_size_pretty

5858 MB
(1 row)

pdc_uima=#  select pg_size_pretty(pg_total_relation_size('svo2'));   
pg_size_pretty


4719 MB
(1 row)


I explain the query as after making the  indexes as :

pdc_uima=# explain select c.clause, s.* from clause2 c, svo2 s where  
c.clause_id=s.clause_id and s.doc_id=c.source_id and c.

pdc_uima-# sentence_id=s.sentence_id ;
 QUERY PLAN   
   
--

Merge Join  (cost=5673831.05..34033959.87 rows=167324179 width=2053)
  Merge Cond: ((s.clause_id = c.clause_id) AND (s.doc_id = c.source_id)  
AND (s.sentence_id = c.sentence_id))
  ->  Index Scan using idx_svo2 on svo2 s  (cost=0.00..24489343.65  
rows=27471560 width=1993)

  ->  Materialize  (cost=5673828.74..6071992.29 rows=31853084 width=72)
->  Sort  (cost=5673828.74..5753461.45 rows=31853084 width=72)
  Sort Key: c.clause_id, c.source_id, c.sentence_id
  ->  Seq Scan on clause2 c  (cost=0.00..770951.84  
rows=31853084 width=72)




Indexes are :

CREATE INDEX idx_clause  ON clause2  USING btree  (clause_id, source_id,  
sentence_id);
CREATE INDEX idx_svo2  ON svo2  USING btree (clause_id, doc_id,  
sentence_id);


I don't know why it not uses the index scan for clause2 table.



How many rows contains clause2? The planner expected 167324179 returning
rows, can you run the same explain with ANALYSE to see the real amount
of returning rows?


Andreas
  




Re: [PERFORM] Why Index is not used

2011-03-24 Thread Chetan Suttraway
On Fri, Mar 25, 2011 at 12:05 PM, Adarsh Sharma wrote:

> Dear all,
>
> Today I got to run a query internally from my application by more than 10
> connections.
>
> But The query performed very badly. A the data size of tables are as :
>
> pdc_uima=#  select pg_size_pretty(pg_total_relation_size('clause2'));
> pg_size_pretty
> 
> 5858 MB
> (1 row)
>
> pdc_uima=#  select pg_size_pretty(pg_total_relation_size('svo2'));
>  pg_size_pretty
> 
> 4719 MB
> (1 row)
>
>
> I explain the query as after making the  indexes as :
>
> pdc_uima=# explain select c.clause, s.* from clause2 c, svo2 s where
> c.clause_id=s.clause_id and s.doc_id=c.source_id and c.
> pdc_uima-# sentence_id=s.sentence_id ;
> QUERY PLAN
>
> --
> Merge Join  (cost=5673831.05..34033959.87 rows=167324179 width=2053)
>  Merge Cond: ((s.clause_id = c.clause_id) AND (s.doc_id = c.source_id) AND
> (s.sentence_id = c.sentence_id))
>  ->  Index Scan using idx_svo2 on svo2 s  (cost=0.00..24489343.65
> rows=27471560 width=1993)
>  ->  Materialize  (cost=5673828.74..6071992.29 rows=31853084 width=72)
>->  Sort  (cost=5673828.74..5753461.45 rows=31853084 width=72)
>  Sort Key: c.clause_id, c.source_id, c.sentence_id
>  ->  Seq Scan on clause2 c  (cost=0.00..770951.84 rows=31853084
> width=72)
>
>
>
> Indexes are :
>
> CREATE INDEX idx_clause  ON clause2  USING btree  (clause_id, source_id,
> sentence_id);
> CREATE INDEX idx_svo2  ON svo2  USING btree (clause_id, doc_id,
> sentence_id);
>
> I don't know why it not uses the index scan for clause2 table.
>
>
In this case, there are no predicates or filters on individual table. (maybe
something like c.source_id=10)
so either of the 2 tables will have to go for simple scan.

Are you expecting seq. scan on svo2 and index scan on clause2?

-- 
Regards,
Chetan Suttraway
EnterpriseDB , The Enterprise
PostgreSQL
 company.