[PERFORM] query problem

2004-10-13 Thread Robin Ericsson
Hi,

I sent this to general earlier but I was redirected to performance.

The query have been running ok for quite some time, but after I did a
vacuum on the database, it's very very slow. This IN-query is only 2
ids. Before the problem that in was a subselect-query returning around
6-7 ids. The tables included in the query are described in database.txt.

status=# select count(id) from data;
  count
-
 1577621
(1 row)

status=# select count(data_id) from data_values;
  count
-
 9680931
(1 row)

I did run a new explain analyze on the query and found the attached
result. The obvious problem I see is a full index scan in
"idx_dv_data_id". I tried dropping and adding the index again, thats why
is't called "idx_data_values_data_id" in the dump.

status=# EXPLAIN ANALYZE
status-# SELECT
status-# data.entered,
status-# data.machine_id,
status-# datatemplate_intervals.template_id,
status-# data_values.value
status-# FROM
status-# data, data_values, datatemplate_intervals
status-# WHERE
status-# datatemplate_intervals.id = data_values.template_id AND
status-# data_values.data_id = data.id AND
status-# data.machine_id IN (2,3) AND
status-# current_timestamp::timestamp - interval '60 seconds' <
data.entered;



Regards,
Robin


-- 
Robin Ericsson <[EMAIL PROTECTED]>
Profecta HB
 Hash Join  (cost=28646.01..274260.15 rows=555706 width=24) (actual 
time=102323.087..102323.196 rows=5 loops=1)
   Hash Cond: ("outer".template_id = "inner".id)
   ->  Merge Join  (cost=28644.09..265922.62 rows=555706 width=24) (actual 
time=102322.632..102322.709 rows=5 loops=1)
 Merge Cond: ("outer".data_id = "inner".id)
 ->  Index Scan using idx_dv_data_id on data_values  (cost=0.00..205034.19 
rows=9580032 width=16) (actual time=17.503..86263.130 
rows=9596747 loops=1)
 ->  Sort  (cost=28644.09..28870.83 rows=90697 width=16) (actual 
time=0.829..0.835 rows=1 loops=1)
   Sort Key: data.id
   ->  Index Scan using idx_d_entered on data  (cost=0.00..20202.81 
rows=90697 width=16) (actual time=0.146..0.185 rows=1 loops=1)
 Index Cond: 'now'::text)::timestamp(6) with time 
zone)::timestamp without time zone - '00:01:00'::interval) < entered)
 Filter: ((machine_id = 2) OR (machine_id = 3))
   ->  Hash  (cost=1.74..1.74 rows=74 width=8) (actual time=0.382..0.382 rows=0 
loops=1)
 ->  Seq Scan on datatemplate_intervals  (cost=0.00..1.74 rows=74 width=8) 
(actual time=0.024..0.250 rows=74 loops=1)
 Total runtime: 102323.491 ms
(13 rows)

status=# \d data
  Table "public.data"
   Column   |Type | Modifiers
+-+
 id | integer | not null default nextval('data_seq'::text)
 updated| timestamp without time zone |
 entered| timestamp without time zone |
 machine_id | integer |
 datas  | character varying(512)[]|
Indexes:
"data_pkey" primary key, btree (id)
"idx_d_entered" btree (entered)
"idx_d_machine_id" btree (machine_id)
Foreign-key constraints:
"machine_id" FOREIGN KEY (machine_id) REFERENCES machines(id)
Triggers:
data_datestamp BEFORE INSERT OR UPDATE ON data FOR EACH ROW EXECUTE PROCEDURE 
datestamp_e()

status=# \d data_values
  Table "public.data_values"
   Column|Type | Modifiers
-+-+---
 updated | timestamp without time zone |
 entered | timestamp without time zone |
 data_id | integer |
 template_id | integer |
 value   | character varying(512)  |
Indexes:
"idx_data_values_data_id" btree (data_id)
"idx_dv_template_id" btree (template_id)
Foreign-key constraints:
"data_id" FOREIGN KEY (data_id) REFERENCES data(id)
"template_id" FOREIGN KEY (template_id) REFERENCES datatemplate_intervals(id)
Triggers:
data_values_datestamp BEFORE INSERT OR UPDATE ON data_values FOR EACH ROW EXECUTE 
PROCEDURE 
datestamp_e()

status=# \d datatemplate_intervals
  Table "public.datatemplate_intervals"
   Column|Type |  Modifiers 
-+-+--
 id  | integer | not null default 
nextval('datatemplate_intervals_seq'::text)
 updated | timestamp without time zone |
 entered | timestamp without time zone |
 machine_id  | integer |
 template_id | integer |
 interval| integer |
Indexes:
"datatemplate_intervals_pkey" primary key, btree (id)
"idx_di_machine_id" btree (machine_id)
"idx_di_template_id" btree

Re: [PERFORM] query problem

2004-10-13 Thread ken
On Wed, 2004-10-13 at 02:21, Robin Ericsson wrote:
> Hi,
> 
> I sent this to general earlier but I was redirected to performance.
> 
> The query have been running ok for quite some time, but after I did a
> vacuum on the database, it's very very slow. 

Did you do a VACUUM FULL ANALYZE on the database or just a VACUUM?  It
looks like your statistics in your query are all off which ANALYZE
should fix.



> This IN-query is only 2
> ids. Before the problem that in was a subselect-query returning around
> 6-7 ids. The tables included in the query are described in database.txt.
> 
> status=# select count(id) from data;
>   count
> -
>  1577621
> (1 row)
> 
> status=# select count(data_id) from data_values;
>   count
> -
>  9680931
> (1 row)
> 
> I did run a new explain analyze on the query and found the attached
> result. The obvious problem I see is a full index scan in
> "idx_dv_data_id". I tried dropping and adding the index again, thats why
> is't called "idx_data_values_data_id" in the dump.
> 
> status=# EXPLAIN ANALYZE
> status-# SELECT
> status-# data.entered,
> status-# data.machine_id,
> status-# datatemplate_intervals.template_id,
> status-# data_values.value
> status-# FROM
> status-# data, data_values, datatemplate_intervals
> status-# WHERE
> status-# datatemplate_intervals.id = data_values.template_id AND
> status-# data_values.data_id = data.id AND
> status-# data.machine_id IN (2,3) AND
> status-# current_timestamp::timestamp - interval '60 seconds' <
> data.entered;
> 
> 
> 
> Regards,
> Robin
> 


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] query problem

2004-10-13 Thread Tom Lane
Robin Ericsson <[EMAIL PROTECTED]> writes:
> I sent this to general earlier but I was redirected to performance.

Actually, I think I suggested that you consult the pgsql-performance
archives, where this type of problem has been hashed out before.
See for instance this thread:
http://archives.postgresql.org/pgsql-performance/2004-07/msg00169.php
particularly
http://archives.postgresql.org/pgsql-performance/2004-07/msg00175.php
http://archives.postgresql.org/pgsql-performance/2004-07/msg00184.php
http://archives.postgresql.org/pgsql-performance/2004-07/msg00185.php
which show three different ways of getting the planner to do something
sane with an index range bound like "now() - interval".

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] query problem

2004-10-13 Thread Robin Ericsson
On Wed, 2004-10-13 at 11:03 -0400, Tom Lane wrote:
> Robin Ericsson <[EMAIL PROTECTED]> writes:
> > I sent this to general earlier but I was redirected to performance.
> 
> Actually, I think I suggested that you consult the pgsql-performance
> archives, where this type of problem has been hashed out before.
> See for instance this thread:
> http://archives.postgresql.org/pgsql-performance/2004-07/msg00169.php
> particularly
> http://archives.postgresql.org/pgsql-performance/2004-07/msg00175.php
> http://archives.postgresql.org/pgsql-performance/2004-07/msg00184.php
> http://archives.postgresql.org/pgsql-performance/2004-07/msg00185.php
> which show three different ways of getting the planner to do something
> sane with an index range bound like "now() - interval".

Using exact timestamp makes the query go back as it should in speed (see
explain below). However I still have the problem using a stored
procedure or even using the "ago"-example from above.




regards,
Robin

status=# explain analyse
status-# SELECT
status-# data.entered,
status-# data.machine_id,
status-# datatemplate_intervals.template_id,
status-# data_values.value
status-# FROM
status-# data, data_values, datatemplate_intervals
status-# WHERE
status-# datatemplate_intervals.id =
data_values.template_id AND
status-# data_values.data_id = data.id AND
status-# data.machine_id IN (SELECT machine_id FROM
machine_group_xref WHERE group_id = 1) AND
status-# '2004-10-13 17:47:36.902062' < data.entered
status-# ;

QUERY PLAN
--
 Hash Join  (cost=3.09..481.28 rows=777 width=24) (actual
time=0.637..1.804 rows=57 loops=1)
   Hash Cond: ("outer".template_id = "inner".id)
   ->  Nested Loop  (cost=1.17..467.71 rows=776 width=24) (actual
time=0.212..1.012 rows=57 loops=1)
 ->  Hash IN Join  (cost=1.17..9.56 rows=146 width=16) (actual
time=0.165..0.265 rows=9 loops=1)
   Hash Cond: ("outer".machine_id = "inner".machine_id)
   ->  Index Scan using idx_d_entered on data
(cost=0.00..6.14 rows=159 width=16) (actual time=0.051..0.097 rows=10
loops=1)
 Index Cond: ('2004-10-13
17:47:36.902062'::timestamp without time zone < entered)
   ->  Hash  (cost=1.14..1.14 rows=11 width=4) (actual
time=0.076..0.076 rows=0 loops=1)
 ->  Seq Scan on machine_group_xref
(cost=0.00..1.14 rows=11 width=4) (actual time=0.017..0.054 rows=11
loops=1)
   Filter: (group_id = 1)
 ->  Index Scan using idx_data_values_data_id on data_values
(cost=0.00..3.07 rows=5 width=16) (actual time=0.018..0.047 rows=6
loops=9)
   Index Cond: (data_values.data_id = "outer".id)
   ->  Hash  (cost=1.74..1.74 rows=74 width=8) (actual time=0.382..0.382
rows=0 loops=1)
 ->  Seq Scan on datatemplate_intervals  (cost=0.00..1.74
rows=74 width=8) (actual time=0.024..0.248 rows=74 loops=1)
 Total runtime: 2.145 ms
(15 rows)



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


[PERFORM] Free PostgreSQL Training, Philadelphia, Oct 30

2004-10-13 Thread Aaron Mulder
All,
My company (Chariot Solutions) is sponsoring a day of free
PostgreSQL training by Bruce Momjian (one of the core PostgreSQL
developers).  The day is split into 2 sessions (plus a Q&A session):

 * Mastering PostgreSQL Administration
 * PostgreSQL Performance Tuning

Registration is required, and space is limited.  The location is
Malvern, PA (suburb of Philadelphia) and it's on Saturday Oct 30.  For
more information or to register, see

http://chariotsolutions.com/postgresql.jsp

Thanks,
Aaron

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[PERFORM] Opteron vs RHAT

2004-10-13 Thread Chris Browne
[EMAIL PROTECTED] (Matt Clark) writes:
>>As for "vendor support" for Opteron, that sure looks like a
>>trainwreck...  If you're going through IBM, then they won't want to
>>respond to any issues if you're not running a "bog-standard" RHAS/RHES
>>release from Red Hat.  And that, on Opteron, is preposterous, because
>>there's plenty of the bits of Opteron support that only ever got put
>>in Linux 2.6, whilst RHAT is still back in the 2.4 days.
>
> To be fair, they have backported a boatload of 2.6 features to their kernel:
> http://www.redhat.com/software/rhel/kernel26/
>
> And that page certainly isn't an exhaustive list...

To be fair, we keep on actually running into things that _can't_ be
backported, like fibrechannel drivers that were written to take
advantage of changes in the SCSI support in 2.6.

This sort of thing will be particularly problematic with Opteron,
where the porting efforts for AMD64 have taken place alongside the
creation of 2.6.
-- 
let name="cbbrowne" and tld="cbbrowne.com" in String.concat "@" [name;tld];;
http://www.ntlug.org/~cbbrowne/linuxxian.html
A VAX is virtually a computer, but not quite.

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Free PostgreSQL Training, Philadelphia, Oct 30

2004-10-13 Thread Bryan Encina
> All,
>   My company (Chariot Solutions) is sponsoring a day of free
> PostgreSQL training by Bruce Momjian (one of the core PostgreSQL
> developers).  The day is split into 2 sessions (plus a Q&A session):
>
>  * Mastering PostgreSQL Administration
>  * PostgreSQL Performance Tuning
>
>   Registration is required, and space is limited.  The location is
> Malvern, PA (suburb of Philadelphia) and it's on Saturday Oct 30.  For
> more information or to register, see
>
> http://chariotsolutions.com/postgresql.jsp
>
> Thanks,
>   Aaron

Wow, that's good stuff, too bad there's no one doing stuff like that in the
Los Angeles area.

-b


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[Fwd: Re: [GENERAL] [PERFORM] query problem]

2004-10-13 Thread Robin Ericsson
Sent this to wrong list.

 Forwarded Message 
From: Robin Ericsson <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Re: [GENERAL] [PERFORM] query problem
Date: Wed, 13 Oct 2004 18:27:20 +0200
On Wed, 2004-10-13 at 18:01 +0200, Robin Ericsson wrote:
> Using exact timestamp makes the query go back as it should in speed (see
> explain below). However I still have the problem using a stored
> procedure or even using the "ago"-example from above.

Well, changing ago() to use timestamp without time zone it goes ok in
the query. This query now takes ~2ms.

SELECT
data.entered,
data.machine_id,
datatemplate_intervals.template_id,
data_values.value
FROM
data, data_values, datatemplate_intervals
WHERE
datatemplate_intervals.id = data_values.template_id AND
data_values.data_id = data.id AND
data.machine_id IN (SELECT machine_id FROM machine_group_xref
WHERE group_id = 1) AND
ago('60 seconds') < data.entered

Using it in this procedure.
select * from get_current_machine_status('60 seconds', 1);
takes ~100s. Maybe there's some obvious wrong I do about it?

CREATE TYPE public.mstatus_holder AS
   (entered timestamp,
machine_id int4,
template_id int4,
value varchar);
CREATE OR REPLACE FUNCTION public.get_current_machine_status(interval,
int4)
  RETURNS SETOF mstatus_holder AS
'
SELECT
data.entered,
data.machine_id,
datatemplate_intervals.template_id,
data_values.value
FROM
data, data_values, datatemplate_intervals
WHERE
datatemplate_intervals.id = data_values.template_id AND
data_values.data_id = data.id AND
data.machine_id IN (SELECT machine_id FROM machine_group_xref WHERE
group_id = $2) AND
ago($1) < data.entered
'
  LANGUAGE 'sql' VOLATILE;


Regards,
Robin



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings
-- 
Robin Ericsson <[EMAIL PROTECTED]>
Profecta HB


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Opteron vs RHAT

2004-10-13 Thread Matt Clark
> >>trainwreck...  If you're going through IBM, then they won't want to 
> >>respond to any issues if you're not running a 
> "bog-standard" RHAS/RHES 
> >>release from Red Hat.  
...> To be fair, we keep on actually running into things that 
> _can't_ be backported, like fibrechannel drivers that were 
> written to take advantage of changes in the SCSI support in 2.6.

I thought IBM had good support for SUSE?  I don't know why I thought that...


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[PERFORM] futex results with dbt-3

2004-10-13 Thread Mark Wong
Hi guys,

I have some DBT-3 (decision support) results using Gavin's original
futex patch fix.  It's on out 8-way Pentium III Xeon systems
in our STP environment.  Definitely see some overall throughput
performance on the tests, about 15% increase, but not change with
respect to the number of context switches.

Perhaps it doesn't really address what's causing the incredible number
of context switches, but still helps.  I think I'm seeing what Gavin
has, that it seems to solves some concurrency problems on x86 platforms.

Here's results without futexes:
http://khack.osdl.org/stp/298114/

Results with futexes:
http://khack.osdl.org/stp/298115/

Mark

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Free PostgreSQL Training, Philadelphia, Oct 30

2004-10-13 Thread Aaron Glenn
On Wed, 13 Oct 2004 09:23:47 -0700, Bryan Encina
<[EMAIL PROTECTED]> wrote:
>
> Wow, that's good stuff, too bad there's no one doing stuff like that in the
> Los Angeles area.
>
> -b

That makes two of us. Hanging out with Tom, Bruce, and others at OSCON
2002 was one of the most informative and fun times I've had. That and
I could really stand to brush up on my Postgres basics


aaron.glenn

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Free PostgreSQL Training, Philadelphia, Oct 30

2004-10-13 Thread Josh Berkus
Aaron,

> That makes two of us. Hanging out with Tom, Bruce, and others at OSCON
> 2002 was one of the most informative and fun times I've had. That and
> I could really stand to brush up on my Postgres basics

You're thinking of Jan.   Tom wasn't at OSCON.  ;-)

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-13 Thread Jan Wieck
On 10/9/2004 7:20 AM, Kevin Brown wrote:
Christopher Browne wrote:
Increasing the number of cache buffers _is_ likely to lead to some
slowdowns:
 - Data that passes through the cache also passes through kernel
   cache, so it's recorded twice, and read twice...
Even worse, memory that's used for the PG cache is memory that's not
available to the kernel's page cache.  Even if the overall memory
Which underlines my previous statement, that a PG shared cache much 
larger than the high-frequently accessed data portion of the DB is 
counterproductive. Double buffering (kernel-disk-buffer plus shared 
buffer) only makes sense for data that would otherwise cause excessive 
memory copies in and out of the shared buffer. After that, in only 
lowers the memory available for disk buffers.

Jan
usage in the system isn't enough to cause some paging to disk, most
modern kernels will adjust the page/disk cache size dynamically to fit
the memory demands of the system, which in this case means it'll be
smaller if running programs need more memory for their own use.
This is why I sometimes wonder whether or not it would be a win to use
mmap() to access the data and index files -- doing so under a truly
modern OS would surely at the very least save a buffer copy (from the
page/disk cache to program memory) because the OS could instead
direcly map the buffer cache pages directly to the program's memory
space.
Since PG often has to have multiple files open at the same time, and
in a production database many of those files will be rather large, PG
would have to limit the size of the mmap()ed region on 32-bit
platforms, which means that things like the order of mmap() operations
to access various parts of the file can become just as important in
the mmap()ed case as it is in the read()/write() case (if not more
so!).  I would imagine that the use of mmap() on a 64-bit platform
would be a much, much larger win because PG would most likely be able
to mmap() entire files and let the OS work out how to order disk reads
and writes.
The biggest problem as I see it is that (I think) mmap() would have to
be made to cooperate with malloc() for virtual address space.  I
suspect issues like this have already been worked out by others,
however...


--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-13 Thread Jan Wieck
On 10/14/2004 12:22 AM, Greg Stark wrote:
Jan Wieck <[EMAIL PROTECTED]> writes:
Which would require that shared memory is not allowed to be swapped out, and
that is allowed in Linux by default IIRC, not to completely distort the entire
test.
Well if it's getting swapped out then it's clearly not being used effectively.
Is it really that easy if 3 different cache algorithms (PG cache, kernel 
buffers and swapping) are competing for the same chips?

Jan
There are APIs to bar swapping out pages and the tests could be run without
swap. I suggested it only as an experiment though, there are lots of details
between here and having it be a good configuration for production use.

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Free PostgreSQL Training, Philadelphia, Oct 30

2004-10-13 Thread Bruce Momjian
Josh Berkus wrote:
> Aaron,
> 
> > That makes two of us. Hanging out with Tom, Bruce, and others at OSCON
> > 2002 was one of the most informative and fun times I've had. That and
> > I could really stand to brush up on my Postgres basics
> 
> You're thinking of Jan.   Tom wasn't at OSCON.  ;-)

Ah, but he said 2002 and I think Tom was there that year.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-13 Thread Greg Stark
Jan Wieck <[EMAIL PROTECTED]> writes:

> Which would require that shared memory is not allowed to be swapped out, and
> that is allowed in Linux by default IIRC, not to completely distort the entire
> test.

Well if it's getting swapped out then it's clearly not being used effectively.

There are APIs to bar swapping out pages and the tests could be run without
swap. I suggested it only as an experiment though, there are lots of details
between here and having it be a good configuration for production use.

-- 
greg


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-13 Thread Jan Wieck
On 10/13/2004 11:52 PM, Greg Stark wrote:
Jan Wieck <[EMAIL PROTECTED]> writes:
On 10/8/2004 10:10 PM, Christopher Browne wrote:
> [EMAIL PROTECTED] (Josh Berkus) wrote:
>> I've been trying to peg the "sweet spot" for shared memory using
>> OSDL's equipment.  With Jan's new ARC patch, I was expecting that
>> the desired amount of shared_buffers to be greatly increased.  This
>> has not turned out to be the case.
> That doesn't surprise me.
Neither does it surprise me.
There's been some speculation that having a large shared buffers be about 50%
of your RAM is pessimal as it guarantees the OS cache is merely doubling up on
all the buffers postgres is keeping. I wonder whether there's a second sweet
spot where the postgres cache is closer to the total amount of RAM.
Which would require that shared memory is not allowed to be swapped out, 
and that is allowed in Linux by default IIRC, not to completely distort 
the entire test.

Jan
--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] futex results with dbt-3

2004-10-13 Thread Neil Conway
On Thu, 2004-10-14 at 04:57, Mark Wong wrote: 
> I have some DBT-3 (decision support) results using Gavin's original
> futex patch fix.

I sent an initial description of the futex patch to the mailing lists
last week, but it never appeared (from talking to Marc I believe it
exceeded the size limit on -performance). In any case, the "futex patch"
uses the Linux 2.6 futex API to implement PostgreSQL spinlocks. The hope
is that using futexes will lead to better performance when there is
contention for spinlocks (e.g. on a busy SMP system). The original patch
was written by Stephen Hemminger at OSDL; Gavin and myself have done a
bunch of additional bugfixing and optimization, as well as added IA64
support.

I've attached a WIP copy of the patch to this email (it supports x86,
x86-64 (untested) and IA64 -- more architectures can be added at
request). I'll post a longer writeup when I submit the patch to
-patches.

> Definitely see some overall throughput performance on the tests, about
> 15% increase, but not change with respect to the number of context
> switches.

I'm glad to see that there is a performance improvement; in my own
testing on an 8-way P3 system provided by OSDL, I saw a similar
improvement in pgbench performance (50 concurrent clients, 1000
transactions each, scale factor 75; without the patch, TPS/sec was
between 180 and 185, with the patch TPS/sec was between 200 and 215).

As for context switching, there was some earlier speculation that the
patch might improve or even resolve the "CS storm" issue that some
people have experienced with SMP Xeon P4 systems. I don't think we have
enough evidence to answer this one way or the other at this point.

-Neil

Index: src/backend/storage/lmgr/s_lock.c
===
RCS file: /var/lib/cvs/pgsql/src/backend/storage/lmgr/s_lock.c,v
retrieving revision 1.32
diff -c -r1.32 s_lock.c
*** src/backend/storage/lmgr/s_lock.c	30 Aug 2004 23:47:20 -	1.32
--- src/backend/storage/lmgr/s_lock.c	13 Oct 2004 06:23:26 -
***
*** 15,26 
   */
  #include "postgres.h"
  
  #include 
- #include 
  
  #include "storage/s_lock.h"
  #include "miscadmin.h"
  
  /*
   * s_lock_stuck() - complain about a stuck spinlock
   */
--- 15,49 
   */
  #include "postgres.h"
  
+ #ifdef S_LOCK_TEST
+ #undef Assert
+ #define Assert(cond) DoAssert(cond, #cond, __FILE__, __LINE__)
+ 
+ #define DoAssert(cond, text, file, line)		\
+ 	if (!(cond))\
+ 	{			\
+ 		printf("ASSERTION FAILED! [%s], file = %s, line = %d\n", \
+ 			   text, file, line);\
+ 		abort();\
+ 	}
+ #endif
+ 
  #include 
  
  #include "storage/s_lock.h"
  #include "miscadmin.h"
  
+ #ifdef S_LOCK_TEST
+ #define LOCK_TEST_MSG()			\
+ 	do			\
+ 	{			\
+ 		fprintf(stdout, "*");	\
+ 		fflush(stdout);			\
+ 	} while (0);
+ #else
+ #define LOCK_TEST_MSG()
+ #endif
+ 
  /*
   * s_lock_stuck() - complain about a stuck spinlock
   */
***
*** 38,43 
--- 61,131 
  #endif
  }
  
+ #ifdef HAVE_FUTEX
+ /*
+  * futex_lock_contended() is similar to s_lock() for the normal TAS
+  * implementation of spinlocks. When this function is invoked, we have
+  * failed to immediately acquire the spinlock, so we should spin some
+  * number of times attempting to acquire the lock before invoking
+  * sys_futex() to have the kernel wake us up later. "val" is the
+  * current value of the mutex we saw when we tried to acquire it; it
+  * may have changed since then, of course.
+  */
+ void
+ futex_lock_contended(volatile slock_t *lock, slock_t val,
+ 	 const char *file, int line)
+ {
+ 	int loop_count = 0;
+ 
+ #define MAX_LOCK_WAIT		30
+ #define SPINS_BEFORE_WAIT	100
+ 
+ 	Assert(val != FUTEX_UNLOCKED);
+ 
+ 	if (val == FUTEX_LOCKED_NOWAITER)
+ 		val = atomic_exchange(lock, FUTEX_LOCKED_WAITER);
+ 
+ 	while (val != FUTEX_UNLOCKED)
+ 	{
+ 		static struct timespec delay = { .tv_sec = MAX_LOCK_WAIT,
+ 		 .tv_nsec = 0 };
+ 
+ 		LOCK_TEST_MSG();
+ 
+ #if defined(__i386__) || defined(__x86_64__)
+ 		/* See spin_delay() */
+ 		__asm__ __volatile__(" rep; nop\n");
+ #endif
+ 
+ 		/*
+ 		 * XXX: This code is derived from the Drepper algorithm, which
+ 		 * doesn't spin (why, I'm not sure). We should actually change
+ 		 * the lock status to "lock, with waiters" just before we wait
+ 		 * on the futex, not before we begin looping (that avoids a
+ 		 * system call when the lock is released).
+ 		 */
+ 
+ 		/* XXX: worth using __builtin_expect() here? */
+ 		if (++loop_count >= SPINS_BEFORE_WAIT)
+ 		{
+ 			loop_count = 0;
+ 			if (sys_futex(lock, FUTEX_OP_WAIT,
+ 		  FUTEX_LOCKED_WAITER, &delay))
+ 			{
+ if (errno == ETIMEDOUT)
+ 	s_lock_stuck(lock, file, line);
+ 			}
+ 		}
+ 
+ 		/*
+ 		 * Do a non-locking test before asserting the bus lock.
+ 		 */
+ 		if (*lock == FUTEX_UNLOCKED)
+ 			val = atomic_exchange(lock, FUTEX_LOCKED_WAITER);
+ 	}
+ }
+ 
+ #else
  
  /*
   * s_lock(lock) - platform-independent p

Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-13 Thread Greg Stark
Jan Wieck <[EMAIL PROTECTED]> writes:

> On 10/8/2004 10:10 PM, Christopher Browne wrote:
> 
> > [EMAIL PROTECTED] (Josh Berkus) wrote:
> >> I've been trying to peg the "sweet spot" for shared memory using
> >> OSDL's equipment.  With Jan's new ARC patch, I was expecting that
> >> the desired amount of shared_buffers to be greatly increased.  This
> >> has not turned out to be the case.
> > That doesn't surprise me.
> 
> Neither does it surprise me.

There's been some speculation that having a large shared buffers be about 50%
of your RAM is pessimal as it guarantees the OS cache is merely doubling up on
all the buffers postgres is keeping. I wonder whether there's a second sweet
spot where the postgres cache is closer to the total amount of RAM.

That configuration would have disadvantages for servers running other jobs
besides postgres. And I was led to believe earlier that postgres starts each
backend with a fairly fresh slate as far as the ARC algorithm, so it wouldn't
work well for a postgres server that had lots of short to moderate life
sessions.

But if it were even close it could be interesting. Reading the data with
O_DIRECT and having a single global cache could be interesting experiments. I
know there are arguments against each of these, but ...

I'm still pulling for an mmap approach to eliminate postgres's buffer cache
entirely in the long term, but it seems like slim odds now. But one way or the
other having two layers of buffering seems like a waste.

-- 
greg


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]