[PERFORM] query problem
Hi, I sent this to general earlier but I was redirected to performance. The query have been running ok for quite some time, but after I did a vacuum on the database, it's very very slow. This IN-query is only 2 ids. Before the problem that in was a subselect-query returning around 6-7 ids. The tables included in the query are described in database.txt. status=# select count(id) from data; count - 1577621 (1 row) status=# select count(data_id) from data_values; count - 9680931 (1 row) I did run a new explain analyze on the query and found the attached result. The obvious problem I see is a full index scan in "idx_dv_data_id". I tried dropping and adding the index again, thats why is't called "idx_data_values_data_id" in the dump. status=# EXPLAIN ANALYZE status-# SELECT status-# data.entered, status-# data.machine_id, status-# datatemplate_intervals.template_id, status-# data_values.value status-# FROM status-# data, data_values, datatemplate_intervals status-# WHERE status-# datatemplate_intervals.id = data_values.template_id AND status-# data_values.data_id = data.id AND status-# data.machine_id IN (2,3) AND status-# current_timestamp::timestamp - interval '60 seconds' < data.entered; Regards, Robin -- Robin Ericsson <[EMAIL PROTECTED]> Profecta HB Hash Join (cost=28646.01..274260.15 rows=555706 width=24) (actual time=102323.087..102323.196 rows=5 loops=1) Hash Cond: ("outer".template_id = "inner".id) -> Merge Join (cost=28644.09..265922.62 rows=555706 width=24) (actual time=102322.632..102322.709 rows=5 loops=1) Merge Cond: ("outer".data_id = "inner".id) -> Index Scan using idx_dv_data_id on data_values (cost=0.00..205034.19 rows=9580032 width=16) (actual time=17.503..86263.130 rows=9596747 loops=1) -> Sort (cost=28644.09..28870.83 rows=90697 width=16) (actual time=0.829..0.835 rows=1 loops=1) Sort Key: data.id -> Index Scan using idx_d_entered on data (cost=0.00..20202.81 rows=90697 width=16) (actual time=0.146..0.185 rows=1 loops=1) Index Cond: 'now'::text)::timestamp(6) with time zone)::timestamp without time zone - '00:01:00'::interval) < entered) Filter: ((machine_id = 2) OR (machine_id = 3)) -> Hash (cost=1.74..1.74 rows=74 width=8) (actual time=0.382..0.382 rows=0 loops=1) -> Seq Scan on datatemplate_intervals (cost=0.00..1.74 rows=74 width=8) (actual time=0.024..0.250 rows=74 loops=1) Total runtime: 102323.491 ms (13 rows) status=# \d data Table "public.data" Column |Type | Modifiers +-+ id | integer | not null default nextval('data_seq'::text) updated| timestamp without time zone | entered| timestamp without time zone | machine_id | integer | datas | character varying(512)[]| Indexes: "data_pkey" primary key, btree (id) "idx_d_entered" btree (entered) "idx_d_machine_id" btree (machine_id) Foreign-key constraints: "machine_id" FOREIGN KEY (machine_id) REFERENCES machines(id) Triggers: data_datestamp BEFORE INSERT OR UPDATE ON data FOR EACH ROW EXECUTE PROCEDURE datestamp_e() status=# \d data_values Table "public.data_values" Column|Type | Modifiers -+-+--- updated | timestamp without time zone | entered | timestamp without time zone | data_id | integer | template_id | integer | value | character varying(512) | Indexes: "idx_data_values_data_id" btree (data_id) "idx_dv_template_id" btree (template_id) Foreign-key constraints: "data_id" FOREIGN KEY (data_id) REFERENCES data(id) "template_id" FOREIGN KEY (template_id) REFERENCES datatemplate_intervals(id) Triggers: data_values_datestamp BEFORE INSERT OR UPDATE ON data_values FOR EACH ROW EXECUTE PROCEDURE datestamp_e() status=# \d datatemplate_intervals Table "public.datatemplate_intervals" Column|Type | Modifiers -+-+-- id | integer | not null default nextval('datatemplate_intervals_seq'::text) updated | timestamp without time zone | entered | timestamp without time zone | machine_id | integer | template_id | integer | interval| integer | Indexes: "datatemplate_intervals_pkey" primary key, btree (id) "idx_di_machine_id" btree (machine_id) "idx_di_template_id" btree
Re: [PERFORM] query problem
On Wed, 2004-10-13 at 02:21, Robin Ericsson wrote: > Hi, > > I sent this to general earlier but I was redirected to performance. > > The query have been running ok for quite some time, but after I did a > vacuum on the database, it's very very slow. Did you do a VACUUM FULL ANALYZE on the database or just a VACUUM? It looks like your statistics in your query are all off which ANALYZE should fix. > This IN-query is only 2 > ids. Before the problem that in was a subselect-query returning around > 6-7 ids. The tables included in the query are described in database.txt. > > status=# select count(id) from data; > count > - > 1577621 > (1 row) > > status=# select count(data_id) from data_values; > count > - > 9680931 > (1 row) > > I did run a new explain analyze on the query and found the attached > result. The obvious problem I see is a full index scan in > "idx_dv_data_id". I tried dropping and adding the index again, thats why > is't called "idx_data_values_data_id" in the dump. > > status=# EXPLAIN ANALYZE > status-# SELECT > status-# data.entered, > status-# data.machine_id, > status-# datatemplate_intervals.template_id, > status-# data_values.value > status-# FROM > status-# data, data_values, datatemplate_intervals > status-# WHERE > status-# datatemplate_intervals.id = data_values.template_id AND > status-# data_values.data_id = data.id AND > status-# data.machine_id IN (2,3) AND > status-# current_timestamp::timestamp - interval '60 seconds' < > data.entered; > > > > Regards, > Robin > ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] query problem
Robin Ericsson <[EMAIL PROTECTED]> writes: > I sent this to general earlier but I was redirected to performance. Actually, I think I suggested that you consult the pgsql-performance archives, where this type of problem has been hashed out before. See for instance this thread: http://archives.postgresql.org/pgsql-performance/2004-07/msg00169.php particularly http://archives.postgresql.org/pgsql-performance/2004-07/msg00175.php http://archives.postgresql.org/pgsql-performance/2004-07/msg00184.php http://archives.postgresql.org/pgsql-performance/2004-07/msg00185.php which show three different ways of getting the planner to do something sane with an index range bound like "now() - interval". regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] query problem
On Wed, 2004-10-13 at 11:03 -0400, Tom Lane wrote: > Robin Ericsson <[EMAIL PROTECTED]> writes: > > I sent this to general earlier but I was redirected to performance. > > Actually, I think I suggested that you consult the pgsql-performance > archives, where this type of problem has been hashed out before. > See for instance this thread: > http://archives.postgresql.org/pgsql-performance/2004-07/msg00169.php > particularly > http://archives.postgresql.org/pgsql-performance/2004-07/msg00175.php > http://archives.postgresql.org/pgsql-performance/2004-07/msg00184.php > http://archives.postgresql.org/pgsql-performance/2004-07/msg00185.php > which show three different ways of getting the planner to do something > sane with an index range bound like "now() - interval". Using exact timestamp makes the query go back as it should in speed (see explain below). However I still have the problem using a stored procedure or even using the "ago"-example from above. regards, Robin status=# explain analyse status-# SELECT status-# data.entered, status-# data.machine_id, status-# datatemplate_intervals.template_id, status-# data_values.value status-# FROM status-# data, data_values, datatemplate_intervals status-# WHERE status-# datatemplate_intervals.id = data_values.template_id AND status-# data_values.data_id = data.id AND status-# data.machine_id IN (SELECT machine_id FROM machine_group_xref WHERE group_id = 1) AND status-# '2004-10-13 17:47:36.902062' < data.entered status-# ; QUERY PLAN -- Hash Join (cost=3.09..481.28 rows=777 width=24) (actual time=0.637..1.804 rows=57 loops=1) Hash Cond: ("outer".template_id = "inner".id) -> Nested Loop (cost=1.17..467.71 rows=776 width=24) (actual time=0.212..1.012 rows=57 loops=1) -> Hash IN Join (cost=1.17..9.56 rows=146 width=16) (actual time=0.165..0.265 rows=9 loops=1) Hash Cond: ("outer".machine_id = "inner".machine_id) -> Index Scan using idx_d_entered on data (cost=0.00..6.14 rows=159 width=16) (actual time=0.051..0.097 rows=10 loops=1) Index Cond: ('2004-10-13 17:47:36.902062'::timestamp without time zone < entered) -> Hash (cost=1.14..1.14 rows=11 width=4) (actual time=0.076..0.076 rows=0 loops=1) -> Seq Scan on machine_group_xref (cost=0.00..1.14 rows=11 width=4) (actual time=0.017..0.054 rows=11 loops=1) Filter: (group_id = 1) -> Index Scan using idx_data_values_data_id on data_values (cost=0.00..3.07 rows=5 width=16) (actual time=0.018..0.047 rows=6 loops=9) Index Cond: (data_values.data_id = "outer".id) -> Hash (cost=1.74..1.74 rows=74 width=8) (actual time=0.382..0.382 rows=0 loops=1) -> Seq Scan on datatemplate_intervals (cost=0.00..1.74 rows=74 width=8) (actual time=0.024..0.248 rows=74 loops=1) Total runtime: 2.145 ms (15 rows) ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[PERFORM] Free PostgreSQL Training, Philadelphia, Oct 30
All, My company (Chariot Solutions) is sponsoring a day of free PostgreSQL training by Bruce Momjian (one of the core PostgreSQL developers). The day is split into 2 sessions (plus a Q&A session): * Mastering PostgreSQL Administration * PostgreSQL Performance Tuning Registration is required, and space is limited. The location is Malvern, PA (suburb of Philadelphia) and it's on Saturday Oct 30. For more information or to register, see http://chariotsolutions.com/postgresql.jsp Thanks, Aaron ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[PERFORM] Opteron vs RHAT
[EMAIL PROTECTED] (Matt Clark) writes: >>As for "vendor support" for Opteron, that sure looks like a >>trainwreck... If you're going through IBM, then they won't want to >>respond to any issues if you're not running a "bog-standard" RHAS/RHES >>release from Red Hat. And that, on Opteron, is preposterous, because >>there's plenty of the bits of Opteron support that only ever got put >>in Linux 2.6, whilst RHAT is still back in the 2.4 days. > > To be fair, they have backported a boatload of 2.6 features to their kernel: > http://www.redhat.com/software/rhel/kernel26/ > > And that page certainly isn't an exhaustive list... To be fair, we keep on actually running into things that _can't_ be backported, like fibrechannel drivers that were written to take advantage of changes in the SCSI support in 2.6. This sort of thing will be particularly problematic with Opteron, where the porting efforts for AMD64 have taken place alongside the creation of 2.6. -- let name="cbbrowne" and tld="cbbrowne.com" in String.concat "@" [name;tld];; http://www.ntlug.org/~cbbrowne/linuxxian.html A VAX is virtually a computer, but not quite. ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Free PostgreSQL Training, Philadelphia, Oct 30
> All, > My company (Chariot Solutions) is sponsoring a day of free > PostgreSQL training by Bruce Momjian (one of the core PostgreSQL > developers). The day is split into 2 sessions (plus a Q&A session): > > * Mastering PostgreSQL Administration > * PostgreSQL Performance Tuning > > Registration is required, and space is limited. The location is > Malvern, PA (suburb of Philadelphia) and it's on Saturday Oct 30. For > more information or to register, see > > http://chariotsolutions.com/postgresql.jsp > > Thanks, > Aaron Wow, that's good stuff, too bad there's no one doing stuff like that in the Los Angeles area. -b ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[Fwd: Re: [GENERAL] [PERFORM] query problem]
Sent this to wrong list. Forwarded Message From: Robin Ericsson <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Subject: Re: [GENERAL] [PERFORM] query problem Date: Wed, 13 Oct 2004 18:27:20 +0200 On Wed, 2004-10-13 at 18:01 +0200, Robin Ericsson wrote: > Using exact timestamp makes the query go back as it should in speed (see > explain below). However I still have the problem using a stored > procedure or even using the "ago"-example from above. Well, changing ago() to use timestamp without time zone it goes ok in the query. This query now takes ~2ms. SELECT data.entered, data.machine_id, datatemplate_intervals.template_id, data_values.value FROM data, data_values, datatemplate_intervals WHERE datatemplate_intervals.id = data_values.template_id AND data_values.data_id = data.id AND data.machine_id IN (SELECT machine_id FROM machine_group_xref WHERE group_id = 1) AND ago('60 seconds') < data.entered Using it in this procedure. select * from get_current_machine_status('60 seconds', 1); takes ~100s. Maybe there's some obvious wrong I do about it? CREATE TYPE public.mstatus_holder AS (entered timestamp, machine_id int4, template_id int4, value varchar); CREATE OR REPLACE FUNCTION public.get_current_machine_status(interval, int4) RETURNS SETOF mstatus_holder AS ' SELECT data.entered, data.machine_id, datatemplate_intervals.template_id, data_values.value FROM data, data_values, datatemplate_intervals WHERE datatemplate_intervals.id = data_values.template_id AND data_values.data_id = data.id AND data.machine_id IN (SELECT machine_id FROM machine_group_xref WHERE group_id = $2) AND ago($1) < data.entered ' LANGUAGE 'sql' VOLATILE; Regards, Robin ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings -- Robin Ericsson <[EMAIL PROTECTED]> Profecta HB ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Opteron vs RHAT
> >>trainwreck... If you're going through IBM, then they won't want to > >>respond to any issues if you're not running a > "bog-standard" RHAS/RHES > >>release from Red Hat. ...> To be fair, we keep on actually running into things that > _can't_ be backported, like fibrechannel drivers that were > written to take advantage of changes in the SCSI support in 2.6. I thought IBM had good support for SUSE? I don't know why I thought that... ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[PERFORM] futex results with dbt-3
Hi guys, I have some DBT-3 (decision support) results using Gavin's original futex patch fix. It's on out 8-way Pentium III Xeon systems in our STP environment. Definitely see some overall throughput performance on the tests, about 15% increase, but not change with respect to the number of context switches. Perhaps it doesn't really address what's causing the incredible number of context switches, but still helps. I think I'm seeing what Gavin has, that it seems to solves some concurrency problems on x86 platforms. Here's results without futexes: http://khack.osdl.org/stp/298114/ Results with futexes: http://khack.osdl.org/stp/298115/ Mark ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Free PostgreSQL Training, Philadelphia, Oct 30
On Wed, 13 Oct 2004 09:23:47 -0700, Bryan Encina <[EMAIL PROTECTED]> wrote: > > Wow, that's good stuff, too bad there's no one doing stuff like that in the > Los Angeles area. > > -b That makes two of us. Hanging out with Tom, Bruce, and others at OSCON 2002 was one of the most informative and fun times I've had. That and I could really stand to brush up on my Postgres basics aaron.glenn ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Free PostgreSQL Training, Philadelphia, Oct 30
Aaron, > That makes two of us. Hanging out with Tom, Bruce, and others at OSCON > 2002 was one of the most informative and fun times I've had. That and > I could really stand to brush up on my Postgres basics You're thinking of Jan. Tom wasn't at OSCON. ;-) -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] First set of OSDL Shared Mem scalability results, some
On 10/9/2004 7:20 AM, Kevin Brown wrote: Christopher Browne wrote: Increasing the number of cache buffers _is_ likely to lead to some slowdowns: - Data that passes through the cache also passes through kernel cache, so it's recorded twice, and read twice... Even worse, memory that's used for the PG cache is memory that's not available to the kernel's page cache. Even if the overall memory Which underlines my previous statement, that a PG shared cache much larger than the high-frequently accessed data portion of the DB is counterproductive. Double buffering (kernel-disk-buffer plus shared buffer) only makes sense for data that would otherwise cause excessive memory copies in and out of the shared buffer. After that, in only lowers the memory available for disk buffers. Jan usage in the system isn't enough to cause some paging to disk, most modern kernels will adjust the page/disk cache size dynamically to fit the memory demands of the system, which in this case means it'll be smaller if running programs need more memory for their own use. This is why I sometimes wonder whether or not it would be a win to use mmap() to access the data and index files -- doing so under a truly modern OS would surely at the very least save a buffer copy (from the page/disk cache to program memory) because the OS could instead direcly map the buffer cache pages directly to the program's memory space. Since PG often has to have multiple files open at the same time, and in a production database many of those files will be rather large, PG would have to limit the size of the mmap()ed region on 32-bit platforms, which means that things like the order of mmap() operations to access various parts of the file can become just as important in the mmap()ed case as it is in the read()/write() case (if not more so!). I would imagine that the use of mmap() on a 64-bit platform would be a much, much larger win because PG would most likely be able to mmap() entire files and let the OS work out how to order disk reads and writes. The biggest problem as I see it is that (I think) mmap() would have to be made to cooperate with malloc() for virtual address space. I suspect issues like this have already been worked out by others, however... -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] First set of OSDL Shared Mem scalability results, some
On 10/14/2004 12:22 AM, Greg Stark wrote: Jan Wieck <[EMAIL PROTECTED]> writes: Which would require that shared memory is not allowed to be swapped out, and that is allowed in Linux by default IIRC, not to completely distort the entire test. Well if it's getting swapped out then it's clearly not being used effectively. Is it really that easy if 3 different cache algorithms (PG cache, kernel buffers and swapping) are competing for the same chips? Jan There are APIs to bar swapping out pages and the tests could be run without swap. I suggested it only as an experiment though, there are lots of details between here and having it be a good configuration for production use. -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Free PostgreSQL Training, Philadelphia, Oct 30
Josh Berkus wrote: > Aaron, > > > That makes two of us. Hanging out with Tom, Bruce, and others at OSCON > > 2002 was one of the most informative and fun times I've had. That and > > I could really stand to brush up on my Postgres basics > > You're thinking of Jan. Tom wasn't at OSCON. ;-) Ah, but he said 2002 and I think Tom was there that year. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] First set of OSDL Shared Mem scalability results, some
Jan Wieck <[EMAIL PROTECTED]> writes: > Which would require that shared memory is not allowed to be swapped out, and > that is allowed in Linux by default IIRC, not to completely distort the entire > test. Well if it's getting swapped out then it's clearly not being used effectively. There are APIs to bar swapping out pages and the tests could be run without swap. I suggested it only as an experiment though, there are lots of details between here and having it be a good configuration for production use. -- greg ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] First set of OSDL Shared Mem scalability results, some
On 10/13/2004 11:52 PM, Greg Stark wrote: Jan Wieck <[EMAIL PROTECTED]> writes: On 10/8/2004 10:10 PM, Christopher Browne wrote: > [EMAIL PROTECTED] (Josh Berkus) wrote: >> I've been trying to peg the "sweet spot" for shared memory using >> OSDL's equipment. With Jan's new ARC patch, I was expecting that >> the desired amount of shared_buffers to be greatly increased. This >> has not turned out to be the case. > That doesn't surprise me. Neither does it surprise me. There's been some speculation that having a large shared buffers be about 50% of your RAM is pessimal as it guarantees the OS cache is merely doubling up on all the buffers postgres is keeping. I wonder whether there's a second sweet spot where the postgres cache is closer to the total amount of RAM. Which would require that shared memory is not allowed to be swapped out, and that is allowed in Linux by default IIRC, not to completely distort the entire test. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] futex results with dbt-3
On Thu, 2004-10-14 at 04:57, Mark Wong wrote: > I have some DBT-3 (decision support) results using Gavin's original > futex patch fix. I sent an initial description of the futex patch to the mailing lists last week, but it never appeared (from talking to Marc I believe it exceeded the size limit on -performance). In any case, the "futex patch" uses the Linux 2.6 futex API to implement PostgreSQL spinlocks. The hope is that using futexes will lead to better performance when there is contention for spinlocks (e.g. on a busy SMP system). The original patch was written by Stephen Hemminger at OSDL; Gavin and myself have done a bunch of additional bugfixing and optimization, as well as added IA64 support. I've attached a WIP copy of the patch to this email (it supports x86, x86-64 (untested) and IA64 -- more architectures can be added at request). I'll post a longer writeup when I submit the patch to -patches. > Definitely see some overall throughput performance on the tests, about > 15% increase, but not change with respect to the number of context > switches. I'm glad to see that there is a performance improvement; in my own testing on an 8-way P3 system provided by OSDL, I saw a similar improvement in pgbench performance (50 concurrent clients, 1000 transactions each, scale factor 75; without the patch, TPS/sec was between 180 and 185, with the patch TPS/sec was between 200 and 215). As for context switching, there was some earlier speculation that the patch might improve or even resolve the "CS storm" issue that some people have experienced with SMP Xeon P4 systems. I don't think we have enough evidence to answer this one way or the other at this point. -Neil Index: src/backend/storage/lmgr/s_lock.c === RCS file: /var/lib/cvs/pgsql/src/backend/storage/lmgr/s_lock.c,v retrieving revision 1.32 diff -c -r1.32 s_lock.c *** src/backend/storage/lmgr/s_lock.c 30 Aug 2004 23:47:20 - 1.32 --- src/backend/storage/lmgr/s_lock.c 13 Oct 2004 06:23:26 - *** *** 15,26 */ #include "postgres.h" #include - #include #include "storage/s_lock.h" #include "miscadmin.h" /* * s_lock_stuck() - complain about a stuck spinlock */ --- 15,49 */ #include "postgres.h" + #ifdef S_LOCK_TEST + #undef Assert + #define Assert(cond) DoAssert(cond, #cond, __FILE__, __LINE__) + + #define DoAssert(cond, text, file, line) \ + if (!(cond))\ + { \ + printf("ASSERTION FAILED! [%s], file = %s, line = %d\n", \ + text, file, line);\ + abort();\ + } + #endif + #include #include "storage/s_lock.h" #include "miscadmin.h" + #ifdef S_LOCK_TEST + #define LOCK_TEST_MSG() \ + do \ + { \ + fprintf(stdout, "*"); \ + fflush(stdout); \ + } while (0); + #else + #define LOCK_TEST_MSG() + #endif + /* * s_lock_stuck() - complain about a stuck spinlock */ *** *** 38,43 --- 61,131 #endif } + #ifdef HAVE_FUTEX + /* + * futex_lock_contended() is similar to s_lock() for the normal TAS + * implementation of spinlocks. When this function is invoked, we have + * failed to immediately acquire the spinlock, so we should spin some + * number of times attempting to acquire the lock before invoking + * sys_futex() to have the kernel wake us up later. "val" is the + * current value of the mutex we saw when we tried to acquire it; it + * may have changed since then, of course. + */ + void + futex_lock_contended(volatile slock_t *lock, slock_t val, + const char *file, int line) + { + int loop_count = 0; + + #define MAX_LOCK_WAIT 30 + #define SPINS_BEFORE_WAIT 100 + + Assert(val != FUTEX_UNLOCKED); + + if (val == FUTEX_LOCKED_NOWAITER) + val = atomic_exchange(lock, FUTEX_LOCKED_WAITER); + + while (val != FUTEX_UNLOCKED) + { + static struct timespec delay = { .tv_sec = MAX_LOCK_WAIT, + .tv_nsec = 0 }; + + LOCK_TEST_MSG(); + + #if defined(__i386__) || defined(__x86_64__) + /* See spin_delay() */ + __asm__ __volatile__(" rep; nop\n"); + #endif + + /* + * XXX: This code is derived from the Drepper algorithm, which + * doesn't spin (why, I'm not sure). We should actually change + * the lock status to "lock, with waiters" just before we wait + * on the futex, not before we begin looping (that avoids a + * system call when the lock is released). + */ + + /* XXX: worth using __builtin_expect() here? */ + if (++loop_count >= SPINS_BEFORE_WAIT) + { + loop_count = 0; + if (sys_futex(lock, FUTEX_OP_WAIT, + FUTEX_LOCKED_WAITER, &delay)) + { + if (errno == ETIMEDOUT) + s_lock_stuck(lock, file, line); + } + } + + /* + * Do a non-locking test before asserting the bus lock. + */ + if (*lock == FUTEX_UNLOCKED) + val = atomic_exchange(lock, FUTEX_LOCKED_WAITER); + } + } + + #else /* * s_lock(lock) - platform-independent p
Re: [PERFORM] First set of OSDL Shared Mem scalability results, some
Jan Wieck <[EMAIL PROTECTED]> writes: > On 10/8/2004 10:10 PM, Christopher Browne wrote: > > > [EMAIL PROTECTED] (Josh Berkus) wrote: > >> I've been trying to peg the "sweet spot" for shared memory using > >> OSDL's equipment. With Jan's new ARC patch, I was expecting that > >> the desired amount of shared_buffers to be greatly increased. This > >> has not turned out to be the case. > > That doesn't surprise me. > > Neither does it surprise me. There's been some speculation that having a large shared buffers be about 50% of your RAM is pessimal as it guarantees the OS cache is merely doubling up on all the buffers postgres is keeping. I wonder whether there's a second sweet spot where the postgres cache is closer to the total amount of RAM. That configuration would have disadvantages for servers running other jobs besides postgres. And I was led to believe earlier that postgres starts each backend with a fairly fresh slate as far as the ARC algorithm, so it wouldn't work well for a postgres server that had lots of short to moderate life sessions. But if it were even close it could be interesting. Reading the data with O_DIRECT and having a single global cache could be interesting experiments. I know there are arguments against each of these, but ... I'm still pulling for an mmap approach to eliminate postgres's buffer cache entirely in the long term, but it seems like slim odds now. But one way or the other having two layers of buffering seems like a waste. -- greg ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]