Re: [HACKERS] Rendezvous/Bonjour broken in 8.1 beta

2005-09-11 Thread Sandy McArthur
FYI: The DNSServiceDiscovery.h is Mac OS X specific and only kept
around for Mac OS X 10.2 and older compatiability. Apple has moved to
a cross platform impelentation with the dns_sd.h API available for OS
X, Win32, and most any posix type system.

Any reference to DNSServiceDiscovery.h in pgsql should go away (unless
you want to keep support for pre-10.3 Mac OS X) and be replaced with
the newer dns_sd.h interfaces.

On 9/8/05, Andrew Dunstan [EMAIL PROTECTED] wrote:
 It gets a lot worse. The Windows SDK does not export the same API. None
 of the DNSfoo things we refer to in postmaster.c are defined in the
 supplied header file (not to mention other idiocies I was able to
 negotiate past). Apparently Apple have never heard of code portability.

-- 
Sandy McArthur

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] FAQ/HTML standard?

2005-09-11 Thread Jochem van Dieten
On 9/11/05, Bruno Wolff III wrote:
 On Sat, Sep 10, 2005 at 14:31:06 -0400, Andrew Dunstan wrote:
 
 XHTML is simply a minimal reformulation of HTML in XML, and even uses
 the HTML 4.01 definitions for its semantics. Given that, it's hard to
 see why it should be considered a bad thing.
 
 Here is the article:
 http://www.hixie.ch/advocacy/xhtml

I prefer standards over opinions:

quote
5.1. Internet Media Type

XHTML Documents which follow the guidelines set forth in Appendix C,
HTML Compatibility Guidelines may be labeled with the Internet Media
Type text/html [RFC2854], as they are compatible with most HTML
browsers. Those documents, and any other document conforming to this
specification, may also be labeled with the Internet Media Type
application/xhtml+xml as defined in [RFC3236]. (..)
/quote http://www.w3.org/TR/xhtml1/

So just follow the compatibility guidelines (we want people to be able
to read the FAQ anyway) and use text/html.

Jochem

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[HACKERS] random system table corruption ...

2005-09-11 Thread Hans-Jürgen Schönig
in the past we have faced a couple of problems with corrupted system 
tables. this seems to be a version independent problem which occurs on 
hackers' from time to time.
i have checked a broken file and i have seen that the corrupted page has 
actually been zeroed out.


my question is: are there any options to implement something which makes 
system tables more robust? the problem is: the described error happens 
only once i an while and cannot be reproduced. maybe there is a way to 
add some more sanity checks before the page is actually written.


any suggestions?

best regards,

hans

--
Cybertec Geschwinde  Schönig GmbH
Schöngrabern 134; A-2020 Hollabrunn
Tel: +43/1/205 10 35 / 340
www.postgresql.at, www.cybertec.at

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] random system table corruption ...

2005-09-11 Thread Martijn van Oosterhout
On Sun, Sep 11, 2005 at 01:12:34PM +0200, Hans-Jürgen Schönig wrote:
 in the past we have faced a couple of problems with corrupted system 
 tables. this seems to be a version independent problem which occurs on 
 hackers' from time to time.
 i have checked a broken file and i have seen that the corrupted page has 
 actually been zeroed out.

Near as I can tell, the only times pages are zeroed out is if
zero_damaged_pages is set (destroying the evidence) or during WAL
recovery.

 my question is: are there any options to implement something which makes 
 system tables more robust? the problem is: the described error happens 
 only once i an while and cannot be reproduced. maybe there is a way to 
 add some more sanity checks before the page is actually written.

Well, the most common causes are dodgy memory. Other than that I guess
you could arrange for bgwriter to check the pages it is writing. I
imagine it already does check the header, checking the data requires
knowledge about the actual table and attributes. And about the only
thing that says I'm broken is a varlena value with a long value.

As they say, the only thing sure would be to have a backup. the only
thing I can imagine being really useful would be a restore mode where
you feed it the schema so it can reconstruct the pg_class and
pg_attribute just enough for you to dump it to reconstruct
everything...

You know, VACUUM FREEZE BACKUP on pg_catalog, physically copy the
datafiles and offer the option to blat your catalog with an old one...
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


pgp7YJB3n07fv.pgp
Description: PGP signature


[HACKERS] Benchmarks?

2005-09-11 Thread Martin Ehmsen
Hi,

I have tried to post the following question in the general
malinglist, but haven't got any respons. I really hope
some pgsql-developer can help.
Btw. this is my first mail to this list, so please bear
with me.

I'm a computer science student doing a thesis on paging
algorithms, both from a theoretical and practical viewpoint.
I'm currently looking for some standard way of testing the
performance of paging alogrithms. But I haven't been able
to find any paging benchmarks anywhere (both in the
literature and on the web).

I would therefore like to know if the developers of
postgresql are doing any kind of paging benchmarks? And in
case they do, I would like to see them.
Or there might be someone who has a good idea how to
produce real-life sequences of requests, that I might
test the algorithms against.

Thanks in advance,
Martin

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[HACKERS] problem for o/p

2005-09-11 Thread rajinder ruprai
sir,
i'am getting different o/p for the same program code as well as the data base is the same .twice the output is being printed on some machines whereas correct o/p on some.the program code is
create or replace function f() return int4 as'
c1 cursor for select empname from emp;
e1 emp.ename%type;
begin
open c1;
loop
fetch c1 into e1;
exit when not found;
raise notice ' 'emp name %' ', e1;
end loop;
close c1;
return 1;
end;
'language 'plpgsql';


		 
Yahoo! India Matrimony: Find your partner now.

Re: [HACKERS] FAQ/HTML standard?

2005-09-11 Thread Robert Treat
On Saturday 10 September 2005 12:10, Andrew Dunstan wrote:
 Is there an HTML standard that we try to follow in our HTML docs such as
 FAQs?

 If there isn't an explicit standard, may I suggest that we adopt XHTML
 1.0 as the standard?


Really the FAQ files need to be able to validate when viewed through the 
postgresql website, where we use: 
!DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Strict//EN
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd;
I have corrected some markup in the past, but by no means would I say this has 
been done thoroughly wrt the FAQ's. 

FWIW I'm not against changing the source to docbook if someone wanted to be 
really ambitious and convert all of the FAQ's to such a scheme (including the 
text only ones) as long as we had a way to easily build them out of cvs and 
into the website. I imagine my other web cohorts would probably think 
similarly.  

-- 
Robert Treat
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[HACKERS] -fPIC

2005-09-11 Thread Peter Eisentraut
So far, we have tended to use -fpic to compile position-independent code 
until we have received some sort of overflow that forced the use of 
-fPIC.  Since 8.0, the makefiles to build shared libraries are also 
available to external modules through the pgxs system, so we cannot 
exactly check anymore what the fpic vs. fPIC requirement of each 
conceivable module is.  I have just received confirmation that PL/Java 
needs -fPIC to compile on Alpha and S/390 on Linux, so we need to make 
at least that change, but maybe it's more prudent to change to -fPIC 
across the board now.  Comments?

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] initdb profiles

2005-09-11 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 It'd be nice to get out from under the fixed-size-shmem restriction, but
 I don't know any very portable way to do that.  

Without knowing that part of the code at all it seems to me the logical
approach would be to make the fsm steal its pages out of the shared buffers
allocation. That is, you specify a total amount of shared memory to allocate
and Postgres decides how much of it to use for shared buffers and how much for
fsm.

-- 
greg


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] random system table corruption ...

2005-09-11 Thread Tom Lane
=?ISO-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= [EMAIL PROTECTED] writes:
 in the past we have faced a couple of problems with corrupted system 
 tables. this seems to be a version independent problem which occurs on 
 hackers' from time to time.
 i have checked a broken file and i have seen that the corrupted page has 
 actually been zeroed out.

That sounds to me like a hardware problem --- disk or disk controller
momentarily writing zeroes instead of what it should write.   Have you
seen this on more than one physical machine?  Do you have any evidence
for the implication that it only happens to system tables and not user
tables?

Also, you don't have zero_damaged_pages turned on by any chance?

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] -fPIC

2005-09-11 Thread Tom Lane
Peter Eisentraut [EMAIL PROTECTED] writes:
 So far, we have tended to use -fpic to compile position-independent code 
 until we have received some sort of overflow that forced the use of 
 -fPIC.  Since 8.0, the makefiles to build shared libraries are also 
 available to external modules through the pgxs system, so we cannot 
 exactly check anymore what the fpic vs. fPIC requirement of each 
 conceivable module is.  I have just received confirmation that PL/Java 
 needs -fPIC to compile on Alpha and S/390 on Linux, so we need to make 
 at least that change, but maybe it's more prudent to change to -fPIC 
 across the board now.  Comments?

PL/Java is bigger than the whole backend?

The reason for -fpic vs -fPIC (on the machines where it makes any
difference at all) is that the former is faster.  I'm not real thrilled
by the prospect that a bloated add-on should get to dictate an
across-the-board slowdown even on installations where it will never
be used.

I think the correct answer is for PL/Java to do s/-fpic/-fPIC/ on
CFLAGS in its Makefile, rather than trying to force the same on
everything else.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] random system table corruption ...

2005-09-11 Thread Hans-Jürgen Schönig

Tom Lane wrote:

=?ISO-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= [EMAIL PROTECTED] writes:

in the past we have faced a couple of problems with corrupted system 
tables. this seems to be a version independent problem which occurs on 
hackers' from time to time.
i have checked a broken file and i have seen that the corrupted page has 
actually been zeroed out.



That sounds to me like a hardware problem --- disk or disk controller
momentarily writing zeroes instead of what it should write.   Have you
seen this on more than one physical machine?  Do you have any evidence
for the implication that it only happens to system tables and not user
tables?

Also, you don't have zero_damaged_pages turned on by any chance?

regards, tom lane



tom,

well, there is some evidence that this is not a hardware related issue.
we have only seen this problem from time to time but it happened on 
different machines. it cannot be reproduced. it can even happen when 
somebody runs a script which has been called million times before.
in my current scenario the page header only consists of 0x00 bytes and 
therefore the page checks fails when reading the system table.


i have never seen this in data files up to now (at least not when the 
hardware was still intact).


did anybody face similar problems? maybe on sun?
by the way: currently the broken system is running PostgreSQL 7.4 but as 
I said - we have also seen that on 8.0 once.


best regards,

hans


--
Cybertec Geschwinde  Schönig GmbH
Schöngrabern 134; A-2020 Hollabrunn
Tel: +43/1/205 10 35 / 340
www.postgresql.at, www.cybertec.at

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] initdb profiles

2005-09-11 Thread Andrew Dunstan



Tom Lane wrote:


The thought behind my suggestion was that the current max_fsm_pages
default of 2 pages is enough to track free space in a database of
maybe a few hundred megabytes.  The other defaults are sized
appropriately for machines with about that much in main memory.  This
doesn't seem to add up :-(.  The default max_fsm_pages probably should
be about ten times bigger just to bring it in balance with the other
defaults ... after that we could talk about increasing the defaults
across-the-board.


 



Ok, how about this? I based the numbers on your 10*current suggestion 
and some linear scaling


When we test connection currently, we use shared bufferes if n*5. We 
could add in a setting of max_fsm_pages = n * 1000 in line with the 
arithmetic - not sure if it's worth it.


When we test n shared buffers, let's add in a max_fsm_pages setting of n 
* 200.


Another alternative I thought might be better would be that instead of 
fixing the default max_fsm_pages at 2, we set the default at a fixed 
ratio (say 200:1) to shared_buffers. Not sure how easy that is to do via 
the GUC mechanism.


Lastly, I would suggest that we increase the limits we try modestly - 
adding in 400, 350, 300, 250, 200, and 150 to the number of cennections 
tried, and perhaps 3000, 2500, 2000 and 1500 to number of buffers tried.


These numbers aren't entirely plucked out of the air. The number of 
connections is picked to match then number of clients a default apache 
setup can have under a hybrid MPM, and the number of shared buffers is 
picked to be somewhat less than the 10% on a modest machine that Peter 
thought would be too much.


cheers

andrew



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] -fPIC

2005-09-11 Thread Peter Eisentraut
Tom Lane wrote:
 PL/Java is bigger than the whole backend?

No, it's not, but the backend is not compiled as position-independent.

 The reason for -fpic vs -fPIC (on the machines where it makes any
 difference at all) is that the former is faster.

I don't doubt that, but out of curiosity, considering that everyone else 
is using libtool, and libtool always uses -fPIC, what kind of impact 
does this *really* have?

 I think the correct answer is for PL/Java to do s/-fpic/-fPIC/ on
 CFLAGS in its Makefile, rather than trying to force the same on
 everything else.

That would certainly work, but is that the kind of interface we want to 
offer?  In the extreme case, a module could end up redefining a great 
deal of the shared library knowledge that it was supposed to not have 
to care about.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] initdb profiles

2005-09-11 Thread Jim C. Nasby
On Sun, Sep 11, 2005 at 12:15:01PM -0400, Greg Stark wrote:
  It'd be nice to get out from under the fixed-size-shmem restriction, but
  I don't know any very portable way to do that.  
 
 Without knowing that part of the code at all it seems to me the logical
 approach would be to make the fsm steal its pages out of the shared buffers
 allocation. That is, you specify a total amount of shared memory to allocate
 and Postgres decides how much of it to use for shared buffers and how much for
 fsm.

FWIW, I know this is how DB2 does things, and I think Oracle's the same.
We probably still want some kind of limit so it doesn't blow the buffer
cache completely out.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] random system table corruption ...

2005-09-11 Thread Alvaro Herrera
On Sun, Sep 11, 2005 at 01:12:34PM +0200, Hans-Jürgen Schönig wrote:
 in the past we have faced a couple of problems with corrupted system 
 tables. this seems to be a version independent problem which occurs on 
 hackers' from time to time.
 i have checked a broken file and i have seen that the corrupted page has 
 actually been zeroed out.

IIRC the XFS filesystem zeroes out pages that it recovers from the
journal but did not have a fsync on them (AFAIK XFS journals only
metadata, so page creation but not the content itself).  I don't think
this would be applicable to your case, because we do fsync modified
files on checkpoint, and rewrite them completely from WAL images after
that.  But I thought I'd mention it.

-- 
Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com
Just treat us the way you want to be treated + some extra allowance
 for ignorance.(Michael Brusser)

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] -fPIC

2005-09-11 Thread Kurt Roeckx
On Sun, Sep 11, 2005 at 05:49:40PM +0200, Peter Eisentraut wrote:
 So far, we have tended to use -fpic to compile position-independent code 
 until we have received some sort of overflow that forced the use of 
 -fPIC.  Since 8.0, the makefiles to build shared libraries are also 
 available to external modules through the pgxs system, so we cannot 
 exactly check anymore what the fpic vs. fPIC requirement of each 
 conceivable module is.  I have just received confirmation that PL/Java 
 needs -fPIC to compile on Alpha and S/390 on Linux, so we need to make 
 at least that change, but maybe it's more prudent to change to -fPIC 
 across the board now.  Comments?

Can we avoid those relocation by not exporting variables and
function that shouldn't be exported and marking them static?  Or
is static already being used properly?


Kurt


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] random system table corruption ...

2005-09-11 Thread Hans-Jürgen Schönig

Alvaro Herrera wrote:

On Sun, Sep 11, 2005 at 01:12:34PM +0200, Hans-Jürgen Schönig wrote:

in the past we have faced a couple of problems with corrupted system 
tables. this seems to be a version independent problem which occurs on 
hackers' from time to time.
i have checked a broken file and i have seen that the corrupted page has 
actually been zeroed out.



IIRC the XFS filesystem zeroes out pages that it recovers from the
journal but did not have a fsync on them (AFAIK XFS journals only
metadata, so page creation but not the content itself).  I don't think
this would be applicable to your case, because we do fsync modified
files on checkpoint, and rewrite them completely from WAL images after
that.  But I thought I'd mention it.




alvora,

thanks a lot.
we have some reports about sun systems.
meanwhile i got the impression that the filesystem might be doing 
something wrong. i have seen that the page is not completely zeroed out. 
at some strange positions there are 2 bytes of crap (i have overlooked 
that at first glance). the first couple hundreds of bytes are crap, 
however. very strange ...


best regards,

hans

--
Cybertec Geschwinde  Schönig GmbH
Schöngrabern 134; A-2020 Hollabrunn
Tel: +43/1/205 10 35 / 340
www.postgresql.at, www.cybertec.at

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] -fPIC

2005-09-11 Thread Stephen Frost
* Peter Eisentraut ([EMAIL PROTECTED]) wrote:
 Tom Lane wrote:
  The reason for -fpic vs -fPIC (on the machines where it makes any
  difference at all) is that the former is faster.
 
 I don't doubt that, but out of curiosity, considering that everyone else 
 is using libtool, and libtool always uses -fPIC, what kind of impact 
 does this *really* have?

I certainly wouldn't assume something done in libtool is necessairly the
'smart' approach, ever.

  I think the correct answer is for PL/Java to do s/-fpic/-fPIC/ on
  CFLAGS in its Makefile, rather than trying to force the same on
  everything else.
 
 That would certainly work, but is that the kind of interface we want to 
 offer?  In the extreme case, a module could end up redefining a great 
 deal of the shared library knowledge that it was supposed to not have 
 to care about.

I don't think it's all that sane to expect a generalized build system to
support every possible library compilation requirement...

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] problem for o/p

2005-09-11 Thread Michael Fuhr
On Sat, Sep 10, 2005 at 12:11:52PM +0100, rajinder ruprai wrote:
 i'am getting different o/p for the same program code as well as
 the data base is the same .twice the output is being printed on
 some machines whereas correct o/p on some.the program code is
[...]
 raise notice ' 'emp name %' ', e1;

Where do the database server's logs go?  You might be seeing duplicate
output because the server's logs are sent to the same terminal as the
client's output.

BTW, this doesn't belong on pgsql-hackers.

-- 
Michael Fuhr

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[HACKERS] Spinlocks, yet again: a new test case

2005-09-11 Thread Tom Lane
I've recently been taking another look at the test case shown here:
http://archives.postgresql.org/pgsql-performance/2004-04/msg00280.php
This is the infamous context swap storm problem that we've hacking
at for so long.  After the 8.1 buffer manager redesign, the problem
of contention for the BufMgrLock is pretty much gone, and so the
test case no longer exhibits a lot of semop()s.  However ... performance
of CVS tip on a multiprocessor still pretty much sucks :-(

I started looking at this because of a discussion with a Red Hat
customer, who was saying that the futex spinlock patch shown here
http://archives.postgresql.org/pgsql-performance/2004-10/msg00185.php
had helped him materially with a context-swap-storm problem on 8.0.3.
That didn't make a lot of sense to me, because we had determined that
the futex patch didn't help much for this problem (which is one reason
why it never got applied).  So I started digging, and I've found a few
interesting things that we still need to fix.

First off, the RH customer had found that the futex patch improved
his query performance even though the context swap rate actually went
*up* materially.  So I realized that we shouldn't simply be looking
at the CS rate and saying ugh, we have to get that down.  It's
important to find out how much work is really getting done, too.
Accordingly, I generated a modified version of the test case with
a shorter query that is actually meant to complete in a reasonable
amount of time, and I looked at how long it took to complete multiple
instances of this query running in parallel.  Ideally, if you have
N cpus you should be able to run up to N copies of the query in the
same amount of time it takes for one copy, and then, say, 2*N copies
ought to take about twice as long as N copies.  We do actually seem
to scale pretty linearly beyond N ... but below that, not.

The files for the updated test case are attached if anyone else wants
to try it.  They are:

test_setup.sql  Run this to create the test tables

test_run.sqlA version of the test query that will run
pretty nearly forever (until you kill it)

test_run_small.sql  A one-minute-or-so test query

startn.sh   A shell script to run N copies of
test_run_small.sql in parallel.

You need this one-time setup:

createdb test
psql -f test_setup.sql test

To perform testing, do time ./startn.sh N for N=1,2,... up to however
many CPUs you have, and maybe a run with more queries than that to make
sure performance doesn't drop off above that.  Note the total time to
complete the script for each N.  You can also watch the CPU utilization
and context swap rates using vmstat 1 or equivalent, and/or strace
some of the backends to see what they're doing.

I did my testing on a 4-way Opteron box at Red Hat, using kernel
2.6.12-1.1398_FC4smp.  8.0.3 completely sucks in this test:

N, runtime: 1 45s   2 119s  4 507s  (didn't try 8)

With 4 processes running, it shows about 40K context swaps/sec
and barely 15% CPU utilization!  strace shows plenty of semop()s
as expected, but also a fair number of delaying select() calls
from s_lock().  I saw about one select for every hundred semops,
but that was still enough to have 30 or so selects per second
(per process).  Since each select sleeps 10msec, that's about a
third of the potential CPU usage being discarded in spinlock delays.

(Even more interesting, this test case behaves much worse on Opteron
than on other SMP boxes.  I had previously done similar tests on
a 4-way Xeon with similar-vintage kernel, and saw context swap rates
over 140K/sec with about 70% CPU load.  I don't have test_run_small
results from that session, but clearly the spinlock code is behaving
worse on the Opterons.)

CVS tip eliminates essentially all the semops, as we hoped, but
scalability is still not great:

N, runtime: 1 36s   2 61s   4 105s  8 198s

At 4 processes, CPU use is only about 30%; the rate of select() delays
has gone up to 50/sec or more, so even though we fixed the bufmgr itself,
we are still losing a lot of potential CPU cycles to spinlock waits.

(Note: I later realized that the CVS-tip code was accidentally built
with -O1, not -O2 as the 8.0.3 code had been, so these two sets of
numbers are not completely comparable: with -O2 the single-query time
drops to 31s in CVS tip.  But the main point is all the time that's
disappearing into spinlock delays.)

I have some proposals for fixing this stuff, which I'll put into a
separate message since this one is pretty long already.  What I wanted
to accomplish here is to document the test case and the problem, and
that's done.

Anyone have SMP boxes of other types that they can try this on?

regards, tom lane

drop table test_data;

create table test_data(f1 int);

insert into test_data values (random() * 100);
insert into test_data select random() * 100 from test_data;
insert into 

[HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Tom Lane
The test case I just posted shows that our spinlock code, which we had
thought largely done, is once again becoming a performance bottleneck.
It's time to resurrect some of the ideas we kicked around in early
2002, and didn't pursue because we decided spinlocks weren't our major
performance problem.

I started investigating this by trying to understand why the futex
spinlock patch had helped the Red Hat customer I mentioned, when it
hadn't done anything much in last year's experiments.  It turns out
that this customer was using an 8-way Opteron, and the futex patch
helps a lot more on that architecture than others.  Some random
experimentation eventually turned up part of the reason: the x86 TAS
assembly code we are using is pessimal for x86_64.  Specifically,
s_lock.h uses this for both architectures:

/* Use a non-locking test before asserting the bus lock */
__asm__ __volatile__(
cmpb$0,%1\n
jne1f\n
lock \n
xchgb%0,%1   \n
1: \n

and it turns out that deleting the cmpb and jne makes things go
significantly faster on Opterons.  So the non-locking test is
not a win at all on x86_64.  As best I can tell from testing,
removing it wins because it causes the rate of spinlock delays
to drop significantly.  Why this should be is not completely
obvious.  I speculate that the Opterons are capable of pulling
in a copy of the cache line that contains the spinlock and then
running 100 iterations of the cmpb test without ever noticing
that the processor holding the lock has now released it.  Without
the cmpb, they are forced by the locking xchgb test to actually
look at the real state of the spinlock each time.

I kinda suspect that the cmpb test is a no-op or loss on all
Intelish processors: it can only be a win if you expect a lot
of contention for the spin lock, but in percentage terms we still
have a very low conflict rate, so in most executions of the TAS
macro, the cmpb is just wasted cycles.  Bottom line: we definitely
don't want it for x86_64, and maybe not at all, but we need more
research to decide the latter.

The second reason that the futex patch is helping is that when
a spinlock delay does occur, it allows the delaying process to be
awoken almost immediately, rather than delaying 10 msec or more
as the existing code does.  However, given that we are only expecting
the spinlock to be held for a couple dozen instructions, using the
kernel futex mechanism is huge overkill --- the in-kernel overhead
to manage the futex state is almost certainly several orders of
magnitude more than the delay we actually want.

I looked into several other methods of doing the spinlock delay
instead.  I think all of these were suggested at one point or
another in our earlier discussions of spinlocks:

1. Use sched_yield() if available: it does just what we want,
ie, yield the processor without forcing any useless time delay
before we can be rescheduled.  This doesn't exist everywhere
but it exists in recent Linuxen, so I tried it.  It made for a
beautiful improvement in my test case numbers: CPU utilization
went to 100% and the context swap rate to almost nil.  Unfortunately,
I also saw fairly frequent stuck spinlock panics when running
more queries than there were processors --- this despite increasing
NUM_DELAYS to 1 in s_lock.c.  So I don't trust sched_yield
anymore.  Whatever it's doing in Linux 2.6 isn't what you'd expect.
(I speculate that it's set up to only yield the processor to other
processes already affiliated to that processor.  In any case, it
is definitely capable of getting through 1 yields without
running the guy who's holding the spinlock.)

2. Reduce the length of the select() delays in s_lock.  The current code
delays in quanta of 10msec, but on recent Linuxen (and I think other
platforms too) the scheduling quantum is 1msec, so we can get the
processor back faster if we ask for a 1msec delay.  I tried this and it
is definitely a win on Linux; it doesn't seem to hurt anything on older
systems either, they just round the delay up to 10msec like before.
So I think we should do this, even though it's only a partial answer.

3. Modify the spin loop to do a little more work between TAS() tests.
In the existing code there are only about half a dozen instructions
total in the normal spin loop, most of them very fast, with the result
that the spinning processor does its best to monopolize the system bus
with locked probe instructions.  This is obviously not good, as it'll
interfere with the ability of the spinlock holder to complete its work
and release the lock.  (The bulk of the spinlock uses are for LWLocks,
and with the current data structures the LWLock's own state is usually
going to be in the same cache line as the spinlock proper, so that
cache grabs for the spinlock hurt the LWLock state updating as well
as the eventual spinlock release.)  I modified the code to use an
integer modulo operation as 

Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Kurt Roeckx
On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
 
 I kinda suspect that the cmpb test is a no-op or loss on all
 Intelish processors: it can only be a win if you expect a lot
 of contention for the spin lock, but in percentage terms we still
 have a very low conflict rate, so in most executions of the TAS
 macro, the cmpb is just wasted cycles.  Bottom line: we definitely
 don't want it for x86_64, and maybe not at all, but we need more
 research to decide the latter.

I think an important question is wether this is for x86_64 in
general, of opteron specific.  It could be that it's not the same
on Intel's EM64Ts.

One reason this might behave differently for opterons is that
it's a cc-NUMA instead of the normal SMP.

Something else to consider is the OS you're using.  I've been
told that Linux isn't that good in NUMA and FreeBSD might be
better.


Kurt


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Spinlocks, yet again: a new test case

2005-09-11 Thread Gavin Sherry
On Sun, 11 Sep 2005, Tom Lane wrote:

 The files for the updated test case are attached if anyone else wants
 to try it.  They are:

   test_setup.sql  Run this to create the test tables

   test_run.sqlA version of the test query that will run
   pretty nearly forever (until you kill it)

   test_run_small.sql  A one-minute-or-so test query

   startn.sh   A shell script to run N copies of
   test_run_small.sql in parallel.

 You need this one-time setup:

   createdb test
   psql -f test_setup.sql test

 To perform testing, do time ./startn.sh N for N=1,2,... up to however


With HEAD on a 4-way Xeon, I'm seeing:

1: 56  2: 120 4: 230

Removing the the non-locking test and branch, which you suggested
down-thread:

1: 55: 2: 111: 4: 207

--

Gavin

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] statement logging / extended query protocol issues

2005-09-11 Thread Oliver Jowett
Simon Riggs wrote:

 Are we sure there is just 3 cases?

I haven't exhaustively checked, but I think those are the main cases.

 Even if case (3) is not that common, I still want to know it is
 occurring, to see what effect or overhead it has.

I don't want it to be more verbose than the other cases when I set
log_statement = all.

 We'll only see the output for case (3) when someone has programmed
 things that way by using setFetchSize.

Can we put extra output in this case into log_statement = verbose only
please?

 (1)
 jdbc parse
 jdbc bind
 jdbc execute
 LOG:  statement: SELECT * from pg_proc

 Notice that the parse of the unnamed statement does *not* now generate a
 log record.

What about the syntax error case?

 (2)
 jdbc parse S_1
 LOG:  statement: PREPARE S_1 AS SELECT * from pg_proc
 (perhaps this should be logged at BIND time, just like the
 optimization?)
 
 jdbc bind S_1
 jdbc execute
 LOG:  statement: EXECUTE unnamed [PREPARE:  SELECT * from pg_proc]

I do not like logging queries that the driver never sent (the driver
sends neither PREPARE nor EXECUTE).

I also don't see why it's useful to log the statement and portal names.

Can we reword this to what I suggested previously?

  LOG: parse statement: SELECT * from pg_proc
  LOG: execute statement: SELECT * from pg_proc

 (3)
 jdbc prepare S_2
 LOG:  statement: PREPARE S_2 AS SELECT * from pg_proc
 
 jdbc bind S_2 to C_2
 jdbc execute C_2
 LOG:  statement: EXECUTE C_2 ROWS 42 [PREPARE:  SELECT * from pg_proc]
 jdbc next (after cache has run out on 42nd row)
 v3 protocol sends E for Execute, execution halts at 49 rows for this set
 of bind parameters
 LOG:  statement: FETCH C_2 ROWS 7

Again I do not like logging synthetic queries that the driver never sent
(PREPARE / EXECUTE / FETCH). BTW, if you do it this way, you could get
the bizarre PREPARE S_2 AS PREPARE xyz AS SELECT .. result if the
application used PREPARE itself.

I think that logging the second and subsequent Executes is not normally
useful and shouldn't happen when log_statement = all. In that case you
don't need to log the portal name either.

So for the normal case:

  LOG: parse statement: SELECT * from pg_proc
  LOG: execute statement: SELECT * from pg_proc

and for the verbose case perhaps something like:

  LOG: parse statement: SELECT * from pg_proc
  LOG: execute statement (C_2, 42 rows): SELECT * from pg_proc
  LOG: fetch statement results (C_2, 7 rows)

-O

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Tom Lane
Kurt Roeckx [EMAIL PROTECTED] writes:
 On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
 I kinda suspect that the cmpb test is a no-op or loss on all
 Intelish processors:

 I think an important question is wether this is for x86_64 in
 general, of opteron specific.  It could be that it's not the same
 on Intel's EM64Ts.

Good point --- anyone have one to try?

 Something else to consider is the OS you're using.  I've been
 told that Linux isn't that good in NUMA and FreeBSD might be
 better.

It's hard to see how the OS could affect behavior at the level of
processor cache operations --- unless they did something truly
spectacularly stupid, like mark main memory non-cacheable.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Stephen Frost
* Tom Lane ([EMAIL PROTECTED]) wrote:
 Kurt Roeckx [EMAIL PROTECTED] writes:
  On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
  I kinda suspect that the cmpb test is a no-op or loss on all
  Intelish processors:
 
  I think an important question is wether this is for x86_64 in
  general, of opteron specific.  It could be that it's not the same
  on Intel's EM64Ts.
 
 Good point --- anyone have one to try?

I've got one I can test on.  I need to upgrade the kernel and some other
things on it though (it's running 2.6.8 atm, and an older
Debian/unstable which I should probably bring up to current).

I'll work on it starting now and post results once I get some.

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] -fPIC

2005-09-11 Thread Greg Stark
Peter Eisentraut [EMAIL PROTECTED] writes:

  The reason for -fpic vs -fPIC (on the machines where it makes any
  difference at all) is that the former is faster.
 
 I don't doubt that, but out of curiosity, considering that everyone else 
 is using libtool, and libtool always uses -fPIC, what kind of impact 
 does this *really* have?

Incidentally, Debian uses -fPIC as a matter of policy everywhere too. I think
the problem is that it's hard to know in advance whether -fpic is going to
cause a problem and mixing -fpic and -fPIC libraries is a problem. So it's
safer to just compile all the libraries with -fPIC.

-- 
greg


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[HACKERS] counting disk access from index seek operation -- how to?

2005-09-11 Thread huaxin zhang
I need a way to tell how many pages loaded from disk for a particular
index seek operation.

What I did is to set a global flag to true before calling the
following statement
(inside index_getnext() in /backend/access/indexam.c)

  found = DatumGetBool(FunctionCall2(scan-fn_getnext,
   PointerGetDatum(scan),
  
Int32GetDatum(direction)));


then for each access to disk, I increment a counter until the above call is
finished and set back the global flag.

The number of page IOs is not even matching from what I got from 
select * from pg_stat_all_indexes  ---  (is there anything I need to
set in postgresql.conf?)

also, the number of paged IOs for a given index seek is always less
than the total
page loads. Is it because of the statement (following the above in 
/backend/access/indexam.c)?

if (heap_release_fetch(scan-heapRelation, scan-xs_snapshot,  

  heapTuple, scan-xs_cbuf, true,
   scan-xs_pgstat_info))

If I am running in a single user mode, is there a way to avoid using
extra page IO in
the above statement? It seems to me the extra page IO is caused by
comparing snapshots...

thanks

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] -fPIC

2005-09-11 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 ... mixing -fpic and -fPIC libraries is a problem.

Is it?  I would think having two options would be essentially unworkable
if so.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Stephen Frost
* Tom Lane ([EMAIL PROTECTED]) wrote:
 My proposal therefore is to do #2, #3, and #4, and to modify the TAS
 assembly code at least on x86_64.  Together, these changes bring
 my test case on a 4-way Opteron from
 
 N, runtime:   1 36s   2 61s   4 105s  8 198s

em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12:

N, runtime: 1 31s   2 47s   4 86s   8 159s

 to
 
 N, runtime:   1 35s   2 48s   4 58s   8 105s

N, runtime: 1 32s   2 53s   4 90s   8 169s

CPU utilization is definitely higher when running with the patch though.
Hope this helps, happy to do additional testing if necessary.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Tom Lane
Stephen Frost [EMAIL PROTECTED] writes:
 em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12:

 N, runtime:   1 31s   2 47s   4 86s   8 159s

 N, runtime:   1 32s   2 53s   4 90s   8 169s

Er, which (or both) of the two patches did you apply here?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Spinlocks, yet again: a new test case

2005-09-11 Thread Mark Kirkwood

Tom Lane wrote:

Anyone have SMP boxes of other types that they can try this on?



For those of us running antiques:

2x PIII 1Ghz 2G
Running on FreeBSD 6.0beta4 (non-debug kernel)

8.0.3:

N runtime: 1 158s  2 271s  4 567s

8.1beta1 (2005-08-28):

N runtime: 1 85s   2 139s  4 220s


Wow - a huge difference between the versions, I thought the handbrake 
must be on for 8.0.3, but apparently not, all relevant parameters are 
the same...


regards

Mark





---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Stephen Frost
* Tom Lane ([EMAIL PROTECTED]) wrote:
 Stephen Frost [EMAIL PROTECTED] writes:
  em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12:
 
  N, runtime: 1 31s   2 47s   4 86s   8 159s
 
  N, runtime: 1 32s   2 53s   4 90s   8 169s
 
 Er, which (or both) of the two patches did you apply here?

Applied both, sorry that wasn't clear.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Tom Lane
Stephen Frost [EMAIL PROTECTED] writes:
 * Tom Lane ([EMAIL PROTECTED]) wrote:
 Er, which (or both) of the two patches did you apply here?

 Applied both, sorry that wasn't clear.

Thanks.  If you've got the time, could you try the two patches
separately and see what you get?

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Qingqing Zhou

Tom Lane [EMAIL PROTECTED] wrote

 My proposal therefore is to do #2, #3, and #4, and to modify the TAS
 assembly code at least on x86_64.  Together, these changes bring
 my test case on a 4-way Opteron from


Some changes are based on tests and heuristics, so can we make them into the 
configure script so the choice could be made automatically?

Regards,
Qingqing 



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Stephen Frost
* Tom Lane ([EMAIL PROTECTED]) wrote:
 Stephen Frost [EMAIL PROTECTED] writes:
  * Tom Lane ([EMAIL PROTECTED]) wrote:
  Er, which (or both) of the two patches did you apply here?
 
  Applied both, sorry that wasn't clear.
 
 Thanks.  If you've got the time, could you try the two patches
 separately and see what you get?

Sure.

CVS Head:

N, runtime: 1 31s   2 47s   4 86s   8 159s

With just slock-no-cmpb.patch:

N, runtime: 1 32s   2 39s   4 82s   8 167s

With just spin-delay.patch

N, runtime: 1 32s   2 52s   4 94s   8 164s

With both:

N, runtime: 1 32s   2 53s   4 90s   8 169s

Hope that helps,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

  Something else to consider is the OS you're using.  I've been
  told that Linux isn't that good in NUMA and FreeBSD might be
  better.
 
 It's hard to see how the OS could affect behavior at the level of
 processor cache operations --- unless they did something truly
 spectacularly stupid, like mark main memory non-cacheable.

Well it could schedule processes on processors in ways that force less than
optimal memory usage patterns.

But maybe you should tell the Altix folk with their 32-processor 384Gb NUMA
machines what you've been told about Linux not being that good in NUMA.
Really, these kind of cargo cult anecdotes are pretty pointless.

-- 
greg


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Stephen Frost
* Stephen Frost ([EMAIL PROTECTED]) wrote:
  Thanks.  If you've got the time, could you try the two patches
  separately and see what you get?
 
 Sure.
[...]

Just to be clear- this was from a completely default 'make install' 
using the Debian configure options from 8.0.3 (which aren't that 
particularly interesting really- nls, integer-datetimes, debug, 
disable-rpath, tcl, perl, python, pam, krb5, openssl, gnu-ld, 
enable-thread-safety).  If it'd be useful for the test I can adjust 
whichever parameters are appropriate (work_mem, cache_size, etc).

There's absolutely nothing else going on except for these test, a few
shells, top, etc, on the box.

Enjoy,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Tom Lane
Qingqing Zhou [EMAIL PROTECTED] writes:
 Some changes are based on tests and heuristics, so can we make them into the 
 configure script so the choice could be made automatically?

It's a bit premature to propose that, when we don't yet know if the
suggested changes are a win or loss under any conditions, let alone
how to automatically tell the difference.  Right now we are in the
find-out-the-facts mode ...

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches

2005-09-11 Thread Mark Kirkwood

Tom Lane wrote:


I attach two proposed patches: the first removes the cmpb/jne from
the x86 TAS assembly code, and the second one does the s_lock changes
enumerated as points #2, #3, #4.  The first one in particular needs
more testing to see if it hurts performance on any non-Opteron x86
chips.  (If so, we'd just make it conditional to x86_64.)



2x PIII 1G 2G Freebsd 6.0Beta4

8.1beta1 (2005-08-28):

N runtime: 1 85s   2 139s  4 220s


8.1beta1 (2005-08-28) + patch 1 (s_lock.h only)

N runtime: 1 89s   2 137s  4 229s


8.1beta1 (2005-08-28) + patch 2

N runtime: 1 84s   2 108s  4 214s



Observe the interesting little speed improvement for patch 2 with 2 
processes (seems to be repeatable).


Let me know if you want to see vmstat output for any of these.

regards

Mark



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org