Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-10-07 Thread Andrew Sullivan
On Sat, Sep 18, 2004 at 06:06:05AM -0400, Jan Wieck wrote:
 On 9/17/2004 7:32 PM, Tom Lane wrote:
 over time.  I'm wondering about DNS lookup results in particular.
 
 Except for one localhost, one /tmp/.s.PGSQL... and the 543x lookup 
 during the postmaster start, all lookups are IP addresses with 
 AI_NUMERICHOST set. And we have checked with tcpdump that the box really 
 does not issue DNS lookups.

Just for the sake of posterity, it appears that this is actually a
libc problem on AIX.  In particular, there's a patched libc fileset
which was released to solve a problem where getaddrinfo() returns an
error on valid input.  IBM's AIX support was unwilling to give us
libraries with debug symbols built in, but they did point me at a new
fileset for libc.  We've been running a test load which fairly
consistently produced sig 11s before, and haven't seen one since.  So
we don't have a perfect explanation, but it looks like this is the
cause.

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]
The plural of anecdote is not data.
--Roger Brinner

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-09-20 Thread Andrew Sullivan
On Fri, Sep 17, 2004 at 07:32:30PM -0400, Tom Lane wrote:

 involve consulting DNS?  If so, try to correlate the crash probability
 with changes in your DNS zone contents ...

No changes.  The systems in question have no access to DNS. 
/etc/hosts only.

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]
The fact that technology doesn't work is no bar to success in the marketplace.
--Philip Greenspun

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-09-19 Thread Jan Wieck
On 9/17/2004 7:32 PM, Tom Lane wrote:
Jan Wieck [EMAIL PROTECTED] writes:
The problem comes and goes. So either I can cause a coredump just on the 
snap by running a shellscript that does 100 psql -c select version() 
calls, or it is next to impossible to crash it at all.
Hmm, that's really bizarre.  It seems like the only satisfactory
explanation for that would involve some external condition that varies
over time.  I'm wondering about DNS lookup results in particular.
What values are you asking getaddrinfo to look up, and might those
involve consulting DNS?  If so, try to correlate the crash probability
with changes in your DNS zone contents ...
			regards, tom lane
Except for one localhost, one /tmp/.s.PGSQL... and the 543x lookup 
during the postmaster start, all lookups are IP addresses with 
AI_NUMERICHOST set. And we have checked with tcpdump that the box really 
does not issue DNS lookups.

Jan
--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-09-17 Thread Jan Wieck
On 4/19/2004 1:18 PM, Jan Wieck wrote:
Tom Lane wrote:
Andrew Sullivan [EMAIL PROTECTED] writes:
On Thu, Apr 15, 2004 at 07:52:59PM -0400, Tom Lane wrote:
I can see from your trace that you are using the getaddrinfo code from
libc, but where is configure finding a header that declares struct
addrinfo?

Hrm, I can't seem to tell.  I see this in config.log, but it isn't
telling me where it found it.  Am I looking in the wrong place?
What you'd need to do is determine which system headers are being
#include'd by that config test, and then look through them to find
struct addrinfo.
judging by gdb's structure printing, the crashed postgres instance used 
the non-43 compatible 64-bit version of the strucure. What I don't 
really get is that the whole excercise seems to have scribbled over the 
stack. The hints pointer originating from the on-stack structure in 
parse_hba is somehow pointing into the blue.
This issue is still not closed and it is hitting us more and more. So I 
would like to add some more of what we have done in the hope to get some 
more ideas.

The scribbled over the stack part turned out to be not true. The stack 
dump is fine if compiled with -O0. The problem persists in 7.4.5.

I have tried to isolate the getaddrinfo() calls by writing a program 
that does the getaddrinfo() calls done during PM startup, then keeps 
100-200 child processes in a fork()/wait() loop and every child process 
does the same getaddrinfo() calls a starting backend would perform 
during the pg_hba parsing. This program does not crash.

So far we did not get a libc from IBM that has debug symbols. So I only 
know that getaddrinfo() calls getaddrinfo2(), which calls memmove() and 
that one crashes with a SIGSEGV. All the call arguments to getaddrinfo() 
look absolutely fine. I hope to get that libc any time soon to see what 
exactly that memmove tries to access.

The problem comes and goes. So either I can cause a coredump just on the 
snap by running a shellscript that does 100 psql -c select version() 
calls, or it is next to impossible to crash it at all.

There are numerous reports on the net about getaddrinfo() causing grief 
on AIX and it seems to be IPV6 related. For the moment we intend to 
replace the call with a slightly limited implementation using 
inet_aton() in getaddrinfo_all() whenever AI_NUMERICHOST is set. This 
will lose us the IPV6 support as hba.c can't parse those pg_hba.conf 
lines any more. So it is not a satisfactory workaround for PostgreSQL. 
But I will make that patch available tomorrow night in the event someone 
else finds it usefull.

Jan
--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-09-17 Thread Tom Lane
Jan Wieck [EMAIL PROTECTED] writes:
 The problem comes and goes. So either I can cause a coredump just on the 
 snap by running a shellscript that does 100 psql -c select version() 
 calls, or it is next to impossible to crash it at all.

Hmm, that's really bizarre.  It seems like the only satisfactory
explanation for that would involve some external condition that varies
over time.  I'm wondering about DNS lookup results in particular.
What values are you asking getaddrinfo to look up, and might those
involve consulting DNS?  If so, try to correlate the crash probability
with changes in your DNS zone contents ...

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-06-18 Thread Zeugswetter Andreas SB SD

  My only guess is that getaddrinfo in your libc has a bug somehow that is
  corrupting the stack (hance the improper backtrace), then crashing.
 
 It could be libc on AIX, I suppose, but it strikes me as sort of odd
 that nobody else ever seens this.  Unless nobody else is using AIX
 5.1, which is of course possible.

I can confirm, that AIX 4.3.2 getaddrinfo is at least a bit *funny*. 
getaddrinfo seems to not honour nsorder and only does dns, even though the manual sais:
Should there be any discrepancies between this description and the POSIX description,
 the POSIX description takes precedence.
The function does return multiple entries, often the first is not the best.

Log is:
LOG:  could not translate service 5432 to address: Host not found
WARNING:  could not create listen socket for *
LOG:  could not bind socket for statistics collector: Can't assign requested address
LOG:  disabling statistics collector for lack of working socket

This area probably needs a fix/workaround on AIX :-(

Andreas

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-06-18 Thread Andrew Sullivan
On Thu, Jun 17, 2004 at 06:06:12PM -0400, Bruce Momjian wrote:
 
 When you say init directory, what do you mean?  /bin?

No.  The place where the init scripts (which cause postgres to start)
live.

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]
In the future this spectacle of the middle classes shocking the avant-
garde will probably become the textbook definition of Postmodernism. 
--Brad Holland

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-06-18 Thread Christopher Browne
Quoth [EMAIL PROTECTED] (Bruce Momjian):
 Andrew Sullivan wrote:
 On Thu, Jun 17, 2004 at 01:12:10PM -0400, Bruce Momjian wrote:
  
  Well, the bad news is that this backtrace isn't very useful. 
 
 No kidding.  It's pretty frustrating.
 
  My only guess is that getaddrinfo in your libc has a bug somehow that is
  corrupting the stack (hance the improper backtrace), then crashing.
 
 It could be libc on AIX, I suppose, but it strikes me as sort of odd
 that nobody else ever seens this.  Unless nobody else is using AIX
 5.1, which is of course possible.
 
 One hypothesis is that this is happening at start up time (this
 core dump didn't show up in the data/ area, but in the init
 directory, however, which makes that theory a little suspect).

 When you say init directory, what do you mean?  /bin?

No, it's a directory with various init-like scripts.

In premium hosting environments, root access is restricted to the
site operators, so PostgreSQL doesn't get started up from /etc/init.d.

Instead, PostgreSQL and other services get invoked by custom init
scripts in a custom init directory.
-- 
let name=cbbrowne and tld=ntlug.org in name ^ @ ^ tld;;
http://www.ntlug.org/~cbbrowne/sap.html
I am a bomb technician. If you see me running, try to keep up...

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-06-17 Thread Andrew Sullivan
On Mon, May 10, 2004 at 11:59:40AM -0400, Andrew Sullivan wrote:
 
 On the weekend, we ran a set of tests on the offending system to see
 if we could re-create it.  We set up the triggering conditions just
 as they'd been when it happened, and alas, no segfault.  So although
 this was pretty much regularly reproducible when it actually
 happened, it's now a note to the Journal of Irreproducible Results. 
 I hate when that happens.

I hate it even more when the symptom comes back inexplicably.  We had
it again.  For the record, here's what gdb says (there are some
high-bit characters in here; dunno how they'll come though in mail):

(gdb) bt
#0  0xd01d7778 in memmove () from /usr/lib/libc.a(shr.o)
#1  0xd0326e1c in getaddrinfo2 () from /usr/lib/libc.a(shr.o)
#2  0xd0327b6c in getaddrinfo () from /usr/lib/libc.a(shr.o)
#3  0x10058668 in WriteControlFile () at xlog.c:2121
#4  0x101f8f78 in init_execution_state (src=0x202acd8c , 
argOidVect=0x7308710b, nargs=4, rettype=539520040, haspolyarg=-104 '\230')
at functions.c:121
#5  0x101f9304 in init_sql_fcache (finfo=0xdeadbeef) at functions.c:250
#6  0x101fa57c in set_tz (tz=0x7308710b Address 0x7308710b out of bounds)
at variable.c:261
#7  0x101fa9a4 in assign_timezone (value=0x202ad398 , doit=-1 'ÿ', 
interactive=-8 'ø') at variable.c:584
#8  0x1000466c in PostgresMain (argc=1, argv=0x2002cf38, username=0x1 )
at postgres.c:2560
#9  0x100040b0 in PostgresMain (argc=537240896, argv=0xdeadbeef, 
username=0xdeadbeef Address 0xdeadbeef out of bounds) at postgres.c:2307
#10 0x10002530 in exec_parse_message (query_string=0x2a24 , 
stmt_name=0x5 , paramTypes=0x0, numParams=0) at postgres.c:1216
#11 0x10001f84 in exec_simple_query (
query_string=0x2005a540 'ÿ' repeats 40 times) at postgres.c:980
#12 0x15f0 in main (argc=1, argv=0xdeadbeef) at main.c:228


-- 
Andrew Sullivan  | [EMAIL PROTECTED]
I remember when computers were frustrating because they *did* exactly what 
you told them to.  That actually seems sort of quaint now.
--J.D. Baldwin

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-06-17 Thread Bruce Momjian
Andrew Sullivan wrote:
 On Mon, May 10, 2004 at 11:59:40AM -0400, Andrew Sullivan wrote:
  
  On the weekend, we ran a set of tests on the offending system to see
  if we could re-create it.  We set up the triggering conditions just
  as they'd been when it happened, and alas, no segfault.  So although
  this was pretty much regularly reproducible when it actually
  happened, it's now a note to the Journal of Irreproducible Results. 
  I hate when that happens.
 
 I hate it even more when the symptom comes back inexplicably.  We had
 it again.  For the record, here's what gdb says (there are some
 high-bit characters in here; dunno how they'll come though in mail):
 
 (gdb) bt
 #0  0xd01d7778 in memmove () from /usr/lib/libc.a(shr.o)
 #1  0xd0326e1c in getaddrinfo2 () from /usr/lib/libc.a(shr.o)
 #2  0xd0327b6c in getaddrinfo () from /usr/lib/libc.a(shr.o)
 #3  0x10058668 in WriteControlFile () at xlog.c:2121
 #4  0x101f8f78 in init_execution_state (src=0x202acd8c , 
 argOidVect=0x7308710b, nargs=4, rettype=539520040, haspolyarg=-104 '\230')
 at functions.c:121
 #5  0x101f9304 in init_sql_fcache (finfo=0xdeadbeef) at functions.c:250
 #6  0x101fa57c in set_tz (tz=0x7308710b Address 0x7308710b out of bounds)
 at variable.c:261
 #7  0x101fa9a4 in assign_timezone (value=0x202ad398 , doit=-1 'ÿ', 
 interactive=-8 'ø') at variable.c:584
 #8  0x1000466c in PostgresMain (argc=1, argv=0x2002cf38, username=0x1 )
 at postgres.c:2560
 #9  0x100040b0 in PostgresMain (argc=537240896, argv=0xdeadbeef, 
 username=0xdeadbeef Address 0xdeadbeef out of bounds) at postgres.c:2307
 #10 0x10002530 in exec_parse_message (query_string=0x2a24 , 
 stmt_name=0x5 , paramTypes=0x0, numParams=0) at postgres.c:1216
 #11 0x10001f84 in exec_simple_query (
 query_string=0x2005a540 'ÿ' repeats 40 times) at postgres.c:980
 #12 0x15f0 in main (argc=1, argv=0xdeadbeef) at main.c:228

Well, the bad news is that this backtrace isn't very useful.  It states
the query you sent was 40 0xff's, and it says you called
assign_timezone, which called set_tz, which then shows it calling
init_sql_fcache() (impossible), which later calls WriteControlFile()
impossible, which calls getaddrinfo() (impossible).

My only guess is that getaddrinfo in your libc has a bug somehow that is
corrupting the stack (hance the improper backtrace), then crashing.

As to the cause, I assume this is not reproducable, right?  Is there
something unusual about your DNS setup or something that might have
changed recently that caused getaddrinfo() to do something new?

Of course, the memmove() might be causing the problem and the
getaddrinfo is a corrupt part of the backtrace too.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-06-17 Thread Andrew Sullivan
On Thu, Jun 17, 2004 at 01:12:10PM -0400, Bruce Momjian wrote:
 
 Well, the bad news is that this backtrace isn't very useful. 

No kidding.  It's pretty frustrating.

 My only guess is that getaddrinfo in your libc has a bug somehow that is
 corrupting the stack (hance the improper backtrace), then crashing.

It could be libc on AIX, I suppose, but it strikes me as sort of odd
that nobody else ever seens this.  Unless nobody else is using AIX
5.1, which is of course possible.

One hypothesis is that this is happening at start up time (this core
dump didn't show up in the data/ area, but in the init directory,
however, which makes that theory a little suspect).

 As to the cause, I assume this is not reproducable, right?  Is there

Well, it's reproduced itsef a few times, but it isn't reproducible at
will, and we have no clue what is causing it.

 something unusual about your DNS setup or something that might have
 changed recently that caused getaddrinfo() to do something new?

Nothing has changed recently, but we started having this not long
after promoting an RS/6000 to production on AIX 5.1.  Before that we
were all-Solaris.  We have never managed to tickle this on a test
machine.  It's pretty tough to guess what might be going on, at least
for me.  If there are any AIX gurus around, I'd sure like to talk to
them.  (I do have a budget to pay such gurus, BTW!)

 Of course, the memmove() might be causing the problem and the
 getaddrinfo is a corrupt part of the backtrace too.

Yeah, which is why it's so frustrating.  If I could see what it was
doing when it did it, I'd be able to tell.  But without knowing why
it's happening, there's no way to sit up for 6 weeks while I wait for
it to happen.

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]
This work was visionary and imaginative, and goes to show that visionary
and imaginative work need not end up well. 
--Dennis Ritchie

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-06-17 Thread Bruce Momjian
Andrew Sullivan wrote:
 On Thu, Jun 17, 2004 at 01:12:10PM -0400, Bruce Momjian wrote:
  
  Well, the bad news is that this backtrace isn't very useful. 
 
 No kidding.  It's pretty frustrating.
 
  My only guess is that getaddrinfo in your libc has a bug somehow that is
  corrupting the stack (hance the improper backtrace), then crashing.
 
 It could be libc on AIX, I suppose, but it strikes me as sort of odd
 that nobody else ever seens this.  Unless nobody else is using AIX
 5.1, which is of course possible.
 
 One hypothesis is that this is happening at start up time (this core
 dump didn't show up in the data/ area, but in the init directory,
 however, which makes that theory a little suspect).

When you say init directory, what do you mean?  /bin?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-05-10 Thread Andrew Sullivan
On Wed, Apr 28, 2004 at 03:56:55PM -0400, Andrew Sullivan wrote:
 On Mon, Apr 26, 2004 at 03:19:21PM -0400, Bruce Momjian wrote:
  
  Has this been resolved?

 it elsewhere.  I've been trying some alternative approaches to
 causing it today, and so far no luck.

On the weekend, we ran a set of tests on the offending system to see
if we could re-create it.  We set up the triggering conditions just
as they'd been when it happened, and alas, no segfault.  So although
this was pretty much regularly reproducible when it actually
happened, it's now a note to the Journal of Irreproducible Results. 
I hate when that happens.

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-04-28 Thread Andrew Sullivan
On Mon, Apr 26, 2004 at 03:19:21PM -0400, Bruce Momjian wrote:
 
 Has this been resolved?

Not as far as I know.  Unfortunately, the problem happened in an
environment I Can't Play With, and I haven't been able to reproduce
it elsewhere.  I've been trying some alternative approaches to
causing it today, and so far no luck.

Jan is, AFAIK, similarly mystified about what happened.

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-04-26 Thread Bruce Momjian

Has this been resolved?

---

Andrew Sullivan wrote:
 On Mon, Apr 19, 2004 at 11:18:07AM -0400, Tom Lane wrote:
  
  What you'd need to do is determine which system headers are being
  #include'd by that config test, and then look through them to find
  struct addrinfo.
 
 Well, I have this in /usr/include/netdb.h:
 
 struct addrinfo {
 int  ai_flags;  /* AI_PASSIVE, AI_CANONNAME,
 AI_NUMERICH
 OST */
 int  ai_family; /* PF_xxx */
 int  ai_socktype;   /* SOCK_xxx */
 int  ai_protocol;   /* 0 or IPPROTO_xxx */
 size_t   ai_addrlen;/* length of ai_addr */
 char*ai_canonname;  /* canonical name for
 hostname */
 struct sockaddr *ai_addr;   /* binary address */
 struct addrinfo *ai_next;   /* next structure in list */
 };
 
 Using the cpp trick that Alvaro Herrera suggested, I see that file
 mentioned in the output, and this a little way along:
 
 struct addrinfo {
 int  ai_flags;   
 int  ai_family;  
 int  ai_socktype;
 int  ai_protocol;
 size_t   ai_addrlen; 
 char*ai_canonname;   
 struct sockaddr *ai_addr;
 struct addrinfo *ai_next;
 };
 
 So it looks like that must be the one.  Dunno if this helps.
 
 A
 
 -- 
 Andrew Sullivan  | [EMAIL PROTECTED]
 
 ---(end of broadcast)---
 TIP 7: don't forget to increase your free space map settings
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-04-19 Thread Tom Lane
Andrew Sullivan [EMAIL PROTECTED] writes:
 On Thu, Apr 15, 2004 at 07:52:59PM -0400, Tom Lane wrote:
 I can see from your trace that you are using the getaddrinfo code from
 libc, but where is configure finding a header that declares struct
 addrinfo?

 Hrm, I can't seem to tell.  I see this in config.log, but it isn't
 telling me where it found it.  Am I looking in the wrong place?

What you'd need to do is determine which system headers are being
#include'd by that config test, and then look through them to find
struct addrinfo.

A shortcut is just to grep through /usr/include and its subdirectories
for addrinfo.  If you only find one definition, then you don't really
need to worry too much.  But if there's more than one you need to
determine which is getting used.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-04-19 Thread Alvaro Herrera
On Mon, Apr 19, 2004 at 11:18:07AM -0400, Tom Lane wrote:

 A shortcut is just to grep through /usr/include and its subdirectories
 for addrinfo.  If you only find one definition, then you don't really
 need to worry too much.  But if there's more than one you need to
 determine which is getting used.

Maybe an easier way is to examine the output of cpp src/include/c.h.

-- 
Alvaro Herrera (alvherre[a]dcc.uchile.cl)
En las profundidades de nuestro inconsciente hay una obsesiva necesidad
de un universo lógico y coherente. Pero el universo real se halla siempre
un paso más allá de la lógica (Irulan)

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-04-19 Thread Jan Wieck
Tom Lane wrote:

Andrew Sullivan [EMAIL PROTECTED] writes:
On Thu, Apr 15, 2004 at 07:52:59PM -0400, Tom Lane wrote:
I can see from your trace that you are using the getaddrinfo code from
libc, but where is configure finding a header that declares struct
addrinfo?

Hrm, I can't seem to tell.  I see this in config.log, but it isn't
telling me where it found it.  Am I looking in the wrong place?
What you'd need to do is determine which system headers are being
#include'd by that config test, and then look through them to find
struct addrinfo.
judging by gdb's structure printing, the crashed postgres instance used 
the non-43 compatible 64-bit version of the strucure. What I don't 
really get is that the whole excercise seems to have scribbled over the 
stack. The hints pointer originating from the on-stack structure in 
parse_hba is somehow pointing into the blue.

Jan

A shortcut is just to grep through /usr/include and its subdirectories
for addrinfo.  If you only find one definition, then you don't really
need to worry too much.  But if there's more than one you need to
determine which is getting used.
			regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-04-19 Thread Andrew Sullivan
On Mon, Apr 19, 2004 at 11:18:07AM -0400, Tom Lane wrote:
 
 What you'd need to do is determine which system headers are being
 #include'd by that config test, and then look through them to find
 struct addrinfo.

Well, I have this in /usr/include/netdb.h:

struct addrinfo {
int  ai_flags;  /* AI_PASSIVE, AI_CANONNAME,
AI_NUMERICH
OST */
int  ai_family; /* PF_xxx */
int  ai_socktype;   /* SOCK_xxx */
int  ai_protocol;   /* 0 or IPPROTO_xxx */
size_t   ai_addrlen;/* length of ai_addr */
char*ai_canonname;  /* canonical name for
hostname */
struct sockaddr *ai_addr;   /* binary address */
struct addrinfo *ai_next;   /* next structure in list */
};

Using the cpp trick that Alvaro Herrera suggested, I see that file
mentioned in the output, and this a little way along:

struct addrinfo {
int  ai_flags;   
int  ai_family;  
int  ai_socktype;
int  ai_protocol;
size_t   ai_addrlen; 
char*ai_canonname;   
struct sockaddr *ai_addr;
struct addrinfo *ai_next;
};

So it looks like that must be the one.  Dunno if this helps.

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[HACKERS] signal 11 on AIX: 7.4.2

2004-04-15 Thread Andrew Sullivan
We've had a backend crash with sig 11 during connection.  My guess is
there's something up with (maybe) the IPv6 support on AIX.  I seem to
recall something similar recently, but I can't find the post in the
archives.  Suggestions?


oxrslive=# SELECT version();
   version
--
 PostgreSQL 7.4.2 on powerpc-ibm-aix5.1.0.0, compiled by GCC
2.9-aix51-020209
(1 row)

GNU gdb 6.0
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type show copying to see the conditions.
There is absolutely no warranty for GDB.  Type show warranty for details.
This GDB was configured as powerpc-ibm-aix5.1.0.0...
Core was generated by `postgres'.
Program terminated with signal 11, Segmentation fault.
#0  0xd01d7778 in memmove () from /usr/lib/libc.a(shr.o)
(gdb) bt
#0  0xd01d7778 in memmove () from /usr/lib/libc.a(shr.o)
#1  0xd0326e1c in getaddrinfo2 () from /usr/lib/libc.a(shr.o)
#2  0xd0327b6c in getaddrinfo () from /usr/lib/libc.a(shr.o)
#3  0x1005860c in getaddrinfo_all (hostname=0x34e0 , 
servname=0x74696f Address 0x74696f out of bounds, hintp=0xf03a2e80, 
result=0x74696f) at ip.c:78
#4  0x101f9330 in parse_hba (line=0x202ae198, port=0x202a6988, 
found_p=0x2ff1f810 , error_p=0x2ff1f811 ) at hba.c:669
#5  0x101f96bc in check_hba (port=0x202a6988) at hba.c:793
#6  0x101fa934 in hba_getauthmethod (port=0x202b6f3c) at hba.c:1574
#7  0x101fad5c in ClientAuthentication (port=0x202a6988) at auth.c:415
#8  0x10004674 in BackendFork (port=0x202a6988) at postmaster.c:2444
#9  0x100040b8 in BackendStartup (port=0x202a6988) at postmaster.c:2207
#10 0x10002538 in ServerLoop () at postmaster.c:1119
#11 0x10001f8c in PostmasterMain (argc=1, argv=0x20270698) at postmaster.c:897
#12 0x15f0 in main (argc=1, argv=0x2ff22b8c) at main.c:214
(gdb) 

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-04-15 Thread Andrew Sullivan
On Thu, Apr 15, 2004 at 01:07:33PM -0400, Andrew Sullivan wrote:
 We've had a backend crash with sig 11 during connection.  

By the way, I failed to mention, but sig 11 is segfault on AIX.

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] signal 11 on AIX: 7.4.2

2004-04-15 Thread Tom Lane
Andrew Sullivan [EMAIL PROTECTED] writes:
 We've had a backend crash with sig 11 during connection.  My guess is
 there's something up with (maybe) the IPv6 support on AIX.

 (gdb) bt
 #0  0xd01d7778 in memmove () from /usr/lib/libc.a(shr.o)
 #1  0xd0326e1c in getaddrinfo2 () from /usr/lib/libc.a(shr.o)
 #2  0xd0327b6c in getaddrinfo () from /usr/lib/libc.a(shr.o)
 #3  0x1005860c in getaddrinfo_all (hostname=0x34e0 , 
 servname=0x74696f Address 0x74696f out of bounds, hintp=0xf03a2e80, 
 result=0x74696f) at ip.c:78
 #4  0x101f9330 in parse_hba (line=0x202ae198, port=0x202a6988, 
 found_p=0x2ff1f810 , error_p=0x2ff1f811 ) at hba.c:669

Hm, a crash inside the system-supplied getaddrinfo routine would suggest
that there's something wrong with the values we are passing into it.
The most likely bet is that we don't agree with libc about the layout of
struct addrinfo.  The configure script goes out of its way to be
paranoid about this, because we've seen it get confused by add-on
libbind installations (see also the head comment in
src/include/getaddrinfo.h) ... but I'll bet that AIX has found another
way to trip it up.

I can see from your trace that you are using the getaddrinfo code from
libc, but where is configure finding a header that declares struct
addrinfo?

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster