Re: HAProxy 1.4.2 on z/Linux - segfaults LIST_DEL(s-list); and hangs

2010-03-21 Thread Dariusz Suchojad

Willy Tarreau wrote:


What would you consider a good indicator of its reliability? Would
running flawlessly for a week straight be enough of testing?


The fact that it runs a lot longer than previous run is a natural
indicator of reliability. However it's not an indicator of correctness.


I sure agree that it isn't any proof of its correctness but I can only 
say that it's been running for more than 40 hours now and I don't see 
any problems. I'll spare you the details of how many times the backend 
servers crashed in that time ;-)


 Whatever we spot, I'll keep in mind that we can get it to crash on
 your machine in 31-bit mode. If ever I come across a vicious bug
 that could explain that, I'd be happy to ask you to give it a try.

And I'll be happy to give it a go if I only still have access to that 
platform. Just in case you ever need it, you can run Debian (or, I 
imagine, any other distribution which supports s390/s390x) under the 
Hercules VM, here's a very nice HOWTO 
http://www.josefsipek.net/docs/s390-linux/hercules-s390.html. I haven't 
tried using HAProxy on it though but I guess there shouldn't be any issues.



Last, are you aware of any version that has worked reliably on
your platform ?


Not really, it's the first time we're using HAProxy on that platform.


OK so I wish you that it works well for this first time :-)


Cheers!

--
Dariusz Suchojad



Re: HAProxy 1.4.2 on z/Linux - segfaults LIST_DEL(s-list); and hangs

2010-03-21 Thread Willy Tarreau
Hi Dariusz,

On Mon, Mar 22, 2010 at 03:54:13AM +0100, Dariusz Suchojad wrote:
 Willy Tarreau wrote:
 
 What would you consider a good indicator of its reliability? Would
 running flawlessly for a week straight be enough of testing?
 
 The fact that it runs a lot longer than previous run is a natural
 indicator of reliability. However it's not an indicator of correctness.
 
 I sure agree that it isn't any proof of its correctness but I can only 
 say that it's been running for more than 40 hours now and I don't see 
 any problems. I'll spare you the details of how many times the backend 
 servers crashed in that time ;-)

OK so now I'm confident that it is the 31-bit mode that triggers the
problem.

  Whatever we spot, I'll keep in mind that we can get it to crash on
  your machine in 31-bit mode. If ever I come across a vicious bug
  that could explain that, I'd be happy to ask you to give it a try.
 
 And I'll be happy to give it a go if I only still have access to that 
 platform. Just in case you ever need it, you can run Debian (or, I 
 imagine, any other distribution which supports s390/s390x) under the 
 Hercules VM, here's a very nice HOWTO 
 http://www.josefsipek.net/docs/s390-linux/hercules-s390.html. I haven't 
 tried using HAProxy on it though but I guess there shouldn't be any issues.

Oh I've never heard of this VM. That's excellent. And Josef has put
up a very nice howto ! I'll probably try it someday, at least to
satisfy my curiosity :-)

Cheers,
Willy




Re: HAProxy 1.4.2 on z/Linux - segfaults LIST_DEL(s-list); and hangs

2010-03-19 Thread Dariusz Suchojad

Willy Tarreau wrote:

Hi,


There's something easy you can do to check if it's that : in
src/stream_sock.c, there's only one recv() call. Simply check
that the max value is within bounds :

+   if (max  0 || max  b-size)
+   abort();
ret = recv(fd, b-r, max, 0);

If you believe you can reproduce it, doing it under strace could
immensely help : strace -tt -s 200 -o trace.log haproxy -[args].


The odd-looking numbers got me thinking and in the meantime I have 
modified the Makefile and compiled HAProxy with CPU set to custom and 
ARCH set to s390x (it's a 64bit system) - I'm not sure how's that 
related but z/Linux (s390 one) can also be a 31bit system 
http://www.zjournal.com/index.cfm?section=articleaid=1033 and maybe the 
default Makefile  gcc somehow got confused by that?
Anyway, things look better now, it's been 2 hours and there have been 
about 1M of messages processed so far. I'll let it run over the weekend 
and we'll see how stable it is.


Here's how the pools look like now:

Dumping pools usage.
  - Pool pipe (32 bytes) : 0 allocated (0 bytes), 0 used, 2 users [SHARED]
  - Pool capture (64 bytes) : 0 allocated (0 bytes), 0 used, 1 users 
[SHARED]
  - Pool task (144 bytes) : 7 allocated (1008 bytes), 5 used, 1 users 
[SHARED]
  - Pool hdr_idx (832 bytes) : 5 allocated (4160 bytes), 3 used, 2 
users [SHARED]
  - Pool requri (1024 bytes) : 5 allocated (5120 bytes), 2 used, 1 
users [SHARED]
  - Pool session (1344 bytes) : 5 allocated (6720 bytes), 3 used, 1 
users [SHARED]
  - Pool buffer (16512 bytes) : 10 allocated (165120 bytes), 6 used, 1 
users [SHARED]

Total: 7 pools, 182128 bytes allocated, 108368 used.

Too bad I didn't take a snapshot of those when everything was fine 
initially but I really didn't expect any problems would arise.


Assuming there aren't any problems, would you still like me to strace 
it? It would have to wait till next week - I'll need to ask the sysadmin 
for installing strace for me.


What would you consider a good indicator of its reliability? Would 
running flawlessly for a week straight be enough of testing?



Also, do you see any build warning ? It's possible that we have
one type wrong somewhere which is different on your platform. I
once got caught by unsigned chars on PPC for instance.


There are indeed some warnings during compilation:

gcc -Iinclude -Iebtree -Wall   -g   -DTPROXY -DCONFIG_HAP_CRYPT 
-DENABLE_POLL -DENABLE_EPOLL -DENABLE_SEPOLL -DNETFILTER 
-DUSE_GETSOCKNAME  -DCONFIG_HAPROXY_VERSION=\1.4.2\ 
-DCONFIG_HAPROXY_DATE=\2010/03/17\ -c -o src/dumpstats.o src/dumpstats.c

src/dumpstats.c: In function ‘stats_dump_full_sess_to_buffer’:
src/dumpstats.c:2469: warning: format ‘%d’ expects type ‘int’, but 
argument 5 has type ‘long int’
src/dumpstats.c:2469: warning: format ‘%d’ expects type ‘int’, but 
argument 6 has type ‘long int’
src/dumpstats.c:2469: warning: format ‘%d’ expects type ‘int’, but 
argument 7 has type ‘long int’
src/dumpstats.c:2499: warning: format ‘%d’ expects type ‘int’, but 
argument 5 has type ‘long int’
src/dumpstats.c:2499: warning: format ‘%d’ expects type ‘int’, but 
argument 6 has type ‘long int’
src/dumpstats.c:2499: warning: format ‘%d’ expects type ‘int’, but 
argument 7 has type ‘long int’



Last, are you aware of any version that has worked reliably on
your platform ?


Not really, it's the first time we're using HAProxy on that platform.

Thanks!

--
Dariusz Suchojad