Re: HAProxy 1.4.2 on z/Linux - segfaults LIST_DEL(s-list); and hangs

2010-03-21 Thread Dariusz Suchojad

Willy Tarreau wrote:


What would you consider a good indicator of its reliability? Would
running flawlessly for a week straight be enough of testing?


The fact that it runs a lot longer than previous run is a natural
indicator of reliability. However it's not an indicator of correctness.


I sure agree that it isn't any proof of its correctness but I can only 
say that it's been running for more than 40 hours now and I don't see 
any problems. I'll spare you the details of how many times the backend 
servers crashed in that time ;-)


 Whatever we spot, I'll keep in mind that we can get it to crash on
 your machine in 31-bit mode. If ever I come across a vicious bug
 that could explain that, I'd be happy to ask you to give it a try.

And I'll be happy to give it a go if I only still have access to that 
platform. Just in case you ever need it, you can run Debian (or, I 
imagine, any other distribution which supports s390/s390x) under the 
Hercules VM, here's a very nice HOWTO 
http://www.josefsipek.net/docs/s390-linux/hercules-s390.html. I haven't 
tried using HAProxy on it though but I guess there shouldn't be any issues.



Last, are you aware of any version that has worked reliably on
your platform ?


Not really, it's the first time we're using HAProxy on that platform.


OK so I wish you that it works well for this first time :-)


Cheers!

--
Dariusz Suchojad



Re: HAProxy 1.4.2 on z/Linux - segfaults LIST_DEL(s-list); and hangs

2010-03-21 Thread Willy Tarreau
Hi Dariusz,

On Mon, Mar 22, 2010 at 03:54:13AM +0100, Dariusz Suchojad wrote:
 Willy Tarreau wrote:
 
 What would you consider a good indicator of its reliability? Would
 running flawlessly for a week straight be enough of testing?
 
 The fact that it runs a lot longer than previous run is a natural
 indicator of reliability. However it's not an indicator of correctness.
 
 I sure agree that it isn't any proof of its correctness but I can only 
 say that it's been running for more than 40 hours now and I don't see 
 any problems. I'll spare you the details of how many times the backend 
 servers crashed in that time ;-)

OK so now I'm confident that it is the 31-bit mode that triggers the
problem.

  Whatever we spot, I'll keep in mind that we can get it to crash on
  your machine in 31-bit mode. If ever I come across a vicious bug
  that could explain that, I'd be happy to ask you to give it a try.
 
 And I'll be happy to give it a go if I only still have access to that 
 platform. Just in case you ever need it, you can run Debian (or, I 
 imagine, any other distribution which supports s390/s390x) under the 
 Hercules VM, here's a very nice HOWTO 
 http://www.josefsipek.net/docs/s390-linux/hercules-s390.html. I haven't 
 tried using HAProxy on it though but I guess there shouldn't be any issues.

Oh I've never heard of this VM. That's excellent. And Josef has put
up a very nice howto ! I'll probably try it someday, at least to
satisfy my curiosity :-)

Cheers,
Willy




HAProxy 1.4.2 on z/Linux - segfaults LIST_DEL(s-list); and hangs

2010-03-19 Thread Dariusz Suchojad


Hello,

first off - thank you very much for such a wonderful piece of software! 
I'm using HAProxy on Linux/x86 and it's been a wonderful experience so far.


Things are a bit different on z/Linux, using s390 architecture. HAProxy 
was running very nicely for several hours and then suddenly started to 
segfault and hang after I had to restart it.


Here's some background information:

$ uname -a
Linux vlbt12 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:58 EDT 2008 s390x 
s390x s390x GNU/Linux

$

$ ./src/haproxy-1.4.2/haproxy -vv
HA-Proxy version 1.4.2 2010/03/17
Copyright 2000-2010 Willy Tarreau w...@1wt.eu

Build options :
  TARGET  = linux26
  CPU = generic
  CC  = gcc
  CFLAGS  = -O2 -g
  OPTIONS =

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200

Encrypted password support via crypt(3): yes

Available polling systems :
 sepoll : pref=400,  test result OK
  epoll : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 4 (4 usable), will use sepoll.

$

Configuration file:

global
log 127.0.0.1:514 local0 debug

defaults
log global
option httpclose
timeout connect 5000
timeout client 5000
timeout server 5000

backend http-plain-bck
mode http
balance roundrobin
server httpplain01 127.0.0.1:7081 check inter 2s rise 2 fall 2
server httpplain02 127.0.0.1:7082 check inter 2s rise 2 fall 2
option httpchk GET /http-check


frontend plain
bind 0.0.0.0:37080
mode http
maxconn 300
default_backend http-plain-bck
option httplog


The client application is SoapUI http://soapui.com - they're using an 
embedded Jakarta Commons HTTP client 3.0.1 underneath. The servers are 
IBM's proprietary Message Brokers. I'm sending HTTP POST data which is 
then proxied over two backends.


I've read http://www.formilux.org/archives/haproxy/0910/2562.html but I 
think my case is a bit different as I have no problems with starting 
HAProxy. And unfortunately I'm not able to use Valgrind as it seems not 
to be supported on that architecture.


When the 'hang' happens it's just that - the client waits for several 
dozens of seconds and I'm able to send the SIGQUIT signal, here's the 
output from HAProxy upon receiving it:


$ ./src/haproxy-1.4.2/haproxy -V -f ./problematic.conf
Available polling systems :
 sepoll : pref=400,  test result OK
  epoll : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 4 (4 usable), will use sepoll.
Using sepoll() as the polling mechanism.

Dumping pools usage.
  - Pool pipe (32 bytes) : 0 allocated (0 bytes), 0 used, 2 users [SHARED]
  - Pool capture (64 bytes) : 0 allocated (0 bytes), 0 used, 1 users 
[SHARED]
  - Pool task (144 bytes) : 6 allocated (864 bytes), -19793676 used, 1 
users [SHARED]
  - Pool hdr_idx (832 bytes) : 4 allocated (3328 bytes), -19793678 
used, 2 users [SHARED]
  - Pool requri (1024 bytes) : 3 allocated (3072 bytes), 2 used, 1 
users [SHARED]
  - Pool session (1344 bytes) : 3 allocated (4032 bytes), -19793678 
used, 1 users [SHARED]
  - Pool buffer (16512 bytes) : 8 allocated (132096 bytes), -39587356 
used, 1 users [SHARED]

Total: 7 pools, 143392 bytes allocated, 9081850944 used.

I'm not sure how to interpret it but the numbers look a bit odd, don't 
they? 143392 bytes have been allocated yet 9081850944 have been used?


And here's what syslog shows just before I send SIGQUIT

Mar 19 19:29:22 localhost haproxy[22945]: 172.150.21.62:4447 
[19/Mar/2010:19:29:22.448] plain http-plain-bck/httpplain02 0/0/0/0/1 
404 428 - -  2/2/2/1/0 0/0 POST /customer/profile HTTP/1.1
Mar 19 19:29:22 localhost haproxy[22945]: 172.150.21.62:4448 
[19/Mar/2010:19:29:22.449] plain http-plain-bck/httpplain01 0/0/0/0/1 
404 428 - -  1/1/1/0/0 0/0 POST /customer/profile HTTP/1.1
Mar 19 19:29:22 localhost haproxy[22945]: 172.150.21.62:4449 
[19/Mar/2010:19:29:22.480] plain http-plain-bck/httpplain02 0/0/0/0/1 
404 428 - -  2/2/2/1/0 0/0 POST /customer/profile HTTP/1.1
Mar 19 19:29:22 localhost haproxy[22945]: 172.150.21.62:4450 
[19/Mar/2010:19:29:22.481] plain http-plain-bck/httpplain01 0/0/0/0/1 
404 428 - -  1/1/1/0/0 0/0 POST /customer/profile HTTP/1.1
Mar 19 19:29:22 localhost haproxy[22945]: 172.150.21.62:4453 
[19/Mar/2010:19:29:22.494] plain plain/NOSRV -1/-1/-1/-1/0 400 187 - - 
CR-- 3/3/0/0/0 0/0 BADREQ
Mar 19 19:29:22 localhost haproxy[22945]: 172.150.21.62:4452 
[19/Mar/2010:19:29:22.489] plain http-plain-bck/httpplain01 0/0/0/31/32 
404 428 - -  -18798/-18798/2/0/0 0/0 POST /customer/profile HTTP/1.1


HTTP 404 response code is perfectly fine in this case.

But it's not always going into a hung state, sometimes it simply 
segfaults after processing a couple of requests - here's what gets 
logged in syslog when it happens:


Mar 19 20:01:34 localhost haproxy[23143]: Proxy http-plain-bck started.
Mar 19 20:01:34 localhost 

Re: HAProxy 1.4.2 on z/Linux - segfaults LIST_DEL(s-list); and hangs

2010-03-19 Thread Dariusz Suchojad

Willy Tarreau wrote:

Hi,


There's something easy you can do to check if it's that : in
src/stream_sock.c, there's only one recv() call. Simply check
that the max value is within bounds :

+   if (max  0 || max  b-size)
+   abort();
ret = recv(fd, b-r, max, 0);

If you believe you can reproduce it, doing it under strace could
immensely help : strace -tt -s 200 -o trace.log haproxy -[args].


The odd-looking numbers got me thinking and in the meantime I have 
modified the Makefile and compiled HAProxy with CPU set to custom and 
ARCH set to s390x (it's a 64bit system) - I'm not sure how's that 
related but z/Linux (s390 one) can also be a 31bit system 
http://www.zjournal.com/index.cfm?section=articleaid=1033 and maybe the 
default Makefile  gcc somehow got confused by that?
Anyway, things look better now, it's been 2 hours and there have been 
about 1M of messages processed so far. I'll let it run over the weekend 
and we'll see how stable it is.


Here's how the pools look like now:

Dumping pools usage.
  - Pool pipe (32 bytes) : 0 allocated (0 bytes), 0 used, 2 users [SHARED]
  - Pool capture (64 bytes) : 0 allocated (0 bytes), 0 used, 1 users 
[SHARED]
  - Pool task (144 bytes) : 7 allocated (1008 bytes), 5 used, 1 users 
[SHARED]
  - Pool hdr_idx (832 bytes) : 5 allocated (4160 bytes), 3 used, 2 
users [SHARED]
  - Pool requri (1024 bytes) : 5 allocated (5120 bytes), 2 used, 1 
users [SHARED]
  - Pool session (1344 bytes) : 5 allocated (6720 bytes), 3 used, 1 
users [SHARED]
  - Pool buffer (16512 bytes) : 10 allocated (165120 bytes), 6 used, 1 
users [SHARED]

Total: 7 pools, 182128 bytes allocated, 108368 used.

Too bad I didn't take a snapshot of those when everything was fine 
initially but I really didn't expect any problems would arise.


Assuming there aren't any problems, would you still like me to strace 
it? It would have to wait till next week - I'll need to ask the sysadmin 
for installing strace for me.


What would you consider a good indicator of its reliability? Would 
running flawlessly for a week straight be enough of testing?



Also, do you see any build warning ? It's possible that we have
one type wrong somewhere which is different on your platform. I
once got caught by unsigned chars on PPC for instance.


There are indeed some warnings during compilation:

gcc -Iinclude -Iebtree -Wall   -g   -DTPROXY -DCONFIG_HAP_CRYPT 
-DENABLE_POLL -DENABLE_EPOLL -DENABLE_SEPOLL -DNETFILTER 
-DUSE_GETSOCKNAME  -DCONFIG_HAPROXY_VERSION=\1.4.2\ 
-DCONFIG_HAPROXY_DATE=\2010/03/17\ -c -o src/dumpstats.o src/dumpstats.c

src/dumpstats.c: In function ‘stats_dump_full_sess_to_buffer’:
src/dumpstats.c:2469: warning: format ‘%d’ expects type ‘int’, but 
argument 5 has type ‘long int’
src/dumpstats.c:2469: warning: format ‘%d’ expects type ‘int’, but 
argument 6 has type ‘long int’
src/dumpstats.c:2469: warning: format ‘%d’ expects type ‘int’, but 
argument 7 has type ‘long int’
src/dumpstats.c:2499: warning: format ‘%d’ expects type ‘int’, but 
argument 5 has type ‘long int’
src/dumpstats.c:2499: warning: format ‘%d’ expects type ‘int’, but 
argument 6 has type ‘long int’
src/dumpstats.c:2499: warning: format ‘%d’ expects type ‘int’, but 
argument 7 has type ‘long int’



Last, are you aware of any version that has worked reliably on
your platform ?


Not really, it's the first time we're using HAProxy on that platform.

Thanks!

--
Dariusz Suchojad