RES: RES: RES: RES: RES: RES: RES: RES: RES: High CPU Usage (HaProxy)
OK. The last point could slightly help in reducing the number of calls to kqueue and aggregate more events at once. But FreeBSD's kqueue is really fast so that should not change much. You really need to be able to pin the processes to certain CPUs, as well as the interrupts. Unfortunately I cannot be of any help here :-( But do you believe the CPU pinning will really make all this difference ? I know how to do it, using pthread, because I am used with it, just a few lines of code are able to make it.
RES: RES: RES: RES: RES: RES: RES: RES: High CPU Usage (HaProxy)
Hey, Willy. I've switch to haproxy 1.5 (last one available on the website), but the results didn't change much. However, I didn't try to run all the proxies in just one single process, to check the difference yet. -Mensagem original- De: Fred Pedrisa [mailto:fredhp...@hotmail.com] Enviada em: terça-feira, 5 de novembro de 2013 13:33 Para: 'Willy Tarreau' Cc: 'Lukas Tribus'; 'haproxy@formilux.org' Assunto: RES: RES: RES: RES: RES: RES: RES: RES: High CPU Usage (HaProxy) OK. Do you know if you have a single or multiple interrupts on your NICs, and if they're delivered to a single core, multiple cores, or floating around more or less randomly ? This is managed by FreeBSD, it currently have multiple queues and irq balance with msix. It seems that your numbers below tend to confirm this model. I still don't know why you have that high a context switch rate. Are you running with more processes than CPUs ? Also it looks like the system is mostly spending its time idling. Is it that haproxy is on the same CPU as the network's interrupts ? Then maybe it could make sense to start multiple processes and pin them to specific CPU cores, and do the same with the interrupts. Delivering 500-bytes large messages between two NICs via userspace experiences a high overhead and everything which could be saved must be saved (including CPU cache misses). Yes, if we have 40 processes running and 16 physical cores, I suppose this is more than the number of physical cores available right ? However, in FreeBSD we can't do that IRQ Assigning, like we can on linux. (As far I know). We are speaking about 100Kpps (input) and 140Kpps (output) 'approximately'. OK, so probably about 30k msg/s in each direction with their respective ACKs. That just makes me think it could possibly do better since we can do better with HTTP messages. Do you have enough concurrent connections to fill the wire and ensure that the system never waits for either a client or a server ? I'm assuming that OK given the values assigned to the file descriptors in your latest email, which were up to 1428. With such numbers and that small messages, it can make sense to use multiple processes if that's not the case yet. In theory yes, the connections are quick, because they are pure tcp applications and in other cases, http websites, but behind the pure tcp mode instead of http mode (not in all cases tho). Fred
Re: RES: RES: RES: RES: RES: RES: RES: High CPU Usage (HaProxy)
Hello Fred, [ first, please avoid top-posting, this is very cumbersome for replying in context afterwards, and tends to pollute subscribers mailboxes with overly large emails ] Also, can you confirm that this is a real machine and that we're not troubleshooting a VM ? Yes, this is a 'real machine', running FreeBSD 9 x64. It is a Xeon E5-2650 Dual (So we have 16 physical cores to use here and 32 threads). OK. Do you know if you have a single or multiple interrupts on your NICs, and if they're delivered to a single core, multiple cores, or floating around more or less randomly ? That said, assuming you're dealing with 300 Mbps (about 40 MB/s) and say 500 bytes per message, this turns into 80k messages per second, which require : - 2 recvfrom() - 1 getsockopt() (we can remove this one, 1.5 doesn't have it) - 1 sendto() So 4 syscalls per message, resulting in 320k syscalls per second. It can start to represent some CPU usage. But there's more. Such small messages are transferred using TCP_NODELAY meaning that a TCP PUSH is set on each outgoing packet and that each of them is immediately ACKed. So you get 80kpps per side in each direction, resulting in 320kpps as well. If you have a firewall running on the system, it might take its share of load as well, which is possibly attributed to the sending process on outgoing messages. That said, even with that in mind, I still consider that the system load is high for the workload. Could you please share the output of vmstat 1 (just take the first 10 lines) ? Here is the vmstat 1 result : procs memory pagedisks faults cpu r b w avmfre flt re pi pofr sr da0 pa0 in sy cs us sy id 7 0 0 4818M35G 643 0 0 0 714 0 0 0 4977 1364 5996 8 25 67 3 0 0 4818M35G 224 0 0 0 174 0 0 0 42698 355001 170303 8 22 71 3 0 0 4818M35G 177 0 0 0 174 0 0 0 28715 383061 138108 7 23 69 4 0 0 4818M35G 173 0 0 0 174 0 0 0 28342 375281 138067 8 24 69 5 0 0 4818M35G 185 0 0 0 174 0 0 0 32900 372294 148576 7 21 71 5 0 0 4818M35G 372 0 0 0 174 0 0 0 29112 364030 138826 7 25 68 It seems that your numbers below tend to confirm this model. I still don't know why you have that high a context switch rate. Are you running with more processes than CPUs ? Also it looks like the system is mostly spending its time idling. Is it that haproxy is on the same CPU as the network's interrupts ? Then maybe it could make sense to start multiple processes and pin them to specific CPU cores, and do the same with the interrupts. Delivering 500-bytes large messages between two NICs via userspace experiences a high overhead and everything which could be saved must be saved (including CPU cache misses). We are speaking about 100Kpps (input) and 140Kpps (output) 'approximately'. OK, so probably about 30k msg/s in each direction with their respective ACKs. That just makes me think it could possibly do better since we can do better with HTTP messages. Do you have enough concurrent connections to fill the wire and ensure that the system never waits for either a client or a server ? I'm assuming that OK given the values assigned to the file descriptors in your latest email, which were up to 1428. With such numbers and that small messages, it can make sense to use multiple processes if that's not the case yet. Best regards, Willy
Re: RES: RES: RES: RES: RES: RES: RES: High CPU Usage (HaProxy)
On 5 November 2013 11:16, Willy Tarreau w...@1wt.eu wrote: It is a Xeon E5-2650 Dual (So we have 16 physical cores to use here and 32 threads). OK. Do you know if you have a single or multiple interrupts on your NICs, and if they're delivered to a single core, multiple cores, or floating around more or less randomly ? [snip] I still don't know why you have that high a context switch rate. Are you running with more processes than CPUs ? Fred is running with at least 30 separate haproxy processes (as per his top output in message-id col129-ds31e074947100ad71da09cb0...@phx.gbl) and 16 real (32 H/T) cores. I haven't seen a mail in this thread where Fred's shown that his problems persist after moving to a single haproxy instance. /wood-for-the-trees :-) Jonathan
RES: RES: RES: RES: RES: RES: RES: RES: High CPU Usage (HaProxy)
OK. Do you know if you have a single or multiple interrupts on your NICs, and if they're delivered to a single core, multiple cores, or floating around more or less randomly ? This is managed by FreeBSD, it currently have multiple queues and irq balance with msix. It seems that your numbers below tend to confirm this model. I still don't know why you have that high a context switch rate. Are you running with more processes than CPUs ? Also it looks like the system is mostly spending its time idling. Is it that haproxy is on the same CPU as the network's interrupts ? Then maybe it could make sense to start multiple processes and pin them to specific CPU cores, and do the same with the interrupts. Delivering 500-bytes large messages between two NICs via userspace experiences a high overhead and everything which could be saved must be saved (including CPU cache misses). Yes, if we have 40 processes running and 16 physical cores, I suppose this is more than the number of physical cores available right ? However, in FreeBSD we can't do that IRQ Assigning, like we can on linux. (As far I know). We are speaking about 100Kpps (input) and 140Kpps (output) 'approximately'. OK, so probably about 30k msg/s in each direction with their respective ACKs. That just makes me think it could possibly do better since we can do better with HTTP messages. Do you have enough concurrent connections to fill the wire and ensure that the system never waits for either a client or a server ? I'm assuming that OK given the values assigned to the file descriptors in your latest email, which were up to 1428. With such numbers and that small messages, it can make sense to use multiple processes if that's not the case yet. In theory yes, the connections are quick, because they are pure tcp applications and in other cases, http websites, but behind the pure tcp mode instead of http mode (not in all cases tho). Fred
Re: RES: RES: RES: RES: RES: RES: RES: RES: High CPU Usage (HaProxy)
On 05 нояб. 2013 г., at 19:33, Fred Pedrisa fredhp...@hotmail.com wrote: However, in FreeBSD we can't do that IRQ Assigning, like we can on linux. (As far I know). JFYI: you can assign IRQs to CPUs via cpuset -x irq (I can’t tell you if it is “like on linux” or not though).
RES: RES: RES: RES: RES: RES: RES: High CPU Usage (HaProxy)
Hello, Willy. Yes, this is a 'real machine', running FreeBSD 9 x64. It is a Xeon E5-2650 Dual (So we have 16 physical cores to use here and 32 threads). We are speaking about 100Kpps (input) and 140Kpps (output) 'approximately'. Here is the vmstat 1 result : procs memory pagedisks faults cpu r b w avmfre flt re pi pofr sr da0 pa0 in sy cs us sy id 7 0 0 4818M35G 643 0 0 0 714 0 0 0 4977 1364 5996 8 25 67 3 0 0 4818M35G 224 0 0 0 174 0 0 0 42698 355001 170303 8 22 71 3 0 0 4818M35G 177 0 0 0 174 0 0 0 28715 383061 138108 7 23 69 4 0 0 4818M35G 173 0 0 0 174 0 0 0 28342 375281 138067 8 24 69 5 0 0 4818M35G 185 0 0 0 174 0 0 0 32900 372294 148576 7 21 71 5 0 0 4818M35G 372 0 0 0 174 0 0 0 29112 364030 138826 7 25 68 4 0 0 4818M35G 159 0 0 0 174 0 0 0 34102 368835 150530 9 22 70 4 0 0 4818M35G 362 0 0 0 174 0 0 0 39928 366139 165853 8 21 71 3 0 0 4818M35G 220 0 0 0 174 0 0 0 39195 371933 163533 8 21 71 6 0 0 4818M35G 262 0 0 0 174 0 0 0 42681 354697 172687 8 21 71 -Mensagem original- De: Willy Tarreau [mailto:w...@1wt.eu] Enviada em: segunda-feira, 28 de outubro de 2013 20:58 Para: Fred Pedrisa Cc: 'Lukas Tribus'; haproxy@formilux.org Assunto: Re: RES: RES: RES: RES: RES: RES: High CPU Usage (HaProxy) Hello Fred, On Mon, Oct 28, 2013 at 10:02:15AM -0200, Fred Pedrisa wrote: Hello, Willy. As you said, take a look : getsockopt(0x12e,0x,0x1007,0x7fffdb94,0x7fffdb90,0x0) = 0 (0x0) sendto(302,\^D\0\^V0\0\0^z\M-L-\a\0d8\0\0...,926,0x80,NULL,0x0) = 926 (0x39e) recvfrom(682,\^S\0W0\0\0\M-,\^?\M-L-\^P\0\^E@...,8030,0x0,NULL,0x0) = 988 (0x3dc) recvfrom(682,0x801f3545c,7042,0x0,0x0,0x0) ERR#35 'Resource temporarily unavailable' getsockopt(0x2a9,0x,0x1007,0x7fffdb94,0x7fffdb90,0x0) = 0 (0x0) sendto(681,\^S\0W0\0\0\M-,\^?\M-L-\^P\0\^E@...,988,0x80,NULL,0x0) = 988 (0x3dc) recvfrom(1428,\^N\0!\M-0\0\0\M-\\M^_\M-H-\^AoU...,8030,0x0,NULL,0x0) = 444 (0x1bc) recvfrom(1428,0x8011b523c,7586,0x0,0x0,0x0) ERR#35 'Resource temporarily unavailable' getsockopt(0x593,0x,0x1007,0x7fffdb94,0x7fffdb90,0x0) = 0 (0x0) sendto(1427,\^N\0!\M-0\0\0\M-\\M^_\M-H-\^AoU...,444,0x80,NULL,0x0) = 444 (0x1bc) recvfrom(201,\b\0\\0\0\0\M-=\M-]\M-G-\^O\0\0...,8030,0x0,NULL,0x0) = 2627 (0xa43) recvfrom(201,0x800ec5ac3,5403,0x0,0x0,0x0) ERR#35 'Resource temporarily unavailable' getsockopt(0xbf,0x,0x1007,0x7fffdb94,0x7fffdb90,0x0) = 0 (0x0) sendto(191,\b\0\\0\0\0\M-=\M-]\M-G-\^O\0\0...,2627,0x80,NULL,0x0) = 2627 (0xa43) recvfrom(888,\^S\0W0\0\0\M-,\^?\M-L-\^P\0\^E@...,8030,0x0,NULL,0x0) = 1226 (0x4ca) recvfrom(888,0x801ee354a,6804,0x0,0x0,0x0) ERR#35 'Resource temporarily unavailable' getsockopt(0x377,0x,0x1007,0x7fffdb94,0x7fffdb90,0x0) = 0 (0x0) sendto(887,\^S\0W0\0\0\M-,\^?\M-L-\^P\0\^E@...,1226,0x80,NULL,0x0) = 1226 (0x4ca) recvfrom(674,\f\0\M-=\M-0\0\0\M^K}\M-#-d\r\0...,8030,0x0,NULL,0x0) = 982 (0x3d6) recvfrom(674,0x800f6f456,7048,0x0,0x0,0x0) ERR#35 'Resource temporarily unavailable' getsockopt(0x2a1,0x,0x1007,0x7fffdb94,0x7fffdb90,0x0) = 0 (0x0) sendto(673,\f\0\M-=\M-0\0\0\M^K}\M-#-d\r\0...,982,0x80,NULL,0x0) = 982 (0x3d6) recvfrom(1032,\^S\0W0\0\0\M-,\^?\M-L-\^P\0\^E@...,8030,0x0,NULL,0x0) = 1205 (0x4b5) recvfrom(1032,0x801ddb535,6825,0x0,0x0,0x0) ERR#35 'Resource temporarily unavailable' getsockopt(0x407,0x,0x1007,0x7fffdb94,0x7fffdb90,0x0) = 0 (0x0) sendto(1031,\^S\0W0\0\0\M-,\^?\M-L-\^P\0\^E@...,1205,0x80,NULL,0x0) = 1205 (0x4b5) recvfrom(1339,\v\0tpDa\^A\^DV \0\0\^A\M^R\M^K...,8030,0x0,NULL,0x0) = 68 (0x44) recvfrom(1339,0x8011790c4,7962,0x0,0x0,0x0) ERR#35 'Resource temporarily unavailable' getsockopt(0x53c,0x,0x1007,0x7fffdb94,0x7fffdb90,0x0) = 0 (0x0) sendto(1340,\v\0tpDa\^A\^DV \0\0\^A\M^R\M^K...,68,0x80,NULL,0x0) = 68 (0x44) recvfrom(913,\v\0tpj\M-h\^A\^D\M-Q\^]\0\0\^A...,8030,0x0,NULL,0x0) = 108 (0x6c) recvfrom(913,0x8019090ec,7922,0x0,0x0,0x0) ERR#35 'Resource temporarily unavailable' getsockopt(0x392,0x,0x1007,0x7fffdb94,0x7fffdb90,0x0) = 0 (0x0) sendto(914,\v\0tpj\M-h\^A\^D\M-Q\^]\0\0\^A...,108,0x80,NULL,0x0) = 108 (0x6c) recvfrom(166,\^D\0\^V0\0\0\M-$\M^@\M-L-\^T\0p...,8030,0x0,NULL,0x0) = 643 (0x283) recvfrom(166,0x800f13303,7387,0x0,0x0,0x0) ERR#35 'Resource temporarily unavailable' So yes, a lot of recv/send calls as you said before. Yes but they're not all that small. The average size looks like .5 or 1kB. That said, assuming you're dealing with 300 Mbps (about 40 MB/s) and say 500 bytes per message, this turns into 80k messages per