Re: [rsyslog] Rsyslog stops relaying messages
On Tue, 6 Dec 2016, Arik Mitschang wrote: Hi David, the problem is probably not on the system that stops forwarding messages, but rather on the system they are forwarding the messages to. When the queues fill up, unles you have configured rsyslog to throw away messages, it will stop accepting any new messages as it can't put them in the queue. This is "working as designed" (one of these days I've got to sit down and finish writing my "how to make your logs unreliable" article :-) There are several reasons I do not think this is the case: We have multiple relays downstream connected to upstream relays, and see messages come through these other paths when this situation occurs. Also, a frequent solution to the problem is to restart the stuck process (and only that one), where we see the messages flush through upstream relays when shutting down, implying they are not holding back messages. Finally, we do have impstats enabled, it is going through he main queue but this actually allows to probe the status of the queue. We have nagios alert when there are no stats messages coming in a fixed time window. Before getting stuck, messages in the queue are at maximum (actually we see 700k in the main queue which is set at 1M), then we see no more stats from only the stuck relays, others keep pushing stats and reflect the reduction in message throughput in their main queue sizes. do you have the stats messages configured to go through the main queue (like any other message)? or do you have them set to use a separate queue so that they will get through even if the main queue is blocked? can you configure one to write to either a separate queue (i.e. ruleset with it's own queue) or to a file so that we can see what the stats look like when things break? On my system I created a 'high priority' ruleset with it's own queue for the stats to go through that bypassed my intermediate relays and delivered directly to my central servers so that if anything happened to the main queue, I would still get the stats data. I also had this write to the local disk and send stats to my monitoring system. If the stats messages are queued and sent after the restart, what do they show during the time when you have trouble? do they show any of the actions being suspended? David Lang ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
Re: [rsyslog] Rsyslog stops relaying messages
Hi David, > the problem is probably not on the system that stops forwarding messages, but > rather on the system they are forwarding the messages to. > > When the queues fill up, unles you have configured rsyslog to throw away > messages, it will stop accepting any new messages as it can't put them in the > queue. This is "working as designed" (one of these days I've got to sit down > and > finish writing my "how to make your logs unreliable" article :-) There are several reasons I do not think this is the case: We have multiple relays downstream connected to upstream relays, and see messages come through these other paths when this situation occurs. Also, a frequent solution to the problem is to restart the stuck process (and only that one), where we see the messages flush through upstream relays when shutting down, implying they are not holding back messages. Finally, we do have impstats enabled, it is going through he main queue but this actually allows to probe the status of the queue. We have nagios alert when there are no stats messages coming in a fixed time window. Before getting stuck, messages in the queue are at maximum (actually we see 700k in the main queue which is set at 1M), then we see no more stats from only the stuck relays, others keep pushing stats and reflect the reduction in message throughput in their main queue sizes. > what version are you running, there have been some unicode related fixes in > the > last few versions. We have a mix of systems but in general rsyslog is at least version 8.22, with a number of systems being 8.23. > A couple things to do would be > > 1. make sure you have impstats enabled, and since you are having problems > delivering messages, make sure it either uses a different ruleset (with a > queue) > or writes a file to disk so that you don't risk the pstats data getting stuck > as > well. As above, we have it enabled, but it is going through the default ruleset at the moment. I can look into getting it into a different ruleset and both write maybe a days worth to disk as well as sending upstream. > 2. as a debugging tool, consider writing the logs to disk before forwarding > them. You don't need to keep a very long history of them, but seeing the > message > that rsyslog was trying to send could be very helpful Will look into this as well, though I suspect it cannot be done anytime real soon. It would be nice to see what message was being processed, though it is possible the issue would prevent its writing to disk as well, if it gets stuck in the main queue and not the omrelp action... > 3. look at the systems receiving the messages to see if anything odd happens > there around the time that things start failing. As above, I don't believe it is the receiving systems, but if anythings still stands out let me know. Thanks for all your input. Arik P.S. I only got the digest, not the original response, so apologies if this does not get properly inserted in the thread. -- *This correspondence (including any attachments) is for the intended recipient(s) only. It may contain confidential or privileged information or both. No confidentiality or privilege is waived or lost by any mis-transmission. If you receive this correspondence by mistake, please contact the sender immediately, delete this correspondence (and all attachments) and destroy any hard copies. You must not use, disclose, copy, distribute or rely on any part of this correspondence (including any attachments) if you are not the intended recipient(s).本メッセージに記載および添付されている情報(以下、総称して「本情報」といいます。)は、本来の受信者による使用のみを意図しています。誤送信等により本情報を取得された場合でも、本情報に係る秘密、または法律上の秘匿特権が失われるものではありません。本電子メールを受取られた方が、本来の受信者ではない場合には、本情報及びそのコピーすべてを削除・破棄し、本電子メールが誤って届いた旨を発信者宛てにご通知下さいますようお願いします。本情報の閲覧、発信または本情報に基づくいかなる行為も明確に禁止されていることをご了承ください。* ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
Re: [rsyslog] Rsyslog stops relaying messages
On Fri, 2 Dec 2016, Arik Mitschang wrote: Hi Rsyslog users, We have been periodically experiencing an issue with our rsyslog setup where some RELP relay nodes appear to fill up their queue and stop processing any messages. Our log flow essentially is made up of a number of "clients" that send messages over RELP to one or more "relay" layers which finally send to a number of rsyslog processes which index messages in elasticsearch, for example: client -> relay -> relay (x2) -> indexer (x2) -> elasticsearch Each relay sends to at least 2 rsyslog servers balancing messages between them using config like: if $$uptime % 2 == 0 then { RELP A } else if $$uptime % 1 == 0 then { RELP B } During peak times we are pushing about 25000 messages per second on each of the most busy relays and indexers (limited by the indexing operation). The "relays" do not write queue to disk. The problem has always been that one or more "relays" simply stops forwarding, inspection of the process shows memory usage higher than others as the queue is full. Normally, restarting the rsyslog process clears the queue and resumes normal processing. This looks like a bug, and perhaps gets triggered by some badly formed or encoded incoming message or something (noting this is also a largely Japanese environment), but I was curious if anyone here has experienced similar or knows where to look or any suggestions how to get useful information to report about this. I appreciate any help you can give, thanks, the problem is probably not on the system that stops forwarding messages, but rather on the system they are forwarding the messages to. When the queues fill up, unles you have configured rsyslog to throw away messages, it will stop accepting any new messages as it can't put them in the queue. This is "working as designed" (one of these days I've got to sit down and finish writing my "how to make your logs unreliable" article :-) what version are you running, there have been some unicode related fixes in the last few versions. A couple things to do would be 1. make sure you have impstats enabled, and since you are having problems delivering messages, make sure it either uses a different ruleset (with a queue) or writes a file to disk so that you don't risk the pstats data getting stuck as well. 2. as a debugging tool, consider writing the logs to disk before forwarding them. You don't need to keep a very long history of them, but seeing the message that rsyslog was trying to send could be very helpful 3. look at the systems receiving the messages to see if anything odd happens there around the time that things start failing. David Lang ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.