Re: [rsyslog] Rsyslog stops relaying messages

2016-12-05 Thread David Lang

On Tue, 6 Dec 2016, Arik Mitschang wrote:


Hi David,


the problem is probably not on the system that stops forwarding messages, but
rather on the system they are forwarding the messages to.

When the queues fill up, unles you have configured rsyslog to throw away
messages, it will stop accepting any new messages as it can't put them in the
queue. This is "working as designed" (one of these days I've got to sit down and
finish writing my "how to make your logs unreliable" article :-)


There are several reasons I do not think this is the case:

We have multiple relays downstream connected to upstream relays, and see
messages come through these other paths when this situation occurs.

Also, a frequent solution to the problem is to restart the stuck process
(and only that one), where we see the messages flush through upstream
relays when shutting down, implying they are not holding back messages.

Finally, we do have impstats enabled, it is going through he main queue
but this actually allows to probe the status of the queue. We have
nagios alert when there are no stats messages coming in a fixed time
window. Before getting stuck, messages in the queue are at maximum
(actually we see 700k in the main queue which is set at 1M), then we see
no more stats from only the stuck relays, others keep pushing stats and
reflect the reduction in message throughput in their main queue sizes.


do you have the stats messages configured to go through the main queue (like any 
other message)? or do you have them set to use a separate queue so that they 
will get through even if the main queue is blocked?


can you configure one to write to either a separate queue (i.e. ruleset with 
it's own queue) or to a file so that we can see what the stats look like when 
things break? On my system I created a 'high priority' ruleset with it's own 
queue for the stats to go through that bypassed my intermediate relays and 
delivered directly to my central servers so that if anything happened to the 
main queue, I would still get the stats data. I also had this write to the local 
disk and send stats to my monitoring system.


If the stats messages are queued and sent after the restart, what do they show 
during the time when you have trouble? do they show any of the actions being 
suspended?


David Lang
___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.


Re: [rsyslog] Rsyslog stops relaying messages

2016-12-05 Thread Arik Mitschang
Hi David,

> the problem is probably not on the system that stops forwarding messages, but 
> rather on the system they are forwarding the messages to.
> 
> When the queues fill up, unles you have configured rsyslog to throw away 
> messages, it will stop accepting any new messages as it can't put them in the 
> queue. This is "working as designed" (one of these days I've got to sit down 
> and 
> finish writing my "how to make your logs unreliable" article :-)

There are several reasons I do not think this is the case:

We have multiple relays downstream connected to upstream relays, and see
messages come through these other paths when this situation occurs.

Also, a frequent solution to the problem is to restart the stuck process
(and only that one), where we see the messages flush through upstream
relays when shutting down, implying they are not holding back messages.

Finally, we do have impstats enabled, it is going through he main queue
but this actually allows to probe the status of the queue. We have
nagios alert when there are no stats messages coming in a fixed time
window. Before getting stuck, messages in the queue are at maximum
(actually we see 700k in the main queue which is set at 1M), then we see
no more stats from only the stuck relays, others keep pushing stats and
reflect the reduction in message throughput in their main queue sizes.

> what version are you running, there have been some unicode related fixes in 
> the 
> last few versions.

We have a mix of systems but in general rsyslog is at least version
8.22, with a number of systems being 8.23.

> A couple things to do would be
> 
> 1. make sure you have impstats enabled, and since you are having problems 
> delivering messages, make sure it either uses a different ruleset (with a 
> queue) 
> or writes a file to disk so that you don't risk the pstats data getting stuck 
> as 
> well.

As above, we have it enabled, but it is going through the default
ruleset at the moment. I can look into getting it into a different
ruleset and both write maybe a days worth to disk as well as sending
upstream.

> 2. as a debugging tool, consider writing the logs to disk before forwarding 
> them. You don't need to keep a very long history of them, but seeing the 
> message 
> that rsyslog was trying to send could be very helpful

Will look into this as well, though I suspect it cannot be done anytime
real soon. It would be nice to see what message was being processed,
though it is possible the issue would prevent its writing to disk as
well, if it gets stuck in the main queue and not the omrelp action...

> 3. look at the systems receiving the messages to see if anything odd happens 
> there around the time that things start failing.

As above, I don't believe it is the receiving systems, but if anythings
still stands out let me know. Thanks for all your input.

Arik

P.S. I only got the digest, not the original response, so apologies if
this does not get properly inserted in the thread.

-- 
*This correspondence (including any attachments) is for the intended 
recipient(s) only. It may contain confidential or privileged information or 
both. No confidentiality or privilege is waived or lost by any 
mis-transmission. If you receive this correspondence by mistake, please 
contact the sender immediately, delete this correspondence (and all 
attachments) and destroy any hard copies. You must not use, disclose, copy, 
distribute or rely on any part of this correspondence (including any 
attachments) if you are not the intended 
recipient(s).本メッセージに記載および添付されている情報(以下、総称して「本情報」といいます。)は、本来の受信者による使用のみを意図しています。誤送信等により本情報を取得された場合でも、本情報に係る秘密、または法律上の秘匿特権が失われるものではありません。本電子メールを受取られた方が、本来の受信者ではない場合には、本情報及びそのコピーすべてを削除・破棄し、本電子メールが誤って届いた旨を発信者宛てにご通知下さいますようお願いします。本情報の閲覧、発信または本情報に基づくいかなる行為も明確に禁止されていることをご了承ください。*
___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.


Re: [rsyslog] Rsyslog stops relaying messages

2016-12-01 Thread David Lang

On Fri, 2 Dec 2016, Arik Mitschang wrote:


Hi Rsyslog users,

We have been periodically experiencing an issue with our rsyslog setup
where some RELP relay nodes appear to fill up their queue and stop
processing any messages.

Our log flow essentially is made up of a number of "clients" that send
messages over RELP to one or more "relay" layers which finally send to a
number of rsyslog processes which index messages in elasticsearch, for
example:

client -> relay -> relay (x2) -> indexer (x2) -> elasticsearch

Each relay sends to at least 2 rsyslog servers balancing messages
between them using config like:

if $$uptime % 2 == 0 then {
RELP A
}
else if $$uptime % 1 == 0 then {
RELP B
}

During peak times we are pushing about 25000 messages per second on each
of the most busy relays and indexers (limited by the indexing
operation). The "relays" do not write queue to disk.

The problem has always been that one or more "relays" simply stops
forwarding, inspection of the process shows memory usage higher than
others as the queue is full. Normally, restarting the rsyslog process
clears the queue and resumes normal processing.

This looks like a bug, and perhaps gets triggered by some badly formed
or encoded incoming message or something (noting this is also a largely
Japanese environment), but I was curious if anyone here has experienced
similar or knows where to look or any suggestions how to get useful
information to report about this.

I appreciate any help you can give, thanks,


the problem is probably not on the system that stops forwarding messages, but 
rather on the system they are forwarding the messages to.


When the queues fill up, unles you have configured rsyslog to throw away 
messages, it will stop accepting any new messages as it can't put them in the 
queue. This is "working as designed" (one of these days I've got to sit down and 
finish writing my "how to make your logs unreliable" article :-)


what version are you running, there have been some unicode related fixes in the 
last few versions.


A couple things to do would be

1. make sure you have impstats enabled, and since you are having problems 
delivering messages, make sure it either uses a different ruleset (with a queue) 
or writes a file to disk so that you don't risk the pstats data getting stuck as 
well.


2. as a debugging tool, consider writing the logs to disk before forwarding 
them. You don't need to keep a very long history of them, but seeing the message 
that rsyslog was trying to send could be very helpful


3. look at the systems receiving the messages to see if anything odd happens 
there around the time that things start failing.


David Lang
___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.