Re: [rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store
On Tue, 16 Feb 2016, singh.janmejay wrote: @David: As of now, I am thinking of end-of-the-day style measurement (basically report number of messages lost at a good-enough granularity, say host x severity). I am thinking of this as something independent of frequency of outages and unrelated to maintenance windows. Im thinking of it as a report that captures extent of loss, where one can pull down several months of this data and verify loss was never beyond a acceptable level, compare it across days when load profile was very different (the day when too many circuit-breakers engaged etc). I haven't thought through this, but reset may not be required. Basically let the counter count-up and wrap-around (as long as wrap-around is well defined behavior which is accounted for during measurement). I have my central server produce a daily report of how many logs it got from each source[1], and my significant traffic generators generate a similar report. I can then spot check them, or put them on the same graph, etc. David Lang [1] Well, actually, what I do is a abit fancier, with redundancies because I haven't cleaned things up yet :-) My first thing is that I make a file that collects 'useful' info about logs that arrive $template sources,"%hostname% %fromhost-ip% %programname% %timegenerated:::date-rfc3339% %$.len%\n" set $.len = strlen($rawmsg); /var/log/sources-messages;sources This gives me a one-line-per-message file that I can easily do things like cut -f 1 -d ' ' sources-messages |sort |uniq -c to get a per-host log count or cut -f 2 -d ' ' sources-messages |sort |uniq -c to get a report of the relay servers that send me logs rotate this file on a regular basis, and you have the ability to get stats on arbitrary times I'm slowly tweaking this to run things through SEC and have SEC produce per-min stats that are summaries of the data, making it much faster to summarize. I also have SEC dumping some of these stats to my monitoring system. you can do similar stuff with the pstats output, set it to a reasonable granularity and capture the count that the sender claims they are sending in your monitoring system, and then capture the count that you see on the other end. compare the two and if there is a significant difference, alert. ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
Re: [rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store
2016-02-16 12:27 GMT+01:00 singh.janmejay: > @Thomas: This is not about testing and quantifying loss during a test. > Its about quantifying it during normal operation. I see it as a choice > between: > A. deploy the strongest protocol at every system-boundary and test > each one rigorously and each change rigorously to identify or bound > loss in test conditions, and expect nothing unexpected to show up in > production > B. do the former and measure loss in production to identify that > something unexpected happened > C. deploy efficient protocols at all system-boundaries and measure > loss (as long as loss stays within an acceptable level, deployment > benefits from all the efficiency gains) > > I am talking in the context of C. > > If/when loss is above acceptable level, one needs to debug and fix the > problem. Both B and C provide the data required to identify > situation(s) when such debugging needs to happen. > > The approach of stamping on one end an measuring on the other treats > all intermediate hops as a blackbox. For instance, it can be used to > quantify losses in face of frequent machine failures or down-time free > maintenance etc. > > @David: As of now, I am thinking of end-of-the-day style measurement > (basically report number of messages lost at a good-enough > granularity, say host x severity). > > I am thinking of this as something independent of frequency of outages > and unrelated to maintenance windows. Im thinking of it as a report > that captures extent of loss, where one can pull down several months > of this data and verify loss was never beyond a acceptable level, > compare it across days when load profile was very different (the day > when too many circuit-breakers engaged etc). > > I just wanted to push in a link to upcoming new feature: https://github.com/rsyslog/rsyslog/pull/764 Rainer > I haven't thought through this, but reset may not be required. > Basically let the counter count-up and wrap-around (as long as > wrap-around is well defined behavior which is accounted for during > measurement). > > > On Sat, Feb 13, 2016 at 5:13 AM, David Lang wrote: > > On Sat, 13 Feb 2016, singh.janmejay wrote: > > > >> The ideal solution would be one that identifies host, log-source and > >> time of loss along with accurate number of messages lost. > >> > >> pstats makes sense, but correlating data from stats across large > >> number of machines will be difficult (some machines may send stats > >> slightly delayed which may skew aggregation etc). > > > > > > if you don't reset the counters, they keep increasing, so over time the > > error due to the slew becomes a very minor componnent. > > > > > >> One approach I can think of: slap a stream-identifier and > >> sequence-number on each received message, then find gaps in sequence > >> number for a session-id on the other side (as a query over log-store > >> etc). > > > > > > I'll point out that generating/checking a monotonic sequence number > destroys > > parallelism, and so it can seriously hurt performance. > > > > Are you trying to detect problems 'on the fly' as they happen? or at the > end > > of the hour/day saying 'hey, there was a problem at some point' > > > > how frequent do you think problems are? I would suggest that you run some > > stress tests on your equipment/network and push things until you do have > > problems, so you can track when they happen. I expect that you will find > > that they don't start happening until you have much higher loads than you > > expect (at least after a bit of tuning), and this can make it so that the > > most invastive solutions aren't needed. > > > > David Lang > > > > > >> Large issues such as producer suddenly going silent can be detected > >> using macro mechanisms (like pstats). > >> > >> On Sat, Feb 13, 2016 at 2:56 AM, David Lang wrote: > >>> > >>> On Sat, 13 Feb 2016, Andre wrote: > >>> > > The easiest way I found to do that is to have a control system and > send > two > streams of data to two or more different destinations. > > In case of rsyslog processing a large message volume UDP the loss has > always been noticeable. > >>> > >>> > >>> > >>> this depends on your setup. I was able to send UDP logs at gig-E wire > >>> speed > >>> with no losses, but it required tuning the receiving sytem to not do > DNS > >>> lookups, have sufficient RAM for buffering, etc > >>> > >>> > >>> I never was able to get my hands on 10G equiepment to push up from > there. > >>> > >>> David Lang > >>> > >>> ___ > >>> rsyslog mailing list > >>> http://lists.adiscon.net/mailman/listinfo/rsyslog > >>> http://www.rsyslog.com/professional-services/ > >>> What's up with rsyslog? Follow https://twitter.com/rgerhards > >>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a > myriad > >>> of > >>> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST
Re: [rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store
@Thomas: This is not about testing and quantifying loss during a test. Its about quantifying it during normal operation. I see it as a choice between: A. deploy the strongest protocol at every system-boundary and test each one rigorously and each change rigorously to identify or bound loss in test conditions, and expect nothing unexpected to show up in production B. do the former and measure loss in production to identify that something unexpected happened C. deploy efficient protocols at all system-boundaries and measure loss (as long as loss stays within an acceptable level, deployment benefits from all the efficiency gains) I am talking in the context of C. If/when loss is above acceptable level, one needs to debug and fix the problem. Both B and C provide the data required to identify situation(s) when such debugging needs to happen. The approach of stamping on one end an measuring on the other treats all intermediate hops as a blackbox. For instance, it can be used to quantify losses in face of frequent machine failures or down-time free maintenance etc. @David: As of now, I am thinking of end-of-the-day style measurement (basically report number of messages lost at a good-enough granularity, say host x severity). I am thinking of this as something independent of frequency of outages and unrelated to maintenance windows. Im thinking of it as a report that captures extent of loss, where one can pull down several months of this data and verify loss was never beyond a acceptable level, compare it across days when load profile was very different (the day when too many circuit-breakers engaged etc). I haven't thought through this, but reset may not be required. Basically let the counter count-up and wrap-around (as long as wrap-around is well defined behavior which is accounted for during measurement). On Sat, Feb 13, 2016 at 5:13 AM, David Langwrote: > On Sat, 13 Feb 2016, singh.janmejay wrote: > >> The ideal solution would be one that identifies host, log-source and >> time of loss along with accurate number of messages lost. >> >> pstats makes sense, but correlating data from stats across large >> number of machines will be difficult (some machines may send stats >> slightly delayed which may skew aggregation etc). > > > if you don't reset the counters, they keep increasing, so over time the > error due to the slew becomes a very minor componnent. > > >> One approach I can think of: slap a stream-identifier and >> sequence-number on each received message, then find gaps in sequence >> number for a session-id on the other side (as a query over log-store >> etc). > > > I'll point out that generating/checking a monotonic sequence number destroys > parallelism, and so it can seriously hurt performance. > > Are you trying to detect problems 'on the fly' as they happen? or at the end > of the hour/day saying 'hey, there was a problem at some point' > > how frequent do you think problems are? I would suggest that you run some > stress tests on your equipment/network and push things until you do have > problems, so you can track when they happen. I expect that you will find > that they don't start happening until you have much higher loads than you > expect (at least after a bit of tuning), and this can make it so that the > most invastive solutions aren't needed. > > David Lang > > >> Large issues such as producer suddenly going silent can be detected >> using macro mechanisms (like pstats). >> >> On Sat, Feb 13, 2016 at 2:56 AM, David Lang wrote: >>> >>> On Sat, 13 Feb 2016, Andre wrote: >>> The easiest way I found to do that is to have a control system and send two streams of data to two or more different destinations. In case of rsyslog processing a large message volume UDP the loss has always been noticeable. >>> >>> >>> >>> this depends on your setup. I was able to send UDP logs at gig-E wire >>> speed >>> with no losses, but it required tuning the receiving sytem to not do DNS >>> lookups, have sufficient RAM for buffering, etc >>> >>> >>> I never was able to get my hands on 10G equiepment to push up from there. >>> >>> David Lang >>> >>> ___ >>> rsyslog mailing list >>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>> http://www.rsyslog.com/professional-services/ >>> What's up with rsyslog? Follow https://twitter.com/rgerhards >>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad >>> of >>> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T >>> LIKE THAT. >> >> >> >> >> > ___ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com/professional-services/ > What's up with rsyslog? Follow https://twitter.com/rgerhards > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of > sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT
Re: [rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store
On Sat, 13 Feb 2016, singh.janmejay wrote: The ideal solution would be one that identifies host, log-source and time of loss along with accurate number of messages lost. pstats makes sense, but correlating data from stats across large number of machines will be difficult (some machines may send stats slightly delayed which may skew aggregation etc). if you don't reset the counters, they keep increasing, so over time the error due to the slew becomes a very minor componnent. One approach I can think of: slap a stream-identifier and sequence-number on each received message, then find gaps in sequence number for a session-id on the other side (as a query over log-store etc). I'll point out that generating/checking a monotonic sequence number destroys parallelism, and so it can seriously hurt performance. Are you trying to detect problems 'on the fly' as they happen? or at the end of the hour/day saying 'hey, there was a problem at some point' how frequent do you think problems are? I would suggest that you run some stress tests on your equipment/network and push things until you do have problems, so you can track when they happen. I expect that you will find that they don't start happening until you have much higher loads than you expect (at least after a bit of tuning), and this can make it so that the most invastive solutions aren't needed. David Lang Large issues such as producer suddenly going silent can be detected using macro mechanisms (like pstats). On Sat, Feb 13, 2016 at 2:56 AM, David Langwrote: On Sat, 13 Feb 2016, Andre wrote: The easiest way I found to do that is to have a control system and send two streams of data to two or more different destinations. In case of rsyslog processing a large message volume UDP the loss has always been noticeable. this depends on your setup. I was able to send UDP logs at gig-E wire speed with no losses, but it required tuning the receiving sytem to not do DNS lookups, have sufficient RAM for buffering, etc I never was able to get my hands on 10G equiepment to push up from there. David Lang ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT. ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
Re: [rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store
Hi, I don't understand this conversation. If you are sending messages via UDP there are thousands of $reasons why you can lose a message. Even if you are using TCP you can lose messages if you don't configure rsyslog to run "in reliable mode". Because of these different $reasons you cannot compare with data from different setups (change one piece, i.e. different network hardware, updated router software..., and you could face different problems). So why should one waste time in quantifying log loss in such an unreliable setup? As said above: If you would have found a working setup for you (you tested everything, fine-tuned everything, quantified everything), changing _anything_ could change _everything_. If you would really care wouldn't you install a local rsyslog daemon (Ra) on the application server and make sure your application uses syslog() to log data (so it is guaranteed that this call hang/fail if syslog cannot read/accept the message)? In the next step you would have to make sure that rsyslog is running in a reliable mode, i.e. with queue etc. to ensure that rsyslog won't ever throw away a message. >From this instance you would now send your data to the next hop (Rr) via RELP. You would also have to configure your Rr instance the same way to ensure that it will never throw away a message. If you will ever see a message loss in such a setup there must be a bug. -Thomas ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
[rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store
Inviting ideas. Has anyone tried to quantify log-loss (#number of lines lost per day per sender etc) for a log-store? Let us consider the following setup: - An environment has several application nodes. Each app node hands-over its logs to local Rsyslog daemon(let us call it Ra, Rsyslog-application). - The environment has one or more Rsyslog receiver nodes (let us call it Rr, Rsyslog-receiver). - Rr(s) write received logs to a log-store. The problem statement is: Quantify log-loss(defined as messages that are successfully handed over to Ra, but can't be found in log-store) in log-events lost per day per host. Log-events may be lost because of any reason (in the pipe, or after being written to log-store). It doesn't matter which of the intermediate systems lost logs, as long as loss is bounded (by any empirical figure, say less than 0.1%). -- Regards, Janmejay http://codehunk.wordpress.com ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
Re: [rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store
yes. The easiest way I found to do that is to have a control system and send two streams of data to two or more different destinations. In case of rsyslog processing a large message volume UDP the loss has always been noticeable. On Fri, Feb 12, 2016 at 11:35 PM, singh.janmejaywrote: > Inviting ideas. > > Has anyone tried to quantify log-loss (#number of lines lost per day > per sender etc) for a log-store? > > Let us consider the following setup: > - An environment has several application nodes. Each app node > hands-over its logs to local Rsyslog daemon(let us call it Ra, > Rsyslog-application). > - The environment has one or more Rsyslog receiver nodes (let us call > it Rr, Rsyslog-receiver). > - Rr(s) write received logs to a log-store. > > The problem statement is: Quantify log-loss(defined as messages that > are successfully handed over to Ra, but can't be found in log-store) > in log-events lost per day per host. > > Log-events may be lost because of any reason (in the pipe, or after > being written to log-store). It doesn't matter which of the > intermediate systems lost logs, as long as loss is bounded (by any > empirical figure, say less than 0.1%). > > -- > Regards, > Janmejay > http://codehunk.wordpress.com > ___ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com/professional-services/ > What's up with rsyslog? Follow https://twitter.com/rgerhards > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad > of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you > DON'T LIKE THAT. > ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
Re: [rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store
On Sat, 13 Feb 2016, Andre wrote: The easiest way I found to do that is to have a control system and send two streams of data to two or more different destinations. In case of rsyslog processing a large message volume UDP the loss has always been noticeable. this depends on your setup. I was able to send UDP logs at gig-E wire speed with no losses, but it required tuning the receiving sytem to not do DNS lookups, have sufficient RAM for buffering, etc I never was able to get my hands on 10G equiepment to push up from there. David Lang ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
Re: [rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store
On Fri, 12 Feb 2016, singh.janmejay wrote: Inviting ideas. Has anyone tried to quantify log-loss (#number of lines lost per day per sender etc) for a log-store? Let us consider the following setup: - An environment has several application nodes. Each app node hands-over its logs to local Rsyslog daemon(let us call it Ra, Rsyslog-application). - The environment has one or more Rsyslog receiver nodes (let us call it Rr, Rsyslog-receiver). - Rr(s) write received logs to a log-store. The problem statement is: Quantify log-loss(defined as messages that are successfully handed over to Ra, but can't be found in log-store) in log-events lost per day per host. Log-events may be lost because of any reason (in the pipe, or after being written to log-store). It doesn't matter which of the intermediate systems lost logs, as long as loss is bounded (by any empirical figure, say less than 0.1%). I have done so for benchmark/acceptance tests, but not as an ongoing process on a live system pstats will give you a lot of what you want to start with (how many items were sent on one system so that you can look on other systems and find how many were received to correlate the two. can you go into more detail about what you are trying to prove? David Lang ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
Re: [rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store
The ideal solution would be one that identifies host, log-source and time of loss along with accurate number of messages lost. pstats makes sense, but correlating data from stats across large number of machines will be difficult (some machines may send stats slightly delayed which may skew aggregation etc). One approach I can think of: slap a stream-identifier and sequence-number on each received message, then find gaps in sequence number for a session-id on the other side (as a query over log-store etc). Large issues such as producer suddenly going silent can be detected using macro mechanisms (like pstats). On Sat, Feb 13, 2016 at 2:56 AM, David Langwrote: > On Sat, 13 Feb 2016, Andre wrote: > >> >> The easiest way I found to do that is to have a control system and send >> two >> streams of data to two or more different destinations. >> >> In case of rsyslog processing a large message volume UDP the loss has >> always been noticeable. > > > this depends on your setup. I was able to send UDP logs at gig-E wire speed > with no losses, but it required tuning the receiving sytem to not do DNS > lookups, have sufficient RAM for buffering, etc > > > I never was able to get my hands on 10G equiepment to push up from there. > > David Lang > > ___ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com/professional-services/ > What's up with rsyslog? Follow https://twitter.com/rgerhards > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of > sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T > LIKE THAT. -- Regards, Janmejay http://codehunk.wordpress.com ___ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.