Re: [rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store

2016-02-16 Thread David Lang

On Tue, 16 Feb 2016, singh.janmejay wrote:


@David: As of now, I am thinking of end-of-the-day style measurement
(basically report number of messages lost at a good-enough
granularity, say host x severity).

I am thinking of this as something independent of frequency of outages
and unrelated to maintenance windows. Im thinking of it as a report
that captures extent of loss, where one can pull down several months
of this data and verify loss was never beyond a acceptable level,
compare it across days when load profile was very different (the day
when too many circuit-breakers engaged etc).

I haven't thought through this, but reset may not be required.
Basically let the counter count-up and wrap-around (as long as
wrap-around is well defined behavior which is accounted for during
measurement).


I have my central server produce a daily report of how many logs it got from 
each source[1], and my significant traffic generators generate a similar report. 
I can then spot check them, or put them on the same graph, etc.



David Lang

[1] Well, actually, what I do is a abit fancier, with redundancies because I 
haven't cleaned things up yet :-)


My first thing is that I make a file that collects 'useful' info about logs that 
arrive


$template sources,"%hostname% %fromhost-ip% %programname% 
%timegenerated:::date-rfc3339% %$.len%\n"

set $.len = strlen($rawmsg);
/var/log/sources-messages;sources

This gives me a one-line-per-message file that I can easily do things like

cut -f 1 -d ' ' sources-messages |sort |uniq -c

to get a per-host log count

or

cut -f 2 -d ' ' sources-messages |sort |uniq -c

to get a report of the relay servers that send me logs


rotate this file on a regular basis, and you have the ability to get stats on 
arbitrary times


I'm slowly tweaking this to run things through SEC and have SEC produce per-min 
stats that are summaries of the data, making it much faster to summarize. I also 
have SEC dumping some of these stats to my monitoring system.


you can do similar stuff with the pstats output, set it to a reasonable 
granularity and capture the count that the sender claims they are sending in 
your monitoring system, and then capture the count that you see on the other 
end. compare the two and if there is a significant difference, alert.



___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.


Re: [rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store

2016-02-16 Thread Rainer Gerhards
2016-02-16 12:27 GMT+01:00 singh.janmejay :

> @Thomas: This is not about testing and quantifying loss during a test.
> Its about quantifying it during normal operation. I see it as a choice
> between:
> A. deploy the strongest protocol at every system-boundary and test
> each one rigorously and each change rigorously to identify or bound
> loss in test conditions, and expect nothing unexpected to show up in
> production
> B. do the former and measure loss in production to identify that
> something unexpected happened
> C. deploy efficient protocols at all system-boundaries and measure
> loss (as long as loss stays within an acceptable level, deployment
> benefits from all the efficiency gains)
>
> I am talking in the context of C.
>
> If/when loss is above acceptable level, one needs to debug and fix the
> problem. Both B and C provide the data required to identify
> situation(s) when such debugging needs to happen.
>
> The approach of stamping on one end an measuring on the other treats
> all intermediate hops as a blackbox. For instance, it can be used to
> quantify losses in face of frequent machine failures or down-time free
> maintenance etc.
>
> @David: As of now, I am thinking of end-of-the-day style measurement
> (basically report number of messages lost at a good-enough
> granularity, say host x severity).
>
> I am thinking of this as something independent of frequency of outages
> and unrelated to maintenance windows. Im thinking of it as a report
> that captures extent of loss, where one can pull down several months
> of this data and verify loss was never beyond a acceptable level,
> compare it across days when load profile was very different (the day
> when too many circuit-breakers engaged etc).
>
>
I just wanted to push in a link to upcoming new feature:

https://github.com/rsyslog/rsyslog/pull/764

Rainer

> I haven't thought through this, but reset may not be required.
> Basically let the counter count-up and wrap-around (as long as
> wrap-around is well defined behavior which is accounted for during
> measurement).
>
>
> On Sat, Feb 13, 2016 at 5:13 AM, David Lang  wrote:
> > On Sat, 13 Feb 2016, singh.janmejay wrote:
> >
> >> The ideal solution would be one that identifies host, log-source and
> >> time of loss along with accurate number of messages lost.
> >>
> >> pstats makes sense, but correlating data from stats across large
> >> number of machines will be difficult (some machines may send stats
> >> slightly delayed which may skew aggregation etc).
> >
> >
> > if you don't reset the counters, they keep increasing, so over time the
> > error due to the slew becomes a very minor componnent.
> >
> >
> >> One approach I can think of: slap a stream-identifier and
> >> sequence-number on each received message, then find gaps in sequence
> >> number for a session-id on the other side (as a query over log-store
> >> etc).
> >
> >
> > I'll point out that generating/checking a monotonic sequence number
> destroys
> > parallelism, and so it can seriously hurt performance.
> >
> > Are you trying to detect problems 'on the fly' as they happen? or at the
> end
> > of the hour/day saying 'hey, there was a problem at some point'
> >
> > how frequent do you think problems are? I would suggest that you run some
> > stress tests on your equipment/network and push things until you do have
> > problems, so you can track when they happen. I expect that you will find
> > that they don't start happening until you have much higher loads than you
> > expect (at least after a bit of tuning), and this can make it so that the
> > most invastive solutions aren't needed.
> >
> > David Lang
> >
> >
> >> Large issues such as producer suddenly going silent can be detected
> >> using macro mechanisms (like pstats).
> >>
> >> On Sat, Feb 13, 2016 at 2:56 AM, David Lang  wrote:
> >>>
> >>> On Sat, 13 Feb 2016, Andre wrote:
> >>>
> 
>  The easiest way I found to do that is to have a control system and
> send
>  two
>  streams of data to two or more different destinations.
> 
>  In case of rsyslog processing a large message volume UDP the loss has
>  always been noticeable.
> >>>
> >>>
> >>>
> >>> this depends on your setup. I was able to send UDP logs at gig-E wire
> >>> speed
> >>> with no losses, but it required tuning the receiving sytem to not do
> DNS
> >>> lookups, have sufficient RAM for buffering, etc
> >>>
> >>>
> >>> I never was able to get my hands on 10G equiepment to push up from
> there.
> >>>
> >>> David Lang
> >>>
> >>> ___
> >>> rsyslog mailing list
> >>> http://lists.adiscon.net/mailman/listinfo/rsyslog
> >>> http://www.rsyslog.com/professional-services/
> >>> What's up with rsyslog? Follow https://twitter.com/rgerhards
> >>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
> myriad
> >>> of
> >>> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST 

Re: [rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store

2016-02-16 Thread singh.janmejay
@Thomas: This is not about testing and quantifying loss during a test.
Its about quantifying it during normal operation. I see it as a choice
between:
A. deploy the strongest protocol at every system-boundary and test
each one rigorously and each change rigorously to identify or bound
loss in test conditions, and expect nothing unexpected to show up in
production
B. do the former and measure loss in production to identify that
something unexpected happened
C. deploy efficient protocols at all system-boundaries and measure
loss (as long as loss stays within an acceptable level, deployment
benefits from all the efficiency gains)

I am talking in the context of C.

If/when loss is above acceptable level, one needs to debug and fix the
problem. Both B and C provide the data required to identify
situation(s) when such debugging needs to happen.

The approach of stamping on one end an measuring on the other treats
all intermediate hops as a blackbox. For instance, it can be used to
quantify losses in face of frequent machine failures or down-time free
maintenance etc.

@David: As of now, I am thinking of end-of-the-day style measurement
(basically report number of messages lost at a good-enough
granularity, say host x severity).

I am thinking of this as something independent of frequency of outages
and unrelated to maintenance windows. Im thinking of it as a report
that captures extent of loss, where one can pull down several months
of this data and verify loss was never beyond a acceptable level,
compare it across days when load profile was very different (the day
when too many circuit-breakers engaged etc).

I haven't thought through this, but reset may not be required.
Basically let the counter count-up and wrap-around (as long as
wrap-around is well defined behavior which is accounted for during
measurement).


On Sat, Feb 13, 2016 at 5:13 AM, David Lang  wrote:
> On Sat, 13 Feb 2016, singh.janmejay wrote:
>
>> The ideal solution would be one that identifies host, log-source and
>> time of loss along with accurate number of messages lost.
>>
>> pstats makes sense, but correlating data from stats across large
>> number of machines will be difficult (some machines may send stats
>> slightly delayed which may skew aggregation etc).
>
>
> if you don't reset the counters, they keep increasing, so over time the
> error due to the slew becomes a very minor componnent.
>
>
>> One approach I can think of: slap a stream-identifier and
>> sequence-number on each received message, then find gaps in sequence
>> number for a session-id on the other side (as a query over log-store
>> etc).
>
>
> I'll point out that generating/checking a monotonic sequence number destroys
> parallelism, and so it can seriously hurt performance.
>
> Are you trying to detect problems 'on the fly' as they happen? or at the end
> of the hour/day saying 'hey, there was a problem at some point'
>
> how frequent do you think problems are? I would suggest that you run some
> stress tests on your equipment/network and push things until you do have
> problems, so you can track when they happen. I expect that you will find
> that they don't start happening until you have much higher loads than you
> expect (at least after a bit of tuning), and this can make it so that the
> most invastive solutions aren't needed.
>
> David Lang
>
>
>> Large issues such as producer suddenly going silent can be detected
>> using macro mechanisms (like pstats).
>>
>> On Sat, Feb 13, 2016 at 2:56 AM, David Lang  wrote:
>>>
>>> On Sat, 13 Feb 2016, Andre wrote:
>>>

 The easiest way I found to do that is to have a control system and send
 two
 streams of data to two or more different destinations.

 In case of rsyslog processing a large message volume UDP the loss has
 always been noticeable.
>>>
>>>
>>>
>>> this depends on your setup. I was able to send UDP logs at gig-E wire
>>> speed
>>> with no losses, but it required tuning the receiving sytem to not do DNS
>>> lookups, have sufficient RAM for buffering, etc
>>>
>>>
>>> I never was able to get my hands on 10G equiepment to push up from there.
>>>
>>> David Lang
>>>
>>> ___
>>> rsyslog mailing list
>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>> http://www.rsyslog.com/professional-services/
>>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>>> of
>>> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T
>>> LIKE THAT.
>>
>>
>>
>>
>>
> ___
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT 

Re: [rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store

2016-02-16 Thread David Lang

On Sat, 13 Feb 2016, singh.janmejay wrote:


The ideal solution would be one that identifies host, log-source and
time of loss along with accurate number of messages lost.

pstats makes sense, but correlating data from stats across large
number of machines will be difficult (some machines may send stats
slightly delayed which may skew aggregation etc).


if you don't reset the counters, they keep increasing, so over time the error 
due to the slew becomes a very minor componnent.




One approach I can think of: slap a stream-identifier and
sequence-number on each received message, then find gaps in sequence
number for a session-id on the other side (as a query over log-store
etc).


I'll point out that generating/checking a monotonic sequence number destroys 
parallelism, and so it can seriously hurt performance.


Are you trying to detect problems 'on the fly' as they happen? or at the end of 
the hour/day saying 'hey, there was a problem at some point'


how frequent do you think problems are? I would suggest that you run some stress 
tests on your equipment/network and push things until you do have problems, so 
you can track when they happen. I expect that you will find that they don't 
start happening until you have much higher loads than you expect (at least after 
a bit of tuning), and this can make it so that the most invastive solutions 
aren't needed.


David Lang


Large issues such as producer suddenly going silent can be detected
using macro mechanisms (like pstats).

On Sat, Feb 13, 2016 at 2:56 AM, David Lang  wrote:

On Sat, 13 Feb 2016, Andre wrote:



The easiest way I found to do that is to have a control system and send
two
streams of data to two or more different destinations.

In case of rsyslog processing a large message volume UDP the loss has
always been noticeable.



this depends on your setup. I was able to send UDP logs at gig-E wire speed
with no losses, but it required tuning the receiving sytem to not do DNS
lookups, have sufficient RAM for buffering, etc


I never was able to get my hands on 10G equiepment to push up from there.

David Lang

___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T
LIKE THAT.






___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.


Re: [rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store

2016-02-16 Thread Thomas D.
Hi,

I don't understand this conversation. If you are sending messages via
UDP there are thousands of $reasons why you can lose a message.

Even if you are using TCP you can lose messages if you don't configure
rsyslog to run "in reliable mode".

Because of these different $reasons you cannot compare with data from
different setups (change one piece, i.e. different network hardware,
updated router software..., and you could face different problems).

So why should one waste time in quantifying log loss in such an
unreliable setup?

As said above: If you would have found a working setup for you (you
tested everything, fine-tuned everything, quantified everything),
changing _anything_ could change _everything_.

If you would really care wouldn't you install a local rsyslog daemon
(Ra) on the application server and make sure your application uses
syslog() to log data (so it is guaranteed that this call hang/fail if
syslog cannot read/accept the message)?
In the next step you would have to make sure that rsyslog is running in
a reliable mode, i.e. with queue etc. to ensure that rsyslog won't ever
throw away a message.
>From this instance you would now send your data to the next hop (Rr) via
RELP. You would also have to configure your Rr instance the same way to
ensure that it will never throw away a message.
If you will ever see a message loss in such a setup there must be a bug.


-Thomas
___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.


[rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store

2016-02-12 Thread singh.janmejay
Inviting ideas.

Has anyone tried to quantify log-loss (#number of lines lost per day
per sender etc) for a log-store?

Let us consider the following setup:
- An environment has several application nodes. Each app node
hands-over its logs to local Rsyslog daemon(let us call it Ra,
Rsyslog-application).
- The environment has one or more Rsyslog receiver nodes (let us call
it Rr, Rsyslog-receiver).
- Rr(s) write received logs to a log-store.

The problem statement is: Quantify log-loss(defined as messages that
are successfully handed over to Ra, but can't be found in log-store)
in log-events lost per day per host.

Log-events may be lost because of any reason (in the pipe, or after
being written to log-store). It doesn't matter which of the
intermediate systems lost logs, as long as loss is bounded (by any
empirical figure, say less than 0.1%).

-- 
Regards,
Janmejay
http://codehunk.wordpress.com
___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.


Re: [rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store

2016-02-12 Thread Andre
yes.

The easiest way I found to do that is to have a control system and send two
streams of data to two or more different destinations.

In case of rsyslog processing a large message volume UDP the loss has
always been noticeable.

On Fri, Feb 12, 2016 at 11:35 PM, singh.janmejay 
wrote:

> Inviting ideas.
>
> Has anyone tried to quantify log-loss (#number of lines lost per day
> per sender etc) for a log-store?
>
> Let us consider the following setup:
> - An environment has several application nodes. Each app node
> hands-over its logs to local Rsyslog daemon(let us call it Ra,
> Rsyslog-application).
> - The environment has one or more Rsyslog receiver nodes (let us call
> it Rr, Rsyslog-receiver).
> - Rr(s) write received logs to a log-store.
>
> The problem statement is: Quantify log-loss(defined as messages that
> are successfully handed over to Ra, but can't be found in log-store)
> in log-events lost per day per host.
>
> Log-events may be lost because of any reason (in the pipe, or after
> being written to log-store). It doesn't matter which of the
> intermediate systems lost logs, as long as loss is bounded (by any
> empirical figure, say less than 0.1%).
>
> --
> Regards,
> Janmejay
> http://codehunk.wordpress.com
> ___
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
> DON'T LIKE THAT.
>
___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.


Re: [rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store

2016-02-12 Thread David Lang

On Sat, 13 Feb 2016, Andre wrote:



The easiest way I found to do that is to have a control system and send two
streams of data to two or more different destinations.

In case of rsyslog processing a large message volume UDP the loss has
always been noticeable.


this depends on your setup. I was able to send UDP logs at gig-E wire speed with 
no losses, but it required tuning the receiving sytem to not do DNS lookups, 
have sufficient RAM for buffering, etc



I never was able to get my hands on 10G equiepment to push up from there.

David Lang
___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.


Re: [rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store

2016-02-12 Thread David Lang

On Fri, 12 Feb 2016, singh.janmejay wrote:


Inviting ideas.

Has anyone tried to quantify log-loss (#number of lines lost per day
per sender etc) for a log-store?

Let us consider the following setup:
- An environment has several application nodes. Each app node
hands-over its logs to local Rsyslog daemon(let us call it Ra,
Rsyslog-application).
- The environment has one or more Rsyslog receiver nodes (let us call
it Rr, Rsyslog-receiver).
- Rr(s) write received logs to a log-store.

The problem statement is: Quantify log-loss(defined as messages that
are successfully handed over to Ra, but can't be found in log-store)
in log-events lost per day per host.

Log-events may be lost because of any reason (in the pipe, or after
being written to log-store). It doesn't matter which of the
intermediate systems lost logs, as long as loss is bounded (by any
empirical figure, say less than 0.1%).


I have done so for benchmark/acceptance tests, but not as an ongoing process on 
a live system


pstats will give you a lot of what you want to start with (how many items were 
sent on one system so that you can look on other systems and find how many were 
received to correlate the two.


can you go into more detail about what you are trying to prove?

David Lang
___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.


Re: [rsyslog] Identifying/quantifying log-loss or proving no-loss in a log-store

2016-02-12 Thread singh.janmejay
The ideal solution would be one that identifies host, log-source and
time of loss along with accurate number of messages lost.

pstats makes sense, but correlating data from stats across large
number of machines will be difficult (some machines may send stats
slightly delayed which may skew aggregation etc).

One approach I can think of: slap a stream-identifier and
sequence-number on each received message, then find gaps in sequence
number for a session-id on the other side (as a query over log-store
etc).

Large issues such as producer suddenly going silent can be detected
using macro mechanisms (like pstats).

On Sat, Feb 13, 2016 at 2:56 AM, David Lang  wrote:
> On Sat, 13 Feb 2016, Andre wrote:
>
>>
>> The easiest way I found to do that is to have a control system and send
>> two
>> streams of data to two or more different destinations.
>>
>> In case of rsyslog processing a large message volume UDP the loss has
>> always been noticeable.
>
>
> this depends on your setup. I was able to send UDP logs at gig-E wire speed
> with no losses, but it required tuning the receiving sytem to not do DNS
> lookups, have sufficient RAM for buffering, etc
>
>
> I never was able to get my hands on 10G equiepment to push up from there.
>
> David Lang
>
> ___
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
> sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T
> LIKE THAT.



-- 
Regards,
Janmejay
http://codehunk.wordpress.com
___
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.