Re: [DISCUSS] Generic Syslog Parsing capability for parsers

2018-03-26 Thread Ali Nazemian
Just adding more details regarding what different parts are:

There are three stages here that need to be understood:
1- pre-parsing
2- chain of parsing (wrapping one type of message in another format)
3- post-parsing aka normalization

Pre-parsing stage is where we need to specify what specific log format we
have received. Sometimes we receive logs aggregated and we cannot segregate
feeds without checking the format of logs. Currently, we have addressed
this by consuming message in multiple parsers which means we are wasting
compute.

Chain of parsers is sort of clear, so I don't go the details.

Posparsing is where we need to normalize different formats to a single data
model based on different criteria (e.g. tenant).

For example, we may receive Syslog and WEF (Windows event format)
aggregated. At first, we want to specify which parser should consume WEF
and which one costumes Syslog. Then, in WEF parser we have DHCP, DNS,
Application logs etc. We need to send it to the next layer for assigning a
right data model and at the end, we need to normalize it to a single format
based on some criteria (e.g. tenant name).

Regards,
Ali

On Wed, Mar 21, 2018 at 9:49 AM, zeo...@gmail.com  wrote:

> So I've kept my ear to the ground regarding this topic for a while now, and
> had some conversations a year or so ago about the idea as well.  At the
> very least, I think having the concept of a pre-parser is a good one, if
> not chaining an arbitrary number of parsers together.  I see this as an
> important way to reduce the complexity of implementing new parsers and
> getting more community involvement/contributions.
>
> Syslog headers are a solid use case to start with because a lot of
> implementations fail to properly implement it on the sending side, at least
> in the real world scenarios that I've seen.  Having a way to extend the
> parser to easily handle incorrect implementations of syslog would be great,
> but anything that can pre-parse or trim the syslog headers to make parsing
> further along in the pipeline more simple would help.
>
> Another idea that would be attractive would be the ability to do
> opportunistic parsing given an ordered list of parsers and some criteria
> for successful parsing (which I admittedly am not sure how to solve) which
> (at least in my mind) would require similar logic to parser chaining.  In
> some highly decentralized organizations this would be helpful as it takes
> the configuration effort off of the team sending the logs (and thus makes
> them more willing to send logs _at all_) and pushes it onto the team
> parsing and/or storing them.
>
> I'm not suggesting we attempt to crack that second nut here, I would love
> to see that use case in mind during discussions.
>
> TL;DR:  +1
>
> Jon
>
> On Tue, Mar 20, 2018 at 6:14 PM Otto Fowler 
> wrote:
>
> > I think the chaining of parsers, or ability to compose parsers is a good
> > idea, but with reference to the pr mentioned, I would have some number of
> > StellarChainLinks as opposed re-implementing stellar in chainlinks.
> > Although it is NiFi-y.  But since I write Processors too, that is fine.
> >
> >
> > On March 20, 2018 at 18:05:12, Simon Elliston Ball (
> > si...@simonellistonball.com) wrote:
> >
> > It seems like parser chaining is becomes a hot topic on the repo too with
> > https://github.com/apache/metron/pull/969#partial-pull-merging <
> > https://github.com/apache/metron/pull/969#partial-pull-merging>
> >
> > I would like to discuss the option, and how we might architect, of
> > configuring parsers to operate on the output of parsers. This may also
> give
> > us the opportunity to be more efficient in scenarios where people have
> > large numbers of sources, and so use up a lot of slots for lower volume
> > parsers for example.
> >
> > I have a bunch of ideas around this, but am more keen to hear what
> everyone
> > else thinks at this stage. How should we go about fixing parser config so
> > that it’s clearer (removing the need for people to reinvent the parser
> > wheel as we’ve seen in a few places) and also more concise and powerful
> > (consolidating the parsing of transports such as syslog and content such
> as
> > application logs, or types of device logs).
> >
> > If this can lead to a more efficient way of handling both the syslog
> > problem, and the kind of problem that leads to switching between grok
> > statements in something like our ASA parser then all the better. I
> suspect
> > that there might also be a case for multi-level chaining here too, since
> > some things are embedded in multiple transports, or might have complex
> > fields that want ‘sub-parsing’.
> >
> > Of course one of the key values of Metron is its speed, so maybe
> > formalising some of the microbenchmarking approaches a few of us have
> been
> > working on might help here too. I’ve got a few bits of micro-benching
> > infrastructure around CEF and ASA, and I believe there’s also been 

Re: [DISCUSS] Generic Syslog Parsing capability for parsers

2018-03-20 Thread zeo...@gmail.com
So I've kept my ear to the ground regarding this topic for a while now, and
had some conversations a year or so ago about the idea as well.  At the
very least, I think having the concept of a pre-parser is a good one, if
not chaining an arbitrary number of parsers together.  I see this as an
important way to reduce the complexity of implementing new parsers and
getting more community involvement/contributions.

Syslog headers are a solid use case to start with because a lot of
implementations fail to properly implement it on the sending side, at least
in the real world scenarios that I've seen.  Having a way to extend the
parser to easily handle incorrect implementations of syslog would be great,
but anything that can pre-parse or trim the syslog headers to make parsing
further along in the pipeline more simple would help.

Another idea that would be attractive would be the ability to do
opportunistic parsing given an ordered list of parsers and some criteria
for successful parsing (which I admittedly am not sure how to solve) which
(at least in my mind) would require similar logic to parser chaining.  In
some highly decentralized organizations this would be helpful as it takes
the configuration effort off of the team sending the logs (and thus makes
them more willing to send logs _at all_) and pushes it onto the team
parsing and/or storing them.

I'm not suggesting we attempt to crack that second nut here, I would love
to see that use case in mind during discussions.

TL;DR:  +1

Jon

On Tue, Mar 20, 2018 at 6:14 PM Otto Fowler  wrote:

> I think the chaining of parsers, or ability to compose parsers is a good
> idea, but with reference to the pr mentioned, I would have some number of
> StellarChainLinks as opposed re-implementing stellar in chainlinks.
> Although it is NiFi-y.  But since I write Processors too, that is fine.
>
>
> On March 20, 2018 at 18:05:12, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> It seems like parser chaining is becomes a hot topic on the repo too with
> https://github.com/apache/metron/pull/969#partial-pull-merging <
> https://github.com/apache/metron/pull/969#partial-pull-merging>
>
> I would like to discuss the option, and how we might architect, of
> configuring parsers to operate on the output of parsers. This may also give
> us the opportunity to be more efficient in scenarios where people have
> large numbers of sources, and so use up a lot of slots for lower volume
> parsers for example.
>
> I have a bunch of ideas around this, but am more keen to hear what everyone
> else thinks at this stage. How should we go about fixing parser config so
> that it’s clearer (removing the need for people to reinvent the parser
> wheel as we’ve seen in a few places) and also more concise and powerful
> (consolidating the parsing of transports such as syslog and content such as
> application logs, or types of device logs).
>
> If this can lead to a more efficient way of handling both the syslog
> problem, and the kind of problem that leads to switching between grok
> statements in something like our ASA parser then all the better. I suspect
> that there might also be a case for multi-level chaining here too, since
> some things are embedded in multiple transports, or might have complex
> fields that want ‘sub-parsing’.
>
> Of course one of the key values of Metron is its speed, so maybe
> formalising some of the microbenchmarking approaches a few of us have been
> working on might help here too. I’ve got a few bits of micro-benching
> infrastructure around CEF and ASA, and I believe there’s also been some
> work to load and perf test things like enrichment that might be leveraged.
>
> Thoughts on a dev board?
>
> Simon
>
> > On 20 Mar 2018, at 21:47, Otto Fowler  wrote:
> >
> > I entered METRON–1453  >
> a
> > little while ago while working on the PR#579
> > .
> >
> > "We have several parsers now, with many imaginable that are based on
> > syslog, where the format is SYSLOG HEADER MESSAGE.
> >
> > With message being in a different format. It would be great is we had a
> way
> > to generically handle syslog headers, such that ANY parser data could
> come
> > over syslog.
> >
> > Either you could have a custom parser, or configure CSV or JSON such that
> > they could be the payload, such that you can handle JSON over syslog by
> > configuration only."
> >
> > The idea would be that the parser bolt would use the configuration to
> > trigger parsing the incoming message as syslog formatted, and pass the
> > message part to the parser, and put the syslog parts in the message(s)
> > after parsing.
> >
> > As part of this I did some work on parsing syslog, using both grok and a
> > DSL that I did from the spec :
> https://github.com/ottobackwards/grok-v-antlr
> >
> > The DSL is slower, but grok cannot handle multiple structured data

Re: [DISCUSS] Generic Syslog Parsing capability for parsers

2018-03-20 Thread Otto Fowler
I think the chaining of parsers, or ability to compose parsers is a good
idea, but with reference to the pr mentioned, I would have some number of
StellarChainLinks as opposed re-implementing stellar in chainlinks.
Although it is NiFi-y.  But since I write Processors too, that is fine.


On March 20, 2018 at 18:05:12, Simon Elliston Ball (
si...@simonellistonball.com) wrote:

It seems like parser chaining is becomes a hot topic on the repo too with
https://github.com/apache/metron/pull/969#partial-pull-merging <
https://github.com/apache/metron/pull/969#partial-pull-merging>

I would like to discuss the option, and how we might architect, of
configuring parsers to operate on the output of parsers. This may also give
us the opportunity to be more efficient in scenarios where people have
large numbers of sources, and so use up a lot of slots for lower volume
parsers for example.

I have a bunch of ideas around this, but am more keen to hear what everyone
else thinks at this stage. How should we go about fixing parser config so
that it’s clearer (removing the need for people to reinvent the parser
wheel as we’ve seen in a few places) and also more concise and powerful
(consolidating the parsing of transports such as syslog and content such as
application logs, or types of device logs).

If this can lead to a more efficient way of handling both the syslog
problem, and the kind of problem that leads to switching between grok
statements in something like our ASA parser then all the better. I suspect
that there might also be a case for multi-level chaining here too, since
some things are embedded in multiple transports, or might have complex
fields that want ‘sub-parsing’.

Of course one of the key values of Metron is its speed, so maybe
formalising some of the microbenchmarking approaches a few of us have been
working on might help here too. I’ve got a few bits of micro-benching
infrastructure around CEF and ASA, and I believe there’s also been some
work to load and perf test things like enrichment that might be leveraged.

Thoughts on a dev board?

Simon

> On 20 Mar 2018, at 21:47, Otto Fowler  wrote:
>
> I entered METRON–1453 
a
> little while ago while working on the PR#579
> .
>
> "We have several parsers now, with many imaginable that are based on
> syslog, where the format is SYSLOG HEADER MESSAGE.
>
> With message being in a different format. It would be great is we had a
way
> to generically handle syslog headers, such that ANY parser data could
come
> over syslog.
>
> Either you could have a custom parser, or configure CSV or JSON such that
> they could be the payload, such that you can handle JSON over syslog by
> configuration only."
>
> The idea would be that the parser bolt would use the configuration to
> trigger parsing the incoming message as syslog formatted, and pass the
> message part to the parser, and put the syslog parts in the message(s)
> after parsing.
>
> As part of this I did some work on parsing syslog, using both grok and a
> DSL that I did from the spec :
https://github.com/ottobackwards/grok-v-antlr
>
> The DSL is slower, but grok cannot handle multiple structured data
entries,
> and the DSL can. I’m not good enough at grok to fix it so that it is
> functionally equivalent. Another option would be to write a third parser…
> It is also possible that the DSL could be improved for speed of course.
>
> Thoughts?


Re: [DISCUSS] Generic Syslog Parsing capability for parsers

2018-03-20 Thread Simon Elliston Ball
It seems like parser chaining is becomes a hot topic on the repo too with 
https://github.com/apache/metron/pull/969#partial-pull-merging 


I would like to discuss the option, and how we might architect, of configuring 
parsers to operate on the output of parsers. This may also give us the 
opportunity to be more efficient in scenarios where people have large numbers 
of sources, and so use up a lot of slots for lower volume parsers for example.

I have a bunch of ideas around this, but am more keen to hear what everyone 
else thinks at this stage. How should we go about fixing parser config so that 
it’s clearer (removing the need for people to reinvent the parser wheel as 
we’ve seen in a few places) and also more concise and powerful (consolidating 
the parsing of transports such as syslog and content such as application logs, 
or types of device logs). 

If this can lead to a more efficient way of handling both the syslog problem, 
and the kind of problem that leads to switching between grok statements in 
something like our ASA parser then all the better. I suspect that there might 
also be a case for multi-level chaining here too, since some things are 
embedded in multiple transports, or might have complex fields that want 
‘sub-parsing’.

Of course one of the key values of Metron is its speed, so maybe formalising 
some of the microbenchmarking approaches a few of us have been working on might 
help here too. I’ve got a few bits of micro-benching infrastructure around CEF 
and ASA, and I believe there’s also been some work to load and perf test things 
like enrichment that might be leveraged.

Thoughts on a dev board? 

Simon

> On 20 Mar 2018, at 21:47, Otto Fowler  wrote:
> 
> I entered METRON–1453  a
> little while ago while working on the PR#579
> .
> 
> "We have several parsers now, with many imaginable that are based on
> syslog, where the format is SYSLOG HEADER MESSAGE.
> 
> With message being in a different format. It would be great is we had a way
> to generically handle syslog headers, such that ANY parser data could come
> over syslog.
> 
> Either you could have a custom parser, or configure CSV or JSON such that
> they could be the payload, such that you can handle JSON over syslog by
> configuration only."
> 
> The idea would be that the parser bolt would use the configuration to
> trigger parsing the incoming message as syslog formatted, and pass the
> message part to the parser, and put the syslog parts in the message(s)
> after parsing.
> 
> As part of this I did some work on parsing syslog, using both grok and a
> DSL that I did from the spec : https://github.com/ottobackwards/grok-v-antlr
> 
> The DSL is slower, but grok cannot handle multiple structured data entries,
> and the DSL can. I’m not good enough at grok to fix it so that it is
> functionally equivalent. Another option would be to write a third parser…
> It is also possible that the DSL could be improved for speed of course.
> 
> Thoughts?



[DISCUSS] Generic Syslog Parsing capability for parsers

2018-03-20 Thread Otto Fowler
I entered METRON–1453  a
little while ago while working on the PR#579
.

"We have several parsers now, with many imaginable that are based on
syslog, where the format is SYSLOG HEADER MESSAGE.

With message being in a different format. It would be great is we had a way
to generically handle syslog headers, such that ANY parser data could come
over syslog.

Either you could have a custom parser, or configure CSV or JSON such that
they could be the payload, such that you can handle JSON over syslog by
configuration only."

The idea would be that the parser bolt would use the configuration to
trigger parsing the incoming message as syslog formatted, and pass the
message part to the parser, and put the syslog parts in the message(s)
after parsing.

As part of this I did some work on parsing syslog, using both grok and a
DSL that I did from the spec : https://github.com/ottobackwards/grok-v-antlr

The DSL is slower, but grok cannot handle multiple structured data entries,
and the DSL can. I’m not good enough at grok to fix it so that it is
functionally equivalent. Another option would be to write a third parser…
It is also possible that the DSL could be improved for speed of course.

Thoughts?