Re: [DISCUSS] Real-time processing engine: Storm, Spark, Flink or Cloud Native

2019-05-10 Thread Ali Nazemian
Oops. Turns out I totally missed this email.

Thanks, Mike for your reply. Spark native support of Kubernetes has been
added very recently and it is not really at the stage that can provide all
the aforementioned features. There is no doubt that Spark is a powerful
tool and it is been widely used for similar use cases in the last few
years. However, when we look at the features that Spark can provide and try
to map them to Metron high-level architecture, It is hard to believe that
Spark will bring much added value to this architecture for the event
processing (no doubt about the batch side of it, though). When we compare
that with more lightweight frameworks for event-driven data processing
pipeline and cloud-native architectures you can see that all the features
Spark targets them in the real-time side can be covered by your
architecture natively (without getting help from your framework). Stuff
like fault tolerance, reliability, back pressure, at least once guarantee,
etc. all can be provided very easily. The only difference is you have got
the full support of Kubernetes features out of the box instead of waiting
for technology to evolve and maybe in two years come to the point that you
can truly have stuff like self-healing, change isolation, auto-scalability,
etc. with Spark whereas you can have them all right now just by looking at
this problem form a different angle.

Cheers,
Ali

On Fri, Apr 12, 2019 at 3:54 AM Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> Hi Ali,
>
> Thank you for taking the time to share your experiences with us. I've been
> thinking about this a while now and wanted to take some time for reflection
> before responding. I need to kick out a proper dev list DISCUSS thread on
> this, but if you've seen a couple of the recent refactoring PR's, you are
> right that we've been looking to decouple ourselves from Storm and open up
> the possibility of onboarding another processing engine. The most obvious
> candidate here, imho, is Apache Spark. Getting right to the meat of your
> discussion points, I don't think this is an either/or proposition between
> Kubernetes and Spark. I see this as an AND proposition. The reality is that
> Spark offers quite a bit from a job scaling, redundancy, and efficiency
> perspective. Not to mention, the capabilities it provides purely from a
> data transformation and processing engine perspective. The real roadmap, at
> least in my mind, would be for us to onboard Spark and then leverage
> Kubernetes at some point to enable some of the features that you describe -
> vertical and horizontal elasticity, in particular. In addition to that,
> Helm could provide some compelling features for managing that container
> application deployment story. Expect a discussion from me very soon about
> more specific ideas as to what I think our integration with Spark can and
> should look like in the near future with Metron. We have nearly completed
> decoupling our core infrastructure from Storm at this point, which opens us
> up to a number of possibilities going forward.
>
> Best,
> Mike Miklavcic
>
>
> On Thu, Apr 4, 2019 at 1:35 AM Ali Nazemian  wrote:
>
> > Hi All,
> >
> > As far as I understood, there is a plan to change the real-time engine of
> > Metron due to some issues that user and developer have been facing with
> it.
> > I would like to explain some critical issues that customer have been
> facing
> > to clarify it for the development team what the best approach could be
> for
> > the future of Metron. Based on the experience we have had with Metron
> there
> > are two important issues that cause lots of problems from the technology
> > and business:
> >
> > - Infrastructure cost
> > - Operational complexity
> >
> > We have had lots of issues to minimize infrastructure cost. We have also
> > spent significant time to tune infrastructure to be able to reduce the
> > cost. However, regardless of what had been done, we were not able to
> manage
> > our cost properly. The main reason for that is the rate of log ingestion
> > has been very fluctuating. It means we were receiving 4k eps on a sensor
> > during the peak time and less than 1 eps off-peak (e.g. during night).
> The
> > problem with that is you want to have an environment that can easily
> *scale
> > up* and *scale down* based on your ingestion traffic. Not to mention that
> > there have been situations where we cannot even predict the ingestion
> rate
> > as there has been a sort of cyber attach where lots of logs are generated
> > from the source devices. For example, DDOS might be one of the scenarios
> > that lots of logs are generated.
> >
> > When it comes to operational complexity, we have had lots of issues to
> > manage sensors and tune different parameters based on the traffic we
> > receive. We have had lots of failures as well due to different reasons
> and
> > we spent a fair amount of time to write scripts that can be simulated
> > *self-healing* feature at a very basic 

Re: [VOTE] Metron Release Candidate 0.7.1-RC2

2019-05-10 Thread Michael Miklavcic
+1 binding

Validated same as Nick.

Mike

On Thu, May 9, 2019 at 5:54 PM Nick Allen  wrote:

> +1 binding
>
> I validated the release tarball, ran the full test suite and validated the
> CentOS 6 development environment.  Everything looks solid.  Let's ship it.
>
> On Wed, May 8, 2019 at 6:50 PM Justin Leet  wrote:
>
> > This is a call to vote on releasing Apache Metron 0.7.1
> >
> > Full list of changes in this release:
> > https://dist.apache.org/repos/dist/dev/metron/0.7.1-RC2/CHANGES
> > The tag to be voted upon is:
> > apache-metron_0.7.1-rc2
> >
> > The source archives being voted upon can be found here:
> >
> >
> https://dist.apache.org/repos/dist/dev/metron/0.7.1-RC2/apache-metron_0.7.1-rc2.tar.gz
> >
> > Other release files, signatures and digests can be found here:
> > https://dist.apache.org/repos/dist/dev/metron/0.7.1-RC2/
> >
> > The release artifacts are signed with the following key:
> > https://dist.apache.org/repos/dist/release/metron/KEYS
> > Please vote on releasing this package as Apache Metron 0.7.1-RC2
> >
> > When voting, please list the actions taken to verify the release.
> >
> > Recommended build validation and verification instructions are posted
> > here:
> > https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds
> >
> > This vote will be open for until 7pm EDT on Monday May 13 2019, to
> account
> > for the weekend.
> >
> > [ ] +1 Release this package as Apache Metron 0.7.1-RC2
> >
> > [ ] 0 No opinion
> >
> > [ ] -1 Do not release this package because...
> >
>


Re: [DISCUSS] JsonMapParser original string functionality

2019-05-10 Thread Otto Fowler
The original string would be the string specified as the message body, thus
each message in the chain produced would just be the bytes passed in, from
a specific field in the incoming message.



On May 10, 2019 at 19:55:28, Simon Elliston Ball (
si...@simonellistonball.com) wrote:

My understanding is that chaining preserves (correctly to my mind) the
original original string.

In other words: unless the message strategy is raw message, the original
string is just passed through. Original string therefore comes from outside
Metron, and is preserved throughout Metron processes, allowing for
recreation of original form for forensics and evidentiary purposes.

Simon

> On 11 May 2019, at 00:10, Otto Fowler  wrote:
>
> What about parser chaining? Should the original string be from kafka, or
> the last parsed?
>
>
> On May 10, 2019 at 19:03:39, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> The only scenario I can think of where a parser might treat original
string
> differently, or even need to know about it would be different encoding
> locales. For example, if the string were to be encoded in a locale
specific
> to the device and choose the encoding based on metadata or parsed
content,
> then that could merit pushing it down. The other edge might be when you
> have binary data that does not go down to an original string well (eg a
> netflow parser).
>
> That said, that’s a highly unlikely edge case that could be handled by
> workarounds.
>
> I’m a definitely +1 on Nick’s idea of pulling original string up to the
> runner. Right now we’re pretty inconsistent in how it’s done, so that
would
> help.
>
> Simon
>
> Sent from my iPhone
>
> On 10 May 2019, at 23:10, Nick Allen  wrote:
>
>>> I suppose we could always allow this to be overridden, also.
>>
>> I like an on/off switch for the "original string" functionality. If on,
>> you get the original string in pristine condition. If off, no original
>> string is appended for those who care more about storage space.
>>
>> I can't think of a reason where one kind of parser would have a
different
>> original string mechanism than the others. If something like that does
>> come up, the parser can create its own original string by just naming it
>> something different and then turning "off" the switch that you
described.
>>
>>
>>
>> On Fri, May 10, 2019 at 5:53 PM Michael Miklavcic <
>> michael.miklav...@gmail.com> wrote:
>>
>>> I think that's an excellent idea. Can anyone think of a situation where
> we
>>> wouldn't want to add this the same way for all parsers? I suppose we
> could
>>> always allow this to be overridden, also.
>>>
 On Fri, May 10, 2019 at 3:43 PM Nick Allen  wrote:

 I think maintaining the integrity of the original data makes a lot of
>>> sense
 for any parser. And ideally the original string should be what came
out
>>> of
 Kafka with only the minimally necessary processing.

 With that in mind, we could solve this one level up. Instead of
relying
>>> on
 each parser to do this right, we could have the ParserRunner and
 specifically the ParserRunnerImpl [1] handle this round-abouts here
 <

>>>
>
https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
>
 [1].
 It has the raw message data and can append the original string to each
 message it gets back from the parsers.

 Just another approach to consider.

 --
 [1]


>>>
>
https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158

 On Fri, May 10, 2019 at 4:11 PM Otto Fowler 
 wrote:

> +1
>
>
> On May 10, 2019 at 13:57:55, Michael Miklavcic (
> michael.miklav...@gmail.com)
> wrote:
>
> When adding the capability for parsing messages in the JsonMapParser
 using
> JSON Path expressions the original behavior for managing original
>>> strings
> was changed.
>
>
>

>>>
>
https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192
>
> A couple issues have been reported recently regarding this change:
>
> 1. We're losing the actual original string, which is a legal issue
for
> data lineage for some customers
> 2. Even for the degenerate case with no sub-messages created, the
> original sub-message string is modified because of the
> serialization/deserialization process with Jackson/JsonSimple. The
>>> fields
> are reordered bc the content is normalized.
>
> I looked at options for preserving formatting, but am unable to find
a
> method that allows you to both parse, then query the 

Re: [DISCUSS] JsonMapParser original string functionality

2019-05-10 Thread Simon Elliston Ball
My understanding is that chaining preserves (correctly to my mind) the original 
original string.

In other words: unless the message strategy is raw message, the original string 
is just passed through. Original string therefore comes from outside Metron, 
and is preserved throughout Metron processes, allowing for recreation of 
original form for forensics and evidentiary purposes.

Simon

> On 11 May 2019, at 00:10, Otto Fowler  wrote:
> 
> What about parser chaining?   Should the original string be from kafka, or
> the last parsed?
> 
> 
> On May 10, 2019 at 19:03:39, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
> 
> The only scenario I can think of where a parser might treat original string
> differently, or even need to know about it would be different encoding
> locales. For example, if the string were to be encoded in a locale specific
> to the device and choose the encoding based on metadata or parsed content,
> then that could merit pushing it down. The other edge might be when you
> have binary data that does not go down to an original string well (eg a
> netflow parser).
> 
> That said, that’s a highly unlikely edge case that could be handled by
> workarounds.
> 
> I’m a definitely +1 on Nick’s idea of pulling original string up to the
> runner. Right now we’re pretty inconsistent in how it’s done, so that would
> help.
> 
> Simon
> 
> Sent from my iPhone
> 
> On 10 May 2019, at 23:10, Nick Allen  wrote:
> 
>>> I suppose we could always allow this to be overridden, also.
>> 
>> I like an on/off switch for the "original string" functionality. If on,
>> you get the original string in pristine condition. If off, no original
>> string is appended for those who care more about storage space.
>> 
>> I can't think of a reason where one kind of parser would have a different
>> original string mechanism than the others. If something like that does
>> come up, the parser can create its own original string by just naming it
>> something different and then turning "off" the switch that you described.
>> 
>> 
>> 
>> On Fri, May 10, 2019 at 5:53 PM Michael Miklavcic <
>> michael.miklav...@gmail.com> wrote:
>> 
>>> I think that's an excellent idea. Can anyone think of a situation where
> we
>>> wouldn't want to add this the same way for all parsers? I suppose we
> could
>>> always allow this to be overridden, also.
>>> 
 On Fri, May 10, 2019 at 3:43 PM Nick Allen  wrote:
 
 I think maintaining the integrity of the original data makes a lot of
>>> sense
 for any parser. And ideally the original string should be what came out
>>> of
 Kafka with only the minimally necessary processing.
 
 With that in mind, we could solve this one level up. Instead of relying
>>> on
 each parser to do this right, we could have the ParserRunner and
 specifically the ParserRunnerImpl [1] handle this round-abouts here
 <
 
>>> 
> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
> 
 [1].
 It has the raw message data and can append the original string to each
 message it gets back from the parsers.
 
 Just another approach to consider.
 
 --
 [1]
 
 
>>> 
> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
 
 On Fri, May 10, 2019 at 4:11 PM Otto Fowler 
 wrote:
 
> +1
> 
> 
> On May 10, 2019 at 13:57:55, Michael Miklavcic (
> michael.miklav...@gmail.com)
> wrote:
> 
> When adding the capability for parsing messages in the JsonMapParser
 using
> JSON Path expressions the original behavior for managing original
>>> strings
> was changed.
> 
> 
> 
 
>>> 
> https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192
> 
> A couple issues have been reported recently regarding this change:
> 
> 1. We're losing the actual original string, which is a legal issue for
> data lineage for some customers
> 2. Even for the degenerate case with no sub-messages created, the
> original sub-message string is modified because of the
> serialization/deserialization process with Jackson/JsonSimple. The
>>> fields
> are reordered bc the content is normalized.
> 
> I looked at options for preserving formatting, but am unable to find a
> method that allows you to both parse, then query the original message
>>> and
> then also obtain the raw string matches without the normalizing from
> ser/deserialization.
> 
> I'd like to propose that we add a configuration option for this parser
 that
> allows the user 

Re: [DISCUSS] JsonMapParser original string functionality

2019-05-10 Thread Otto Fowler
What about parser chaining?   Should the original string be from kafka, or
the last parsed?


On May 10, 2019 at 19:03:39, Simon Elliston Ball (
si...@simonellistonball.com) wrote:

The only scenario I can think of where a parser might treat original string
differently, or even need to know about it would be different encoding
locales. For example, if the string were to be encoded in a locale specific
to the device and choose the encoding based on metadata or parsed content,
then that could merit pushing it down. The other edge might be when you
have binary data that does not go down to an original string well (eg a
netflow parser).

That said, that’s a highly unlikely edge case that could be handled by
workarounds.

I’m a definitely +1 on Nick’s idea of pulling original string up to the
runner. Right now we’re pretty inconsistent in how it’s done, so that would
help.

Simon

Sent from my iPhone

On 10 May 2019, at 23:10, Nick Allen  wrote:

>> I suppose we could always allow this to be overridden, also.
>
> I like an on/off switch for the "original string" functionality. If on,
> you get the original string in pristine condition. If off, no original
> string is appended for those who care more about storage space.
>
> I can't think of a reason where one kind of parser would have a different
> original string mechanism than the others. If something like that does
> come up, the parser can create its own original string by just naming it
> something different and then turning "off" the switch that you described.
>
>
>
> On Fri, May 10, 2019 at 5:53 PM Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
>
>> I think that's an excellent idea. Can anyone think of a situation where
we
>> wouldn't want to add this the same way for all parsers? I suppose we
could
>> always allow this to be overridden, also.
>>
>>> On Fri, May 10, 2019 at 3:43 PM Nick Allen  wrote:
>>>
>>> I think maintaining the integrity of the original data makes a lot of
>> sense
>>> for any parser. And ideally the original string should be what came out
>> of
>>> Kafka with only the minimally necessary processing.
>>>
>>> With that in mind, we could solve this one level up. Instead of relying
>> on
>>> each parser to do this right, we could have the ParserRunner and
>>> specifically the ParserRunnerImpl [1] handle this round-abouts here
>>> <
>>>
>>
https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158

>>> [1].
>>> It has the raw message data and can append the original string to each
>>> message it gets back from the parsers.
>>>
>>> Just another approach to consider.
>>>
>>> --
>>> [1]
>>>
>>>
>>
https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
>>>
>>> On Fri, May 10, 2019 at 4:11 PM Otto Fowler 
>>> wrote:
>>>
 +1


 On May 10, 2019 at 13:57:55, Michael Miklavcic (
 michael.miklav...@gmail.com)
 wrote:

 When adding the capability for parsing messages in the JsonMapParser
>>> using
 JSON Path expressions the original behavior for managing original
>> strings
 was changed.



>>>
>>
https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192

 A couple issues have been reported recently regarding this change:

 1. We're losing the actual original string, which is a legal issue for
 data lineage for some customers
 2. Even for the degenerate case with no sub-messages created, the
 original sub-message string is modified because of the
 serialization/deserialization process with Jackson/JsonSimple. The
>> fields
 are reordered bc the content is normalized.

 I looked at options for preserving formatting, but am unable to find a
 method that allows you to both parse, then query the original message
>> and
 then also obtain the raw string matches without the normalizing from
 ser/deserialization.

 I'd like to propose that we add a configuration option for this parser
>>> that
 allows the user to toggle which approach they'd like to use. My
>> personal
 preference based on feedback I've gotten from multiple customers is
>> that
 the default should be the older approach which takes the raw original
 string. It's arguable that this change in contract is a regression, so
>>> the
 default should be the earlier behavior. Any sub-messages would then
>> have
>>> a
 copy of that raw original string, not just the sub-message original
>>> string.
 Enabling the flag would enable the current sub-message original string
 functionality.

 Mike

>>>
>>


Re: [DISCUSS] JsonMapParser original string functionality

2019-05-10 Thread Nick Allen
>  I suppose we could always allow this to be overridden, also.

I like an on/off switch for the "original string" functionality.  If on,
you get the original string in pristine condition.  If off, no original
string is appended for those who care more about storage space.

I can't think of a reason where one kind of parser would have a different
original string mechanism than the others.  If something like that does
come up, the parser can create its own original string by just naming it
something different and then turning "off" the switch that you described.



On Fri, May 10, 2019 at 5:53 PM Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> I think that's an excellent idea. Can anyone think of a situation where we
> wouldn't want to add this the same way for all parsers? I suppose we could
> always allow this to be overridden, also.
>
> On Fri, May 10, 2019 at 3:43 PM Nick Allen  wrote:
>
> > I think maintaining the integrity of the original data makes a lot of
> sense
> > for any parser. And ideally the original string should be what came out
> of
> > Kafka with only the minimally necessary processing.
> >
> > With that in mind, we could solve this one level up.  Instead of relying
> on
> > each parser to do this right, we could have the ParserRunner and
> > specifically the ParserRunnerImpl [1] handle this round-abouts here
> > <
> >
> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
> > >
> > [1].
> > It has the raw message data and can append the original string to each
> > message it gets back from the parsers.
> >
> > Just another approach to consider.
> >
> > --
> > [1]
> >
> >
> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
> >
> > On Fri, May 10, 2019 at 4:11 PM Otto Fowler 
> > wrote:
> >
> > > +1
> > >
> > >
> > > On May 10, 2019 at 13:57:55, Michael Miklavcic (
> > > michael.miklav...@gmail.com)
> > > wrote:
> > >
> > > When adding the capability for parsing messages in the JsonMapParser
> > using
> > > JSON Path expressions the original behavior for managing original
> strings
> > > was changed.
> > >
> > >
> > >
> >
> https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192
> > >
> > > A couple issues have been reported recently regarding this change:
> > >
> > > 1. We're losing the actual original string, which is a legal issue for
> > > data lineage for some customers
> > > 2. Even for the degenerate case with no sub-messages created, the
> > > original sub-message string is modified because of the
> > > serialization/deserialization process with Jackson/JsonSimple. The
> fields
> > > are reordered bc the content is normalized.
> > >
> > > I looked at options for preserving formatting, but am unable to find a
> > > method that allows you to both parse, then query the original message
> and
> > > then also obtain the raw string matches without the normalizing from
> > > ser/deserialization.
> > >
> > > I'd like to propose that we add a configuration option for this parser
> > that
> > > allows the user to toggle which approach they'd like to use. My
> personal
> > > preference based on feedback I've gotten from multiple customers is
> that
> > > the default should be the older approach which takes the raw original
> > > string. It's arguable that this change in contract is a regression, so
> > the
> > > default should be the earlier behavior. Any sub-messages would then
> have
> > a
> > > copy of that raw original string, not just the sub-message original
> > string.
> > > Enabling the flag would enable the current sub-message original string
> > > functionality.
> > >
> > > Mike
> > >
> >
>


Re: [DISCUSS] JsonMapParser original string functionality

2019-05-10 Thread Michael Miklavcic
I think that's an excellent idea. Can anyone think of a situation where we
wouldn't want to add this the same way for all parsers? I suppose we could
always allow this to be overridden, also.

On Fri, May 10, 2019 at 3:43 PM Nick Allen  wrote:

> I think maintaining the integrity of the original data makes a lot of sense
> for any parser. And ideally the original string should be what came out of
> Kafka with only the minimally necessary processing.
>
> With that in mind, we could solve this one level up.  Instead of relying on
> each parser to do this right, we could have the ParserRunner and
> specifically the ParserRunnerImpl [1] handle this round-abouts here
> <
> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
> >
> [1].
> It has the raw message data and can append the original string to each
> message it gets back from the parsers.
>
> Just another approach to consider.
>
> --
> [1]
>
> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
>
> On Fri, May 10, 2019 at 4:11 PM Otto Fowler 
> wrote:
>
> > +1
> >
> >
> > On May 10, 2019 at 13:57:55, Michael Miklavcic (
> > michael.miklav...@gmail.com)
> > wrote:
> >
> > When adding the capability for parsing messages in the JsonMapParser
> using
> > JSON Path expressions the original behavior for managing original strings
> > was changed.
> >
> >
> >
> https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192
> >
> > A couple issues have been reported recently regarding this change:
> >
> > 1. We're losing the actual original string, which is a legal issue for
> > data lineage for some customers
> > 2. Even for the degenerate case with no sub-messages created, the
> > original sub-message string is modified because of the
> > serialization/deserialization process with Jackson/JsonSimple. The fields
> > are reordered bc the content is normalized.
> >
> > I looked at options for preserving formatting, but am unable to find a
> > method that allows you to both parse, then query the original message and
> > then also obtain the raw string matches without the normalizing from
> > ser/deserialization.
> >
> > I'd like to propose that we add a configuration option for this parser
> that
> > allows the user to toggle which approach they'd like to use. My personal
> > preference based on feedback I've gotten from multiple customers is that
> > the default should be the older approach which takes the raw original
> > string. It's arguable that this change in contract is a regression, so
> the
> > default should be the earlier behavior. Any sub-messages would then have
> a
> > copy of that raw original string, not just the sub-message original
> string.
> > Enabling the flag would enable the current sub-message original string
> > functionality.
> >
> > Mike
> >
>


Re: [DISCUSS] JsonMapParser original string functionality

2019-05-10 Thread Nick Allen
I think maintaining the integrity of the original data makes a lot of sense
for any parser. And ideally the original string should be what came out of
Kafka with only the minimally necessary processing.

With that in mind, we could solve this one level up.  Instead of relying on
each parser to do this right, we could have the ParserRunner and
specifically the ParserRunnerImpl [1] handle this round-abouts here

[1].
It has the raw message data and can append the original string to each
message it gets back from the parsers.

Just another approach to consider.

--
[1]
https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158

On Fri, May 10, 2019 at 4:11 PM Otto Fowler  wrote:

> +1
>
>
> On May 10, 2019 at 13:57:55, Michael Miklavcic (
> michael.miklav...@gmail.com)
> wrote:
>
> When adding the capability for parsing messages in the JsonMapParser using
> JSON Path expressions the original behavior for managing original strings
> was changed.
>
>
> https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192
>
> A couple issues have been reported recently regarding this change:
>
> 1. We're losing the actual original string, which is a legal issue for
> data lineage for some customers
> 2. Even for the degenerate case with no sub-messages created, the
> original sub-message string is modified because of the
> serialization/deserialization process with Jackson/JsonSimple. The fields
> are reordered bc the content is normalized.
>
> I looked at options for preserving formatting, but am unable to find a
> method that allows you to both parse, then query the original message and
> then also obtain the raw string matches without the normalizing from
> ser/deserialization.
>
> I'd like to propose that we add a configuration option for this parser that
> allows the user to toggle which approach they'd like to use. My personal
> preference based on feedback I've gotten from multiple customers is that
> the default should be the older approach which takes the raw original
> string. It's arguable that this change in contract is a regression, so the
> default should be the earlier behavior. Any sub-messages would then have a
> copy of that raw original string, not just the sub-message original string.
> Enabling the flag would enable the current sub-message original string
> functionality.
>
> Mike
>


Re: [VOTE] Metron Release Candidate 0.7.1-RC2

2019-05-10 Thread Nick Allen
I really enjoyed the retro, 3-digit vibe on that one.

On Fri, May 10, 2019 at 4:38 PM Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> "METRON-685" - wow, that one was a long time coming.
>
> On Thu, May 9, 2019 at 5:54 PM Nick Allen  wrote:
>
> > +1 binding
> >
> > I validated the release tarball, ran the full test suite and validated
> the
> > CentOS 6 development environment.  Everything looks solid.  Let's ship
> it.
> >
> > On Wed, May 8, 2019 at 6:50 PM Justin Leet 
> wrote:
> >
> > > This is a call to vote on releasing Apache Metron 0.7.1
> > >
> > > Full list of changes in this release:
> > > https://dist.apache.org/repos/dist/dev/metron/0.7.1-RC2/CHANGES
> > > The tag to be voted upon is:
> > > apache-metron_0.7.1-rc2
> > >
> > > The source archives being voted upon can be found here:
> > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/metron/0.7.1-RC2/apache-metron_0.7.1-rc2.tar.gz
> > >
> > > Other release files, signatures and digests can be found here:
> > > https://dist.apache.org/repos/dist/dev/metron/0.7.1-RC2/
> > >
> > > The release artifacts are signed with the following key:
> > > https://dist.apache.org/repos/dist/release/metron/KEYS
> > > Please vote on releasing this package as Apache Metron 0.7.1-RC2
> > >
> > > When voting, please list the actions taken to verify the release.
> > >
> > > Recommended build validation and verification instructions are posted
> > > here:
> > > https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds
> > >
> > > This vote will be open for until 7pm EDT on Monday May 13 2019, to
> > account
> > > for the weekend.
> > >
> > > [ ] +1 Release this package as Apache Metron 0.7.1-RC2
> > >
> > > [ ] 0 No opinion
> > >
> > > [ ] -1 Do not release this package because...
> > >
> >
>


Re: [VOTE] Metron Release Candidate 0.7.1-RC2

2019-05-10 Thread Michael Miklavcic
"METRON-685" - wow, that one was a long time coming.

On Thu, May 9, 2019 at 5:54 PM Nick Allen  wrote:

> +1 binding
>
> I validated the release tarball, ran the full test suite and validated the
> CentOS 6 development environment.  Everything looks solid.  Let's ship it.
>
> On Wed, May 8, 2019 at 6:50 PM Justin Leet  wrote:
>
> > This is a call to vote on releasing Apache Metron 0.7.1
> >
> > Full list of changes in this release:
> > https://dist.apache.org/repos/dist/dev/metron/0.7.1-RC2/CHANGES
> > The tag to be voted upon is:
> > apache-metron_0.7.1-rc2
> >
> > The source archives being voted upon can be found here:
> >
> >
> https://dist.apache.org/repos/dist/dev/metron/0.7.1-RC2/apache-metron_0.7.1-rc2.tar.gz
> >
> > Other release files, signatures and digests can be found here:
> > https://dist.apache.org/repos/dist/dev/metron/0.7.1-RC2/
> >
> > The release artifacts are signed with the following key:
> > https://dist.apache.org/repos/dist/release/metron/KEYS
> > Please vote on releasing this package as Apache Metron 0.7.1-RC2
> >
> > When voting, please list the actions taken to verify the release.
> >
> > Recommended build validation and verification instructions are posted
> > here:
> > https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds
> >
> > This vote will be open for until 7pm EDT on Monday May 13 2019, to
> account
> > for the weekend.
> >
> > [ ] +1 Release this package as Apache Metron 0.7.1-RC2
> >
> > [ ] 0 No opinion
> >
> > [ ] -1 Do not release this package because...
> >
>


Re: [DISCUSS] JsonMapParser original string functionality

2019-05-10 Thread Otto Fowler
+1


On May 10, 2019 at 13:57:55, Michael Miklavcic (michael.miklav...@gmail.com)
wrote:

When adding the capability for parsing messages in the JsonMapParser using
JSON Path expressions the original behavior for managing original strings
was changed.

https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192

A couple issues have been reported recently regarding this change:

1. We're losing the actual original string, which is a legal issue for
data lineage for some customers
2. Even for the degenerate case with no sub-messages created, the
original sub-message string is modified because of the
serialization/deserialization process with Jackson/JsonSimple. The fields
are reordered bc the content is normalized.

I looked at options for preserving formatting, but am unable to find a
method that allows you to both parse, then query the original message and
then also obtain the raw string matches without the normalizing from
ser/deserialization.

I'd like to propose that we add a configuration option for this parser that
allows the user to toggle which approach they'd like to use. My personal
preference based on feedback I've gotten from multiple customers is that
the default should be the older approach which takes the raw original
string. It's arguable that this change in contract is a regression, so the
default should be the earlier behavior. Any sub-messages would then have a
copy of that raw original string, not just the sub-message original string.
Enabling the flag would enable the current sub-message original string
functionality.

Mike


[DISCUSS] JsonMapParser original string functionality

2019-05-10 Thread Michael Miklavcic
When adding the capability for parsing messages in the JsonMapParser using
JSON Path expressions the original behavior for managing original strings
was changed.

https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192

A couple issues have been reported recently regarding this change:

   1. We're losing the actual original string, which is a legal issue for
   data lineage for some customers
   2. Even for the degenerate case with no sub-messages created, the
   original sub-message string is modified because of the
   serialization/deserialization process with Jackson/JsonSimple. The fields
   are reordered bc the content is normalized.

I looked at options for preserving formatting, but am unable to find a
method that allows you to both parse, then query the original message and
then also obtain the raw string matches without the normalizing from
ser/deserialization.

I'd like to propose that we add a configuration option for this parser that
allows the user to toggle which approach they'd like to use. My personal
preference based on feedback I've gotten from multiple customers is that
the default should be the older approach which takes the raw original
string. It's arguable that this change in contract is a regression, so the
default should be the earlier behavior. Any sub-messages would then have a
copy of that raw original string, not just the sub-message original string.
Enabling the flag would enable the current sub-message original string
functionality.

Mike