[DISCUSS] deprecate misleading install methods and docs?

2019-10-29 Thread Simon Elliston Ball
Following many discussions on the user and dev lists in the past, a number of 
users seem to have problems with the old ansible methods for installing AWS. 

I am not aware of anyone who is maintaining this area (please shout if you are 
willing to take on bringing this up to date) and we have a lot of outdated 
documentation on both the source tree and the wiki around older, now broken 
install methods. 

My proposal is that we consolidate the multitude of deployment methods, and:
* remove or
* Mark de-deprecated or
* move to contrib
 The methods outside of the Ambari Mpack and full-dev methods of install. 

Does anyone have any thoughts about how we can clean this up and reduce the 
number of options that seem to be confusing new users coming to the platform? I 
am happy as long as  the Ambari method currently used by the distributor (who, 
as you mostly know, I work for, in the interest of full disclosure) remains, 
and full-dev remains as is to avoid disruption to development process. I have 
no strong opinions on any of the other deployment methods, other than that 
their existence seems to be hindering new community members. 

Thoughts?
Simon



Re: Threat Intel hailataxii

2019-10-29 Thread Simon Elliston Ball
Looks to me like your discovery server is not working properly, hence the 
failure message. This could be a temporary connectivity issue, but if it’s 
repeatable I would look into your opentaxii config. 

Simon 

> On 29 Oct 2019, at 13:23, Thiago Rahal Disposti  
> wrote:
> 
> 
> Anyone knows what's going on with the Hail a Taxii server?
> 
> I getting a service temporarily unavailable response for more than 3 weeks 
> now.
> 
> 
> 
> 
> 
> Thanks,
> Thiago Rahal


Re: [DISCUSS] HDP 3.1 Upgrade and release strategy

2019-08-27 Thread Simon Elliston Ball
Not sure it’s in the scope of the project to handle the HDP upgrade as
well, I would scope it to metron config only, and point at the extensive
upgrade capability of Ambari, rather than us trying to recreate the way the
distribution works.

Simon

On Tue, 27 Aug 2019 at 22:23, Otto Fowler  wrote:

> If anyone can think of the things that need to be backed up, please
> comment the jira.
>
>
>
>
> On August 27, 2019 at 17:07:20, Otto Fowler (ottobackwa...@gmail.com)
> wrote:
>
> Good idea METRON–2239 [blocker].
>
>
>
> On August 27, 2019 at 16:30:13, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> You could always submit a Jira :)
>
> On Tue, 27 Aug 2019 at 21:27, Otto Fowler  wrote:
>
>> You are right, that is much better than backup_metron_configs.sh.
>>
>>
>>
>> On August 27, 2019 at 16:05:38, Simon Elliston Ball (
>> si...@simonellistonball.com) wrote:
>>
>> You can do this with zk_load_configs and Ambari’s blueprint api, so we
>> kinda already do.
>>
>> Simon
>>
>> On Tue, 27 Aug 2019 at 20:19, Otto Fowler 
>> wrote:
>>
>>> Maybe we need some automated method to backup configurations and restore
>>> them.
>>>
>>>
>>>
>>> On August 27, 2019 at 14:46:58, Michael Miklavcic (
>>> michael.miklav...@gmail.com) wrote:
>>>
>>> > Once you back up your metron configs, the same configs that worked on
>>> the previous version will continue to work on the version running on HDP
>>> 3.x.  If there is any discrepancy between the two or additional settings
>>> will be required, those will be documented in the release notes.  From the
>>> Metron perspective, this upgrade would be no different than simply
>>> upgrading to the new Metron version.
>>>
>>> This upgrade cannot be performed the same way we've done it in the past.
>>> A number of platform upgrades, including OS, are required:
>>>
>>>1. Requires the OS to be updated on all nodes because there are no
>>>Centos6 RPMs provided in HDP 3.1. Must bump to Centos7.
>>>2. The final new HBase code will not run on HDP 2.6
>>>3. The MPack changes for new Ambari are not backwards compatible
>>>4. YARN and HDFS/MR are also at risk to be backwards incompatible
>>>
>>>
>>> On Tue, Aug 27, 2019 at 12:39 PM Michael Miklavcic <
>>> michael.miklav...@gmail.com> wrote:
>>>
>>>> Adding the dev list back into the thread (a reply-all was missed).
>>>>
>>>> On Tue, Aug 27, 2019 at 10:49 AM James Sirota 
>>>> wrote:
>>>>
>>>>> I agree with Simon.  HDP 2.x platform is rapidly approaching EOL and
>>>>> everyone will likely need to migrate by end of year.  Doing this platform
>>>>> upgrade sooner will give everyone visibility into what Metron on HDP 3.x
>>>>> looks like so they have time to plan and upgrade at their own pace.
>>>>> Feature-wise, the Metron application itself will be unchanged.  It is
>>>>> merely the platform underneath that is changing.  HDP itself can be
>>>>> upgraded per instructions here:
>>>>> https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/release-notes/content/upgrading_parent.html
>>>>>
>>>>> Once you back up your metron configs, the same configs that worked on
>>>>> the previous version will continue to work on the version running on HDP
>>>>> 3.x.  If there is any discrepancy between the two or additional settings
>>>>> will be required, those will be documented in the release notes.  From the
>>>>> Metron perspective, this upgrade would be no different than simply
>>>>> upgrading to the new Metron version.
>>>>>
>>>>> James
>>>>>
>>>>>
>>>>> 27.08.2019, 07:09, "Simon Elliston Ball" >>>> >:
>>>>>
>>>>> Something worth noting here is that HDP 2.6.5 is quite old and
>>>>> approaching EoL rapidly, so the issue of upgrade is urgent. I am aware of 
>>>>> a
>>>>> large number of users who require this upgrade ASAP, and in fact an aware
>>>>> of zero users who wish to remain on HDP 2.
>>>>>
>>>>> Perhaps those users who want to stay on the old platform can stick
>>>>> their hands up and raise concerns, but this move will likely have to 
>>>>> happen
>>>>> very soon.
>>&g

Re: [DISCUSS] HDP 3.1 Upgrade and release strategy

2019-08-27 Thread Simon Elliston Ball
You could always submit a Jira :)

On Tue, 27 Aug 2019 at 21:27, Otto Fowler  wrote:

> You are right, that is much better than backup_metron_configs.sh.
>
>
>
>
> On August 27, 2019 at 16:05:38, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> You can do this with zk_load_configs and Ambari’s blueprint api, so we
> kinda already do.
>
> Simon
>
> On Tue, 27 Aug 2019 at 20:19, Otto Fowler  wrote:
>
>> Maybe we need some automated method to backup configurations and restore
>> them.
>>
>>
>>
>> On August 27, 2019 at 14:46:58, Michael Miklavcic (
>> michael.miklav...@gmail.com) wrote:
>>
>> > Once you back up your metron configs, the same configs that worked on
>> the previous version will continue to work on the version running on HDP
>> 3.x.  If there is any discrepancy between the two or additional settings
>> will be required, those will be documented in the release notes.  From the
>> Metron perspective, this upgrade would be no different than simply
>> upgrading to the new Metron version.
>>
>> This upgrade cannot be performed the same way we've done it in the past.
>> A number of platform upgrades, including OS, are required:
>>
>>1. Requires the OS to be updated on all nodes because there are no
>>Centos6 RPMs provided in HDP 3.1. Must bump to Centos7.
>>2. The final new HBase code will not run on HDP 2.6
>>3. The MPack changes for new Ambari are not backwards compatible
>>4. YARN and HDFS/MR are also at risk to be backwards incompatible
>>
>>
>> On Tue, Aug 27, 2019 at 12:39 PM Michael Miklavcic <
>> michael.miklav...@gmail.com> wrote:
>>
>>> Adding the dev list back into the thread (a reply-all was missed).
>>>
>>> On Tue, Aug 27, 2019 at 10:49 AM James Sirota 
>>> wrote:
>>>
>>>> I agree with Simon.  HDP 2.x platform is rapidly approaching EOL and
>>>> everyone will likely need to migrate by end of year.  Doing this platform
>>>> upgrade sooner will give everyone visibility into what Metron on HDP 3.x
>>>> looks like so they have time to plan and upgrade at their own pace.
>>>> Feature-wise, the Metron application itself will be unchanged.  It is
>>>> merely the platform underneath that is changing.  HDP itself can be
>>>> upgraded per instructions here:
>>>> https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/release-notes/content/upgrading_parent.html
>>>>
>>>> Once you back up your metron configs, the same configs that worked on
>>>> the previous version will continue to work on the version running on HDP
>>>> 3.x.  If there is any discrepancy between the two or additional settings
>>>> will be required, those will be documented in the release notes.  From the
>>>> Metron perspective, this upgrade would be no different than simply
>>>> upgrading to the new Metron version.
>>>>
>>>> James
>>>>
>>>>
>>>> 27.08.2019, 07:09, "Simon Elliston Ball" :
>>>>
>>>> Something worth noting here is that HDP 2.6.5 is quite old and
>>>> approaching EoL rapidly, so the issue of upgrade is urgent. I am aware of a
>>>> large number of users who require this upgrade ASAP, and in fact an aware
>>>> of zero users who wish to remain on HDP 2.
>>>>
>>>> Perhaps those users who want to stay on the old platform can stick
>>>> their hands up and raise concerns, but this move will likely have to happen
>>>> very soon.
>>>>
>>>> Simon
>>>>
>>>> On Tue, 27 Aug 2019 at 15:04, Otto Fowler 
>>>> wrote:
>>>>
>>>> Although we had the discussion, and some great ideas where passed
>>>> around, I do not believe we came to some kind of consensus on what 1.0
>>>> should look like. So that discussion would have to be picked up again so
>>>> that we could know where we are at, and make it an actual thing if we were
>>>> going to make this a 1.0 release.
>>>>
>>>> I believe that the issues raised in that discussion gating 1.0 are
>>>> still largely applicable, including upgrade.
>>>>
>>>> Right now we have *ZERO* HDP 3.1 users. We will go from that to *only*
>>>> supporting 3.1 work and releases. So every user and deployment we currently
>>>> have will feel real pain, have to slay real dragons to move forward with
>>>> metron.
>>>>
>>&g

Re: [DISCUSS] HDP 3.1 Upgrade and release strategy

2019-08-27 Thread Simon Elliston Ball
You can do this with zk_load_configs and Ambari’s blueprint api, so we
kinda already do.

Simon

On Tue, 27 Aug 2019 at 20:19, Otto Fowler  wrote:

> Maybe we need some automated method to backup configurations and restore
> them.
>
>
>
>
> On August 27, 2019 at 14:46:58, Michael Miklavcic (
> michael.miklav...@gmail.com) wrote:
>
> > Once you back up your metron configs, the same configs that worked on
> the previous version will continue to work on the version running on HDP
> 3.x.  If there is any discrepancy between the two or additional settings
> will be required, those will be documented in the release notes.  From the
> Metron perspective, this upgrade would be no different than simply
> upgrading to the new Metron version.
>
> This upgrade cannot be performed the same way we've done it in the past. A
> number of platform upgrades, including OS, are required:
>
>1. Requires the OS to be updated on all nodes because there are no
>Centos6 RPMs provided in HDP 3.1. Must bump to Centos7.
>2. The final new HBase code will not run on HDP 2.6
>3. The MPack changes for new Ambari are not backwards compatible
>4. YARN and HDFS/MR are also at risk to be backwards incompatible
>
>
> On Tue, Aug 27, 2019 at 12:39 PM Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
>
>> Adding the dev list back into the thread (a reply-all was missed).
>>
>> On Tue, Aug 27, 2019 at 10:49 AM James Sirota  wrote:
>>
>>> I agree with Simon.  HDP 2.x platform is rapidly approaching EOL and
>>> everyone will likely need to migrate by end of year.  Doing this platform
>>> upgrade sooner will give everyone visibility into what Metron on HDP 3.x
>>> looks like so they have time to plan and upgrade at their own pace.
>>> Feature-wise, the Metron application itself will be unchanged.  It is
>>> merely the platform underneath that is changing.  HDP itself can be
>>> upgraded per instructions here:
>>> https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/release-notes/content/upgrading_parent.html
>>>
>>> Once you back up your metron configs, the same configs that worked on
>>> the previous version will continue to work on the version running on HDP
>>> 3.x.  If there is any discrepancy between the two or additional settings
>>> will be required, those will be documented in the release notes.  From the
>>> Metron perspective, this upgrade would be no different than simply
>>> upgrading to the new Metron version.
>>>
>>> James
>>>
>>>
>>> 27.08.2019, 07:09, "Simon Elliston Ball" :
>>>
>>> Something worth noting here is that HDP 2.6.5 is quite old and
>>> approaching EoL rapidly, so the issue of upgrade is urgent. I am aware of a
>>> large number of users who require this upgrade ASAP, and in fact an aware
>>> of zero users who wish to remain on HDP 2.
>>>
>>> Perhaps those users who want to stay on the old platform can stick their
>>> hands up and raise concerns, but this move will likely have to happen very
>>> soon.
>>>
>>> Simon
>>>
>>> On Tue, 27 Aug 2019 at 15:04, Otto Fowler 
>>> wrote:
>>>
>>> Although we had the discussion, and some great ideas where passed
>>> around, I do not believe we came to some kind of consensus on what 1.0
>>> should look like. So that discussion would have to be picked up again so
>>> that we could know where we are at, and make it an actual thing if we were
>>> going to make this a 1.0 release.
>>>
>>> I believe that the issues raised in that discussion gating 1.0 are still
>>> largely applicable, including upgrade.
>>>
>>> Right now we have *ZERO* HDP 3.1 users. We will go from that to *only*
>>> supporting 3.1 work and releases. So every user and deployment we currently
>>> have will feel real pain, have to slay real dragons to move forward with
>>> metron.
>>>
>>> With regards to support for older versions, the “backward capability”
>>> that has been mentioned, I would not say that I have any specific plan for
>>> that in mind. What I would say rather, is that I believe that we must be
>>> explicit, setting expectations correctly and clearly with regards to our
>>> intent while demonstrating that we have thought through the situation. That
>>> discussion has not happened, at least I do not believe that the prior dev
>>> thread really handles it in context.
>>>
>>> Depending on the upgrade situation for going to 3.1, it may b

Re: Good first issues to get started with?

2019-05-29 Thread Simon Elliston Ball
Welcome to Metron Jim!

Something we’ve always struggled to get is good sample data to build the
tests they lead to new parsers and use cases. Those areas are good places
to start for sure, but there are also a slew of useful things that could be
added as stellar functions, particularly common things being date
processing and some of the network related function libraries. That’s also
a good point to dig in and start extending some of the code.

I guess it really depends on what you’re interested in, we’re also very
open to new issues and ideas as a community, so if there is something you
have in mind, don’t hesitate to start a discuss thread here, or raise a new
JIRA.

Welcome again, and looking forwards to your contributions.

Btw: if you’re not already on it, might be worth getting setup on the asf
slack channel for Metron, for more dev chat and help, but the important
stuff tends to remain on this list.

Simon

On Wed, 29 May 2019 at 22:39, Jim Spring  wrote:

> Hi all,
>
> I was curious if there are some recommended issues that one might consider
> as good first contributions to Metron?  I’ve combed through the issues list
> a bit and things that I was interested in already had some work going on.
>
> Thanks
> -jim spring
>
-- 
--
simon elliston ball
@sireb


Re: Build Failed for 0.7.2

2019-05-22 Thread Simon Elliston Ball
See https://github.com/apache/metron/pull/1419 which fixes this issue and
will likely make it to master pretty soon.

FYI: probably only necessary to post things like this to one list, not to
add them to random JIRAs.

Simon

On Wed, 22 May 2019 at 06:41, Farrukh Naveed Anjum 
wrote:

> Requires: /bin/bash
> Checking for unpackaged file(s): /usr/lib/rpm/check-files
> /root/BUILDROOT/metron-0.7.2-root
> error: Installed (but unpackaged) file(s) found:
>/usr/metron/0.7.2/config/zookeeper/parsers/leef.json
> Macro %_prerelease has empty body
> Macro %_prerelease has empty body
> Installed (but unpackaged) file(s) found:
>/usr/metron/0.7.2/config/zookeeper/parsers/leef.json
>
>
> RPM build errors:
> RPM build errors encountered
> [ERROR] Command execution failed.
> org.apache.commons.exec.ExecuteException: Process exited with an error: 1
> (Exit value: 1)
> at
>
> org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404)
> at
> org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166)
> at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:764)
> at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:711)
> at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:289)
> at
>
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
> at
>
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
> at
>
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
> at
>
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
> at
>
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
> at
>
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
> at
>
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
> at
>
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
> at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
>
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
> at
>
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
>
>
> --
> *Best Regards*
> Farrukh Naveed Anjum
> *M:* +92 321 5083954 (WhatsApp Enabled)
> *W:* https://www.farrukh.cc/
>


-- 
--
simon elliston ball
@sireb


Re: [DISCUSS] JsonMapParser original string functionality

2019-05-10 Thread Simon Elliston Ball
My understanding is that chaining preserves (correctly to my mind) the original 
original string.

In other words: unless the message strategy is raw message, the original string 
is just passed through. Original string therefore comes from outside Metron, 
and is preserved throughout Metron processes, allowing for recreation of 
original form for forensics and evidentiary purposes.

Simon

> On 11 May 2019, at 00:10, Otto Fowler  wrote:
> 
> What about parser chaining?   Should the original string be from kafka, or
> the last parsed?
> 
> 
> On May 10, 2019 at 19:03:39, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
> 
> The only scenario I can think of where a parser might treat original string
> differently, or even need to know about it would be different encoding
> locales. For example, if the string were to be encoded in a locale specific
> to the device and choose the encoding based on metadata or parsed content,
> then that could merit pushing it down. The other edge might be when you
> have binary data that does not go down to an original string well (eg a
> netflow parser).
> 
> That said, that’s a highly unlikely edge case that could be handled by
> workarounds.
> 
> I’m a definitely +1 on Nick’s idea of pulling original string up to the
> runner. Right now we’re pretty inconsistent in how it’s done, so that would
> help.
> 
> Simon
> 
> Sent from my iPhone
> 
> On 10 May 2019, at 23:10, Nick Allen  wrote:
> 
>>> I suppose we could always allow this to be overridden, also.
>> 
>> I like an on/off switch for the "original string" functionality. If on,
>> you get the original string in pristine condition. If off, no original
>> string is appended for those who care more about storage space.
>> 
>> I can't think of a reason where one kind of parser would have a different
>> original string mechanism than the others. If something like that does
>> come up, the parser can create its own original string by just naming it
>> something different and then turning "off" the switch that you described.
>> 
>> 
>> 
>> On Fri, May 10, 2019 at 5:53 PM Michael Miklavcic <
>> michael.miklav...@gmail.com> wrote:
>> 
>>> I think that's an excellent idea. Can anyone think of a situation where
> we
>>> wouldn't want to add this the same way for all parsers? I suppose we
> could
>>> always allow this to be overridden, also.
>>> 
>>>> On Fri, May 10, 2019 at 3:43 PM Nick Allen  wrote:
>>>> 
>>>> I think maintaining the integrity of the original data makes a lot of
>>> sense
>>>> for any parser. And ideally the original string should be what came out
>>> of
>>>> Kafka with only the minimally necessary processing.
>>>> 
>>>> With that in mind, we could solve this one level up. Instead of relying
>>> on
>>>> each parser to do this right, we could have the ParserRunner and
>>>> specifically the ParserRunnerImpl [1] handle this round-abouts here
>>>> <
>>>> 
>>> 
> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
>>>>> 
>>>> [1].
>>>> It has the raw message data and can append the original string to each
>>>> message it gets back from the parsers.
>>>> 
>>>> Just another approach to consider.
>>>> 
>>>> --
>>>> [1]
>>>> 
>>>> 
>>> 
> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
>>>> 
>>>> On Fri, May 10, 2019 at 4:11 PM Otto Fowler 
>>>> wrote:
>>>> 
>>>>> +1
>>>>> 
>>>>> 
>>>>> On May 10, 2019 at 13:57:55, Michael Miklavcic (
>>>>> michael.miklav...@gmail.com)
>>>>> wrote:
>>>>> 
>>>>> When adding the capability for parsing messages in the JsonMapParser
>>>> using
>>>>> JSON Path expressions the original behavior for managing original
>>> strings
>>>>> was changed.
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
> https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/js

Re: [DISCUSS] JsonMapParser original string functionality

2019-05-10 Thread Simon Elliston Ball
The only scenario I can think of where a parser might treat original string 
differently, or even need to know about it would be different encoding locales. 
For example, if the string were to be encoded in a locale specific to the 
device and choose the encoding based on metadata or parsed content, then that 
could merit pushing it down. The other edge might be when you have binary data 
that does not go down to an original string well (eg a netflow parser).

That said, that’s a highly unlikely edge case that could be handled by 
workarounds. 

I’m a definitely +1 on Nick’s idea of pulling original string up to the runner. 
Right now we’re pretty inconsistent in how it’s done, so that would help.

Simon 

Sent from my iPhone

On 10 May 2019, at 23:10, Nick Allen  wrote:

>> I suppose we could always allow this to be overridden, also.
> 
> I like an on/off switch for the "original string" functionality.  If on,
> you get the original string in pristine condition.  If off, no original
> string is appended for those who care more about storage space.
> 
> I can't think of a reason where one kind of parser would have a different
> original string mechanism than the others.  If something like that does
> come up, the parser can create its own original string by just naming it
> something different and then turning "off" the switch that you described.
> 
> 
> 
> On Fri, May 10, 2019 at 5:53 PM Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
> 
>> I think that's an excellent idea. Can anyone think of a situation where we
>> wouldn't want to add this the same way for all parsers? I suppose we could
>> always allow this to be overridden, also.
>> 
>>> On Fri, May 10, 2019 at 3:43 PM Nick Allen  wrote:
>>> 
>>> I think maintaining the integrity of the original data makes a lot of
>> sense
>>> for any parser. And ideally the original string should be what came out
>> of
>>> Kafka with only the minimally necessary processing.
>>> 
>>> With that in mind, we could solve this one level up.  Instead of relying
>> on
>>> each parser to do this right, we could have the ParserRunner and
>>> specifically the ParserRunnerImpl [1] handle this round-abouts here
>>> <
>>> 
>> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
 
>>> [1].
>>> It has the raw message data and can append the original string to each
>>> message it gets back from the parsers.
>>> 
>>> Just another approach to consider.
>>> 
>>> --
>>> [1]
>>> 
>>> 
>> https://github.com/apache/metron/blob/1b6ef88c79d60022542cda7e9abbea7e720773cc/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/ParserRunnerImpl.java#L149-L158
>>> 
>>> On Fri, May 10, 2019 at 4:11 PM Otto Fowler 
>>> wrote:
>>> 
 +1
 
 
 On May 10, 2019 at 13:57:55, Michael Miklavcic (
 michael.miklav...@gmail.com)
 wrote:
 
 When adding the capability for parsing messages in the JsonMapParser
>>> using
 JSON Path expressions the original behavior for managing original
>> strings
 was changed.
 
 
 
>>> 
>> https://github.com/apache/metron/blob/master/metron-platform/metron-parsing/metron-parsers-common/src/main/java/org/apache/metron/parsers/json/JSONMapParser.java#L192
 
 A couple issues have been reported recently regarding this change:
 
 1. We're losing the actual original string, which is a legal issue for
 data lineage for some customers
 2. Even for the degenerate case with no sub-messages created, the
 original sub-message string is modified because of the
 serialization/deserialization process with Jackson/JsonSimple. The
>> fields
 are reordered bc the content is normalized.
 
 I looked at options for preserving formatting, but am unable to find a
 method that allows you to both parse, then query the original message
>> and
 then also obtain the raw string matches without the normalizing from
 ser/deserialization.
 
 I'd like to propose that we add a configuration option for this parser
>>> that
 allows the user to toggle which approach they'd like to use. My
>> personal
 preference based on feedback I've gotten from multiple customers is
>> that
 the default should be the older approach which takes the raw original
 string. It's arguable that this change in contract is a regression, so
>>> the
 default should be the earlier behavior. Any sub-messages would then
>> have
>>> a
 copy of that raw original string, not just the sub-message original
>>> string.
 Enabling the flag would enable the current sub-message original string
 functionality.
 
 Mike
 
>>> 
>> 


Re: [DISCUSS] Upgrading HBase and Kafka support

2019-03-08 Thread Simon Elliston Ball
The Docker option sounds like a much better and cleaner option for integration 
testing (closer to real too). My one question would be whether this would 
significantly increase test run time, and whether that would need Travis 
changes? 

Either way, the docker option sounds best.

Simon

> On 8 Mar 2019, at 16:38, Michael Miklavcic  
> wrote:
> 
> I'm -1 on #1 unless there's some desperately compelling reason to go that
> route. It would be a regression in our test coverage, and at that point
> it's really just duplicating our unit tests as opposed to checking our
> integration.
> 
> I'm good with 3. Gating factors for a successful implementation would be
> that as a developer I can:
> 
>   1. Run it in my IDE without having to do anything extra (the beauty of
>   the in-mem component is that @BeforeClass spins it up automatically - we
>   should keep doing something along those lines)
>   2. Run it via Maven cli
>   3. Run it in Travis as part of our normal build
> 
> It's probably worth looking at Kafka's testing infrastructure straight from
> the source - https://github.com/apache/kafka/blob/trunk/tests/README.md.
> They leverage Docker containers now for system tests.
> 
> Best,
> Mike
> 
> 
>> On Fri, Mar 8, 2019 at 7:47 AM Ryan Merriman  wrote:
>> 
>> I have been researching the effort involved to upgrade to HDP 3.  Along the
>> way I've found a couple challenging issues that we will need to solve, both
>> involving our integration testing strategy.
>> 
>> The first issue is Kafka.  We are moving from 0.10.0 to 2.0.0 and there
>> have been significant changes to the API.  This creates an issue in the
>> KafkaComponent class, which we use as an in-memory Kafka server in
>> integration tests.  Most of the classes that were previously used have gone
>> away, and to the best of my knowledge, were not supported as public APIs.
>> I also don't see any publicly documented APIs to replace them.
>> 
>> The second issue is HBase.  We are moving from 1.1.2 to 2.0.2 so another
>> significant change.  This creates an issue in the MockHTable class
>> becausethe HTableInterface class has changed to Table, essentially
>> requiring that MockHTable be rewritten to conform to the new interface.
>> It's my opinion that this class is complicated and difficult to maintain as
>> it is anyways.
>> 
>> These 2 issues have the potential to add a significant amount of work to
>> upgrading Metron to HDP 3.  I want to take a step back and review our
>> options before we move forward.  Here are some initial thoughts I had on
>> how to approach this.  For HBase:
>> 
>>   1. Update MockHTable to work with the new HBase API.  We would continue
>>   using a mock server approach for HBase.
>>   2. Research replacing MockHTable with an in-memory HBase server.
>>   3. Replace MockHTable with a Docker container running HBase.
>> 
>> For Kafka:
>> 
>>   1. Replace KafkaComponent with a mock server implementation.
>>   2. Update KafkaComponent to work with the new API.  We would probably
>>   need to leverage some internal Kafka classes.  I do not see a testing
>> API
>>   documented publicly.
>>   3. Replace KafkaComponent with a Docker container running Kafka.
>> 
>> What other options are there?  Whatever we choose I think we should follow
>> a similar approach for both (mock servers, in memory servers, Docker, other
>> options I'm not thinking of).
>> 
>> This will not shock anyone but I would be in favor of Docker containers.
>> They have the advantage of classpath isolation, easy upgrades, and accurate
>> integration testing.  The downside is we will have to adjusts our tests and
>> travis script to incorporate these Docker containers into our build
>> process.  We have discussed this at length in the past and it has generally
>> stalled for various reasons.  Maybe if we move a few services at a time it
>> might be more palatable?  As for the other 2 approaches, I think if either
>> worked well we wouldn't be having this discussion.  Mock servers are hard
>> to maintain and I don't see in memory testing classes documented in
>> javadocs for either service.
>> 
>> Thoughts?
>> 


Re: [DISCUSS] Knox SSO feature branch review and features

2018-11-16 Thread Simon Elliston Ball
It is included, yes, but is not started out of the box by default, we would
also probably tweak the blueprint to change its bootstrap ldif file a bit
to have more sensible user names for our defaults, but that's pretty simple
load. A bit of default blueprint config is what we need there, not anything
'code' related per se.

Simon

On Fri, 16 Nov 2018 at 15:59, Otto Fowler  wrote:

> That does sound good Simon, I think I miss understood that the default
> LDAP was standard with KNOX/ambari and not something we would be doing
> ourselves.
>
>
> On November 16, 2018 at 10:54:48, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> I think there is a lot to be said for defaulting to Knox on... we also
> that
> way get some 'secure by default' - at least ssl by default. The do-nothing
> I think you're proposing would be around the authentication, right? Knox
> does ship with a demo LDAP server we could have some defaults (kinda like
> we do with the dev spring profile today) which might be a good way of
> achieving a similar effect, and that's configured by default by Ambari.
> Would that meet the need, or do you think we should provide a "yeah I'm
> sure we don't need to authenticate that connection, let them in" identity
> provider for Knox? That way we would have to have them impersonate a known
> user for the REST api to work, but you would get the seemless, no auth
> access.
>
> To be honest, I'm a fan of the first option where we give people a nice
> simple, no external config and just use some sensible default users in the
> demo LDAP instance that Knox owns, and then default to Knox on to give us
> our nice single access point.
>
> Simon
>
>
> On Fri, 16 Nov 2018 at 15:35, Otto Fowler 
> wrote:
>
> > Those are all valid points. I think it is ( was ) worth discussion at
> > lease a little.
> >
> > WRT Knox and defaults:
> >
> > I have in the past used “do-nothing” implementations as default
> > placeholders for functionality
> > that needed extensive per customer configuration, or configuration
> outside
> > the responsibility of the product.
> >
> > Would it be simpler if we ALWAYS used Knox, but defaulted to a KNOX
> > configuration with “do-nothing” providers
> > for auth etc. The users would then configure the providers ( based on
> the
> > provider(s) we support ) at a later time.
> >
> > We could write the providers, as everyone has pointed out how extensible
> > KNOX is ;)
> >
> > Would that be a valid way to simplify the issue?
> > What would the fallout of that be?
> >
> >
> >
> > On November 16, 2018 at 09:20:53, Ryan Merriman (merrim...@gmail.com)
> > wrote:
> >
> > Most of the research I've done around adding Metron as a Knox service is
> > based on how other projects do it. The documentation is not easy to
> follow
> > so I learned by reading other service definition files. The assumption
> > that we are doing things drastically different is false.
> >
> > I completely agree with Simon. Why would we want to be dependent on
> Knox's
> > release cycle? How does that benefit us? It may reduce some operational
> > complexity but it makes our install process more complicated because we
> > require a certain version of Knox (who knows when that gets released).
> > What do we do in the meantime? I would also like to point out that
> Metron
> > is inherently different than other Hadoop stack services. We are a
> > full-blown application with multiple UIs so the way we expose services
> > through Knox may be a little different.
> >
> > I think this will be easier to discuss when we can all see what is
> actually
> > involved. I am working on a PR that adds Metron as a Knox service and
> will
> > have that out soon. That should give everyone more context.
> >
> > On Fri, Nov 16, 2018 at 7:39 AM Simon Elliston Ball <
> > si...@simonellistonball.com> wrote:
> >
> > > You could say the same thing about Ambari, but that provides mpacks.
> Knox
> > > is also designed to be extensible through Knox service stacks since
> they
> > > realized they can’t support every project. The challenge is that the
> docs
> > > have not made it as easy as they could for the ecosystem to plug into
> > Knox,
> > > which has led to some confusion around this being a recommended
> pattern
> > > (which it is).
> > >
> > > The danger of trying to get your bits into Knox is that that ties you
> to
> > > their release cycle (a problem Ambari

Re: Running MAAS in batch

2018-11-16 Thread Simon Elliston Ball
You model is really just a function that you wrap in a REST service in
order to deploy in MaaS. In the case of something like spark, you would
just wrap it in a udf instead of wrapping it in a REST service, at that
point, applying it in batch is just a case of a simple dataframe query.

On Fri, 16 Nov 2018 at 15:51, deepak kumar  wrote:

> Simon,
> Can you elaborate more on this:
> '
>
> *wrapped up in a batch engine like Spark to takeadvantage of more
> efficient "mass" scoring.*
> '
> How the mass model wrapped in spark  can take advantage of mass scoring?
>
> Thanks
> Deepak
>
> On Fri, Nov 16, 2018 at 9:15 PM Otto Fowler 
> wrote:
>
>> That may be the best MAAS explanation I’ve seen Simon.
>>
>>
>> On November 16, 2018 at 10:28:57, Simon Elliston Ball (
>> si...@simonellistonball.com) wrote:
>>
>> MaaS is designed to wrap model inference (scoring) an event at a time,
>> via a REST api. As such, running it batch doesn't make a lot of sense,
>> since each message would be processed individually. Most of the models
>> you're likely to run in MaaS however, are also likely to be easily
>> batchable, and are probable better wrapped up in a batch engine like Spark
>> to take advantage of more efficient "mass" scoring.
>>
>> Simon
>>
>> On Fri, 16 Nov 2018 at 15:18, deepak kumar  wrote:
>>
>>> Hi All
>>> Right now MAAS supports running the model against real time events being
>>> streamed into metron platform.
>>> Is there any way to run the models deployed in MAAS on the batch events
>>> / data that have been indexed into hdfs ?
>>> If anyone have tried this batch model , please share some insights.
>>> Thanks
>>> Deepak.
>>>
>>>
>>
>> --
>> --
>> simon elliston ball
>> @sireb
>>
>>

-- 
--
simon elliston ball
@sireb


Re: [DISCUSS] Knox SSO feature branch review and features

2018-11-16 Thread Simon Elliston Ball
I think there is a lot to be said for defaulting to Knox on... we also that
way get some 'secure by default' - at least ssl by default. The do-nothing
I think you're proposing would be around the authentication, right? Knox
does ship with a demo LDAP server we could have some defaults (kinda like
we do with the dev spring profile today) which might be a good way of
achieving a similar effect, and that's configured by default by Ambari.
Would that meet the need, or do you think we should provide a "yeah I'm
sure we don't need to authenticate that connection, let them in" identity
provider for Knox? That way we would have to have them impersonate a known
user for the REST api to work, but you would get the seemless, no auth
access.

To be honest, I'm a fan of the first option where we give people a nice
simple, no external config and just use some sensible default users in the
demo LDAP instance that Knox owns, and then default to Knox on to give us
our nice single access point.

Simon


On Fri, 16 Nov 2018 at 15:35, Otto Fowler  wrote:

> Those are all valid points.  I think it is ( was ) worth discussion at
> lease a little.
>
> WRT Knox and defaults:
>
> I have in the past used “do-nothing” implementations as default
> placeholders for functionality
> that needed extensive per customer configuration, or configuration outside
> the responsibility of the product.
>
> Would it be simpler if we ALWAYS used Knox, but defaulted to a KNOX
> configuration with “do-nothing” providers
> for auth etc.  The users would then configure the providers ( based on the
> provider(s) we support ) at a later time.
>
> We could write the providers, as everyone has pointed out how extensible
> KNOX is ;)
>
> Would that be a valid way to simplify the issue?
> What would the fallout of that be?
>
>
>
> On November 16, 2018 at 09:20:53, Ryan Merriman (merrim...@gmail.com)
> wrote:
>
> Most of the research I've done around adding Metron as a Knox service is
> based on how other projects do it. The documentation is not easy to follow
> so I learned by reading other service definition files. The assumption
> that we are doing things drastically different is false.
>
> I completely agree with Simon. Why would we want to be dependent on Knox's
> release cycle? How does that benefit us? It may reduce some operational
> complexity but it makes our install process more complicated because we
> require a certain version of Knox (who knows when that gets released).
> What do we do in the meantime? I would also like to point out that Metron
> is inherently different than other Hadoop stack services. We are a
> full-blown application with multiple UIs so the way we expose services
> through Knox may be a little different.
>
> I think this will be easier to discuss when we can all see what is actually
> involved. I am working on a PR that adds Metron as a Knox service and will
> have that out soon. That should give everyone more context.
>
> On Fri, Nov 16, 2018 at 7:39 AM Simon Elliston Ball <
> si...@simonellistonball.com> wrote:
>
> > You could say the same thing about Ambari, but that provides mpacks. Knox
> > is also designed to be extensible through Knox service stacks since they
> > realized they can’t support every project. The challenge is that the docs
> > have not made it as easy as they could for the ecosystem to plug into
> Knox,
> > which has led to some confusion around this being a recommended pattern
> > (which it is).
> >
> > The danger of trying to get your bits into Knox is that that ties you to
> > their release cycle (a problem Ambari has felt hard, hence their
> community
> > is moving away from the everything inside model towards everything is an
> > mpack).
> >
> > A number of implementations of Knox also use the approach Ryan is
> > suggesting for their own organization specific end points, so it’s not
> like
> > this is an uncommon, or anti-pattern, it’s more the way Knox is designed
> to
> > work in the future, than the legacy of it only being able to handle a
> > subset of Hadoop projects.
> >
> > Knox remains optional In our scenario, but we keep control over the
> > shipping of things like rewrite rules, which allows Metron to control its
> > release destiny should things like url patterns in the ui need to change
> > (with a new release of angular / new module / new rest endpoint etc)
> > instead of making a Metron release dependent on a Knox release.
> >
> > Imagine how we would have done with the Ambari side if we’d had to wait
> > for them to release every time we needed to change something in the
> > mpack... we don’t want that happ

Re: Running MAAS in batch

2018-11-16 Thread Simon Elliston Ball
MaaS is designed to wrap model inference (scoring) an event at a time, via
a REST api. As such, running it batch doesn't make a lot of sense, since
each message would be processed individually. Most of the models you're
likely to run in MaaS however, are also likely to be easily batchable, and
are probable better wrapped up in a batch engine like Spark to take
advantage of more efficient "mass" scoring.

Simon

On Fri, 16 Nov 2018 at 15:18, deepak kumar  wrote:

> Hi All
> Right now MAAS supports running the model against real time events being
> streamed into metron platform.
> Is there any way to run the models deployed in MAAS on the batch events /
> data that have been indexed into hdfs ?
> If anyone have tried this batch model , please share some insights.
> Thanks
> Deepak.
>
>

-- 
--
simon elliston ball
@sireb


Re: [DISCUSS] Knox SSO feature branch review and features

2018-11-16 Thread Simon Elliston Ball
>>>>>>>> 
>>>>>>>>> However there will be some backwards compatibility issues we
>>> would
>>>>>> need
>>>>>>> to
>>>>>>>>> think through. I think it will be a challenge exposing the UIs
>>>>>> through
>>>>>>>>> both the Knox url and legacy urls at the same time. I'm also
>> not
>>>>>> clear
>>>>>>> on
>>>>>>>>> how one would use Knox with REST set to legacy JDBC-based
>>>>>>> authentication.
>>>>>>>>> As far as I know Knox does not support JDBC so there would be
>> a
>>>>>> mismatch
>>>>>>>>> between Knox and REST. Knox does have the ability to pass
>> along
>>>> basic
>>>>>>>>> authentication headers so LDAP in REST would work. We could
>>>> initially
>>>>>>>>> make
>>>>>>>>> Knox an optional feature that requires setup with the help of
>>> some
>>>>>>>>> documentation (like Kerberos) while keeping the system the way
>>> it
>>>> is
>>>>>> now
>>>>>>>>> by
>>>>>>>>> default. I imagine we'll deprecate JDBC-based authentication
>> at
>>>> some
>>>>>>>>> point
>>>>>>>>> so that may be a good time to switch.
>>>>>>>>> 
>>>>>>>>> What do people think about this approach? Concerns? Are there
>>> any
>>>>>> huge
>>>>>>>>> holes in this I'm not thinking about?
>>>>>>>>> 
>>>>>>>>> I want to highlight that the work Simon did in his feature
>>> branch
>>>> was
>>>>>>>>> crucial to better understanding this. I am pretty sure we'll
>> end
>>>> up
>>>>>>>>> reusing a lot code from that branch.
>>>>>>>>> 
>>>>>>>>> On Thu, Sep 27, 2018 at 6:30 PM Michael Miklavcic <
>>>>>>>>> michael.miklav...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>>> Apparently, I hit send on my last email before finishing my
>>>> synopsis
>>>>>>>>> (per
>>>>>>>>>> @Otto's Q in Slack). To summarize, based on my current
>>>>>> understanding I
>>>>>>>>>> believe that each of the feature branch changes I've outline
>>>> above
>>>>>> are
>>>>>>>>>> units of work that are related, yet should be executed on
>>>>>>> independently.
>>>>>>>>>> Knox SSO in its own feature branch. Migrating technologies
>>> like
>>>>>> NodeJs
>>>>>>>>> or
>>>>>>>>>> migrating the auth DB to LDAP seem like they belong in their
>>> own
>>>>>>>>> separate
>>>>>>>>>> PR's or feature branches.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Mike
>>>>>>>>>> 
>>>>>>>>>> On Thu, Sep 27, 2018 at 4:08 PM Casey Stella <
>>>> ceste...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I'm coming in late to the game here, but for my mind a
>>> feature
>>>>>>> branch
>>>>>>>>>>> should involve the minimum architectural change to
>>> accomplish
>>>> a
>>>>>>> given
>>>>>>>>>>> feature.
>>>>>>>>>>> The feature in question is SSO integration. It seems to me
>>>> that
>>>>>> the
>>>>>>>>>>> operative question is can we do the feature without making
>>> the
>>>>>> OTHER
>>>>>>>>>>> architectural change
>>>>>>>>>>> (e.g. migrating from expressjs to spring boot + zuul). I
>>> would
>>>>>>> argue
>>>>>>>>>> that
>>>>>

Re: [DISCUSS] Deprecating MySQL

2018-11-13 Thread Simon Elliston Ball
We went over the hbase user settings thing on extensive discussions at the 
time. Storing an arbitrary blob of JSON which is only ever accessed by a single 
key (username) was concluded to be a key value problem, not a relational 
problem. Hbase was concluded to be massive overkill as a key value store in 
this usecase, unless it was already there and ready to go, which in the case of 
Metron, it is, for enrichments, threat intel and profiles. Hence it ended up in 
Hbase, as a conveniently present data store that matched the usage patterns. 
See 
https://lists.apache.org/thread.html/145b3b8ffd8c3aa5bbfc3b93f550fc67e71737819b19bc525a2f2ce2@%3Cdev.metron.apache.org%3E
 and METRON-1337 for discussion.

Simon

> On 13 Nov 2018, at 18:50, Michael Miklavcic  
> wrote:
> 
> Thanks for the write up Simon. I don't think I see any major problems with
> deprecating the general sql store. However, just to clarify, Metron does
> NOT require any specific backing store. It's 100% JPA, which means anything
> that can be configured with the Spring properties we expose. I think the
> most opinionated thing we do there is ship an extremely basic table
> creation script for h2 and mysql as a simple example for schema. As an
> example, we simply use H2 in full dev, which is entirely in-memory and spun
> up automatically from configuration. The recent work by Justin Leet removes
> the need to use a SQL store at all if you choose LDAP -
> https://github.com/apache/metron/pull/1246. I'll let him comment further on
> this, but I think there is one small change that could be made via a toggle
> in Ambari that would even eliminate the user from seeing JDBC settings
> altogether during install if they choose LDAP. Again, I think I'm on board
> with deprecating the SQL backing store as I pointed this out on the Knox
> thread as well, but I just wanted to make sure everyone has an accurate
> picture of the current state.
> 
> I had to double check on the HBase config you mentioned, but it does appear
> that we use it for the Alerts UI. I don't think I realized we were storing
> config there instead of the Zookeeper store we use for other system
> configuration. Ironically enough, I think that it probably makes more sense
> than the current auth info to store in a traditional sql store, however
> it's in HBase currently so it's a non-issue wrt SQL/JPA either way, as you
> pointed out.
> 
> Whatever architectural changes we choose to add here, I think we need to
> emphasize pluggability regardless of the specific implementation. That is
> to say, I don't think we should make a hard requirement on Knox, in order
> to get LDAP, in order to deprecate an optional general SQL backing store.
> It makes sensible defaults if that's where we want to go, which is the way
> we have done things for most of the successful features I've seen in
> Metron. Provide all the options should a user desire them, but abstract
> away the complexity in the UIs.
> 
> Best,
> Mike
> 
> 
> On Tue, Nov 13, 2018 at 5:42 AM Simon Elliston Ball <
> si...@simonellistonball.com> wrote:
> 
>> I've been coming across a number of organisations who are blocked from
>> installing Metron by the MySQL auth database.
>> 
>> The main problems with our MySQL default are:
>> 
>> * What? Un-ecrypted passwords?!? - which frankly is embarrassing in a
>> security platform and usually where the deployment conversation ends for me
>> * MySQL install varies from platform to platform
>> * An additional database to manage, backup, etc. so now I have to talk to a
>> DBA
>> * Harder to maintain HA for this without externalising and fighting against
>> our defaults
>> * There are a lot of dependencies for just storing a table of users
>> (Eclipse Link, JPA, the MySQL server and the need to get clients installed
>> and pushed separately because of licence requirements)
>> * Organisations don't want to have to manage yet another user source of
>> truth since this leads to operational complexity.
>> 
>> In short, managing our own user store makes very little sense to operations
>> users.
>> 
>> Some of these (licence and inconsistency for example) could be solved by
>> changing our default DB to something like Postgres, which has easier terms
>> to deal with. We could start encrypting passwords, but there would still be
>> a lot of dependencies to store users, which is a problem much better solved
>> by LDAP.
>> 
>> Now that we have the option to use LDAP for user storage, I would suggest
>> that we deprecate and ultimately remove all the RDBMS and ORM dependencies,
>> which significantly reduces our dependencies and s

[DISCUSS] Deprecating MySQL

2018-11-13 Thread Simon Elliston Ball
I've been coming across a number of organisations who are blocked from
installing Metron by the MySQL auth database.

The main problems with our MySQL default are:

* What? Un-ecrypted passwords?!? - which frankly is embarrassing in a
security platform and usually where the deployment conversation ends for me
* MySQL install varies from platform to platform
* An additional database to manage, backup, etc. so now I have to talk to a
DBA
* Harder to maintain HA for this without externalising and fighting against
our defaults
* There are a lot of dependencies for just storing a table of users
(Eclipse Link, JPA, the MySQL server and the need to get clients installed
and pushed separately because of licence requirements)
* Organisations don't want to have to manage yet another user source of
truth since this leads to operational complexity.

In short, managing our own user store makes very little sense to operations
users.

Some of these (licence and inconsistency for example) could be solved by
changing our default DB to something like Postgres, which has easier terms
to deal with. We could start encrypting passwords, but there would still be
a lot of dependencies to store users, which is a problem much better solved
by LDAP.

Now that we have the option to use LDAP for user storage, I would suggest
that we deprecate and ultimately remove all the RDBMS and ORM dependencies,
which significantly reduces our dependencies and simplifies deployment and
long term management of Metron clusters.

So I propose that we deprecate the RDBMS use in the next Apache release,
and then strip out the RDBMS stuff in the following. We would continue to
use LDAP for users and HBase for non-LDAPy user settings (as we currently
do). We should also provide a small demo LDAP for full dev. Since we are
looking at adding Knox into the stack, that project provides a convenient
mini-LDAP demo service which would do this job without the need to add
additional components.

Thoughts? Anyone relying on MySQL for users (if so, are you aware that your
passwords are all plaintext? How do you currently handle the shortcomings
and admin overhead?) Any objections?

Simon


Re: [DISCUSS] Knox SSO feature branch review and features

2018-11-12 Thread Simon Elliston Ball
What you're looking for is an OUT rewrite rule, and a filter rule on
content-type. It's not spectacularly well documented, but
https://knox.apache.org/books/knox-1-0-0/dev-guide.html#Rewrite+Provider
and specifically
https://knox.apache.org/books/knox-1-0-0/dev-guide.html#Rewrite+Steps is a
starting point. There are some reasonable examples in Knox itself for the
webhdfs service, which uses this mechanism:
https://github.com/apache/knox/blob/master/gateway-service-webhdfs/src/main/resources/org/apache/knox/gateway/hdfs/WebHdfsDeploymentContributor/rewrite.xml


Hope that helps. It's not well doc-ed sadly, and not massively flexible,
but should work. I suspect from my previous experiments with this you may
also need to build this file as part of the UI builds, so it is aware of
the bundle names generated, because the Knox matching rules don't have
proper back reference capabilities.

I did a POC of this sometime back in March, that I might be able to dig out
if it would help.

Simon

On Mon, 12 Nov 2018 at 14:59, Ryan Merriman  wrote:

> I'm just coming up to speed on Knox so maybe rewriting assets links are
> trivial.  If anyone has a good example of how to do that or can point to
> some documentation, please share.
>
> On Mon, Nov 12, 2018 at 8:54 AM Simon Elliston Ball <
> si...@simonellistonball.com> wrote:
>
> > Doing the Knox proxy work first certainly does make a lot of sense vs the
> > SSO first approach, so I'm in favour of this. It bypasses all the
> anti-CORS
> > proxying stuff the other solution needed by being on the same URL space.
> >
> > Is there are reason we're not re-writing the asset link URLs in Knox? We
> > should have a reverse content rewrite rule to avoid that problem and make
> > it entirely transparent whether there is Knox or not. We shouldn't be
> > changing anything about the UI services themselves. If the rewrite
> service
> > is complete, there is no change to base ref in the UI code, Knox would
> > effectively apply it by content filtering. Note also that the gateway URL
> > is configurable and likely to vary from Knox to Knox, so baking it into
> the
> > ng build will break non-full-dev builds. (e.g. gateway/default could well
> > be gateway/xyz).
> >
> > I would also like to discuss removing the JDBC auth, because it's a set
> of
> > plaintext passwords in a mysql DB... it introduces a problematic
> dependency
> > (mysql) a ton of java dependencies we could cut out (JPA, eclipselink)
> and
> > opens up a massive security hole. I personally know of several
> > organisations who are blocked from using Metron by the presence of the
> JDBC
> > authentication method in its current form.
> >
> > Simon
> >
> > On Mon, 12 Nov 2018 at 14:36, Ryan Merriman  wrote:
> >
> > > Let me clarify on exposing both legacy and Knox URLs at the same time.
> > The
> > > base urls will look something like this:
> > >
> > > Legacy REST - http://node1:8082/api/v1
> > > Legacy Alerts UI - http://node1:4201:/alerts-list
> > >
> > > Knox REST - https://node1:8443/gateway/default/metron/api/v1
> > > Knox Alerts UI -
> > > https://node1:8443/gateway/default/metron-alerts-ui/alerts-list
> > >
> > > If Knox were turned on and the alerts UI deployed as is, it would not
> > > work.  This is because static assets are referenced with
> > > http://node1:4201/assets/some-asset.js which does not include the
> > correct
> > > context path to the alerts UI in knox.  To make it work, you have to
> set
> > > the base ref to "/gateway/default/metron-alerts-ui" so that static
> assets
> > > are referenced at
> > >
> https://node1:8443/gateway/default/metron-alerts-ui/assets/some-asset.js
> > .
> > > When you do that, the legacy alerts UI will no longer work.  I guess
> the
> > > point I'm trying to make is that we would have to switch between them
> or
> > > have 2 separate application running.  I imagine most users only need
> one
> > or
> > > the other running so probably not an issue.
> > >
> > > Jon, the primary upgrade consideration I see is with authentication.
> To
> > be
> > > able to use Knox, you would have to upgrade to LDAP-based
> authentication
> > if
> > > you were still using JDBC-based authentication in REST.  The urls would
> > > also change obviously.
> > >
> > > On Sun, Nov 11, 2018 at 6:38 PM zeo...@gmail.com 
> > wrote:
> > >
> > > > Phew, that was quite the thread to catch up on.
> > > >
> >

Re: [DISCUSS] Knox SSO feature branch review and features

2018-11-12 Thread Simon Elliston Ball
> >> > On Thu, Sep 27, 2018 at 4:08 PM Casey Stella 
> > > >> wrote:
> > > >> >
> > > >> > > I'm coming in late to the game here, but for my mind a feature
> > > branch
> > > >> > > should involve the minimum architectural change to accomplish a
> > > given
> > > >> > > feature.
> > > >> > > The feature in question is SSO integration.  It seems to me that
> > the
> > > >> > > operative question is can we do the feature without making the
> > OTHER
> > > >> > > architectural change
> > > >> > > (e.g. migrating from expressjs to spring boot + zuul).  I would
> > > argue
> > > >> > that
> > > >> > > if we WANT to do that, then it should be a separate feature
> > branch.
> > > >> > >
> > > >> > > Thus, I leave with a question: is there a way to accomplish this
> > > >> feature
> > > >> > > without ripping out expressjs?
> > > >> > >
> > > >> > >- If so and it is feasible, I would argue that we should
> > decouple
> > > >> this
> > > >> > >into a separate feature branch.
> > > >> > >- If so and it is infeasible, I'd like to hear an argument as
> > to
> > > >> the
> > > >> > >infeasibility and let's decide given that
> > > >> > >- If it is not possible, then I'd argue that we should keep
> > them
> > > >> > coupled
> > > >> > >and move this through as-is.
> > > >> > >
> > > >> > > On a side-note, it feels a bit weird that we're narrowing to a
> > > bundled
> > > >> > > proxy, rather than having that be a pluggable thing.  I'm not
> > super
> > > >> > > knowledgeable in this space, so I apologize
> > > >> > > in advance if this is naive, but isn't this a pluggable,
> external
> > > >> > component
> > > >> > > (e.g. nginx)?
> > > >> > >
> > > >> > > On Thu, Sep 27, 2018 at 5:05 PM Michael Miklavcic <
> > > >> > > michael.miklav...@gmail.com> wrote:
> > > >> > >
> > > >> > > > I've spent some more time reading through Simon's response and
> > the
> > > >> > added
> > > >> > > > sequence diagram. This is definitely helpful - thank you
> Simon.
> > > >> > > >
> > > >> > > > I need to redact my initial list:
> > > >> > > >
> > > >> > > >1. Node migrated to Spring Boot, expressjs migrated to a
> > > >> > > >non-JS/non-NodeJs proxying mechanism (ie Zuul in this case)
> > > >> > > >2. JDBC removed completely in favor of LDAP
> > > >> > > >3. Knox/SSO
> > > >> > > >
> > > >> > > > I'm a bit conflicted on the best way to move forward and would
> > > like
> > > >> > some
> > > >> > > > thoughts from other community members on this. I think an
> > argument
> > > >> can
> > > >> > be
> > > >> > > > made that 1 and 2 are independent of 3, and should/could
> really
> > be
> > > >> > > > independent PR's against master.
> > > >> > > >
> > > >> > > > The need for a replacement for expressjs (Zuul in this case)
> is
> > an
> > > >> > > artifact
> > > >> > > > that our request/response cycle for REST calls is a simple
> > matter
> > > of
> > > >> > > > forwarding with some additional headers for authentication.
> > > There's
> > > >> a
> > > >> > > > JSESSIONID managed by the client browser in our current
> > > >> architecture,
> > > >> > for
> > > >> > > > example. You login to the alerts or the management UI which
> > > >> forwards a
> > > >> > > > request to REST, which looks up credentials in a backend
> > database,
> > > >> and
> > > >> > > > passes the results back up the chain.

Re: Revert PR #1218

2018-10-23 Thread Simon Elliston Ball
Would it not make more sense to fix the bug on the DAO side, and roll
forward? I suspect what we need to do is add a stage in the update
capability to configure the key field used for update, or worst case have a
pre-query to lookup the internal ID in the relatively rare scenario where
we escalate / modify indexed docs. Seems like a simple new ticket, rather
than a complex roll back and roll forward later. As long as we get the
follow on in before an Apache release we should be fine, no?

Simon

On Tue, 23 Oct 2018 at 19:58, Nick Allen  wrote:

> Hi Guys -
>
> @rmerriman tracked down some problems that were introduced with my PR
> #1218.  Thanks to him for finding this.  The change was intended to improve
> Elasticsearch write performance by allowing Elasticsearch to set its own
> document ID.
>
> The problem is that if you then go to the Alerts UI and escalate an alert,
> it will create a duplicate alert in the index, rather than updating the
> existing alert. I've been looking at how to fix the problem and the scope
> of the fix is larger than I'd like to handle as a follow-on.  There are
> some prerequisites I'd like to tackle before introducing this change.
>
> I am going to revert the change on master, which will introduce an
> additional commit that is an "undo" of the original commit.  I will then
> open a separate PR that introduces this new functionality.
>
> https://github.com/apache/metron/pull/1218
>
> Thanks
>


-- 
--
simon elliston ball
@sireb


Re: [DISCUSS] Knox SSO feature branch review and features

2018-09-19 Thread Simon Elliston Ball
To clarify some of this I've put some documentation into
https://github.com/apache/metron/pull/1203 under METRON-1755 (
https://issues.apache.org/jira/browse/METRON-1755). Hopefully the diagrams
there should make it clearer.

Simon

On Tue, 18 Sep 2018 at 14:17, Simon Elliston Ball <
si...@simonellistonball.com> wrote:

> Hi Mike,
>
> Some good points here which could do with some clarification. I suspect
> the architecture documentation could be clearer and fill in some of these
> gaps, and I'll have a look at working on that and providing some diagrams.
>
> The short version is that the Zuul proxy gateway has been added to replace
> the Nodejs express proxy used to gateway the REST api calls in the current
> hosts. This is done in both cases to avoid CORS restrictions by allowing
> the same host that serves the UI files to proxy call to the API.
>
> The choice of Zuul was partly a pragmatic one (it's the one that's there
> in the box as it were with Spring Boot, which we use for the REST API, via
> the Spring Cloud Netflix project which wraps a bunch of related pieces into
> Spring). The choice of Spring Boot to host the UIs themselves was similarly
> for parity with the REST host, to simplify the stack (we remove the
> occasionally problematic need to install nodejs on target servers, which is
> outside of the regular OS and HDP stacks we support).
>
> Arguably, the Zuul proxy is not necessary if we force everything through a
> Knox instance, since Knox would provide a single endpoint. We probably
> however don't want to force Knox and SSL, hence using Zuul to keep it
> closer to our current architecture. Zuul does some other nice things, which
> might help us in future, so it's really about laying down some options for
> potentially doing micro-services style things at a later date. I'm not
> saying we have to, or even should go that way, it will just make life
> easier later if we decide to. It will also help us if we want to add HA,
> circuit breaking etc to the architecture at a later date. That said, I
> regret that I ever said the word micro-services, since it's caused
> confusion. Just think of it as a proxy to deal with the CORS problem.
>
> Zuul is implemented as a set of filters, but we are not using it for its
> authentication filtering. We're using it as a proxy. Shiro is an
> authentication framework, and could arguably used to provide the security
> piece, but frankly wrapping shiro as a replacement for Spring Security in a
> Spring application seemed like it will make life a lot harder. This could
> be done, but it's not the native happy path, and would pull in additional
> dependencies that duplicate functionality that's already embedded in Spring
> Security.
>
> The version of Knox used is the default from HDP. The link version you
> mention is a docs link. I'll update it to be the older version, which is
> the same and we can decide if we want to maintain the freshness of it when
> we look to upgrade underlying patterns. Either way, the content is the
> same.
>
> I did consider other hosting mechanisms, including Undertow a
>
> If you have a different suggestion to using the Spring default ways of
> doing things, or we want to use a framework other than Spring for this,
> then maybe we could change to that, but the route chosen here is definitely
> the easy path in the context of the decision made to use Spring in metron
> rest, and if anything opens up our choices while minimising, in fact
> reducing, our dependency management overhead.
>
> I hope that explains some of the thinking behind the choices made, but the
> guiding principals I followed were:
> * Don't fight the framework if you don't have to
> * Reduce the need for additional installation pieces and third party repos
> * Minimize dependencies we would have to manage
> * Avoid excessive change of the architecture, or forcing users to adopt
> Knox if they didn't want the SSL overhead.
>
> Simon
>
>
> On Tue, 18 Sep 2018 at 02:46, Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
>
>> Thanks for the write-up Ryan, this is a great start. I have some further
>> questions based on your feedback and in addition to my initial thread.
>>
>> Just for clarification, what version of Knox are we using? HDP 2.6.5,
>> which
>> is what we currently run full dev against, supports 0.12.0.
>>
>> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_release-notes/content/comp_versions.html
>> .
>> I see references to Knox 1.1.0 (latest) in this committed PR -
>>
>> https://github.com/apache/metron/pull//files#diff-70b412194819f3cb829566f05d77c1a6R122
>> .
>

Re: [DISCUSS] PCAP data for testing and development

2018-09-19 Thread Simon Elliston Ball
Isn't this what the pcap_replay role is for? We should be able to install
that role on full-dev and get the example.pcap file we currently ship to
replay and capture. It's not on by default in full dev because it's heavy
for most use cases, but should make it easy to load some sample pcap data
through the pcap topology.

Maybe we should have a method that instead of doing it continuously, has a
'do a loop and stop' version to load this data to keep the cpu weight down
and provide data for testing UI functionality around PCAP.

Simon

On Wed, 19 Sep 2018 at 12:56, Tibor Meller  wrote:

> Hi all,
>
> I would like to start a discussion on the possible ways to provide PCAP
> data for the full dev.
> The full dev VM after a rebuild contains no PCAP data. Currently,
> I'm uploading binaries manually. This makes development slower and testing
> problematic as well. I think a more desired outcome would be
> something similar to what we have in the Alert tab, which is to have some
> pcap data available right after starting the VM.
>
> Do you guys think uploading pcap sample date as part of the
> ansible provisioning step would be a good approach?
> Or sensor stubs for pcap would be a better way?
>
> I would be curious about your thoughts!
>
> Thanks,
> Tibor
>


-- 
--
simon elliston ball
@sireb


Re: [DISCUSS] Knox SSO feature branch review and features

2018-09-18 Thread Simon Elliston Ball
uch as moving from ES
> 2.x
> > > to
> > >5.6.x and upgrading Angular 6.
> > >4. Introduction of Netflix's Zuul.
> > >https://issues.apache.org/jira/browse/METRON-1665.
> > >   - > "The UIs currently proxy to the REST API to avoid CORS
> issues,
> > >   this will be achieved with Zuul."
> > >   - Can we elaborate more on where or how CORS is a problem with
> our
> > >   existing architecture, how Zuul will help solve that, and how it
> > > fits with
> > >   Knox? Wouldn't this be handled by Knox? Since Larry McCay chimed
> in
> > > with
> > >   interest on the original SSO thread about the FB, I'm hoping he
> is
> > > also
> > >   willing to chime in on this as well.
> > >   - This looks like it has the potential to be a rather large piece
> > of
> > >   fundamental infrastructure (as it's also pertinent to
> > microservices)
> > > to
> > >   pull into the platform, and I'd like to be sure the community is
> > > aware of
> > >   and is OK with the implications.
> > >5. > "The proposal is to use a spring boot application, allowing us
> to
> > >harmonize the security implementation across the UI static servers
> and
> > > the
> > >REST layer, and to provide a routing platform for later
> > microservices."
> > > -
> > >https://issues.apache.org/jira/browse/METRON-1665.
> > >   - Microservices is a pretty loaded term. I know there had been
> some
> > >   discussion a while back during the PCAP feature branch start,
> but I
> > > don't
> > >   recall ever reaching a consensus on it. More detail in this
> thread
> > -
> > >
> > >
> >
> https://lists.apache.org/thread.html/1db7c6fa1b0f364f8c03520db9989b4f7a446de82eb4d9786055048c@%3Cdev.metron.apache.org%3E
> > > .
> > >   Can we get some clarification on what is meant by microservices
> > > in the case
> > >   of this FB and relevant PR's, what that architecture looks like,
> > and
> > > how
> > >   it's achieved with the proposed changes in this PR/FB? It seems
> > Zuul
> > > is
> > >   also pertinent to this discussion, but there are many ways to
> > > skin this cat
> > >   so I don't want to presume -
> > >
> > > https://blog.heroku.com/using_netflix_zuul_to_proxy_your_microservices
> > >   6. Zuul, Spring Boot, and microservices -  Closely related to
> > point 5
> > >above. It seems that we weren't quite ready for this when it was
> > > brought up
> > >in May, or at the very least we had some concern of what direction
> to
> > > go.
> > >What is the operational impact, mpack impact, and how we propose to
> > > manage
> > >it with Kerberos, etc.?
> > >
> > >
> >
> https://lists.apache.org/thread.html/c19904681e6a6d9ea3131be3d1a65b24447dca31b4aff588b263fd87@%3Cdev.metron.apache.org%3E
> > >
> > > There is a lot to like in this feature branch, imo. Great feature
> > addition
> > > with Knox and SSO. Introduction of LDAP support for authentication for
> > > Metron UI's. Simplification/unification of our server hosting
> > > infrastructure. I'm hoping we can flesh out some of the details pointed
> > out
> > > above a bit more and get this feature through. Great work so far!
> > >
> > > Best,
> > > Mike Miklavcic
> > >
> >
>


-- 
--
simon elliston ball
@sireb


Re: [DISCUSS] Contributing a General Purpose Regex Parser

2018-08-27 Thread Simon Elliston Ball
+1

This looks like it would be a great contribution. It might be worth having
a look at this in the context of REGEX_ROUTING, which does a similar thing,
but requires a very large number of disparate sensor configs. In that
context, I would say this provides a good means of handling things like
server syslog sources in particular.

It would be great to see a JIRA and PR on this. Discussion around any
configuration specifics is probably easier around some code.

Also, it would be really interesting to hear about any performance thoughts
between something like this vs a complex pattern in Grok for instance, or
the approach taken in the default ASA parser, which is really quite similar
to this, but more 'coded in'.

Simon

On Mon, 27 Aug 2018 at 11:28,  wrote:

> Hello,
>
>
>
> We have implemented a general purpose regex parser for Metron that we are
> interested in contributing back to the community.
>
>
>
> While the Metron Grok parser provides some regex based capability today,
> the intention of this general purpose regex parser is to:
>
>1. Allow for more advanced parsing scenarios (specifically, dealing with
>multiple regex lines for devices that contain several log formats within
>them)
>2. Give users and developers of Metron additional options for parsing
>3. With the new parser chaining and regex routing feature available in
>Metron, this gives some additional flexibility to logically separate a
> flow
>by:
>   1. Regex routing to segregate logs at a device level and handle
>   envelope unwrapping
>   2. This general purpose regex parser to parse an entire device type
>   that contains multiple log formats within the single device (for
> example,
>   RHEL logs)
>
>
>
>  At  a high level control flow is like this:
>
> 1. Identify the record type if incoming raw message.
>
> 2. Find and apply the regular expression of corresponding record type to
> extract the fields (using named groups).
>
> 3. Apply the message header regex to extract the fields in the header part
> of the message (using named groups).
>
>
> The parser config uses the following structure:
>
>"recordTypeRegex": "(?(?<=\\s)\\b(kernel|syslog)\\b(?=\\[|:))"
>
>"messageHeaderRegex": "(?(?<=^<)
>
> \\d{1,4}(?=>)).*?(?(?<=>)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?(?<=\\s).*?(?=\\s))
> ",
>
>"fields": [
>
>   {
>
> "recordType": "kernel",
>
> "regex": ".*(?(?<=\\]|\\w\\:).*?(?=$))"
>
>   },
>
>   {
>
> "recordType": "syslog",
>
> "regex":
>
> ".*(?(?<=PID\\s=\\s).*?(?=\\sLine)).*(?(?<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?.*?(?=\")).*(?(?<=\").*?(?=$))"
>
>   }
>
> ]
>
>
>
> Where:
>
>- recordTypeRegex is used to distinctly identify a record type. It
>inputs a valid regular expression and may also have named groups, which
>would be extracted into fields.
>- messageHeaderRegex is used to specify a regular expression to extract
>fields from a message part which is common across all the messages (i.e,
>syslog fields, standard headers)
>- fields: json list of objects containing recordType and regex. The
>expression that is evaluated is based on the output of the
> recordTypeRegex
>- Note: recordTypeRegex and messageHeaderRegex could be specified as
>lists also (as a JSON array), where the list will be evaluated in order
>until a matching regular expression is found.
>
>
>
>
>
> If there are no objections to having this type of Parser within Metron, we
> will open a JIRA/PR for code review.
>
> *Jagdeep Singh*
>


-- 
--
simon elliston ball
@sireb


Re: [DISCUSS] Getting to a 1.0 release

2018-08-15 Thread Simon Elliston Ball
Agreed, should we add TDE by default, and get the ranger policies on by 
default? That leaves secured in Kafka, which would have to be built into the 
consumers and producers to encrypt into the on disk Kafka topics. Does that 
seem necessary to people? It would have performance implications for sure. 

Simon

> On 15 Aug 2018, at 21:26, Otto Fowler  wrote:
> 
> Well, I look at it like this.
> 
> The Secure Vault was part of the original metron pitch, and many may have 
> used that as part of their evaluations.
> “Look, it is going to have a security vault type thing, it is on the roadmap”.
> 
> Regardless of the implementation, conceptually, security of data at rest is 
> important, and is a major outstanding item or the core metron proposition.
> 
> 
> 
> 
>> On August 15, 2018 at 16:03:19, Simon Elliston Ball 
>> (si...@simonellistonball.com) wrote:
>> 
>> That’s going back a way. I always saw that concept as begin about the 
>> formats, e.g. Orc, and meta data around it plus the data service api to get 
>> at it. I’m all for that too, but think it needs more thought than the ticket 
>> captures. 
>> 
>> Simon
>> 
>> On 15 Aug 2018, at 20:53, Otto Fowler  wrote:
>> 
>>> https://issues.apache.org/jira/browse/METRON-343
>>> 
>>>> On August 15, 2018 at 15:47:24, Simon Elliston Ball 
>>>> (si...@simonellistonball.com) wrote:
>>>> 
>>>> What would you see as secure? I’ve seen people use TDE for the HDFS store, 
>>>> but it’s harder to encrypt storage with solr / es. Something I was 
>>>> thinking of doing to follow up on the Knox Feature was to add Ranger 
>>>> integration for securing and auditing configs, and potentially extending 
>>>> to the index destinations. Do you think that would cover the secure 
>>>> storage concept?
>>>> 
>>>> Simon
>>>> 
>>>> > On 15 Aug 2018, at 20:39, Otto Fowler  wrote:
>>>> >
>>>> > Secure storage off the top of my head
>>>> >
>>>> > On August 15, 2018 at 14:49:26, zeo...@gmail.com (zeo...@gmail.com) 
>>>> > wrote:
>>>> >
>>>> > So, as has been discussed in a few
>>>> > <
>>>> > https://lists.apache.org/thread.html/0445cd8f94dfb844cd5a23ac3eeca04c9f44c9d8f269c6ef12cb3598@%3Cdev.metron.apache.org%3E>
>>>> >
>>>> > other
>>>> > <
>>>> > https://lists.apache.org/thread.html/427a20c22207e84331b94e8ead9a4172a22577d26eb581c0e564d0dc@%3Cdev.metron.apache.org%3E>
>>>> >
>>>> > recent dev list threads, I would like to discuss what a Metron 1.0 
>>>> > release
>>>> > looks like.
>>>> >
>>>> > In order to kick off the conversation, I would like to make a few
>>>> > suggestions regarding "what 1.0 means to me," but I'm very interested to
>>>> > hear everybody else's opinions.
>>>> >
>>>> > In order to go 1.0 I believe we should have:
>>>> > 1. A clear, supported method of upgrading from one version of Metron to 
>>>> > the
>>>> > next. We have attempted
>>>> > <https://github.com/apache/metron/blob/master/Upgrading.md> to make this
>>>> > easier in the past, but it is currently not
>>>> > <
>>>> > https://github.com/apache/metron/tree/master/metron-deployment/packaging/ambari/metron-mpack#limitations>
>>>> >
>>>> > supported
>>>> > <
>>>> > https://github.com/apache/metron/tree/master/metron-deployment/packaging/ambari/elasticsearch-mpack#limitations>
>>>> >
>>>> > .
>>>> > 2. Authentication for all of the UIs and APIs should be secure and 
>>>> > support
>>>> > SSO. I believe this is in progress via METRON-1663
>>>> > <https://issues.apache.org/jira/browse/METRON-1663>.
>>>> > 3. Each of our personas
>>>> > <
>>>> > https://cwiki.apache.org/confluence/display/METRON/Metron+User+Personas+And+Benefits>
>>>> >
>>>> > should
>>>> > be well documented, understood, and supported.
>>>> > - The current state of documentation is, in my opinion, inadequate and I
>>>> > admit I am partially to blame for this. I suggest we define a strict
>>>> > approach for documentation, align to it (such as perhaps migrating all
>&g

Re: [DISCUSS] Getting to a 1.0 release

2018-08-15 Thread Simon Elliston Ball
+1 to that. That’s more the TDE bit, we would also need Kafka SSL, and the Knox 
stuff (METRON-1663 adds SSL to all the UI and rest api stuff)

Simon

> On 15 Aug 2018, at 21:03, Otto Fowler  wrote:
> 
> https://issues.apache.org/jira/browse/METRON-106
> At least making sure it is met and closing it
> 
> 
> 
>> On August 15, 2018 at 15:53:02, Otto Fowler (ottobackwa...@gmail.com) wrote:
>> 
>> https://issues.apache.org/jira/browse/METRON-343
>> 
>>> On August 15, 2018 at 15:47:24, Simon Elliston Ball 
>>> (si...@simonellistonball.com) wrote:
>>> 
>>> What would you see as secure? I’ve seen people use TDE for the HDFS store, 
>>> but it’s harder to encrypt storage with solr / es. Something I was thinking 
>>> of doing to follow up on the Knox Feature was to add Ranger integration for 
>>> securing and auditing configs, and potentially extending to the index 
>>> destinations. Do you think that would cover the secure storage concept?
>>> 
>>> Simon
>>> 
>>> > On 15 Aug 2018, at 20:39, Otto Fowler  wrote:
>>> >
>>> > Secure storage off the top of my head
>>> >
>>> > On August 15, 2018 at 14:49:26, zeo...@gmail.com (zeo...@gmail.com) wrote:
>>> >
>>> > So, as has been discussed in a few
>>> > <
>>> > https://lists.apache.org/thread.html/0445cd8f94dfb844cd5a23ac3eeca04c9f44c9d8f269c6ef12cb3598@%3Cdev.metron.apache.org%3E>
>>> >
>>> > other
>>> > <
>>> > https://lists.apache.org/thread.html/427a20c22207e84331b94e8ead9a4172a22577d26eb581c0e564d0dc@%3Cdev.metron.apache.org%3E>
>>> >
>>> > recent dev list threads, I would like to discuss what a Metron 1.0 release
>>> > looks like.
>>> >
>>> > In order to kick off the conversation, I would like to make a few
>>> > suggestions regarding "what 1.0 means to me," but I'm very interested to
>>> > hear everybody else's opinions.
>>> >
>>> > In order to go 1.0 I believe we should have:
>>> > 1. A clear, supported method of upgrading from one version of Metron to 
>>> > the
>>> > next. We have attempted
>>> > <https://github.com/apache/metron/blob/master/Upgrading.md> to make this
>>> > easier in the past, but it is currently not
>>> > <
>>> > https://github.com/apache/metron/tree/master/metron-deployment/packaging/ambari/metron-mpack#limitations>
>>> >
>>> > supported
>>> > <
>>> > https://github.com/apache/metron/tree/master/metron-deployment/packaging/ambari/elasticsearch-mpack#limitations>
>>> >
>>> > .
>>> > 2. Authentication for all of the UIs and APIs should be secure and support
>>> > SSO. I believe this is in progress via METRON-1663
>>> > <https://issues.apache.org/jira/browse/METRON-1663>.
>>> > 3. Each of our personas
>>> > <
>>> > https://cwiki.apache.org/confluence/display/METRON/Metron+User+Personas+And+Benefits>
>>> >
>>> > should
>>> > be well documented, understood, and supported.
>>> > - The current state of documentation is, in my opinion, inadequate and I
>>> > admit I am partially to blame for this. I suggest we define a strict
>>> > approach for documentation, align to it (such as perhaps migrating all
>>> > useful wiki documentation to git), and enforce it.
>>> > - I would consider METRON-1699
>>> > <https://issues.apache.org/jira/browse/METRON-1699> as a critical item for
>>> > a Security Data Scientist, but it is currently not clear to me where the
>>> > line exists between some of the other personas, or that each persona has
>>> > been sufficiently implemented.
>>> > 4. A performance tuning guide should be available for all of the main
>>> > components, whether as an independent document or as a part of a larger
>>> > document.
>>> > 5. Simple data ingest.
>>> > - Similar to the ongoing conversation for NiFi integration
>>> > <
>>> > https://lists.apache.org/thread.html/d7bb4d32c8c42bd40b2f26973f989bcba16010a672fd8a533a5544bf@%3Cdev.metron.apache.org%3E>,
>>> >
>>> > we should be able to say that we have broken down the barriers to getting
>>> > data into a Metron cluster in easy and efficient ways. In addition to
>>> > NiFi, having support for other popular tools such as beats
>>> > <https://www.elastic.co/products/beats>, fluentd 
>>> > <https://www.fluentd.org/>,
>>> >
>>> > etc.
>>> > - Parsers should be pluggable, with independent tests and the ability to
>>> > make versioned modifications with roll-backs.
>>> >
>>> > What else? Are any of these items not necessary for a 1.0?
>>> >
>>> > Jon
>>> > --
>>> >
>>> > Jon


Re: [DISCUSS] Getting to a 1.0 release

2018-08-15 Thread Simon Elliston Ball
That’s going back a way. I always saw that concept as begin about the formats, 
e.g. Orc, and meta data around it plus the data service api to get at it. I’m 
all for that too, but think it needs more thought than the ticket captures. 

Simon

> On 15 Aug 2018, at 20:53, Otto Fowler  wrote:
> 
> https://issues.apache.org/jira/browse/METRON-343
> 
>> On August 15, 2018 at 15:47:24, Simon Elliston Ball 
>> (si...@simonellistonball.com) wrote:
>> 
>> What would you see as secure? I’ve seen people use TDE for the HDFS store, 
>> but it’s harder to encrypt storage with solr / es. Something I was thinking 
>> of doing to follow up on the Knox Feature was to add Ranger integration for 
>> securing and auditing configs, and potentially extending to the index 
>> destinations. Do you think that would cover the secure storage concept? 
>> 
>> Simon 
>> 
>> > On 15 Aug 2018, at 20:39, Otto Fowler  wrote: 
>> > 
>> > Secure storage off the top of my head 
>> > 
>> > On August 15, 2018 at 14:49:26, zeo...@gmail.com (zeo...@gmail.com) wrote: 
>> > 
>> > So, as has been discussed in a few 
>> > < 
>> > https://lists.apache.org/thread.html/0445cd8f94dfb844cd5a23ac3eeca04c9f44c9d8f269c6ef12cb3598@%3Cdev.metron.apache.org%3E>
>> >  
>> > 
>> > other 
>> > < 
>> > https://lists.apache.org/thread.html/427a20c22207e84331b94e8ead9a4172a22577d26eb581c0e564d0dc@%3Cdev.metron.apache.org%3E>
>> >  
>> > 
>> > recent dev list threads, I would like to discuss what a Metron 1.0 release 
>> > looks like. 
>> > 
>> > In order to kick off the conversation, I would like to make a few 
>> > suggestions regarding "what 1.0 means to me," but I'm very interested to 
>> > hear everybody else's opinions. 
>> > 
>> > In order to go 1.0 I believe we should have: 
>> > 1. A clear, supported method of upgrading from one version of Metron to 
>> > the 
>> > next. We have attempted 
>> > <https://github.com/apache/metron/blob/master/Upgrading.md> to make this 
>> > easier in the past, but it is currently not 
>> > < 
>> > https://github.com/apache/metron/tree/master/metron-deployment/packaging/ambari/metron-mpack#limitations>
>> >  
>> > 
>> > supported 
>> > < 
>> > https://github.com/apache/metron/tree/master/metron-deployment/packaging/ambari/elasticsearch-mpack#limitations>
>> >  
>> > 
>> > . 
>> > 2. Authentication for all of the UIs and APIs should be secure and support 
>> > SSO. I believe this is in progress via METRON-1663 
>> > <https://issues.apache.org/jira/browse/METRON-1663>. 
>> > 3. Each of our personas 
>> > < 
>> > https://cwiki.apache.org/confluence/display/METRON/Metron+User+Personas+And+Benefits>
>> >  
>> > 
>> > should 
>> > be well documented, understood, and supported. 
>> > - The current state of documentation is, in my opinion, inadequate and I 
>> > admit I am partially to blame for this. I suggest we define a strict 
>> > approach for documentation, align to it (such as perhaps migrating all 
>> > useful wiki documentation to git), and enforce it. 
>> > - I would consider METRON-1699 
>> > <https://issues.apache.org/jira/browse/METRON-1699> as a critical item for 
>> > a Security Data Scientist, but it is currently not clear to me where the 
>> > line exists between some of the other personas, or that each persona has 
>> > been sufficiently implemented. 
>> > 4. A performance tuning guide should be available for all of the main 
>> > components, whether as an independent document or as a part of a larger 
>> > document. 
>> > 5. Simple data ingest. 
>> > - Similar to the ongoing conversation for NiFi integration 
>> > < 
>> > https://lists.apache.org/thread.html/d7bb4d32c8c42bd40b2f26973f989bcba16010a672fd8a533a5544bf@%3Cdev.metron.apache.org%3E>,
>> >  
>> > 
>> > we should be able to say that we have broken down the barriers to getting 
>> > data into a Metron cluster in easy and efficient ways. In addition to 
>> > NiFi, having support for other popular tools such as beats 
>> > <https://www.elastic.co/products/beats>, fluentd 
>> > <https://www.fluentd.org/>, 
>> > 
>> > etc. 
>> > - Parsers should be pluggable, with independent tests and the ability to 
>> > make versioned modifications with roll-backs. 
>> > 
>> > What else? Are any of these items not necessary for a 1.0? 
>> > 
>> > Jon 
>> > -- 
>> > 
>> > Jon 


Re: [DISCUSS] Getting to a 1.0 release

2018-08-15 Thread Simon Elliston Ball
What would you see as secure? I’ve seen people use TDE for the HDFS store, but 
it’s harder to encrypt storage with solr / es. Something I was thinking of 
doing to follow up on the Knox Feature was to add Ranger integration for 
securing and auditing configs, and potentially extending to the index 
destinations. Do you think that would cover the secure storage concept? 

Simon

> On 15 Aug 2018, at 20:39, Otto Fowler  wrote:
> 
> Secure storage off the top of my head
> 
> On August 15, 2018 at 14:49:26, zeo...@gmail.com (zeo...@gmail.com) wrote:
> 
> So, as has been discussed in a few
> <
> https://lists.apache.org/thread.html/0445cd8f94dfb844cd5a23ac3eeca04c9f44c9d8f269c6ef12cb3598@%3Cdev.metron.apache.org%3E>
> 
> other
> <
> https://lists.apache.org/thread.html/427a20c22207e84331b94e8ead9a4172a22577d26eb581c0e564d0dc@%3Cdev.metron.apache.org%3E>
> 
> recent dev list threads, I would like to discuss what a Metron 1.0 release
> looks like.
> 
> In order to kick off the conversation, I would like to make a few
> suggestions regarding "what 1.0 means to me," but I'm very interested to
> hear everybody else's opinions.
> 
> In order to go 1.0 I believe we should have:
> 1. A clear, supported method of upgrading from one version of Metron to the
> next. We have attempted
>  to make this
> easier in the past, but it is currently not
> <
> https://github.com/apache/metron/tree/master/metron-deployment/packaging/ambari/metron-mpack#limitations>
> 
> supported
> <
> https://github.com/apache/metron/tree/master/metron-deployment/packaging/ambari/elasticsearch-mpack#limitations>
> 
> .
> 2. Authentication for all of the UIs and APIs should be secure and support
> SSO. I believe this is in progress via METRON-1663
> .
> 3. Each of our personas
> <
> https://cwiki.apache.org/confluence/display/METRON/Metron+User+Personas+And+Benefits>
> 
> should
> be well documented, understood, and supported.
> - The current state of documentation is, in my opinion, inadequate and I
> admit I am partially to blame for this. I suggest we define a strict
> approach for documentation, align to it (such as perhaps migrating all
> useful wiki documentation to git), and enforce it.
> - I would consider METRON-1699
>  as a critical item for
> a Security Data Scientist, but it is currently not clear to me where the
> line exists between some of the other personas, or that each persona has
> been sufficiently implemented.
> 4. A performance tuning guide should be available for all of the main
> components, whether as an independent document or as a part of a larger
> document.
> 5. Simple data ingest.
> - Similar to the ongoing conversation for NiFi integration
> <
> https://lists.apache.org/thread.html/d7bb4d32c8c42bd40b2f26973f989bcba16010a672fd8a533a5544bf@%3Cdev.metron.apache.org%3E>,
> 
> we should be able to say that we have broken down the barriers to getting
> data into a Metron cluster in easy and efficient ways. In addition to
> NiFi, having support for other popular tools such as beats
> , fluentd ,
> 
> etc.
> - Parsers should be pluggable, with independent tests and the ability to
> make versioned modifications with roll-backs.
> 
> What else? Are any of these items not necessary for a 1.0?
> 
> Jon
> -- 
> 
> Jon


Slack Channel

2018-08-15 Thread Simon Elliston Ball
Hello dev team, may I please join your slack channel :)


Re: [ANNOUNCE] - Apache Metron Slack channel

2018-08-15 Thread Simon Elliston Ball
Since this is committers only, would it make more sense to stick to IRC? Or
is exclusivity the idea?

On 15 August 2018 at 16:09, Nick Allen  wrote:

> Thanks for the instructions!
>
> On Wed, Aug 15, 2018 at 10:22 AM, Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
>
> > The Metron community has a Slack channel available for communication
> > (similar to the existing IRC channel, only on Slack).
> >
> > To join:
> >
> >1. Go to slack.com.
> >2. For organization/group, you'll enter "the-asf"
> >3. Use your Apache email for your login
> >4. Click "Channels" and look for #metron (Created by ottO June 15,
> 2018)
> >
> > Best
> > Mike Miklavcic
> >
>



-- 
--
simon elliston ball
@sireb


Re: Change field separator in Metron to make it Hive and ORC friendly

2018-08-14 Thread Simon Elliston Ball
The challenge with making it configurable is that every query, every profile, 
every analytic, template, pre-installed dashboard and use case built by any 
third party who wanted to extend metron would have to honour the configuration 
and paramaterize every query they run. My worry is that that would render some 
engines totally incompatible with many installs (as opposed to just needing an 
escape character as you would with hive now) and would prevent a lot of tools 
participating in the metron eco-system.

I think this is something where we need to make a good decision and stick to it 
to allow the ecosystem to build on a known foundation. 

Dots are not great because hive uses them to separate, underscore collides with 
our existing  convention, and hyphen collides with a number of other common log 
formats, so it’s not an easy one to have an opinion on, but I do think we 
should have an opinion rather than forcing every user to make the hard choice 
to exclude others from sharing. 

Perhaps the flat key value structure is the real question here, and given 
progress in the underlying index engines may not be the panacea it once was.

Simon

Sent from my iPhone

> On 14 Aug 2018, at 11:42, deepak kumar  wrote:
> 
> I agree Ali.
> May be it can be configuration parameter.
> 
>> On Tue, Aug 14, 2018 at 3:e t24 PM Ali Nazemian  
>> wrote:
>> 
>> Hi Simon,
>> 
>> We have temporarily decided to just change it with "_" for HDFS to avoid
>> all the headaches of the bugs and issues that can be raised by using
>> unsupported separators for ORC/Hive and Spark. However, I am not quite
>> confident with "_" as an option for the community as it becomes similar to
>> normal Metron separator. Maybe it would be nice to have an ability to
>> change the separator to any other character and let users decide what they
>> want to use.
>> 
>> Cheers,
>> Ali
>> 
>> On Tue, Aug 14, 2018 at 12:14 AM Simon Elliston Ball <
>> si...@simonellistonball.com> wrote:
>> 
>>> Do you have any suggestions for what would make sense as a delimiter?
>>> 
>>>> On 9 August 2018 at 05:57, Ali Nazemian  wrote:
>>>> 
>>>> Hi All,
>>>> 
>>>> I was wondering if we can change the field separators in Metron to be
>>> able
>>>> to make it Hive/ORC friendly. I could find the following PR, but
>> neither
>>>> dot nor colon is very Hive and ORC friendly and they will cause some
>>>> issues. Hence, I wanted to see if it is possible to change the field
>>>> separator to something else or even give users an ability to define
>> what
>>>> separator to be used to make the data model consistent across
>>> Elasticsearch
>>>> and HDFS.
>>>> 
>>>> https://github.com/apache/metron/pull/1022
>>>> 
>>>> Cheers,
>>>> Ali
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> --
>>> simon elliston ball
>>> @sireb
>>> 
>> 
>> 
>> --
>> A.Nazemian
>> 


Re: Change field separator in Metron to make it Hive and ORC friendly

2018-08-13 Thread Simon Elliston Ball
Do you have any suggestions for what would make sense as a delimiter?

On 9 August 2018 at 05:57, Ali Nazemian  wrote:

> Hi All,
>
> I was wondering if we can change the field separators in Metron to be able
> to make it Hive/ORC friendly. I could find the following PR, but neither
> dot nor colon is very Hive and ORC friendly and they will cause some
> issues. Hence, I wanted to see if it is possible to change the field
> separator to something else or even give users an ability to define what
> separator to be used to make the data model consistent across Elasticsearch
> and HDFS.
>
> https://github.com/apache/metron/pull/1022
>
> Cheers,
> Ali
>



-- 
--
simon elliston ball
@sireb


Re: [DISCUSS] Metron Parsers in Nifi

2018-08-13 Thread Simon Elliston Ball
Yep, I'm wondering whether our parser interface should have the ability to
create schema either like that, or well, that, which would be helpful
within Metron as well.

@Otto, the one thing missing from the record reader api, is that if you
don't emit any records at all for a flow file, it errors, which is not
strictly speaking an error, but yeah, we can certainly control things like
filtering errors aside from this. I would say this was a nifi bug
(debatably) which should be fixed on that side.

Simon

On 13 August 2018 at 14:29, Otto Fowler  wrote:

> Also,  If we are doing the record readers, we can have a reader for a
> parser type and explicitly set the schema, as seen here :
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-
> services/nifi-record-serialization-services-bundle/
> nifi-record-serialization-services/src/main/java/org/apache/nifi/syslog/
> Syslog5424Reader.java
>
>
>
> On August 13, 2018 at 09:26:50, Otto Fowler (ottobackwa...@gmail.com)
> wrote:
>
> If we can do the record readers ourselves ( with the parsers inside them )
> we can handle the returns.
> I’ll be doing the net flow 5 readers once the net flow 5 processor PR (
> not mine ) is in.
>
> I don’t think having a generic class loading parsers foo and having to
> manage all that is preferable to having
> an archetype and explicit parsers.
>
> Nifi processors and readers are self documenting, and this approach will
> make that not possible, as another consideration.
>
>
>
> On August 13, 2018 at 06:50:09, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> Maybe the edge use case will clarify the config issue a little. The reason
> I would want to be able to push Metron parsers into NiFi would be so I can
> pre-parse and filter on the edge to save bandwidth from remote locations. I
> would expect to be able to parse at the edge and use NiFi to prioritise or
> filter on the Metron ready data, then push through to a 'NoOp' parser in
> Metron. For this to happen, we would absolutely not want to connect to
> Zookeeper, so I'm +1 on Otto's suggestion that the config be embeddable in
> NiFi properties. We cannot assume ZK connectivity from NiFi.
>
> I can also see a scenario where NiFi might make it easier to chain parsers,
> which is where it overlaps more with Metron. This is more about the fact
> that NiFi make it a lot easier to configure and manage complex multi-step
> flows than Metron, and is way more user intuitive from a design and
> monitoring perspective. My main concern around using NiFi in this way is
> about the load on the content repository. We are looking at a lot of
> content level transformation here. You could argue that the same load is
> taken off Kafka in the chaining scenario, but there is still a chance for a
> user to accidentally create a lot of disk access if they go over the top
> with NiFi.
>
> I see this as potentially a a chance to make the Metron Parser interface
> compatible with NiFi Record Readers. Then both communities could benefit
> from sharing each other's parsers.
>
> In terms of the NAR approach, I would say we have a base bundle of the NiFi
> bits (https://github.com/simonellistonball/metron/tree/nifi already has
> this for stellar, enrichments and an opinionated publisher, it also has a
> readme with some discussion around this
> https://github.com/simonellistonball/metron/tree/nifi/nifi-metron-bundle).
> We can then use other nar dependencies to side load parser classes into the
> record reader. We would then need to do some fancy property validation in
> NiFi to ensure the classes were available.
>
> Also, Record Readers are much much faster. The only problem I've found with
> them is that they error on blank output, which was a problem for me writing
> a netflow 9 reader (template only records need to live in NiFi cache, but
> not be emitted).
>
> In terms of the schema objection, I'm not sure why schema focus is a
> problem. Our parsers have implicit schema and the output schema formats
> used in NiFi are very flexible and could be "just a map". That said, we
> could also take the opportunity to introduce a method to the parser
> interface to emit traits to contribute the bits of schema that a parser
> produces. This would ultimately lead to us being able to generate output
> schemas (ES, Solr, Hive, whatever which would take a lot of the pain out of
> setup for sensors).
>
> Simon
>
> On 9 August 2018 at 16:42, Otto Fowler  wrote:
>
> > I would say that
> >
> > - For each configuration parameter we want to pull in, it should be
> > explicitly configured through a property as well as through a controller
> > service that acce

Re: [DISCUSS] Metron Parsers in Nifi

2018-08-13 Thread Simon Elliston Ball
e to
> > > > >> see Stellar more readily available in NiFi in general.
> > > > >>
> > > > >> Re: the ControllerService, I see this as a way to maintain
> Metron's
> > > > use of
> > > > >> ZK as the source of config truth. Users could definitely be using
> > > NiFi
> > > > and
> > > > >> Storm in tandem (parse in NiFi + enrich and index from Storm, for
> > > > >> example). Using the ControllerService gives us a ZK instance as
> the
> > > > single
> > > > >> source of truth. That way we aren't forcing users to go to two
> > > > different
> > > > >> places to manage configs. This also lets us leverage our existing
> > > > scripts
> > > > >> and our existing infrastructure around configs and their
> management
> > > and
> > > > >> validation very easily. It also gives users a way to port from
> NiFi
> > > to
> > > > >> Storm or vice-versa without having to migrate configs as well. We
> > > could
> > > > >> also provide the option to configure the Processor itself with the
> > > data
> > > > >> (just don't set up a controller service and provide the json or
> > > > whatever as
> > > > >> one of our properties).
> > > > >>
> > > > >> On Tue, Aug 7, 2018 at 10:12 AM Otto Fowler <
> > ottobackwa...@gmail.com
> > > >
> > > > >> wrote:
> > > > >>
> > > > >>> I think this is a good idea. As I mentioned in the other thread
> > I’ve
> > > > >>> been doing a lot of work on Nifi recently.
> > > > >>> I think the important thing is that what is done should be done
> the
> > > > NiFi
> > > > >>> way, not bolting the Metron composition
> > > > >>> onto Nifi. Think of it like the Tao of Unix, the parsers and
> > > > components
> > > > >>> should be single purpose and simple, allowing
> > > > >>> exceptional flexibility in composition.
> > > > >>>
> > > > >>> Comments inline.
> > > > >>>
> > > > >>> On August 7, 2018 at 09:27:01, Justin Leet (
> justinjl...@gmail.com)
> > > > wrote:
> > > > >>>
> > > > >>> Hi all,
> > > > >>>
> > > > >>> There's interest in being able to run Metron parsers in NiFi,
> > rather
> > > > than
> > > > >>>
> > > > >>> inside Storm. I dug into this a bit, and have some thoughts on
> how
> > > we
> > > > >>> could
> > > > >>> go about this. I'd love feedback on this, along with anything
> we'd
> > > > >>> consider must haves as well as future enhancements.
> > > > >>>
> > > > >>> 1. Separate metron-parsers into metron-parsers-common and
> > > metron-storm
> > > > >>> and create metron-parsers-nifi. For this code to be reusable
> across
> > > > >>> platforms (NiFi, Storm, and anything else in the future), we'll
> > need
> > > > to
> > > > >>> decouple our parsers and Storm.
> > > > >>>
> > > > >>> +1. The “parsing code” should be a library that implements an
> > > > interface
> > > > >>> ( another library ).
> > > > >>>
> > > > >>> The Processors and the Storm things can share them.
> > > > >>>
> > > > >>> - There's also some nice fringe benefits around refactoring our
> > code
> > > > >>> to be substantially more clear and understandable; something
> > > > >>> which came up
> > > > >>> while allowing for parser aggregation.
> > > > >>> 2. Create a MetronProcessor that can run our parsers.
> > > > >>> - I took a look at how RecordReader could be leveraged (e.g.
> > > > >>> CSVRecordReader), but this is pretty tightly tied into schemas
> > > > >>> and is meant
> > > > >>> to be used by ControllerServices, which are then used by
> > Processors.
> > > > >>> There's friction involved there in terms of schemas, but also in
> > > > terms of
> > > > >>>
> > > > >>> access to ZK configs and things like parser chaining. We might
> > > > >>> be able to
> > > > >>> leverage it, but it seems like it'd be fairly shoehorned in
> > > > >>> without getting
> > > > >>> the schema and other benefits.
> > > > >>>
> > > > >>> We won’t have to provide our ‘no schema processors’ ( grok, csv,
> > > json
> > > > ).
> > > > >>>
> > > > >>> All the remaining processors DO have schemas that we know about.
> We
> > > > can
> > > > >>> just provide the avro schemas the same way we provide the ES
> > > schemas.
> > > > >>>
> > > > >>> The “parsing” should not be conflated with the transform/stellar
> in
> > > > >>> NiFi. We should make that separate. Running Stellar over Records
> > > > would be
> > > > >>> the best thing.
> > > > >>>
> > > > >>> - This Processor would work similarly to Storm: bytes[] in ->
> JSON
> > > > >>> out.
> > > > >>> - There is a Processor
> > > > >>> <
> > > > >>>
> > > >
> > >
> > https://github.com/apache/nifi/blob/master/nifi-nar-
> bundles/nifi-standard-bundle/nifi-standard-processors/src/
> main/java/org/apache/nifi/processors/standard/JoltTransformJSON.java
> > > > >>> >
> > > > >>> that
> > > > >>> handles loading other JARs that we can model a
> > > > >>> MetronParserProcessor off of
> > > > >>> that handles classpath/classloader issues (basically just sets
> up a
> > > > >>> classloader specific to what's being loaded and swaps out the
> > > Thread's
> > > > >>> loader when it calls to outside resources).
> > > > >>>
> > > > >>> There should be no reason to load modules outside the NAR. Why do
> > > you
> > > > >>> expect to? If each Metron Processor equiv of a Metron Storm
> Parser
> > > is
> > > > just
> > > > >>> parsing to json it shouldn’t need much.And we could package them
> in
> > > > the
> > > > >>> NAR. I would suggest we have a Processor per Parser to allow for
> > > > >>> specialization. It should all be in the nar.
> > > > >>>
> > > > >>> The Stellar Processor, if you would support the works would
> > possibly
> > > > need
> > > > >>> this.
> > > > >>>
> > > > >>> 3. Create a MetronZkControllerService to supply our configs to
> our
> > > > >>> processors.
> > > > >>> - This is a pretty established NiFi pattern for being able to
> > > provide
> > > > >>> access to other services needed by a Processor (e.g. databases or
> > > > large
> > > > >>> configurations files).
> > > > >>> - The same controller service can be used by all Processors to
> > > manage
> > > > >>> configs in a consistent manner.
> > > > >>>
> > > > >>> I think controller services would make sense where needed, I’m
> just
> > > > not
> > > > >>> sure what you imagine them being needed for?
> > > > >>>
> > > > >>> If the user has NiFi, and a Registry etc, are you saying you
> > imagine
> > > > them
> > > > >>> using Metron + ZK to manage configurations? Or to be using BOTH
> > > storm
> > > > >>> processors and Nifi Processors?
> > > > >>>
> > > > >>> At that point, we can just NAR our controller service and parser
> > > > processor
> > > > >>>
> > > > >>> up as needed, deploy them to NiFi, and let the user provide a
> > config
> > > > for
> > > > >>> where their custom parsers can be provided (i.e. their parser
> jar).
> > > > This
> > > > >>> would be 3 nars (processor, controller-service, and
> > > > controller-service-api
> > > > >>>
> > > > >>> in order to bind the other two together).
> > > > >>>
> > > > >>> Once deployed, our ability to use parsers should fit well into
> the
> > > > >>> standard
> > > > >>> NiFi workflow:
> > > > >>>
> > > > >>> 1. Create a MetronZkControllerService.
> > > > >>> 2. Configure the service to point at zookeeper.
> > > > >>> 3. Create a MetronParser.
> > > > >>> 4. Configure it to use the controller service + parser jar
> location
> > > +
> > > > >>> any other needed configs.
> > > > >>> 5. Use the outputs as needed downstream (either writing out to
> > Kafka
> > > > or
> > > > >>> feeding into more MetronParsers, etc.)
> > > > >>>
> > > > >>> Chaining parsers should ideally become a matter of chaining
> > > > MetronParsers
> > > > >>>
> > > > >>> (and making sure the enveloping configs carry through properly).
> > For
> > > > >>> parser
> > > > >>> aggregation, I'd just avoid it entirely until we know it's needed
> > in
> > > > NiFi.
> > > > >>>
> > > > >>> Justin
> > > >
> > > > ---
> > > > Thank you,
> > > >
> > > > James Sirota
> > > > PMC- Apache Metron
> > > > jsirota AT apache DOT org
> > > >
> > > >
> > >
> >
> >
>



-- 
--
simon elliston ball
@sireb


Knox SSO feature branch PRs: a quick demo

2018-08-01 Thread Simon Elliston Ball
I've recently put in a number of PRs on the Knox feature branch, and
thought it might be useful to post a quick 'sprint demo' style explanation
of what the various PRs and functionality entails:
https://youtu.be/9OJz6hg0N1I

Hope this helps with review process. There are a couple of areas where that
need a little follow on improvement (Ambari mpack cosmetic oddness mainly).
Any thoughts and assistance on that would be very greatly appreciated.

Simon


Re: [DISCUSS] Batch Profiler

2018-07-30 Thread Simon Elliston Ball
Good points Otto +1 to all that.

On the Spark question, we should definitely be more deliberate about it. We
currently have an implicit dependency on spark through the zeppelin
notebooks. Most implementations I've seen of Metron also have some sort of
Spark work built around them. The current full dev HDP build is the latest
2.6.5 version available, even though the profile names 2.5.3. I'm not sure
we should take on jumping to 3.0 just yet for this effort. With the current
version we get Spark 2.3.0 by default, which would seem to do.

On point two... yes, this does seem very much like a first step in the
direction of being able to replace Storm, but I would say that probably
deserves its very own feature branch. I would say we want to use things
like the structured streaming capability for this, which may remove the
need for some of the custom batch writers we have in Metron, delegating
those capabilities to Spark. My one concern around here would be the fact
that Spark Continuous Triggers are still alpha grade, so we would have to
take some micro-batch latency with a move to Spark. Realistically we have
this issue anyway in Storm world, because we have to batch processing there
too.

I wonder whether it's worth considering an existing spark 'host' such as
Apache Livy for managing jobs (not sure if that would actually add any
value) and I'm particularly keen on being able to use things like Spark to
query historical data under our current DAOs to drive UI.

Simon


On 30 July 2018 at 14:50, Otto Fowler  wrote:

> I think the feature branch is a good idea, but what is in the feature
> branch or feature branches will have to shake out.
>
> I agree in concept with what you have in the jira, but I have two points.
>
>1. We will need a break down of introducing Spark to the stack
>   - required version due to HDP support
>   - do we want to update HDP support before this?
>   - Spark tuning/defaults
>   - Spark configuration support / UI etc
>   - more….
>2.
>
>When I read this, it seems like a Lambda architecture approach. Should
>we, as part of this start exploring the possibility to replacing storm
> with
>spark streaming such that we do not have to maintain separate streaming
> vs.
>batch codebases?
>3. This mechanism would be used in the future for telemetry ‘replay’.
>That would mean that ( IMHO )
>   - we should understand that case as well for this
>   - build this capability out such that it is generic enough that a
>   second use will not warrant a re-write or huge refactor
>
> I think this breaks down to a few sets of functionality:
>
>-
>
>Base support for deployment, management or spark
>-
>
>Metron services for triggering, and monitoring of Apache Spark ( on
>demand and constant ), maybe rest stuff like the caps
>-
>
>UI / Stellar base support
>-
>
>Build out of Batch Profiler service on top of that
>-
>
>Build out of replay service on top of that ( plus all the replay stuff
>that needs to also be done - like are you replacing data or having two
>sets…. trial runs etc )
>-
>
>
>-
>
>profit
>
>
>
>
> On July 27, 2018 at 11:29:51, Nick Allen (n...@nickallen.org) wrote:
>
> Hi Everyone -
>
> A while back I opened up a discuss thread around the general idea of a
> Batch Profiler [1]. I'd like to start making progress on a first draft of
> that functionality.
>
> I created METRON-1699 [2] which outlines the general approach and ideas.
> If you're interested, review that JIRA and let me know if you have
> feedback. I will be adding sub-tasks to that JIRA as I make progress and
> can separate it into logical bits for review.
>
> I would like this effort to use a feature branch as it will take a number
> of PRs to get a first cut on the functionality. Pending no disagreement, I
> will create the feature branch based on METRON-1699.
>
> [1]
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4e
> e601041fb47bfc97acb6825083@%3Cdev...
>
> <
> https://lists.apache.org/thread.html/d28d18cc9358f5d9c276c7c304ff4e
> e601041fb47bfc97acb6825083@%3Cdev.metron.apache.org%3E>
>
> [2] https://issues.apache.org/jira/browse/METRON-1699
>



-- 
--
simon elliston ball
@sireb


Re: Security Feature Branch?

2018-07-12 Thread Simon Elliston Ball
It certainly might make for an interesting automated response if metron sent 
alerts to Knox to block IPs for example, or users in the case of strange 
behavior (though I wonder if that would generally be done upstream in the 
authentication source, e.g. blocking an AD account).

Pushing Knox audit events to Metron would also provide an interesting data 
source for things like brute force detection, or strange access patterns. I’ve 
seen people doing similar things with Ranger. 

So far the only real issue I’ve seen with Knox itself is replacing 
certificates. I was hoping to do that by providing a blueprint setting in 
Ambari with a proper cert, but that’s probably a question for dev@ambari. Great 
to see so many projects working together! 

Simon 

> On 12 Jul 2018, at 17:10, larry mccay  wrote:
> 
> Glad to see this work being done!
> Please feel free to reach out to Knox dev@ list for any assistance and
> potentially review.
> 
> Only sort of related, I have been thinking about another integration
> between Knox and Metron wherein possible threat details can be communicated
> to Knox to take action on at authentication/authorization time.
> Knox could also potentially push interesting events like possible brute
> force login attempts to Metron.
> Some bi-directional pub-sub model?
> 
> Thoughts?
> 
>> On Thu, Jul 12, 2018 at 11:57 AM, Casey Stella  wrote:
>> 
>> I added the feature branch: feature/METRON-1663-knoxsso
>> 
>> https://git-wip-us.apache.org/repos/asf?p=metron.git;a=
>> shortlog;h=refs/heads/feature/METRON-1663-knoxsso
>> 
>> On Thu, Jul 12, 2018 at 11:13 AM Otto Fowler 
>> wrote:
>> 
>>> I think I understand what you are saying very very very well Simon.  I am
>>> not sure what would be different about your submittal from other
>> submittals
>>> where that argument failed.
>>> 
>>> On July 12, 2018 at 11:07:02, Simon Elliston Ball (
>>> si...@simonellistonball.com) wrote:
>>> 
>>> Agreed Otto, the challenge is that essentially each change cuts across
>>> dependencies in every component. I could break it down into the changes
>> for
>>> making SSO work, and the changes for making it install, and the changes
>> for
>>> making full-dev work, but that would mean violating our policy that
>> testing
>>> should be done for each PR on full dev, hence the one PR one unit
>> approach.
>>> Does that work, or do we want to review on the basis of a series of
>>> untestable bits, and then a final working build PR that pulls it
>> together?
>>> 
>>> Simon
>>> 
>>>> On 12 July 2018 at 16:00, Otto Fowler  wrote:
>>>> 
>>>> Our policy in the past on such things is to require that they are
>> broken
>>>> into small reviewable chunks on a feature branch, even if the end to
>> end
>>>> working version was more ‘usable’.
>>>> 
>>>> 
>>>> 
>>>> On July 12, 2018 at 10:51:30, Simon Elliston Ball (
>>>> si...@simonellistonball.com) wrote:
>>>> 
>>>> I've been doing some work on getting the Metron UIs and REST layers to
>>> work
>>>> with Apache KnoxSSO, and LDAP authentication, to remove the need to
>> store
>>>> passwords in MySQL, allow AD integration, secure up our authentication
>>>> points. I'm also working in a Knox service to allow the gateway to
>>> provide
>>>> full SSL for the interfaces and avoid all the proxying and CORS things
>> we
>>>> have to worry about.
>>>> 
>>>> This has ended up being a pretty chunky piece of work which involves
>> very
>>>> significant changes to the UIs, REST layer, and introduces Knox to the
>>>> blueprint, as well as messing with the full-dev build scripts, and
>> adding
>>>> ansible roles.
>>>> 
>>>> As such, in-order to make it a bit more reviewable, would it be better
>> to
>>>> contribute it to a feature branch? It could arguably be broken into a
>>>> series of PRs, but at least some parts of full dev would be broken
>>> between
>>>> most of the logical steps, since it's all kinda co-dependent, so it's
>>>> easier to look at as a unit.
>>>> 
>>>> Simon
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> --
>>> simon elliston ball
>>> @sireb
>>> 
>> 


Re: Security Feature Branch?

2018-07-12 Thread Simon Elliston Ball
Agreed Otto, the challenge is that essentially each change cuts across
dependencies in every component. I could break it down into the changes for
making SSO work, and the changes for making it install, and the changes for
making full-dev work, but that would mean violating our policy that testing
should be done for each PR on full dev, hence the one PR one unit approach.
Does that work, or do we want to review on the basis of a series of
untestable bits, and then a final working build PR that pulls it together?

Simon

On 12 July 2018 at 16:00, Otto Fowler  wrote:

> Our policy in the past on such things is to require that they are broken
> into small reviewable chunks on a feature branch, even if the end to end
> working version was more ‘usable’.
>
>
>
> On July 12, 2018 at 10:51:30, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> I've been doing some work on getting the Metron UIs and REST layers to
> work
> with Apache KnoxSSO, and LDAP authentication, to remove the need to store
> passwords in MySQL, allow AD integration, secure up our authentication
> points. I'm also working in a Knox service to allow the gateway to provide
> full SSL for the interfaces and avoid all the proxying and CORS things we
> have to worry about.
>
> This has ended up being a pretty chunky piece of work which involves very
> significant changes to the UIs, REST layer, and introduces Knox to the
> blueprint, as well as messing with the full-dev build scripts, and adding
> ansible roles.
>
> As such, in-order to make it a bit more reviewable, would it be better to
> contribute it to a feature branch? It could arguably be broken into a
> series of PRs, but at least some parts of full dev would be broken between
> most of the logical steps, since it's all kinda co-dependent, so it's
> easier to look at as a unit.
>
> Simon
>
>


-- 
--
simon elliston ball
@sireb


Security Feature Branch?

2018-07-12 Thread Simon Elliston Ball
I've been doing some work on getting the Metron UIs and REST layers to work
with Apache KnoxSSO, and LDAP authentication, to remove the need to store
passwords in MySQL, allow AD integration, secure up our authentication
points. I'm also working in a Knox service to allow the gateway to provide
full SSL for the interfaces and avoid all the proxying and CORS things we
have to worry about.

This has ended up being a pretty chunky piece of work which involves very
significant changes to the UIs, REST layer, and introduces Knox to the
blueprint, as well as messing with the full-dev build scripts, and adding
ansible roles.

As such, in-order to make it a bit more reviewable, would it be better to
contribute it to a feature branch? It could arguably be broken into a
series of PRs, but at least some parts of full dev would be broken between
most of the logical steps, since it's all kinda co-dependent, so it's
easier to look at as a unit.

Simon


Re: Performance comparison between Grok and Java regex

2018-07-11 Thread Simon Elliston Ball
A streaming token parser might well get you good performance for that format... 
maybe something like an antlr grammar or even a simple scanner. Regex is not 
the only pattern :) 

It would also be great to see such a parser contributed back to the community 
of possible, and I sure we would be happy to help maintain and improve it in 
the open source.

Simon

> On 11 Jul 2018, at 16:26, Muhammed Irshad  wrote:
> 
> Otto Fowler,
> 
> Yes, I am Ok with the trade-offs. In case of Active Directory log records
> can I parse it using non-regex custom parser ? I think we need one pattern
> matching library right as it is plain text thing ? One of the dummy AD
> record of my use case would be like this below.
> 
> 
> 12/02/2017 05:14:43 PM LogName=Security SourceName=Microsoft Windows
> security auditing. EventCode=4625 EventType=0 Type=Information ComputerName=
> dc1.ad.ecorp.com TaskCategory=Logon OpCode=Info
> RecordNumber=95055509895231650867 Keywords=Audit Success Message=An account
> failed to log on. Subject: Security ID: NULL SID Account Name: - Account
> Domain: - Logon ID: 0x0 Logon Type: 3 Account For Which Logon Failed:
> Security ID: NULL SID Account Name: K1560365938U$ Account Domain: ECORP
> Failure Information: Failure Reason: Unknown user name or bad password.
> Status: 0xC06D Sub Status: 0xC06A Network Information: Workstation
> Name: K1560365938U Source Network Address: 192.168.151.95 Source Port:
> 53176 Detailed Authentification Information: Logon Process: NtLmSsp
> Authentification Package: NTLM Transited Services: - Package Name (NTLM
> ONLY): - Key Length: 0 This event is generated when a logon request fails.
> It is generated on the computer where access was attempted. The Subject
> fields indicate the account on the local system which requested the logon.
> This is most commonly a service such as the Server service, or a local
> process such as Winlogon.exe or Services.exe. The Logon Type field
> indicates the kind of logon that was requested. The most common types are 2
> (interactive) and 3 (network). The Process Information fields indicate
> which account and process on the system requested the logon. The Network
> Information fields indicate where a remote logon request originated.
> Workstation name is not always available and may be left blank in some
> cases. The authentication information fields provide detailed information
> about this specific logon request. Transited services indicate which
> intermediate services have participated in this logon request. Package name
> indicates which sub-protocol was used among the NTLM protocols
> 
> On Wed, Jul 11, 2018 at 8:44 PM, Otto Fowler 
> wrote:
> 
>> I am not saying it is faster, just giving some info.
>> 
>> Also, that part of the documentation is not referring to regex v. grok,
>> but grok verses a custom non-regex parser, at least as I read it.
>> 
>> If you have the ability to build, deploy, test and maintain a custom
>> parser ( unless you will be submitting it to the project? ), then in most
>> cases where performance
>> is the top issue ( or rather throughput ) you are most likely going to get
>> better results that way.  Accepting that you are ok with the tradeoffs.
>> 
>> If you have 10M mps parsing might night be your bottleneck.
>> 
>> 
>> 
>> 
>> 
>> On July 11, 2018 at 11:01:19, Muhammed Irshad (irshadkt@gmail.com)
>> wrote:
>> 
>> Otto Fowler,
>> 
>> Thanks for the reply. I saw it uses same Java regex under the hood. I got
>> bit sceptic by seeing this open issue
>>  in java-grok which
>> says
>> grok is much slower when compared with pure regex. The fix is not
>> available
>> yet in metron as it need few changes in the API and issue to be closed. As
>> data volume is so huge in my requirement I had to double check and confirm
>> before I go with one. Also metron documentation
>> > metron-parsers/index.html>
>> itself says the below statement under Parser Adapter section.
>> 
>> "Grok parser adapters are designed primarly for someone who is not a Java
>> coder for quickly standing up a parser adapter for lower velocity
>> topologies. Grok relies on Regex for message parsing, which is much slower
>> than purpose-built Java parsers, but is more extensible. Grok parsers are
>> defined via a config file and the topplogy does not need to be recombiled
>> in order to make changes to them."
>> 
>> On Wed, Jul 11, 2018 at 8:01 PM, Otto Fowler 
>> wrote:
>> 
>>> Java-Grok IS java regex. It is just a DSL over Java regex. It takes grok
>>> expressions ( that can reference other expressions and be compound ) and
>>> parses/resolves them and then builds one big regex out of them.
>>> Also, Groks, once parsed / used are re-used, so at that point they are
>>> like compiled regex’s.
>>> 
>>> That is not to say that that takes 0 time, but it may help you to
>>> understand.
>>> 
>>> https://github.com/thekrakken/java

Re: Architectural reason to split in 4 topologies / impact on the kafka ressources

2018-06-25 Thread Simon Elliston Ball
> > >>  > For example, why the parsing and enrichment topologies have not
> been
> > >>  > merged? Would it not be possible when you parse the message to
> > directly
> > >>  > enricht it?
> > >>  >
> > >>  > Im asking that because splitting in several topologies means that
> > all of
> > >>  > the topologies read/write to Kafka, which produce a bigger load on
> > the
> > >>  > kafka cluster and then a need for way more infrastructure/servers.
> > The
> > >>  cost
> > >>  > is especially true when we speak about TBs of data ingested every
> > day.
> > >>  >
> > >>  > Im sure there were a very good reason, I was just curious.
> > >>  >
> > >>  > Thanks,
> > >>  > Michel
> > >>  >
> >
> > ---
> > Thank you,
> >
> > James Sirota
> > PMC- Apache Metron
> > jsirota AT apache DOT org
> >
> >
>



-- 
--
simon elliston ball
@sireb


Re: Writing enrichment data directly from NiFi with PutHBaseJSON

2018-06-13 Thread Simon Elliston Ball
Not convinced we should be writing Jiras against the metron project, or the
nifi project if we don't know where it's actually going to end up to be
honest. In any case, working code:
https://github.com/simonellistonball/metron/tree/nifi/nifi-metron-bundle
which is currently in a metron fork, for no particular reason. Also, it
needs proper tests, docs and all that jazz, but PoC grade, it works,
scales, and is moderately robust as long as hbase doesn't fall over too
much.

Simon

On 13 June 2018 at 15:24, Otto Fowler  wrote:

> Do we even have a jira?  If not maybe Carolyn et. al. can write one up that
> lays out some
> requirements and context.
>
>
> On June 13, 2018 at 10:04:27, Casey Stella (ceste...@gmail.com) wrote:
>
> no, sadly we do not.
>
> On Wed, Jun 13, 2018 at 10:01 AM Carolyn Duby 
> wrote:
>
> > Agreed….Streaming enrichments is the right solution for DNS data.
> >
> > Do we have a web service for writing enrichments?
> >
> > Carolyn Duby
> > Solutions Engineer, Northeast
> > cd...@hortonworks.com
> > +1.508.965.0584
> >
> > Join my team!
> > Enterprise Account Manager – Boston - http://grnh.se/wepchv1
> > Solutions Engineer – Boston - http://grnh.se/8gbxy41
> > Need Answers? Try https://community.hortonworks.com <
> > https://community.hortonworks.com/answers/index.html>
> >
> >
> >
> >
> >
> >
> >
> >
> > On 6/13/18, 6:25 AM, "Charles Joynt" 
> > wrote:
> >
> > >Regarding why I didn't choose to load data with the flatfile loader
> > script...
> > >
> > >I want to be able to SEND enrichment data to Metron rather than have to
> > set up cron jobs to PULL data. At the moment I'm trying to prove that the
> > process works with a simple data source. In the future we will want
> > enrichment data in Metron that comes from systems (e.g. HR databases)
> that
> > I won't have access to, hence will need someone to be able to send us the
> > data.
> > >
> > >> Carolyn: just call the flat file loader from a script processor...
> > >
> > >I didn't believe that would work in my environment. I'm pretty sure the
> > script has dependencies on various Metron JARs, not least for the row id
> > hashing algorithm. I suppose this would require at least a partial
> install
> > of Metron alongside NiFi, and would introduce additional work on the NiFi
> > cluster for any Metron upgrade. In some (enterprise) environments there
> > might be separation of ownership between NiFi and Metron.
> > >
> > >I also prefer not to have a Java app calling a bash script which calls a
> > new java process, with logs or error output that might just get swallowed
> > up invisibly. Somewhere down the line this could hold up effective
> > troubleshooting.
> > >
> > >> Simon: I have actually written a stellar processor, which applies
> > stellar to all FlowFile attributes...
> > >
> > >Gulp.
> > >
> > >> Simon: what didn't you like about the flatfile loader script?
> > >
> > >The flatfile loader script has worked fine for me when prepping
> > enrichment data in test systems, however it was a bit of a chore to get
> the
> > JSON configuration files set up, especially for "wide" data sources that
> > may have 15-20 fields, e.g. Active Directory.
> > >
> > >More broadly speaking, I want to embrace the streaming data paradigm and
> > tried to avoid batch jobs. With the DNS example, you might imagine a
> future
> > where the enrichment data is streamed based on DHCP registrations, DNS
> > update events, etc. In principle this could reduce the window of time
> where
> > we might enrich a data source with out-of-date data.
> > >
> > >Charlie
> > >
> > >-Original Message-
> > >From: Carolyn Duby [mailto:cd...@hortonworks.com]
> > >Sent: 12 June 2018 20:33
> > >To: dev@metron.apache.org
> > >Subject: Re: Writing enrichment data directly from NiFi with
> PutHBaseJSON
> > >
> > >I like the streaming enrichment solutions but it depends on how you are
> > getting the data in. If you get the data in a csv file just call the flat
> > file loader from a script processor. No special Nifi required.
> > >
> > >If the enrichments don’t arrive in bulk, the streaming solution is
> better.
> > >
> > >Thanks
> > >Carolyn Duby
> > >Solutions Engineer, Northeast
> > >cd...@hortonworks.com
> > >+1.508.96

Re: Writing enrichment data directly from NiFi with PutHBaseJSON

2018-06-13 Thread Simon Elliston Ball
That’s where something like the Nifi solution would come in... 

With the PutEnrichment processor and a ProcessHttpRequest processor, you do 
have a web service for loading enrichments.

We could probably also create a rest service end point for it, which would make 
some sense, but there is a nice multi-source, queuing, and lineage element to 
the nifi solution.

Simon 

> On 13 Jun 2018, at 15:04, Casey Stella  wrote:
> 
> no, sadly we do not.
> 
>> On Wed, Jun 13, 2018 at 10:01 AM Carolyn Duby  wrote:
>> 
>> Agreed….Streaming enrichments is the right solution for DNS data.
>> 
>> Do we have a web service for writing enrichments?
>> 
>> Carolyn Duby
>> Solutions Engineer, Northeast
>> cd...@hortonworks.com
>> +1.508.965.0584
>> 
>> Join my team!
>> Enterprise Account Manager – Boston - http://grnh.se/wepchv1
>> Solutions Engineer – Boston - http://grnh.se/8gbxy41
>> Need Answers? Try https://community.hortonworks.com <
>> https://community.hortonworks.com/answers/index.html>
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On 6/13/18, 6:25 AM, "Charles Joynt" 
>> wrote:
>> 
>>> Regarding why I didn't choose to load data with the flatfile loader
>> script...
>>> 
>>> I want to be able to SEND enrichment data to Metron rather than have to
>> set up cron jobs to PULL data. At the moment I'm trying to prove that the
>> process works with a simple data source. In the future we will want
>> enrichment data in Metron that comes from systems (e.g. HR databases) that
>> I won't have access to, hence will need someone to be able to send us the
>> data.
>>> 
>>>> Carolyn: just call the flat file loader from a script processor...
>>> 
>>> I didn't believe that would work in my environment. I'm pretty sure the
>> script has dependencies on various Metron JARs, not least for the row id
>> hashing algorithm. I suppose this would require at least a partial install
>> of Metron alongside NiFi, and would introduce additional work on the NiFi
>> cluster for any Metron upgrade. In some (enterprise) environments there
>> might be separation of ownership between NiFi and Metron.
>>> 
>>> I also prefer not to have a Java app calling a bash script which calls a
>> new java process, with logs or error output that might just get swallowed
>> up invisibly. Somewhere down the line this could hold up effective
>> troubleshooting.
>>> 
>>>> Simon: I have actually written a stellar processor, which applies
>> stellar to all FlowFile attributes...
>>> 
>>> Gulp.
>>> 
>>>> Simon: what didn't you like about the flatfile loader script?
>>> 
>>> The flatfile loader script has worked fine for me when prepping
>> enrichment data in test systems, however it was a bit of a chore to get the
>> JSON configuration files set up, especially for "wide" data sources that
>> may have 15-20 fields, e.g. Active Directory.
>>> 
>>> More broadly speaking, I want to embrace the streaming data paradigm and
>> tried to avoid batch jobs. With the DNS example, you might imagine a future
>> where the enrichment data is streamed based on DHCP registrations, DNS
>> update events, etc. In principle this could reduce the window of time where
>> we might enrich a data source with out-of-date data.
>>> 
>>> Charlie
>>> 
>>> -Original Message-
>>> From: Carolyn Duby [mailto:cd...@hortonworks.com]
>>> Sent: 12 June 2018 20:33
>>> To: dev@metron.apache.org
>>> Subject: Re: Writing enrichment data directly from NiFi with PutHBaseJSON
>>> 
>>> I like the streaming enrichment solutions but it depends on how you are
>> getting the data in.  If you get the data in a csv file just call the flat
>> file loader from a script processor.  No special Nifi required.
>>> 
>>> If the enrichments don’t arrive in bulk, the streaming solution is better.
>>> 
>>> Thanks
>>> Carolyn Duby
>>> Solutions Engineer, Northeast
>>> cd...@hortonworks.com
>>> +1.508.965.0584
>>> 
>>> Join my team!
>>> Enterprise Account Manager – Boston - http://grnh.se/wepchv1 Solutions
>> Engineer – Boston - http://grnh.se/8gbxy41 Need Answers? Try
>> https://community.hortonworks.com <
>> https://community.hortonworks.com/answers/index.html>
>>> 
>>> 
>>> On 6/12/18, 1:08 PM, "Simon Elliston Ball" 
>> wrote:
>>>

Re: Writing enrichment data directly from NiFi with PutHBaseJSON

2018-06-12 Thread Simon Elliston Ball
d get from the flatfile_loader.sh
> > script. A colleague of mine has already loaded some DNS data using
> > that script, so I am using that as a reference.
> >
> > I have implemented a flow in NiFi which takes JSON data from a HTTP
> > listener and routes it to a PutHBaseJSON processor. The flow is
> > working, in the sense that data is successfully written to HBase, but
> > despite (naively) specifying "Row Identifier Encoding Strategy =
> > Binary", the results in HBase don't look correct. Comparing the output
> > from HBase scan commands I
> > see:
> >
> > flatfile_loader.sh produced:
> >
> > ROW:
> > \xFF\xFE\xCB\xB8\xEF\x92\xA3\xD9#xC\xF9\xAC\x0Ap\x1E\x00\x05whois\x00\
> > x0E192.168.0.198
> > CELL: column=data:v, timestamp=1516896203840,
> > value={"clientname":"server.domain.local","clientip":"192.168.0.198"}
> >
> > PutHBaseJSON produced:
> >
> > ROW:  server.domain.local
> > CELL: column=dns:v, timestamp=1527778603783,
> > value={"name":"server.domain.local","type":"A","data":"192.168.0.198"}
> >
> > From source JSON:
> >
> >
> > {"k":"server.domain.local","v":{"name":"server.domain.local","type":"A
> > ","data":"192.168.0.198"}}
> >
> > I know that there are some differences in column family / field names,
> > but my worry is the ROW id. Presumably I need to encode my row key,
> > "k" in the JSON data, in a way that matches how the flatfile_loader.sh
> script did it.
> >
> > Can anyone explain how I might convert my Id to the correct format?
> > -or-
> > Does this matter-can Metron use the human-readable ROW ids?
> >
> > Charlie Joynt
> >
> > --
> > G-RESEARCH believes the information provided herein is reliable. While
> > every care has been taken to ensure accuracy, the information is
> > furnished to the recipients with no warranty as to the completeness
> > and accuracy of its contents and on condition that any errors or
> > omissions shall not be made the basis of any claim, demand or cause of
> action.
> > The information in this email is intended only for the named recipient.
> > If you are not the intended recipient please notify us immediately and
> > do not copy, distribute or take action based on this e-mail.
> > All messages sent to and from this e-mail address will be logged by
> > G-RESEARCH and are subject to archival storage, monitoring, review and
> > disclosure.
> > G-RESEARCH is the trading name of Trenchant Limited, 5th Floor,
> > Whittington House, 19-30 Alfred Place, London WC1E 7EA.
> > Trenchant Limited is a company registered in England with company
> > number 08127121.
> > --
> >
>



-- 
--
simon elliston ball
@sireb


Re: Writing enrichment data directly from NiFi with PutHBaseJSON

2018-06-05 Thread Simon Elliston Ball
Also, the bundle would be part of the metron project I expect, so the NiFi 
release shouldn’t matter much, now NiFi can version only processors 
independently.

Simon 

> On 5 Jun 2018, at 20:14, Casey Stella  wrote:
> 
> I agree with Simon here, the benefit of providing NiFi tooling is to enable 
> NiFi to use our infrastructure (e.g. our parsers, MaaS, stellar enrichments, 
> etc).  This would tie it to Metron pretty closely.
> 
>> On Tue, Jun 5, 2018 at 3:12 PM Otto Fowler  wrote:
>> Nifi releases more often then Metron does, that might be an issue.
>> 
>> 
>> On June 5, 2018 at 14:07:22, Simon Elliston Ball (
>> si...@simonellistonball.com) wrote:
>> 
>> To be honest, I would expect this to be heavily linked to the Metron
>> releases, since it's going to use other metron classes and dependencies to
>> ensure compatibility. For example, a Stellar NiFi processor will be linked
>> to Metron's stellar-common, the enrichment loader will depend on key
>> construction code from metron-enrichment (and should align to it). I was
>> also considering an opinionated PublishMetron which linked to the Metron
>> kafka, and hid some of the dances you have to do to make the readMetadata
>> functions to work (i.e. some sugar around our mild abuse of kafka keys,
>> which prevents people hurting their kafka by choosing the wrong
>> partitioner).
>> 
>> To that extent, I think the releases belong with Metron releases, though of
>> course that does increase our release and test burden.
>> 
>> On 5 June 2018 at 10:55, Otto Fowler  wrote:
>> 
>> > Similar to Bro, we may need to release out of cycle.
>> >
>> >
>> >
>> > On June 5, 2018 at 13:17:55, Simon Elliston Ball (
>> > si...@simonellistonball.com) wrote:
>> >
>> > Do you mean in the sense of a separate module, or are you suggesting we
>> go
>> > as far as a sub-project?
>> >
>> > On 5 June 2018 at 10:08, Otto Fowler  wrote:
>> >
>> > > If we do that, we should have it as a separate component maybe.
>> > >
>> > >
>> > > On June 5, 2018 at 12:42:57, Simon Elliston Ball (
>> > > si...@simonellistonball.com) wrote:
>> > >
>> > > @otto, well, of course we would use the record api... it's great.
>> > >
>> > > @casey, I have actually written a stellar processor, which applies
>> > stellar
>> > > to all FlowFile attributes outputting the resulting stellar variable
>> > space
>> > > to either attributes or as json in the content.
>> > >
>> > > Is it worth us creating an nifi-metron-bundle. Happy to kick that off,
>> > > since I'm half way there.
>> > >
>> > > Simon
>> > >
>> > >
>> > >
>> > > On 5 June 2018 at 08:41, Otto Fowler  wrote:
>> > >
>> > > > We have jiras about ‘diverting’ and reading from nifi flows already
>> > > >
>> > > >
>> > > > On June 5, 2018 at 11:11:45, Casey Stella (ceste...@gmail.com) wrote:
>> > > >
>> > > > I'd be in strong support of that, Simon. I think we should have some
>> > > other
>> > > > NiFi components in Metron to enable users to interact with our
>> > > > infrastructure from NiFi (e.g. being able to transform via stellar,
>> > > etc).
>> > > >
>> > > > On Tue, Jun 5, 2018 at 10:32 AM Simon Elliston Ball <
>> > > > si...@simonellistonball.com> wrote:
>> > > >
>> > > > > Do we, the community, think it would be a good idea to create a
>> > > > > PutMetronEnrichment NiFi processor for this use case? It seems a
>> > > number
>> > > > of
>> > > > > people want to use NiFi to manage and schedule loading of
>> > enrichments
>> > > for
>> > > > > example.
>> > > > >
>> > > > > Simon
>> > > > >
>> > > > > On 5 June 2018 at 06:56, Casey Stella  wrote:
>> > > > >
>> > > > > > The problem, as you correctly diagnosed, is the key in HBase. We
>> > > > > construct
>> > > > > > the key very specifically in Metron, so it's unlikely to work out
>> > of
>> > > > the
>> > > > > > box with the NiFi processor unfortunately. The key that we use is
>> > > > formed
>> >

Re: Writing enrichment data directly from NiFi with PutHBaseJSON

2018-06-05 Thread Simon Elliston Ball
To be honest, I would expect this to be heavily linked to the Metron
releases, since it's going to use other metron classes and dependencies to
ensure compatibility. For example, a Stellar NiFi processor will be linked
to Metron's stellar-common, the enrichment loader will depend on key
construction code from metron-enrichment (and should align to it). I was
also considering an opinionated PublishMetron which linked to the Metron
kafka, and hid some of the dances you have to do to make the readMetadata
functions to work (i.e. some sugar around our mild abuse of kafka keys,
which prevents people hurting their kafka by choosing the wrong
partitioner).

To that extent, I think the releases belong with Metron releases, though of
course that does increase our release and test burden.

On 5 June 2018 at 10:55, Otto Fowler  wrote:

> Similar to Bro, we may need to release out of cycle.
>
>
>
> On June 5, 2018 at 13:17:55, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> Do you mean in the sense of a separate module, or are you suggesting we go
> as far as a sub-project?
>
> On 5 June 2018 at 10:08, Otto Fowler  wrote:
>
> > If we do that, we should have it as a separate component maybe.
> >
> >
> > On June 5, 2018 at 12:42:57, Simon Elliston Ball (
> > si...@simonellistonball.com) wrote:
> >
> > @otto, well, of course we would use the record api... it's great.
> >
> > @casey, I have actually written a stellar processor, which applies
> stellar
> > to all FlowFile attributes outputting the resulting stellar variable
> space
> > to either attributes or as json in the content.
> >
> > Is it worth us creating an nifi-metron-bundle. Happy to kick that off,
> > since I'm half way there.
> >
> > Simon
> >
> >
> >
> > On 5 June 2018 at 08:41, Otto Fowler  wrote:
> >
> > > We have jiras about ‘diverting’ and reading from nifi flows already
> > >
> > >
> > > On June 5, 2018 at 11:11:45, Casey Stella (ceste...@gmail.com) wrote:
> > >
> > > I'd be in strong support of that, Simon. I think we should have some
> > other
> > > NiFi components in Metron to enable users to interact with our
> > > infrastructure from NiFi (e.g. being able to transform via stellar,
> > etc).
> > >
> > > On Tue, Jun 5, 2018 at 10:32 AM Simon Elliston Ball <
> > > si...@simonellistonball.com> wrote:
> > >
> > > > Do we, the community, think it would be a good idea to create a
> > > > PutMetronEnrichment NiFi processor for this use case? It seems a
> > number
> > > of
> > > > people want to use NiFi to manage and schedule loading of
> enrichments
> > for
> > > > example.
> > > >
> > > > Simon
> > > >
> > > > On 5 June 2018 at 06:56, Casey Stella  wrote:
> > > >
> > > > > The problem, as you correctly diagnosed, is the key in HBase. We
> > > > construct
> > > > > the key very specifically in Metron, so it's unlikely to work out
> of
> > > the
> > > > > box with the NiFi processor unfortunately. The key that we use is
> > > formed
> > > > > here in the codebase:
> > > > > https://github.com/cestella/incubator-metron/blob/master/
> > > > > metron-platform/metron-enrichment/src/main/java/org/
> > > > > apache/metron/enrichment/converter/EnrichmentKey.java#L51
> > > > >
> > > > > To put that in english, consider the following:
> > > > >
> > > > > - type - The enrichment type
> > > > > - indicator - the indicator to use
> > > > > - hash(*) - A murmur 3 128bit hash function
> > > > >
> > > > > the key is hash(indicator) + type + indicator
> > > > >
> > > > > This hash prefixing is a standard practice in hbase key design
> that
> > > > allows
> > > > > the keys to be uniformly distributed among the regions and
> prevents
> > > > > hotspotting. Depending on how the PutHBaseJSON processor works, if
> > you
> > > > can
> > > > > construct the key and pass it in, then you might be able to either
> > > > > construct the key in NiFi or write a processor to construct the
> key.
> > > > > Ultimately though, what Carolyn said is true..the easiest approach
> > is
> > > > > probably using the flatfile loader.
> > > > > If you do get this working in NiFi, however, do please let

Re: Writing enrichment data directly from NiFi with PutHBaseJSON

2018-06-05 Thread Simon Elliston Ball
Do you mean in the sense of a separate module, or are you suggesting we go
as far as a sub-project?

On 5 June 2018 at 10:08, Otto Fowler  wrote:

> If we do that, we should have it as a separate component maybe.
>
>
> On June 5, 2018 at 12:42:57, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> @otto, well, of course we would use the record api... it's great.
>
> @casey, I have actually written a stellar processor, which applies stellar
> to all FlowFile attributes outputting the resulting stellar variable space
> to either attributes or as json in the content.
>
> Is it worth us creating an nifi-metron-bundle. Happy to kick that off,
> since I'm half way there.
>
> Simon
>
>
>
> On 5 June 2018 at 08:41, Otto Fowler  wrote:
>
> > We have jiras about ‘diverting’ and reading from nifi flows already
> >
> >
> > On June 5, 2018 at 11:11:45, Casey Stella (ceste...@gmail.com) wrote:
> >
> > I'd be in strong support of that, Simon. I think we should have some
> other
> > NiFi components in Metron to enable users to interact with our
> > infrastructure from NiFi (e.g. being able to transform via stellar,
> etc).
> >
> > On Tue, Jun 5, 2018 at 10:32 AM Simon Elliston Ball <
> > si...@simonellistonball.com> wrote:
> >
> > > Do we, the community, think it would be a good idea to create a
> > > PutMetronEnrichment NiFi processor for this use case? It seems a
> number
> > of
> > > people want to use NiFi to manage and schedule loading of enrichments
> for
> > > example.
> > >
> > > Simon
> > >
> > > On 5 June 2018 at 06:56, Casey Stella  wrote:
> > >
> > > > The problem, as you correctly diagnosed, is the key in HBase. We
> > > construct
> > > > the key very specifically in Metron, so it's unlikely to work out of
> > the
> > > > box with the NiFi processor unfortunately. The key that we use is
> > formed
> > > > here in the codebase:
> > > > https://github.com/cestella/incubator-metron/blob/master/
> > > > metron-platform/metron-enrichment/src/main/java/org/
> > > > apache/metron/enrichment/converter/EnrichmentKey.java#L51
> > > >
> > > > To put that in english, consider the following:
> > > >
> > > > - type - The enrichment type
> > > > - indicator - the indicator to use
> > > > - hash(*) - A murmur 3 128bit hash function
> > > >
> > > > the key is hash(indicator) + type + indicator
> > > >
> > > > This hash prefixing is a standard practice in hbase key design that
> > > allows
> > > > the keys to be uniformly distributed among the regions and prevents
> > > > hotspotting. Depending on how the PutHBaseJSON processor works, if
> you
> > > can
> > > > construct the key and pass it in, then you might be able to either
> > > > construct the key in NiFi or write a processor to construct the key.
> > > > Ultimately though, what Carolyn said is true..the easiest approach
> is
> > > > probably using the flatfile loader.
> > > > If you do get this working in NiFi, however, do please let us know
> > and/or
> > > > consider contributing it back to the project as a PR :)
> > > >
> > > >
> > > >
> > > > On Fri, Jun 1, 2018 at 6:26 AM Charles Joynt <
> > > > charles.jo...@gresearch.co.uk>
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I work as a Dev/Ops Data Engineer within the security team at a
> > company
> > > > in
> > > > > London where we are in the process of implementing Metron. I have
> > been
> > > > > tasked with implementing feeds of network environment data into
> HBase
> > > so
> > > > > that this data can be used as enrichment sources for our security
> > > events.
> > > > > First-off I wanted to pull in DNS data for an internal domain.
> > > > >
> > > > > I am assuming that I need to write data into HBase in such a way
> that
> > > it
> > > > > exactly matches what I would get from the flatfile_loader.sh
> script.
> > A
> > > > > colleague of mine has already loaded some DNS data using that
> script,
> > > so
> > > > I
> > > > > am using that as a reference.
> > > > >
> > > > > I

Re: Writing enrichment data directly from NiFi with PutHBaseJSON

2018-06-05 Thread Simon Elliston Ball
@otto, well, of course we would use the record api... it's great.

@casey, I have actually written a stellar processor, which applies stellar
to all FlowFile attributes outputting the resulting stellar variable space
to either attributes or as json in the content.

Is it worth us creating an nifi-metron-bundle. Happy to kick that off,
since I'm half way there.

Simon



On 5 June 2018 at 08:41, Otto Fowler  wrote:

> We have jiras about ‘diverting’ and reading from nifi flows already
>
>
> On June 5, 2018 at 11:11:45, Casey Stella (ceste...@gmail.com) wrote:
>
> I'd be in strong support of that, Simon. I think we should have some other
> NiFi components in Metron to enable users to interact with our
> infrastructure from NiFi (e.g. being able to transform via stellar, etc).
>
> On Tue, Jun 5, 2018 at 10:32 AM Simon Elliston Ball <
> si...@simonellistonball.com> wrote:
>
> > Do we, the community, think it would be a good idea to create a
> > PutMetronEnrichment NiFi processor for this use case? It seems a number
> of
> > people want to use NiFi to manage and schedule loading of enrichments for
> > example.
> >
> > Simon
> >
> > On 5 June 2018 at 06:56, Casey Stella  wrote:
> >
> > > The problem, as you correctly diagnosed, is the key in HBase. We
> > construct
> > > the key very specifically in Metron, so it's unlikely to work out of
> the
> > > box with the NiFi processor unfortunately. The key that we use is
> formed
> > > here in the codebase:
> > > https://github.com/cestella/incubator-metron/blob/master/
> > > metron-platform/metron-enrichment/src/main/java/org/
> > > apache/metron/enrichment/converter/EnrichmentKey.java#L51
> > >
> > > To put that in english, consider the following:
> > >
> > > - type - The enrichment type
> > > - indicator - the indicator to use
> > > - hash(*) - A murmur 3 128bit hash function
> > >
> > > the key is hash(indicator) + type + indicator
> > >
> > > This hash prefixing is a standard practice in hbase key design that
> > allows
> > > the keys to be uniformly distributed among the regions and prevents
> > > hotspotting. Depending on how the PutHBaseJSON processor works, if you
> > can
> > > construct the key and pass it in, then you might be able to either
> > > construct the key in NiFi or write a processor to construct the key.
> > > Ultimately though, what Carolyn said is true..the easiest approach is
> > > probably using the flatfile loader.
> > > If you do get this working in NiFi, however, do please let us know
> and/or
> > > consider contributing it back to the project as a PR :)
> > >
> > >
> > >
> > > On Fri, Jun 1, 2018 at 6:26 AM Charles Joynt <
> > > charles.jo...@gresearch.co.uk>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I work as a Dev/Ops Data Engineer within the security team at a
> company
> > > in
> > > > London where we are in the process of implementing Metron. I have
> been
> > > > tasked with implementing feeds of network environment data into HBase
> > so
> > > > that this data can be used as enrichment sources for our security
> > events.
> > > > First-off I wanted to pull in DNS data for an internal domain.
> > > >
> > > > I am assuming that I need to write data into HBase in such a way that
> > it
> > > > exactly matches what I would get from the flatfile_loader.sh script.
> A
> > > > colleague of mine has already loaded some DNS data using that script,
> > so
> > > I
> > > > am using that as a reference.
> > > >
> > > > I have implemented a flow in NiFi which takes JSON data from a HTTP
> > > > listener and routes it to a PutHBaseJSON processor. The flow is
> > working,
> > > in
> > > > the sense that data is successfully written to HBase, but despite
> > > (naively)
> > > > specifying "Row Identifier Encoding Strategy = Binary", the results
> in
> > > > HBase don't look correct. Comparing the output from HBase scan
> > commands I
> > > > see:
> > > >
> > > > flatfile_loader.sh produced:
> > > >
> > > > ROW:
> > > > \xFF\xFE\xCB\xB8\xEF\x92\xA3\xD9#xC\xF9\xAC\x0Ap\x1E\x00\
> > > x05whois\x00\x0E192.168.0.198
> > > > CELL: column=data:v, timestamp=1516896203840,
> > > > value={"cli

Re: [DISCUSS] Field conversions

2018-06-05 Thread Simon Elliston Ball
Yes, anything using elastic would need the field names changed. That said, 
people who are on such an old version (eol) will need to not the bullet with ES 
compatibility as some point.

Simon 

> On 5 Jun 2018, at 17:17, Otto Fowler  wrote:
> 
> Are there consequences with Kibana as well?  queries, visualizations,
> templates they may have?
> 
> 
> On June 5, 2018 at 12:03:44, Nick Allen (n...@nickallen.org) wrote:
> 
> I just don't know if telling users to do a bulk upgrade of their indices is
> sufficient enough of an upgrade path. I would expect some to have
> downstream processes dependent on those field names, which would also need
> to be updated.
> 
> Although, we could tell users to do any field name conversions that they
> depend on using parser transformations; rather than the
> `FieldNameConverter` abstractions. I *think* that would be a valid upgrade
> path where we could just revert #1022.
> 
>> On Tue, Jun 5, 2018 at 10:34 AM, Nick Allen  wrote:
>> 
>> I am in favor of removing the `FieldNameConverter` abstraction as an end
>> state. Although, I don't agree with Simon that we could have just done
>> that directly without providing a backwards compatible solution as was
> done
>> in #1022. There are too many touch points that rely on that conversion
> and
>> users who expect fields to land in their indices named a certain way (no
>> matter what version of ES they are running). If I am wrong and there is a
>> better approach that works, then we should just revert #1022.
>> 
>> On Tue, Jun 5, 2018 at 9:37 AM, Simon Elliston Ball <
>> si...@simonellistonball.com> wrote:
>> 
>>> I would definitely agree that the transformation should be removed. We
>>> have
>>> now however added a complex generic solution in the backend, which is
>>> going
>>> to be noop for most people. This was done I believe for the sake of
>>> backward compatibility. I would argue however, that there is no need to
>>> support ES 2.3, and therefore no need to support de-dotting
>>> transformations. This does seem somewhat over-engineered to me, though
> it
>>> does save people re-indexing on upgrades. I suspect in reality that this
>>> is
>>> a rare edge case, and that we would do far better to settle on one
>>> solution
>>> (the dotted version, not the colons, to my mind)
>>> 
>>> Simon
>>> 
>>>> On 5 June 2018 at 06:29, Ryan Merriman  wrote:
>>>> 
>>>> I agree completely. I will leave this thread open for a day or two to
>>> give
>>>> others a chance to weigh in. If no one opposes, I will creates Jiras
>>> for
>>>> removing field transformations and transforming existing data.
>>>> 
>>>> On Tue, Jun 5, 2018 at 8:21 AM, Casey Stella 
>>> wrote:
>>>> 
>>>>> Well, on write it is a transformation, on read it's a translation.
>>> This
>>>> is
>>>>> to say that you're providing a mapping on read to translate field
>>> names
>>>>> given the index you're using. The other approach that I was
>>> considering
>>>>> last night is a field transformation REST call which translates
> field
>>>> names
>>>>> that the UI could call. So, the UI would pass 'source.type' to the
>>> field
>>>>> translation service and in Solr it'd return source.type and in ES
> it'd
>>>>> return source:type. Underneath the hood the service would use the
>>> same
>>>>> transformation as the writer uses. That's another way to skin this
>>> cat.
>>>>> 
>>>>> Ultimately, I think we should just ditch this field transformation
>>>>> business, as Laurens said, as long as we have a utility to transform
>>>>> existing data.
>>>>> 
>>>>> On Tue, Jun 5, 2018 at 8:54 AM Ryan Merriman 
>>>> wrote:
>>>>> 
>>>>>> Having 2 different patterns for configuring field name
>>> transformations
>>>> on
>>>>>> read vs write is confusing to me. I agree with both of you that
>>>>>> normalizing on '.' and not having to do the translation at all
>>> would be
>>>>>> ideal. Like you both suggested, we would need some utility or
>>> script
>>>> to
>>>>>> convert preexisting data to match this format. There could also be
>>>> some
>>>>>> adjustme

Re: Writing enrichment data directly from NiFi with PutHBaseJSON

2018-06-05 Thread Simon Elliston Ball
Do we, the community, think it would be a good idea to create a
PutMetronEnrichment NiFi processor for this use case? It seems a number of
people want to use NiFi to manage and schedule loading of enrichments for
example.

Simon

On 5 June 2018 at 06:56, Casey Stella  wrote:

> The problem, as you correctly diagnosed, is the key in HBase.  We construct
> the key very specifically in Metron, so it's unlikely to work out of the
> box with the NiFi processor unfortunately.  The key that we use is formed
> here in the codebase:
> https://github.com/cestella/incubator-metron/blob/master/
> metron-platform/metron-enrichment/src/main/java/org/
> apache/metron/enrichment/converter/EnrichmentKey.java#L51
>
> To put that in english, consider the following:
>
>- type - The enrichment type
>- indicator - the indicator to use
>- hash(*) - A murmur 3 128bit hash function
>
> the key is hash(indicator) + type + indicator
>
> This hash prefixing is a standard practice in hbase key design that allows
> the keys to be uniformly distributed among the regions and prevents
> hotspotting.  Depending on how the PutHBaseJSON processor works, if you can
> construct the key and pass it in, then you might be able to either
> construct the key in NiFi or write a processor to construct the key.
> Ultimately though, what Carolyn said is true..the easiest approach is
> probably using the flatfile loader.
> If you do get this working in NiFi, however, do please let us know and/or
> consider contributing it back to the project as a PR :)
>
>
>
> On Fri, Jun 1, 2018 at 6:26 AM Charles Joynt <
> charles.jo...@gresearch.co.uk>
> wrote:
>
> > Hello,
> >
> > I work as a Dev/Ops Data Engineer within the security team at a company
> in
> > London where we are in the process of implementing Metron. I have been
> > tasked with implementing feeds of network environment data into HBase so
> > that this data can be used as enrichment sources for our security events.
> > First-off I wanted to pull in DNS data for an internal domain.
> >
> > I am assuming that I need to write data into HBase in such a way that it
> > exactly matches what I would get from the flatfile_loader.sh script. A
> > colleague of mine has already loaded some DNS data using that script, so
> I
> > am using that as a reference.
> >
> > I have implemented a flow in NiFi which takes JSON data from a HTTP
> > listener and routes it to a PutHBaseJSON processor. The flow is working,
> in
> > the sense that data is successfully written to HBase, but despite
> (naively)
> > specifying "Row Identifier Encoding Strategy = Binary", the results in
> > HBase don't look correct. Comparing the output from HBase scan commands I
> > see:
> >
> > flatfile_loader.sh produced:
> >
> > ROW:
> > \xFF\xFE\xCB\xB8\xEF\x92\xA3\xD9#xC\xF9\xAC\x0Ap\x1E\x00\
> x05whois\x00\x0E192.168.0.198
> > CELL: column=data:v, timestamp=1516896203840,
> > value={"clientname":"server.domain.local","clientip":"192.168.0.198"}
> >
> > PutHBaseJSON produced:
> >
> > ROW:  server.domain.local
> > CELL: column=dns:v, timestamp=1527778603783,
> > value={"name":"server.domain.local","type":"A","data":"192.168.0.198"}
> >
> > From source JSON:
> >
> >
> > {"k":"server.domain.local","v":{"name":"server.domain.local"
> ,"type":"A","data":"192.168.0.198"}}
> >
> > I know that there are some differences in column family / field names,
> but
> > my worry is the ROW id. Presumably I need to encode my row key, "k" in
> the
> > JSON data, in a way that matches how the flatfile_loader.sh script did
> it.
> >
> > Can anyone explain how I might convert my Id to the correct format?
> > -or-
> > Does this matter-can Metron use the human-readable ROW ids?
> >
> > Charlie Joynt
> >
> > --
> > G-RESEARCH believes the information provided herein is reliable. While
> > every care has been taken to ensure accuracy, the information is
> furnished
> > to the recipients with no warranty as to the completeness and accuracy of
> > its contents and on condition that any errors or omissions shall not be
> > made the basis of any claim, demand or cause of action.
> > The information in this email is intended only for the named recipient.
> > If you are not the intended recipient please notify us immediately and do
> > not copy, distribute or take action based on this e-mail.
> > All messages sent to and from this e-mail address will be logged by
> > G-RESEARCH and are subject to archival storage, monitoring, review and
> > disclosure.
> > G-RESEARCH is the trading name of Trenchant Limited, 5th Floor,
> > Whittington House, 19-30 Alfred Place, London WC1E 7EA.
> > Trenchant Limited is a company registered in England with company number
> > 08127121.
> > --
> >
>



-- 
--
simon elliston ball
@sireb


Re: [DISCUSS] Field conversions

2018-06-05 Thread Simon Elliston Ball
+1 to that. It's a simple problem to solve if you have it, and with a
little docs help I imagine we'll be fine.

On 5 June 2018 at 06:58, Casey Stella  wrote:

> To be clear, I'm not even suggesting that we create any tooling here.  I'd
> say just a reference to the ES docs and a call-out in Upgrading.md would
> suffice as long as we have some strong reason to believe it'll work.  As
> far as I'm concerned, the sooner we're out of the business of transforming
> fields, the better.
>
> On Tue, Jun 5, 2018 at 9:49 AM Justin Leet  wrote:
>
> > ES does have some docs around how this gets handled in upgrades:
> >
> > https://www.elastic.co/guide/en/elasticsearch/reference/2.
> 4/dots-in-names.html
> >
> > Might be worth taking a look to see what conflicts we'd have going from
> 2.x
> > to 5.x and figuring out where to go from there.
> >
> > On Tue, Jun 5, 2018 at 9:46 AM, Simon Elliston Ball <
> > si...@simonellistonball.com> wrote:
> >
> > > I guess in principal you could use
> > > https://www.elastic.co/guide/en/elasticsearch/reference/
> > > current/docs-reindex.html#docs-reindex-change-name
> > > to reindex with the new fields. It wouldn't be hard to script up a bit
> of
> > > python to help users out with that, or of course to leave that as an
> > > exercise to the reader. It would be nice to have a script that read and
> > > transformed fields for templates and indices to replace the colons with
> > > dots in ES.
> > >
> > > Simon
> > >
> > > On 5 June 2018 at 06:40, Casey Stella  wrote:
> > >
> > > > +1 to that, Simon.  Do we have a sense of if there are utilities
> > provided
> > > > by ES to do this kind of migration transformation easily?
> > > >
> > > > On Tue, Jun 5, 2018 at 9:37 AM Simon Elliston Ball <
> > > > si...@simonellistonball.com> wrote:
> > > >
> > > > > I would definitely agree that the transformation should be removed.
> > We
> > > > have
> > > > > now however added a complex generic solution in the backend, which
> is
> > > > going
> > > > > to be noop for most people. This was done I believe for the sake of
> > > > > backward compatibility. I would argue however, that there is no
> need
> > to
> > > > > support ES 2.3, and therefore no need to support de-dotting
> > > > > transformations. This does seem somewhat over-engineered to me,
> > though
> > > it
> > > > > does save people re-indexing on upgrades. I suspect in reality that
> > > this
> > > > is
> > > > > a rare edge case, and that we would do far better to settle on one
> > > > solution
> > > > > (the dotted version, not the colons, to my mind)
> > > > >
> > > > > Simon
> > > > >
> > > > > On 5 June 2018 at 06:29, Ryan Merriman 
> wrote:
> > > > >
> > > > > > I agree completely.  I will leave this thread open for a day or
> two
> > > to
> > > > > give
> > > > > > others a chance to weigh in.  If no one opposes, I will creates
> > Jiras
> > > > for
> > > > > > removing field transformations and transforming existing data.
> > > > > >
> > > > > > On Tue, Jun 5, 2018 at 8:21 AM, Casey Stella  >
> > > > wrote:
> > > > > >
> > > > > > > Well, on write it is a transformation, on read it's a
> > translation.
> > > > > This
> > > > > > is
> > > > > > > to say that you're providing a mapping on read to translate
> field
> > > > names
> > > > > > > given the index you're using.  The other approach that I was
> > > > > considering
> > > > > > > last night is a field transformation REST call which translates
> > > field
> > > > > > names
> > > > > > > that the UI could call.  So, the UI would pass 'source.type' to
> > the
> > > > > field
> > > > > > > translation service and in Solr it'd return source.type and in
> ES
> > > > it'd
> > > > > > > return source:type.  Underneath the hood the service would use
> > the
> > > > same
> > > > > > > 

Re: [DISCUSS] Field conversions

2018-06-05 Thread Simon Elliston Ball
I guess in principal you could use
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html#docs-reindex-change-name
to reindex with the new fields. It wouldn't be hard to script up a bit of
python to help users out with that, or of course to leave that as an
exercise to the reader. It would be nice to have a script that read and
transformed fields for templates and indices to replace the colons with
dots in ES.

Simon

On 5 June 2018 at 06:40, Casey Stella  wrote:

> +1 to that, Simon.  Do we have a sense of if there are utilities provided
> by ES to do this kind of migration transformation easily?
>
> On Tue, Jun 5, 2018 at 9:37 AM Simon Elliston Ball <
> si...@simonellistonball.com> wrote:
>
> > I would definitely agree that the transformation should be removed. We
> have
> > now however added a complex generic solution in the backend, which is
> going
> > to be noop for most people. This was done I believe for the sake of
> > backward compatibility. I would argue however, that there is no need to
> > support ES 2.3, and therefore no need to support de-dotting
> > transformations. This does seem somewhat over-engineered to me, though it
> > does save people re-indexing on upgrades. I suspect in reality that this
> is
> > a rare edge case, and that we would do far better to settle on one
> solution
> > (the dotted version, not the colons, to my mind)
> >
> > Simon
> >
> > On 5 June 2018 at 06:29, Ryan Merriman  wrote:
> >
> > > I agree completely.  I will leave this thread open for a day or two to
> > give
> > > others a chance to weigh in.  If no one opposes, I will creates Jiras
> for
> > > removing field transformations and transforming existing data.
> > >
> > > On Tue, Jun 5, 2018 at 8:21 AM, Casey Stella 
> wrote:
> > >
> > > > Well, on write it is a transformation, on read it's a translation.
> > This
> > > is
> > > > to say that you're providing a mapping on read to translate field
> names
> > > > given the index you're using.  The other approach that I was
> > considering
> > > > last night is a field transformation REST call which translates field
> > > names
> > > > that the UI could call.  So, the UI would pass 'source.type' to the
> > field
> > > > translation service and in Solr it'd return source.type and in ES
> it'd
> > > > return source:type.  Underneath the hood the service would use the
> same
> > > > transformation as the writer uses.  That's another way to skin this
> > cat.
> > > >
> > > > Ultimately, I think we should just ditch this field transformation
> > > > business, as Laurens said, as long as we have a utility to transform
> > > > existing data.
> > > >
> > > > On Tue, Jun 5, 2018 at 8:54 AM Ryan Merriman 
> > > wrote:
> > > >
> > > > > Having 2 different patterns for configuring field name
> > transformations
> > > on
> > > > > read vs write is confusing to me.  I agree with both of you that
> > > > > normalizing on '.' and not having to do the translation at all
> would
> > be
> > > > > ideal.  Like you both suggested, we would need some utility or
> script
> > > to
> > > > > convert preexisting data to match this format.  There could also be
> > > some
> > > > > adjustments a user would need to make in the UI but I feel like we
> > > could
> > > > > document around that.  Are there any objections to doing it this
> way?
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jun 4, 2018 at 4:30 PM, Laurens Vets 
> > > wrote:
> > > > >
> > > > > > ES 2.x support officially ended 4 months ago (
> > > > > > https://www.elastic.co/support/eol), so why still support ':' at
> > > all?
> > > > :)
> > > > > > Additionally, 2.x isn't even supported at all on the last 2
> Ubuntu
> > > LTS
> > > > > > releases (16.04 & 18.05).
> > > > > >
> > > > > > Therefor, move everything to use '.' and provide a
> > conversion/upgrade
> > > > > > script to change '.' to ':'?
> > > > > >
> > > > > >
> > > > > > On 2018-06-04 13:55, Ryan Merriman wrote:
> > > > > >
> > > >

Re: [DISCUSS] Field conversions

2018-06-05 Thread Simon Elliston Ball
I would definitely agree that the transformation should be removed. We have
now however added a complex generic solution in the backend, which is going
to be noop for most people. This was done I believe for the sake of
backward compatibility. I would argue however, that there is no need to
support ES 2.3, and therefore no need to support de-dotting
transformations. This does seem somewhat over-engineered to me, though it
does save people re-indexing on upgrades. I suspect in reality that this is
a rare edge case, and that we would do far better to settle on one solution
(the dotted version, not the colons, to my mind)

Simon

On 5 June 2018 at 06:29, Ryan Merriman  wrote:

> I agree completely.  I will leave this thread open for a day or two to give
> others a chance to weigh in.  If no one opposes, I will creates Jiras for
> removing field transformations and transforming existing data.
>
> On Tue, Jun 5, 2018 at 8:21 AM, Casey Stella  wrote:
>
> > Well, on write it is a transformation, on read it's a translation.  This
> is
> > to say that you're providing a mapping on read to translate field names
> > given the index you're using.  The other approach that I was considering
> > last night is a field transformation REST call which translates field
> names
> > that the UI could call.  So, the UI would pass 'source.type' to the field
> > translation service and in Solr it'd return source.type and in ES it'd
> > return source:type.  Underneath the hood the service would use the same
> > transformation as the writer uses.  That's another way to skin this cat.
> >
> > Ultimately, I think we should just ditch this field transformation
> > business, as Laurens said, as long as we have a utility to transform
> > existing data.
> >
> > On Tue, Jun 5, 2018 at 8:54 AM Ryan Merriman 
> wrote:
> >
> > > Having 2 different patterns for configuring field name transformations
> on
> > > read vs write is confusing to me.  I agree with both of you that
> > > normalizing on '.' and not having to do the translation at all would be
> > > ideal.  Like you both suggested, we would need some utility or script
> to
> > > convert preexisting data to match this format.  There could also be
> some
> > > adjustments a user would need to make in the UI but I feel like we
> could
> > > document around that.  Are there any objections to doing it this way?
> > >
> > >
> > >
> > > On Mon, Jun 4, 2018 at 4:30 PM, Laurens Vets 
> wrote:
> > >
> > > > ES 2.x support officially ended 4 months ago (
> > > > https://www.elastic.co/support/eol), so why still support ':' at
> all?
> > :)
> > > > Additionally, 2.x isn't even supported at all on the last 2 Ubuntu
> LTS
> > > > releases (16.04 & 18.05).
> > > >
> > > > Therefor, move everything to use '.' and provide a conversion/upgrade
> > > > script to change '.' to ':'?
> > > >
> > > >
> > > > On 2018-06-04 13:55, Ryan Merriman wrote:
> > > >
> > > >> We've been dealing with a reoccurring challenge in Metron.  It is
> > common
> > > >> for various fields to contain '.' characters for the purpose of
> making
> > > >> them
> > > >> more readable, namespacing, etc.  At one point we only supported
> > > >> Elasticsearch 2.3 which did not allow dots and forced us to use ':'
> > > >> instead.  This limitation does not exist in later versions of
> > > >> Elasticsearch
> > > >> or Solr.
> > > >>
> > > >> Now we're in a situation where we need to allow a user to use either
> > one
> > > >> because they may still be using ES 2.3 or have data with ':'
> > characters
> > > in
> > > >> field names.  We've attempted to make this configurable in a couple
> > > >> different PRs:
> > > >>
> > > >> https://github.com/apache/metron/pull/1022
> > > >> https://github.com/apache/metron/pull/1010
> > > >> https://github.com/apache/metron/pull/1038
> > > >>
> > > >> The approaches taken in these are not consistent and fall short in
> > > >> different ways.  The first (METRON-1569 Allow user to change field
> > name
> > > >> conversion when indexing) only applies to indexing and not querying.
> > > The
> > > >> others only apply to a single field which does not scale well.  Now
> we
> > > >> have
> > > >> an issue with another field in
> > > >> https://issues.apache.org/jira/browse/METRON-1600.  Rather than
> > > >> continuing
> > > >> with a patchwork of different fixes I want to attempt to design a
> > > >> system-wide solution.
> > > >>
> > > >> My first thought is to expand
> > > https://github.com/apache/metron/pull/1022
> > > >> to
> > > >> apply globally.  However this is not trivial and would require
> > > significant
> > > >> changes.  It would also make https://github.com/apache/
> > metron/pull/1010
> > > >> obsolete and we might end up having to revert all of it.
> > > >>
> > > >> Does anyone have any ideas or opinions?  I am still researching
> > > solutions
> > > >> but would love some guidance from the community.
> > > >>
> > > >
> > >
> >
>



-- 
--
simon elliston ball
@sireb


Re: [DISCUSS] parser ES + Solr schema abstraction

2018-05-23 Thread Simon Elliston Ball
There is certainly a lot of value in the idea of tagging the data with a config 
version of some sort for traceability. This is probably a per topology it goes 
through thing that gives us detailed lineage. Maybe something like the NiFi 
provenance approach and a link to a lineage store like atlas would make sense 
(in our case that’s simpler than the NiFi use case of course since we have set 
topologies).

My other use for schema versions is around preserving backward compatibility 
for schema in stores that need to think harder about schema evolution such as 
columnar formats in hdfs (orc or parquet for example) so I think we need some 
means of storing and retrieving schema versions.

I’m proposing that the versions be created on the basis of config changes. So 
the process would be config change triggering schema inference triggering diff 
to old schema optionally triggering a net new version. 

Does they make sense?

Simon 

> On 22 May 2018, at 19:33, Otto Fowler  wrote:
> 
> I’ve also talked with J. Zeolla conceptually storing data in hdfs relative to 
> the version of the schema to produced it, but that may not matter….
> 
> So Simon, do you mean that as part of taking a configuration change ( either 
> startup or live while running ) we ‘update’ the metadata/schema, or 
> re-evaluate and then save/version it?
> maybe the data should have a field about the config/schema version that it 
> was generated with….
> 
> 
> 
> 
>> On May 22, 2018 at 13:56:23, Simon Elliston Ball 
>> (si...@simonellistonball.com) wrote:
>> 
>> Absolutely. I would agree with that as an approach. 
>> 
>> I would also suggest we discuss where schemas and versions should be stored. 
>> Atlas? The NiFi schema repo abstraction (which limits us to Avro to express 
>> schema).
>> 
>> What I would like to see would be a change to parser interfaces that emits 
>> field types, ditto the enrichment stages, and then detect changes from that.
>> 
>> The other issue to consider is forward and back compatibility on versions. 
>> For example, if we want to output ORC schema (I really think we should, 
>> because the current JSON on HDFS format is huge and slow), we need to 
>> consider the schema output history, since ORC will allow scheme evolution to 
>> an extent (adding fields) but not to others (removing or reordering fields). 
>> This can be resolved by sensible versioning and history aware schema 
>> generation.
>> 
>> Simon
>> 
>> 
>>> On 22 May 2018 at 15:23, Otto Fowler  wrote:
>>> Yes Simon, when I say ‘whatever we would call the complete parse/enrich 
>>> path’ that is what I was referring to.
>>> 
>>> I would think the flow would be:
>>> 
>>> Save or deploy sensor configurations 
>>> -> check if there is a difference in the configurations from last to new 
>>> version
>>> -> if there is a difference that effects the ‘schema’ in any configuration
>>> -> build master schema from configurations 
>>> -> version, store, deploy
>>> 
>>> or something.  I’m sure there are things about clean slate deploy vs. new 
>>> version deploy.
>>> 
>>>> On May 22, 2018 at 09:59:06, Simon Elliston Ball 
>>>> (si...@simonellistonball.com) wrote:
>>>> 
>>>> What I would really like to see is not a full end-to-end schema, but units
>>>> that contribute schema. I don't want to see a parser, enrichment, indexing
>>>> config as one package because in any given deployment for any given sensor,
>>>> I may have a different set of enrichments, and so need a different output
>>>> template.
>>>> 
>>>> What I would propose would be parsers and enrichments contribute partial
>>>> schema (potentially expressed as avro, but the important thing is just a
>>>> map of fields to types) which can then be composed, and have the metron
>>>> platform handle creating ES templates / solr schema / Hive Hcat schema /
>>>> A.N.Other index's schema meta data as the composite of those pieces. So, a
>>>> parser would contribute a set of fields, the fieldTransformations on the
>>>> sensor would contribute some fields, and each enrichment block would
>>>> contribute some fields, at which point we have enough schema definition to
>>>> generate all the required artefacts for whatever storage it ends up in.
>>>> 
>>>> Essentially, composable partial schema units from each component, which add
>>>> up at the end.
>>>> 
>>>> Does that make sense?
>>>> 
>&

Re: [DISCUSS] parser ES + Solr schema abstraction

2018-05-22 Thread Simon Elliston Ball
Absolutely. I would agree with that as an approach.

I would also suggest we discuss where schemas and versions should be
stored. Atlas? The NiFi schema repo abstraction (which limits us to Avro to
express schema).

What I would like to see would be a change to parser interfaces that emits
field types, ditto the enrichment stages, and then detect changes from that.

The other issue to consider is forward and back compatibility on versions.
For example, if we want to output ORC schema (I really think we should,
because the current JSON on HDFS format is huge and slow), we need to
consider the schema output history, since ORC will allow scheme evolution
to an extent (adding fields) but not to others (removing or reordering
fields). This can be resolved by sensible versioning and history aware
schema generation.

Simon


On 22 May 2018 at 15:23, Otto Fowler  wrote:

> Yes Simon, when I say ‘whatever we would call the complete parse/enrich
> path’ that is what I was referring to.
>
> I would think the flow would be:
>
> Save or deploy sensor configurations
> -> check if there is a difference in the configurations from last to new
> version
> -> if there is a difference that effects the ‘schema’ in any configuration
> -> build master schema from configurations
> -> version, store, deploy
>
> or something.  I’m sure there are things about clean slate deploy vs. new
> version deploy.
>
> On May 22, 2018 at 09:59:06, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> What I would really like to see is not a full end-to-end schema, but units
> that contribute schema. I don't want to see a parser, enrichment, indexing
> config as one package because in any given deployment for any given
> sensor,
> I may have a different set of enrichments, and so need a different output
> template.
>
> What I would propose would be parsers and enrichments contribute partial
> schema (potentially expressed as avro, but the important thing is just a
> map of fields to types) which can then be composed, and have the metron
> platform handle creating ES templates / solr schema / Hive Hcat schema /
> A.N.Other index's schema meta data as the composite of those pieces. So, a
> parser would contribute a set of fields, the fieldTransformations on the
> sensor would contribute some fields, and each enrichment block would
> contribute some fields, at which point we have enough schema definition to
> generate all the required artefacts for whatever storage it ends up in.
>
> Essentially, composable partial schema units from each component, which
> add
> up at the end.
>
> Does that make sense?
>
> Simon
>
>
> On 22 May 2018 at 14:10, Otto Fowler  wrote:
>
> > We have discussed in the past as part of 777 ( moment of silence…. ) the
> > idea that parsers/sensors ( or whatever we would call the complete
> > parse/enrich path ) could define a their ES or Solr schemas so that
> > they can be ‘installed’ as part of metron and remove the requirement for
> a
> > separate install by the system or by the user of a specific index
> template
> > or equivalent.
> >
> > Nifi has settled on Avro schemas to describe their ‘record’ based data,
> and
> > it makes me wonder if we might want to think of using Avro as a
> universal
> > schema or the base for one such that we can define a schema and apply it
> to
> > either ES or Solr.
> >
> > Thoughts?
> >
>
>
>
> --
> --
> simon elliston ball
> @sireb
>
>


-- 
--
simon elliston ball
@sireb


Re: [DISCUSS] parser ES + Solr schema abstraction

2018-05-22 Thread Simon Elliston Ball
What I would really like to see is not a full end-to-end schema, but units
that contribute schema. I don't want to see a parser, enrichment, indexing
config as one package because in any given deployment for any given sensor,
I may have a different set of enrichments, and so need a different output
template.

What I would propose would be parsers and enrichments contribute partial
schema (potentially expressed as avro, but the important thing is just a
map of fields to types) which can then be composed, and have the metron
platform handle creating ES templates / solr schema / Hive Hcat schema /
A.N.Other index's schema meta data as the composite of those pieces. So, a
parser would contribute a set of fields, the fieldTransformations on the
sensor would contribute some fields, and each enrichment block would
contribute some fields, at which point we have enough schema definition to
generate all the required artefacts for whatever storage it ends up in.

Essentially, composable partial schema units from each component, which add
up at the end.

Does that make sense?

Simon


On 22 May 2018 at 14:10, Otto Fowler  wrote:

> We have discussed in the past as part of 777 ( moment of silence…. ) the
> idea that parsers/sensors ( or whatever we would call the complete
> parse/enrich path ) could define a their ES or Solr schemas so that
> they can be ‘installed’ as part of metron and remove the requirement for a
> separate install by the system or by the user of a specific index template
> or equivalent.
>
> Nifi has settled on Avro schemas to describe their ‘record’ based data, and
> it makes me wonder if we might want to think of using Avro as a universal
> schema or the base for one such that we can define a schema and apply it to
> either ES or Solr.
>
> Thoughts?
>



-- 
--
simon elliston ball
@sireb


Re: [DISCUSS] Pcap panel architecture

2018-05-11 Thread Simon Elliston Ball
> > > This endpoint will allow a user to download raw pcap results for
> > the
> > > > > given
> > > > > > page.
> > > > > >
> > > > > > DELETE /api/v1/pcap/
> > > > > >
> > > > > > This endpoint will delete pcap query results. Not sure yet how
> > this
> > > > fits
> > > > > > in with our broader cleanup strategy.
> > > > > >
> > > > > > This should get us started. What did I miss and what would you
> > > change
> > > > > > about these? I did not include much detail related to security,
> > > > cleanup
> > > > > > strategy, or underlying implementation details but these are
> items
> > we
> > > > > > should discuss at some point.
> > > > > >
> > > > > > On Tue, May 8, 2018 at 5:38 PM, Michael Miklavcic <
> > > > > > michael.miklav...@gmail.com> wrote:
> > > > > >
> > > > > > > Sweet! That's great news. The pom changes are a lot simpler
> than
> > I
> > > > > > > expected. Very nice.
> > > > > > >
> > > > > > > On Tue, May 8, 2018 at 4:35 PM, Ryan Merriman <
> > merrim...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Finally figured it out. Commit is here:
> > > > > > > > https://github.com/merrimanr/incubator-metron/commit/
> > > > > > > > 22fe5e9ff3c167b42ebeb7a9f1000753a409aff1
> > > > > > > >
> > > > > > > > It came down to figuring out the right combination of maven
> > > > > > dependencies
> > > > > > > > and passing in the HDP version to REST as a Java system
> > property.
> > > > I
> > > > > > also
> > > > > > > > included some HDFS setup tasks. I tested this in full dev and
> > > can
> > > > > now
> > > > > > > > successfully run a pcap query and get results. All you should
> > > have
> > > > > to
> > > > > > do
> > > > > > > > is generate some pcap data first.
> > > > > > > >
> > > > > > > > On Tue, May 8, 2018 at 4:17 PM, Michael Miklavcic <
> > > > > > > > michael.miklav...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > @Ryan - pulled your branch and experimented with a few
> > things.
> > > In
> > > > > > doing
> > > > > > > > so,
> > > > > > > > > it dawned on me that by adding the yarn and hadoop
> classpath,
> > > you
> > > > > > > > probably
> > > > > > > > > didn't introduce a new classpath issue, rather you probably
> > > just
> > > > > > moved
> > > > > > > > onto
> > > > > > > > > the next classpath issue, ie hbase per your exception about
> > > hbase
> > > > > > jaxb.
> > > > > > > > > Anyhow, I put up a branch with some pom changes worth
> trying
> > in
> > > > > > > > conjunction
> > > > > > > > > with invoking the rest app startup via "/usr/bin/yarn jar"
> > > > > > > > >
> > > > > > > > > https://github.com/mmiklavc/metron/tree/ryan-rest-test
> > > > > > > > >
> > > > > > > > > https://github.com/mmiklavc/metron/commit/
> > > > > > > 5ca23580fc6e043fafae2327c80b65
> > > > > > > > > b20ca1c0c9
> > > > > > > > >
> > > > > > > > > Mike
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, May 8, 2018 at 7:44 AM, Simon Elliston Ball <
> > > > > > > > > si...@simonellistonball.com> wrote:
> > > > > > > > >
> > > > > > > > > > That would be a step closer to something more like a
> > > > > micro-service
> > > > > > > > > > architecture. However, I would want to make sure we think
> > > about
> > > >

Re: [DISCUSS] Release?

2018-05-09 Thread Simon Elliston Ball
Definitely +1, with the Solr pieces going in too, does it make sense to
bump the version to 0.5?

On 9 May 2018 at 16:18, Michael Miklavcic 
wrote:

> +1
>
> On Wed, May 9, 2018 at 9:13 AM, Casey Stella  wrote:
>
> > Is it about time for a release?  I know we got some substantial
> performance
> > changes in since the last release.  I think we might have a justification
> > for a release.
> >
> > Casey
> >
>



-- 
--
simon elliston ball
@sireb


Re: [DISCUSS] Pcap panel architecture

2018-05-08 Thread Simon Elliston Ball
gt; > >
> > > > > Mike, I can start a feature branch and experiment with merging
> > > metron-api
> > > > > into metron-rest. That should allow us to collaborate on any issues
> > or
> > > > > challenges. Also, can you expand on your idea to manage external
> > > > > dependencies as a special module? That seems like a very attractive
> > > > option
> > > > > to me.
> > > > >
> > > > > On Fri, May 4, 2018 at 8:39 AM, Otto Fowler <
> ottobackwa...@gmail.com>
> >
> > > > > wrote:
> > > > >
> > > > > > From my response on the other thread, but applicable to the
> > backend
> > > > > stuff:
> > > > > >
> > > > > > "The PCAP Query seems more like PCAP Report to me. You are
> > > generating a
> > > > > > report based on parameters.
> > > > > > That report is something that takes some time and external
> process
> > to
> > > > > > generate… ie you have to wait for it.
> > > > > >
> > > > > > I can almost imagine a flow where you:
> > > > > >
> > > > > > * Are in the AlertUI
> > > > > > * Ask to generate a PCAP report based on some selected
> > > > alerts/meta-alert,
> > > > > > possibly picking from on or more report ‘templates’
> > > > > > that have query options etc
> > > > > > * The report request is ‘queued’, that is dispatched to be be
> > > > > > executed/generated
> > > > > > * You as a user have a ‘queue’ of your report results, and when
> > the
> > > > > report
> > > > > > is done it is queued there
> > > > > > * We ‘monitor’ the report/queue press through the yarn rest (
> > report
> > > > > > info/meta has the yarn details )
> > > > > > * You can select the report from your queue and view it either in
> > a
> > > new
> > > > > UI
> > > > > > or custom component
> > > > > > * You can then apply a different ‘view’ to the report or work
> with
> > > the
> > > > > > report data
> > > > > > * You can print / save etc
> > > > > > * You can associate the report with the alerts ( again in the
> > report
> > > > info
> > > > > > ) with…. a ‘case’ or ‘ticket’ or investigation something or other
> > > > > >
> > > > > >
> > > > > > We can introduce extensibility into the report templates, report
> > > views
> > > > (
> > > > > > thinks that work with the json data of the report )
> > > > > >
> > > > > > Something like that.”
> > > > > >
> > > > > > Maybe we can do :
> > > > > >
> > > > > > template -> query parameters -> script => yarn info
> > > > > > yarn info + query info + alert context + yarn status => report
> > info
> > > ->
> > > > > > stored in a user’s ‘report queue’
> > > > > > report persistence added to report info
> > > > > > metron-rest -> api to monitor the queue, read results ( page ),
> > etc
> > > etc
> > > > > >
> > > > > >
> > > > > > On May 4, 2018 at 09:23:39, Ryan Merriman (merrim...@gmail.com)
> > > wrote:
> > > > > >
> > > > > > I started a separate thread on Pcap UI considerations and user
> > > > > > requirements
> > > > > > at Otto's request. This should help us keep these two related but
> > > > > separate
> > > > > > discussions focused.
> > > > > >
> > > > > > On Fri, May 4, 2018 at 7:19 AM, Michel Sumbul <
> > > michelsum...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > (Youhouuu my first reply on this kind of mail chain^^)
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > If I may, I would like to share my view on the following 3
> > points.
> > > > > > >
> > > > > > > - Backend:
> > > > > > >
> > > > > > > The current metron-api is totally seperate, it will be logic
> for
> > me
> > > > to
> > > > > > have
> > > > > > > it at the same place as the others rest api. Especially when
> > more
> > > > > > security
> > > > > > > will be added, it will not be needed to do the job twice.
> > > > > > > The current implementation send back a pcap object which still
> > need
> > > > to
> > > > > > be
> > > > > > > decoded. In the opensoc, the decoding was done with tshard on
> > the
> > > > > > frontend.
> > > > > > > It will be good to have this decoding happening directly on the
> > > > backend
> > > > > > to
> > > > > > > not create a load on frontend. An option will be to install
> > tshark
> > > on
> > > > > > the
> > > > > > > rest server and to use to convert the pcap to xml and then to a
> > > json
> > > > > > that
> > > > > > > will be send to the frontend.
> > > > > > >
> > > > > > > I tried to start directly the map/reduce job to search over all
> > the
> > > > > pcap
> > > > > > > data from the rest server and as Ryan mention it, we had
> > trouble. I
> > > > > will
> > > > > > > try to find back the error.
> > > > > > >
> > > > > > > Then in the POC, what we tried is to use the pcap_query script
> > and
> > > > this
> > > > > > > work fine. I just modified it that he sends back directly the
> > > job_id
> > > > of
> > > > > > > yarn and not waiting that the job is finished. Then it will
> > allow
> > > the
> > > > > UI
> > > > > > > and the rest server to know what the status of the research by
> > > > querying
> > > > > > the
> > > > > > > yarn rest api. This will allow the UI and the rest server to be
> > > async
> > > > > > > without any blocking phase. What do you think about that?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Having the job submitted directly from the code of the rest
> > server
> > > > will
> > > > > > be
> > > > > > > perfect, but it will need a lot of investigation I think (but
> > I'm
> > > not
> > > > > > the
> > > > > > > expert so I might be completely wrong ^^).
> > > > > > >
> > > > > > > We know that the pcap_query scritp work fine so why not calling
> > it?
> > > > Is
> > > > > > it
> > > > > > > that bad? (maybe stupid question, but I really don’t see a lot
> > of
> > > > > > drawback)
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > - Front end:
> > > > > > >
> > > > > > > Adding the the pcap search to the alert UI is, I think, the
> > easiest
> > > > way
> > > > > > to
> > > > > > > move forward. But indeed, it will then be the “Alert UI and
> > > > pcapquery”.
> > > > > > > Maybe the name of the UI should just change to something like
> > > > > > “Monitoring &
> > > > > > > Investigation UI” ?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Is there any roadmap or plan for the different UI? I mean did
> > you
> > > > > > already
> > > > > > > had discussion on how you see the ui evolving with the new
> > feature
> > > > that
> > > > > > > will come in the future?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > - Microservices:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > What do you mean exactly by microservices? Is it to separate
> all
> > > the
> > > > > > > features in different projects? Or something like having the
> > > > different
> > > > > > > components in container like kubernet? (again maybe stupid
> > > question,
> > > > > but
> > > > > > I
> > > > > > > don’t clearly understand what you mean J )
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Michel
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> >
>



-- 
--
simon elliston ball
@sireb


Re: Streaming Machine Learning use case

2018-05-08 Thread Simon Elliston Ball
Do you mean Apache SAMOA? I'm not sure of the status of that project, and
it doesn't look particularly lively (last real activity on the lists was 2
months ago, last commits, 7 months ago).

That said, there seem to be some interesting algorithms implemented in
there. The VHT algorithm and the clustering may be relevant, though we have
other efficient means of streaming clustering already in Metron. I would
also argue that we'd be better off looking at algorithms in Spark for
things like frequent pattern mining, though there the FP growth algorithm
is of course primarily a batch implementation.

Are there any SAMOA algorithms in particular that you think would be
relevant to Metron use cases?

Simon


On 8 May 2018 at 07:29, Ali Nazemian  wrote:

> Hi all,
>
> I was wondering if someone has used Metron with any streaming ML framework
> such as SAMOA? I know that Metron provides Machine Learning separately via
> MAAS. However, it is hard to manage it from operational perspective
> especially if we want to have a pretty dynamic and evolving model. SAMOA
> seems to be a very slow project (or maybe even dead). However, it looks
> very close from the integration point of view with Metron, so I wanted to
> see if anyone had tried SAMOA in practice and especially with Metron use
> cases.
>
> Regards,
> Ali
>



-- 
--
simon elliston ball
@sireb


Re: GeoLite deprecating legacy DBs

2018-04-13 Thread Simon Elliston Ball
Don’t we already use the GeoLite2 database? Mine are all 
/apps/metron/geo/default/GeoLite2-City.mmdb.gz downloaded from 
http://geolite.maxmind.com/download/geoip/database/GeoLite2-City.mmdb.gz which 
seems to match the new format page. 

Am I missing something Jon, or are you referring to the old geo enrichment?

Simon


> On 13 Apr 2018, at 10:27, zeo...@gmail.com  wrote:
> 
> Looks like we will need to update the Geo DBs that we use for enrichment.
> 
> 
> Updated versions of the GeoLite Legacy databases are now only available to
> redistribution license customers, although anyone can continue to download
> the March 2018 GeoLite Legacy builds. Starting January 2, 2019, the last
> build will be removed from our website. GeoLite Legacy database users will
> need to switch to the GeoLite2 or commercial GeoIP databases and update
> their integrations by January 2, 2019.
> 
> New Database Format Available: ... For our latest database format, please
> see our GeoLite2 Databases.
> 
> https://dev.maxmind.com/geoip/legacy/geolite/
> 
> Jon
> -- 
> 
> Jon



Re: [DISCUSS] Time to remove github updates from dev?

2018-04-04 Thread Simon Elliston Ball
I would say we should also update our website with subscription information.

Simon

> On 4 Apr 2018, at 18:51, Nick Allen  wrote:
> 
> https://lists.apache.org/list.html?iss...@metron.apache.org​
> 
> On Tue, Mar 20, 2018 at 5:06 PM, Otto Fowler 
> wrote:
> 
>> How about a link?
>> 
>> 
>> 
>> On March 19, 2018 at 08:16:46, Andre (andre-li...@fucs.org) wrote:
>> 
>> Folks,
>> 
>> All rejoice. This has been finally implemented.
>> 
>> Cheers
>> 
>> On 7 Feb 2018 08:33, "Andre"  wrote:
>> 
>>> All,
>>> 
>>> Turns out the process is simpler:
>>> 
>>> A PMC member must create the lists using the self-management potal:
>>> 
>>> 
>>> selfserve.apache.org
>>> 
>>> 
>>> Once this is done someone can update the INFRA-15988 ticket and the folks
>>> will execute the changes.
>>> 
>>> 
>>> 
>>> On Wed, Jan 31, 2018 at 12:15 AM, Otto Fowler 
>>> wrote:
>>> 
 We could also just skip ‘b’ and go directly to ‘c’ like apache-commons
 and have
 commits@ issues@.
 
 
 
 
 On January 30, 2018 at 08:03:37, Andre (andre-li...@fucs.org) wrote:
 
 James,
 
 Give nobody opposed, I would suggest one of the PMCs contact the INFRA
>> to
 get this actioned.
 
 They would need to assist with:
 
 1. Creation of the new "issues" list
 2. redirect both GitHub and JIRA integrations to the new list
 
 Cheers
 
 On Sat, Jan 27, 2018 at 9:40 AM, James Sirota 
>> wrote:
 
> Should we file an infra ticket on this?
> 
> 19.01.2018, 13:56, "zeo...@gmail.com" :
>> I would give that +1 as well.
>> 
>> Jon
>> 
>> On Fri, Jan 19, 2018 at 3:32 PM Casey Stella 
 wrote:
>> 
>>> I could get behind that.
>>> 
>>> On Fri, Jan 19, 2018 at 3:31 PM, Andre 
>> wrote:
>>> 
 Folks,
 
 May I suggest Metron follows the NiFi mailing list strategy (we
>> got
 inspired by another project but I don't recall the name) and
>> remove
> the
 github comments from the dev list?
 
 Within NiFi we have both the dev and the issues lists. dev is for
> humans,
 issues is for JIRA and github commits.[1]
 
 This allows the list thread list to be cleaner and is
>> particularly
>>> helpful
 for those reading the list from a list aggregation service.
 
 Cheers
 
 
 [1] https://lists.apache.org/list.html?iss...@nifi.apache.org
 
>> 
>> --
>> 
>> Jon
> 
> ---
> Thank you,
> 
> James Sirota
> PMC- Apache Metron
> jsirota AT apache DOT org
> 
> 
 
 
>>> 
>> 



Re: [DISCUSS] Generic Syslog Parsing capability for parsers

2018-03-20 Thread Simon Elliston Ball
It seems like parser chaining is becomes a hot topic on the repo too with 
https://github.com/apache/metron/pull/969#partial-pull-merging 


I would like to discuss the option, and how we might architect, of configuring 
parsers to operate on the output of parsers. This may also give us the 
opportunity to be more efficient in scenarios where people have large numbers 
of sources, and so use up a lot of slots for lower volume parsers for example.

I have a bunch of ideas around this, but am more keen to hear what everyone 
else thinks at this stage. How should we go about fixing parser config so that 
it’s clearer (removing the need for people to reinvent the parser wheel as 
we’ve seen in a few places) and also more concise and powerful (consolidating 
the parsing of transports such as syslog and content such as application logs, 
or types of device logs). 

If this can lead to a more efficient way of handling both the syslog problem, 
and the kind of problem that leads to switching between grok statements in 
something like our ASA parser then all the better. I suspect that there might 
also be a case for multi-level chaining here too, since some things are 
embedded in multiple transports, or might have complex fields that want 
‘sub-parsing’.

Of course one of the key values of Metron is its speed, so maybe formalising 
some of the microbenchmarking approaches a few of us have been working on might 
help here too. I’ve got a few bits of micro-benching infrastructure around CEF 
and ASA, and I believe there’s also been some work to load and perf test things 
like enrichment that might be leveraged.

Thoughts on a dev board? 

Simon

> On 20 Mar 2018, at 21:47, Otto Fowler  wrote:
> 
> I entered METRON–1453  a
> little while ago while working on the PR#579
> .
> 
> "We have several parsers now, with many imaginable that are based on
> syslog, where the format is SYSLOG HEADER MESSAGE.
> 
> With message being in a different format. It would be great is we had a way
> to generically handle syslog headers, such that ANY parser data could come
> over syslog.
> 
> Either you could have a custom parser, or configure CSV or JSON such that
> they could be the payload, such that you can handle JSON over syslog by
> configuration only."
> 
> The idea would be that the parser bolt would use the configuration to
> trigger parsing the incoming message as syslog formatted, and pass the
> message part to the parser, and put the syslog parts in the message(s)
> after parsing.
> 
> As part of this I did some work on parsing syslog, using both grok and a
> DSL that I did from the spec : https://github.com/ottobackwards/grok-v-antlr
> 
> The DSL is slower, but grok cannot handle multiple structured data entries,
> and the DSL can. I’m not good enough at grok to fix it so that it is
> functionally equivalent. Another option would be to write a third parser…
> It is also possible that the DSL could be improved for speed of course.
> 
> Thoughts?



Re: [DISCUSS] Time to remove github updates from dev?

2018-03-19 Thread Simon Elliston Ball
Should we not add the new lists to the website?

Simon

> On 19 Mar 2018, at 14:02, Casey Stella  wrote:
> 
> +1
> 
> 
> On Mon, Mar 19, 2018 at 8:16 AM Andre  wrote:
> 
>> Folks,
>> 
>> All rejoice. This has been finally implemented.
>> 
>> Cheers
>> 
>> On 7 Feb 2018 08:33, "Andre"  wrote:
>> 
>>> All,
>>> 
>>> Turns out the process is simpler:
>>> 
>>> A PMC member must create the lists using the self-management potal:
>>> 
>>> 
>>> selfserve.apache.org
>>> 
>>> 
>>> Once this is done someone can update the INFRA-15988 ticket and the folks
>>> will execute the changes.
>>> 
>>> 
>>> 
>>> On Wed, Jan 31, 2018 at 12:15 AM, Otto Fowler 
>>> wrote:
>>> 
 We could also just skip ‘b’ and go directly to ‘c’ like apache-commons
 and have
 commits@ issues@.
 
 
 
 
 On January 30, 2018 at 08:03:37, Andre (andre-li...@fucs.org) wrote:
 
 James,
 
 Give nobody opposed, I would suggest one of the PMCs contact the INFRA
>> to
 get this actioned.
 
 They would need to assist with:
 
 1. Creation of the new "issues" list
 2. redirect both GitHub and JIRA integrations to the new list
 
 Cheers
 
 On Sat, Jan 27, 2018 at 9:40 AM, James Sirota 
 wrote:
 
> Should we file an infra ticket on this?
> 
> 19.01.2018, 13:56, "zeo...@gmail.com" :
>> I would give that +1 as well.
>> 
>> Jon
>> 
>> On Fri, Jan 19, 2018 at 3:32 PM Casey Stella 
 wrote:
>> 
>>> I could get behind that.
>>> 
>>> On Fri, Jan 19, 2018 at 3:31 PM, Andre 
 wrote:
>>> 
 Folks,
 
 May I suggest Metron follows the NiFi mailing list strategy (we
 got
 inspired by another project but I don't recall the name) and
 remove
> the
 github comments from the dev list?
 
 Within NiFi we have both the dev and the issues lists. dev is for
> humans,
 issues is for JIRA and github commits.[1]
 
 This allows the list thread list to be cleaner and is
>> particularly
>>> helpful
 for those reading the list from a list aggregation service.
 
 Cheers
 
 
 [1] https://lists.apache.org/list.html?iss...@nifi.apache.org
 
>> 
>> --
>> 
>> Jon
> 
> ---
> Thank you,
> 
> James Sirota
> PMC- Apache Metron
> jsirota AT apache DOT org
> 
> 
 
 
>>> 
>> 



Re: [DISCUSS] community view/roadmap of threat intel

2018-02-19 Thread Simon Elliston Ball
Agreed, reputation and confidence is not really encoded formally in the data 
model, but I would expect most people are using them to weight the results of 
the threat intel now we have threat triage scores built on stellar expressions. 

There is  definitely scope here to provide at least a recommended formal model 
for this, which may feed into some of the discussions about schema and traits 
elsewhere on the list (anyone remember back to the last time we talked about 
that?!) 

Otto, I’m not sure I see the problem with using NiFi as breaking an application 
boundary, or the necessity of everything being in storm. Ok, if brings in 
another component, but does also give us things like scheduling of web api 
polling for threat feeds. Most implementations of Metron I’ve been involved in 
usually have NiFi on the side anyway to get things into Kafka. I’d love to hear 
if people have a strong objection to bringing it into scope. 

What I was thinking was writing something like a MetronThreatIntelProcessor 
owned in metron, and published as a nar by the metron project. That would load 
NiFi flow file content direct into Metron’s HBase tables using the Metron 
loader code and config format. That would be combined with something like a 
StixProcessor (which I personally think should be a StixRecordReader in new 
NiFi btw) or a whatever parser, fetcher, tailer etc. 

Btw, I’ve also got early stage implementations of things like Stellar in NiFi 
which would be the starting point for building something like that. 

To address the bulk vs incremental side, we could use the same mechanism to 
handle both, but that would very much suggest moving to the record reader based 
apis. That should be fine at the O(100s gigabytes) scale in NiFi. Does anyone 
have any use cases that would still seem like they’d be in the terabytes / 
existing bulk map reduce approach end? 

Simon


> On 19 Feb 2018, at 14:26, Otto Fowler  wrote:
> 
> There are a couple of use cases here for getting the data.
> 
> When you _can_ or want to ingest and duplicate the external store
> 
> 1. Bulk Loading ( from a clean empty state )
> 2. Tailing the feed afterwards
> 
> When you can’t
> 
> 3.  Calling the api ( most likely web ) for reputation or some other thing
> 
> 
> Right now, I *think* we’d use our bulk loader for 1.  I am not sure it can
> be configured for 2.
> NiFi *could* do it, if you wrote your Taxii client such that it was
> stateful and could resume
> after restarts etc and pickup from the right place.
> 
> Right now, we only ingest indicators as raw data.  I do not believe we
> support the reputation and confidence stuff.
> Also, the issue of which version of stix/taxii we support will need to be
> considered.
> 
> I think the idea of a ‘tailing’ topology per service where required would
> be worth looking into, such a topology
> would be transform and index (with a new hbase indexer ) only with no
> enrichment.  We also may want to explore indexing
> enrichments to SEARCH stores or both SEACH and BATCH.
> 
> Like Simon says, there is NiFi, but I would want to consider a metron
> topology because this is a metron managed store,
> and having nifi write to metron’s indicator store, or other threat store is
> wrong I think.  It breaks the application boundary .
> 
> You should take a look at what jiras we currently have, and we can talk
> about what what needs to happen, create the jiras
> and get it rolling.
> 
> I would imagine down the like, that we would support bulk load as we have
> now ‘out of the box’.  And have a new mpack
> for optional threat intel flows available.
> 
> ottO
> 
> On February 19, 2018 at 07:47:39, Andre (andre-li...@fucs.org) wrote:
> 
> Simon,
> 
> I have coded but not merged a STIX / TAXII processor for NiFi that would
> work perfectly fine with this.
> 
> 
> But I will take the opportunity to touch the following points:
> 
> 
> 1. Threat Intel is more frequently than not based on API lookups (e.g.
> VirusTotal, RBLs and correlated, Umbrella's top million, etc). How are
> those going to be consistently managed?
> 
> 2. Threat feeds are frequently classified in regards to confidence but
> today the default Metron schema seems to lack any similar concept? Do we
> have plans to address it?
> 
> 3. Atemporal matching - Given the use of big data technologies it seems to
> me Metron should be able to look into past enrichment data in order to
> classify traffic. I am not sure this is possible today?
> 
> 
> Cheers
> 
> 
> On Mon, Feb 19, 2018 at 8:48 PM, Simon Elliston Ball <
> si...@simonellistonball.com> wrote:
> 
>> Would it make sense to lean on something like Apache NiFi for this? It
>> seems a good fit to handle getting data from wherever (web service, poll,
&g

Re: [DISCUSS] community view/roadmap of threat intel

2018-02-19 Thread Simon Elliston Ball
Would it make sense to lean on something like Apache NiFi for this? It seems a 
good fit to handle getting data from wherever (web service, poll, push etc, 
streams etc). If we were to build a processor which encapsulated the threat 
intel loader logic, that would provide a granular route to update threat intel 
entries in a more streaming manner. We could of course do the same thing in 
code with storm topologies, but I would wonder whether threat intel feeds would 
have enough volume to require that. 

Simon

> On 16 Feb 2018, at 07:11, Ali Nazemian  wrote:
> 
> I think one of the challenges is where the scope of threat intel ends from
> the Metron roadmap? Does it gonna relly on supporting a standard format and
> a loader to send it to HBase for the later threat intel use cases?
> 
> In my opinion, it would be better to have a separate topology (sort of
> similar to the profiler approach) to get the feeds (maybe from Kafka) and
> load it into HBase frequently based on what criteria we want to have. Maybe
> we need to have some normalizations for the threat feeds (either aggregated
> or single feed) as an example (or any other transformation by using
> Stellar). Maybe we need to tailor row_key in a way that can be utilised
> based on the threat intel look up we want to have further from the
> enrichment topology. The problem I see with different loaders in Metron is
> we can mostly use them for the purpose of POC, but if you want to build an
> actual use case for a production platform then it will be out of the
> flexibility of a loader, so we will end up feeding data to HBase based on
> our use case.
> 
> In this case, maybe it won't be very important we want to use an aggregator
> X or aggregator Y, we can integrate it with Metron based on integration
> points.
> 
> Cheers,
> Ali
> 
> On Wed, Feb 14, 2018 at 11:28 PM, Simon Elliston Ball <
> si...@simonellistonball.com> wrote:
> 
>> We used to install soltra edge in the old ansible builds (which have
>> thankfully now been pared back in the interests of stability in full dev).
>> Soltra has not been a good option since they went proprietary, so since
>> then we’ve included opentaxii (BSD 3) as a discovery and aggregator.
>> 
>> Most of the challenges are around licensing. Hippocampe is part of The
>> Hive Project, which is AGPL, which is an apache category X license so can’t
>> be included.
>> 
>> Mindmeld is much better license-wise (Apache 2) so would be well worth
>> community consideration. I kinda like it as a framework, but
>> 
>> I for one would be very pleased to hear a broader community discussion
>> around which platforms we should have integrations with via the threat
>> intel loader, or even through a direct to hbase streaming connector.
>> 
>> Simon
>> 
>>> On 14 Feb 2018, at 03:13, Ali Nazemian  wrote:
>>> 
>>> Hi All,
>>> 
>>> I would like to understand Metron community view on Threat Intel
>>> aggregators as well as the roadmap of threat intelligence and threat
>>> hunting. There are some open source options available regarding threat
>>> intel aggregator such as Minemeld, Hippocampe, etc. Is there any plan to
>>> build that as a part of Metron in future? Is there any specific
>> aggregator
>>> you think would be more aligned with Metron roadmap?
>>> 
>>> Cheers,
>>> Ali
>> 
>> 
> 
> 
> -- 
> A.Nazemian



Re: [DISCUSS] community view/roadmap of threat intel

2018-02-14 Thread Simon Elliston Ball
We used to install soltra edge in the old ansible builds (which have thankfully 
now been pared back in the interests of stability in full dev). Soltra has not 
been a good option since they went proprietary, so since then we’ve included 
opentaxii (BSD 3) as a discovery and aggregator. 

Most of the challenges are around licensing. Hippocampe is part of The Hive 
Project, which is AGPL, which is an apache category X license so can’t be 
included. 

Mindmeld is much better license-wise (Apache 2) so would be well worth 
community consideration. I kinda like it as a framework, but 

I for one would be very pleased to hear a broader community discussion around 
which platforms we should have integrations with via the threat intel loader, 
or even through a direct to hbase streaming connector. 

Simon

> On 14 Feb 2018, at 03:13, Ali Nazemian  wrote:
> 
> Hi All,
> 
> I would like to understand Metron community view on Threat Intel
> aggregators as well as the roadmap of threat intelligence and threat
> hunting. There are some open source options available regarding threat
> intel aggregator such as Minemeld, Hippocampe, etc. Is there any plan to
> build that as a part of Metron in future? Is there any specific aggregator
> you think would be more aligned with Metron roadmap?
> 
> Cheers,
> Ali



Re: Disable Metron parser output writer entirely

2018-02-05 Thread Simon Elliston Ball
I expect the performance would be dire. If you really wanted to do something 
like this, a custom writer might make sense. KAFKA_PUT is really meant for 
debugging use cases only. It’s a very non-stellar construct (non-expression, no 
return, side-effect dependent…) Also, it creates a producer for every call, so 
your are definitely not going to get performance out of it. 

Simon

> On 5 Feb 2018, at 06:32, Ali Nazemian  wrote:
> 
> What about the performance difference?
> 
> On Fri, Feb 2, 2018 at 10:41 PM, Otto Fowler 
> wrote:
> 
>> You cannot.
>> 
>> 
>> 
>> On February 1, 2018 at 23:51:28, Ali Nazemian (alinazem...@gmail.com)
>> wrote:
>> 
>> Hi All,
>> 
>> I am trying to investigate whether we can disable a Metron parser output
>> writer entirely and manage it via KAFKA_PUT Stellar function instead.
>> First, is it possible via configuration? Second, will be any performance
>> difference between normal Kafka writer and the Stellar version of it
>> (KAFKA_PUT).
>> 
>> Regards,
>> Ali
>> 
>> 
> 
> 
> -- 
> A.Nazemian



Re: [DISCUSS] Persistence store for user profile settings

2018-02-02 Thread Simon Elliston Ball
Glad you agree with me that this isn’t HBase scale… it’s clearly not. I would 
never suggest introducing HBase for something like this, but since it’s there.

On the idea of using the Ambari RDBMS for the same basis of it being there, I 
see your point. That said, it can be postgres, sql server, mysql, maria, 
oracle… various. Yes we have an ORM, but those are not nearly as magic as they 
claim, and upgrade / schema evolution of an RDBMS often involves some sort of 
platform dependent SQL migration in my experience. I would suggest that 
supporting that range of options is not a good idea for us. The Ambari project 
also pretty much reserve the right to blow away that infrastructure in upgrades 
(which is fair enough). So relying on there being an RDBMS owned by another 
component is not something I would necessarily say was a clean choice. 

Simon

> On 2 Feb 2018, at 13:50, Nick Allen  wrote:
> 
> I fall marginally on the side of an RDBMS.  There is definitely a case to
> be made on both sides, but I'll point out a few things for the RDBMS.
> 
> 
> (1) Flexibility.  Using an RDBMS is going to provide us with much greater
> flexibility going forward.  We really don't know what the specific use
> cases will be, but I am willing to bet they are user-focused (preferences,
> etc).  The type of use cases that most web applications use an RDBMS for.
> 
> 
>> If anything I would like to see the current RDBMS dependency come out...
> 
> (2) Don't we already have an RDBMS requirement for Ambari?  That's a
> dependency that we do not control.
> 
> 
>> ... hbase seems a good option (because we already have it there, it would
> be kinda crazy at this scale if we didn’t already have it)
> 
> (3) In this scenario, the RDBMS would not scale proportionally with the
> amount of telemetry, it would scale based on usage; primarily the number of
> users.  This is not "big data" scale.  I don't think we can make the case
> for HBase based on scale here.
> 
> 
>> We would also end up with, as Mike points out, a whole new disk
> deployment patterns and a bunch of additional DBA ops process requirements
> for every install.
> 
> (4) Most users that need HA/DR (and other 'advanced stuff'), are
> enterprises and organizations that are already very familiar with RDBMS
> solutions and have the infrastructure in place to manage those.  For users
> that don't need HA/DR, just use the DB that gets spun-up with Ambari.
> 
> 
> 
> 
> 
> On Fri, Feb 2, 2018 at 7:17 AM Simon Elliston Ball <
> si...@simonellistonball.com> wrote:
> 
>> Introducing a RDBMS to the stack seems unnecessary for this.
>> 
>> If we consider the data access patterns for user profiles, we are unlikely
>> to query into them, or indeed do anything other than look them up, or write
>> them out by a username key. To that end, using an ORM to translate a a
>> nested config object into a load of tables seems to introduce complexity
>> and brittleness we then have to take away through relying on relational
>> consistency models. We would also end up with, as Mike points out, a whole
>> new disk deployment patterns and a bunch of additional DBA ops process
>> requirements for every install.
>> 
>> Since the access pattern is almost entirely key => value, hbase seems a
>> good option (because we already have it there, it would be kinda crazy at
>> this scale if we didn’t already have it) or arguably zookeeper, but that
>> might be at the other end of the scale argument. I’d even go as far as to
>> suggest files on HDFS to keep it simple.
>> 
>> Simon
>> 
>>> On 1 Feb 2018, at 23:24, Michael Miklavcic 
>> wrote:
>>> 
>>> Personally, I'd be in favor of something like Maria DB as an open source
>>> repo. Or any other ansi sql store. On the positive side, it should mesh
>>> seamlessly with ORM tools. And the schema for this should be pretty
>>> vanilla, I'd imagine. I might even consider skipping ORM for straight
>> JDBC
>>> and simple command scripts in Java for something this small. I'm not
>>> worried so much about migrations of this sort. Large scale DBs can get
>>> involved with major schema changes, but thats usually when the datastore
>> is
>>> a massive set of tables with complex relationships, at least in my
>>> experience.
>>> 
>>> We could also use hbase, which probably wouldn't be that hard either, but
>>> there may be more boilerplate to write for the client as compared to
>>> standard SQL. But I'm assuming we could reuse a fair amount of existing
>>> code from our enrichments

Re: [DISCUSS] Persistence store for user profile settings

2018-02-02 Thread Simon Elliston Ball
Couldn’t agree with you more Otto! On the perms / ACLs / AXOs / groups / users 
etc concerns though, there are other Apache projects (such as Ranger) which 
have already done a lot of the hard thinking and architecture / data structure 
/ admin ui and persistence pieces for us, so I’d say we lean on them before 
designing our own approach to IAM. 

Simon

> On 2 Feb 2018, at 13:22, Otto Fowler  wrote:
> 
> Fair enough,  I don’t have a preference.  I think my point is that we need to 
> understand the use cases we can think of more, especially if we are going to 
> be having permissions, grouping and crud around that, and preloading, before 
> just throwing everything in RDBMS -or- HBASE.
> 
> 
> 
> On February 2, 2018 at 08:08:24, Simon Elliston Ball 
> (si...@simonellistonball.com <mailto:si...@simonellistonball.com>) wrote:
> 
>> True, and that is a requirement I’ve heard a lot (standard views or field 
>> sets in shared sets of saved search for example). That would definitely rule 
>> out sticking with the current approach (browser local storage, per Casey’s 
>> suggestion below). 
>> 
>> That said, I’m not sure that changes my views on RDBMS. There is an argument 
>> that a single query from RDBMS could return a set of group prefs with a user 
>> overlay, but that’s not that much better than pulling groups and overwriting 
>> the maps clientside with user, from the key value store. We’re not talking 
>> about huge amounts of preference data here. I could be swayed the other way 
>> if we were to use the RDBMS as a canonical store for user and group 
>> information (we use it for users right now, in a really not great way) but I 
>> would much rather see us plugin to the Hadoop ecosystem and use something 
>> like Ranger to sync users, or an LDAP source directly for user and group 
>> data, because I suspect no one wants to have to administer a separate user 
>> database for Metron and open up the result IAM security hole we currently 
>> have (on that, let’s at least stop storing plain text passwords!) /rant. 
>> 
>> If anything I would like to see the current RDBMS dependency come out to 
>> reduce the overall complexity, unless we have a use case that genuinely 
>> benefits from a normalised data structure, or from SQL access patterns. 
>> 
>> In short, I would still go with LDAP / Ranger or users and groups, and 
>> instead of adding an RDBMS, using group prefs and user prefs in the existing 
>> KV store (HBase) to reduce the operational maintenance burden on the 
>> platform. 
>> 
>> Simon
>> 
>>> On 2 Feb 2018, at 12:50, Otto Fowler >> <mailto:ottobackwa...@gmail.com>> wrote:
>>> 
>>> It is not uncommon to want to have ‘shared’ preferences or setups.   Think 
>>> of shared dashboards or queries vs. personal version in jira.  Would RDBMS 
>>> help with that?
>>> 
>>> 
>>> 
>>> On February 2, 2018 at 07:17:04, Simon Elliston Ball 
>>> (si...@simonellistonball.com <mailto:si...@simonellistonball.com>) wrote:
>>> 
>>>> Introducing a RDBMS to the stack seems unnecessary for this. 
>>>> 
>>>> If we consider the data access patterns for user profiles, we are unlikely 
>>>> to query into them, or indeed do anything other than look them up, or 
>>>> write them out by a username key. To that end, using an ORM to translate a 
>>>> a nested config object into a load of tables seems to introduce complexity 
>>>> and brittleness we then have to take away through relying on relational 
>>>> consistency models. We would also end up with, as Mike points out, a whole 
>>>> new disk deployment patterns and a bunch of additional DBA ops process 
>>>> requirements for every install. 
>>>> 
>>>> Since the access pattern is almost entirely key => value, hbase seems a 
>>>> good option (because we already have it there, it would be kinda crazy at 
>>>> this scale if we didn’t already have it) or arguably zookeeper, but that 
>>>> might be at the other end of the scale argument. I’d even go as far as to 
>>>> suggest files on HDFS to keep it simple.  
>>>> 
>>>> Simon 
>>>> 
>>>> > On 1 Feb 2018, at 23:24, Michael Miklavcic >>> > <mailto:michael.miklav...@gmail.com>> wrote: 
>>>> >  
>>>> > Personally, I'd be in favor of something like Maria DB as an open source 
>>>> > repo. Or any other ansi sql store. On the positive side, it should mesh 
>>>> > seamlessl

Re: [DISCUSS] Persistence store for user profile settings

2018-02-02 Thread Simon Elliston Ball
True, and that is a requirement I’ve heard a lot (standard views or field sets 
in shared sets of saved search for example). That would definitely rule out 
sticking with the current approach (browser local storage, per Casey’s 
suggestion below). 

That said, I’m not sure that changes my views on RDBMS. There is an argument 
that a single query from RDBMS could return a set of group prefs with a user 
overlay, but that’s not that much better than pulling groups and overwriting 
the maps clientside with user, from the key value store. We’re not talking 
about huge amounts of preference data here. I could be swayed the other way if 
we were to use the RDBMS as a canonical store for user and group information 
(we use it for users right now, in a really not great way) but I would much 
rather see us plugin to the Hadoop ecosystem and use something like Ranger to 
sync users, or an LDAP source directly for user and group data, because I 
suspect no one wants to have to administer a separate user database for Metron 
and open up the result IAM security hole we currently have (on that, let’s at 
least stop storing plain text passwords!) /rant. 

If anything I would like to see the current RDBMS dependency come out to reduce 
the overall complexity, unless we have a use case that genuinely benefits from 
a normalised data structure, or from SQL access patterns. 

In short, I would still go with LDAP / Ranger or users and groups, and instead 
of adding an RDBMS, using group prefs and user prefs in the existing KV store 
(HBase) to reduce the operational maintenance burden on the platform. 

Simon

> On 2 Feb 2018, at 12:50, Otto Fowler  wrote:
> 
> It is not uncommon to want to have ‘shared’ preferences or setups.   Think of 
> shared dashboards or queries vs. personal version in jira.  Would RDBMS help 
> with that?
> 
> 
> 
> On February 2, 2018 at 07:17:04, Simon Elliston Ball 
> (si...@simonellistonball.com <mailto:si...@simonellistonball.com>) wrote:
> 
>> Introducing a RDBMS to the stack seems unnecessary for this. 
>> 
>> If we consider the data access patterns for user profiles, we are unlikely 
>> to query into them, or indeed do anything other than look them up, or write 
>> them out by a username key. To that end, using an ORM to translate a a 
>> nested config object into a load of tables seems to introduce complexity and 
>> brittleness we then have to take away through relying on relational 
>> consistency models. We would also end up with, as Mike points out, a whole 
>> new disk deployment patterns and a bunch of additional DBA ops process 
>> requirements for every install. 
>> 
>> Since the access pattern is almost entirely key => value, hbase seems a good 
>> option (because we already have it there, it would be kinda crazy at this 
>> scale if we didn’t already have it) or arguably zookeeper, but that might be 
>> at the other end of the scale argument. I’d even go as far as to suggest 
>> files on HDFS to keep it simple.  
>> 
>> Simon 
>> 
>> > On 1 Feb 2018, at 23:24, Michael Miklavcic > > <mailto:michael.miklav...@gmail.com>> wrote: 
>> >  
>> > Personally, I'd be in favor of something like Maria DB as an open source 
>> > repo. Or any other ansi sql store. On the positive side, it should mesh 
>> > seamlessly with ORM tools. And the schema for this should be pretty 
>> > vanilla, I'd imagine. I might even consider skipping ORM for straight JDBC 
>> > and simple command scripts in Java for something this small. I'm not 
>> > worried so much about migrations of this sort. Large scale DBs can get 
>> > involved with major schema changes, but thats usually when the datastore 
>> > is 
>> > a massive set of tables with complex relationships, at least in my 
>> > experience. 
>> >  
>> > We could also use hbase, which probably wouldn't be that hard either, but 
>> > there may be more boilerplate to write for the client as compared to 
>> > standard SQL. But I'm assuming we could reuse a fair amount of existing 
>> > code from our enrichments. One additional reason in favor of hbase might 
>> > be 
>> > data replication. For a SQL instance we'd probably recommend a RAID store 
>> > or backup procedure, but we get that pretty easy with hbase too. 
>> >  
>> > On Feb 1, 2018 2:45 PM, "Casey Stella" > > <mailto:ceste...@gmail.com>> wrote: 
>> >  
>> >> So, I'll answer your question with some questions: 
>> >>  
>> >> - No matter the data store we use upgrading will take some care, right? 
>> >> - Do we currently depend

Re: [DISCUSS] Persistence store for user profile settings

2018-02-02 Thread Simon Elliston Ball
Introducing a RDBMS to the stack seems unnecessary for this.

If we consider the data access patterns for user profiles, we are unlikely to 
query into them, or indeed do anything other than look them up, or write them 
out by a username key. To that end, using an ORM to translate a a nested config 
object into a load of tables seems to introduce complexity and brittleness we 
then have to take away through relying on relational consistency models. We 
would also end up with, as Mike points out, a whole new disk deployment 
patterns and a bunch of additional DBA ops process requirements for every 
install.

Since the access pattern is almost entirely key => value, hbase seems a good 
option (because we already have it there, it would be kinda crazy at this scale 
if we didn’t already have it) or arguably zookeeper, but that might be at the 
other end of the scale argument. I’d even go as far as to suggest files on HDFS 
to keep it simple. 

Simon

> On 1 Feb 2018, at 23:24, Michael Miklavcic  
> wrote:
> 
> Personally, I'd be in favor of something like Maria DB as an open source
> repo. Or any other ansi sql store. On the positive side, it should mesh
> seamlessly with ORM tools. And the schema for this should be pretty
> vanilla, I'd imagine. I might even consider skipping ORM for straight JDBC
> and simple command scripts in Java for something this small. I'm not
> worried so much about migrations of this sort. Large scale DBs can get
> involved with major schema changes, but thats usually when the datastore is
> a massive set of tables with complex relationships, at least in my
> experience.
> 
> We could also use hbase, which probably wouldn't be that hard either, but
> there may be more boilerplate to write for the client as compared to
> standard SQL. But I'm assuming we could reuse a fair amount of existing
> code from our enrichments. One additional reason in favor of hbase might be
> data replication. For a SQL instance we'd probably recommend a RAID store
> or backup procedure, but we get that pretty easy with hbase too.
> 
> On Feb 1, 2018 2:45 PM, "Casey Stella"  wrote:
> 
>> So, I'll answer your question with some questions:
>> 
>>   - No matter the data store we use upgrading will take some care, right?
>>   - Do we currently depend on a RDBMS anywhere?  I want to say that we do
>>   in the REST layer already, right?
>>   - If we don't use a RDBMs, what's the other option?  What are the pros
>>   and cons?
>>   - Have we considered non-server offline persistent solutions (e.g.
>>   https://www.html5rocks.com/en/features/storage)?
>> 
>> 
>> 
>> On Thu, Feb 1, 2018 at 9:11 AM, Ryan Merriman  wrote:
>> 
>>> There is currently a PR up for review that allows a user to configure and
>>> save the list of facet fields that appear in the left column of the
>> Alerts
>>> UI:  https://github.com/apache/metron/pull/853.  The REST layer has ORM
>>> support which means we can store those in a relational database.
>>> 
>>> However I'm not 100% sure this is the best place to keep this.  As we add
>>> more use cases like this the backing tables in the RDBMS will need to be
>>> managed.  This could make upgrading more tedious and error-prone.  Is
>> there
>>> are a better way to store this, assuming we can leverage a component
>> that's
>>> already included in our stack?
>>> 
>>> Ryan
>>> 
>> 



Re: When things change in hdfs, how do we know

2018-01-31 Thread Simon Elliston Ball
I take it your service would just be a thin daemon along the lines of the PoC 
you linked, which makes a lot of sense, delegating the actual notification to 
the zookeeper bits we already have.

That makes sense to me. One other question would be around the availability of 
that service (which is not exactly critical, but would be nice to be able to 
run HA). As far as I can see it’s not likely to be stateful, and as long as 
there is some sort of de-dupe you could have two or more running. Is that worth 
chewing, or do we just need one running and accept occasional outages of the 
rarely firing non-critical service?

Simon

> On 31 Jan 2018, at 17:24, Otto Fowler  wrote:
> 
> No,
> 
> I would propose a new Ambari Service, the did the notify->zookeeper stuff.
> Did you not see my awesome ascii art diagram?
> 
> 
> 
> 
> On January 31, 2018 at 11:51:51, Casey Stella (ceste...@gmail.com) wrote:
> 
> Well, it'll be one listener per worker and if you have a lot of workers,
> it's going to be a bad time probably.
> 
> On Wed, Jan 31, 2018 at 11:50 AM, Otto Fowler 
> wrote:
> 
>> I don’t think the Unstable means the implementation will crash.  I think
>> it means
>> it is a newish-api, and there should be 1 listeners.
>> 
>> Having 1 listener shouldn’t be an issue.
>> 
>> 
>> 
>> On January 31, 2018 at 11:45:54, Casey Stella (ceste...@gmail.com) wrote:
>> 
>> Hmm, I have heard this feedback before. Perhaps a more low-key approach
>> would be either a static timer that checked or a timer bolt that sent a
>> periodic timer and the parser bolt reconfigured the parser (or indeed we
>> added a Reloadable interface with a 'reload' method). We could be smart
>> also and only set up the topology with the timer bolt if the parser
>> actually implemented the Reloadable interface. Just some thoughts that
>> might be easy and avoid instability.
>> 
>> On Tue, Jan 30, 2018 at 3:42 PM, Otto Fowler 
>> wrote:
>> 
>>> It is still @unstable, but the jiras :
>>> https://issues.apache.org/jira/browse/HDFS-8940?jql=
>>> project%20%3D%20HDFS%20AND%20status%20in%20(Open%2C%20%
>>> 22In%20Progress%22)%20AND%20text%20~%20%22INotify%22
>>> that I see are stall from over the summer.
>>> 
>>> They also seem geared to scale or changing the filter object not the api.
>>> 
>>> 
>>> 
>>> On January 30, 2018 at 14:19:56, JJ Meyer (jjmey...@gmail.com) wrote:
>>> 
>>> Hello all,
>>> 
>>> I had created a NiFi processor a long time back that used the inotify
>> API.
>>> One thing I noticed while working with it is that it is marked with the
>>> `Unstable` annotation. It may be worth checking if anymore work is going
>> on
>>> with it and if it will impact this (if it hasn't already been looked
>> into).
>>> 
>>> Thanks,
>>> JJ
>>> 
>>> On Mon, Jan 29, 2018 at 7:27 AM, Otto Fowler 
>>> wrote:
>>> 
 I have updated the jira as well
 
 
 On January 29, 2018 at 08:22:34, Otto Fowler (ottobackwa...@gmail.com)
 wrote:
 
 https://github.com/ottobackwards/hdfs-inotify-zookeeper
 
>>> 
>> 
>> 



Re: Enrichment and indexing routing mechanism

2018-01-29 Thread Simon Elliston Ball
Flow is:

Parser (including the parser class, and all transformations, including stellar 
transformations) -> Kafka (enrichments) 

Kafka (enrichments) -> Enrichment topology with all it’s Stellary goodness -> 
Kafka (indexing) 

Kafka (indexing) -> Indexing topologies (ES / Solr / HDFS) configured based on 
the indexing config named the same as source.type -> wherever the indexer tells 
it to be.

Simon

> On 29 Jan 2018, at 11:53, Ali Nazemian  wrote:
> 
> Thanks, Simon. When will it apply for the enrichment? Is that after parser
> and post-parser Stellar implementation? I am trying to understand If I
> change it in post-parser Stellar, will it be overwritten at the last step
> of Parser topology or not?
> 
> Cheers,
> Ali
> 
> On Mon, Jan 29, 2018 at 8:55 PM, Simon Elliston Ball <
> si...@simonellistonball.com> wrote:
> 
>> Yes, it is.
>> 
>> Sent from my iPhone
>> 
>>> On 29 Jan 2018, at 09:33, Ali Nazemian  wrote:
>>> 
>>> Hi All,
>>> 
>>> I was wondering how the routing mechanism works in Metron currently. Can
>>> somebody please explain how Enrichment Storm topology understands a
>> single
>>> event is related to which Metron feed? What about indexing? is that based
>>> on "source.type" field?
>>> 
>>> Cheers,
>>> Ali
>> 
> 
> 
> 
> -- 
> A.Nazemian



Re: Enrichment and indexing routing mechanism

2018-01-29 Thread Simon Elliston Ball
Yes, it is.

Sent from my iPhone

> On 29 Jan 2018, at 09:33, Ali Nazemian  wrote:
> 
> Hi All,
> 
> I was wondering how the routing mechanism works in Metron currently. Can
> somebody please explain how Enrichment Storm topology understands a single
> event is related to which Metron feed? What about indexing? is that based
> on "source.type" field?
> 
> Cheers,
> Ali


Re: [DISCUSS] Update Metron Elasticsearch index names to metron_

2018-01-26 Thread Simon Elliston Ball
+1 on this. The idea of a default broad matching template should also include 
an order entry to avoid conflicts with more specific templates, and we should 
then document the need for a higher order value in all per-source index 
templates. 

In terms of production migration, I think we may want to provide some detailed 
documentation in the upgrade guide on this, because there will be people with a 
lot of existing indices that will be difficult to handle. We may also need some 
tooling, but I expect docs would do the job. What do people think about 
migration?

Simon

> 
> One other benefit of this revised approach - we can more effectively use
> index template patterns to specify our base set of Metron property types.
> Call me crazy, but I think we should be able to do something like:
> 
> 
> 
> {
>  *"template": "metron_*",*
>  "mappings": {
>"metron_doc": {
>  "dynamic_templates": [
>  {
>"geo_location_point": {
>  "match": "enrichments:geo:*:location_point",
>  "match_mapping_type": "*",
>  "mapping": {
>"type": "geo_point"
>  }
>}
>  },
>  {
>"geo_country": {
>  "match": "enrichments:geo:*:country",
>  "match_mapping_type": "*",
>  "mapping": {
>"type": "keyword"
>  }
>}
>  },
>  {
>"geo_city": {
>  "match": "enrichments:geo:*:city",
>  "match_mapping_type": "*",
>  "mapping": {
>"type": "keyword"
>  }
>}
>  },
>  {
>"geo_location_id": {
>  "match": "enrichments:geo:*:locID",
>  "match_mapping_type": "*",
>  "mapping": {
>"type": "keyword"
>  }
>}
>  },
>  {
>"geo_dma_code": {
>  "match": "enrichments:geo:*:dmaCode",
>  "match_mapping_type": "*",
>  "mapping": {
>"type": "keyword"
>  }
>}
>  },
>  {
>"geo_postal_code": {
>  "match": "enrichments:geo:*:postalCode",
>  "match_mapping_type": "*",
>  "mapping": {
>"type": "keyword"
>  }
>}
>  },
>  {
>"geo_latitude": {
>  "match": "enrichments:geo:*:latitude",
>  "match_mapping_type": "*",
>  "mapping": {
>"type": "float"
>  }
>}
>  },
>  {
>"geo_longitude": {
>  "match": "enrichments:geo:*:longitude",
>  "match_mapping_type": "*",
>  "mapping": {
>"type": "float"
>  }
>}
>  },
>  {
>"timestamps": {
>  "match": "*:ts",
>  "match_mapping_type": "*",
>  "mapping": {
>"type": "date",
>"format": "epoch_millis"
>  }
>}
>  },
>  {
>"threat_triage_score": {
>  "mapping": {
>"type": "float"
>  },
>  "match": "threat:triage:*score",
>  "match_mapping_type": "*"
>}
>  },
>  {
>"threat_triage_reason": {
>  "mapping": {
>"type": "text",
>"fielddata": "true"
>  },
>  "match": "threat:triage:rules:*:reason",
>  "match_mapping_type": "*"
>}
>  },
>  {
>"threat_triage_name": {
>  "mapping": {
>"type": "text",
>"fielddata": "true"
>  },
>  "match": "threat:triage:rules:*:name",
>  "match_mapping_type": "*"
>}
>  }
> 
> ]}}
> 
> That means that for every new sensor we bring on board we can skip
> adding that boiler plate mapping config to every new template.
> 
> 
> 
> On Wed, Jan 24, 2018 at 6:34 PM, Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
> 
>> I hear you Ali. I think this type of change would actually ease issues
>> with downtime because it offers an easy path to migrating existing indices.
>> I'd have to review the specifics in the ES docs again, but I believe you
>> could duplicate the old indexes and migrate them to "metron_" in advance of
>> the upgrade, and then consume new data to the new index pattern/name after
>> the upgrade. That should be pretty seamless, I think. I guess it depends on
>> how you're using ES.
>> 
>> On Wed, Jan 24, 2018 at 4:08 PM, Ali Nazemian 
>> wrote:
>> 
>>> Hi All,
>>> 
>>> I just wanted to say it would be great if we can be careful with these
>>> type
>>> of changes. From the development point of view, it is just a few lines of
>>> code which can provide multiple advantages, but for live large-scale
>>> Metron
>>> platforms, some of these changes might be really expensive to address with
>>> zero-downtime.
>>> 
>>> Cheers,
>>> Ali
>>> 
>>> On Thu, Jan 25, 2018 at 9:29 AM, Otto Fowler 
>>> wrote:
>>> 
 +1
 
 
 On January 24, 2018 at 16:28:42, Nick Allen (n...@nickallen.org) wrote:
 
 +1 to a standard prefix for all Metron indices. I've had the same

Re: Metron User Community Meeting Call

2018-01-26 Thread Simon Elliston Ball
This is going to be a really exciting call. Looking forward to seeing how the 
GCR Canary sings :) 

I’m going to volunteer https://hortonworks.zoom.us/my/simonellistonball as a 
location for the meeting.

I would also support the idea of a quick poll on what people are doing with 
Metron, and maybe if anyone wants to volunteer at the end of the meeting it 
would be great to have an open mic of use cases. 

Talk to you all Wednesday. 

Simon

> On 26 Jan 2018, at 22:10, Seal, Steve  wrote:
> 
> HI all,
>  
> I have several people on my team that are looking forward to hearing about 
> Ahmed’s work. 
>  
> Steve
>  
>  
> From: Daniel Schafer [mailto:daniel.scha...@sstech.us] 
> Sent: Friday, January 26, 2018 5:05 PM
> To: u...@metron.apache.org; dev@metron.apache.org
> Subject: Re: Metron User Community Meeting Call
>  
> My team members and me would like to join as well.
> We can provide Zoom Meeting login if necessary.
>  
> Thanks
>  
> Daniel
> 7134806608 
>  
> From: Ahmed Shah  >
> Reply-To: "u...@metron.apache.org " 
> mailto:u...@metron.apache.org>>
> Date: Friday, January 26, 2018 at 2:06 PM
> To: "dev@metron.apache.org " 
> mailto:dev@metron.apache.org>>, 
> "u...@metron.apache.org " 
> mailto:u...@metron.apache.org>>
> Subject: Re: Metron User Community Meeting Call
>  
> Looking forward to presenting!
>  
> Just a thought...
> In advanced should we create a Google Forms to collect survey data on who is 
> using Metron, how they are using it, ext.. and present the results to the 
> group? 
>  
> -Ahmed
> ___
> Ahmed Shah (PMP, M. Eng.)
> Cybersecurity Analyst & Developer 
> GCR - Cybersecurity Operations Center
> Carleton University - cugcr.com 
> 
>  
> 
> From: Andrew Psaltis  >
> Sent: January 26, 2018 1:53 PM
> To: dev@metron.apache.org 
> Subject: Re: Metron User Community Meeting Call
>  
> Count me in. Very interested to hear about Ahmed's journey.
> 
> On Fri, Jan 26, 2018 at 8:58 AM, Kyle Richardson  >
> wrote:
> 
> > Thanks! I'll be there. Excited to hear Ahmed's successes and challenges.
> >
> > -Kyle
> >
> > On Thu, Jan 25, 2018 at 7:44 PM zeo...@gmail.com  
> > mailto:zeo...@gmail.com>> wrote:
> >
> > > Thanks Otto, I'm in to attend at that time/place.
> > >
> > > Jon
> > >
> > > On Thu, Jan 25, 2018, 14:45 Otto Fowler  > > > wrote:
> > >
> > >> I would like to propose a Metron user community meeting. I propose that
> > >> we set the meeting next week, and will throw out Wednesday, January
> > 31st at
> > >> 09:30AM PST, 12:30 on the East Coast and 5:30 in London Towne. This
> > meeting
> > >> will be held over a web-ex, the details of which will be included in the
> > >> actual meeting notice.
> > >> Topics
> > >>
> > >> We have a volunteer for a community member presentation:
> > >>
> > >> Ahmed Shah (PMP, M. Eng.) Cybersecurity Analyst & Developer GCR -
> > >> Cybersecurity Operations Center Carleton University - cugcr.com 
> > >> 
> > >>
> > >> Ahmed would like to talk to the community about
> > >>
> > >>-
> > >>
> > >>Who the GCR group is
> > >>-
> > >>
> > >>How they use Metron 0.4.1
> > >>-
> > >>
> > >>Walk through their dashboards, UI management screen, nifi
> > >>-
> > >>
> > >>Challenges we faced up until now
> > >>
> > >> I would like to thank Ahmed for stepping forward for this meeting.
> > >>
> > >> If you have something you would like to present or talk about please
> > >> reply here! Maybe we can have people ask for “A better explanation of
> > >> feature X” type things?
> > >> Metron User Community Meetings
> > >>
> > >> User Community Meetings are a means for realtime discussion of
> > >> experiences with Apache Metron, or demonstration of how the community is
> > >> using or will be using Apache Metron.
> > >>
> > >> These meetings are geared towards:
> > >>
> > >>-
> > >>
> > >>Demonstrations and knowledge sharing as opposed to technical
> > >>discussion or implementation details from members of the Apache
> > Metron
> > >>Community
> > >>-
> > >>
> > >>Existing Feature demonstrations
> > >>-
> > >>
> > >>Proposed Feature demonstrations
> > >>-
> > >>
> > >>Community

Re: When things change in hdfs, how do we know

2018-01-26 Thread Simon Elliston Ball
Interesting, so you have an INotify listener to filter events, and then on 
given changes, propagate a notification to zookeeper, which then triggers the 
reconfiguration event via the curator client in Metron. I kinda like it given 
our existing zookeeper methods. 

Simon

> On 26 Jan 2018, at 13:27, Otto Fowler  wrote:
> 
> https://github.com/ottobackwards/hdfs-inotify-zookeeper 
> <https://github.com/ottobackwards/hdfs-inotify-zookeeper>
> 
> Working on a poc
> 
> 
> 
> On January 26, 2018 at 07:41:44, Simon Elliston Ball 
> (si...@simonellistonball.com <mailto:si...@simonellistonball.com>) wrote:
> 
>> Should we consider using the Inotify interface to trigger reconfiguration, 
>> in same way we trigger config changes in curator? We also need to fix 
>> caching and lifecycle in the Grok parser to make the zookeeper changes 
>> propagate pattern changes while we’re at it.  
>> 
>> Simon 
>> 
>> > On 26 Jan 2018, at 03:16, Casey Stella > > <mailto:ceste...@gmail.com>> wrote: 
>> >  
>> > Right now you have to restart the parser topology. 
>> >  
>> > On Thu, Jan 25, 2018 at 10:15 PM, Otto Fowler > > <mailto:ottobackwa...@gmail.com>> 
>> > wrote: 
>> >  
>> >> At the moment, when a grok file or something changes in HDFS, how do we 
>> >> know? Do we have to restart the parser topology to pick it up? 
>> >> Just trying to clarify for myself. 
>> >>  
>> >> ottO 
>> >> 



Re: When things change in hdfs, how do we know

2018-01-26 Thread Simon Elliston Ball
Should we consider using the Inotify interface to trigger reconfiguration, in 
same way we trigger config changes in curator? We also need to fix caching and 
lifecycle in the Grok parser to make the zookeeper changes propagate pattern 
changes while we’re at it. 

Simon

> On 26 Jan 2018, at 03:16, Casey Stella  wrote:
> 
> Right now you have to restart the parser topology.
> 
> On Thu, Jan 25, 2018 at 10:15 PM, Otto Fowler 
> wrote:
> 
>> At the moment, when a grok file or something changes in HDFS, how do we
>> know?  Do we have to restart the parser topology to pick it up?
>> Just trying to clarify for myself.
>> 
>> ottO
>> 



Re: Metron nested object

2018-01-11 Thread Simon Elliston Ball
I’m all for adding extra stores, especially once we have separated indexing 
topologies.

Druid (and therefore a ui based on superset) seems an obvious logical store to 
me. That said, the scheme management starts to feel like it needs some thought 
once we have enough range of schema sensitive stores (though I guess Druid is 
no different from ES in that regard).

Simon 

> On 11 Jan 2018, at 20:34, Andre  wrote:
> 
> Simon,
> 
> With the risk of sounding like an heretic:
> 
> Is there any particular reason Metron still considers ES as the
> "default"[1] fast access data store?
> 
> Sometimes I wonder if we wouldn't be better off leveraging schema evolution
> friendly formats with UIs like SuperSets?
> 
> Probably not as fast as ES but at least it would be one less development
> front to handle.
> 
> Keen to hear your thoughts
> 
> 
> Cheers
> 
> 
> 
> [1] I appreciate the architecture is flexible...
> [-] Apologies for the delay but I suspect my previous message got stuck in
> moderation
> 
> On Fri, Dec 22, 2017 at 3:59 AM, Simon Elliston Ball <
> si...@simonellistonball.com> wrote:
> 
>> Correct, nested objects in lucene indexes lead to sub-documents, which
>> leads to a massive drop in ingest and query rates, this is why the JSONMap
>> parser for example deliberately flattens the Metorn JSON object. Before
>> this decision was made, very early versions of OpenSOC nested enrichments
>> for example, but performance became a challenge.
>> 
>> Simon
>> 
>> 
>>> On 21 Dec 2017, at 13:57, Ali Nazemian  wrote:
>>> 
>>> So Metron enrichment and indexer are not nested aware? Is there any plan
>> to
>>> add that to Metron in future?
>>> 
>>> Cheers,
>>> Ali
>>> 
>>> On Fri, Dec 22, 2017 at 12:46 AM, Otto Fowler 
>>> wrote:
>>> 
>>>> I believe right now you have to flatten.
>>>> The jsonMap parser does this.
>>>> 
>>>> 
>>>> On December 21, 2017 at 08:28:13, Ali Nazemian (alinazem...@gmail.com)
>>>> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> 
>>>> We have recently faced some data sources that generate data in a nested
>>>> format. For example, AWS Cloudtrail generates data in the following JSON
>>>> format:
>>>> 
>>>> {
>>>> 
>>>> "Records": [
>>>> 
>>>> {
>>>> 
>>>> "eventVersion": *"2.0"*,
>>>> 
>>>> "userIdentity": {
>>>> 
>>>> "type": *"IAMUser"*,
>>>> 
>>>> "principalId": *"EX_PRINCIPAL_ID"*,
>>>> 
>>>> "arn": *"arn:aws:iam::123456789012:user/Alice"*,
>>>> 
>>>> "accessKeyId": *"EXAMPLE_KEY_ID"*,
>>>> 
>>>> "accountId": *"123456789012"*,
>>>> 
>>>> "userName": *"Alice"*
>>>> 
>>>> },
>>>> 
>>>> "eventTime": *"2014-03-07T21:22:54Z"*,
>>>> 
>>>> "eventSource": *"ec2.amazonaws.com <http://ec2.amazonaws.com>"*,
>>>> 
>>>> "eventName": *"StartInstances"*,
>>>> 
>>>> "awsRegion": *"us-east-2"*,
>>>> 
>>>> "sourceIPAddress": *"205.251.233.176"*,
>>>> 
>>>> "userAgent": *"ec2-api-tools 1.6.12.2"*,
>>>> 
>>>> "requestParameters": {
>>>> 
>>>> "instancesSet": {
>>>> 
>>>> "items": [
>>>> 
>>>> {
>>>> 
>>>> "instanceId": *"i-ebeaf9e2"*
>>>> 
>>>> }
>>>> 
>>>> ]
>>>> 
>>>> }
>>>> 
>>>> },
>>>> 
>>>> "responseElements": {
>>>> 
>>>> "instancesSet": {
>>>> 
>>>> "items": [
>>>> 
>>>> {
>>>> 
>>>> "instanceId": *"i-ebeaf9e2"*,
>>>> 
>>>> "currentState": {
>>>> 
>>>> "code": 0,
>>>> 
>>>> "name": *"pending"*
>>>> 
>>>> },
>>>> 
>>>> "previousState": {
>>>> 
>>>> "code": 80,
>>>> 
>>>> "name": *"stopped"*
>>>> 
>>>> }
>>>> 
>>>> }
>>>> 
>>>> ]
>>>> 
>>>> }
>>>> 
>>>> }
>>>> 
>>>> }
>>>> 
>>>> ]
>>>> 
>>>> }
>>>> 
>>>> 
>>>> We are able to make this as a flat JSON file. However, a nested object
>> is
>>>> supported by data backends in Metron (ES, ORC, etc.), so I was wondering
>>>> whether with the current version of Metron we are able to index nested
>>>> documents or we have to make it flat?
>>>> 
>>>> 
>>>> 
>>>> Cheers,
>>>> 
>>>> Ali
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> A.Nazemian
>> 
>> 


Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Simon Elliston Ball
There is some really cool stuff happening here, if only I’d been allowed to see 
the lists over Christmas... :)

A few thoughts...

I like Otto’s generalisation of the problem to include specific local stellar 
objects in a cache loaded from a store (HDFS seems a natural, but not only 
place, maybe even a web service / local microservicey object provider!?) That 
said, I suspect that’s a good platform optimisation approach. Should we look at 
this as a separate piece of work given it extends beyond the scope of the 
summarisation concept and ultimately use it as a back-end to feed the 
summarising engine proposed here for the enrichment loader?

On the more specific use case, one think I would comment on is the 
configuration approach. The iteration loop (state_{init|update|merge} should be 
consistent with the way we handle things like the profiler config, since it’s 
the same approach to data handling. 

The other thing that seems to have crept in here is the interface to something 
like Spark, which again, I am really very very keen on seeing happen. That 
said, not sure how that would happen in this context, unless you’re talking 
about pushing to something like livy for example (eminently sensible for things 
like cross instance caching and faster RPC-ish access to an existing spark 
context which seem to be what Casey is driving at with the spark piece. 

To address the question of text manipulation in Stellar / metron enrichment 
ingest etc, we already have this outside of the context of the issues here. I 
would argue that yes, we don’t want too many paths for this, and that maybe our 
parser approach might be heavily related to text-based ingest. I would say the 
scope worth dealing with here though is not really text manipulation, but 
summarisation, which is not well served by existing CLI tools like awk / sed 
and friends.

Simon

> On 3 Jan 2018, at 15:48, Nick Allen  wrote:
> 
>> Even with 5 threads, it takes an hour for the full Alexa 1m, so I  think
> this will impact performance
> 
> What exactly takes an hour?  Adding 1M entries to a bloom filter?  That
> seems really high, unless I am not understanding something.
> 
> 
> 
> 
> 
> 
> On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella  wrote:
> 
>> Thanks for the feedback, Nick.
>> 
>> Regarding "IMHO, I'd rather not reinvent the wheel for text manipulation."
>> 
>> I would argue that we are not reinventing the wheel for text manipulation
>> as the extractor config exists already and we are doing a similar thing in
>> the flatfile loader (in fact, the code is reused and merely extended).
>> Transformation operations are already supported in our codebase in the
>> extractor config, this PR has just added some hooks for stateful
>> operations.
>> 
>> Furthermore, we will need a configuration object to pass to the REST call
>> if we are ever to create a UI around importing data into hbase or creating
>> these summary objects.
>> 
>> Regarding your example:
>> $ cat top-1m.csv | awk -F, '{print $2}' | sed '/^$/d' | stellar -i
>> 'DOMAIN_REMOVE_TLD(_)' | stellar -i 'BLOOM_ADD(_)'
>> 
>> I'm very sympathetic to this type of extension, but it has some issues:
>> 
>>   1. This implies a single-threaded addition to the bloom filter.
>>  1. Even with 5 threads, it takes an hour for the full alexa 1m, so I
>>  think this will impact performance
>>  2. There's not a way to specify how to merge across threads if we do
>>  make a multithread command line option
>>   2. This restricts these kinds of operations to roles with heavy unix CLI
>>   knowledge, which isn't often the types of people who would be doing this
>>   type of operation
>>   3. What if we need two variables passed to stellar?
>>   4. This approach will be harder to move to Hadoop.  Eventually we will
>>   want to support data on HDFS being processed by Hadoop (similar to
>> flatfile
>>   loader), so instead of -m LOCAL being passed for the flatfile summarizer
>>   you'd pass -m SPARK and the processing would happen on the cluster
>>  1. This is particularly relevant in this case as it's a
>>  embarrassingly parallel problem in general
>> 
>> In summary, while this a CLI approach is attractive, I prefer the extractor
>> config solution because it is the solution with the smallest iteration
>> that:
>> 
>>   1. Reuses existing metron extraction infrastructure
>>   2. Provides the most solid base for the extensions that will be sorely
>>   needed soon (and will keep it in parity with the flatfile loader)
>>   3. Provides the most solid base for a future UI extension in the
>>   management UI to support both summarization and loading
>> 
>> 
>> 
>> 
>> On Tue, Dec 26, 2017 at 11:27 AM, Nick Allen  wrote:
>> 
>>> First off, I really do like the typosquatting use case and a lot of what
>>> you have described.
>>> 
 We need a way to generate the summary sketches from flat data for this
>> to
 work.
 ​..​
 
>>> 
>>> I took this quote directly from your use cas

Re: Metron nested object

2017-12-21 Thread Simon Elliston Ball
Correct, nested objects in lucene indexes lead to sub-documents, which leads to 
a massive drop in ingest and query rates, this is why the JSONMap parser for 
example deliberately flattens the Metorn JSON object. Before this decision was 
made, very early versions of OpenSOC nested enrichments for example, but 
performance became a challenge. 

Simon


> On 21 Dec 2017, at 13:57, Ali Nazemian  wrote:
> 
> So Metron enrichment and indexer are not nested aware? Is there any plan to
> add that to Metron in future?
> 
> Cheers,
> Ali
> 
> On Fri, Dec 22, 2017 at 12:46 AM, Otto Fowler 
> wrote:
> 
>> I believe right now you have to flatten.
>> The jsonMap parser does this.
>> 
>> 
>> On December 21, 2017 at 08:28:13, Ali Nazemian (alinazem...@gmail.com)
>> wrote:
>> 
>> Hi all,
>> 
>> 
>> We have recently faced some data sources that generate data in a nested
>> format. For example, AWS Cloudtrail generates data in the following JSON
>> format:
>> 
>> {
>> 
>> "Records": [
>> 
>> {
>> 
>> "eventVersion": *"2.0"*,
>> 
>> "userIdentity": {
>> 
>> "type": *"IAMUser"*,
>> 
>> "principalId": *"EX_PRINCIPAL_ID"*,
>> 
>> "arn": *"arn:aws:iam::123456789012:user/Alice"*,
>> 
>> "accessKeyId": *"EXAMPLE_KEY_ID"*,
>> 
>> "accountId": *"123456789012"*,
>> 
>> "userName": *"Alice"*
>> 
>> },
>> 
>> "eventTime": *"2014-03-07T21:22:54Z"*,
>> 
>> "eventSource": *"ec2.amazonaws.com "*,
>> 
>> "eventName": *"StartInstances"*,
>> 
>> "awsRegion": *"us-east-2"*,
>> 
>> "sourceIPAddress": *"205.251.233.176"*,
>> 
>> "userAgent": *"ec2-api-tools 1.6.12.2"*,
>> 
>> "requestParameters": {
>> 
>> "instancesSet": {
>> 
>> "items": [
>> 
>> {
>> 
>> "instanceId": *"i-ebeaf9e2"*
>> 
>> }
>> 
>> ]
>> 
>> }
>> 
>> },
>> 
>> "responseElements": {
>> 
>> "instancesSet": {
>> 
>> "items": [
>> 
>> {
>> 
>> "instanceId": *"i-ebeaf9e2"*,
>> 
>> "currentState": {
>> 
>> "code": 0,
>> 
>> "name": *"pending"*
>> 
>> },
>> 
>> "previousState": {
>> 
>> "code": 80,
>> 
>> "name": *"stopped"*
>> 
>> }
>> 
>> }
>> 
>> ]
>> 
>> }
>> 
>> }
>> 
>> }
>> 
>> ]
>> 
>> }
>> 
>> 
>> We are able to make this as a flat JSON file. However, a nested object is
>> supported by data backends in Metron (ES, ORC, etc.), so I was wondering
>> whether with the current version of Metron we are able to index nested
>> documents or we have to make it flat?
>> 
>> 
>> 
>> Cheers,
>> 
>> Ali
>> 
>> 
> 
> 
> -- 
> A.Nazemian



Re: Metron - Emailing Alerts

2017-12-13 Thread Simon Elliston Ball
That makes a lot of sense, especially if you wanted the detail in the email as 
well. We could definitely use some good "reporting of alerts” functionality 
that would make something like that work. What do people think?

Simon

> On 13 Dec 2017, at 21:52, James Sirota  wrote:
> 
> I think there may be gaps in doing it with the profiler.  You can record 
> stats and counts of different alert types, and maybe even alert ids, but you 
> can't cross-correlate these IDs to the alert body.  At least not in the 
> profiler.  I was thinking about emailing something that looks like a zeppelin 
> report.  You would run it in a cron, export to PDF, and send that out as a 
> summary.  It can be a simple list of alerts that match your rule, or it can 
> have aggregations, graphics, metrics, KPI screens, etc.  That would be the 
> feature that I would want to discuss and flesh out
> 
> Thanks,
> James 
> 
> 13.12.2017, 14:26, "Simon Elliston Ball" :
>> We can already do that with profiles I would have thought. Create a profile 
>> that only picks alerts and then base your emails only from the alert events 
>> produced by that profile. Would that create the right batching mechanism (at 
>> a cost of possible higher latency than you might get with a more specific 
>> alert batcher?)
>> 
>> Simon
>> 
>>>  On 13 Dec 2017, at 21:23, James Sirota  wrote:
>>> 
>>>  I agree with Simon. If you email each alert individually you will be 
>>> overwhelmed. I think a better idea would be to email alert summaries 
>>> periodically, which is more manageable. This is probably a feature worthy 
>>> of consideration for Metron.
>>> 
>>>  13.12.2017, 12:19, "Simon Elliston Ball" :
>>>>  Metron generates alerts onto a Kafka queue, which can be used to 
>>>> integrate with Alert management tools, usually some sort of existing alert 
>>>> aggregation tool.
>>>> 
>>>>  An alternative approach common with this is to have a tool like Apache 
>>>> NiFi attach to the Metron alert feed and send email.
>>>> 
>>>>  The solution here would be to have Metron generate alerts (by adding the 
>>>> is_alert: true flag in the enrichment process) and possibly other flags 
>>>> like alert_email for example, and then have NiFi use ConsumeKafka and then 
>>>> filter out the alert only messages in NiFi to use the PutEmail processor 
>>>> (probably with a ControlRate before it too).
>>>> 
>>>>  Something I would caution is that email is not a great way to manage or 
>>>> send alerts at the volume likely to occur in network monitoring tools. A 
>>>> spike in network traffic can lead to a very large number of emails, which 
>>>> tends to then cause you bigger problems. As such we usually find people 
>>>> want some sort of buffering or aggregation of alerts, hence the use of a 
>>>> an alert management or ticketing solution in front.
>>>> 
>>>>  Simon
>>>> 
>>>>>   On 13 Dec 2017, at 19:06, Ahmed Shah  
>>>>> wrote:
>>>>> 
>>>>>   Hello,
>>>>>   Just wondering if Metron has a feature to email alerts based on rules 
>>>>> that a user defines.
>>>>> 
>>>>>   Example:
>>>>>   Rule A: Email the user 1...@1.com whenever ip_src_addr=100.2.10.*
>>>>>   Rule B: Email the user 1...@1.com whenever payload contains "critical"
>>>>> 
>>>>>   If not, does anyone have any recommendations on where to code these 
>>>>> rules in the Metron stack that uses attributes from the GROK parser?
>>>>> 
>>>>>   -Ahmed
>>>>>   ___
>>>>>   Ahmed Shah (PMP, M. Eng.)
>>>>>   Cybersecurity Analyst & Developer
>>>>>   GCR - Cybersecurity Operations Center
>>>>>   Carleton University - cugcr.com<https://cugcr.com/tiki/lce/index.php>
>>> 
>>>  ---
>>>  Thank you,
>>> 
>>>  James Sirota
>>>  PMC- Apache Metron
>>>  jsirota AT apache DOT org
> 
> --- 
> Thank you,
> 
> James Sirota
> PMC- Apache Metron
> jsirota AT apache DOT org



Re: Metron - Emailing Alerts

2017-12-13 Thread Simon Elliston Ball
We can already do that with profiles I would have thought. Create a profile 
that only picks alerts and then base your emails only from the alert events 
produced by that profile. Would that create the right batching mechanism (at a 
cost of possible higher latency than you might get with a more specific alert 
batcher?)

Simon 

> On 13 Dec 2017, at 21:23, James Sirota  wrote:
> 
> I agree with Simon.  If you email each alert individually you will be 
> overwhelmed.  I think a better idea would be to email alert summaries 
> periodically, which is more manageable.  This is probably a feature worthy of 
> consideration for Metron. 
> 
> 13.12.2017, 12:19, "Simon Elliston Ball" :
>> Metron generates alerts onto a Kafka queue, which can be used to integrate 
>> with Alert management tools, usually some sort of existing alert aggregation 
>> tool.
>> 
>> An alternative approach common with this is to have a tool like Apache NiFi 
>> attach to the Metron alert feed and send email.
>> 
>> The solution here would be to have Metron generate alerts (by adding the 
>> is_alert: true flag in the enrichment process) and possibly other flags like 
>> alert_email for example, and then have NiFi use ConsumeKafka and then filter 
>> out the alert only messages in NiFi to use the PutEmail processor (probably 
>> with a ControlRate before it too).
>> 
>> Something I would caution is that email is not a great way to manage or send 
>> alerts at the volume likely to occur in network monitoring tools. A spike in 
>> network traffic can lead to a very large number of emails, which tends to 
>> then cause you bigger problems. As such we usually find people want some 
>> sort of buffering or aggregation of alerts, hence the use of a an alert 
>> management or ticketing solution in front.
>> 
>> Simon
>> 
>>>  On 13 Dec 2017, at 19:06, Ahmed Shah  wrote:
>>> 
>>>  Hello,
>>>  Just wondering if Metron has a feature to email alerts based on rules that 
>>> a user defines.
>>> 
>>>  Example:
>>>  Rule A: Email the user 1...@1.com whenever ip_src_addr=100.2.10.*
>>>  Rule B: Email the user 1...@1.com whenever payload contains "critical"
>>> 
>>>  If not, does anyone have any recommendations on where to code these rules 
>>> in the Metron stack that uses attributes from the GROK parser?
>>> 
>>>  -Ahmed
>>>  ___
>>>  Ahmed Shah (PMP, M. Eng.)
>>>  Cybersecurity Analyst & Developer
>>>  GCR - Cybersecurity Operations Center
>>>  Carleton University - cugcr.com<https://cugcr.com/tiki/lce/index.php>
> 
> --- 
> Thank you,
> 
> James Sirota
> PMC- Apache Metron
> jsirota AT apache DOT org


Re: Metron - Emailing Alerts

2017-12-13 Thread Simon Elliston Ball
Metron generates alerts onto a Kafka queue, which can be used to integrate with 
Alert management tools, usually some sort of existing alert aggregation tool.

An alternative approach common with this is to have a tool like Apache NiFi 
attach to the Metron alert feed and send email. 

The solution here would be to have Metron generate alerts (by adding the 
is_alert: true flag in the enrichment process) and possibly other flags like 
alert_email for example, and then have NiFi use ConsumeKafka and then filter 
out the alert only messages in NiFi to use the PutEmail processor (probably 
with a ControlRate before it too).

Something I would caution is that email is not a great way to manage or send 
alerts at the volume likely to occur in network monitoring tools. A spike in 
network traffic can lead to a very large number of emails, which tends to then 
cause you bigger problems. As such we usually find people want some sort of 
buffering or aggregation of alerts, hence the use of a an alert management or 
ticketing solution in front.

Simon

> On 13 Dec 2017, at 19:06, Ahmed Shah  wrote:
> 
> Hello,
> Just wondering if Metron has a feature to email alerts based on rules that a 
> user defines.
> 
> Example:
> Rule A: Email the user 1...@1.com whenever ip_src_addr=100.2.10.*
> Rule B: Email the user 1...@1.com whenever payload contains "critical"
> 
> If not, does anyone have any recommendations on where to code these rules in 
> the Metron stack that uses attributes from the GROK parser?
> 
> 
> -Ahmed
> ___
> Ahmed Shah (PMP, M. Eng.)
> Cybersecurity Analyst & Developer
> GCR - Cybersecurity Operations Center
> Carleton University - cugcr.com



Re: [DISCUSS] Community Meetings

2017-12-13 Thread Simon Elliston Ball
Good points Larry, we would need to get consent from everyone on the call to 
record to properly comply with regulations in some countries. We would 
definitely need someone to step up as note taker. 

Something else to think about is intended audience. Previously we’ve had 
meeting like this which have been very detailed Dev@ focussed (which is a great 
thing) but have rather alienated participants in User@ land. We need to make it 
clear what level we’re talking about to be inclusive. 

Simon

> On 13 Dec 2017, at 00:44, larry mccay  wrote:
> 
> Not sure about posting the recordings - you will need to check and make
> sure that doesn't violate anything.
> 
> Just a friendly reminder...
> It is important that meetings have notes and a summary that is sent out
> describing topics to be decided on the mailing list.
> No decisions can be made in the community meeting itself - this gives
> others in other timezones and commitments review and voice in the decisions.
> 
> If it didn't happen on the mailing lists then it didn't happen. :)
> 
> 
> On Tue, Dec 12, 2017 at 1:39 PM, Simon Elliston Ball <
> si...@simonellistonball.com> wrote:
> 
>> Yes, I do.
>> 
>> I suspect the best bet will be to post recordings somewhere on the
>> apache.org <http://apache.org/> metron site.
>> 
>> Simon
>> 
>>> On 12 Dec 2017, at 18:36, Otto Fowler  wrote:
>>> 
>>> Excellent, do you have the > 40 min + record option?
>>> 
>>> 
>>> On December 12, 2017 at 13:19:55, Simon Elliston Ball (
>>> si...@simonellistonball.com) wrote:
>>> 
>>> Happy to volunteer a zoom room. That seems to have worked for most in the
>>> past.
>>> 
>>> Simon
>>> 
>>>> On 12 Dec 2017, at 18:09, Otto Fowler  wrote:
>>>> 
>>>> Thanks! I think I’d like something hosted though.
>>>> 
>>>> 
>>>> On December 12, 2017 at 11:18:52, Ahmed Shah (
>> ahmeds...@cmail.carleton.ca)
>>> 
>>>> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> wrt "- How are we going to host it"...
>>>> 
>>>> I've used BigBlueButton as an end user at our University.
>>>> 
>>>> It is LGPL open source.
>>>> 
>>>> https://bigbluebutton.org/
>>>> https://bigbluebutton.org/developers/
>>>> 
>>>> 
>>>> -Ahmed
>>>> 
>>>> ___
>>>> Ahmed Shah (PMP, M. Eng.)
>>>> Cybersecurity Analyst & Developer
>>>> GCR - Cybersecurity Operations Center
>>>> Carleton University - cugcr.com<https://cugcr.com/tiki/lce/index.php>
>>>> 
>>>> 
>>>> 
>>>> From: Otto Fowler 
>>>> Sent: December 11, 2017 4:41 PM
>>>> To: dev@metron.apache.org
>>>> Subject: [DISCUSS] Community Meetings
>>>> 
>>>> I think that we all want to have regular community meetings. We may be
>>>> better able to keep to a regular schedule with these meetings if we
>>> spread
>>>> out the responsibility for them from James and Casey, both of whom have
>> a
>>>> lot on their plate already.
>>>> 
>>>> I would be willing to coordinate and run the meetings, and would welcome
>>>> anyone else who wants to help when they can.
>>>> 
>>>> The only issue for me is I do not have a web-ex account that I can use
>> to
>>>> hold the meeting. So I’ll need some recommendations for a suitable
>>>> alternative. I have not been able to find an Apache Friendly
>> alternative,
>>>> in the same way that Atlassian is apache friendly.
>>>> 
>>>> 
>>>> So - from what I can see we need to:
>>>> 
>>>> - Talk through who is going to do it
>>>> - How are we going to host it
>>>> - When are we going to do it
>>>> 
>>>> Anything else?
>>>> 
>>>> ottO
>> 
>> 


Re: [DISCUSS] Community Meetings

2017-12-12 Thread Simon Elliston Ball
Yes, I do. 

I suspect the best bet will be to post recordings somewhere on the apache.org 
<http://apache.org/> metron site.

Simon

> On 12 Dec 2017, at 18:36, Otto Fowler  wrote:
> 
> Excellent, do you have the > 40 min + record option?
> 
> 
> On December 12, 2017 at 13:19:55, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
> 
> Happy to volunteer a zoom room. That seems to have worked for most in the
> past.
> 
> Simon
> 
>> On 12 Dec 2017, at 18:09, Otto Fowler  wrote:
>> 
>> Thanks! I think I’d like something hosted though.
>> 
>> 
>> On December 12, 2017 at 11:18:52, Ahmed Shah (ahmeds...@cmail.carleton.ca)
> 
>> wrote:
>> 
>> Hello,
>> 
>> wrt "- How are we going to host it"...
>> 
>> I've used BigBlueButton as an end user at our University.
>> 
>> It is LGPL open source.
>> 
>> https://bigbluebutton.org/
>> https://bigbluebutton.org/developers/
>> 
>> 
>> -Ahmed
>> 
>> ___
>> Ahmed Shah (PMP, M. Eng.)
>> Cybersecurity Analyst & Developer
>> GCR - Cybersecurity Operations Center
>> Carleton University - cugcr.com<https://cugcr.com/tiki/lce/index.php>
>> 
>> 
>> 
>> From: Otto Fowler 
>> Sent: December 11, 2017 4:41 PM
>> To: dev@metron.apache.org
>> Subject: [DISCUSS] Community Meetings
>> 
>> I think that we all want to have regular community meetings. We may be
>> better able to keep to a regular schedule with these meetings if we
> spread
>> out the responsibility for them from James and Casey, both of whom have a
>> lot on their plate already.
>> 
>> I would be willing to coordinate and run the meetings, and would welcome
>> anyone else who wants to help when they can.
>> 
>> The only issue for me is I do not have a web-ex account that I can use to
>> hold the meeting. So I’ll need some recommendations for a suitable
>> alternative. I have not been able to find an Apache Friendly alternative,
>> in the same way that Atlassian is apache friendly.
>> 
>> 
>> So - from what I can see we need to:
>> 
>> - Talk through who is going to do it
>> - How are we going to host it
>> - When are we going to do it
>> 
>> Anything else?
>> 
>> ottO



Re: [DISCUSS] Community Meetings

2017-12-12 Thread Simon Elliston Ball
Happy to volunteer a zoom room. That seems to have worked for most in the past.

Simon

> On 12 Dec 2017, at 18:09, Otto Fowler  wrote:
> 
> Thanks!  I think I’d like something hosted though.
> 
> 
> On December 12, 2017 at 11:18:52, Ahmed Shah (ahmeds...@cmail.carleton.ca)
> wrote:
> 
> Hello,
> 
> wrt "- How are we going to host it"...
> 
> I've used BigBlueButton as an end user at our University.
> 
> It is LGPL open source.
> 
> https://bigbluebutton.org/
> https://bigbluebutton.org/developers/
> 
> 
> -Ahmed
> 
> ___
> Ahmed Shah (PMP, M. Eng.)
> Cybersecurity Analyst & Developer
> GCR - Cybersecurity Operations Center
> Carleton University - cugcr.com
> 
> 
> 
> From: Otto Fowler 
> Sent: December 11, 2017 4:41 PM
> To: dev@metron.apache.org
> Subject: [DISCUSS] Community Meetings
> 
> I think that we all want to have regular community meetings. We may be
> better able to keep to a regular schedule with these meetings if we spread
> out the responsibility for them from James and Casey, both of whom have a
> lot on their plate already.
> 
> I would be willing to coordinate and run the meetings, and would welcome
> anyone else who wants to help when they can.
> 
> The only issue for me is I do not have a web-ex account that I can use to
> hold the meeting. So I’ll need some recommendations for a suitable
> alternative. I have not been able to find an Apache Friendly alternative,
> in the same way that Atlassian is apache friendly.
> 
> 
> So - from what I can see we need to:
> 
> - Talk through who is going to do it
> - How are we going to host it
> - When are we going to do it
> 
> Anything else?
> 
> ottO



Re: Wiki Docs links seem wrong

2017-12-07 Thread Simon Elliston Ball
Awesome, many thanks!

> On 7 Dec 2017, at 13:08, Kyle Richardson  wrote:
> 
> Fixed.
> 
> -Kyle
> 
> On Thu, Dec 7, 2017 at 7:20 AM, Simon Elliston Ball <
> si...@simonellistonball.com> wrote:
> 
>> https://cwiki.apache.org/confluence/display/METRON/
>> Metron+User+Guide+-+per+release <https://cwiki.apache.org/
>> confluence/display/METRON/Metron+User+Guide+-+per+release>
>> 
>> The links don’t seem to correspond to the versions on this page. Would be
>> happy to fix, but I don’t have wiki perms.
>> 
>> Simon



Wiki Docs links seem wrong

2017-12-07 Thread Simon Elliston Ball
https://cwiki.apache.org/confluence/display/METRON/Metron+User+Guide+-+per+release
 


The links don’t seem to correspond to the versions on this page. Would be happy 
to fix, but I don’t have wiki perms. 

Simon

Re: DISCUSS: Quick change to parser config

2017-12-04 Thread Simon Elliston Ball
Personally I suspect that temporary variable is a different thing as is the 
assignment PR. Might be useful for intermediate steps in a parser, but then 
we’re potentially getting more complex than a parser wants to be. I am warming 
to the idea of temporary variables though. 

In terms of the removal, I like the idea of the COMPLETE transformation to 
express a projection. That makes the output interface of the metron object more 
explicit in a parser, which makes governance much easier. 

Do we think this is a good consensus? Shall I ticket it (I might even code it!) 
in the transformation form proposed? 

Simon

> On 4 Dec 2017, at 17:21, Casey Stella  wrote:
> 
> So, just chiming in here.  It seems to me that we have a problem with
> extraneous fields in a couple of different ways:
> 
> * Temporary Variables
> 
> I think that the problem of temporary variables is one beyond just the
> parser.  What I'd like to see is the Stellar field transformations operate
> similar to the enrichment field transformations in that they are no longer
> a map (this is useful beyond this case for having multiple assignments for
> a variable) and having a special assignment indicator which would indicate
> a temporary variable (e.g. ^= instead of :=).  This would clean up some of
> the usecases in enrichments as well.  Combine this with the assumption that
> all non-temporary fields are included in output for the field
> transformation if it is not specified and I think we have something that is
> sensible and somewhat backwards compatible.  To wit:
> {
>  "fieldTransformations": [
>{
>  "transformation": "STELLAR",
>  "config": [
>"ipSrc ^= TRIM(raw_ip_src)"
>"ip_src_addr := ipSrc"
>  ]
>}
>  ]
> }
> 
> * Extraneous Fields from the Parser
> 
> For these, we do currently have a REMOVE field transformation, but I'd be
> ok with a PROJECT or COMPLETE field transformation to provide a whitelist.
> That might look like:
> {
>  "fieldTransformations": [
>{
>  "transformation": "STELLAR",
>  "config": [
>"ipSrc ^= TRIM(raw_ip_src)"
>"ip_src_addr := ipSrc"
>  ]
>},
> {
>  "transformation": "COMPLETE",
>  "output" : [ "ip_src_addr", "ip_dst_addr", "message"]
>}
>  ]
> }
> 
> I think having these two treated separately makes sense because sometimes
> you will want COMPLETE and sometimes not.  Also, this fits within the core
> abstraction that we already have.
> 
> On Thu, Nov 30, 2017 at 8:21 PM, Simon Elliston Ball <
> si...@simonellistonball.com <mailto:si...@simonellistonball.com>> wrote:
> 
>> Hmmm… Actually, I kinda like that.
>> 
>> May want a little refactoring in the back for clarity.
>> 
>> My question about whether we could ever imagine this ‘cleanup policy’
>> applying to other transforms would sway me to the field rather than
>> transformation name approach though.
>> 
>> Simon
>> 
>>> On 1 Dec 2017, at 01:17, Otto Fowler  wrote:
>>> 
>>> Or, we can create new transformation types
>>> STELLAR_COMPLETE, which may be more in line with the original design.
>>> 
>>> 
>>> 
>>> On November 30, 2017 at 20:14:46, Otto Fowler (ottobackwa...@gmail.com
>> <mailto:ottobackwa...@gmail.com <mailto:ottobackwa...@gmail.com>>) wrote:
>>> 
>>>> I would suggest that instead of explicitly having “complete”, we have
>> “operation”:”complete”
>>>> 
>>>> Such that we can have multiple transformations, each with a different
>> “operation”.
>>>> No operation would be the status quo ante, if we can do it so that we
>> don’t get errors with old configs and the keep same behavior.
>>>> 
>>>> {
>>>> "fieldTransformations": [
>>>> {
>>>> "transformation": "STELLAR",
>>>> “operation": “complete",
>>>> "output": ["ip_src_addr", "ip_dst_addr"],
>>>> "config": {
>>>> "ip_src_addr": "ipSrc",
>>>> "ip_dest_addr": "ipDst"
>>>> } ,
>>>> {
>>>> "transformation": "STELLAR",
>>>> “operation": “SomeOtherThing",
>>>> "output": [“foo", “bar"],
>>>> "config": {
>>>> “foo": “TO_UPPER(foo)",
>>>

Re: DISCUSS: Quick change to parser config

2017-11-30 Thread Simon Elliston Ball
Hmmm… Actually, I kinda like that. 

May want a little refactoring in the back for clarity. 

My question about whether we could ever imagine this ‘cleanup policy’ applying 
to other transforms would sway me to the field rather than transformation name 
approach though. 

Simon

> On 1 Dec 2017, at 01:17, Otto Fowler  wrote:
> 
> Or, we can create new transformation types
> STELLAR_COMPLETE, which may be more in line with the original design.
> 
> 
> 
> On November 30, 2017 at 20:14:46, Otto Fowler (ottobackwa...@gmail.com 
> <mailto:ottobackwa...@gmail.com>) wrote:
> 
>> I would suggest that instead of explicitly having “complete”, we have 
>> “operation”:”complete”
>> 
>> Such that we can have multiple transformations, each with a different 
>> “operation”.
>> No operation would be the status quo ante, if we can do it so that we don’t 
>> get errors with old configs and the keep same behavior.
>> 
>> { 
>> "fieldTransformations": [ 
>> { 
>> "transformation": "STELLAR", 
>> “operation": “complete", 
>> "output": ["ip_src_addr", "ip_dst_addr"], 
>> "config": { 
>> "ip_src_addr": "ipSrc", 
>> "ip_dest_addr": "ipDst" 
>> } ,
>> { 
>> "transformation": "STELLAR", 
>> “operation": “SomeOtherThing", 
>> "output": [“foo", “bar"], 
>> "config": { 
>> “foo": “TO_UPPER(foo)", 
>> “bar": “TO_LOWER(bar)" 
>> } 
>> } 
>> ] 
>> } 
>> 
>> 
>> Sorry for the junk examples, but hopefully it makes sense.
>> 
>> 
>> 
>> 
>> 
>> On November 30, 2017 at 20:00:06, Simon Elliston Ball 
>> (si...@simonellistonball.com <mailto:si...@simonellistonball.com>) wrote:
>> 
>>> I’m looking at the way parser config works, and transformation of field 
>>> from their native names in, for example the ASA or CEF parsers, into a 
>>> standard data model.
>>> 
>>> At the moment I would do something like this:
>>> 
>>> assuming I have fields [ipSrc, ipDst, pointlessExtraStuff, message] I might 
>>> have:
>>> 
>>> {
>>> "fieldTransformations": [
>>> {
>>> "transformation": "STELLAR",
>>> "output": ["ip_src_addr", "ip_dst_addr", "message"],
>>> "config": {
>>> "ip_src_addr": "ipSrc",
>>> "ip_dest_addr": "ipDst"
>>> }
>>> }
>>> ]
>>> }
>>> 
>>> which leave me with the field set:
>>> [ipSrc, ipDst, pointlessExtraStuff, message, ip_src_addr, ip_dest_addr]
>>> 
>>> unless I go with:-
>>> 
>>> {
>>> "fieldTransformations": [
>>> {
>>> "transformation": "STELLAR",
>>> "output": ["ip_src_addr", "ip_dst_addr", "message"],
>>> "config": {
>>> "ip_src_addr": "ipSrc",
>>> "ip_dest_addr": "ipDst",
>>> "pointlessExtraStuff": null,
>>> "ipSrc": null,
>>> "ipDst": null
>>> }
>>> }
>>> ]
>>> }
>>> 
>>> which seems a little over verbose.
>>> 
>>> Do you think it would be valuable to add a switch of some sort on the 
>>> transformation to make it “complete”, i.e. to only preserve fields which 
>>> are explicitly set.
>>> 
>>> To my mind, this breaks a principal of mutability, but gives us much much 
>>> cleaner mapping of data.
>>> 
>>> I would propose something like:
>>> 
>>> {
>>> "fieldTransformations": [
>>> {
>>> "transformation": "STELLAR",
>>> "complete": true,
>>> "output": ["ip_src_addr", "ip_dst_addr", "message"],
>>> "config": {
>>> "ip_src_addr": "ipSrc",
>>> "ip_dest_addr": "ipDst"
>>> }
>>> }
>>> ]
>>> }
>>> 
>>> which would give me the set ["ip_src_addr", "ip_dst_addr", "message”] 
>>> effectively making the nulling in my previous example implicit.
>>> 
>>> Thoughts?
>>> 
>>> Also, in the second scenario, if ‘output' were to be empty would we assume 
>>> that the output field set should be ["ip_src_addr", “ip_dst_addr”]?
>>> 
>>> Simon



Re: DISCUSS: Quick change to parser config

2017-11-30 Thread Simon Elliston Ball
Do you have any thoughts on what these other operations might be? 

What I’m imagining is something that basically specifies a policy on how to 
handle things that the transformation block does not explicitly handle. Right 
now, we just leave them along and they flow through. 

Would “policy”: “explicit”, or “policy”: “onlyExplict” make sense and give the 
flex? 

To my mind “operation” implies further transformation, which would just be 
another block, no? 

Maybe it’s just semantic pedantry on my part… would we see this sort of policy 
logic applying to other transformations? It doesn’t really make sense for 
“remove”, and well… who cares about any of the other legacy transforms now we 
have Stellar :) 

Simon

> On 1 Dec 2017, at 01:14, Otto Fowler  wrote:
> 
> I would suggest that instead of explicitly having “complete”, we have 
> “operation”:”complete”
> 
> Such that we can have multiple transformations, each with a different 
> “operation”.
> No operation would be the status quo ante, if we can do it so that we don’t 
> get errors with old configs and the keep same behavior.
> 
> { 
> "fieldTransformations": [ 
> { 
> "transformation": "STELLAR", 
> “operation": “complete", 
> "output": ["ip_src_addr", "ip_dst_addr"], 
> "config": { 
> "ip_src_addr": "ipSrc", 
> "ip_dest_addr": "ipDst" 
> } ,
> { 
> "transformation": "STELLAR", 
> “operation": “SomeOtherThing", 
> "output": [“foo", “bar"], 
> "config": { 
> “foo": “TO_UPPER(foo)", 
> “bar": “TO_LOWER(bar)" 
> } 
> } 
> ] 
> } 
> 
> 
> Sorry for the junk examples, but hopefully it makes sense.
> 
> 
> 
> 
> 
> On November 30, 2017 at 20:00:06, Simon Elliston Ball 
> (si...@simonellistonball.com <mailto:si...@simonellistonball.com>) wrote:
> 
>> I’m looking at the way parser config works, and transformation of field from 
>> their native names in, for example the ASA or CEF parsers, into a standard 
>> data model.  
>> 
>> At the moment I would do something like this:  
>> 
>> assuming I have fields [ipSrc, ipDst, pointlessExtraStuff, message] I might 
>> have: 
>> 
>> { 
>> "fieldTransformations": [ 
>> { 
>> "transformation": "STELLAR", 
>> "output": ["ip_src_addr", "ip_dst_addr", "message"], 
>> "config": { 
>> "ip_src_addr": "ipSrc", 
>> "ip_dest_addr": "ipDst" 
>> } 
>> } 
>> ] 
>> } 
>> 
>> which leave me with the field set:  
>> [ipSrc, ipDst, pointlessExtraStuff, message, ip_src_addr, ip_dest_addr] 
>> 
>> unless I go with:- 
>> 
>> { 
>> "fieldTransformations": [ 
>> { 
>> "transformation": "STELLAR", 
>> "output": ["ip_src_addr", "ip_dst_addr", "message"], 
>> "config": { 
>> "ip_src_addr": "ipSrc", 
>> "ip_dest_addr": "ipDst", 
>> "pointlessExtraStuff": null, 
>> "ipSrc": null, 
>> "ipDst": null 
>> } 
>> } 
>> ] 
>> } 
>> 
>> which seems a little over verbose.  
>> 
>> Do you think it would be valuable to add a switch of some sort on the 
>> transformation to make it “complete”, i.e. to only preserve fields which are 
>> explicitly set.  
>> 
>> To my mind, this breaks a principal of mutability, but gives us much much 
>> cleaner mapping of data.  
>> 
>> I would propose something like: 
>> 
>> { 
>> "fieldTransformations": [ 
>> { 
>> "transformation": "STELLAR", 
>> "complete": true, 
>> "output": ["ip_src_addr", "ip_dst_addr", "message"], 
>> "config": { 
>> "ip_src_addr": "ipSrc", 
>> "ip_dest_addr": "ipDst" 
>> } 
>> } 
>> ] 
>> } 
>> 
>> which would give me the set ["ip_src_addr", "ip_dst_addr", "message”] 
>> effectively making the nulling in my previous example implicit.  
>> 
>> Thoughts?  
>> 
>> Also, in the second scenario, if ‘output' were to be empty would we assume 
>> that the output field set should be ["ip_src_addr", “ip_dst_addr”]?  
>> 
>> Simon



DISCUSS: Quick change to parser config

2017-11-30 Thread Simon Elliston Ball
I’m looking at the way parser config works, and transformation of field from 
their native names in, for example the ASA or CEF parsers, into a standard data 
model. 

At the moment I would do something like this: 

assuming I have fields [ipSrc, ipDst, pointlessExtraStuff, message] I might 
have:

{
  "fieldTransformations": [
{
  "transformation": "STELLAR",
  "output": ["ip_src_addr", "ip_dst_addr", "message"],
  "config": {
"ip_src_addr": "ipSrc",
"ip_dest_addr": "ipDst"
  }
}
  ]
}

which leave me with the field set: 
[ipSrc, ipDst, pointlessExtraStuff, message, ip_src_addr, ip_dest_addr]

unless I go with:-

{
  "fieldTransformations": [
{
  "transformation": "STELLAR",
  "output": ["ip_src_addr", "ip_dst_addr", "message"],
  "config": {
"ip_src_addr": "ipSrc",
"ip_dest_addr": "ipDst",
"pointlessExtraStuff": null,
"ipSrc": null,
"ipDst": null
  }
}
  ]
}

which seems a little over verbose. 

Do you think it would be valuable to add a switch of some sort on the 
transformation to make it “complete”, i.e. to only preserve fields which are 
explicitly set. 

To my mind, this breaks a principal of mutability, but gives us much much 
cleaner mapping of data. 

I would propose something like:

{
  "fieldTransformations": [
{
  "transformation": "STELLAR",
  "complete": true,
  "output": ["ip_src_addr", "ip_dst_addr", "message"],
  "config": {
"ip_src_addr": "ipSrc",
"ip_dest_addr": "ipDst"
  }
}
  ]
}

which would give me the set ["ip_src_addr", "ip_dst_addr", "message”] 
effectively making the nulling in my previous example implicit. 

Thoughts? 

Also, in the second scenario, if ‘output' were to be empty would we assume that 
the output field set should be ["ip_src_addr", “ip_dst_addr”]? 

Simon



Re: [DISCUSS] NPM / Node Problems

2017-11-27 Thread Simon Elliston Ball
Well, that’s good news on that issue. Reproducing the problem is half way to 
solving it, right? 

I would still say there are some systemic things going on that have manifested 
in a variety of ways on both the users and dev list, so it’s worth us having a 
good look at a more robust approach to node dependencies (both npm ones, and 
the native ones) 

Simon

> On 27 Nov 2017, at 13:30, Otto Fowler  wrote:
> 
> I can reproduce the failure in out ansible docker build container, which is 
> also centos.
> The issue is building our node on centos in all these cases.
> 
> 
> 
> On November 27, 2017 at 07:02:51, Simon Elliston Ball 
> (si...@simonellistonball.com <mailto:si...@simonellistonball.com>) wrote:
> 
>> Thinking about this, doesn’t our build plugin explicitly install it’s own 
>> node? So actually all the node version things may be a red herring, since 
>> this is under our control through the pom. Not sure if we actually 
>> exercising this control. It seems that some of the errors people report are 
>> more to do with compilation failures for native node modules, which is 
>> doesn’t pin (i.e. things like system library dependencies). I’m not sure 
>> what we have in the dependency tree that requires complex native 
>> dependencies, but this might just be one of those node things we could doc 
>> around.  
>> 
>> This scenario would be fixed by standardising the build container. 
>> 
>> Yarn’s big thing is that it enables faster dependency resolution and local 
>> caching, right? It does not seem to address any of the problems we see, but 
>> sure, it’s the shiny new dependency system for node modules, which might 
>> make npm less horrible to deal with, so worth looking into.  
>> 
>> The other issue that I’ve seen people run into a lot is flat out download 
>> errors. This could be helped by finding our versions, maybe with yarn, but 
>> let’s face it, package-lock.json could also do that with npm, albeit with a 
>> slightly slower algorithm. However, short of bundling and hosting deps 
>> ourselves, I suspect the download errors are beyond our control, and 
>> certainly beyond the scope of this project (fix maven, fix npm, fix all the 
>> node hosting servers…) 
>> 
>> Simon 
>> 
>> 
>> > On 27 Nov 2017, at 07:28, RaghuMitra Kandikonda > > <mailto:raghumitra@gmail.com>> wrote: 
>> >  
>> > Looking at some of the build failure emails and past experience i 
>> > would suggest having a node & npm version check in our build scripts 
>> > and moving dependency management to yarn. 
>> >  
>> > We need not restrict the build to a specific version of node & npm but 
>> > we can surely suggest a min version required to build UI successfully. 
>> >  
>> > -Raghu 
>> >  
>> >  
>> >  
>> > On Fri, Nov 24, 2017 at 10:21 PM, Simon Elliston Ball 
>> > mailto:si...@simonellistonball.com>> wrote: 
>> >> Agreeing with Nick, it seems like the main reason people are building 
>> >> themselves, and hitting all these environmental issues, is that we do not 
>> >> as a project produce binary release artefacts (the rpms which users could 
>> >> just install) and instead leave that for the commercial distributors to 
>> >> do. 
>> >>  
>> >> Yarn may help with some of the dependency version issues we’re having, 
>> >> but not afaik with the core missing library headers / build tools / node 
>> >> and npm version issue, those would seem to fit a documentation fix and 
>> >> improvements to platform-info to flag the problems, so this can then be a 
>> >> pre-flight check tool as well as a diagnostic tool. 
>> >>  
>> >> Another option I would put on the table is to standardise our build 
>> >> environment, so that the non-java bits are run in a standard docker image 
>> >> or something fo the sort, that way we can take control of all the 
>> >> environmental and OS dependent pieces, much as we do right now with the 
>> >> rpm build sections of the mpack build. 
>> >>  
>> >> The challenge here will be adding the relevant maven support. At the 
>> >> moment we’re relying on the maven npm and node build plugins, this would 
>> >> likely need replacing with something custom and a challenge to support to 
>> >> go dow this route. 
>> >>  
>> >> Perhaps the real answer here is to push people who are just kicking the 
>> >> tyres towards a binary distribu

Re: [DISCUSS] NPM / Node Problems

2017-11-27 Thread Simon Elliston Ball
Thinking about this, doesn’t our build plugin explicitly install it’s own node? 
So actually all the node version things may be a red herring, since this is 
under our control through the pom. Not sure if we actually exercising this 
control. It seems that some of the errors people report are more to do with 
compilation failures for native node modules, which is doesn’t pin (i.e. things 
like system library dependencies). I’m not sure what we have in the dependency 
tree that requires complex native dependencies, but this might just be one of 
those node things we could doc around. 

This scenario would be fixed by standardising the build container.

Yarn’s big thing is that it enables faster dependency resolution and local 
caching, right? It does not seem to address any of the problems we see, but 
sure, it’s the shiny new dependency system for node modules, which might make 
npm less horrible to deal with, so worth looking into. 

The other issue that I’ve seen people run into a lot is flat out download 
errors. This could be helped by finding our versions, maybe with yarn, but 
let’s face it, package-lock.json could also do that with npm, albeit with a 
slightly slower algorithm. However, short of bundling and hosting deps 
ourselves, I suspect the download errors are beyond our control, and certainly 
beyond the scope of this project (fix maven, fix npm, fix all the node hosting 
servers…)

Simon


> On 27 Nov 2017, at 07:28, RaghuMitra Kandikonda  
> wrote:
> 
> Looking at some of the build failure emails and past experience i
> would suggest having a node & npm version check in our build scripts
> and moving dependency management to yarn.
> 
> We need not restrict the build to a specific version of node & npm but
> we can surely suggest a min version required to build UI successfully.
> 
> -Raghu
> 
> 
> 
> On Fri, Nov 24, 2017 at 10:21 PM, Simon Elliston Ball
>  wrote:
>> Agreeing with Nick, it seems like the main reason people are building 
>> themselves, and hitting all these environmental issues, is that we do not as 
>> a project produce binary release artefacts (the rpms which users could just 
>> install) and instead leave that for the commercial distributors to do.
>> 
>> Yarn may help with some of the dependency version issues we’re having, but 
>> not afaik with the core missing library headers / build tools / node and npm 
>> version issue, those would seem to fit a documentation fix and improvements 
>> to platform-info to flag the problems, so this can then be a pre-flight 
>> check tool as well as a diagnostic tool.
>> 
>> Another option I would put on the table is to standardise our build 
>> environment, so that the non-java bits are run in a standard docker image or 
>> something fo the sort, that way we can take control of all the environmental 
>> and OS dependent pieces, much as we do right now with the rpm build sections 
>> of the mpack build.
>> 
>> The challenge here will be adding the relevant maven support. At the moment 
>> we’re relying on the maven npm and node build plugins, this would likely 
>> need replacing with something custom and a challenge to support to go dow 
>> this route.
>> 
>> Perhaps the real answer here is to push people who are just kicking the 
>> tyres towards a binary distribution, or at least rpm artefacts as part of 
>> the Apache release to give them a head start for a happy path on a known 
>> good OS environment.
>> 
>> Simon
>> 
>>> On 24 Nov 2017, at 16:01, Nick Allen  wrote:
>>> 
>>> Yes, it is a problem.  I think you've identified a couple important things
>>> that we could address in parallel.  I see these as challenges we need to
>>> solve for the dev community.
>>> 
>>> (1) NPM is causing us some major headaches.  Which version do we require?
>>> How do I install that version (on Mac, Windows, Linux)?  Does YARN help
>>> here at all?
>>> 
>>> (2) Can we automate the prerequisite checks that we currently do manually
>>> with `platform-info.sh`?  An automated check could run and fail as part of
>>> the build or deployment process.
>>> 
>>> 
>>> 
>>> More importantly though is that users should not have to build Metron at
>>> all.  They should not have to worry about installing NPM and the rest of
>>> the development tooling.   Here are some options that are not mutually
>>> exclusive.
>>> 
>>> 
>>> (1) Create an image in Atlas that has Metron fully installed.  A new user
>>> could run single node Metron on their laptop with a single command and the
>>> only prereqs would be Vagrant and

Re: Using Storm Resource Aware Scheduler

2017-11-26 Thread Simon Elliston Ball
The multi-tenancy though meta-data method mentioned is designed to solve 
exactly that problem and has been in the project for some time now. The goal 
would be to have one topology per data schema and use the key to communicate 
tenant meta-data. See 
https://archive.apache.org/dist/metron/0.4.1/site-book/metron-platform/metron-parsers/index.html#Metadata
 
<https://archive.apache.org/dist/metron/0.4.1/site-book/metron-platform/metron-parsers/index.html#Metadata>
 for details.

The storm issue you mention is something for the storm project to look at, so 
we can’t really comment on their behalf here, but yeah, it will be nice to have 
storm do some of the tuning for us at some point. 

Not that the UI already has the tuning parameters you’re talking about in the 
latest version, so there is no need for the new JIRA 
(https://issues.apache.org/jira/browse/METRON-1330 
<https://issues.apache.org/jira/browse/METRON-1330>). It should be closed as a 
duplicate of https://issues.apache.org/jira/browse/METRON-1161 
<https://issues.apache.org/jira/browse/METRON-1161>. 

Simon

> On 26 Nov 2017, at 02:15, Ali Nazemian  wrote:
> 
> Oops, I didn't know that. Happy Thanksgiving.
> 
> Thanks, Otto and Simon.
> 
> As you are aware of our use cases, with the current limitations of
> multi-tenancy support, we are creating a feed per tenant per device.
> Sometimes the amount of traffic we are receiving per each tenant and per
> each device is way less than dedicating one storm slot for it. Therefore, I
> was hoping to make it at least theoretically possible to tune resources
> more wisely, but it is not going to be easy at all. This is probably a use
> case that storm auto-scaling mechanism would be very nice to have.
> 
> https://issues.apache.org/jira/browse/STORM-594
> 
> On the other side, I can recall there was a PR to address multi-tenancy by
> adding meta-data to Kafka topic. However, I lost track of that feature, so
> maybe this situation can be tackled at another level by merging different
> parsers.
> 
> I will create a Jira ticket to add an ability in UI to tune Metron parser
> feeds at Storm level. Right now it is a little hard to maintain tuning
> configurations per each parser, and as soon as somebody restarts them from
> Management-UI/Ambari, it will be overwritten.
> 
> 
> Cheers,
> Ali
> 
> On Sat, Nov 25, 2017 at 3:36 AM, Simon Elliston Ball <
> si...@simonellistonball.com> wrote:
> 
>> Implementing the resource aware scheduler would be decidedly non-trivial.
>> Every topology will need additional configuration to tune for things like
>> memory sizes, which is not going to buy you much change. So, at the
>> micro-tuning level of parser this doesn’t make a lot of sense.
>> 
>> However, it may be relevant to consider separate tuning for parsers in
>> general vs the core enrichment and indexing topologies (potentially also
>> for separate indexing topologies when this comes in) and the resource
>> scheduler could provide a theoretical benefit there.
>> 
>> Specifying resource requirements per parser topology might sound like a
>> good idea, but if your parsers are working the way they should, they should
>> be using a small amount of memory as their default size, and achieving
>> additional resource use by multiplying workers and executors (to get higher
>> usage per slot) and balance the load that way. To be honest, the only
>> difference you’re going to get from the RAS is to add a bunch of tuning
>> parameters which allow slightly different granularity of units for things
>> like memory.
>> 
>> The other RAS feature which might be a good add is prioritisation of
>> different parser topologies, but again, this is probably not something you
>> want to push hard on unless you are severely limited in resources (in which
>> case, why not just add another node, it will be cheaper than spending all
>> that time micro-tuning the resource requirements for each data feed).
>> 
>> Right now we do allow a lot of micro tuning of parallelism around things
>> like the count of executor threads, which is achieves roughly the
>> equivalent of the cpu based limits in the RAS.
>> 
>> TL;DR:
>> 
>> If you’re not using resource pools for different users and using the idea
>> that prioritisation can lead to arbitrary kills, all you’re getting is a
>> slightly different way of tuning knobs that already exist, but you would
>> get a slightly different granularity. Also, we would have to rewrite all
>> the topology code to add the config endpoints for CPU and memory estimates.
>> 
>> Simon
>> 
>>> On 24 Nov 2017, at 07:56, Ali Nazemian  wrote:
>>> 

Re: [DISCUSS] NPM / Node Problems

2017-11-24 Thread Simon Elliston Ball
Agreeing with Nick, it seems like the main reason people are building 
themselves, and hitting all these environmental issues, is that we do not as a 
project produce binary release artefacts (the rpms which users could just 
install) and instead leave that for the commercial distributors to do. 

Yarn may help with some of the dependency version issues we’re having, but not 
afaik with the core missing library headers / build tools / node and npm 
version issue, those would seem to fit a documentation fix and improvements to 
platform-info to flag the problems, so this can then be a pre-flight check tool 
as well as a diagnostic tool. 

Another option I would put on the table is to standardise our build 
environment, so that the non-java bits are run in a standard docker image or 
something fo the sort, that way we can take control of all the environmental 
and OS dependent pieces, much as we do right now with the rpm build sections of 
the mpack build. 

The challenge here will be adding the relevant maven support. At the moment 
we’re relying on the maven npm and node build plugins, this would likely need 
replacing with something custom and a challenge to support to go dow this 
route. 

Perhaps the real answer here is to push people who are just kicking the tyres 
towards a binary distribution, or at least rpm artefacts as part of the Apache 
release to give them a head start for a happy path on a known good OS 
environment. 

Simon

> On 24 Nov 2017, at 16:01, Nick Allen  wrote:
> 
> Yes, it is a problem.  I think you've identified a couple important things
> that we could address in parallel.  I see these as challenges we need to
> solve for the dev community.
> 
> (1) NPM is causing us some major headaches.  Which version do we require?
> How do I install that version (on Mac, Windows, Linux)?  Does YARN help
> here at all?
> 
> (2) Can we automate the prerequisite checks that we currently do manually
> with `platform-info.sh`?  An automated check could run and fail as part of
> the build or deployment process.
> 
> 
> 
> More importantly though is that users should not have to build Metron at
> all.  They should not have to worry about installing NPM and the rest of
> the development tooling.   Here are some options that are not mutually
> exclusive.
> 
> 
> (1) Create an image in Atlas that has Metron fully installed.  A new user
> could run single node Metron on their laptop with a single command and the
> only prereqs would be Vagrant and Virtualbox.  We could cut new images for
> each Metron release.  Or selectively cut new dev images from master as we
> see fit.
> 
> (2) Distribute the Metron RPMs (and the MPack tarball?) so that users can
> install Metron on a cluster without having to build it.
> 
> 
> 
> 
> 
> 
> On Fri, Nov 24, 2017 at 10:11 AM, Otto Fowler 
> wrote:
> 
>> It seems like it is getting *very* common for people to have trouble
>> building recently. Errors with NPM and Node seen common, with fixes ranging
>> from updating c/c++ libs to the version of npm/node.
>> 
>> There has to be a better way to do this.
>> 
>>   -
>> 
>>   Are we out of date or missing requirements in our documentation?
>>   -
>> 
>>   Does our documentation need to be updated for building?
>>   -
>> 
>>   Is there a better way in maven to check the versions required for some
>>   of these things and fail faster with a better message?
>>   -
>> 
>>   Are we building correctly or are we asking for trouble?
>> 
>> The ability to build metron is pretty important, and it seems that people
>> are having a lot of trouble related to the new technologies in alerts and
>> config ui.
>> 



Re: Using Storm Resource Aware Scheduler

2017-11-24 Thread Simon Elliston Ball
Implementing the resource aware scheduler would be decidedly non-trivial. Every 
topology will need additional configuration to tune for things like memory 
sizes, which is not going to buy you much change. So, at the micro-tuning level 
of parser this doesn’t make a lot of sense. 

However, it may be relevant to consider separate tuning for parsers in general 
vs the core enrichment and indexing topologies (potentially also for separate 
indexing topologies when this comes in) and the resource scheduler could 
provide a theoretical benefit there.

Specifying resource requirements per parser topology might sound like a good 
idea, but if your parsers are working the way they should, they should be using 
a small amount of memory as their default size, and achieving additional 
resource use by multiplying workers and executors (to get higher usage per 
slot) and balance the load that way. To be honest, the only difference you’re 
going to get from the RAS is to add a bunch of tuning parameters which allow 
slightly different granularity of units for things like memory.

The other RAS feature which might be a good add is prioritisation of different 
parser topologies, but again, this is probably not something you want to push 
hard on unless you are severely limited in resources (in which case, why not 
just add another node, it will be cheaper than spending all that time 
micro-tuning the resource requirements for each data feed).

Right now we do allow a lot of micro tuning of parallelism around things like 
the count of executor threads, which is achieves roughly the equivalent of the 
cpu based limits in the RAS. 

TL;DR: 

If you’re not using resource pools for different users and using the idea that 
prioritisation can lead to arbitrary kills, all you’re getting is a slightly 
different way of tuning knobs that already exist, but you would get a slightly 
different granularity. Also, we would have to rewrite all the topology code to 
add the config endpoints for CPU and memory estimates. 

Simon

> On 24 Nov 2017, at 07:56, Ali Nazemian  wrote:
> 
> Any help regarding this question would be appreciated.
> 
> 
> On Thu, Nov 23, 2017 at 8:57 AM, Ali Nazemian  wrote:
> 
>> 30 mins average of CPU load by checking Ambari.
>> 
>> On 23 Nov. 2017 00:51, "Otto Fowler"  wrote:
>> 
>> How are you measuring the utilization?
>> 
>> 
>> On November 22, 2017 at 08:12:51, Ali Nazemian (alinazem...@gmail.com)
>> wrote:
>> 
>> Hi all,
>> 
>> 
>> One of the issues that we are dealing with is the fact that not all of
>> the Metron feeds have the same type of resource requirements. For example,
>> we have some feeds that even a single Strom slot is way more than what it
>> needs. We thought we could make it more utilised in total by limiting at
>> least the amount of available heap space per feed to the parser topology
>> worker. However, since Storm scheduler relies on available slots, it is
>> very hard and almost impossible to utilise the cluster in the scenario
>> that
>> there will be lots of different topologies with different requirements
>> running at the same time. Therefore, on a daily basis, we can see that for
>> example one of the Storm hosts is 120% utilised and another is 20%
>> utilised! I was wondering whether we can address this situation by using
>> Storm Resource Aware scheduler or not.
>> 
>> P.S: it would be very nice to have a functionality to tune Storm
>> topology-related parameters per feed in the GUI (for example in Management
>> UI).
>> 
>> 
>> Regards,
>> Ali
>> 
>> 
>> 
> 
> 
> -- 
> A.Nazemian



Re: analytics exchange platform

2017-11-15 Thread Simon Elliston Ball
The analytics exchange concept is not really part of Apache Metron, but some 
commercial offerings include it. In terms of Metron itself, are you maybe 
thinking about Model as a Service: 
http://metron.apache.org/current-book/metron-analytics/metron-maas-service/index.html
 


Simon

> On 15 Nov 2017, at 00:54, Satish Abburi  wrote:
> 
>  
> Any pointers to this? We are looking to deploy few analytics packages on top 
> of Metron platform.
>  
> Thanks,
> Satish 
>  



  1   2   >