from:"Ali Nazemian"

Re: [DISCUSS] Real-time processing engine: Storm, Spark, Flink or Cloud Native

2019-05-10 Thread Ali Nazemian

Oops. Turns out I totally missed this email.

Thanks, Mike for your reply. Spark native support of Kubernetes has been
added very recently and it is not really at the stage that can provide all
the aforementioned features. There is no doubt that Spark is a powerful
tool and it is been widely used for similar use cases in the last few
years. However, when we look at the features that Spark can provide and try
to map them to Metron high-level architecture, It is hard to believe that
Spark will bring much added value to this architecture for the event
processing (no doubt about the batch side of it, though). When we compare
that with more lightweight frameworks for event-driven data processing
pipeline and cloud-native architectures you can see that all the features
Spark targets them in the real-time side can be covered by your
architecture natively (without getting help from your framework). Stuff
like fault tolerance, reliability, back pressure, at least once guarantee,
etc. all can be provided very easily. The only difference is you have got
the full support of Kubernetes features out of the box instead of waiting
for technology to evolve and maybe in two years come to the point that you
can truly have stuff like self-healing, change isolation, auto-scalability,
etc. with Spark whereas you can have them all right now just by looking at
this problem form a different angle.

Cheers,
Ali

On Fri, Apr 12, 2019 at 3:54 AM Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> Hi Ali,
>
> Thank you for taking the time to share your experiences with us. I've been
> thinking about this a while now and wanted to take some time for reflection
> before responding. I need to kick out a proper dev list DISCUSS thread on
> this, but if you've seen a couple of the recent refactoring PR's, you are
> right that we've been looking to decouple ourselves from Storm and open up
> the possibility of onboarding another processing engine. The most obvious
> candidate here, imho, is Apache Spark. Getting right to the meat of your
> discussion points, I don't think this is an either/or proposition between
> Kubernetes and Spark. I see this as an AND proposition. The reality is that
> Spark offers quite a bit from a job scaling, redundancy, and efficiency
> perspective. Not to mention, the capabilities it provides purely from a
> data transformation and processing engine perspective. The real roadmap, at
> least in my mind, would be for us to onboard Spark and then leverage
> Kubernetes at some point to enable some of the features that you describe -
> vertical and horizontal elasticity, in particular. In addition to that,
> Helm could provide some compelling features for managing that container
> application deployment story. Expect a discussion from me very soon about
> more specific ideas as to what I think our integration with Spark can and
> should look like in the near future with Metron. We have nearly completed
> decoupling our core infrastructure from Storm at this point, which opens us
> up to a number of possibilities going forward.
>
> Best,
> Mike Miklavcic
>
>
> On Thu, Apr 4, 2019 at 1:35 AM Ali Nazemian  wrote:
>
> > Hi All,
> >
> > As far as I understood, there is a plan to change the real-time engine of
> > Metron due to some issues that user and developer have been facing with
> it.
> > I would like to explain some critical issues that customer have been
> facing
> > to clarify it for the development team what the best approach could be
> for
> > the future of Metron. Based on the experience we have had with Metron
> there
> > are two important issues that cause lots of problems from the technology
> > and business:
> >
> > - Infrastructure cost
> > - Operational complexity
> >
> > We have had lots of issues to minimize infrastructure cost. We have also
> > spent significant time to tune infrastructure to be able to reduce the
> > cost. However, regardless of what had been done, we were not able to
> manage
> > our cost properly. The main reason for that is the rate of log ingestion
> > has been very fluctuating. It means we were receiving 4k eps on a sensor
> > during the peak time and less than 1 eps off-peak (e.g. during night).
> The
> > problem with that is you want to have an environment that can easily
> *scale
> > up* and *scale down* based on your ingestion traffic. Not to mention that
> > there have been situations where we cannot even predict the ingestion
> rate
> > as there has been a sort of cyber attach where lots of logs are generated
> > from the source devices. For example, DDOS might be one of the scenarios
> > that lots of logs are generated.
> >
> > When it comes to operational compl

[DISCUSS] Real-time processing engine: Storm, Spark, Flink or Cloud Native

2019-04-04 Thread Ali Nazemian

Hi All,

As far as I understood, there is a plan to change the real-time engine of
Metron due to some issues that user and developer have been facing with it.
I would like to explain some critical issues that customer have been facing
to clarify it for the development team what the best approach could be for
the future of Metron. Based on the experience we have had with Metron there
are two important issues that cause lots of problems from the technology
and business:

- Infrastructure cost
- Operational complexity

We have had lots of issues to minimize infrastructure cost. We have also
spent significant time to tune infrastructure to be able to reduce the
cost. However, regardless of what had been done, we were not able to manage
our cost properly. The main reason for that is the rate of log ingestion
has been very fluctuating. It means we were receiving 4k eps on a sensor
during the peak time and less than 1 eps off-peak (e.g. during night). The
problem with that is you want to have an environment that can easily *scale
up* and *scale down* based on your ingestion traffic. Not to mention that
there have been situations where we cannot even predict the ingestion rate
as there has been a sort of cyber attach where lots of logs are generated
from the source devices. For example, DDOS might be one of the scenarios
that lots of logs are generated.

When it comes to operational complexity, we have had lots of issues to
manage sensors and tune different parameters based on the traffic we
receive. We have had lots of failures as well due to different reasons and
we spent a fair amount of time to write scripts that can be simulated
*self-healing* feature at a very basic level. In the production use case,
we need to be able to respond to different situations very quickly. For
example, if a service is down, bring it up automatically or if a new sensor
is onboarded make sure that there won't be any risk to other services. We
also have lots of discussion about how we can create different processes or
automation tests to make sure nothing can go wrong. However, this made us
to create lots of platforms to test something from different aspects which
increases our cost even more. We didn't have the capability of provisioning
a short-lived environment once a PR is submitted. We really miss an ability
to *provision an environment very quickly*. We really needed to have a
capability to isolate different sensors and different use cases entirely
from not only parser topology, but also enrichment and indexing topologies.
We needed a good mechanism for *change isolation*.

I understand that the requirements of running an application on Cloud would
be different than on-premise. However, the majority of them are quite the
same when it comes to running Metron in production.

We have recently delivered a data processing pipeline project using more
cloud-native architectures and we have found out that how similar the
concerns have been and how easily Kubernetes helped us to manage these
problems with providing native support for scale-up and scale-downs,
self-healing, being able to provision a short-lived environment very
quickly and isolate our changes via canary and blue-green deployments. Of
course, following 12-factors were a big important principle for us to
manage those concerns. We have used Spring Cloud Stream to create an
event-driven data processing pipeline for this matter and some other
complementary frameworks provided by Pivotal. What has come to my mind is,
if other customer experiences of using Metron in production were similar to
our experience and they had had the same sort of concerns, can migrating
from Storm to Event-Driven Pipeline help all users to have a better
experience with running Metron in production? Of course, I have not been
across other user challenges so I cannot answer that, but it is just an
idea.

There is no doubt that we can have all these features by using Spark as
well in future, but it requires more time to build the integration and some
of these functionalities are not going to be available very soon. It is
just a thought that the Metron architecture is already Event-Driven at some
stages and state-less by nature. Which makes it a good fit for using an
event-driven pipeline to deploy it on containers.

Cheers,
Ali

Re: [DISCUSS] Handling dropped messages in REGEX_SELECT with Kafka topic routing

2019-01-07 Thread Ali Nazemian

Just one thing to bear in mind, publishing an error may cause some
operational challenges as it fills up the error topic as well as storm logs
which may not be necessary. To wear a Metron user hat, dropping a message
with a debug/trace level log to specify the event is filter out makes
sense. I guess if we want to make this really fancy having the flexibility
to decide what happens next would be really nice to have as No. 2 and 3
would be required in some special cases  (Make it a bit complex, though).
Of course, the default can be the drop with the ack.

Cheers,
Ali

On Thu, Dec 20, 2018 at 8:18 AM Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> Completely agreed on the acking. The reason I posed the question to begin
> with was because, while I believe dropping+acking is the correct
> functionality, I could see a few alternative patterns for handling this:
>
>1. Require filtering to be handled by the message filter infrastructure
>and publish an error to the error queue if field transformations such as
>REGEX_SELECT violate this by dropping messages.
>2. Default records to be written to enrichments, or handle per my
>comments in #1
>3. Default records to be written to the topic defined by outputTopic
>(non-default version of #2)
>
> At any rate, we should fix the acking problem and then the dropped messages
> pattern makes sense to me. I've created a Jira to track it -
> https://issues.apache.org/jira/browse/METRON-1948.
>
> On Wed, Dec 19, 2018 at 12:43 PM Casey Stella  wrote:
>
> > We absolutely should be acking the dropped messages otherwise they'll be
> in
> > a replay loop.  Not acking is a flat-out bug IMO.
> >
> > On Wed, Dec 19, 2018 at 2:37 PM Michael Miklavcic <
> > michael.miklav...@gmail.com> wrote:
> >
> > > When a message is filtered by the message filtering mechanism, we
> > > explicitly drop the message (and presumably ack it in Storm), as
> > explained
> > > here -
> > >
> > >
> >
> https://github.com/apache/metron/tree/master/metron-platform/metron-parsing#filtered
> > > .
> > > When using the REGEX_SELECT field transformation (see here -
> > >
> > >
> >
> https://github.com/apache/metron/tree/master/metron-platform/metron-parsing#fieldtransformation-configuration
> > > )
> > > with the kafka.topicField option for parser-chaining, it's unclear to
> me
> > > whether we expect the same behavior (drop message, ack it). The
> > > interpretation I get from this example in the parser-chaining doc
> > >
> > >
> >
> https://github.com/apache/metron/tree/master/use-cases/parser_chaining#the-pix_syslog_router-parser
> > > suggests to me that the approach we take for messages with message
> > > filtering is the correct one, however in testing an example with
> dropped
> > > messages, we appear not to ack those dropped messages.
> > >
> > > Before I go creating a fix I thought it best to summarize and confirm
> my
> > > expectations on this functionality. Messages from a REGEX_SELECT that
> > don't
> > > match a pattern, and therefore don't get a value assigned to their
> output
> > > topic value, should be dropped and acked.
> > >
> > > *Example:*
> > > {
> > > "parserClassName": "org.apache.metron.parsers.GrokParser",
> > > "sensorTopic": "myInTopic",
> > > ...
> > > "parserConfig": {
> > > ...,
> > > "kafka.topicField": "output_topic"
> > > },
> > > "fieldTransformations": [
> > > {
> > > "input": [
> > > "message"
> > > ],
> > > "output": [
> > > "output_topic"
> > > ],
> > > "transformation": "REGEX_SELECT",
> > > "config": {
> > > "world": "^Hello "
> > > }
> > > },
> > > ...
> > > }
> > >
> > > *Input Records:*
> > > "...sshd[32469]: Hello..."
> > > "...sshd[30432]: Bye..."
> > >
> > > *Output:*
> > > Kafka topic = "world" (as determined by the REGEX_SELECT pattern match
> > that
> > > sets the "output_topic" property used by kafka.topicField)
> > > 1 record present
> > > contents of that record = our record with "Hello" in it
> > > 1 record is dropped ("Bye" record) and will not be forwarded any
> further
> > > through the pipeline.
> > >
> >
>


-- 
A.Nazemian

Re: [DISCUSS] Recurrent Large Indexing Error Messages

2018-12-05 Thread Ali Nazemian

I think if you look at the indexing error management, it is pretty much
similar to parser and enrichment error use cases. It is even more common to
expect something ended up in error topics. I think a wider independent job
can be used to take care of error management. It can be decided to add a
separate topology later on to manage error logs and create
alert/notifications separately.
It can be even integrated with log feeder and log search.
The scenario of sending solution operational logs to the same solution is a
bit weird and not enterprise friendly. Normally platform operation team
would be a separate team with different objectives and probably they have
got a separate monitoring/notification solution in placed already.

I don't think it is the end of the world if this part is left to be managed
by users. So I prefer option 2 as a short term. Long term solution can be
discussed separately.

Cheers,
Ali

On Sat, 20 Oct. 2018, 05:20 Nick Allen  I want to discuss solutions for the problem that I have described in
> METRON-1832; Recurrent Large Indexing Error Messages. I feel this is a very
> easy trap to fall into when using the default settings that currently come
> with Metron.
>
>
> ## Problem
>
>
> https://issues.apache.org/jira/browse/METRON-1832
>
>
> If any index destination like HDFS, Elasticsearch, or Solr goes down while
> the Indexing topology is running, an error message is created and sent back
> to the user-defined error topic.  By default, this is defined to also be
> the 'indexing' topic.
>
> The Indexing topology then consumes this error message and attempts to
> write it again. If the index destination is still down, another error
> occurs and another error message is created that encapsulates the original
> error message.  That message is then sent to the 'indexing' topic, which is
> later consumed, yet again, by the Indexing topology.
>
> These error messages will continue to be recycled and grow larger and
> larger as each new error message encapsulates all previous error messages
> in the "raw_message" field.
>
> Once the index destination recovers, one giant error message will finally
> be written that contains massively duplicated, useless information which
> can further negatively impact performance of the index destination.
>
> Also, the escape character '\' continually compounds one another leading to
> long strings of '\' characters in the error message.
>
>
> ## Background
>
> There was some discussion on how to handle this on the original PR #453
> https://github.com/apache/metron/pull/453.
>
> ## Solutions
>
> (1) The first, easiest option is to just do nothing.  There was already a
> discussion around this and this is the solution that we landed on in #453.
>
> Pros: Really easy; do nothing.
>
> Cons: Intermittent problems with ES/Solr can easily create very large error
> messages that can significantly impact both search and ingest performance.
>
>
> (2) Change the default indexing error topic to 'indexing_errors' to avoid
> recycling error messages. Nothing will consume from the 'indexing_errors'
> topic, thus preventing a cycle.
>
> Pros: Simple, easy change that prevents the cycle.
>
> Cons: Recoverable indexing errors are not visible to users as they will
> never be indexed in ES/Solr.
>
> (2) Add logic to limit the number times a message can be 'recycled' through
> the Indexing topology.  This effectively sets a maximum number of retry
> attempts.  If a message fails N times, then write the message to a separate
> unrecoverable, error topic.
>
> Pros: Recoverable errors are visible to users in ES/Solr.
>
> Cons: More complex.  Users still need to check the unrecoverable, error
> topic for potential problems.
>
> (4) Do not further encapsulate error messages in the 'raw_message' field.
> If an error message fails, don't encapsulate it in another error message.
> Just push it to the error topic as-is.  Could add a field that indicates
> how many times the message has failed.
>
> Pros: Prevents giant error messages from being created from recoverable
> errors.
>
> Cons: Extended outages would still cause the Indexing topology to
> repeatedly recycle these error messages, which would ultimately exhaust
> resources in Storm.
>
>
>
> What other ways can we solve this?
>

Re: Authorization for Configuration

2018-12-03 Thread Ali Nazemian

Would it be based on the operation as well? Like be able to read or modify.
So is this scenario valid? from the user experience perspective, a user may
be authorised to change the indexing/enrichment config (because there is
only one topology for them), but because he/she doesn't have sufficient
privilege to apply parser configs may not be able to modify that through
the management UI? I personally think it would be better to keep it simple
to be able to make it consistent across all the configs. For example, very
few roles (groups) to be able to either read configs or also modify them.
Having per sensor access management might be something that can be added
later on. I guess when it comes to CLI, it may become even more confusing
to have per sensor access for parsers and as a whole for the rest.

On Sat, Dec 1, 2018 at 1:36 AM Justin Leet  wrote:

> To start with, I'm thinking just the configuration, in particular anything
> that touches the ConfigurationsUtils.  I think for the first pass, we could
> leave reading from configs out of it and focus on who can write configs
> (since that's more disruptive to the system).
>
> To the best of my knowledge, this is:
>
>- Configuration through the Management UI
>- Configuration through the CLI (e.g. the push script + Stellar + any
>other scripts that have to do it).
>
> In terms of how fine-grained, I would essentially expect this to be at the
> users+groups level and the individual sensor / topology level (e.g. parsers
> can be authorized individually by users/group, indexing is on the whole).
> Essentially, I don't propose trying to rearchitect things to try to have
> multiple indexing topologies for complete separation, etc.  Just to take
> the existing setup and be able to apply authorization on top of it.
>
> I would expect anything covered by the ZK configs to be done as part of
> this effort (possibly in stages).  As noted, I would expect this to be a
> feature branch rather than piecemeal replacement.
>
> @Mike Yeah, I agree. The Jira for that doesn't exist yet, pretty much
> pending this exact conversation winding down.
>
> On Tue, Nov 20, 2018 at 12:08 PM Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
>
> > Justin, It probably makes sense to provide a list of these configuration
> > items as subtasks on the FB Jira so that we can crosscheck what entry
> > points have been implemented against the test scripts. Do you think this
> > will impact streaming enrichments or the profiler at all? That is to say,
> > as Ali asked, just how far are you looking to take the fine grained auth
> > scope for this?
> >
> > M
> >
> > On Mon, Nov 19, 2018 at 11:37 PM Ali Nazemian 
> > wrote:
> >
> > > Hi Justin,
> > >
> > > By configuration do you mean the sensor related configurations only?
> Are
> > > you limiting the scope of this activity to the management-UI or also
> > > Alert-UI as well? For example, defining different roles (pre-defined
> > > or customizable) and the fine-grained integration with Ranger?
> > >
> > > Cheers,
> > > Ali
> > >
> > > On Wed, Nov 14, 2018 at 1:25 AM Justin Leet 
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > Right now, our various configs can be modified by anyone with access
> to
> > > the
> > > > various scripts. I'd like to start a discussion around building out
> > some
> > > > authorization to be able to add some more fine grained controls
> around
> > > > this.
> > > >
> > > > Other projects have some variants on how to accomplish this.
> > Typically,
> > > > this follows a pattern of calling out to a interface/class that takes
> > in
> > > > the operation and context (user and other info) and returns
> true/false
> > if
> > > > something is authorized.
> > > >
> > > > In my mind, what we would need out of this is
> > > >
> > > >- Ability to apply fine grained permissions
> > > >- The various scripts and UI should flow this authorization
> > framework.
> > > >I believe most (if not all) of our configuration flows through
> > > >ConfigurationsUtils.  Anything that doesn't should either be
> hooked
> > in
> > > > or
> > > >refactored to flow through the same codepaths.
> > > >- Pluggability. We shouldn't force only one authorizer.
> > > >
> > > > In particular, I'm proposing we use Apache Ranger
> > > > <https://ranger.ap

Re: [DISCUSS] Deprecating MySQL

2018-11-19 Thread Ali Nazemian

Great feature to move to LDAP integration and hopefully Ranger
integration afterwards. Does it need to support LDAP and AD separately?

Cheers,
Ali

On Sat, Nov 17, 2018 at 3:29 AM Otto Fowler  wrote:

> I would like to understand the work required to move our JDBC support ( or
> adapt the current support to the abstraction ) to /contrib.
> We could default and only officially support LDAP, but have the /contrib (
> or /extension_examples ) have a “this is how you would support jdbc for
> auth “ project.
>
>
>
> On November 15, 2018 at 15:01:10, Michael Miklavcic (
> michael.miklav...@gmail.com) wrote:
>
> Yes, makes sense. +1 to that.
>
> On Thu, Nov 15, 2018 at 12:54 PM James Sirota  wrote:
>
> > To clarify my position, I don't have a problem with mySql or any other
> > projects relying on it. mySql in itself is not an issue. What I don't
> > want is for a customer to be presented with an option to chose and
> > configure two options for authenticating the UI, which I think is
> > needless. It adds complexity for not much value. Since LDAP is clearly
> > the better way to go that should be what we support without explicitly
> > giving a user an option to switch to JDBC. A user can still do so by
> > extending our abstractions if that is what they chose to do, but this
> would
> > not be officially supported by us. We would not be providing a config or
> > an mPack to do this. A user would have to do it on their own.
> >
> > James
> >
> >
> >
> > 15.11.2018, 12:15, "Michael Miklavcic" :
> > > Incidentally, even without the Metron piece in the picture, what is the
> > > answer for Ambari's database dependency? Which uses a SQL data store.
> > Does
> > > this actually solve the problem of "customers won't install Metron bc
> SQL
> > > store?" or are there other issues we need to address?
> > >
> > > On Thu, Nov 15, 2018 at 9:30 AM James Sirota 
> wrote:
> > >
> > >> Hi Guys,
> > >>
> > >> My opinion on this, as is with Knox SSO, is that the code should be
> > >> pluggable to support JDBC, but we should not continue to support the
> > >> concrete implementation and expose it to users via a setting. This is
> a
> > >> fairly minor feature and the added complexity of supporting switching
> > >> between JDBC and LDAP is simply not worth it. We need to strike a
> > balance
> > >> between ease of use and capabilities/extensibility. For features that
> > are
> > >> worth it such as with analytics and stream processing, the extra
> > capability
> > >> is worth the added complexity in configuration. But for this, it is
> > not.
> > >> So let's keep JDBC around for a release to allow users to migrate to
> > LDAP,
> > >> deprecate it, and move on.
> > >>
> > >> Thanks,
> > >> James
> > >>
> > >> 13.11.2018, 16:03, "Simon Elliston Ball"  > >:
> > >> > We went over the hbase user settings thing on extensive discussions
> > at
> > >> the time. Storing an arbitrary blob of JSON which is only ever
> > accessed by
> > >> a single key (username) was concluded to be a key value problem, not a
> > >> relational problem. Hbase was concluded to be massive overkill as a
> key
> > >> value store in this usecase, unless it was already there and ready to
> > go,
> > >> which in the case of Metron, it is, for enrichments, threat intel and
> > >> profiles. Hence it ended up in Hbase, as a conveniently present data
> > store
> > >> that matched the usage patterns. See
> > >>
> >
>
> https://lists.apache.org/thread.html/145b3b8ffd8c3aa5bbfc3b93f550fc67e71737819b19bc525a2f2ce2@%3Cdev.metron.apache.org%3E
> > >> and METRON-1337 for discussion.
> > >> >
> > >> > Simon
> > >> >
> > >> >> On 13 Nov 2018, at 18:50, Michael Miklavcic <
> > >> michael.miklav...@gmail.com> wrote:
> > >> >>
> > >> >> Thanks for the write up Simon. I don't think I see any major
> > problems
> > >> with
> > >> >> deprecating the general sql store. However, just to clarify, Metron
> > >> does
> > >> >> NOT require any specific backing store. It's 100% JPA, which means
> > >> anything
> > >> >> that can be configured with the Spring properties we expose. I
> think
> > >> the
> > >> >> most opinionated thing we do there is ship an extremely basic table
> > >> >> creation script for h2 and mysql as a simple example for schema. As
> > an
> > >> >> example, we simply use H2 in full dev, which is entirely in-memory
> > and
> > >> spun
> > >> >> up automatically from configuration. The recent work by Justin Leet
> > >> removes
> > >> >> the need to use a SQL store at all if you choose LDAP -
> > >> >> https://github.com/apache/metron/pull/1246. I'll let him comment
> > >> further on
> > >> >> this, but I think there is one small change that could be made via
> a
> > >> toggle
> > >> >> in Ambari that would even eliminate the user from seeing JDBC
> > settings
> > >> >> altogether during install if they choose LDAP. Again, I think I'm
> on
> > >> board
> > >> >> with deprecating the SQL backing store as I pointed this out on the
> > >> Knox
> > >> >> thread as well, but I j

Re: [DISCUSS] Deprecate split-join enrichment topology in favor of unified enrichment topology

2018-11-19 Thread Ali Nazemian

Hi,

One thing to point out here is there were a few timestamp fields that
exist for Split-join enrichment topology that haven't been made to the
unified one. For example, there is no threat intel bolt timestamp. There
might be some SLA related use cases regarding these timestamp fields that
might be nice to have before depreciation of the split-join one. Generally
speaking, makes sense to deprecate split-join topology, though.

Cheers,
Ali

On Fri, Nov 16, 2018 at 3:40 AM James Sirota  wrote:

> This is excellent work, Mike and long overdue.  Thanks for doing this
>
> 05.11.2018, 16:46, "Michael Miklavcic" :
> > The PR has now been merged into master and closed.
> >
> > https://issues.apache.org/jira/browse/METRON-1855
> >
> > On Sat, Nov 3, 2018 at 6:47 PM Michael Miklavcic <
> > michael.miklav...@gmail.com> wrote:
> >
> >>  PR is out here - https://github.com/apache/metron/pull/1252
> >>
> >>  I made the unified enrichment topology the new default and marked the
> >>  split-join topology as deprecated in various parts of the
> documentation. I
> >>  think we should have a release with the deprecation notes and new
> default
> >>  and then move to remove split-join entirely shortly thereafter.
> >>
> >>  Best,
> >>  Mike
> >>
> >>  On Fri, Nov 2, 2018 at 5:47 AM Mohan Venkateshaiah <
> >>  mvenkatesha...@hortonworks.com> wrote:
> >>
> >>>  +1 (non-binding)
> >>>
> >>>  Thanks
> >>>  Mohan DV
> >>>
> >>>  On 11/2/18, 3:29 PM, "zeo...@gmail.com"  wrote:
> >>>
> >>>  +1 totally agree.
> >>>
> >>>  Jon
> >>>
> >>>  On Fri, Nov 2, 2018, 1:31 AM Anand Subramanian <
> >>>  asubraman...@hortonworks.com>
> >>>  wrote:
> >>>
> >>>  > Piling on my +1 (non-binding) as well.
> >>>  >
> >>>  > On 11/2/18, 4:41 AM, "Ryan Merriman" 
> wrote:
> >>>  >
> >>>  > +1
> >>>  >
> >>>  > On Thu, Nov 1, 2018 at 5:38 PM Casey Stella  >>>  >
> >>>  > wrote:
> >>>  >
> >>>  > > +1
> >>>  > > On Thu, Nov 1, 2018 at 18:34 Nick Allen 
> >>>  wrote:
> >>>  > >
> >>>  > > > +1
> >>>  > > >
> >>>  > > > On Thu, Nov 1, 2018, 6:27 PM Justin Leet <
> >>>  justinjl...@gmail.com>
> >>>  > wrote:
> >>>  > > >
> >>>  > > > > +1, I haven't seen any case where the split-join topology
> >>>  isn't
> >>>  > made
> >>>  > > > > obsolete by the unified topology.
> >>>  > > > >
> >>>  > > > > On Thu, Nov 1, 2018 at 6:17 PM Michael Miklavcic <
> >>>  > > > > michael.miklav...@gmail.com> wrote:
> >>>  > > > >
> >>>  > > > > > Fellow Metronians,
> >>>  > > > > >
> >>>  > > > > > We've had the unified enrichment topology around for a
> >>>  number
> >>>  > of
> >>>  > > months
> >>>  > > > > > now, it has proved itself stable, and there is yet to
> >>>  be a
> >>>  > time that
> >>>  > > I
> >>>  > > > > have
> >>>  > > > > > seen the split-join topology outperform the unified
> >>>  one. Here
> >>>  > are
> >>>  > > some
> >>>  > > > > > simple reasons to deprecate the split-join topology.
> >>>  > > > > >
> >>>  > > > > > 1. Unified topology performs better.
> >>>  > > > > > 2. The configuration, especially for performance
> >>>  tuning is
> >>>  > much,
> >>>  > > > much
> >>>  > > > > > simpler in the unified model.
> >>>  > > > > > 3. The footprint within the cluster is smaller.
> >>>  > > > > > 4. One of the first activities for any install is
> >>>  that we
> >>>  > spend
> >>>  > > time
> >>>  > > > > > instructing users to switch to the unified topology.
> >>>  > > > > > 5. One less moving part to maintain.
> >>>  > > > > >
> >>>  > > > > > I'd like to recommend that we deprecate the split-join
> >>>  > topology and
> >>>  > > > make
> >>>  > > > > > the unified enrichment topology the new default.
> >>>  > > > > >
> >>>  > > > > > Best,
> >>>  > > > > > Mike
> >>>  > > > > >
> >>>  > > > >
> >>>  > > >
> >>>  > >
> >>>  >
> >>>  >
> >>>  > --
> >>>
> >>>  Jon Zeolla
>
> ---
> Thank you,
>
> James Sirota
> PMC- Apache Metron
> jsirota AT apache DOT org
>
>

-- 
A.Nazemian

Re: [DISCUSS] Authorization for Configuration

2018-11-19 Thread Ali Nazemian

Hi Justin,

By configuration do you mean the sensor related configurations only? Are
you limiting the scope of this activity to the management-UI or also
Alert-UI as well? For example, defining different roles (pre-defined
or customizable) and the fine-grained integration with Ranger?

Cheers,
Ali

On Thu, Nov 15, 2018 at 1:16 AM Justin Leet  wrote:

> Hi all,
>
> Sorry for the second copy of this email, I forgot the  [DISCUSS] tag on the
> original.  Otherwise, this has the same content.
>
> Right now, our various configs can be modified by anyone with access to the
> various scripts. I'd like to start a discussion around building out some
> authorization to be able to add some more fine grained controls around
> this.
>
> Other projects have some variants on how to accomplish this.  Typically,
> this follows a pattern of calling out to a interface/class that takes in
> the operation and context (user and other info) and returns true/false if
> something is authorized.
>
> In my mind, what we would need out of this is
>
>- Ability to apply fine grained permissions
>- The various scripts and UI should flow this authorization framework.
>I believe most (if not all) of our configuration flows through
>ConfigurationsUtils.  Anything that doesn't should either be hooked in
> or
>refactored to flow through the same codepaths.
>- Pluggability. We shouldn't force only one authorizer.
>
> In particular, I'm proposing we use Apache Ranger
>  as a supported authorization framework,
> implementing it alongside the authorization framework to validate what we
> build. In my mind, the main catch with Ranger is that, based on my
> understanding, we won't be able to restrict users with direct access to
> ZooKeeper via it's CLI (e.g. Ranger can't mirror it's ACLs down into ZK's
> ACLs).  I believe this is a reasonable restriction, especially as the
> management UI gets improved to handle more of the configuration burden and
> the number of users with access to ZK CLI begins to decrease.  Users can
> still add ZK ACLs separately to enforce that access there.
>
> For anyone not familiar with Ranger, essentially you build a plugin that
> hooks into the existing component's authorization framework (e.g. for
> Storm, the plugin runs through the IAuthorizer
> <
> https://storm.apache.org/releases/1.2.2/javadocs/org/apache/storm/security/auth/IAuthorizer.html
> >
> interface,
> for Yarn it runs through YarnAuthorizationProvider
> <
> http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-common/apidocs/org/apache/hadoop/yarn/security/YarnAuthorizationProvider.html
> >).
> Additionally, Ranger provides auditing capabilities for this authorization
> and has plugins for a good portion of our stack already (so users can
> already setup ACLs on HDFS, Storm, etc.). Checkout the Ranger Github
>  for a list of the plugins they have
> built in.
>
> What this means for Metron is building out an authorization setup similar
> to Storm or Yarn or whatever we choose. We'll want this anyway, to allow
> our solution to be pluggable.  At that point, we build an implementation of
> the authorizer compatible with Ranger along with the plugin.
>
> I think this could probably be a fairly small feature branch, which I'm
> suggesting primarily to do the Ranger implementation alongside the general
> authorization work to validate what's being built.  I think the main
> tasking would be something similar to:
>
>- Build out pluggable authorization for our configs.
>- This includes testing (and possibly doing something similar to Storm,
>where they have a some testing IAuthorizers, e.g. NoopAuthorizer,
>DenyAuthorizer, etc.)
>- Ensure that all the code paths consistently flow through this
>Authorization.
>- Build a Ranger compatible version of this.
>- Define the Ranger plugin for this.
>- Make sure auditing is defined.
>- Integration testing (particularly with Kerberos. After all, if they
>want to do authorization and auditing, they're almost certainly using
>Kerberos).
>
> Is there anything missing that we'd need or want for this?  Are there any
> other concerns we'd want to make sure are taken care of?
>


-- 
A.Nazemian

Re: Authorization for Configuration

2018-11-19 Thread Ali Nazemian

Hi Justin,

By configuration do you mean the sensor related configurations only? Are
you limiting the scope of this activity to the management-UI or also
Alert-UI as well? For example, defining different roles (pre-defined
or customizable) and the fine-grained integration with Ranger?

Cheers,
Ali

On Wed, Nov 14, 2018 at 1:25 AM Justin Leet  wrote:

> Hi all,
>
> Right now, our various configs can be modified by anyone with access to the
> various scripts. I'd like to start a discussion around building out some
> authorization to be able to add some more fine grained controls around
> this.
>
> Other projects have some variants on how to accomplish this.  Typically,
> this follows a pattern of calling out to a interface/class that takes in
> the operation and context (user and other info) and returns true/false if
> something is authorized.
>
> In my mind, what we would need out of this is
>
>- Ability to apply fine grained permissions
>- The various scripts and UI should flow this authorization framework.
>I believe most (if not all) of our configuration flows through
>ConfigurationsUtils.  Anything that doesn't should either be hooked in
> or
>refactored to flow through the same codepaths.
>- Pluggability. We shouldn't force only one authorizer.
>
> In particular, I'm proposing we use Apache Ranger
>  as a supported authorization framework,
> implementing it alongside the authorization framework to validate what we
> build. In my mind, the main catch with Ranger is that, based on my
> understanding, we won't be able to restrict users with direct access to
> ZooKeeper via it's CLI (e.g. Ranger can't mirror it's ACLs down into ZK's
> ACLs).  I believe this is a reasonable restriction, especially as the
> management UI gets improved to handle more of the configuration burden and
> the number of users with access to ZK CLI begins to decrease.  Users can
> still add ZK ACLs separately to enforce that access there.
>
> For anyone not familiar with Ranger, essentially you build a plugin that
> hooks into the existing component's authorization framework (e.g. for
> Storm, the plugin runs through the IAuthorizer
> <
> https://storm.apache.org/releases/1.2.2/javadocs/org/apache/storm/security/auth/IAuthorizer.html
> >
> interface, for Yarn it runs through YarnAuthorizationProvider
> <
> http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-common/apidocs/org/apache/hadoop/yarn/security/YarnAuthorizationProvider.html
> >).
> Additionally, Ranger provides auditing capabilities for this authorization
> and has plugins for a good portion of our stack already (so users can
> already setup ACLs on HDFS, Storm, etc.). Checkout the Ranger Github
>  for a list of the plugins they have
> built in.
>
> What this means for Metron is building out an authorization setup similar
> to Storm or Yarn or whatever we choose. We'll want this anyway, to allow
> our solution to be pluggable.  At that point, we build an implementation of
> the authorizer compatible with Ranger along with the plugin.
>
> I think this could probably be a fairly small feature branch, which I'm
> suggesting primarily to do the Ranger implementation alongside the general
> authorization work to validate what's being built.  I think the main
> tasking would be something similar to:
>
>- Build out pluggable authorization for our configs.
>- This includes testing (and possibly doing something similar to Storm,
>where they have a some testing IAuthorizers, e.g. NoopAuthorizer,
>DenyAuthorizer, etc.)
>- Ensure that all the code paths consistently flow through this
>Authorization.
>- Build a Ranger compatible version of this.
>- Define the Ranger plugin for this.
>- Make sure auditing is defined.
>- Integration testing (particularly with Kerberos. After all, if they
>want to do authorization and auditing, they're almost certainly using
>Kerberos).
>
> Is there anything missing that we'd need or want for this?  Are there any
> other concerns we'd want to make sure are taken care of?
>


-- 
A.Nazemian

Re: [DISCUSS] Slack Channel Use

2018-10-23 Thread Ali Nazemian

I kind of expect to have Slack for more dev related discussions rather than
user QA. I guess it is quite common to expect mailing list to be used for
the purpose of knowledge sharing to make sure it will be accessible by
other users as well. Of course, it is a trade-off that most of the other
Apache projects decided to accept the risk of keeping user related
discussions out of Slack/IRC. However, it sometimes happens to see the
mixture of questions coming to Slack. I have heard recently people thought
Metron is sort of dead just because the mailing list is not so active
anymore!

Cheers,
Ali

On Tue, Oct 23, 2018 at 8:23 AM Casey Stella  wrote:

> Agreed, the benefit of the mailing list is that it’s searchable by ponymail
> and the major search engines.
> On Mon, Oct 22, 2018 at 17:18 Nick Allen  wrote:
>
> > I don't know that it is the same kind of searchable.  Is it being indexed
> > by the major search engines?  I have never used a search engine and
> > uncovered the answer to my problem in a Slack archive.
> >
> > On Mon, Oct 22, 2018 at 5:05 PM Otto Fowler 
> > wrote:
> >
> > > According to Greg Stein, an infra admin on the NiFi slack, the ASF
> slack
> > > that metron is in IS the standard plan, not the free one and is
> > searchable
> > > past 10,000 messages.
> > >
> > >
> > >
> > > On October 22, 2018 at 15:35:51, Michael Miklavcic (
> > > michael.miklav...@gmail.com) wrote:
> > >
> > > ...From an archival and broader reach point of view, I do think there's
> > > something to be said about using the mailing list. It's also easier to
> > link
> > > to Q/A threads from the mailing list archives and do searches...
> > >
> > >
> >
> https://lists.apache.org/thread.html/1aa85bc13d41e04a1f85c3100c2b803abe35d79b54062bbeaab83ace@%3Cdev.metron.apache.org%3E
> > >
> > > How very Inception.
> > >
> > >
> > > On Mon, Oct 22, 2018 at 1:32 PM Michael Miklavcic <
> > > michael.miklav...@gmail.com> wrote:
> > >
> > > > I just want to point out that we currently have 32 members in the
> > Metron
> > > > Slack channel which I personally think is a great sign. This is good
> > from
> > > a
> > > > community perspective and helps foster interactive sessions where
> > > required.
> > > > From an archival and broader reach point of view, I do think there's
> > > > something to be said about using the mailing list. It's also easier
> to
> > > link
> > > > to Q/A threads from the mailing list archives and do searches. As
> > such, I
> > > > would also go along with Nick's suggestion and urge members to prefer
> > the
> > > > user/dev list where possible.
> > > >
> > > > On Mon, Oct 22, 2018 at 10:51 AM Justin Leet 
> > > > wrote:
> > > >
> > > >> If we want to push more discussion to the dev list, my obvious
> follow
> > up
> > > >> question then is "What are we hoping to get out of Slack/irc/other
> > > >> interactive medium?". What discussion would we even want on there,
> if
> > we
> > > >> can't have decisions and don't want usage/support?
> > > >>
> > > >> On Mon, Oct 22, 2018 at 12:44 PM Casey Stella 
> > > wrote:
> > > >>
> > > >> > I am of 2 minds, but I tend to agree. On the one hand, it's
> > definitely
> > > >> the
> > > >> > preference that we use the mailing lists for the reasons you
> stated
> > > (and
> > > >> > also because not everyone has access to slack generally). On the
> > other
> > > >> > hand, I think an interactive medium like Slack has a lot of
> > advantages
> > > >> in
> > > >> > terms of user satisfaction. Ultimately, though, we may satisfy 1
> > user
> > > >> at
> > > >> > the cost of not persisting the discussion and satisfying many
> users.
> > > >> >
> > > >> > I'll go along with a specific preference to drive more discussion
> to
> > > the
> > > >> > mailing list.
> > > >> >
> > > >> > Casey
> > > >> >
> > > >> > On Mon, Oct 22, 2018 at 12:18 PM Nick Allen 
> > > wrote:
> > > >> >
> > > >> > > It seems that we are seeing a lot of Metron usage and support
> > > >> questions
> > > >> > on
> > > >> > > the Slack Channel.
> > > >> > > These are questions that previously would have been directed to
> > the
> > > >> User
> > > >> > or
> > > >> > > Dev mailing lists. Since this is occurring in the Slack Channel,
> > the
> > > >> > > conversations are not archived.
> > > >> > >
> > > >> > > In my opinion, this is not good for the Metron community. Having
> > > this
> > > >> > > persisted in a discoverable form (like a mailing list archive)
> not
> > > >> only
> > > >> > > helps support current users, but also helps *potential* users
> > > >> understand
> > > >> > > how Metron is being used.
> > > >> > >
> > > >> > > Does anyone else agree or disagree? At a minimum, I feel we need
> > to
> > > >> do
> > > >> > > something to direct these conversations back to the mailing
> list.
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>


-- 
A.Nazemian

Re: HCP in Cloud infrastructures such as AWS , GCP, AZURE

2018-10-23 Thread Ali Nazemian

Depending on the model of security, you may have some challenges with the
Ranger integration with your cloud storage especially if you are thinking
of using TDE for the encryption at rest. Otherwise, using Metron in that
way should be quite feasible. However, you may face some performance issues
depending on what the required SLA is, but the cost saving will most
probably convince you to go with the decoupling storage from compute.

Cheers,
Ali

On Tue, Oct 23, 2018 at 2:57 AM deepak kumar  wrote:

> Thanks Carolyn.
> Is there any defined reference architecture to refer to?
>
> Thanks
> Deepak
>
> On Mon, Oct 22, 2018 at 8:23 PM Carolyn Duby 
> wrote:
>
> >
> > Hive 3.0 works well with block stores.  You can either add it to your
> > Metron cluster or spin up an ephemeral cluster with Cloudbreak:
> >
> > 1. Metron streams into HDFS in JSON.
> > 2. Compact daily with Spark into ORC format and store in block store (S3,
> > ADLS, etc).
> > 3. Query ORC in block store using external Hive 3.0 tables in HDP 3 using
> > LLAP.
> > 4. If querying externally from block store is too slow, try adding more
> > LLAP cache or load data into HDFS prior to analysis.
> >
> > If you are using the Metron Alerts UI, you will need solr which works
> well
> > only on fast disk.   To keep costs down, reduce the context stored in
> Solr
> > using the following techniques:
> > 1. Only index the fields you might search on.
> > 2. Reduce the formats you store in Solr to only those you will want to
> see
> > in the Alerts UI.
> > 3. Reduce the length of time you store data in Solr.
> >
> > Thanks
> > Carolyn Duby
> > Solutions Engineer, Northeast
> > cd...@hortonworks.com
> > +1.508.965.0584
> >
> > Join my team!
> > Enterprise Account Manager – Boston - http://grnh.se/wepchv1
> > Solutions Engineer – Boston - http://grnh.se/8gbxy41
> > Need Answers? Try https://community.hortonworks.com <
> > https://community.hortonworks.com/answers/index.html>
> >
> >
> >
> >
> >
> >
> >
> >
> > On 10/19/18, 7:18 AM, "deepak kumar"  wrote:
> >
> > >Hi All
> > >I have a quick question around HCP deployments in cloud infra such as
> AWS.
> > >I am planning to run persistent cluster for all event streaming and
> > >processing.
> > >And then run transient cluster such as AWS EMR to run batch loads on the
> > >data ingested from persistent cluster.
> > >Have anyone tried this model ?
> > >Since data volume is going to be humongous ,cloud is charging lot of
> money
> > >for data io and storage.
> > >Keeping this in mind , what could be the best cloud deployment of hcp
> > >components assuming there is going to be ingest rate of 10TB per day .
> > >
> > >Thanks in advance.
> > >
> > >
> > >Regards,
> > >Deepak
> >
>


-- 
A.Nazemian

Re: [DISCUSS] Internal Metron fields

2018-09-12 Thread Ali Nazemian

Totally agree with replacing dot with something else. We have had so much
drama to use either dot or column with ORC either via Hive or Spark.
Although we have replaced it with an underscore, it may not be a good idea
as it can be confusing with underscores in the internal field names.

Cheers,
Ali

On Wed, Sep 12, 2018 at 8:18 AM James Sirota  wrote:

> I propose that we just disallow having dots in the field name.  Dots seem
> to have a special meaning and as we keep adding data stores we may run into
> some unintended behavior.  We should have logic in our code to check for it
> and either auto-correct it (replace with underscores?) or at least throw an
> error or a warning.
>
> Thanks,
> James
>
> 07.09.2018, 16:33, "Ryan Merriman" :
> > Internal means it’s not configurable, doesn’t contain our default
> separator (dots) and is namespaced with metron. We can definitely improve
> on DRY but there’s more to it than that. For example, having 2 different
> versions of this field name (ES and Solr) adds a significant amount of
> complexity for no real benefit.
> >
> >>  On Sep 7, 2018, at 5:12 PM, Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
> >>
> >>  Can you elaborate on what you mean by "convert to internal?" From your
> >>  description, it looks like the challenge is from our violations of DRY
> when
> >>  it comes to constants referencing those keys, which would be
> eliminated by
> >>  refactoring.
> >>
> >>>  On Fri, Sep 7, 2018, 3:50 PM Ryan Merriman 
> wrote:
> >>>
> >>>  I recently worked on a PR that involved changing the default behavior
> of
> >>>  the ElasticsearchWriter to store data using field names with the
> default
> >>>  Metron separator, dots. One of the unfortunate consequences of this is
> >>>  that although dots are allowed in more recent versions of ES, it
> changes
> >>>  how these fields are stored. Having a dot in a field name causes ES to
> >>>  treat it as an object field type. We're not quite comfortable with
> this
> >>>  because it could introduce unforeseen side effects that may not be
> >>>  obvious. Here's the PR: https://github.com/apache/metron/pull/1181
> >>>
> >>>  As I worked through it I noticed there are a couple fields that
> include
> >>>  separators where it's not actually necessary. They are not nested by
> >>>  nature and are internal to Metron. The fact that they are internal
> means
> >>>  they show up in constants and are hardcoded in several different
> places.
> >>>  That made the work in the PR above much harder and tedious than it
> should
> >>>  have been. There are 2 in particular that I had to deal with:
> source:type
> >>>  and threat:triage:score in metaalerts.
> >>>
> >>>  Is it worth considering converting these to internal Metron fields so
> that
> >>>  they stay constant and this isn't a problem in the future? I could see
> >>>  these fields following the same pattern as 'metron_alert'. However
> this
> >>>  would cause pain when upgrading because existing data would need to be
> >>>  updated with these new fields.
> >>>
> >>>  Just an idea. Curious if other have an opinion on the subject.
>
> ---
> Thank you,
>
> James Sirota
> PMC- Apache Metron
> jsirota AT apache DOT org
>
>

-- 
A.Nazemian

Re: [ANNOUNCE] - Apache Metron Slack channel

2018-08-26 Thread Ali Nazemian

Can I be invited as well?

On Thu, Aug 16, 2018 at 4:37 AM Otto Fowler  wrote:

> Done
>
>
> On August 15, 2018 at 14:22:45, Vets, Laurens (laur...@daemon.be) wrote:
>
> Could I be invited?
>
> On 15-Aug-18 09:48, Michael Miklavcic wrote:
> > + Metron user list
> >
> > On Wed, Aug 15, 2018 at 10:38 AM Michael Miklavcic <
> > michael.miklav...@gmail.com> wrote:
> >
> >> Turns out we are able to invite folks on an ad-hoc basis. See
> instructions
> >> here -
> >> https://cwiki.apache.org/confluence/display/METRON/Community+Resources
> >>
> >>
> >> On Wed, Aug 15, 2018 at 9:23 AM Michael Miklavcic <
> >> michael.miklav...@gmail.com> wrote:
> >>
> >>> It's another option with different features. I imagine many people will
> >>> use both.
> >>>
> >>> On Wed, Aug 15, 2018, 9:14 AM Simon Elliston Ball <
> >>> si...@simonellistonball.com> wrote:
> >>>
>  Since this is committers only, would it make more sense to stick to
> IRC?
>  Or
>  is exclusivity the idea?
> 
>  On 15 August 2018 at 16:09, Nick Allen  wrote:
> 
> > Thanks for the instructions!
> >
> > On Wed, Aug 15, 2018 at 10:22 AM, Michael Miklavcic <
> > michael.miklav...@gmail.com> wrote:
> >
> >> The Metron community has a Slack channel available for communication
> >> (similar to the existing IRC channel, only on Slack).
> >>
> >> To join:
> >>
> >> 1. Go to slack.com.
> >> 2. For organization/group, you'll enter "the-asf"
> >> 3. Use your Apache email for your login
> >> 4. Click "Channels" and look for #metron (Created by ottO June 15,
> > 2018)
> >> Best
> >> Mike Miklavcic
> >>
> 
> 
>  --
>  --
>  simon elliston ball
>  @sireb
> 
>


-- 
A.Nazemian

Re: [DISCUSS] Getting to a 1.0 release

2018-08-26 Thread Ali Nazemian

One thing that we could imagine for v1.0 might be an ability to extend
Metron from adding more pipelines to it. For example, being able to extend
Metron to be integrated with other endpoints more easily from Storm
perspective. For example, what if we would like to create other topologies
to write files in ORC directly rather than HDFS or index it to Druid. What
if we would like to build an automatic security response and move to become
SOAR. All these integrations can be done even now and there are other users
that may have done it already. However, providing a clear extension point
can make it easier to contribute other pipelines to the community. I think
by adding this level of extendability, Metron community will grow way
faster by adding more integration that can be available.

Cheers,
Ali

On Mon, Aug 20, 2018 at 10:50 PM Casey Stella  wrote:

> I completely agree, Mike.  Our docs are either very high level or very low
> level (and possibly stale) and, worse, aren't aimed at the actors that
> you've stated.
> I think that the HBase project does a good job of providing coherent and
> useable documentation in their "HBase Book" (see
> https://hbase.apache.org/book.html).
> It's not actor-specific, but it is coherent advice for the practical
> practitioner of HBase (both admin and developer) and speaks with one
> voice.  I think Metron's need
> is a bit different, but at the minimum some coherent docs that speaks with
> one voice and has a coherent pitch about what Metron is used for and what
> it isn't used for
> is well needed.
>
> On Sat, Aug 18, 2018 at 1:00 PM Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
>
> > Apologies for any spelling mishaps as I'm writing from my phone.
> >
> > I'm for improving our docs. I'd like to see us guide our various profiles
> > of user towards the specific documentation for the abstraction levels
> > they'll be most interested in working from. I think we should have
> platform
> > docs about how we're a broadly useful, extensible streaming analytics
> > platform for cyber security as well as docs that emphasize more narrow
> and
> > specific use cases.
> >
> > Personally, I think I see 3 potential tiers or classifications of docs.
> > These are just observations and ideas I had, not necessarily a
> prescription
> > for organizing docs:
> > - Low level tool instructions, eg
> > - how do I run the pcap toplogy and then query with the CLI and UI?
> > - Platform docs about building on top of Metron, e.g.
> > - writing custom parsers, enrichment, and threat Intel (imho we
> should
> > start to take a more opinionated view of leveraging Stellar as this
> > extension point rather than implementing new parser classes in Java)
> > - using the profiler for constructing outlier analysis use cases
> > - using MAAS for building and deploying models for use in enrichment
> > - Docs around more specific use cases that solve specific as opposed to
> > more general problems, similar to those we have in the use-cases folder.
> >
> > I think one of our challenges currently is that our docs could be better
> > tailored to the "actors" we've talked about in the past. An individual
> SOC
> > analyst will have a very different set of interests than would a reseller
> > that wants to build on top of our platform to expose new modules and
> > functionality to those SOC analyst. And we can, and do, currently support
> > both.
> >
> >
> > On Sat, Aug 18, 2018, 9:34 AM Nick Allen  wrote:
> >
> > > Yes, I imagine just a separate top level directory which would contain
> > the
> > > docs.
> > >
> > > We would need someone to survey what doc tools are out there and
> provide
> > > some advice.
> > >
> > > Maybe we could look around at other open source projects that have done
> > > their docs well and emulate them.
> > >
> > > On Sat, Aug 18, 2018, 10:57 AM Kyle Richardson <
> > kylerichards...@gmail.com>
> > > wrote:
> > >
> > > > +1 to separating developer docs and user docs. How should we approach
> > > that.
> > > > Have a separate doc book? I haven’t had a ton of time to contribute
> to
> > > code
> > > > lately but I’d be happy to help write some of these.
> > > >
> > > > On Sat, Aug 18, 2018 at 9:48 AM Nick Allen 
> wrote:
> > > >
> > > > > Personally, I think the state of our docs and web presence is an
> > > > inhibitor
> > > > > to growing the Metron community.  Unless we can offer concise,
> > > compelling
> > > > > answers to the basic questions (What can I do with Metron?  Who
> does
> > it
> > > > > help? How do I do that?), potential users and contributors are
> unable
> > > to
> > > > > see the value of Metron.
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Aug 18, 2018 at 9:42 AM, Nick Allen 
> > > wrote:
> > > > >
> > > > > > I'd like to see us focus on improving our docs before a version
> > 1.0.
> > > > > > Right now we just stitch together a bunch of READMEs, which is a
> > > great
> > > > > > stride from where we started, but is not ideal.
> > > >

Re: Change field separator in Metron to make it Hive and ORC friendly

2018-08-15 Thread Ali Nazemian

Hi Simon,

I think it is a hard trade-off. Even right now without any ability to
customise separator/Metron internal field names, Metron users need to put a
mapping in place at the integration layer (At least this is what we are
doing :) ). Every organisation/user may need to follow different policies
for different reasons, not to mention any certain technology limitations
(e.g. hive). The question is, do we think Elasticsearch/Solr and HDFS (As
data storage) are coupled with Metron or not. Metron components can freely
use metron specific data model, but when it comes to the data model at
rest, it would be better to decouple it from Metron data model to make it
more flexible for the integration with other tools, so it means whenever
data model is related to rest, a mapping layer would be required.
Certainly, it doesn't mean every Metron user should provide a mapping. We
can, but it doesn't mean we have to. It becomes just more flexible for the
integration to be able to have a consistent data model across integration
endpoints (Elasticsearch/Solr and HDFS). The problem we are facing is in
addition to a separate mapping for Elasticsearch, we have to put a
different mapping for ORC as well. At least if it was consistent across
Elasticsearch and HDFS, we could only have a single mapping for an
application that consumes from both. Therefore, if we exclude the data
model in transit, A mapping at Metron-rest (to serve Alert UI) and a
mapping at Metron-indexing (ES/Solr and HDFS) would be sufficient. Even
right now by changing the separator at the index time we are doing the same
thing. We are not changing the data model in transit.

Cheers,
Ali

On Tue, Aug 14, 2018 at 9:11 PM Simon Elliston Ball <
si...@simonellistonball.com> wrote:

> The challenge with making it configurable is that every query, every
> profile, every analytic, template, pre-installed dashboard and use case
> built by any third party who wanted to extend metron would have to honour
> the configuration and paramaterize every query they run. My worry is that
> that would render some engines totally incompatible with many installs (as
> opposed to just needing an escape character as you would with hive now) and
> would prevent a lot of tools participating in the metron eco-system.
>
> I think this is something where we need to make a good decision and stick
> to it to allow the ecosystem to build on a known foundation.
>
> Dots are not great because hive uses them to separate, underscore collides
> with our existing  convention, and hyphen collides with a number of other
> common log formats, so it’s not an easy one to have an opinion on, but I do
> think we should have an opinion rather than forcing every user to make the
> hard choice to exclude others from sharing.
>
> Perhaps the flat key value structure is the real question here, and given
> progress in the underlying index engines may not be the panacea it once was.
>
> Simon
>
> Sent from my iPhone
>
> > On 14 Aug 2018, at 11:42, deepak kumar  wrote:
> >
> > I agree Ali.
> > May be it can be configuration parameter.
> >
> >> On Tue, Aug 14, 2018 at 3:e t24 PM Ali Nazemian 
> wrote:
> >>
> >> Hi Simon,
> >>
> >> We have temporarily decided to just change it with "_" for HDFS to avoid
> >> all the headaches of the bugs and issues that can be raised by using
> >> unsupported separators for ORC/Hive and Spark. However, I am not quite
> >> confident with "_" as an option for the community as it becomes similar
> to
> >> normal Metron separator. Maybe it would be nice to have an ability to
> >> change the separator to any other character and let users decide what
> they
> >> want to use.
> >>
> >> Cheers,
> >> Ali
> >>
> >> On Tue, Aug 14, 2018 at 12:14 AM Simon Elliston Ball <
> >> si...@simonellistonball.com> wrote:
> >>
> >>> Do you have any suggestions for what would make sense as a delimiter?
> >>>
> >>>> On 9 August 2018 at 05:57, Ali Nazemian 
> wrote:
> >>>>
> >>>> Hi All,
> >>>>
> >>>> I was wondering if we can change the field separators in Metron to be
> >>> able
> >>>> to make it Hive/ORC friendly. I could find the following PR, but
> >> neither
> >>>> dot nor colon is very Hive and ORC friendly and they will cause some
> >>>> issues. Hence, I wanted to see if it is possible to change the field
> >>>> separator to something else or even give users an ability to define
> >> what
> >>>> separator to be used to make the data model consistent across
> >>> Elasticsearch
> >>>> and HDFS.
> >>>>
> >>>> https://github.com/apache/metron/pull/1022
> >>>>
> >>>> Cheers,
> >>>> Ali
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> --
> >>> simon elliston ball
> >>> @sireb
> >>>
> >>
> >>
> >> --
> >> A.Nazemian
> >>
>

-- 
A.Nazemian

Re: Change field separator in Metron to make it Hive and ORC friendly

2018-08-14 Thread Ali Nazemian

Hi Simon,

We have temporarily decided to just change it with "_" for HDFS to avoid
all the headaches of the bugs and issues that can be raised by using
unsupported separators for ORC/Hive and Spark. However, I am not quite
confident with "_" as an option for the community as it becomes similar to
normal Metron separator. Maybe it would be nice to have an ability to
change the separator to any other character and let users decide what they
want to use.

Cheers,
Ali

On Tue, Aug 14, 2018 at 12:14 AM Simon Elliston Ball <
si...@simonellistonball.com> wrote:

> Do you have any suggestions for what would make sense as a delimiter?
>
> On 9 August 2018 at 05:57, Ali Nazemian  wrote:
>
> > Hi All,
> >
> > I was wondering if we can change the field separators in Metron to be
> able
> > to make it Hive/ORC friendly. I could find the following PR, but neither
> > dot nor colon is very Hive and ORC friendly and they will cause some
> > issues. Hence, I wanted to see if it is possible to change the field
> > separator to something else or even give users an ability to define what
> > separator to be used to make the data model consistent across
> Elasticsearch
> > and HDFS.
> >
> > https://github.com/apache/metron/pull/1022
> >
> > Cheers,
> > Ali
> >
>
>
>
> --
> --
> simon elliston ball
> @sireb
>

-- 
A.Nazemian

Change field separator in Metron to make it Hive and ORC friendly

2018-08-08 Thread Ali Nazemian

Hi All,

I was wondering if we can change the field separators in Metron to be able
to make it Hive/ORC friendly. I could find the following PR, but neither
dot nor colon is very Hive and ORC friendly and they will cause some
issues. Hence, I wanted to see if it is possible to change the field
separator to something else or even give users an ability to define what
separator to be used to make the data model consistent across Elasticsearch
and HDFS.

https://github.com/apache/metron/pull/1022

Cheers,
Ali

Re: Using Java Rest Client instead of Transport Client for Elasticsearch

2018-07-01 Thread Ali Nazemian

It looks like it's possible to use Xpack credentials and use Rest clients.

https://www.elastic.co/guide/en/elasticsearch/client/java-rest/master/java-rest-high-getting-started-initialization.html
https://www.elastic.co/guide/en/elasticsearch/client/java-rest/master/java-rest-low-usage-initialization.html



For example:

*final* *CredentialsProvider** credentialsProvider **=* *new*
*BasicCredentialsProvider**();*

*credentialsProvider**.**setCredentials**(**AuthScope**.**ANY**,*

*new* *UsernamePasswordCredentials**(**"user"**,* *"password"**));*



*RestClientBuilder** builder **=* *RestClient**.**builder**(**new*
*HttpHost**(**"localhost"**,* *9200**))*

*.**setHttpClientConfigCallback**(**new*
*RestClientBuilder**.**HttpClientConfigCallback**()* *{*

*@Override*

*public* *HttpAsyncClientBuilder**
customizeHttpClient**(**HttpAsyncClientBuilder** httpClientBuilder**)*
*{*

*return**
httpClientBuilder**.**setDefaultCredentialsProvider**(**credentialsProvider**);*

*}*

*});*



*RestHighLevelClient** client **=* *new* *RestHighLevelClient**(**builder**);*


Cheers,
Ali

On Thu, Jun 14, 2018 at 2:28 PM Ali Nazemian  wrote:

> Hi Michael and Casey,
>
> It looks like ES believe Java Rest Client is mature enough to be pushed to
> different products at this stage. However, I haven't used it personally. I
> will share the question regarding x-pack support with Elasticsearch
> engineers to see if there is any issue regarding that.
>
> Cheers,
> Ali
>
> On Thu, Jun 14, 2018 at 4:26 AM, Michael Miklavcic <
> michael.miklav...@gmail.com> wrote:
>
>> I think there is some level of support for auth via their REST api, but I
>> don't see anything specific to X-Pack as you mentioned. However, the major
>> reason we did not adopt it at the time of upgrade was because a number of
>> features were not available to REST yet and the effort to simultaneously
>> upgrade ES and migrate the API to REST was an effort decidedly too large
>> for the scope of the PR at the time.
>>
>>
>>
>> On Wed, Jun 13, 2018 at 8:39 AM, Casey Stella  wrote:
>>
>> > It was my understanding was that ES x-pack only supports the transport
>> > client (e.g.
>> > https://www.elastic.co/guide/en/x-pack/current/java-clients.html).  I
>> > think
>> > that was a major reason why we chose to go that route.  I might be wrong
>> > though.
>> >
>> > On Wed, Jun 13, 2018 at 10:30 AM Ali Nazemian 
>> > wrote:
>> >
>> > > Hi All,
>> > >
>> > >
>> > > I have noticed that the recommendation from Elasticsearch team is
>> changed
>> > > to use Java Rest Client instead of Transport one. The rationale
>> behind it
>> > > looks convincing and it can also help Metron to be more decoupled from
>> > > Elasticsearch roadmap, so Metron users can upgrade Elasticsearch with
>> > > minimum dependency to Metron support.
>> > >
>> > >
>> > > https://www.elastic.co/blog/state-of-the-official-
>> > elasticsearch-java-clients
>> > >
>> > > P.S: Transport client will be deprecated in ES 7 and will be removed
>> > > completely on 8.
>> > >
>> > >
>> > > Regards,
>> > > Ali
>> > >
>> >
>>
>
>
>
> --
> A.Nazemian
>


-- 
A.Nazemian

Re: Using Java Rest Client instead of Transport Client for Elasticsearch

2018-06-13 Thread Ali Nazemian

Hi Michael and Casey,

It looks like ES believe Java Rest Client is mature enough to be pushed to
different products at this stage. However, I haven't used it personally. I
will share the question regarding x-pack support with Elasticsearch
engineers to see if there is any issue regarding that.

Cheers,
Ali

On Thu, Jun 14, 2018 at 4:26 AM, Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> I think there is some level of support for auth via their REST api, but I
> don't see anything specific to X-Pack as you mentioned. However, the major
> reason we did not adopt it at the time of upgrade was because a number of
> features were not available to REST yet and the effort to simultaneously
> upgrade ES and migrate the API to REST was an effort decidedly too large
> for the scope of the PR at the time.
>
>
>
> On Wed, Jun 13, 2018 at 8:39 AM, Casey Stella  wrote:
>
> > It was my understanding was that ES x-pack only supports the transport
> > client (e.g.
> > https://www.elastic.co/guide/en/x-pack/current/java-clients.html).  I
> > think
> > that was a major reason why we chose to go that route.  I might be wrong
> > though.
> >
> > On Wed, Jun 13, 2018 at 10:30 AM Ali Nazemian 
> > wrote:
> >
> > > Hi All,
> > >
> > >
> > > I have noticed that the recommendation from Elasticsearch team is
> changed
> > > to use Java Rest Client instead of Transport one. The rationale behind
> it
> > > looks convincing and it can also help Metron to be more decoupled from
> > > Elasticsearch roadmap, so Metron users can upgrade Elasticsearch with
> > > minimum dependency to Metron support.
> > >
> > >
> > > https://www.elastic.co/blog/state-of-the-official-
> > elasticsearch-java-clients
> > >
> > > P.S: Transport client will be deprecated in ES 7 and will be removed
> > > completely on 8.
> > >
> > >
> > > Regards,
> > > Ali
> > >
> >
>



-- 
A.Nazemian

Using Java Rest Client instead of Transport Client for Elasticsearch

2018-06-13 Thread Ali Nazemian

Hi All,


I have noticed that the recommendation from Elasticsearch team is changed
to use Java Rest Client instead of Transport one. The rationale behind it
looks convincing and it can also help Metron to be more decoupled from
Elasticsearch roadmap, so Metron users can upgrade Elasticsearch with
minimum dependency to Metron support.

https://www.elastic.co/blog/state-of-the-official-elasticsearch-java-clients

P.S: Transport client will be deprecated in ES 7 and will be removed
completely on 8.


Regards,
Ali

Re: Streaming Machine Learning use case

2018-05-09 Thread Ali Nazemian

Hi Simon,

That's correct. Apache SAMOA. Not any specific algorithm at this stage.
Just the idea of being able to use streaming supervised learning without
being worried of training cycle is interesting to me. The fact that it is
closed to Metron from technology perspective made me wonder to see if
anyone uses that.

Cheers,
Ali


On Tue, 8 May 2018, 22:56 Simon Elliston Ball, 
wrote:

> Do you mean Apache SAMOA? I'm not sure of the status of that project, and
> it doesn't look particularly lively (last real activity on the lists was 2
> months ago, last commits, 7 months ago).
>
> That said, there seem to be some interesting algorithms implemented in
> there. The VHT algorithm and the clustering may be relevant, though we have
> other efficient means of streaming clustering already in Metron. I would
> also argue that we'd be better off looking at algorithms in Spark for
> things like frequent pattern mining, though there the FP growth algorithm
> is of course primarily a batch implementation.
>
> Are there any SAMOA algorithms in particular that you think would be
> relevant to Metron use cases?
>
> Simon
>
>
> On 8 May 2018 at 07:29, Ali Nazemian  wrote:
>
> > Hi all,
> >
> > I was wondering if someone has used Metron with any streaming ML
> framework
> > such as SAMOA? I know that Metron provides Machine Learning separately
> via
> > MAAS. However, it is hard to manage it from operational perspective
> > especially if we want to have a pretty dynamic and evolving model. SAMOA
> > seems to be a very slow project (or maybe even dead). However, it looks
> > very close from the integration point of view with Metron, so I wanted to
> > see if anyone had tried SAMOA in practice and especially with Metron use
> > cases.
> >
> > Regards,
> > Ali
> >
>
>
>
> --
> --
> simon elliston ball
> @sireb
>

Streaming Machine Learning use case

2018-05-07 Thread Ali Nazemian

Hi all,

I was wondering if someone has used Metron with any streaming ML framework
such as SAMOA? I know that Metron provides Machine Learning separately via
MAAS. However, it is hard to manage it from operational perspective
especially if we want to have a pretty dynamic and evolving model. SAMOA
seems to be a very slow project (or maybe even dead). However, it looks
very close from the integration point of view with Metron, so I wanted to
see if anyone had tried SAMOA in practice and especially with Metron use
cases.

Regards,
Ali

Re: [DISCUSS] Generic Syslog Parsing capability for parsers

2018-03-26 Thread Ali Nazemian

Just adding more details regarding what different parts are:

There are three stages here that need to be understood:
1- pre-parsing
2- chain of parsing (wrapping one type of message in another format)
3- post-parsing aka normalization

Pre-parsing stage is where we need to specify what specific log format we
have received. Sometimes we receive logs aggregated and we cannot segregate
feeds without checking the format of logs. Currently, we have addressed
this by consuming message in multiple parsers which means we are wasting
compute.

Chain of parsers is sort of clear, so I don't go the details.

Posparsing is where we need to normalize different formats to a single data
model based on different criteria (e.g. tenant).

For example, we may receive Syslog and WEF (Windows event format)
aggregated. At first, we want to specify which parser should consume WEF
and which one costumes Syslog. Then, in WEF parser we have DHCP, DNS,
Application logs etc. We need to send it to the next layer for assigning a
right data model and at the end, we need to normalize it to a single format
based on some criteria (e.g. tenant name).

Regards,
Ali

On Wed, Mar 21, 2018 at 9:49 AM, zeo...@gmail.com  wrote:

> So I've kept my ear to the ground regarding this topic for a while now, and
> had some conversations a year or so ago about the idea as well.  At the
> very least, I think having the concept of a pre-parser is a good one, if
> not chaining an arbitrary number of parsers together.  I see this as an
> important way to reduce the complexity of implementing new parsers and
> getting more community involvement/contributions.
>
> Syslog headers are a solid use case to start with because a lot of
> implementations fail to properly implement it on the sending side, at least
> in the real world scenarios that I've seen.  Having a way to extend the
> parser to easily handle incorrect implementations of syslog would be great,
> but anything that can pre-parse or trim the syslog headers to make parsing
> further along in the pipeline more simple would help.
>
> Another idea that would be attractive would be the ability to do
> opportunistic parsing given an ordered list of parsers and some criteria
> for successful parsing (which I admittedly am not sure how to solve) which
> (at least in my mind) would require similar logic to parser chaining.  In
> some highly decentralized organizations this would be helpful as it takes
> the configuration effort off of the team sending the logs (and thus makes
> them more willing to send logs _at all_) and pushes it onto the team
> parsing and/or storing them.
>
> I'm not suggesting we attempt to crack that second nut here, I would love
> to see that use case in mind during discussions.
>
> TL;DR:  +1
>
> Jon
>
> On Tue, Mar 20, 2018 at 6:14 PM Otto Fowler 
> wrote:
>
> > I think the chaining of parsers, or ability to compose parsers is a good
> > idea, but with reference to the pr mentioned, I would have some number of
> > StellarChainLinks as opposed re-implementing stellar in chainlinks.
> > Although it is NiFi-y.  But since I write Processors too, that is fine.
> >
> >
> > On March 20, 2018 at 18:05:12, Simon Elliston Ball (
> > si...@simonellistonball.com) wrote:
> >
> > It seems like parser chaining is becomes a hot topic on the repo too with
> > https://github.com/apache/metron/pull/969#partial-pull-merging <
> > https://github.com/apache/metron/pull/969#partial-pull-merging>
> >
> > I would like to discuss the option, and how we might architect, of
> > configuring parsers to operate on the output of parsers. This may also
> give
> > us the opportunity to be more efficient in scenarios where people have
> > large numbers of sources, and so use up a lot of slots for lower volume
> > parsers for example.
> >
> > I have a bunch of ideas around this, but am more keen to hear what
> everyone
> > else thinks at this stage. How should we go about fixing parser config so
> > that it’s clearer (removing the need for people to reinvent the parser
> > wheel as we’ve seen in a few places) and also more concise and powerful
> > (consolidating the parsing of transports such as syslog and content such
> as
> > application logs, or types of device logs).
> >
> > If this can lead to a more efficient way of handling both the syslog
> > problem, and the kind of problem that leads to switching between grok
> > statements in something like our ASA parser then all the better. I
> suspect
> > that there might also be a case for multi-level chaining here too, since
> > some things are embedded in multiple transports, or might have complex
> > fields that want ‘sub-parsing’.
> >
> > Of course one of the key values of Metron is its speed, so maybe
> > formalising some of the microbenchmarking approaches a few of us have
> been
> > working on might help here too. I’ve got a few bits of micro-benching
> > infrastructure around CEF and ASA, and I believe there’s also been some
> > work to load and perf test things l

Re: ES mpack to include more ES 5 stack properties

2018-02-21 Thread Ali Nazemian

Hi Otto,

A Jira ticket is created to address that.

https://issues.apache.org/jira/browse/METRON-1459

Cheers,
Ali

On Tue, Feb 20, 2018 at 1:10 AM, Otto Fowler 
wrote:

> I don’t think there are right now.  I would recommend entering jira issues
> for what you haven in mind
>
>
> On February 19, 2018 at 01:02:32, Ali Nazemian (alinazem...@gmail.com)
> wrote:
>
> Hi All,
>
> Is there any plan to include more ES 5+ specific properties to
> Metron mpack? For example, if we want to use dedicated nodes for Master
> Nodes, Data Nodes, Ingestion Nodes and ML Nodes and different
> configurations for them, how can we proceed? It may be out of the scope of
> the current mpack, so I just wanted to understand whether it is a right
> expectation or we are moving to spend more on Solr as an indexer.
>
> Cheers,
> Ali
>
>


-- 
A.Nazemian

Re: [DISCUSS] community view/roadmap of threat intel

2018-02-21 Thread Ali Nazemian

ith something like a StixProcessor (which I personally think
> > should be a StixRecordReader in new NiFi btw) or a whatever parser,
> > fetcher, tailer etc.
> >
> > Btw, I’ve also got early stage implementations of things like Stellar in
> > NiFi which would be the starting point for building something like that.
> >
> > To address the bulk vs incremental side, we could use the same mechanism
> > to handle both, but that would very much suggest moving to the record
> > reader based apis. That should be fine at the O(100s gigabytes) scale in
> > NiFi. Does anyone have any use cases that would still seem like they’d be
> > in the terabytes / existing bulk map reduce approach end?
> >
> > Simon
> >
> >
> > > On 19 Feb 2018, at 14:26, Otto Fowler  wrote:
> > >
> > > There are a couple of use cases here for getting the data.
> > >
> > > When you _can_ or want to ingest and duplicate the external store
> > >
> > > 1. Bulk Loading ( from a clean empty state )
> > > 2. Tailing the feed afterwards
> > >
> > > When you can’t
> > >
> > > 3. Calling the api ( most likely web ) for reputation or some other
> > thing
> > >
> > >
> > > Right now, I *think* we’d use our bulk loader for 1. I am not sure it
> > can
> > > be configured for 2.
> > > NiFi *could* do it, if you wrote your Taxii client such that it was
> > > stateful and could resume
> > > after restarts etc and pickup from the right place.
> > >
> > > Right now, we only ingest indicators as raw data. I do not believe we
> > > support the reputation and confidence stuff.
> > > Also, the issue of which version of stix/taxii we support will need to
> > be
> > > considered.
> > >
> > > I think the idea of a ‘tailing’ topology per service where required
> > would
> > > be worth looking into, such a topology
> > > would be transform and index (with a new hbase indexer ) only with no
> > > enrichment. We also may want to explore indexing
> > > enrichments to SEARCH stores or both SEACH and BATCH.
> > >
> > > Like Simon says, there is NiFi, but I would want to consider a metron
> > > topology because this is a metron managed store,
> > > and having nifi write to metron’s indicator store, or other threat
> store
> > is
> > > wrong I think. It breaks the application boundary .
> > >
> > > You should take a look at what jiras we currently have, and we can talk
> > > about what what needs to happen, create the jiras
> > > and get it rolling.
> > >
> > > I would imagine down the like, that we would support bulk load as we
> > have
> > > now ‘out of the box’. And have a new mpack
> > > for optional threat intel flows available.
> > >
> > > ottO
> > >
> > > On February 19, 2018 at 07:47:39, Andre (andre-li...@fucs.org) wrote:
> > >
> > > Simon,
> > >
> > > I have coded but not merged a STIX / TAXII processor for NiFi that
> would
> > > work perfectly fine with this.
> > >
> > >
> > > But I will take the opportunity to touch the following points:
> > >
> > >
> > > 1. Threat Intel is more frequently than not based on API lookups (e.g.
> > > VirusTotal, RBLs and correlated, Umbrella's top million, etc). How are
> > > those going to be consistently managed?
> > >
> > > 2. Threat feeds are frequently classified in regards to confidence but
> > > today the default Metron schema seems to lack any similar concept? Do
> we
> > > have plans to address it?
> > >
> > > 3. Atemporal matching - Given the use of big data technologies it seems
> > to
> > > me Metron should be able to look into past enrichment data in order to
> > > classify traffic. I am not sure this is possible today?
> > >
> > >
> > > Cheers
> > >
> > >
> > > On Mon, Feb 19, 2018 at 8:48 PM, Simon Elliston Ball <
> > > si...@simonellistonball.com> wrote:
> > >
> > >> Would it make sense to lean on something like Apache NiFi for this? It
> > >> seems a good fit to handle getting data from wherever (web service,
> > poll,
> > >> push etc, streams etc). If we were to build a processor which
> > > encapsulated
> > >> the threat intel loader logic, that would provide a granular route to
> > >> update threat intel entries in a more st

ES mpack to include more ES 5 stack properties

2018-02-18 Thread Ali Nazemian

Hi All,

Is there any plan to include more ES 5+ specific properties to
Metron mpack? For example, if we want to use dedicated nodes for Master
Nodes, Data Nodes, Ingestion Nodes and ML Nodes and different
configurations for them, how can we proceed? It may be out of the scope of
the current mpack, so I just wanted to understand whether it is a right
expectation or we are moving to spend more on Solr as an indexer.

Cheers,
Ali

Re: [DISCUSS] community view/roadmap of threat intel

2018-02-15 Thread Ali Nazemian

I think one of the challenges is where the scope of threat intel ends from
the Metron roadmap? Does it gonna relly on supporting a standard format and
a loader to send it to HBase for the later threat intel use cases?

In my opinion, it would be better to have a separate topology (sort of
similar to the profiler approach) to get the feeds (maybe from Kafka) and
load it into HBase frequently based on what criteria we want to have. Maybe
we need to have some normalizations for the threat feeds (either aggregated
or single feed) as an example (or any other transformation by using
Stellar). Maybe we need to tailor row_key in a way that can be utilised
based on the threat intel look up we want to have further from the
enrichment topology. The problem I see with different loaders in Metron is
we can mostly use them for the purpose of POC, but if you want to build an
actual use case for a production platform then it will be out of the
flexibility of a loader, so we will end up feeding data to HBase based on
our use case.

In this case, maybe it won't be very important we want to use an aggregator
X or aggregator Y, we can integrate it with Metron based on integration
points.

Cheers,
Ali

On Wed, Feb 14, 2018 at 11:28 PM, Simon Elliston Ball <
si...@simonellistonball.com> wrote:

> We used to install soltra edge in the old ansible builds (which have
> thankfully now been pared back in the interests of stability in full dev).
> Soltra has not been a good option since they went proprietary, so since
> then we’ve included opentaxii (BSD 3) as a discovery and aggregator.
>
> Most of the challenges are around licensing. Hippocampe is part of The
> Hive Project, which is AGPL, which is an apache category X license so can’t
> be included.
>
> Mindmeld is much better license-wise (Apache 2) so would be well worth
> community consideration. I kinda like it as a framework, but
>
> I for one would be very pleased to hear a broader community discussion
> around which platforms we should have integrations with via the threat
> intel loader, or even through a direct to hbase streaming connector.
>
> Simon
>
> > On 14 Feb 2018, at 03:13, Ali Nazemian  wrote:
> >
> > Hi All,
> >
> > I would like to understand Metron community view on Threat Intel
> > aggregators as well as the roadmap of threat intelligence and threat
> > hunting. There are some open source options available regarding threat
> > intel aggregator such as Minemeld, Hippocampe, etc. Is there any plan to
> > build that as a part of Metron in future? Is there any specific
> aggregator
> > you think would be more aligned with Metron roadmap?
> >
> > Cheers,
> > Ali
>
>

-- 
A.Nazemian

[DISCUSS] community view/roadmap of threat intel

2018-02-13 Thread Ali Nazemian

Hi All,

I would like to understand Metron community view on Threat Intel
aggregators as well as the roadmap of threat intelligence and threat
hunting. There are some open source options available regarding threat
intel aggregator such as Minemeld, Hippocampe, etc. Is there any plan to
build that as a part of Metron in future? Is there any specific aggregator
you think would be more aligned with Metron roadmap?

Cheers,
Ali

Re: Disable Metron parser output writer entirely

2018-02-05 Thread Ali Nazemian

Thanks, Simon.

On 5 Feb. 2018 22:00, "Simon Elliston Ball" 
wrote:

> I expect the performance would be dire. If you really wanted to do
> something like this, a custom writer might make sense. KAFKA_PUT is really
> meant for debugging use cases only. It’s a very non-stellar construct
> (non-expression, no return, side-effect dependent…) Also, it creates a
> producer for every call, so your are definitely not going to get
> performance out of it.
>
> Simon
>
> > On 5 Feb 2018, at 06:32, Ali Nazemian  wrote:
> >
> > What about the performance difference?
> >
> > On Fri, Feb 2, 2018 at 10:41 PM, Otto Fowler 
> > wrote:
> >
> >> You cannot.
> >>
> >>
> >>
> >> On February 1, 2018 at 23:51:28, Ali Nazemian (alinazem...@gmail.com)
> >> wrote:
> >>
> >> Hi All,
> >>
> >> I am trying to investigate whether we can disable a Metron parser output
> >> writer entirely and manage it via KAFKA_PUT Stellar function instead.
> >> First, is it possible via configuration? Second, will be any performance
> >> difference between normal Kafka writer and the Stellar version of it
> >> (KAFKA_PUT).
> >>
> >> Regards,
> >> Ali
> >>
> >>
> >
> >
> > --
> > A.Nazemian
>
>

Re: Disable Metron parser output writer entirely

2018-02-04 Thread Ali Nazemian

What about the performance difference?

On Fri, Feb 2, 2018 at 10:41 PM, Otto Fowler 
wrote:

> You cannot.
>
>
>
> On February 1, 2018 at 23:51:28, Ali Nazemian (alinazem...@gmail.com)
> wrote:
>
> Hi All,
>
> I am trying to investigate whether we can disable a Metron parser output
> writer entirely and manage it via KAFKA_PUT Stellar function instead.
> First, is it possible via configuration? Second, will be any performance
> difference between normal Kafka writer and the Stellar version of it
> (KAFKA_PUT).
>
> Regards,
> Ali
>
>


-- 
A.Nazemian

Disable Metron parser output writer entirely

2018-02-01 Thread Ali Nazemian

Hi All,

I am trying to investigate whether we can disable a Metron parser output
writer entirely and manage it via KAFKA_PUT Stellar function instead.
First, is it possible via configuration? Second, will be any performance
difference between normal Kafka writer and the Stellar version of it
(KAFKA_PUT).

Regards,
Ali

Re: Enrichment and indexing routing mechanism

2018-01-29 Thread Ali Nazemian

No, I haven't yet. I have checked the source code and it seems I should be
able to modify source.type in Stellar in the parser config.

Regards,
Ali

On 29 Jan. 2018 23:37, "Otto Fowler"  wrote:

The source.type is set before the stellar transformations, so I think if
you change it in stellar
it should work.

Have you tried and failed?


On January 29, 2018 at 07:22:23, Ali Nazemian (alinazem...@gmail.com) wrote:

Yes, exactly.

On Mon, Jan 29, 2018 at 11:15 PM, Otto Fowler 
wrote:

> Are you trying to change the source.type to generate multiple sources
from
> a single feed?
>
>
> On January 29, 2018 at 07:08:57, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> Flow is:
>
> Parser (including the parser class, and all transformations, including
> stellar transformations) -> Kafka (enrichments)
>
> Kafka (enrichments) -> Enrichment topology with all it’s Stellary
goodness
> -> Kafka (indexing)
>
> Kafka (indexing) -> Indexing topologies (ES / Solr / HDFS) configured
based
> on the indexing config named the same as source.type -> wherever the
> indexer tells it to be.
>
> Simon
>
> > On 29 Jan 2018, at 11:53, Ali Nazemian  wrote:
> >
> > Thanks, Simon. When will it apply for the enrichment? Is that after
> parser
> > and post-parser Stellar implementation? I am trying to understand If I
> > change it in post-parser Stellar, will it be overwritten at the last
step
> > of Parser topology or not?
> >
> > Cheers,
> > Ali
> >
> > On Mon, Jan 29, 2018 at 8:55 PM, Simon Elliston Ball <
> > si...@simonellistonball.com> wrote:
> >
> >> Yes, it is.
> >>
> >> Sent from my iPhone
> >>
> >>> On 29 Jan 2018, at 09:33, Ali Nazemian  wrote:
> >>>
> >>> Hi All,
> >>>
> >>> I was wondering how the routing mechanism works in Metron currently.
> Can
> >>> somebody please explain how Enrichment Storm topology understands a
> >> single
> >>> event is related to which Metron feed? What about indexing? is that
> based
> >>> on "source.type" field?
> >>>
> >>> Cheers,
> >>> Ali
> >>
> >
> >
> >
> > --
> > A.Nazemian
>



-- 
A.Nazemian

Re: Enrichment and indexing routing mechanism

2018-01-29 Thread Ali Nazemian

And I am trying to understand if I set a post-parser Stellar transformation
to change the value of "source.type" will it impact enrichment routing or
it will get overwritten by an internal method?

On Mon, Jan 29, 2018 at 11:22 PM, Ali Nazemian 
wrote:

> Yes, exactly.
>
> On Mon, Jan 29, 2018 at 11:15 PM, Otto Fowler 
> wrote:
>
>> Are you trying to change the source.type to generate multiple sources from
>> a single feed?
>>
>>
>> On January 29, 2018 at 07:08:57, Simon Elliston Ball (
>> si...@simonellistonball.com) wrote:
>>
>> Flow is:
>>
>> Parser (including the parser class, and all transformations, including
>> stellar transformations) -> Kafka (enrichments)
>>
>> Kafka (enrichments) -> Enrichment topology with all it’s Stellary goodness
>> -> Kafka (indexing)
>>
>> Kafka (indexing) -> Indexing topologies (ES / Solr / HDFS) configured
>> based
>> on the indexing config named the same as source.type -> wherever the
>> indexer tells it to be.
>>
>> Simon
>>
>> > On 29 Jan 2018, at 11:53, Ali Nazemian  wrote:
>> >
>> > Thanks, Simon. When will it apply for the enrichment? Is that after
>> parser
>> > and post-parser Stellar implementation? I am trying to understand If I
>> > change it in post-parser Stellar, will it be overwritten at the last
>> step
>> > of Parser topology or not?
>> >
>> > Cheers,
>> > Ali
>> >
>> > On Mon, Jan 29, 2018 at 8:55 PM, Simon Elliston Ball <
>> > si...@simonellistonball.com> wrote:
>> >
>> >> Yes, it is.
>> >>
>> >> Sent from my iPhone
>> >>
>> >>> On 29 Jan 2018, at 09:33, Ali Nazemian  wrote:
>> >>>
>> >>> Hi All,
>> >>>
>> >>> I was wondering how the routing mechanism works in Metron currently.
>> Can
>> >>> somebody please explain how Enrichment Storm topology understands a
>> >> single
>> >>> event is related to which Metron feed? What about indexing? is that
>> based
>> >>> on "source.type" field?
>> >>>
>> >>> Cheers,
>> >>> Ali
>> >>
>> >
>> >
>> >
>> > --
>> > A.Nazemian
>>
>
>
>
> --
> A.Nazemian
>



-- 
A.Nazemian

Re: Enrichment and indexing routing mechanism

2018-01-29 Thread Ali Nazemian

Yes, exactly.

On Mon, Jan 29, 2018 at 11:15 PM, Otto Fowler 
wrote:

> Are you trying to change the source.type to generate multiple sources from
> a single feed?
>
>
> On January 29, 2018 at 07:08:57, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> Flow is:
>
> Parser (including the parser class, and all transformations, including
> stellar transformations) -> Kafka (enrichments)
>
> Kafka (enrichments) -> Enrichment topology with all it’s Stellary goodness
> -> Kafka (indexing)
>
> Kafka (indexing) -> Indexing topologies (ES / Solr / HDFS) configured based
> on the indexing config named the same as source.type -> wherever the
> indexer tells it to be.
>
> Simon
>
> > On 29 Jan 2018, at 11:53, Ali Nazemian  wrote:
> >
> > Thanks, Simon. When will it apply for the enrichment? Is that after
> parser
> > and post-parser Stellar implementation? I am trying to understand If I
> > change it in post-parser Stellar, will it be overwritten at the last step
> > of Parser topology or not?
> >
> > Cheers,
> > Ali
> >
> > On Mon, Jan 29, 2018 at 8:55 PM, Simon Elliston Ball <
> > si...@simonellistonball.com> wrote:
> >
> >> Yes, it is.
> >>
> >> Sent from my iPhone
> >>
> >>> On 29 Jan 2018, at 09:33, Ali Nazemian  wrote:
> >>>
> >>> Hi All,
> >>>
> >>> I was wondering how the routing mechanism works in Metron currently.
> Can
> >>> somebody please explain how Enrichment Storm topology understands a
> >> single
> >>> event is related to which Metron feed? What about indexing? is that
> based
> >>> on "source.type" field?
> >>>
> >>> Cheers,
> >>> Ali
> >>
> >
> >
> >
> > --
> > A.Nazemian
>



-- 
A.Nazemian

Re: Enrichment and indexing routing mechanism

2018-01-29 Thread Ali Nazemian

Thanks, Simon. When will it apply for the enrichment? Is that after parser
and post-parser Stellar implementation? I am trying to understand If I
change it in post-parser Stellar, will it be overwritten at the last step
of Parser topology or not?

Cheers,
Ali

On Mon, Jan 29, 2018 at 8:55 PM, Simon Elliston Ball <
si...@simonellistonball.com> wrote:

> Yes, it is.
>
> Sent from my iPhone
>
> > On 29 Jan 2018, at 09:33, Ali Nazemian  wrote:
> >
> > Hi All,
> >
> > I was wondering how the routing mechanism works in Metron currently. Can
> > somebody please explain how Enrichment Storm topology understands a
> single
> > event is related to which Metron feed? What about indexing? is that based
> > on "source.type" field?
> >
> > Cheers,
> > Ali
>

-- 
A.Nazemian

Enrichment and indexing routing mechanism

2018-01-29 Thread Ali Nazemian

Hi All,

I was wondering how the routing mechanism works in Metron currently. Can
somebody please explain how Enrichment Storm topology understands a single
event is related to which Metron feed? What about indexing? is that based
on "source.type" field?

Cheers,
Ali

Re: [DISCUSS] Update Metron Elasticsearch index names to metron_

2018-01-24 Thread Ali Nazemian

Hi All,

I just wanted to say it would be great if we can be careful with these type
of changes. From the development point of view, it is just a few lines of
code which can provide multiple advantages, but for live large-scale Metron
platforms, some of these changes might be really expensive to address with
zero-downtime.

Cheers,
Ali

On Thu, Jan 25, 2018 at 9:29 AM, Otto Fowler 
wrote:

> +1
>
>
> On January 24, 2018 at 16:28:42, Nick Allen (n...@nickallen.org) wrote:
>
> +1 to a standard prefix for all Metron indices. I've had the same thought
> myself and you laid out the advantages well.
>
>
>
>
>
> On Wed, Jan 24, 2018 at 3:47 PM zeo...@gmail.com  wrote:
>
> > I agree with having a metron_ prefix for ES indexes, and the timing.
> >
> > Jon
> >
> > On Wed, Jan 24, 2018 at 3:20 PM Michael Miklavcic <
> > michael.miklav...@gmail.com> wrote:
> >
> > > With the completion of https://github.com/apache/metron/pull/840
> > > (METRON-939: Upgrade ElasticSearch and Kibana), we have the makings for
> a
> > > major release rev of Metron in the upcoming release (currently slotted
> to
> > > 0.4.3, I believe). Since there are non-backwards compatible changes
> > > pertaining to ES indexing, it seems like a good opportunity to revisit
> > our
> > > index naming standards.
> > >
> > > I propose we add a simple prefix "metron_" to all Metron indexes. There
> > are
> > > numerous reasons for doing so
> > >
> > > - removes the likelihood of index name collisions when we perform
> > > operations on index wildcard names, e.g. "enrichment_*, indexing_*,
> > > etc.".
> > > - ie, this allows us to be more friendly in a multi-tenant ES
> > > environment for relatively low engineering cost.
> > > - simplifies the Kibana dashboard a bit. We currently needed to
> > create a
> > > special index pattern in order to accommodate multi-index pattern
> > > matching
> > > across all metron-specific indexes. Using metron_* would be much
> > simpler
> > > and less prone to error.
> > > - easier for customers to debug and identify Metron-specific indexes
> > and
> > > associated data
> > >
> > >
> > > The reason for making these changes now is that we already have
> breaking
> > > changes with ES. Leveraging existing indexed data rather than deleting
> > > indexes and starting from scractch already requires a
> > re-indexing/migration
> > > step, so there is no additional effort on the part of users if they
> > choose
> > > to attempt a migration. It further makes sense with our current work
> > > towards upgrading Solr.
> > >
> > > We already have a battery of integration and manual tests after the ES
> > > upgrade work that can be leveraged to validate the changes.
> > >
> > > Mike Miklavcic
> > >
> >
> >
> > --
> >
> > Jon
> >
>



-- 
A.Nazemian

Re: Metron Alert UI and zero-down time Elasticsearch re-index

2018-01-14 Thread Ali Nazemian

It would be great if we can have some help on this issue.

Cheers,
Ali

On Sat, Jan 6, 2018 at 12:33 PM, Ali Nazemian  wrote:

> Hi James,
>
> Due to changes in the field format, I want to create a new index with the
> new format. Create an alias to refer to both new and old index. Then, copy
> all the documents from the old index to the new index and use the alias to
> search through Metron Alert UI and Kibana to avoid any downtime. Handling
> it in Kibana is easy. However, Metron Alert UI shows duplicate documents. I
> want to limit Metron Alert UI somehow to read alias instead of both
> underneath indices (old index and new index).
>
> P.S: all of your messages in the mailing list end up in my spam for some
> reason!
>
> Cheers,
> Ali
>
> On Thu, Jan 4, 2018 at 5:48 PM, James Sirota  wrote:
>
>> Hi Ali, I am not sure I understand what you are trying to do.  Are you
>> trying to change the name on the old index, add it to the alias, and then
>> re-index and give the new index the name of the old index?
>>
>> 01.01.2018, 22:30, "Ali Nazemian" :
>> > Hi All,
>> >
>> > We are using an older version of Metron Alert-UI (Received in Oct 2017)
>> > which sends search queries to ES directly without using Metron Rest
>> API. We
>> > wanted to run a zero-downtime ES reindex process by using ES aliasing.
>> > However, I am not sure how it will impact the search part of Alert-UI
>> > because we need to change it to refer to the alias instead of the old
>> index
>> > name. Please advise how it can be covered in the older version of Metron
>> > Alert-UI.
>> >
>> > Regards,
>> > Ali
>>
>> ---
>> Thank you,
>>
>> James Sirota
>> PMC- Apache Metron
>> jsirota AT apache DOT org
>>
>
>
>
> --
> A.Nazemian
>



-- 
A.Nazemian

Re: Metron Alert UI and zero-down time Elasticsearch re-index

2018-01-05 Thread Ali Nazemian

Hi James,

Due to changes in the field format, I want to create a new index with the
new format. Create an alias to refer to both new and old index. Then, copy
all the documents from the old index to the new index and use the alias to
search through Metron Alert UI and Kibana to avoid any downtime. Handling
it in Kibana is easy. However, Metron Alert UI shows duplicate documents. I
want to limit Metron Alert UI somehow to read alias instead of both
underneath indices (old index and new index).

P.S: all of your messages in the mailing list end up in my spam for some
reason!

Cheers,
Ali

On Thu, Jan 4, 2018 at 5:48 PM, James Sirota  wrote:

> Hi Ali, I am not sure I understand what you are trying to do.  Are you
> trying to change the name on the old index, add it to the alias, and then
> re-index and give the new index the name of the old index?
>
> 01.01.2018, 22:30, "Ali Nazemian" :
> > Hi All,
> >
> > We are using an older version of Metron Alert-UI (Received in Oct 2017)
> > which sends search queries to ES directly without using Metron Rest API.
> We
> > wanted to run a zero-downtime ES reindex process by using ES aliasing.
> > However, I am not sure how it will impact the search part of Alert-UI
> > because we need to change it to refer to the alias instead of the old
> index
> > name. Please advise how it can be covered in the older version of Metron
> > Alert-UI.
> >
> > Regards,
> > Ali
>
> ---
> Thank you,
>
> James Sirota
> PMC- Apache Metron
> jsirota AT apache DOT org
>

-- 
A.Nazemian

Metron Alert UI and zero-down time Elasticsearch re-index

2018-01-01 Thread Ali Nazemian

Hi All,


We are using an older version of Metron Alert-UI (Received in Oct 2017)
which sends search queries to ES directly without using Metron Rest API. We
wanted to run a zero-downtime ES reindex process by using ES aliasing.
However, I am not sure how it will impact the search part of Alert-UI
because we need to change it to refer to the alias instead of the old index
name. Please advise how it can be covered in the older version of Metron
Alert-UI.

Regards,
Ali

Re: Metron nested object

2017-12-21 Thread Ali Nazemian

So Metron enrichment and indexer are not nested aware? Is there any plan to
add that to Metron in future?

Cheers,
Ali

On Fri, Dec 22, 2017 at 12:46 AM, Otto Fowler 
wrote:

> I believe right now you have to flatten.
> The jsonMap parser does this.
>
>
> On December 21, 2017 at 08:28:13, Ali Nazemian (alinazem...@gmail.com)
> wrote:
>
> Hi all,
>
>
> We have recently faced some data sources that generate data in a nested
> format. For example, AWS Cloudtrail generates data in the following JSON
> format:
>
> {
>
> "Records": [
>
> {
>
> "eventVersion": *"2.0"*,
>
> "userIdentity": {
>
> "type": *"IAMUser"*,
>
> "principalId": *"EX_PRINCIPAL_ID"*,
>
> "arn": *"arn:aws:iam::123456789012:user/Alice"*,
>
> "accessKeyId": *"EXAMPLE_KEY_ID"*,
>
> "accountId": *"123456789012"*,
>
> "userName": *"Alice"*
>
> },
>
> "eventTime": *"2014-03-07T21:22:54Z"*,
>
> "eventSource": *"ec2.amazonaws.com <http://ec2.amazonaws.com>"*,
>
> "eventName": *"StartInstances"*,
>
> "awsRegion": *"us-east-2"*,
>
> "sourceIPAddress": *"205.251.233.176"*,
>
> "userAgent": *"ec2-api-tools 1.6.12.2"*,
>
> "requestParameters": {
>
> "instancesSet": {
>
> "items": [
>
> {
>
> "instanceId": *"i-ebeaf9e2"*
>
> }
>
> ]
>
> }
>
> },
>
> "responseElements": {
>
> "instancesSet": {
>
> "items": [
>
> {
>
> "instanceId": *"i-ebeaf9e2"*,
>
> "currentState": {
>
> "code": 0,
>
> "name": *"pending"*
>
> },
>
> "previousState": {
>
> "code": 80,
>
> "name": *"stopped"*
>
> }
>
> }
>
> ]
>
> }
>
> }
>
> }
>
> ]
>
> }
>
>
> We are able to make this as a flat JSON file. However, a nested object is
> supported by data backends in Metron (ES, ORC, etc.), so I was wondering
> whether with the current version of Metron we are able to index nested
> documents or we have to make it flat?
>
>
>
> Cheers,
>
> Ali
>
>


-- 
A.Nazemian

Metron nested object

2017-12-21 Thread Ali Nazemian

Hi all,


We have recently faced some data sources that generate data in a nested
format. For example, AWS Cloudtrail generates data in the following JSON
format:

{

  "Records": [

{

  "eventVersion": *"2.0"*,

  "userIdentity": {

"type": *"IAMUser"*,

"principalId": *"EX_PRINCIPAL_ID"*,

"arn": *"arn:aws:iam::123456789012:user/Alice"*,

"accessKeyId": *"EXAMPLE_KEY_ID"*,

"accountId": *"123456789012"*,

"userName": *"Alice"*

  },

  "eventTime": *"2014-03-07T21:22:54Z"*,

  "eventSource": *"ec2.amazonaws.com "*,

  "eventName": *"StartInstances"*,

  "awsRegion": *"us-east-2"*,

  "sourceIPAddress": *"205.251.233.176"*,

  "userAgent": *"ec2-api-tools 1.6.12.2"*,

  "requestParameters": {

"instancesSet": {

  "items": [

{

  "instanceId": *"i-ebeaf9e2"*

}

  ]

}

  },

  "responseElements": {

"instancesSet": {

  "items": [

{

  "instanceId": *"i-ebeaf9e2"*,

  "currentState": {

"code": 0,

"name": *"pending"*

  },

  "previousState": {

"code": 80,

"name": *"stopped"*

  }

}

  ]

}

  }

}

  ]

}


We are able to make this as a flat JSON file. However, a nested object is
supported by data backends in Metron (ES, ORC, etc.), so I was wondering
whether with the current version of Metron we are able to index nested
documents or we have to make it flat?



Cheers,

Ali

Re: Heterogeneous indexing batch size for different Metron feeds

2017-12-11 Thread Ali Nazemian

t;can handle given its current tuning configuration and then work up from
>there. I just went through this the other day and I started by lowering
> the
>setting to a conservative 1000.
>- From this point, you can work on tuning ES or HDFS, whichever the
>culprit is. Realistically, you're going to need to do a few tuning
> cycles
>to tune that end point, increase the upstream throughput, tune again,
> and
>repeat until you get the rate of throughput necessary to handle your
> live
>feeds.
>
> The other recommendation I have that goes hand in hand with the approach
> above is to look at the lag offsets for your indexing topic [1]. This is
> invaluable, and it's really, really easy to use from the CLI. There are 2
> ways to approach this problem:
>
>1. live sensor data streaming with the indexing topology keeping pace
>e2e. e.g. bro sensor to bro topic to enrichment topic to indexing topic.
>You would expect to see your kafka partition consumer offsets
> maintaining a
>fairly consistent lag after a few minutes of activity and stabilizing.
>2. reading a large quantity of data already landed in the indexing topic
>(start indexing topology with Kafka offset = EARLIEST). The goal would
> be
>to see your offset lag catch up eventually. This allows you to stress
> test
>your indexing topology different from a normal streaming experience.
> One,
>it simplifies the moving parts (no parsers, enrichments are live at this
>point - you're replaying data already in the topic). But two, you can
>easily tune the levers outlined above, iterate with each change in rapid
>succession, and record your results.
>
> 1.
> https://github.com/apache/metron/blob/master/metron-
> platform/Performance-tuning-guide.md
>
> Sample command without Kerberos enabled (see link [1] for more detail with
> Kerberos):
>
> watch -n 10 -d ${KAFKA_HOME}/bin/kafka-consumer-groups.sh \
> --describe \
> --group indexing \
> --bootstrap-server $BROKERLIST \
> --new-consumer
>
> Hope this helps.
>
> Cheers,
> Michael Miklavcic
>
>
> On Sun, Dec 10, 2017 at 5:38 AM, Ali Nazemian 
> wrote:
>
> > This seems not the same as our observations. Whenever there are some
> > messages in the indexing or enrichments backlog, the new configurations
> (at
> > least related to the batch size) won't be applied to the new messages. It
> > will remain as the previous state until it processes all the old
> messages.
> > This scenario can be produced very easily.
> >
> > Create a feed with an inefficient batch size to create a backlog on
> > indexing topic. Then change the batch size to an effective value and wait
> > to see how long it will take to process the backlog. Based on our
> > observations, it takes a while to process messages in a back-log even if
> > you fix the batch size. It feels batch size changes are not synchronised
> > instantly.
> >
> > On Thu, Dec 7, 2017 at 11:45 PM, Otto Fowler 
> > wrote:
> >
> > > We use TreeCache
> > > <https://curator.apache.org/apidocs/org/apache/curator/
> > framework/recipes/cache/TreeCache.html>
> > > .
> > >
> > > When the configuration is updated in zookeeper, the configuration
> object
> > > in the bolt is updated. This configuration is read on each message, so
> I
> > > think from what I see new configurations should get picked up for the
> > next
> > > message.
> > >
> > > I could be wrong though.
> > >
> > >
> > >
> > >
> > > On December 7, 2017 at 06:47:15, Ali Nazemian (alinazem...@gmail.com)
> > > wrote:
> > >
> > > Thank you very much. Unfortunately, reproducing all the situations are
> > > very costly for us at this moment. We are kind of avoiding to hit that
> > > issue by using the same batch size for all the feeds. Hopefully, with
> the
> > > new PR Casey provided for the segregation of ES and HDFS, it will be
> very
> > > much clear to tune them.
> > >
> > > Do you know how the synchronization of indexing config will happen with
> > > the topology? Does the topology gets synchronised by pulling the last
> > > configs from ZK based on some background mechanism or it is based on an
> > > update trigger? As I mentioned, based on our observation it looks like
> > the
> > > synchronization doesn't work until all the old messages in Kafka queue
> > get
> > > processed based on the old indexing configs.
> > >
> > > Regards,
> >

Re: Heterogeneous indexing batch size for different Metron feeds

2017-12-10 Thread Ali Nazemian

This seems not the same as our observations. Whenever there are some
messages in the indexing or enrichments backlog, the new configurations (at
least related to the batch size) won't be applied to the new messages. It
will remain as the previous state until it processes all the old messages.
This scenario can be produced very easily.

Create a feed with an inefficient batch size to create a backlog on
indexing topic. Then change the batch size to an effective value and wait
to see how long it will take to process the backlog. Based on our
observations, it takes a while to process messages in a back-log even if
you fix the batch size. It feels batch size changes are not synchronised
instantly.

On Thu, Dec 7, 2017 at 11:45 PM, Otto Fowler 
wrote:

> We use TreeCache
> <https://curator.apache.org/apidocs/org/apache/curator/framework/recipes/cache/TreeCache.html>
> .
>
> When the configuration is updated in zookeeper, the configuration object
> in the bolt is updated. This configuration is read on each message, so I
> think from what I see new configurations should get picked up for the next
> message.
>
> I could be wrong though.
>
>
>
>
> On December 7, 2017 at 06:47:15, Ali Nazemian (alinazem...@gmail.com)
> wrote:
>
> Thank you very much. Unfortunately, reproducing all the situations are
> very costly for us at this moment. We are kind of avoiding to hit that
> issue by using the same batch size for all the feeds. Hopefully, with the
> new PR Casey provided for the segregation of ES and HDFS, it will be very
> much clear to tune them.
>
> Do you know how the synchronization of indexing config will happen with
> the topology? Does the topology gets synchronised by pulling the last
> configs from ZK based on some background mechanism or it is based on an
> update trigger? As I mentioned, based on our observation it looks like the
> synchronization doesn't work until all the old messages in Kafka queue get
> processed based on the old indexing configs.
>
> Regards,
> Ali
>
> On Thu, Dec 7, 2017 at 12:33 AM, Otto Fowler 
> wrote:
>
>> Sorry,
>> We flush for timeouts on every storm ‘tick’ message, not on every message.
>>
>>
>>
>> On December 6, 2017 at 08:29:51, Otto Fowler (ottobackwa...@gmail.com)
>> wrote:
>>
>> I have looked at it.
>>
>> We maintain batch lists for each sensor which gather messages to index.
>> When we get a message that puts it over the batch size the messages are
>> flushed and written to the target.
>> There is also a timeout component, where the batch would be flushed based
>> on timeout.
>>
>> While batch size checking occurs on a per sensor-message receipt basis,
>> each message, regardless of sensor will trigger a check of the batch
>> timeout for all the lists.
>>
>> At least that is what I think I see.
>>
>> Without understanding what the failures are for it is hard to see what
>> the issue is.
>>
>> Do we have timing issues where all the lists are timing out all the time
>> causing some kind of cascading failure for example?
>> Does the number of sensors matter?  For example if only one sensor
>> topology is running with batch setup X, is everything fine?  Do failures
>> start after adding Nth additional sensor?
>>
>> Hopefully someone else on the list may have an idea.
>> That code does not have any logging to speak of… well debug / trace
>> logging that would help here either.
>>
>>
>>
>> On December 6, 2017 at 08:18:01, Ali Nazemian (alinazem...@gmail.com)
>> wrote:
>>
>> Everything looks normal except the high number of failed tuples. Do you
>> know how the indexing batch size works? Based on our observations it seems
>> it doesn't update the messages that are in enrichments and indexing topics.
>>
>> On Thu, Dec 7, 2017 at 12:13 AM, Otto Fowler 
>> wrote:
>>
>>> What do you see in the storm ui for the indexing topology?
>>>
>>>
>>> On December 6, 2017 at 07:10:17, Ali Nazemian (alinazem...@gmail.com)
>>> wrote:
>>>
>>> Both hdfs and Elasticsearch batch sizes. There is no error in the logs.
>>> It mpacts topology error rate and cause almost 90% error rate on indexing
>>> tuples.
>>>
>>> On 6 Dec. 2017 00:20, "Otto Fowler"  wrote:
>>>
>>> Where are you seeing the errors?  Screenshot?
>>>
>>>
>>> On December 5, 2017 at 08:03:46, Otto Fowler (ottobackwa...@gmail.com)
>>> wrote:
>>>
>>> Which of the indexing options are you changing the batch size for?
>>> HDFS?  Elasticsearch?  Both?
&g

Re: Heterogeneous indexing batch size for different Metron feeds

2017-12-07 Thread Ali Nazemian

Thank you very much. Unfortunately, reproducing all the situations are very
costly for us at this moment. We are kind of avoiding to hit that issue by
using the same batch size for all the feeds. Hopefully, with the new PR
Casey provided for the segregation of ES and HDFS, it will be very much
clear to tune them.

Do you know how the synchronization of indexing config will happen with the
topology? Does the topology gets synchronised by pulling the last configs
from ZK based on some background mechanism or it is based on an update
trigger? As I mentioned, based on our observation it looks like the
synchronization doesn't work until all the old messages in Kafka queue get
processed based on the old indexing configs.

Regards,
Ali

On Thu, Dec 7, 2017 at 12:33 AM, Otto Fowler 
wrote:

> Sorry,
> We flush for timeouts on every storm ‘tick’ message, not on every message.
>
>
>
> On December 6, 2017 at 08:29:51, Otto Fowler (ottobackwa...@gmail.com)
> wrote:
>
> I have looked at it.
>
> We maintain batch lists for each sensor which gather messages to index.
> When we get a message that puts it over the batch size the messages are
> flushed and written to the target.
> There is also a timeout component, where the batch would be flushed based
> on timeout.
>
> While batch size checking occurs on a per sensor-message receipt basis,
> each message, regardless of sensor will trigger a check of the batch
> timeout for all the lists.
>
> At least that is what I think I see.
>
> Without understanding what the failures are for it is hard to see what the
> issue is.
>
> Do we have timing issues where all the lists are timing out all the time
> causing some kind of cascading failure for example?
> Does the number of sensors matter?  For example if only one sensor
> topology is running with batch setup X, is everything fine?  Do failures
> start after adding Nth additional sensor?
>
> Hopefully someone else on the list may have an idea.
> That code does not have any logging to speak of… well debug / trace
> logging that would help here either.
>
>
>
> On December 6, 2017 at 08:18:01, Ali Nazemian (alinazem...@gmail.com)
> wrote:
>
> Everything looks normal except the high number of failed tuples. Do you
> know how the indexing batch size works? Based on our observations it seems
> it doesn't update the messages that are in enrichments and indexing topics.
>
> On Thu, Dec 7, 2017 at 12:13 AM, Otto Fowler 
> wrote:
>
>> What do you see in the storm ui for the indexing topology?
>>
>>
>> On December 6, 2017 at 07:10:17, Ali Nazemian (alinazem...@gmail.com)
>> wrote:
>>
>> Both hdfs and Elasticsearch batch sizes. There is no error in the logs.
>> It mpacts topology error rate and cause almost 90% error rate on indexing
>> tuples.
>>
>> On 6 Dec. 2017 00:20, "Otto Fowler"  wrote:
>>
>> Where are you seeing the errors?  Screenshot?
>>
>>
>> On December 5, 2017 at 08:03:46, Otto Fowler (ottobackwa...@gmail.com)
>> wrote:
>>
>> Which of the indexing options are you changing the batch size for?
>> HDFS?  Elasticsearch?  Both?
>>
>> Can you give an example?
>>
>>
>>
>> On December 5, 2017 at 02:09:29, Ali Nazemian (alinazem...@gmail.com)
>> wrote:
>>
>> No specific error in the logs. I haven't enabled debug/trace, though.
>>
>> On Tue, Dec 5, 2017 at 11:54 AM, Otto Fowler 
>> wrote:
>>
>>> My first thought is what are the errors when you get a high error rate?
>>>
>>>
>>> On December 4, 2017 at 19:34:29, Ali Nazemian (alinazem...@gmail.com)
>>> wrote:
>>>
>>> Any thoughts?
>>>
>>> On Sun, Dec 3, 2017 at 11:27 PM, Ali Nazemian 
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > We have noticed recently that no matter what batch size we use for
>>> Metron
>>> > indexing feeds, as long as we start using different batch size for
>>> > different Metron feeds, indexing topology throughput will start
>>> dropping
>>> > due to the high error rate! So I was wondering whether based on the
>>> current
>>> > indexing topology design, we have to choose the same batch size for
>>> all the
>>> > feeds or not. Otherwise, throughout will be dropped. I assume since it
>>> is
>>> > acceptable to use different batch sizes for different feeds, it is not
>>> > expected by design.
>>> >
>>> > Moreover, I have noticed in practice that even if we change the batch
>>> > size, it will not affect the messages that are already in enrichments
>>> or
>>> > indexing topics, and it will only affect the new messages that are
>>> coming
>>> > to the parser. Therefore, we need to let all the messages pass the
>>> indexing
>>> > topology so that we can change the batch size!
>>> >
>>> > It would be great if we can have more details regarding the design of
>>> this
>>> > section so we can understand our observations are based on the design
>>> or
>>> > some kind of bug.
>>> >
>>> > Regards,
>>> > Ali
>>> >
>>>
>>>
>>>
>>> --
>>> A.Nazemian
>>>
>>>
>>
>>
>> --
>> A.Nazemian
>>
>>
>>
>
>
> --
> A.Nazemian
>
>


-- 
A.Nazemian

Re: Heterogeneous indexing batch size for different Metron feeds

2017-12-06 Thread Ali Nazemian

Everything looks normal except the high number of failed tuples. Do you
know how the indexing batch size works? Based on our observations it seems
it doesn't update the messages that are in enrichments and indexing topics.

On Thu, Dec 7, 2017 at 12:13 AM, Otto Fowler 
wrote:

> What do you see in the storm ui for the indexing topology?
>
>
> On December 6, 2017 at 07:10:17, Ali Nazemian (alinazem...@gmail.com)
> wrote:
>
> Both hdfs and Elasticsearch batch sizes. There is no error in the logs. It
> mpacts topology error rate and cause almost 90% error rate on indexing
> tuples.
>
> On 6 Dec. 2017 00:20, "Otto Fowler"  wrote:
>
> Where are you seeing the errors?  Screenshot?
>
>
> On December 5, 2017 at 08:03:46, Otto Fowler (ottobackwa...@gmail.com)
> wrote:
>
> Which of the indexing options are you changing the batch size for?  HDFS?
> Elasticsearch?  Both?
>
> Can you give an example?
>
>
>
> On December 5, 2017 at 02:09:29, Ali Nazemian (alinazem...@gmail.com)
> wrote:
>
> No specific error in the logs. I haven't enabled debug/trace, though.
>
> On Tue, Dec 5, 2017 at 11:54 AM, Otto Fowler 
> wrote:
>
>> My first thought is what are the errors when you get a high error rate?
>>
>>
>> On December 4, 2017 at 19:34:29, Ali Nazemian (alinazem...@gmail.com)
>> wrote:
>>
>> Any thoughts?
>>
>> On Sun, Dec 3, 2017 at 11:27 PM, Ali Nazemian 
>> wrote:
>>
>> > Hi,
>> >
>> > We have noticed recently that no matter what batch size we use for
>> Metron
>> > indexing feeds, as long as we start using different batch size for
>> > different Metron feeds, indexing topology throughput will start dropping
>> > due to the high error rate! So I was wondering whether based on the
>> current
>> > indexing topology design, we have to choose the same batch size for all
>> the
>> > feeds or not. Otherwise, throughout will be dropped. I assume since it
>> is
>> > acceptable to use different batch sizes for different feeds, it is not
>> > expected by design.
>> >
>> > Moreover, I have noticed in practice that even if we change the batch
>> > size, it will not affect the messages that are already in enrichments or
>> > indexing topics, and it will only affect the new messages that are
>> coming
>> > to the parser. Therefore, we need to let all the messages pass the
>> indexing
>> > topology so that we can change the batch size!
>> >
>> > It would be great if we can have more details regarding the design of
>> this
>> > section so we can understand our observations are based on the design or
>> > some kind of bug.
>> >
>> > Regards,
>> > Ali
>> >
>>
>>
>>
>> --
>> A.Nazemian
>>
>>
>
>
> --
> A.Nazemian
>
>
>


-- 
A.Nazemian

Re: Heterogeneous indexing batch size for different Metron feeds

2017-12-06 Thread Ali Nazemian

Both hdfs and Elasticsearch batch sizes. There is no error in the logs. It
mpacts topology error rate and cause almost 90% error rate on indexing
tuples.

On 6 Dec. 2017 00:20, "Otto Fowler"  wrote:

Where are you seeing the errors?  Screenshot?


On December 5, 2017 at 08:03:46, Otto Fowler (ottobackwa...@gmail.com)
wrote:

Which of the indexing options are you changing the batch size for?  HDFS?
Elasticsearch?  Both?

Can you give an example?



On December 5, 2017 at 02:09:29, Ali Nazemian (alinazem...@gmail.com) wrote:

No specific error in the logs. I haven't enabled debug/trace, though.

On Tue, Dec 5, 2017 at 11:54 AM, Otto Fowler 
wrote:

> My first thought is what are the errors when you get a high error rate?
>
>
> On December 4, 2017 at 19:34:29, Ali Nazemian (alinazem...@gmail.com)
> wrote:
>
> Any thoughts?
>
> On Sun, Dec 3, 2017 at 11:27 PM, Ali Nazemian 
> wrote:
>
> > Hi,
> >
> > We have noticed recently that no matter what batch size we use for Metron
> > indexing feeds, as long as we start using different batch size for
> > different Metron feeds, indexing topology throughput will start dropping
> > due to the high error rate! So I was wondering whether based on the
> current
> > indexing topology design, we have to choose the same batch size for all
> the
> > feeds or not. Otherwise, throughout will be dropped. I assume since it is
> > acceptable to use different batch sizes for different feeds, it is not
> > expected by design.
> >
> > Moreover, I have noticed in practice that even if we change the batch
> > size, it will not affect the messages that are already in enrichments or
> > indexing topics, and it will only affect the new messages that are coming
> > to the parser. Therefore, we need to let all the messages pass the
> indexing
> > topology so that we can change the batch size!
> >
> > It would be great if we can have more details regarding the design of
> this
> > section so we can understand our observations are based on the design or
> > some kind of bug.
> >
> > Regards,
> > Ali
> >
>
>
>
> --
> A.Nazemian
>
>


--
A.Nazemian

Re: Heterogeneous indexing batch size for different Metron feeds

2017-12-04 Thread Ali Nazemian

No specific error in the logs. I haven't enabled debug/trace, though.

On Tue, Dec 5, 2017 at 11:54 AM, Otto Fowler 
wrote:

> My first thought is what are the errors when you get a high error rate?
>
>
> On December 4, 2017 at 19:34:29, Ali Nazemian (alinazem...@gmail.com)
> wrote:
>
> Any thoughts?
>
> On Sun, Dec 3, 2017 at 11:27 PM, Ali Nazemian 
> wrote:
>
> > Hi,
> >
> > We have noticed recently that no matter what batch size we use for
> Metron
> > indexing feeds, as long as we start using different batch size for
> > different Metron feeds, indexing topology throughput will start dropping
> > due to the high error rate! So I was wondering whether based on the
> current
> > indexing topology design, we have to choose the same batch size for all
> the
> > feeds or not. Otherwise, throughout will be dropped. I assume since it
> is
> > acceptable to use different batch sizes for different feeds, it is not
> > expected by design.
> >
> > Moreover, I have noticed in practice that even if we change the batch
> > size, it will not affect the messages that are already in enrichments or
> > indexing topics, and it will only affect the new messages that are
> coming
> > to the parser. Therefore, we need to let all the messages pass the
> indexing
> > topology so that we can change the batch size!
> >
> > It would be great if we can have more details regarding the design of
> this
> > section so we can understand our observations are based on the design or
> > some kind of bug.
> >
> > Regards,
> > Ali
> >
>
>
>
> --
> A.Nazemian
>
>


-- 
A.Nazemian

Re: Heterogeneous indexing batch size for different Metron feeds

2017-12-04 Thread Ali Nazemian

Any thoughts?

On Sun, Dec 3, 2017 at 11:27 PM, Ali Nazemian  wrote:

> Hi,
>
> We have noticed recently that no matter what batch size we use for Metron
> indexing feeds, as long as we start using different batch size for
> different Metron feeds, indexing topology throughput will start dropping
> due to the high error rate! So I was wondering whether based on the current
> indexing topology design, we have to choose the same batch size for all the
> feeds or not. Otherwise, throughout will be dropped. I assume since it is
> acceptable to use different batch sizes for different feeds, it is not
> expected by design.
>
> Moreover, I have noticed in practice that even if we change the batch
> size, it will not affect the messages that are already in enrichments or
> indexing topics, and it will only affect the new messages that are coming
> to the parser. Therefore, we need to let all the messages pass the indexing
> topology so that we can change the batch size!
>
> It would be great if we can have more details regarding the design of this
> section so we can understand our observations are based on the design or
> some kind of bug.
>
> Regards,
> Ali
>



-- 
A.Nazemian

Heterogeneous indexing batch size for different Metron feeds

2017-12-03 Thread Ali Nazemian

Hi,

We have noticed recently that no matter what batch size we use for Metron
indexing feeds, as long as we start using different batch size for
different Metron feeds, indexing topology throughput will start dropping
due to the high error rate! So I was wondering whether based on the current
indexing topology design, we have to choose the same batch size for all the
feeds or not. Otherwise, throughout will be dropped. I assume since it is
acceptable to use different batch sizes for different feeds, it is not
expected by design.

Moreover, I have noticed in practice that even if we change the batch size,
it will not affect the messages that are already in enrichments or indexing
topics, and it will only affect the new messages that are coming to the
parser. Therefore, we need to let all the messages pass the indexing
topology so that we can change the batch size!

It would be great if we can have more details regarding the design of this
section so we can understand our observations are based on the design or
some kind of bug.

Regards,
Ali

Re: [DISCUSS] Are/how are you using the ES data pruner?

2017-11-27 Thread Ali Nazemian

Sorry, Michael. I am having some issues to share any code right now. It
seems we need to go through an internal verification of anything we want to
share. BTW, the curator script I mentioned is very simple and nothing
special. It doesn't worth to waste any time on waiting for it.

Cheers,
Ali

On Tue, Nov 28, 2017 at 9:56 AM, Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> It's a worthy mention. Our existing pruner wouldn't be able to handle Solr
> without modification, so we'd either need something native to Solr or
> something custom.
>
> Mike
>
> On Mon, Nov 27, 2017 at 3:46 PM, James Sirota  wrote:
>
> > One thing to keep in mind, as we will be introducing Solr shortly, is to
> > find if something similar to curator exists for Solr.  But we'll cross
> that
> > bridge when we get there
> >
> > 22.11.2017, 22:58, "Ali Nazemian" :
> > > Sure. I will have a chat internally and come back to you shortly. It
> was
> > a
> > > quick and dirty work actually just to fix this temporarily. However, it
> > > might be a good starting point.
> > >
> > > On Thu, Nov 23, 2017 at 3:31 PM, Michael Miklavcic <
> > > michael.miklav...@gmail.com> wrote:
> > >
> > >>  Thanks Ali, that's good feedback. Would you be willing to share any
> of
> > your
> > >>  Curator calls/config and use cases with the community? I'd love to
> add
> > it
> > >>  to a document around ES pruning in the short term, and maybe we could
> > look
> > >>  at how to build this into indexing at some point.
> > >>
> > >>  Cheers,
> > >>  Mike
> > >>
> > >>  On Nov 22, 2017 8:53 PM, "Ali Nazemian" 
> wrote:
> > >>
> > >>  > We tried to use it, but we had the same issue. It was not
> > documented. We
> > >>  > tried to use it, and we had some issues. It also was not exactly
> > what we
> > >>  > wanted, so we decided to create something from scratch by using
> > >>  > Elasticsearch Curator. We wanted to have an ability to manage
> > different
> > >>  > prune mechanism for different feeds. Having a hard threshold to
> > remove
> > >>  > index and Soft threshold to close that index. Maybe it can be a
> > feature
> > >>  to
> > >>  > add to the indexing JSON config file per feed.
> > >>  >
> > >>  > Cheers,
> > >>  > Ali
> > >>  >
> > >>  > On Thu, Nov 23, 2017 at 12:20 PM, Michael Miklavcic <
> > >>  > michael.miklav...@gmail.com> wrote:
> > >>  >
> > >>  > > From what I can tell, the data pruner isn't documented anywhere,
> > so I'm
> > >>  > > curious if anybody is using this, and if so, how are you using
> it?
> > >>  > >
> > >>  > > -
> > >>  > > https://github.com/apache/metron/blob/master/metron-
> > >>  > > platform/metron-data-management/README.md
> > >>  > > -
> > >>  > > https://github.com/apache/metron/blob/master/metron-
> > >>  > > platform/metron-data-management/src/main/java/org/
> > >>  > > apache/metron/dataloads/bulk/ElasticsearchDataPrunerRunner.java
> > >>  > > -
> > >>  > > https://github.com/apache/metron/blob/master/metron-
> > >>  > > platform/metron-data-management/src/main/java/org/
> > >>  > > apache/metron/dataloads/bulk/DataPruner.java
> > >>  > >
> > >>  > > It looks to me that it allows you to specify the start date and a
> > >>  number
> > >>  > of
> > >>  > > days for lookback from the start date to purge along with a regex
> > >>  pattern
> > >>  > > to match the index name. It also does not look like it has any
> > built-in
> > >>  > > scheduling semantics, so I assume this was a cron job. I think
> that
> > >>  about
> > >>  > > covers it. Anything I've missed?
> > >>  > >
> > >>  > > I'm adding a quick doc write-up to METRON-939 (
> > >>  > > https://github.com/apache/metron/pull/840) for using Curator to
> > prune
> > >>  > > indices from Elasticsearch. It is desirable to make sure I've
> > covered
> > >>  > > existing use cases.
> > >>  > >
> > >>  > > Best,
> > >>  > > Mike
> > >>  > >
> > >>  >
> > >>  >
> > >>  >
> > >>  > --
> > >>  > A.Nazemian
> > >>  >
> > >
> > > --
> > > A.Nazemian
> >
> > ---
> > Thank you,
> >
> > James Sirota
> > PMC- Apache Metron
> > jsirota AT apache DOT org
> >
>



-- 
A.Nazemian

Re: Using Storm Resource Aware Scheduler

2017-11-26 Thread Ali Nazemian

Sounds great, Simon. We will work on refactoring our design to be aligned
with Metadata feature. As long as we can use the same parser, there is no
technical reason that we cannot use the same feed to handle it. However, I
need to check it for more details to understand how complex it would be to
merge different tenants at this moment. Hopefully, it shouldn't be too
complex.

BTW, I haven't had any permission to close this ticket, so I have just
created a duplicate link to the main ticket as you mentioned.

Cheers,
Ali

On Mon, Nov 27, 2017 at 9:06 AM, Simon Elliston Ball <
si...@simonellistonball.com> wrote:

> The multi-tenancy though meta-data method mentioned is designed to solve
> exactly that problem and has been in the project for some time now. The
> goal would be to have one topology per data schema and use the key to
> communicate tenant meta-data. See https://archive.apache.org/
> dist/metron/0.4.1/site-book/metron-platform/metron-
> parsers/index.html#Metadata <https://archive.apache.org/
> dist/metron/0.4.1/site-book/metron-platform/metron-
> parsers/index.html#Metadata> for details.
>
> The storm issue you mention is something for the storm project to look at,
> so we can’t really comment on their behalf here, but yeah, it will be nice
> to have storm do some of the tuning for us at some point.
>
> Not that the UI already has the tuning parameters you’re talking about in
> the latest version, so there is no need for the new JIRA (
> https://issues.apache.org/jira/browse/METRON-1330 <
> https://issues.apache.org/jira/browse/METRON-1330>). It should be closed
> as a duplicate of https://issues.apache.org/jira/browse/METRON-1161 <
> https://issues.apache.org/jira/browse/METRON-1161>.
>
> Simon
>
> > On 26 Nov 2017, at 02:15, Ali Nazemian  wrote:
> >
> > Oops, I didn't know that. Happy Thanksgiving.
> >
> > Thanks, Otto and Simon.
> >
> > As you are aware of our use cases, with the current limitations of
> > multi-tenancy support, we are creating a feed per tenant per device.
> > Sometimes the amount of traffic we are receiving per each tenant and per
> > each device is way less than dedicating one storm slot for it.
> Therefore, I
> > was hoping to make it at least theoretically possible to tune resources
> > more wisely, but it is not going to be easy at all. This is probably a
> use
> > case that storm auto-scaling mechanism would be very nice to have.
> >
> > https://issues.apache.org/jira/browse/STORM-594
> >
> > On the other side, I can recall there was a PR to address multi-tenancy
> by
> > adding meta-data to Kafka topic. However, I lost track of that feature,
> so
> > maybe this situation can be tackled at another level by merging different
> > parsers.
> >
> > I will create a Jira ticket to add an ability in UI to tune Metron parser
> > feeds at Storm level. Right now it is a little hard to maintain tuning
> > configurations per each parser, and as soon as somebody restarts them
> from
> > Management-UI/Ambari, it will be overwritten.
> >
> >
> > Cheers,
> > Ali
> >
> > On Sat, Nov 25, 2017 at 3:36 AM, Simon Elliston Ball <
> > si...@simonellistonball.com> wrote:
> >
> >> Implementing the resource aware scheduler would be decidedly
> non-trivial.
> >> Every topology will need additional configuration to tune for things
> like
> >> memory sizes, which is not going to buy you much change. So, at the
> >> micro-tuning level of parser this doesn’t make a lot of sense.
> >>
> >> However, it may be relevant to consider separate tuning for parsers in
> >> general vs the core enrichment and indexing topologies (potentially also
> >> for separate indexing topologies when this comes in) and the resource
> >> scheduler could provide a theoretical benefit there.
> >>
> >> Specifying resource requirements per parser topology might sound like a
> >> good idea, but if your parsers are working the way they should, they
> should
> >> be using a small amount of memory as their default size, and achieving
> >> additional resource use by multiplying workers and executors (to get
> higher
> >> usage per slot) and balance the load that way. To be honest, the only
> >> difference you’re going to get from the RAS is to add a bunch of tuning
> >> parameters which allow slightly different granularity of units for
> things
> >> like memory.
> >>
> >> The other RAS feature which might be a good add is prioritisation of
> >> different parser topologies, but again, this is probably not somethin

Re: Using Storm Resource Aware Scheduler

2017-11-25 Thread Ali Nazemian

Oops, I didn't know that. Happy Thanksgiving.

Thanks, Otto and Simon.

As you are aware of our use cases, with the current limitations of
multi-tenancy support, we are creating a feed per tenant per device.
Sometimes the amount of traffic we are receiving per each tenant and per
each device is way less than dedicating one storm slot for it. Therefore, I
was hoping to make it at least theoretically possible to tune resources
more wisely, but it is not going to be easy at all. This is probably a use
case that storm auto-scaling mechanism would be very nice to have.

https://issues.apache.org/jira/browse/STORM-594

On the other side, I can recall there was a PR to address multi-tenancy by
adding meta-data to Kafka topic. However, I lost track of that feature, so
maybe this situation can be tackled at another level by merging different
parsers.

I will create a Jira ticket to add an ability in UI to tune Metron parser
feeds at Storm level. Right now it is a little hard to maintain tuning
configurations per each parser, and as soon as somebody restarts them from
Management-UI/Ambari, it will be overwritten.


Cheers,
Ali

On Sat, Nov 25, 2017 at 3:36 AM, Simon Elliston Ball <
si...@simonellistonball.com> wrote:

> Implementing the resource aware scheduler would be decidedly non-trivial.
> Every topology will need additional configuration to tune for things like
> memory sizes, which is not going to buy you much change. So, at the
> micro-tuning level of parser this doesn’t make a lot of sense.
>
> However, it may be relevant to consider separate tuning for parsers in
> general vs the core enrichment and indexing topologies (potentially also
> for separate indexing topologies when this comes in) and the resource
> scheduler could provide a theoretical benefit there.
>
> Specifying resource requirements per parser topology might sound like a
> good idea, but if your parsers are working the way they should, they should
> be using a small amount of memory as their default size, and achieving
> additional resource use by multiplying workers and executors (to get higher
> usage per slot) and balance the load that way. To be honest, the only
> difference you’re going to get from the RAS is to add a bunch of tuning
> parameters which allow slightly different granularity of units for things
> like memory.
>
> The other RAS feature which might be a good add is prioritisation of
> different parser topologies, but again, this is probably not something you
> want to push hard on unless you are severely limited in resources (in which
> case, why not just add another node, it will be cheaper than spending all
> that time micro-tuning the resource requirements for each data feed).
>
> Right now we do allow a lot of micro tuning of parallelism around things
> like the count of executor threads, which is achieves roughly the
> equivalent of the cpu based limits in the RAS.
>
> TL;DR:
>
> If you’re not using resource pools for different users and using the idea
> that prioritisation can lead to arbitrary kills, all you’re getting is a
> slightly different way of tuning knobs that already exist, but you would
> get a slightly different granularity. Also, we would have to rewrite all
> the topology code to add the config endpoints for CPU and memory estimates.
>
> Simon
>
> > On 24 Nov 2017, at 07:56, Ali Nazemian  wrote:
> >
> > Any help regarding this question would be appreciated.
> >
> >
> > On Thu, Nov 23, 2017 at 8:57 AM, Ali Nazemian 
> wrote:
> >
> >> 30 mins average of CPU load by checking Ambari.
> >>
> >> On 23 Nov. 2017 00:51, "Otto Fowler"  wrote:
> >>
> >> How are you measuring the utilization?
> >>
> >>
> >> On November 22, 2017 at 08:12:51, Ali Nazemian (alinazem...@gmail.com)
> >> wrote:
> >>
> >> Hi all,
> >>
> >>
> >> One of the issues that we are dealing with is the fact that not all of
> >> the Metron feeds have the same type of resource requirements. For
> example,
> >> we have some feeds that even a single Strom slot is way more than what
> it
> >> needs. We thought we could make it more utilised in total by limiting at
> >> least the amount of available heap space per feed to the parser topology
> >> worker. However, since Storm scheduler relies on available slots, it is
> >> very hard and almost impossible to utilise the cluster in the scenario
> >> that
> >> there will be lots of different topologies with different requirements
> >> running at the same time. Therefore, on a daily basis, we can see that
> for
> >> example one of the Storm hosts is 120% utilised and another is 20%
> >> utilised! I was wondering whether we can address this situation by using
> >> Storm Resource Aware scheduler or not.
> >>
> >> P.S: it would be very nice to have a functionality to tune Storm
> >> topology-related parameters per feed in the GUI (for example in
> Management
> >> UI).
> >>
> >>
> >> Regards,
> >> Ali
> >>
> >>
> >>
> >
> >
> > --
> > A.Nazemian
>
>


-- 
A.Nazemian

Re: Using Storm Resource Aware Scheduler

2017-11-23 Thread Ali Nazemian

Any help regarding this question would be appreciated.


On Thu, Nov 23, 2017 at 8:57 AM, Ali Nazemian  wrote:

> 30 mins average of CPU load by checking Ambari.
>
> On 23 Nov. 2017 00:51, "Otto Fowler"  wrote:
>
> How are you measuring the utilization?
>
>
> On November 22, 2017 at 08:12:51, Ali Nazemian (alinazem...@gmail.com)
> wrote:
>
> Hi all,
>
>
> One of the issues that we are dealing with is the fact that not all of
> the Metron feeds have the same type of resource requirements. For example,
> we have some feeds that even a single Strom slot is way more than what it
> needs. We thought we could make it more utilised in total by limiting at
> least the amount of available heap space per feed to the parser topology
> worker. However, since Storm scheduler relies on available slots, it is
> very hard and almost impossible to utilise the cluster in the scenario
> that
> there will be lots of different topologies with different requirements
> running at the same time. Therefore, on a daily basis, we can see that for
> example one of the Storm hosts is 120% utilised and another is 20%
> utilised! I was wondering whether we can address this situation by using
> Storm Resource Aware scheduler or not.
>
> P.S: it would be very nice to have a functionality to tune Storm
> topology-related parameters per feed in the GUI (for example in Management
> UI).
>
>
> Regards,
> Ali
>
>
>


-- 
A.Nazemian

Re: [DISCUSS] Are/how are you using the ES data pruner?

2017-11-22 Thread Ali Nazemian

Sure. I will have a chat internally and come back to you shortly. It was a
quick and dirty work actually just to fix this temporarily. However, it
might be a good starting point.

On Thu, Nov 23, 2017 at 3:31 PM, Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> Thanks Ali, that's good feedback. Would you be willing to share any of your
> Curator calls/config and use cases with the community? I'd love to add it
> to a document around ES pruning in the short term, and maybe we could look
> at how to build this into indexing at some point.
>
> Cheers,
> Mike
>
> On Nov 22, 2017 8:53 PM, "Ali Nazemian"  wrote:
>
> > We tried to use it, but we had the same issue. It was not documented. We
> > tried to use it, and we had some issues. It also was not exactly what we
> > wanted, so we decided to create something from scratch by using
> > Elasticsearch Curator. We wanted to have an ability to manage different
> > prune mechanism for different feeds. Having a hard threshold to remove
> > index and Soft threshold to close that index. Maybe it can be a feature
> to
> > add to the indexing JSON config file per feed.
> >
> > Cheers,
> > Ali
> >
> > On Thu, Nov 23, 2017 at 12:20 PM, Michael Miklavcic <
> > michael.miklav...@gmail.com> wrote:
> >
> > > From what I can tell, the data pruner isn't documented anywhere, so I'm
> > > curious if anybody is using this, and if so, how are you using it?
> > >
> > >-
> > >https://github.com/apache/metron/blob/master/metron-
> > > platform/metron-data-management/README.md
> > >-
> > >https://github.com/apache/metron/blob/master/metron-
> > > platform/metron-data-management/src/main/java/org/
> > > apache/metron/dataloads/bulk/ElasticsearchDataPrunerRunner.java
> > >-
> > >https://github.com/apache/metron/blob/master/metron-
> > > platform/metron-data-management/src/main/java/org/
> > > apache/metron/dataloads/bulk/DataPruner.java
> > >
> > > It looks to me that it allows you to specify the start date and a
> number
> > of
> > > days for lookback from the start date to purge along with a regex
> pattern
> > > to match the index name. It also does not look like it has any built-in
> > > scheduling semantics, so I assume this was a cron job. I think that
> about
> > > covers it. Anything I've missed?
> > >
> > > I'm adding a quick doc write-up to METRON-939 (
> > > https://github.com/apache/metron/pull/840) for using Curator to prune
> > > indices from Elasticsearch. It is desirable to make sure I've covered
> > > existing use cases.
> > >
> > > Best,
> > > Mike
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
>



-- 
A.Nazemian

Re: [DISCUSS] Are/how are you using the ES data pruner?

2017-11-22 Thread Ali Nazemian

We tried to use it, but we had the same issue. It was not documented. We
tried to use it, and we had some issues. It also was not exactly what we
wanted, so we decided to create something from scratch by using
Elasticsearch Curator. We wanted to have an ability to manage different
prune mechanism for different feeds. Having a hard threshold to remove
index and Soft threshold to close that index. Maybe it can be a feature to
add to the indexing JSON config file per feed.

Cheers,
Ali

On Thu, Nov 23, 2017 at 12:20 PM, Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> From what I can tell, the data pruner isn't documented anywhere, so I'm
> curious if anybody is using this, and if so, how are you using it?
>
>-
>https://github.com/apache/metron/blob/master/metron-
> platform/metron-data-management/README.md
>-
>https://github.com/apache/metron/blob/master/metron-
> platform/metron-data-management/src/main/java/org/
> apache/metron/dataloads/bulk/ElasticsearchDataPrunerRunner.java
>-
>https://github.com/apache/metron/blob/master/metron-
> platform/metron-data-management/src/main/java/org/
> apache/metron/dataloads/bulk/DataPruner.java
>
> It looks to me that it allows you to specify the start date and a number of
> days for lookback from the start date to purge along with a regex pattern
> to match the index name. It also does not look like it has any built-in
> scheduling semantics, so I assume this was a cron job. I think that about
> covers it. Anything I've missed?
>
> I'm adding a quick doc write-up to METRON-939 (
> https://github.com/apache/metron/pull/840) for using Curator to prune
> indices from Elasticsearch. It is desirable to make sure I've covered
> existing use cases.
>
> Best,
> Mike
>



-- 
A.Nazemian

Re: Using Storm Resource Aware Scheduler

2017-11-22 Thread Ali Nazemian

30 mins average of CPU load by checking Ambari.

On 23 Nov. 2017 00:51, "Otto Fowler"  wrote:

How are you measuring the utilization?


On November 22, 2017 at 08:12:51, Ali Nazemian (alinazem...@gmail.com)
wrote:

Hi all,


One of the issues that we are dealing with is the fact that not all of
the Metron feeds have the same type of resource requirements. For example,
we have some feeds that even a single Strom slot is way more than what it
needs. We thought we could make it more utilised in total by limiting at
least the amount of available heap space per feed to the parser topology
worker. However, since Storm scheduler relies on available slots, it is
very hard and almost impossible to utilise the cluster in the scenario that
there will be lots of different topologies with different requirements
running at the same time. Therefore, on a daily basis, we can see that for
example one of the Storm hosts is 120% utilised and another is 20%
utilised! I was wondering whether we can address this situation by using
Storm Resource Aware scheduler or not.

P.S: it would be very nice to have a functionality to tune Storm
topology-related parameters per feed in the GUI (for example in Management
UI).


Regards,
Ali

Using Storm Resource Aware Scheduler

2017-11-22 Thread Ali Nazemian

Hi all,


One of the issues that we are dealing with is the fact that not all of
the Metron feeds have the same type of resource requirements. For example,
we have some feeds that even a single Strom slot is way more than what it
needs. We thought we could make it more utilised in total by limiting at
least the amount of available heap space per feed to the parser topology
worker. However, since Storm scheduler relies on available slots, it is
very hard and almost impossible to utilise the cluster in the scenario that
there will be lots of different topologies with different requirements
running at the same time. Therefore, on a daily basis, we can see that for
example one of the Storm hosts is 120% utilised and another is 20%
utilised! I was wondering whether we can address this situation by using
Storm Resource Aware scheduler or not.

P.S: it would be very nice to have a functionality to tune Storm
topology-related parameters per feed in the GUI (for example in Management
UI).


Regards,
Ali

Re: Metron 0.4.2 release date

2017-10-11 Thread Ali Nazemian

Hi Michael,

My entire concern regarding ES 5.x upgrade is to be able to use Alert-UI
and Metron indexing on ES 5.x. Basically, as a proof of concept, we have
started using ES 5.x in parallel with using Nifi flow for the Metron
indexing to investigate Elasticsearch capabilities regarding Graph
visualisation/analytics as well as Anomaly Detection. Clearly, we don't
expect to go to production with this condition and I wanted to plan for the
version of Metron that can support this matter. However, I am not expecting
to include ES 5.x as a part of Ambari mpack. As long as Metron can support
it, I am fine.

Regards,
Ali

On Tue, Oct 10, 2017 at 3:33 AM, Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> Hey Ali, I'm currently deep in ES 5.x migration work. If you have any
> specific concerns or requests that have not yet been covered on the mailing
> list, please feel free to comment.
>
> Best,
> Mike
>
> On Mon, Oct 9, 2017 at 9:05 AM, James Sirota  wrote:
>
> > I would expect ES 5.x support to be in the next version of Metron
> >
> > 08.10.2017, 18:18, "zeo...@gmail.com" :
> > > There's an ongoing conversation regarding client support in Metron here
> > > <https://lists.apache.org/thread.html/0c5a837c901dd057420dd8c6b673dc
> > 33ba88a8d97545d5b58856cfe8@%3Cdev.metron.apache.org%3E>
> > > .
> > >
> > > Jon
> > >
> > > On Sun, Oct 8, 2017 at 9:02 PM Ali Nazemian 
> > wrote:
> > >
> > >>  Hi Jon,
> > >>
> > >>  For the Elasticsearch, I am looking for the support from the client
> > side
> > >>  rather than a full Metron mpack that includes ES 5.x. As long as
> Metron
> > >>  Alert-UI and indexing can support ES 5, I am fine. Is that the scope
> of
> > >>  Metron-939?
> > >>
> > >>  Cheers,
> > >>  Ali
> > >>
> > >>  On Mon, Oct 9, 2017 at 11:04 AM, zeo...@gmail.com 
> > >>  wrote:
> > >>
> > >>  > As of right now I'm not aware of any discussions regarding a next
> > >>  release,
> > >>  > and I believe the METRON-777 features are at least a few months out
> > from
> > >>  > being reviewed and merged in (There is a fair amount of work in
> > chunking
> > >>  it
> > >>  > up to be reviewed, then work to review and merge it in). ES 5.x is
> > also
> > >>  in
> > >>  > progress but not even open as a PR yet, let alone in master and
> > ready to
> > >>  be
> > >>  > included in a release. I'm really looking forward to those
> > >>  > changes/improvements as well, but I wouldn't be expecting them in
> the
> > >>  next
> > >>  > few months, and trying to look further into the future than that at
> > this
> > >>  > point would be difficult.
> > >>  >
> > >>  > That said, if anybody else has a more detailed timeline in mind, I
> > would
> > >>  > love to hear more.
> > >>  >
> > >>  > Jon
> > >>  >
> > >>  > On Sun, Oct 8, 2017, 09:05 Ali Nazemian 
> > wrote:
> > >>  >
> > >>  > > Hi all,
> > >>  > >
> > >>  > > I was wondering when Metron 0.4.2 will be released and whether it
> > >>  > includes
> > >>  > > Metron-777 and Elasticsearch 5.x or not?
> > >>  > >
> > >>  > > Cheers,
> > >>  > > Ali
> > >>  > >
> > >>  > --
> > >>  >
> > >>  > Jon
> > >>  >
> > >>
> > >>  --
> > >>  A.Nazemian
> > > --
> > >
> > > Jon
> >
> > ---
> > Thank you,
> >
> > James Sirota
> > PPMC- Apache Metron (Incubating)
> > jsirota AT apache DOT org
> >
>



-- 
A.Nazemian

Re: Metron 0.4.2 release date

2017-10-08 Thread Ali Nazemian

Hi Jon,

For the Elasticsearch, I am looking for the support from the client side
rather than a full Metron mpack that includes ES 5.x. As long as Metron
Alert-UI and indexing can support ES 5, I am fine. Is that the scope of
Metron-939?

Cheers,
Ali

On Mon, Oct 9, 2017 at 11:04 AM, zeo...@gmail.com  wrote:

> As of right now I'm not aware of any discussions regarding a next release,
> and I believe the METRON-777 features are at least a few months out from
> being reviewed and merged in (There is a fair amount of work in chunking it
> up to be reviewed, then work to review and merge it in).  ES 5.x is also in
> progress but not even open as a PR yet, let alone in master and ready to be
> included in a release.  I'm really looking forward to those
> changes/improvements as well, but I wouldn't be expecting them in the next
> few months, and trying to look further into the future than that at this
> point would be difficult.
>
> That said, if anybody else has a more detailed timeline in mind, I would
> love to hear more.
>
> Jon
>
> On Sun, Oct 8, 2017, 09:05 Ali Nazemian  wrote:
>
> > Hi all,
> >
> > I was wondering when Metron 0.4.2 will be released and whether it
> includes
> > Metron-777 and Elasticsearch 5.x or not?
> >
> > Cheers,
> > Ali
> >
> --
>
> Jon
>



-- 
A.Nazemian

Metron 0.4.2 release date

2017-10-08 Thread Ali Nazemian

Hi all,

I was wondering when Metron 0.4.2 will be released and whether it includes
Metron-777 and Elasticsearch 5.x or not?

Cheers,
Ali

Elasticsearch 5.x upgrade

2017-07-16 Thread Ali Nazemian

Hi all,

I've heard there is a plan to upgrade Elasticsearch from 2.x to 5.x
regarding Metron and Ambari mpack. I was wondering when that will happen.
Is there any part in Metron Elasticsearch indexing that will be impacted by
this upgrade? Like any change from the way of bulk-indexing?

Cheers,
Ali

Re: UI pivotting / aggregation backend

2017-07-08 Thread Ali Nazemian

Given the fact that some people prefer Solr and some of them
Elasticsearch, having an abstraction layer for Solr and Elasticsearch would
be really great. However, I haven't seen any framework out there that can
provide the required level of search abstraction on top of Solr and
Elasticsearch, but I guess there should be one. Something like Apache
Calcite but more specific to search queries. Without that there is too much
implementation.



On Fri, Jul 7, 2017 at 6:48 PM, Casey Stella  wrote:

> I just want to chime in and support the notion of an abstraction layer
> between the UI and the indexed stores.  I think that having an API that
> people can conform to is going to be important as people want to plug in
> their own backing indices in the future.
>
> Casey
>
> On Thu, Jul 6, 2017 at 2:11 PM, Justin Leet  wrote:
>
> > I wanted to bring up a some stuff on the backend of our UI, and get
> > thoughts (+ things I overlooked, etc.).  There's also a couple points at
> > the end that merit discussion about how we handle things, since it gets
> > into how we handle our ES templates (since we generally want to aggregate
> > on raw fields, not analyzed ones).
> >
> > To set the use case a bit, when we're looking through alerts in the UI,
> > we're going to want to be able to start pivoting and grouping in the UI.
> >
> > For example, given a list of alerts, we may want to follow a ordering of
> > groupings like so:
> >
> > All Alerts
> > --> Bucketed by User
> > > Then further by Destination IP
> > --> Then further by Severity
> >
> > The stuff I expect we'll want to be able to do:
> > * Pivot through multiple layers (as in the example above).
> > * Get counts within each bucket (Do we have a lot of high severity
> alerts?
> > Mostly medium? etc?)
> > * Get a subset of fields (I assume we don't want every entire doc that
> > comes back in the bucket)
> > * Pagination (if I have > X docs, show me X and let me retrieve more as
> > needed)
> > * Sorting within a bucket (I may want to sort by time, by userid, etc.)
> > * Filtering (Be able to do this stuff while only showing high severity
> > alerts)
> >
> > In terms of actually implementing this, to the best of my limited
> knowledge
> > (and playing around with ES looking into this), this seems like pretty
> > doable stuff, out of the box. See:
> > https://www.elastic.co/guide/en/elasticsearch/reference/2.
> > 4/search-aggregations-bucket-terms-aggregation.html
> >
> > There are two main pain points I see in this:
> > * Actually constructing these queries.  I don't know that we've
> explicitly
> > said we want a layer of abstraction between the UI and the real time
> store,
> > but I strongly suggest we have one.  Theoretically, we should be able to
> > support (at least) Solr and ES in the UI, not just one.  Unfortunately,
> > since they aren't the same syntax, this means we have two impls, and I'd
> > personally like to see an abstraction that delegates appropriately.
> >
> > * Aggregations in ES function post analysis. This means that we'll
> > typically want the raw field value to be able to aggregated on.  In ES
> > implementation, this means a "not_analyzed" field. Glancing (incredibly)
> > briefly through our templates, we do have some string values that are
> > analyzed (and I have no idea if they're generally relevant to this UI or
> > not, I just didn't look).  I'm also assuming Stellar enrichments are
> > analyzed right now.  I'm also unsure what happens to metadata (
> > https://github.com/apache/metron/pull/621)  Essentially the question is:
> > "How do we handle this, particularly since we're a pretty dynamic
> system?"
> >
>



-- 
A.Nazemian

Re: Post-parsing and Enrichment test framework

2017-07-08 Thread Ali Nazemian

Hi Nick,

Something like GetProfileTest is exactly what I am looking for. Although
following this test case is good enough at this step, It would be great if
a test-framework can be implemented to make that easier. Probably it is not
a very critical requirement, but it would be nice to have it.

Cheers,
Ali

On Sat, Jul 8, 2017 at 1:02 AM, Nick Allen  wrote:

> >
> > Is there any other approach to check
> > that through writing Java test-cases? Righting test-cases would be easier
> > for keeping track of changes.
>
>
> While the Shell is great, it does not serve as an automated, repeatable
> test case.
>
> An alternative approach along these lines, is to create your own JUnit test
> cases that leverage a Stellar executor to execute arbitrary expressions and
> validate the result.  This is what we do in any unit tests for Stellar
> functions.  For example, see `GetProfileTest` that tests the Profiler's
> `PROFILE_GET` function.
>
> Do you think these examples get you 80% there?
>
>
>
>
>
>
>
> On Fri, Jul 7, 2017 at 10:54 AM, Nick Allen  wrote:
>
> > For experimenting or validating specific Stellar expressions, the Stellar
> > Shell is perfect.  To do this, you just have to remember than when your
> > Stellar expressions execute all of the fields of the message are
> in-scope.
> >
> > For example, here is a quick session where I mock-up some logic that
> sends
> > a message to Triage if a hypothetical "count" field is greater than 22.
> In
> > this example, I expect my telemetry to look-like the following.
> >
> > {
> >   "ip_src_addr": "10.0.0.2",
> >   "ip_dst_addr": "10.0.0.3",
> >   "ip_src_port": "22",
> >   "ip_dst_port": "12345",
> >   "source.type": "bro",
> >   "count": "22"
> > }
> >
> >
> > Like I said, when my Stellar expression executes each of the fields from
> > the message are in-scope as variables.  To replicate this in the shell,
> all
> > I have to do is create those variables as I would expect them to exist in
> > the telemetry.
> >
> > [Stellar]>>>
> > [Stellar]>>> ip_src_addr := "10.0.0.2"
> > [Stellar]>>> ip_dst_addr := "10.0.0.3"
> > [Stellar]>>> ip_src_port := 22
> > [Stellar]>>> ip_dst_port := 12345
> > [Stellar]>>> source.type := "bro"
> > [Stellar]>>> count := 22
> > [Stellar]>>> is_alert := if count > 22 then true else false
> > [Stellar]>>> is_alert
> >
> > false
> >
> > This session helped me validate the `is_alert` expression that I will add
> > as an enrichment expression.
> >
> > Hope that answered at least some of your questions.
> >
> >
> >
> >
> > On Tue, Jul 4, 2017 at 10:23 AM, Ali Nazemian 
> > wrote:
> >
> >> Hi Simon,
> >>
> >> Yeah, it does, but we are looking for a way to mock a specific message
> and
> >> check some post-parse/enrichments stuff. Is that achievable via Stellar
> >> shell? Right now we are checking that either through end-to-end testing,
> >> or
> >> changing flux files to check them section by section. Unfortunately,
> both
> >> approaches are time-consuming. We are using the Stellar shell for only
> >> checking the validity of Stellar functions one by one right now.
> >>
> >> Suppose there is an approach we can define a JSON object as an output
> of a
> >> parser. Then, we can apply a set of post-parsing and enrichment process
> on
> >> that JSON object and check the output. Is that achievable via Stellar
> >> shell? Do you have any sample that we can follow to understand Stellar
> >> shell capabilities for this scenario? Is there any other approach to
> check
> >> that through writing Java test-cases? Righting test-cases would be
> easier
> >> for keeping track of changes.
> >>
> >> Cheers,
> >> Ali
> >>
> >>
> >> On Wed, Jul 5, 2017 at 12:06 AM, Simon Elliston Ball <
> >> si...@simonellistonball.com> wrote:
> >>
> >> > You should probably use the Stellar REPL (../metron/bin/stellar -z
> $ZK)
> >> > which gives you a kind of Stellar playground.
> >> >
> >> > Simon
> >> >
> >> > > On 4 Jul 2017, at 15:02, Ali Nazemian 
> wrote:
> >> > >
> >> > > Hi all,
> >> > >
> >> > > I was wondering if there is a test framework we can use for Stellar
> >> > > post-parsing and enrichment use cases. It is very time-consuming to
> >> > verify
> >> > > use cases end-to-end. Therefore, I am looking for a way of mocking
> use
> >> > > cases step by step to speed up our development.
> >> > >
> >> > > Regards,
> >> > > Ali
> >> >
> >> >
> >>
> >>
> >> --
> >> A.Nazemian
> >>
> >
> >
>



-- 
A.Nazemian

Re: Post-parsing and Enrichment test framework

2017-07-04 Thread Ali Nazemian

Hi Simon,

Yeah, it does, but we are looking for a way to mock a specific message and
check some post-parse/enrichments stuff. Is that achievable via Stellar
shell? Right now we are checking that either through end-to-end testing, or
changing flux files to check them section by section. Unfortunately, both
approaches are time-consuming. We are using the Stellar shell for only
checking the validity of Stellar functions one by one right now.

Suppose there is an approach we can define a JSON object as an output of a
parser. Then, we can apply a set of post-parsing and enrichment process on
that JSON object and check the output. Is that achievable via Stellar
shell? Do you have any sample that we can follow to understand Stellar
shell capabilities for this scenario? Is there any other approach to check
that through writing Java test-cases? Righting test-cases would be easier
for keeping track of changes.

Cheers,
Ali

On Wed, Jul 5, 2017 at 12:06 AM, Simon Elliston Ball <
si...@simonellistonball.com> wrote:

> You should probably use the Stellar REPL (../metron/bin/stellar -z $ZK)
> which gives you a kind of Stellar playground.
>
> Simon
>
> > On 4 Jul 2017, at 15:02, Ali Nazemian  wrote:
> >
> > Hi all,
> >
> > I was wondering if there is a test framework we can use for Stellar
> > post-parsing and enrichment use cases. It is very time-consuming to
> verify
> > use cases end-to-end. Therefore, I am looking for a way of mocking use
> > cases step by step to speed up our development.
> >
> > Regards,
> > Ali
>
>

-- 
A.Nazemian

Post-parsing and Enrichment test framework

2017-07-04 Thread Ali Nazemian

Hi all,

I was wondering if there is a test framework we can use for Stellar
post-parsing and enrichment use cases. It is very time-consuming to verify
use cases end-to-end. Therefore, I am looking for a way of mocking use
cases step by step to speed up our development.

Regards,
Ali

Re: performance benchmarks on the asa parser

2017-06-09 Thread Ali Nazemian

Simon,

I have read all emails and now I understand what you are saying. However, I
couldn't understand the effect of predictability of latency on enrichments.

On Fri, Jun 9, 2017 at 2:45 PM, Ali Nazemian  wrote:

> Hi Simon,
>
> We have noticed those issues as well. Can you share the changes you have
> made? so we can merge it with our version. We have implemented about 40-50
> more ciscotags so far. It would be great if we can optimize it and
> contribute back to the community. However, we may end up reimplement it
> using via Java Parser.
>
> Cheers,
> Ali
>
> On Fri, Jun 9, 2017 at 12:55 PM, Simon Elliston Ball <
> si...@simonellistonball.com> wrote:
>
>> I thought about compile of first use and cache as an approach, but
>> decided it would reduce the predictability of latency for a message, which
>> is important in the metron enrichment context. As you say, we could end up
>> growing a large number of Groks, but if the load of compile is all pushed
>> to the (hopefully very rare) topology restart event, it feels like the
>> performance trade off there is a good one, though the memory usage tradeoff
>> could start to bite if we’re getting into the hundreds I guess.
>>
>> Simon
>>
>>
>> > On 9 Jun 2017, at 03:32, Kyle Richardson 
>> wrote:
>> >
>> > I like the pre-compile idea. One concern is I see the number of grok
>> objects growing over time. This parser does not account for nearly all of
>> the possible ASA message types, currently only the most common ones. Is
>> there a middle ground implementation where we can compile on first use of a
>> grok and then hold in memory? Avoids the up front burden but should also
>> boost performance.
>> >
>> > -Kyle
>> >
>> >> On Jun 8, 2017, at 8:56 PM, Simon Elliston Ball <
>> si...@simonellistonball.com> wrote:
>> >>
>> >> The changes are pretty simple (pre-compile the grok, duh). Most other
>> grok parser just use a single expression, which is already pre-compiled
>> (/checks assumption in code) so really it’s just the ASA one because of
>> it’s strange two stage grok.
>> >>
>> >> Shame, it would have been nice to find some more low hanging fruit.
>> >>
>> >> Simon
>> >>
>> >>> On 9 Jun 2017, at 01:52, Otto Fowler  wrote:
>> >>>
>> >>> Are these changes that all grok parsers can benefit from?  Are your
>> changes to the base classes that they use or asa only?
>> >>>
>> >>>
>> >>>
>> >>>> On June 8, 2017 at 20:49:49, Simon Elliston Ball (
>> si...@simonellistonball.com <mailto:si...@simonellistonball.com>) wrote:
>> >>>>
>> >>>> I got mildly interested in parser performance as a result of some
>> recent work on tuning, and did some very quick benchmarking with Predfix on
>> the ASA parser (which I hadn’t really cared about enough due to relatively
>> low volume previously).
>> >>>>
>> >>>> That said, it’s not exactly perf optimised. 3 runs of 1000
>> iterations on my laptop as a micro-benchmark in Predfix (I know,
>> scientific, right), with some changes (basically pushing all the grok
>> statements up to pre-compile in init (the parser currently uses one grok to
>> do the syslog bit and figure out which grok it needs for the second half,
>> so this makes for a large number of Grok objects upfront, which I think we
>> can live with.
>> >>>>
>> >>>> Do you think we should do this benchmarking properly, and extend?
>> Anyone have thoughts about how to build parser benchmarks in to our test
>> suite properly?
>> >>>>
>> >>>> Also, since these are showing approx 20 times improvement on the P95
>> interval, do we think it’s worth the memory (not measured, but 39 Grok
>> objects hanging around? If so I’ll get it JIRAed up and push my new version.
>> >>>>
>> >>>> Run results:-
>> >>>>
>> >>>> Base line (current master as is)
>> >>>> |= Benchmark ==
>> |
>> >>>> | - | unit | sum | min | max | avg | stddev | conf95 | runs |
>> >>>> |= TimeMeter
>> ==|
>> >>>> |. AsaBenchmark ..
>> ...

Re: performance benchmarks on the asa parser

2017-06-08 Thread Ali Nazemian

Hi Simon,

We have noticed those issues as well. Can you share the changes you have
made? so we can merge it with our version. We have implemented about 40-50
more ciscotags so far. It would be great if we can optimize it and
contribute back to the community. However, we may end up reimplement it
using via Java Parser.

Cheers,
Ali

On Fri, Jun 9, 2017 at 12:55 PM, Simon Elliston Ball <
si...@simonellistonball.com> wrote:

> I thought about compile of first use and cache as an approach, but decided
> it would reduce the predictability of latency for a message, which is
> important in the metron enrichment context. As you say, we could end up
> growing a large number of Groks, but if the load of compile is all pushed
> to the (hopefully very rare) topology restart event, it feels like the
> performance trade off there is a good one, though the memory usage tradeoff
> could start to bite if we’re getting into the hundreds I guess.
>
> Simon
>
>
> > On 9 Jun 2017, at 03:32, Kyle Richardson 
> wrote:
> >
> > I like the pre-compile idea. One concern is I see the number of grok
> objects growing over time. This parser does not account for nearly all of
> the possible ASA message types, currently only the most common ones. Is
> there a middle ground implementation where we can compile on first use of a
> grok and then hold in memory? Avoids the up front burden but should also
> boost performance.
> >
> > -Kyle
> >
> >> On Jun 8, 2017, at 8:56 PM, Simon Elliston Ball <
> si...@simonellistonball.com> wrote:
> >>
> >> The changes are pretty simple (pre-compile the grok, duh). Most other
> grok parser just use a single expression, which is already pre-compiled
> (/checks assumption in code) so really it’s just the ASA one because of
> it’s strange two stage grok.
> >>
> >> Shame, it would have been nice to find some more low hanging fruit.
> >>
> >> Simon
> >>
> >>> On 9 Jun 2017, at 01:52, Otto Fowler  wrote:
> >>>
> >>> Are these changes that all grok parsers can benefit from?  Are your
> changes to the base classes that they use or asa only?
> >>>
> >>>
> >>>
>  On June 8, 2017 at 20:49:49, Simon Elliston Ball (
> si...@simonellistonball.com ) wrote:
> 
>  I got mildly interested in parser performance as a result of some
> recent work on tuning, and did some very quick benchmarking with Predfix on
> the ASA parser (which I hadn’t really cared about enough due to relatively
> low volume previously).
> 
>  That said, it’s not exactly perf optimised. 3 runs of 1000 iterations
> on my laptop as a micro-benchmark in Predfix (I know, scientific, right),
> with some changes (basically pushing all the grok statements up to
> pre-compile in init (the parser currently uses one grok to do the syslog
> bit and figure out which grok it needs for the second half, so this makes
> for a large number of Grok objects upfront, which I think we can live with.
> 
>  Do you think we should do this benchmarking properly, and extend?
> Anyone have thoughts about how to build parser benchmarks in to our test
> suite properly?
> 
>  Also, since these are showing approx 20 times improvement on the P95
> interval, do we think it’s worth the memory (not measured, but 39 Grok
> objects hanging around? If so I’ll get it JIRAed up and push my new version.
> 
>  Run results:-
> 
>  Base line (current master as is)
>  |= Benchmark ==
> |
>  | - | unit | sum | min | max | avg | stddev | conf95 | runs |
>  |= TimeMeter
> ==|
>  |. AsaBenchmark ..
> .|
>  | parserBenchmark | ms | 5597.98 | 04.90 | 159.02 | 05.60 | 04.89 |
> [05.01-06.20] | 1000.00 |
>  | parserBenchmark | ms | 5503.91 | 04.82 | 149.60 | 05.50 | 04.59 |
> [05.00-05.90] | 1000.00 |
>  | parserBenchmark | ms | 5620.90 | 04.80 | 152.83 | 05.62 | 04.71 |
> [04.98-06.73] | 1000.00 |
>  |===
> ===|
> 
>  Syslog element of Grok pulled out and pre-compiled
> 
>  |= Benchmark ==
> |
>  | - | unit | sum | min | max | avg | stddev | conf95 | runs |
>  |= TimeMeter
> ==|
>  |. AsaBenchmark ..
> .|
>  | parserBenchmark | ms | 4299.91 | 03.29 | 120.06 | 04.30 | 03.89 |
> [03.36-07.10] | 1000.00 |
>  | parserBenchmark | ms | 4206.98 | 03.31 | 129.41 | 04.21 | 04.07 |
> [03.46-05.44] | 1000.00 |
>  | parserBenchmark | ms | 3843.05 | 03.28 | 119.39 | 03.84 | 03.79 |
> [03.33-04.55] | 1000.00 |
>

Re: [Discuss] Cyber Security Asset Management for Metron

2017-05-24 Thread Ali Nazemian

Agreed on having a separate discussion/proposal. Having a graph database
from the design perspective is one thing and having a stable and
high-performance implementation of it is another thing. I have used
different graph databases for multiple projects so far. It is very good on
paper, but we should be careful about the implementation.

The good point about using Titan for this purpose is it comes with a native
ThinkerPop implementation that will be helpful in OLAP using Spark directly
that we can use them out of the box. However, there were lots of issues
regarding the stability of Titan (we were working on making that stable for
8 months!). I am not sure they have been fixed or not as a part of
JanusGraph. I know Atlas team members are involved in JanusGraph
development. The fact that they are using HBase as a backend would also be
helpful, so we may need to share the conversation with them and use some of
their experiences.

Anyway, I was wondering anybody has done anything regarding this or not so
I need to be aligned with that work and avoid any re-work.

Cheers,
Ali

On Thu, May 25, 2017 at 4:21 AM, Otto Fowler 
wrote:

> We should have a discussion or a proposal on what should go in the graph
> vs. what should go
> in other stores.
>
>
> On May 24, 2017 at 14:09:59, zeo...@gmail.com (zeo...@gmail.com) wrote:
>
> I would be very interested in a graph db that could leverage the
> ip_src_addr and ip_dst_addr fields in a broad sense (who is talking to who,
> visualize top talkers, etc.). In order to be very useful it would need to
> have the ability to apply filters (IPs, ports, connection durations, bytes
> transferred, etc.) and to narrow down certain time-based windows. I
> probably have an environment where I could test this at semi-scale (a
> couple billion messages per day) and flesh out some of the performance
> concerns if this turns into something. Even if it was very early in
> development, as I frequently rebuild that environment from scratch for
> testing things.
>
> Jon
>
> On Wed, May 24, 2017 at 12:46 PM Nick Allen  wrote:
>
> > I think the addition of a graph capability would be very powerful. I know
> > many who would love the idea, but I know of no implementations that have
> > occurred.
> >
> > It might be good to discuss in the community specific use cases that
> would
> > be enabled by a graph database. That might help to flesh out the
> technical
> > aspects of it.
> >
> >
> >
> >
> >
> > On Wed, May 24, 2017 at 10:08 AM, Ali Nazemian 
> > wrote:
> >
> > > Hi all,
> > >
> > > We are going to design and develop an asset database for Metron. For
> this
> > > purpose, I have been thinking of a graph schema model to map assets as
> > > Nodes and provide relations as Edges. This can be extended to event
> level
> > > to have a particular relation to assets as well as an event to event
> > > relation. Regarding technology, I was thinking of using Titan Graph
> > > Database (probably JanusGraph) and using HBase and Elasticsearch/Solr
> as
> > > backends. However, there might be a performance issue regarding this
> > > decision if we want to use lots of Composite Indices. The problem we
> will
> > > be facing would be the fact that Titan creates separate column family
> for
> > > each Composite Index which HBase is not very good for it. Basically, it
> > > would be better to use Cassandra for this purpose.
> > >
> > > I would like to understand what work have been done already regarding
> > this
> > > problem and what the roadmap will be, so I can make sure we will follow
> > the
> > > same strategy.
> > >
> > > Regards,
> > > Ali
> > >
> >
> --
>
> Jon
>

-- 
A.Nazemian

[Discuss] Cyber Security Asset Management for Metron

2017-05-24 Thread Ali Nazemian

Hi all,

We are going to design and develop an asset database for Metron. For this
purpose, I have been thinking of a graph schema model to map assets as
Nodes and provide relations as Edges. This can be extended to event level
to have a particular relation to assets as well as an event to event
relation. Regarding technology, I was thinking of using Titan Graph
Database (probably JanusGraph) and using HBase and Elasticsearch/Solr as
backends. However, there might be a performance issue regarding this
decision if we want to use lots of Composite Indices. The problem we will
be facing would be the fact that Titan creates separate column family for
each Composite Index which HBase is not very good for it. Basically, it
would be better to use Cassandra for this purpose.

I would like to understand what work have been done already regarding this
problem and what the roadmap will be, so I can make sure we will follow the
same strategy.

Regards,
Ali

Elasticsearch Indexing timestamp

2017-05-09 Thread Ali Nazemian

Hi all,

I was wondering whether there is an index time timestamp field which I can
enable it through some configurations. I want to capture the index time per
each events coming through our platform.

There was a _timestamp field before Elasticsearch 2 which was handy. Since
there is no such feature in Elasticsearch anymore we have to capture it at
the Metron indexing side. I have had a look at Metron Elasticsearch module,
but I couldn't find any part which can be enabled for such a use case.

Cheers,
Ali

Re: Normalization topology or separate normalization bolt for parsing topology

2017-05-02 Thread Ali Nazemian

Hi Nick,

I am happy to continue the development using the current architecture and
embed the pre-parsing steps in the parser code. However, this would be
against the policy to have a contribution to Metron community to expand the
range of supported devices. Clearly, a generic parser would be useful for
the community not a type of parser that is highly customised for our noisy
environment. I was looking for decoupling Parsing and Normalisation to
implement a generic parser which can be used by others as well.

I think this is more a type of strategic decision which can increase the
number of generic parsers that will be contributed back to the community in
future. Ideally, it would be better that official Metron developers focus
on Metron features instead of developing generic parsers.

Thanks,
Ali

On Wed, May 3, 2017 at 3:03 AM, Nick Allen  wrote:

> Yes, and currently that normalization step is the Parsers.
>
> I am not saying the message has to be entirely clear and well-defined.  But
> there are a minimum set of expectations that you must have of any data that
> you're ingesting.   Once it meets that "minimum set", the parser should be
> able to ingest and normalize the message.  Any oddities beyond that
> "minimum set" can be handled with Stellar either post-Parsing or in
> Enrichment.
>
> It is, of course, a judgement call as to what that minimum set is for you.
> You would just need a Parser that matches your definition of "minimum set".
>
> My main point here is that I am not seeing a need to re-architect
> anything.  I think we have the right tools, IMHO.
>
>
>
>
>
>
>
>
>
> On Tue, May 2, 2017 at 10:33 AM, Ali Nazemian 
> wrote:
>
> > Hi Nick,
> >
> > The date could be corrupted due to any reason, and sometimes we haven't
> got
> > any control on the device. Obviously, it is not a big deal if we lose
> <166>
> > severity message, but it could be a different situation for <161>
> > severity or an actual critical threat. However, I have mentioned those
> > defects as an example to pointed the importance of having a normalisation
> > step in Metron processing chain.
> >
> > I still think there is no guarantee to have an entirely clear and
> > well-defined message in the real world use case. If we recognise this
> > situation as a problem, then finding a high performance and flexible
> > solution is not very hard.
> >
> > Cheers,
> > Ali
> >
> > On Tue, May 2, 2017 at 11:24 PM, Nick Allen  wrote:
> >
> > > Before worrying about how to ingest this 'noisy' data, I would want to
> > > better understand root cause.  If you cannot even get a valid date
> > format,
> > > are you sure the data can be trusted?
> > >
> > > Rather than bending over backwards to try to ingest it, I would first
> > make
> > > sure the telemetry is not totally bogus to begin with.  Maybe it is
> > better
> > > that the data is dropped in cases like this.
> > >
> > > IMHO, that is how I would tackle a problem like this.  Not all data can
> > be
> > > trusted.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Apr 27, 2017 at 9:54 AM, Ali Nazemian 
> > > wrote:
> > >
> > > > Are you sure? The syslog_host name is way more complicated than
> > something
> > > > that can be a coincidence. I need to double check with one of the
> > > security
> > > > device experts, but I thought it is some kind of noises.
> > > >
> > > > Yes, we do have more use cases that seem to be corrupted. For
> example,
> > > > having duplicate IP addresses or corrupted date format. Please have a
> > > look
> > > > at the following message. At least I am sure the date format is
> > corrupted
> > > > in this one.
> > > >
> > > > <166>*Jan 22:42:12* hostname : %ASA-6-302013: Built inbound TCP
> > > connection
> > > > 416661250 for outside:*x.x.x.x/p1* *x.x.x.x/p1* to
> inside:*y.y.y.y/p2*
> > > > *y.y.y.y/p2*
> > > >
> > > > Cheers,
> > > > Ali
> > > >
> > > > On Thu, Apr 27, 2017 at 10:00 PM, Simon Elliston Ball <
> > > > si...@simonellistonball.com> wrote:
> > > >
> > > > > Is that instance, you're looking at valid syslog which should be
> > parsed
> > > > as
> > > > > such. The repeat host is not really a host in syslog terms, it's an
> > > > > ap

Re: Normalization topology or separate normalization bolt for parsing topology

2017-05-02 Thread Ali Nazemian

Hi Nick,

The date could be corrupted due to any reason, and sometimes we haven't got
any control on the device. Obviously, it is not a big deal if we lose <166>
severity message, but it could be a different situation for <161>
severity or an actual critical threat. However, I have mentioned those
defects as an example to pointed the importance of having a normalisation
step in Metron processing chain.

I still think there is no guarantee to have an entirely clear and
well-defined message in the real world use case. If we recognise this
situation as a problem, then finding a high performance and flexible
solution is not very hard.

Cheers,
Ali

On Tue, May 2, 2017 at 11:24 PM, Nick Allen  wrote:

> Before worrying about how to ingest this 'noisy' data, I would want to
> better understand root cause.  If you cannot even get a valid date format,
> are you sure the data can be trusted?
>
> Rather than bending over backwards to try to ingest it, I would first make
> sure the telemetry is not totally bogus to begin with.  Maybe it is better
> that the data is dropped in cases like this.
>
> IMHO, that is how I would tackle a problem like this.  Not all data can be
> trusted.
>
>
>
>
>
>
>
> On Thu, Apr 27, 2017 at 9:54 AM, Ali Nazemian 
> wrote:
>
> > Are you sure? The syslog_host name is way more complicated than something
> > that can be a coincidence. I need to double check with one of the
> security
> > device experts, but I thought it is some kind of noises.
> >
> > Yes, we do have more use cases that seem to be corrupted. For example,
> > having duplicate IP addresses or corrupted date format. Please have a
> look
> > at the following message. At least I am sure the date format is corrupted
> > in this one.
> >
> > <166>*Jan 22:42:12* hostname : %ASA-6-302013: Built inbound TCP
> connection
> > 416661250 for outside:*x.x.x.x/p1* *x.x.x.x/p1* to inside:*y.y.y.y/p2*
> > *y.y.y.y/p2*
> >
> > Cheers,
> > Ali
> >
> > On Thu, Apr 27, 2017 at 10:00 PM, Simon Elliston Ball <
> > si...@simonellistonball.com> wrote:
> >
> > > Is that instance, you're looking at valid syslog which should be parsed
> > as
> > > such. The repeat host is not really a host in syslog terms, it's an
> > > application name header which happens to be the same. This is
> definitely
> > a
> > > parser bug which should be handled, esp since the header is perfectly
> RFC
> > > compliant.
> > >
> > > Do you have any other such cases? My view is that parsers should be
> > > written with more any case, so should extract all the fields they can
> > from
> > > malformed logs, rather than throwing exceptions, but that's more about
> > the
> > > way we write parsers than having some kind of pre-clean.
> > >
> > > Simon
> > >
> > > Sent from my iPad
> > >
> > > > On 27 Apr 2017, at 08:04, Ali Nazemian 
> wrote:
> > > >
> > > > I do agree there is a fair amount of overhead for using another bolt
> > for
> > > > this purpose. I am not pointing to the way of implementation. It
> might
> > > be a
> > > > way of implementation to segregate two extension points without
> adding
> > > > overhead; I haven't thought about it yet. However, the main issue is
> > > > sometimes the type of noise is something that generates an exception
> on
> > > the
> > > > parsing side. For example, have a look at the following log:
> > > >
> > > > <166>Apr 12 03:19:12 hostname hostname %ASA-6-302021: Teardown ICMP
> > > > connection for faddr x.x.x.x/0 gaddr y.y.y.y/0 laddr k.k.k.k/0
> > > > (ryanmar)
> > > >
> > > > Clearly duplicate syslog_host throws an exception on parsing, so how
> > > > are we going to deal with that at post-parse transformation? It
> cannot
> > > > pass the parsing. This is only a single example of cases that might
> > > > affect the production data. Unless Stellar transformation is
> something
> > > > that can be done at pre-parse and for the entire message.
> > > >
> > > >
> > > > On Thu, Apr 27, 2017 at 11:14 AM, Simon Elliston Ball <
> > > > si...@simonellistonball.com> wrote:
> > > >
> > > >> Ali,
> > > >>
> > > >> Sounds very much like what you’re talking about when you say
> > > >> normalization, and what I would understand it as, is the process
> > > fulfi

Re: Normalization topology or separate normalization bolt for parsing topology

2017-04-27 Thread Ali Nazemian

Are you sure? The syslog_host name is way more complicated than something
that can be a coincidence. I need to double check with one of the security
device experts, but I thought it is some kind of noises.

Yes, we do have more use cases that seem to be corrupted. For example,
having duplicate IP addresses or corrupted date format. Please have a look
at the following message. At least I am sure the date format is corrupted
in this one.

<166>*Jan 22:42:12* hostname : %ASA-6-302013: Built inbound TCP connection
416661250 for outside:*x.x.x.x/p1* *x.x.x.x/p1* to inside:*y.y.y.y/p2*
*y.y.y.y/p2*

Cheers,
Ali

On Thu, Apr 27, 2017 at 10:00 PM, Simon Elliston Ball <
si...@simonellistonball.com> wrote:

> Is that instance, you're looking at valid syslog which should be parsed as
> such. The repeat host is not really a host in syslog terms, it's an
> application name header which happens to be the same. This is definitely a
> parser bug which should be handled, esp since the header is perfectly RFC
> compliant.
>
> Do you have any other such cases? My view is that parsers should be
> written with more any case, so should extract all the fields they can from
> malformed logs, rather than throwing exceptions, but that's more about the
> way we write parsers than having some kind of pre-clean.
>
> Simon
>
> Sent from my iPad
>
> > On 27 Apr 2017, at 08:04, Ali Nazemian  wrote:
> >
> > I do agree there is a fair amount of overhead for using another bolt for
> > this purpose. I am not pointing to the way of implementation. It might
> be a
> > way of implementation to segregate two extension points without adding
> > overhead; I haven't thought about it yet. However, the main issue is
> > sometimes the type of noise is something that generates an exception on
> the
> > parsing side. For example, have a look at the following log:
> >
> > <166>Apr 12 03:19:12 hostname hostname %ASA-6-302021: Teardown ICMP
> > connection for faddr x.x.x.x/0 gaddr y.y.y.y/0 laddr k.k.k.k/0
> > (ryanmar)
> >
> > Clearly duplicate syslog_host throws an exception on parsing, so how
> > are we going to deal with that at post-parse transformation? It cannot
> > pass the parsing. This is only a single example of cases that might
> > affect the production data. Unless Stellar transformation is something
> > that can be done at pre-parse and for the entire message.
> >
> >
> > On Thu, Apr 27, 2017 at 11:14 AM, Simon Elliston Ball <
> > si...@simonellistonball.com> wrote:
> >
> >> Ali,
> >>
> >> Sounds very much like what you’re talking about when you say
> >> normalization, and what I would understand it as, is the process
> fulfilled
> >> by stellar field transformation in the parser config. Agreed that some
> of
> >> these will be general, based on common metron standard schema, but
> others
> >> will be organisation specific (custom fields overloaded with different
> >> meanings for instance in CEF, for example). These are very much one of
> the
> >> reasons we have the stellar transformation step. I don’t think that
> should
> >> be moved to a separate bolt to be honest, because that comes with a fair
> >> amount of overhead, but logically it is in the parser config rather than
> >> the parser, so seems to serve this purpose in the post-parse transform,
> no?
> >>
> >> Simon
> >>
> >>
> >>
> >>> On 27 Apr 2017, at 02:08, Ali Nazemian  wrote:
> >>>
> >>> Hi Simon,
> >>>
> >>> The reason I am asking for a specific normalisation step is due to the
> >> fact
> >>> that normalisation is not a general use case which can be used by other
> >>> users. It is completely bounded to our application. The way we have
> fixed
> >>> it, for now, is to add a normalisation step to the parser and clear the
> >>> incoming data so the parser step can work on that, but I don't like it.
> >>> There is no point of creating a parser that can handle all of the
> >> possible
> >>> noises that can exist in the production data. Even if it is possible to
> >>> predict every kind of noise in production data there is no point for
> >> Metron
> >>> community to focus on building a general purpose parser for a specific
> >>> device while they can spend that time on developing a cool feature.
> Even
> >> if
> >>> it is possible to predict noises and it is acceptable for the community
> >> to
> >>> spend their time on creating that kin

Re: Normalization topology or separate normalization bolt for parsing topology

2017-04-27 Thread Ali Nazemian

I do agree there is a fair amount of overhead for using another bolt for
this purpose. I am not pointing to the way of implementation. It might be a
way of implementation to segregate two extension points without adding
overhead; I haven't thought about it yet. However, the main issue is
sometimes the type of noise is something that generates an exception on the
parsing side. For example, have a look at the following log:

<166>Apr 12 03:19:12 hostname hostname %ASA-6-302021: Teardown ICMP
connection for faddr x.x.x.x/0 gaddr y.y.y.y/0 laddr k.k.k.k/0
(ryanmar)

Clearly duplicate syslog_host throws an exception on parsing, so how
are we going to deal with that at post-parse transformation? It cannot
pass the parsing. This is only a single example of cases that might
affect the production data. Unless Stellar transformation is something
that can be done at pre-parse and for the entire message.


On Thu, Apr 27, 2017 at 11:14 AM, Simon Elliston Ball <
si...@simonellistonball.com> wrote:

> Ali,
>
> Sounds very much like what you’re talking about when you say
> normalization, and what I would understand it as, is the process fulfilled
> by stellar field transformation in the parser config. Agreed that some of
> these will be general, based on common metron standard schema, but others
> will be organisation specific (custom fields overloaded with different
> meanings for instance in CEF, for example). These are very much one of the
> reasons we have the stellar transformation step. I don’t think that should
> be moved to a separate bolt to be honest, because that comes with a fair
> amount of overhead, but logically it is in the parser config rather than
> the parser, so seems to serve this purpose in the post-parse transform, no?
>
> Simon
>
>
>
> > On 27 Apr 2017, at 02:08, Ali Nazemian  wrote:
> >
> > Hi Simon,
> >
> > The reason I am asking for a specific normalisation step is due to the
> fact
> > that normalisation is not a general use case which can be used by other
> > users. It is completely bounded to our application. The way we have fixed
> > it, for now, is to add a normalisation step to the parser and clear the
> > incoming data so the parser step can work on that, but I don't like it.
> > There is no point of creating a parser that can handle all of the
> possible
> > noises that can exist in the production data. Even if it is possible to
> > predict every kind of noise in production data there is no point for
> Metron
> > community to focus on building a general purpose parser for a specific
> > device while they can spend that time on developing a cool feature. Even
> if
> > it is possible to predict noises and it is acceptable for the community
> to
> > spend their time on creating that kind of parser why every Metron user
> need
> > that extra normalisation? A user data might be clear at the first step
> and
> > obviously, it only decreases the total throughput without any use for
> that
> > specific user.
> >
> > Imagine there is an additional bolt for normalisation and there is a
> > mechanism to customise the normalisation without changing the general
> > parser for a specific device. We can have a general parser as a common
> > parser for that device and leave the normalisation development to users.
> > However, it is very important to provide the normalisation step as fast
> as
> > possible.
> >
> > Cheers,
> > Ali
> >
> > On Thu, Apr 27, 2017 at 12:05 AM, Casey Stella 
> wrote:
> >
> >> Yeah, we definitely don't want to rewrite parsing in Stellar.  I would
> >> expect the job of the parser, however, to handle structural issues.  In
> my
> >> mind, parsing is about transforming structures into fields and the role
> of
> >> the field transformations are to transform values.  There's obvious
> overlap
> >> there wherein parsers may do some normalizations/transformations (i.e.
> look
> >> how grok handles timestamps), but it almost always gets us into trouble
> >> when parsers do even moderately complex value transformations.
> >>
> >> As I type this, though, I think I see your point.  What you really want
> is
> >> to chain parsers, have a pre-parser to bring you 80% of the way there
> and
> >> hammer out all the structural issues so you might be able to use a more
> >> generic parser down the chain.  I have often thought that maybe we
> should
> >> expose parsers as Stellar functions which take raw data and emit whole
> >> messages.  This would allow us to compose parsers, so imagine the above
> >> example where you've written a s

Re: Normalization topology or separate normalization bolt for parsing topology

2017-04-26 Thread Ali Nazemian

lies and
> will
> > likely side-step a lot of your issues.
> >
> > Simon
> >
> > > On 26 Apr 2017, at 14:37, Casey Stella  wrote:
> > >
> > > Ok, that's another story.  h, we don't generally pre-parse becuase
> we
> > > try to not assume any particular format there (i.e. it could be
> strings,
> > > could be byte arrays).  Maybe the right answer is to pass the raw,
> > > non-normalized data (best effort tyep of thing) through the parser and
> do
> > > the normalization post-parse..or is there a problem with that?
> > >
> > > On Wed, Apr 26, 2017 at 9:33 AM, Ali Nazemian 
> > wrote:
> > >
> > >> Hi Casey,
> > >>
> > >> It is actually pre-parse process, not a post-parse one. These type of
> > >> noises affect the position of an attribute for example and give us
> > parsing
> > >> exception. The timestamp example was not a good one because that is
> > >> actually a post-parse exception.
> > >>
> > >> On Wed, Apr 26, 2017 at 11:28 PM, Casey Stella 
> > wrote:
> > >>
> > >>> So, further transformation post-parse was one of the motivating
> reasons
> > >> for
> > >>> Stellar (to do that transformation post-parse).  Is there a
> capability
> > >> that
> > >>> it's lacking that we can add to fit your usecase?
> > >>>
> > >>> On Wed, Apr 26, 2017 at 9:24 AM, Ali Nazemian  >
> > >>> wrote:
> > >>>
> > >>>> I've created a Jira ticket regarding this feature.
> > >>>>
> > >>>> https://issues.apache.org/jira/browse/METRON-893
> > >>>>
> > >>>>
> > >>>> On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian <
> alinazem...@gmail.com
> > >
> > >>>> wrote:
> > >>>>
> > >>>>> Currently, we are using normal regex at the Java source code to
> > >> handle
> > >>>>> those situations. However, it would be nice to have a separate bolt
> > >> and
> > >>>>> deal with them separately. Yeah, I can create a Jira issue
> regarding
> > >>>> that.
> > >>>>> The main reason I am asking for such a feature is the fact that
> lack
> > >> of
> > >>>>> such a feature makes the process of creating some parser for the
> > >>>> community
> > >>>>> a little painful for us. We need to maintain two different
> versions,
> > >>> one
> > >>>>> for community another for the internal use case. Clearly, noise is
> an
> > >>>>> inevitable part of real world use cases.
> > >>>>>
> > >>>>> Cheers,
> > >>>>> Ali
> > >>>>>
> > >>>>> On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler <
> > >> ottobackwa...@gmail.com
> > >>>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> Are you doing this cleansing all in the parser or are you using
> any
> > >>>>>> Stellar to do it?
> > >>>>>> Can you create a jira?
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> On April 26, 2017 at 08:59:16, Ali Nazemian (
> alinazem...@gmail.com)
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>> Hi all,
> > >>>>>>
> > >>>>>>
> > >>>>>> We are facing certain use cases in Metron production that happen
> to
> > >> be
> > >>>>>> related to noisy stream. For example, a wrong timestamp, duplicate
> > >>>>>> hostname/IP address, etc. To deal with the normalization we have
> > >> added
> > >>>> an
> > >>>>>> additional step for the corresponding parsers to do the data
> > >> cleaning.
> > >>>>>> Clearly, parsing is a standard factor which is mostly related to
> the
> > >>>>>> device
> > >>>>>> that is generating the data and can be used for the same type of
> > >>> device
> > >>>>>> everywhere, but normalization is very production dependent and
> there
> > >>> is
> > >>>>>> no
> > >>>>>> point of mixing normalization with parsing. It would be nice to
> > >> have a
> > >>>>>> sperate bolt in a parsing topologies to dedicate to production
> > >>>>>> related cleaning process. In that case, eveybody can easily
> > >> contribute
> > >>>> to
> > >>>>>> Metron community with additional parsers without being worried
> about
> > >>>>>> mixing
> > >>>>>> parsers and data cleaning process.
> > >>>>>>
> > >>>>>>
> > >>>>>> Regards,
> > >>>>>>
> > >>>>>> Ali
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> A.Nazemian
> > >>>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> A.Nazemian
> > >>>>
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> A.Nazemian
> > >>
> >
> >
>



-- 
A.Nazemian

Re: Normalization topology or separate normalization bolt for parsing topology

2017-04-26 Thread Ali Nazemian

Having Stellar function for the normalization is very cool actually.

Casey, how are you going to deal with normalization after the parsing if
that noise affects the parsing? For some reason, the incoming data do not
look like in the way that has to be.

On Wed, Apr 26, 2017 at 11:37 PM, Casey Stella  wrote:

> Ok, that's another story.  h, we don't generally pre-parse becuase we
> try to not assume any particular format there (i.e. it could be strings,
> could be byte arrays).  Maybe the right answer is to pass the raw,
> non-normalized data (best effort tyep of thing) through the parser and do
> the normalization post-parse..or is there a problem with that?
>
> On Wed, Apr 26, 2017 at 9:33 AM, Ali Nazemian 
> wrote:
>
> > Hi Casey,
> >
> > It is actually pre-parse process, not a post-parse one. These type of
> > noises affect the position of an attribute for example and give us
> parsing
> > exception. The timestamp example was not a good one because that is
> > actually a post-parse exception.
> >
> > On Wed, Apr 26, 2017 at 11:28 PM, Casey Stella 
> wrote:
> >
> > > So, further transformation post-parse was one of the motivating reasons
> > for
> > > Stellar (to do that transformation post-parse).  Is there a capability
> > that
> > > it's lacking that we can add to fit your usecase?
> > >
> > > On Wed, Apr 26, 2017 at 9:24 AM, Ali Nazemian 
> > > wrote:
> > >
> > > > I've created a Jira ticket regarding this feature.
> > > >
> > > > https://issues.apache.org/jira/browse/METRON-893
> > > >
> > > >
> > > > On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian <
> alinazem...@gmail.com>
> > > > wrote:
> > > >
> > > > > Currently, we are using normal regex at the Java source code to
> > handle
> > > > > those situations. However, it would be nice to have a separate bolt
> > and
> > > > > deal with them separately. Yeah, I can create a Jira issue
> regarding
> > > > that.
> > > > > The main reason I am asking for such a feature is the fact that
> lack
> > of
> > > > > such a feature makes the process of creating some parser for the
> > > > community
> > > > > a little painful for us. We need to maintain two different
> versions,
> > > one
> > > > > for community another for the internal use case. Clearly, noise is
> an
> > > > > inevitable part of real world use cases.
> > > > >
> > > > > Cheers,
> > > > > Ali
> > > > >
> > > > > On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler <
> > ottobackwa...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > >> Hi,
> > > > >>
> > > > >> Are you doing this cleansing all in the parser or are you using
> any
> > > > >> Stellar to do it?
> > > > >> Can you create a jira?
> > > > >>
> > > > >>
> > > > >>
> > > > >> On April 26, 2017 at 08:59:16, Ali Nazemian (
> alinazem...@gmail.com)
> > > > >> wrote:
> > > > >>
> > > > >> Hi all,
> > > > >>
> > > > >>
> > > > >> We are facing certain use cases in Metron production that happen
> to
> > be
> > > > >> related to noisy stream. For example, a wrong timestamp, duplicate
> > > > >> hostname/IP address, etc. To deal with the normalization we have
> > added
> > > > an
> > > > >> additional step for the corresponding parsers to do the data
> > cleaning.
> > > > >> Clearly, parsing is a standard factor which is mostly related to
> the
> > > > >> device
> > > > >> that is generating the data and can be used for the same type of
> > > device
> > > > >> everywhere, but normalization is very production dependent and
> there
> > > is
> > > > >> no
> > > > >> point of mixing normalization with parsing. It would be nice to
> > have a
> > > > >> sperate bolt in a parsing topologies to dedicate to production
> > > > >> related cleaning process. In that case, eveybody can easily
> > contribute
> > > > to
> > > > >> Metron community with additional parsers without being worried
> about
> > > > >> mixing
> > > > >> parsers and data cleaning process.
> > > > >>
> > > > >>
> > > > >> Regards,
> > > > >>
> > > > >> Ali
> > > > >>
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > A.Nazemian
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > A.Nazemian
> > > >
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
>



-- 
A.Nazemian

Re: Normalization topology or separate normalization bolt for parsing topology

2017-04-26 Thread Ali Nazemian

Hi Casey,

It is actually pre-parse process, not a post-parse one. These type of
noises affect the position of an attribute for example and give us parsing
exception. The timestamp example was not a good one because that is
actually a post-parse exception.

On Wed, Apr 26, 2017 at 11:28 PM, Casey Stella  wrote:

> So, further transformation post-parse was one of the motivating reasons for
> Stellar (to do that transformation post-parse).  Is there a capability that
> it's lacking that we can add to fit your usecase?
>
> On Wed, Apr 26, 2017 at 9:24 AM, Ali Nazemian 
> wrote:
>
> > I've created a Jira ticket regarding this feature.
> >
> > https://issues.apache.org/jira/browse/METRON-893
> >
> >
> > On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian 
> > wrote:
> >
> > > Currently, we are using normal regex at the Java source code to handle
> > > those situations. However, it would be nice to have a separate bolt and
> > > deal with them separately. Yeah, I can create a Jira issue regarding
> > that.
> > > The main reason I am asking for such a feature is the fact that lack of
> > > such a feature makes the process of creating some parser for the
> > community
> > > a little painful for us. We need to maintain two different versions,
> one
> > > for community another for the internal use case. Clearly, noise is an
> > > inevitable part of real world use cases.
> > >
> > > Cheers,
> > > Ali
> > >
> > > On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler  >
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> Are you doing this cleansing all in the parser or are you using any
> > >> Stellar to do it?
> > >> Can you create a jira?
> > >>
> > >>
> > >>
> > >> On April 26, 2017 at 08:59:16, Ali Nazemian (alinazem...@gmail.com)
> > >> wrote:
> > >>
> > >> Hi all,
> > >>
> > >>
> > >> We are facing certain use cases in Metron production that happen to be
> > >> related to noisy stream. For example, a wrong timestamp, duplicate
> > >> hostname/IP address, etc. To deal with the normalization we have added
> > an
> > >> additional step for the corresponding parsers to do the data cleaning.
> > >> Clearly, parsing is a standard factor which is mostly related to the
> > >> device
> > >> that is generating the data and can be used for the same type of
> device
> > >> everywhere, but normalization is very production dependent and there
> is
> > >> no
> > >> point of mixing normalization with parsing. It would be nice to have a
> > >> sperate bolt in a parsing topologies to dedicate to production
> > >> related cleaning process. In that case, eveybody can easily contribute
> > to
> > >> Metron community with additional parsers without being worried about
> > >> mixing
> > >> parsers and data cleaning process.
> > >>
> > >>
> > >> Regards,
> > >>
> > >> Ali
> > >>
> > >>
> > >
> > >
> > > --
> > > A.Nazemian
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
>



-- 
A.Nazemian

Re: Normalization topology or separate normalization bolt for parsing topology

2017-04-26 Thread Ali Nazemian

I've created a Jira ticket regarding this feature.

https://issues.apache.org/jira/browse/METRON-893


On Wed, Apr 26, 2017 at 11:11 PM, Ali Nazemian 
wrote:

> Currently, we are using normal regex at the Java source code to handle
> those situations. However, it would be nice to have a separate bolt and
> deal with them separately. Yeah, I can create a Jira issue regarding that.
> The main reason I am asking for such a feature is the fact that lack of
> such a feature makes the process of creating some parser for the community
> a little painful for us. We need to maintain two different versions, one
> for community another for the internal use case. Clearly, noise is an
> inevitable part of real world use cases.
>
> Cheers,
> Ali
>
> On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler 
> wrote:
>
>> Hi,
>>
>> Are you doing this cleansing all in the parser or are you using any
>> Stellar to do it?
>> Can you create a jira?
>>
>>
>>
>> On April 26, 2017 at 08:59:16, Ali Nazemian (alinazem...@gmail.com)
>> wrote:
>>
>> Hi all,
>>
>>
>> We are facing certain use cases in Metron production that happen to be
>> related to noisy stream. For example, a wrong timestamp, duplicate
>> hostname/IP address, etc. To deal with the normalization we have added an
>> additional step for the corresponding parsers to do the data cleaning.
>> Clearly, parsing is a standard factor which is mostly related to the
>> device
>> that is generating the data and can be used for the same type of device
>> everywhere, but normalization is very production dependent and there is
>> no
>> point of mixing normalization with parsing. It would be nice to have a
>> sperate bolt in a parsing topologies to dedicate to production
>> related cleaning process. In that case, eveybody can easily contribute to
>> Metron community with additional parsers without being worried about
>> mixing
>> parsers and data cleaning process.
>>
>>
>> Regards,
>>
>> Ali
>>
>>
>
>
> --
> A.Nazemian
>



-- 
A.Nazemian

Re: Normalization topology or separate normalization bolt for parsing topology

2017-04-26 Thread Ali Nazemian

Currently, we are using normal regex at the Java source code to handle
those situations. However, it would be nice to have a separate bolt and
deal with them separately. Yeah, I can create a Jira issue regarding that.
The main reason I am asking for such a feature is the fact that lack of
such a feature makes the process of creating some parser for the community
a little painful for us. We need to maintain two different versions, one
for community another for the internal use case. Clearly, noise is an
inevitable part of real world use cases.

Cheers,
Ali

On Wed, Apr 26, 2017 at 11:04 PM, Otto Fowler 
wrote:

> Hi,
>
> Are you doing this cleansing all in the parser or are you using any
> Stellar to do it?
> Can you create a jira?
>
>
>
> On April 26, 2017 at 08:59:16, Ali Nazemian (alinazem...@gmail.com) wrote:
>
> Hi all,
>
>
> We are facing certain use cases in Metron production that happen to be
> related to noisy stream. For example, a wrong timestamp, duplicate
> hostname/IP address, etc. To deal with the normalization we have added an
> additional step for the corresponding parsers to do the data cleaning.
> Clearly, parsing is a standard factor which is mostly related to the
> device
> that is generating the data and can be used for the same type of device
> everywhere, but normalization is very production dependent and there is no
> point of mixing normalization with parsing. It would be nice to have a
> sperate bolt in a parsing topologies to dedicate to production
> related cleaning process. In that case, eveybody can easily contribute to
> Metron community with additional parsers without being worried about
> mixing
> parsers and data cleaning process.
>
>
> Regards,
>
> Ali
>
>

-- 
A.Nazemian

Normalization topology or separate normalization bolt for parsing topology

2017-04-26 Thread Ali Nazemian

Hi all,


We are facing certain use cases in Metron production that happen to be
related to noisy stream. For example, a wrong timestamp, duplicate
hostname/IP address, etc. To deal with the normalization we have added an
additional step for the corresponding parsers to do the data cleaning.
Clearly, parsing is a standard factor which is mostly related to the device
that is generating the data and can be used for the same type of device
everywhere, but normalization is very production dependent and there is no
point of mixing normalization with parsing. It would be nice to have a
sperate bolt in a parsing topologies to dedicate to production
related cleaning process. In that case, eveybody can easily contribute to
Metron community with additional parsers without being worried about mixing
parsers and data cleaning process.


Regards,

Ali

Re: So we graduated...

2017-04-20 Thread Ali Nazemian

That's great! Congratulation everybody.

On Fri, Apr 21, 2017 at 12:54 PM, Michael Miklavcic <
michael.miklav...@gmail.com> wrote:

> Congrats all
>
> On Apr 20, 2017 8:38 PM, "zeo...@gmail.com"  wrote:
>
> > Well done everybody!  Congrats
> >
> > Jon
> >
> > On Thu, Apr 20, 2017 at 8:55 PM Matt Foley  wrote:
> >
> > > Really exciting!  Congrats to the founding team!
> > > --Matt
> > >
> > >
> > > On 4/20/17, 4:02 PM, "Houshang Livian" 
> wrote:
> > >
> > > Congratulations Team. Great work!
> > >
> > >
> > >
> > >
> > > On 4/20/17, 2:55 PM, "larry mccay"  wrote:
> > >
> > > >Wonderful news and well deserved!
> > > >This community has embraced and committed to the Apache way so
> > > quickly.
> > > >
> > > >
> > > >On Thu, Apr 20, 2017 at 5:39 PM, Kyle Richardson <
> > > kylerichards...@gmail.com>
> > > >wrote:
> > > >
> > > >> That's awesome! Congratulations everyone. Looking forward to the
> > > official
> > > >> announcement on Monday.
> > > >>
> > > >> -Kyle
> > > >>
> > > >> > On Apr 20, 2017, at 5:15 PM, David Lyle  >
> > > wrote:
> > > >> >
> > > >> > Outstanding! Great work everyone. Building a TLP worthy
> > community
> > > is
> > > >> > difficult and worthy work, congratulations all!
> > > >> >
> > > >> > -D...
> > > >> >
> > > >> >> On Thu, Apr 20, 2017 at 5:12 PM, Casey Stella <
> > > ceste...@gmail.com>
> > > >> wrote:
> > > >> >>
> > > >> >> For anyone paying attention to incubator-general, it will
> come
> > > as no
> > > >> >> surprise that we graduated as of last night's board meeting.
> > We
> > > have a
> > > >> >> press released queued up and planned for monday along with a
> PR
> > > >> (METRON-687
> > > >> >> at https://github.com/apache/incubator-metron/pull/539).
> > > >> >>
> > > >> >> It escaped my notice that the graduation was talked about on
> > > >> >> incubator-general; otherwise I'd have sent this email earlier
> > > and been
> > > >> less
> > > >> >> cagey in 687's description.  Even so, I'd like to ask that
> > > everyone
> > > >> keep it
> > > >> >> to themselves until monday morning after the press release
> gets
> > > out the
> > > >> >> door.  I know the cat is out of the bag, but it'd be nice to
> > > have a bit
> > > >> of
> > > >> >> an embargo.
> > > >> >>
> > > >> >> Thanks!
> > > >> >>
> > > >> >> Casey
> > > >> >>
> > > >>
> > >
> > >
> > >
> > > --
> >
> > Jon
> >
>



-- 
A.Nazemian

82 matches

Mail list logo