Re: [DISCUSS] Alternatives to split/join enrichment

2018-02-22 Thread Casey Stella
FYI, the PR for this is up at https://github.com/apache/metron/pull/940
For those interested, please comment on the actual implementation there.

On Thu, Feb 22, 2018 at 12:43 PM, Casey Stella  wrote:

> So, these are good questions, as usual Otto :)
>
> > how does this effect the distribution of work through the cluster, and
> resiliency of the topologies?
>
> This moves us to a data parallelism scheme rather than a task parallelism
> scheme.  This, in effect means, that we will not be distributing the
> partial enrichments across the network for a given message, but rather
> distributing the messages across the network for *full* enrichment.  So,
> the bundle of work is the same, but we're not concentrating capabilities in
> specific workers.  Then again, as soon as we moved to stellar enrichments
> and sub-groups where you can interact with hbase or geo from within
> stellar, we sorta abandoned specialization.  Resiliency shouldn't be
> effected and, indeed, it should be easier to reason about.  We ack after
> every bolt in the new scheme rather than avoid acking until we join and ack
> the original tuple.  In fact, I'm still not convinced there's not a bug
> somewhere in that join bolt that makes it so we don't ack the right tuple.
>
> > Is anyone else doing it like this?
>
> The stormy way of doing this is to specialize in the bolts and join, no
> doubt, in a fan-out/fan-in pattern.  I do not think it's unheard of,
> though, to use a threadpool.  It's slightly peculiar inasmuch as storm has
> its own threading model, but it is an embarassingly parallel task and the
> main shift is trading the unit of parallelism from enrichment task to
> message to the gain of fewer network hops.  That being said, as long as
> you're not emitting from a different thread that you are receiving from,
> there's no technical limitation.
>
> > Can we have multiple thread pools and group tasks together ( or separate
> them ) wrt hbase?
>
> We could, but I think we might consider starting with just a simple static
> threadpool that we configure at the topology level (e.g. multiple worker
> threads can share the same threadpool that we can configure).  I think as
> the trend of moving everything to stellar continues, we may end up in a
> situation where we don't have a coherent or clear way to differentiate
> between thread pools like we do now.
>
> > Also, how are we to measure the effect?
>
> Well, some of the benefits here are at an architectural/feature level, the
> most exciting of which is that this approach opens up avenues for stellar
> subgroups to depend on each other.  Slightly less exciting, but still nice
> is the fact that this normalizes us with *other* streaming technologies and
> the decoupling work done as part of the PR (soon to be released) will make
> it easy to transition if we so desire.  Beyond that, for performance,
> someone will have to run some performance tests or try it out in a
> situation where they're having some enrichment performance issues.  Until
> we do that, I think we should probably just keep it as a parallel approach
> that you can swap out if you so desire.
>
> On Thu, Feb 22, 2018 at 11:48 AM, Otto Fowler 
> wrote:
>
>> This sounds worth exploring.  A couple of questions:
>>
>> * how does this effect the distribution of work through the cluster, and
>> resiliency of the topologies?
>> * Is anyone else doing it like this?
>> * Can we have multiple thread pools and group tasks together ( or
>> separate them ) wrt hbase?
>>
>>
>>
>> On February 22, 2018 at 11:32:39, Casey Stella (ceste...@gmail.com)
>> wrote:
>>
>> Hi all,
>>
>> I've been thinking and working on something that I wanted to get some
>> feedback on. The way that we do our enrichments, the split/join
>> architecture was created to effectively to parallel enrichments in a
>> storm-like way in contrast to OpenSoc.
>>
>> There are some good parts to this architecture:
>>
>> - It works, enrichments are done in parallel
>> - You can tune individual enrichments differently
>> - It's very storm-like
>>
>> There are also some deficiencies:
>>
>> - It's hard to reason about
>> - Understanding the latency of enriching a message requires looking
>> at multiple bolts that each give summary statistics
>> - The join bolt's cache is really hard to reason about when performance
>> tuning
>> - During spikes in traffic, you can overload the join bolt's cache
>> and drop messages if you aren't careful
>> - In general, it's hard to associate a cache size and a duration kept
>> in cache with throughput and latency
>> - There are a lot of network hops per message
>> - Right now we are stuck at 2 stages of transformations being done
>> (enrichment and threat intel). It's very possible that you might want
>> stellar enrichments to depend on the output of other stellar enrichments.
>> In order to implement this in split/join you'd have to create a cycle in
>> the storm topology
>>
>> I propose a change. I propose that we move to a model where we d

Re: [DISCUSS] Alternatives to split/join enrichment

2018-02-22 Thread Casey Stella
So, these are good questions, as usual Otto :)

> how does this effect the distribution of work through the cluster, and
resiliency of the topologies?

This moves us to a data parallelism scheme rather than a task parallelism
scheme.  This, in effect means, that we will not be distributing the
partial enrichments across the network for a given message, but rather
distributing the messages across the network for *full* enrichment.  So,
the bundle of work is the same, but we're not concentrating capabilities in
specific workers.  Then again, as soon as we moved to stellar enrichments
and sub-groups where you can interact with hbase or geo from within
stellar, we sorta abandoned specialization.  Resiliency shouldn't be
effected and, indeed, it should be easier to reason about.  We ack after
every bolt in the new scheme rather than avoid acking until we join and ack
the original tuple.  In fact, I'm still not convinced there's not a bug
somewhere in that join bolt that makes it so we don't ack the right tuple.

> Is anyone else doing it like this?

The stormy way of doing this is to specialize in the bolts and join, no
doubt, in a fan-out/fan-in pattern.  I do not think it's unheard of,
though, to use a threadpool.  It's slightly peculiar inasmuch as storm has
its own threading model, but it is an embarassingly parallel task and the
main shift is trading the unit of parallelism from enrichment task to
message to the gain of fewer network hops.  That being said, as long as
you're not emitting from a different thread that you are receiving from,
there's no technical limitation.

> Can we have multiple thread pools and group tasks together ( or separate
them ) wrt hbase?

We could, but I think we might consider starting with just a simple static
threadpool that we configure at the topology level (e.g. multiple worker
threads can share the same threadpool that we can configure).  I think as
the trend of moving everything to stellar continues, we may end up in a
situation where we don't have a coherent or clear way to differentiate
between thread pools like we do now.

> Also, how are we to measure the effect?

Well, some of the benefits here are at an architectural/feature level, the
most exciting of which is that this approach opens up avenues for stellar
subgroups to depend on each other.  Slightly less exciting, but still nice
is the fact that this normalizes us with *other* streaming technologies and
the decoupling work done as part of the PR (soon to be released) will make
it easy to transition if we so desire.  Beyond that, for performance,
someone will have to run some performance tests or try it out in a
situation where they're having some enrichment performance issues.  Until
we do that, I think we should probably just keep it as a parallel approach
that you can swap out if you so desire.

On Thu, Feb 22, 2018 at 11:48 AM, Otto Fowler 
wrote:

> This sounds worth exploring.  A couple of questions:
>
> * how does this effect the distribution of work through the cluster, and
> resiliency of the topologies?
> * Is anyone else doing it like this?
> * Can we have multiple thread pools and group tasks together ( or separate
> them ) wrt hbase?
>
>
>
> On February 22, 2018 at 11:32:39, Casey Stella (ceste...@gmail.com) wrote:
>
> Hi all,
>
> I've been thinking and working on something that I wanted to get some
> feedback on. The way that we do our enrichments, the split/join
> architecture was created to effectively to parallel enrichments in a
> storm-like way in contrast to OpenSoc.
>
> There are some good parts to this architecture:
>
> - It works, enrichments are done in parallel
> - You can tune individual enrichments differently
> - It's very storm-like
>
> There are also some deficiencies:
>
> - It's hard to reason about
> - Understanding the latency of enriching a message requires looking
> at multiple bolts that each give summary statistics
> - The join bolt's cache is really hard to reason about when performance
> tuning
> - During spikes in traffic, you can overload the join bolt's cache
> and drop messages if you aren't careful
> - In general, it's hard to associate a cache size and a duration kept
> in cache with throughput and latency
> - There are a lot of network hops per message
> - Right now we are stuck at 2 stages of transformations being done
> (enrichment and threat intel). It's very possible that you might want
> stellar enrichments to depend on the output of other stellar enrichments.
> In order to implement this in split/join you'd have to create a cycle in
> the storm topology
>
> I propose a change. I propose that we move to a model where we do
> enrichments in a single bolt in parallel using a static threadpool (e.g.
> multiple workers in the same process would share the threadpool). IN all
> other ways, this would be backwards compatible. A transparent drop-in for
> the existing enrichment topology.
>
> There are some pros/cons about this too:
>
> - Pro
> - Easier to reason about fro

Re: [DISCUSS] Alternatives to split/join enrichment

2018-02-22 Thread Otto Fowler
Also, how are we to measure the effect?  Not to get all six sigma ;)


On February 22, 2018 at 11:48:41, Otto Fowler (ottobackwa...@gmail.com)
wrote:

This sounds worth exploring.  A couple of questions:

* how does this effect the distribution of work through the cluster, and
resiliency of the topologies?
* Is anyone else doing it like this?
* Can we have multiple thread pools and group tasks together ( or separate
them ) wrt hbase?



On February 22, 2018 at 11:32:39, Casey Stella (ceste...@gmail.com) wrote:

Hi all,

I've been thinking and working on something that I wanted to get some
feedback on. The way that we do our enrichments, the split/join
architecture was created to effectively to parallel enrichments in a
storm-like way in contrast to OpenSoc.

There are some good parts to this architecture:

- It works, enrichments are done in parallel
- You can tune individual enrichments differently
- It's very storm-like

There are also some deficiencies:

- It's hard to reason about
- Understanding the latency of enriching a message requires looking
at multiple bolts that each give summary statistics
- The join bolt's cache is really hard to reason about when performance
tuning
- During spikes in traffic, you can overload the join bolt's cache
and drop messages if you aren't careful
- In general, it's hard to associate a cache size and a duration kept
in cache with throughput and latency
- There are a lot of network hops per message
- Right now we are stuck at 2 stages of transformations being done
(enrichment and threat intel). It's very possible that you might want
stellar enrichments to depend on the output of other stellar enrichments.
In order to implement this in split/join you'd have to create a cycle in
the storm topology

I propose a change. I propose that we move to a model where we do
enrichments in a single bolt in parallel using a static threadpool (e.g.
multiple workers in the same process would share the threadpool). IN all
other ways, this would be backwards compatible. A transparent drop-in for
the existing enrichment topology.

There are some pros/cons about this too:

- Pro
- Easier to reason about from an individual message perspective
- Architecturally decoupled from Storm
- This sets us up if we want to consider other streaming
technologies
- Fewer bolts
- spout -> enrichment bolt -> threatintel bolt -> output bolt
- Way fewer network hops per message
- currently 2n+1 where n is the number of enrichments used (if
using stellar subgroups, each subgroup is a hop)
- Easier to reason about from a performance perspective
- We trade cache size and eviction timeout for threadpool size
- We set ourselves up to have stellar subgroups with dependencies
- i.e. stellar subgroups that depend on the output of other
subgroups
- If we do this, we can shrink the topology to just spout ->
enrichment/threat intel -> output
- Con
- We can no longer tune stellar enrichments independent from HBase
enrichments
- To be fair, with enrichments moving to stellar, this is the case
in the split/join approach too
- No idea about performance


What I propose is to submit a PR that will deliver an alternative,
completely backwards compatible topology for enrichment that you can use by
adjusting the start_enrichment_topology.sh script to use
remote-unified.yaml instead of remote.yaml. If we live with it for a while
and have some good experiences with it, maybe we can consider retiring the
old enrichment topology.

Thoughts? Keep me honest; if I have over or understated the issues for
split/join or missed some important architectural issue let me know. I'm
going to submit a PR to this effect by the EOD today so things will be more
obvious.


Re: [DISCUSS] Alternatives to split/join enrichment

2018-02-22 Thread Otto Fowler
This sounds worth exploring.  A couple of questions:

* how does this effect the distribution of work through the cluster, and
resiliency of the topologies?
* Is anyone else doing it like this?
* Can we have multiple thread pools and group tasks together ( or separate
them ) wrt hbase?



On February 22, 2018 at 11:32:39, Casey Stella (ceste...@gmail.com) wrote:

Hi all,

I've been thinking and working on something that I wanted to get some
feedback on. The way that we do our enrichments, the split/join
architecture was created to effectively to parallel enrichments in a
storm-like way in contrast to OpenSoc.

There are some good parts to this architecture:

- It works, enrichments are done in parallel
- You can tune individual enrichments differently
- It's very storm-like

There are also some deficiencies:

- It's hard to reason about
- Understanding the latency of enriching a message requires looking
at multiple bolts that each give summary statistics
- The join bolt's cache is really hard to reason about when performance
tuning
- During spikes in traffic, you can overload the join bolt's cache
and drop messages if you aren't careful
- In general, it's hard to associate a cache size and a duration kept
in cache with throughput and latency
- There are a lot of network hops per message
- Right now we are stuck at 2 stages of transformations being done
(enrichment and threat intel). It's very possible that you might want
stellar enrichments to depend on the output of other stellar enrichments.
In order to implement this in split/join you'd have to create a cycle in
the storm topology

I propose a change. I propose that we move to a model where we do
enrichments in a single bolt in parallel using a static threadpool (e.g.
multiple workers in the same process would share the threadpool). IN all
other ways, this would be backwards compatible. A transparent drop-in for
the existing enrichment topology.

There are some pros/cons about this too:

- Pro
- Easier to reason about from an individual message perspective
- Architecturally decoupled from Storm
- This sets us up if we want to consider other streaming
technologies
- Fewer bolts
- spout -> enrichment bolt -> threatintel bolt -> output bolt
- Way fewer network hops per message
- currently 2n+1 where n is the number of enrichments used (if
using stellar subgroups, each subgroup is a hop)
- Easier to reason about from a performance perspective
- We trade cache size and eviction timeout for threadpool size
- We set ourselves up to have stellar subgroups with dependencies
- i.e. stellar subgroups that depend on the output of other
subgroups
- If we do this, we can shrink the topology to just spout ->
enrichment/threat intel -> output
- Con
- We can no longer tune stellar enrichments independent from HBase
enrichments
- To be fair, with enrichments moving to stellar, this is the case
in the split/join approach too
- No idea about performance


What I propose is to submit a PR that will deliver an alternative,
completely backwards compatible topology for enrichment that you can use by
adjusting the start_enrichment_topology.sh script to use
remote-unified.yaml instead of remote.yaml. If we live with it for a while
and have some good experiences with it, maybe we can consider retiring the
old enrichment topology.

Thoughts? Keep me honest; if I have over or understated the issues for
split/join or missed some important architectural issue let me know. I'm
going to submit a PR to this effect by the EOD today so things will be more
obvious.