Re: [Bro-Dev] Writing SumStats plugin

2018-08-07 Thread Jim Mellander
It seems that there's some inconsistency in SumStats plugin usage and
implementation.  There appear to be 2 classes of plugins with differing
calling mechanisms and action:

   1. Item to be measured is in the Key, and the measurement is in
   Observation
  1. These include Average, Last X Observations, Max, Min, Sample,
  Standard Deviation, Sum, Unique, Variance
 1. These are exact measurements.
 2. Some of these have dependencies: StdDev depends on Variance,
 which depends on Average
 2. Item to be measured is in Observation, and the measurement is
   implicitly 1, and the Key is generally null
   1. These include HyperLogLog (number of Unique), TopK (top count)
  1. These are probabilistic data structures.

The Key is not passed to the plugin, but is used to allocate a table that
includes, among other things, the processed observations.  Both classes
call the epoch_result function once per key at the end of the epoch.  Since
class 2 plugins often use a null key, there is only one call to
epoch_result, and a special function is used to extract the results from
the probabilistic data structure (
https://www.bro.org/current/exercises/sumstats/sumstats-5.bro).  The
epoch_finished function is called when all keys have been returned to
finish up.  This is unneeded with this sort of class 2 plugin, since all
the work can be done in the sole call to epoch_result.  Multiple keys could
be used with class 2 plugins, which allows for groupings (
https://www.bro.org/current/exercises/sumstats/sumstats-4.bro).

I have a use case where I want to pass both a key and measurement to a
plugin maintaining a probabilistic data store [1].  I don't want to
allocate a table for each key, since many/most will not be reflected in the
final results.  Since the Observation is a record containing both a string
& a number, a hack would be to coerce the key to a string, and pass both in
the Observation to a class 2 plugin, with a null key - which is what I am
doing currently.

It would be nice to have a conversation on how to unify these two classes
of plugins.  A few thoughts on this:

   - Pass Key to the plugins - maybe Key could be added to the Observation
   structure.
   - Provide a mechanism to *not* allocate the table structure with every
   new Key (this and the previous can possibly be done with some hackiness
   with the normalize_key function in the reducer record)
   - Some sort of epoch_result factory function that by default just
   performs the class 1 plugin behavior.  For class 2 plugins, the function
   would feed the results one by one into epoch_result.

Incidentally, I think theres a bug in the observe() function:

These two lines are run in the loop thru the reducers:
   if ( r?$normalize_key )
key = r$normalize_key(copy(key));
which has the effect of modifying the key for subsequent loops, rather than
just for the one reducer it applies to.  The fix is easy and and obvious

Jim


[1] Implementation of algorithms 4&5 (with enhancements) of
https://arxiv.org/pdf/1705.07001.pdf



On Thu, Aug 2, 2018 at 4:44 PM, Jim Mellander  wrote:

> Hi all:
>
> I'm thinking of writing a SumStats plugin, probably with the initial
> implementation in bro scriptland, with a re-implementation as BIFs if
> initial tests successful.
>
> From examining several plugins, it appears that I need to:
>
>- Add NAME of my plugin as an enum to Calculation
>- Add optional tunables to Reducer
>- Add my data structure to ResultVal
>- In register_observe_plugins, register the function to take an
>observation.
>- In init_result_val_hook, add code to initialize data structure.
>- In compose_resultvals_hook, add code to merge multiple data
>structures
>- Create function to extract
>from data structure either at epoch_result, or epoch_finished
>
> Any thing else I should be aware of?
>
> Thanks in advance,
>
> Jim
>
>
>
>
>
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Broker::publish API

2018-08-07 Thread Jon Siwek
On Mon, Aug 6, 2018 at 3:00 PM Robin Sommer  wrote:

> Overall I have to say I found it pretty hard to follow this all
> because we don't have much consistency right now in how scripts
> structure their communication. That's not surprising, given that we're
> just starting to use all this, but it suggests that we have room for
> improvement in our abstractions. :)

How much is due to new API usage and how much is due to things mainly
being a direct port of old communication patterns (which I guess are
written by various people over extended lengths of time and so there's
inconsistencies to be expected) ?  Or due to being a mishmash of both
old and new?

- Jon
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Broker::publish API

2018-08-07 Thread Jon Siwek
On Mon, Aug 6, 2018 at 1:57 PM Robin Sommer  wrote:

> I have another question about this specific case: we use relay_rr()
> only for sending Intel::insert_indicator. Intel::remove_indicator gets
> published normally through auto_publish(). Why the difference?

Potentially no reason other than no one reviewed whether it had
potential to be optimized in a similar way.  e.g. I first ported
scripts in a direct fashion without trying to change too much
structurally about comm. patterns or doing any optimization except in
cases where a change was specifically talked about.  I only recall
Justin had called out Intel::insert_indicator, so it got changed.

- Jon
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] Broker::publish API

2018-08-07 Thread Jan Grashöfer
To be honest, I have somehow lost track of the discussion. What I can 
recall, it's about simplifying the API in the light of multi-hop 
routing, which is not fully functional yet.

Regarding multi-hop routing I am even not sure what the actual goal is 
that we are currently aiming at. However, from a conceptual perspective 
I think "routing" either needs routing algorithms or strict conventions 
of how the network, to route messages through, is structured. So, what 
would a "deep cluster" look like and what kind of message flows do we 
expect in there?

Some comments on the observations:

On 06/08/18 21:50, Robin Sommer wrote:
>  - The main topics are bro/cluster/ and
>bro/cluster/node/. For these we wouldn't have a problem
>with loops if we enabled automatic, topic-driven forwading as
>far as I can see.

How does forwarding work if I add another node type? Do we assume a 
certain cluster structure here? If yes: Is that a valid assumption?

>  - bro/cluster/broadcast seems to be the main case with a looping
>problem, because everybody subscribes to it. It's hardly used
>though. (bro/config/change is used similarly though).

The topic-concept is a multicast scheme, isn't it? Having a broadcast 
functionality on top of that feels odd. However, it's limited to the 
cluster topic. This leads me to the question which domains do we operate 
on? If I think of messages, I start to think about a cluster but that 
might be only one domain of application. I think it would be good to 
define layers of abstraction more precise here.

>  - There are a couple of script-specific topics where I'm wondering
>if these could switch to using bro/cluster/ instead
>(bro/intel/*, bro/irc/dcc_transfer_update). In other words: when
>clusterizing scripts, prefer not to introduce new topics.

 From my understanding this would mean going back to the old 
communication patterns. What's the point of having topics if we don't 
use them?

>  - There's a lot of checks in publishing code of the type "if I am
>(not) of node type X".

That's something I would have expected. I don't think this is 
necessarily an indicator of bad design. Having these kind of checks 
means that roles are somehow fixed and responsibilities are explicitly 
codified.

>  - Pools are used for two different things: 1. the known-* scripts
>pick a proxy to process and log the information; whereas 2. the
>Intel scripts pick a proxy just as a relay to broadcast stuff
>out, reducing load. That 1st application is a good, but the 2nd
>feels like should be handled differently.

I think we should be careful about introducing too much abstractions. 
Communication patterns tend to be complex and the more of the complexity 
is hidden, the easier it will be to generate misunderstandings. For 
example, in case of the intel framework, proxy nodes might be able to 
implement some more logic than just relaying at some point. Having the 
relay abstraction would mean to deal with two different levels of 
abstractions regarding intel on proxy nodes in this case.

> Overall I have to say I found it pretty hard to follow this all
> because we don't have much consistency right now in how scripts
> structure their communication. That's not surprising, given that we're
> just starting to use all this, but it suggests that we have room for
> improvement in our abstractions. :)

I totally agree here! I think it could help to come up with some more 
use cases to identify the best abstractions.

Jan
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev