subject:"\[DISCUSS\] Turning off indexing writers feature discussion"

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-16 Thread Casey Stella

The problem with throwing up a warning is that this is a sensor-specific
configuration and the Indexing topology does not know at topology start
time all of the sensors.  Furthermore, you can start a new sensor in the
middle of a running topology.

I'd suggest a compromise and have the indexing topology use the defaults
for writing and log a warning to the Storm UI about the sensor not being
explicitly stated and recommend remedying that immediately.

On Mon, Jan 16, 2017 at 4:01 PM, James Sirota  wrote:

> The explicit on/off seems like a good option to have.  This way I don't
> have to completely remove the config block in order for me to test
> something.  I think if the config for the writer is unspecified we should
> throw up a warning.
>
> 16.01.2017, 08:54, "Nick Allen" :
> >>  To recap, what I am +1 on is Nick's proposed syntax with the following
> >>  modifications:
> >>  1. An explicit enabled field
> >>  2. A default on for unspecified to match current semantics
> >
> > I'm +1 on all of this.
> >
> > On Sat, Jan 14, 2017 at 10:51 AM, Casey Stella 
> wrote:
> >
> >>  I'm +1 on an explicit enabled property and a filter (or when)
> property. I
> >>  think we are zeroing in on a decent design, so that is good.
> >>
> >>  To recap, what I am +1 on is Nick's proposed syntax with the following
> >>  modifications:
> >>  1. An explicit enabled field
> >>  2. A default on for unspecified to match current semantics
> >>
> >>  Casey
> >>  On Sat, Jan 14, 2017 at 10:45 zeo...@gmail.com 
> wrote:
> >>
> >>  > This has the additional benefit of doing something like below when
> you
> >>  want
> >>  > to temporarily disable the hdfs writer, but don't want to remove the
> >>  > settings. This removes the need to store the path and batchSize (and
> >>  many
> >>  > additional settings) somewhere else so they can be brought back in
> when
> >>  you
> >>  > want to re-enable it, which is a nice workflow attribute for the end
> >>  user:
> >>  >
> >>  > {
> >>  > 'elasticsearch': {
> >>  > 'enabled': 'true',
> >>  > 'index': 'foo',
> >>  > 'batchSize': 100,
> >>  > },
> >>  > 'hdfs': {
> >>  > 'enabled': 'false',
> >>  > 'path': '/foo/bar/...',
> >>  > 'batchSize': 100,
> >>  > },
> >>  > 'solr': {
> >>  > 'enabled': 'false'
> >>  > }
> >>  > }
> >>  >
> >>  > Jon
> >>  >
> >>  > On Sat, Jan 14, 2017 at 9:24 AM zeo...@gmail.com 
> >>  wrote:
> >>  >
> >>  > > I similarly have a concern there because I prefer being as
> explicit as
> >>  > > possible, which makes things easier to pick up for new users.
> Using my
> >>  > > example from earlier this could look like specifying while(false),
> but
> >>  an
> >>  > > even better and more obvious approach may be to use
> enabled(false). So
> >>  > the
> >>  > > current simple default would be:
> >>  > >
> >>  > > {
> >>  > > 'elasticsearch': { 'enabled': 'true' },
> >>  > > 'hdfs': { 'enabled': 'true' },
> >>  > > 'solr': { enabled': 'false' }
> >>  > > }
> >>  > >
> >>  > > And to use ES with some overrides but not HDFS or solr it would
> look
> >>  > like:
> >>  > >
> >>  > > {
> >>  > > 'elasticsearch': {
> >>  > > 'enabled': 'true',
> >>  > > 'index': 'foo',
> >>  > > 'batchSize': 100
> >>  > > },
> >>  > > 'hdfs': {
> >>  > > 'enabled': 'false'
> >>  > > },
> >>  > > 'solr': {
> >>  > > 'enabled': 'false'
> >>  > > }
> >>  > > }
> >>  > >
> >>  > > Jon
> >>  > >
> >>  > > On Fri, Jan 13, 2017 at 10:21 PM Casey Stella 
> >>  > wrote:
> >>  > >
> >>  > > One thing that I thought of that I very strenuous do not like in
> Nick's
> >>  > > proposal is that if a writer config is not specified then it is
> turned
> >>  > off
> >>  > > (I think; if I misunderstood let me know). In the situation where
> we
> >>  > have a
> >>  > > new sensor, right now if there are no index config and no
> enrichment
> >>  > > config, it still passes through to the index using defaults. In
> this
> >>  new
> >>  > > scheme it would not. This changes the default semantics for the
> system
> >>  > and
> >>  > > I think it changes it for the worse.
> >>  > >
> >>  > > I would strongly prefer a on-by-default indexing config as we have
> now.
> >>  > > On Fri, Jan 13, 2017 at 17:13 Casey Stella 
> wrote:
> >>  > >
> >>  > > > One thing that I really like about Nick's suggestion is that it
> >>  allows
> >>  > > > writer-specific configs in a clear and simple way. It is more
> >>  complex
> >>  > > for
> >>  > > > the default case (all writers write to indices named the same
> thing
> >>  > with
> >>  > > a
> >>  > > > fixed batch size), which I do not like, but maybe it's worth the
> >>  > > compromise
> >>  > > > to make it less complex for the advanced case.
> >>  > > >
> >>  > > > Thanks a lot for the suggestion, Nick, it's interesting; I'm
> >>  beginning
> >>  > > to
> >>  > > > lean your way.
> >>  > > >
> >>  > > > On Fri, Jan 13, 2017 at 2:51 PM,

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-16 Thread James Sirota

The explicit on/off seems like a good option to have.  This way I don't have to 
completely remove the config block in order for me to test something.  I think 
if the config for the writer is unspecified we should throw up a warning.  

16.01.2017, 08:54, "Nick Allen" :
>>  To recap, what I am +1 on is Nick's proposed syntax with the following
>>  modifications:
>>  1. An explicit enabled field
>>  2. A default on for unspecified to match current semantics
>
> I'm +1 on all of this.
>
> On Sat, Jan 14, 2017 at 10:51 AM, Casey Stella  wrote:
>
>>  I'm +1 on an explicit enabled property and a filter (or when) property. I
>>  think we are zeroing in on a decent design, so that is good.
>>
>>  To recap, what I am +1 on is Nick's proposed syntax with the following
>>  modifications:
>>  1. An explicit enabled field
>>  2. A default on for unspecified to match current semantics
>>
>>  Casey
>>  On Sat, Jan 14, 2017 at 10:45 zeo...@gmail.com  wrote:
>>
>>  > This has the additional benefit of doing something like below when you
>>  want
>>  > to temporarily disable the hdfs writer, but don't want to remove the
>>  > settings. This removes the need to store the path and batchSize (and
>>  many
>>  > additional settings) somewhere else so they can be brought back in when
>>  you
>>  > want to re-enable it, which is a nice workflow attribute for the end
>>  user:
>>  >
>>  > {
>>  > 'elasticsearch': {
>>  > 'enabled': 'true',
>>  > 'index': 'foo',
>>  > 'batchSize': 100,
>>  > },
>>  > 'hdfs': {
>>  > 'enabled': 'false',
>>  > 'path': '/foo/bar/...',
>>  > 'batchSize': 100,
>>  > },
>>  > 'solr': {
>>  > 'enabled': 'false'
>>  > }
>>  > }
>>  >
>>  > Jon
>>  >
>>  > On Sat, Jan 14, 2017 at 9:24 AM zeo...@gmail.com 
>>  wrote:
>>  >
>>  > > I similarly have a concern there because I prefer being as explicit as
>>  > > possible, which makes things easier to pick up for new users. Using my
>>  > > example from earlier this could look like specifying while(false), but
>>  an
>>  > > even better and more obvious approach may be to use enabled(false). So
>>  > the
>>  > > current simple default would be:
>>  > >
>>  > > {
>>  > > 'elasticsearch': { 'enabled': 'true' },
>>  > > 'hdfs': { 'enabled': 'true' },
>>  > > 'solr': { enabled': 'false' }
>>  > > }
>>  > >
>>  > > And to use ES with some overrides but not HDFS or solr it would look
>>  > like:
>>  > >
>>  > > {
>>  > > 'elasticsearch': {
>>  > > 'enabled': 'true',
>>  > > 'index': 'foo',
>>  > > 'batchSize': 100
>>  > > },
>>  > > 'hdfs': {
>>  > > 'enabled': 'false'
>>  > > },
>>  > > 'solr': {
>>  > > 'enabled': 'false'
>>  > > }
>>  > > }
>>  > >
>>  > > Jon
>>  > >
>>  > > On Fri, Jan 13, 2017 at 10:21 PM Casey Stella 
>>  > wrote:
>>  > >
>>  > > One thing that I thought of that I very strenuous do not like in Nick's
>>  > > proposal is that if a writer config is not specified then it is turned
>>  > off
>>  > > (I think; if I misunderstood let me know). In the situation where we
>>  > have a
>>  > > new sensor, right now if there are no index config and no enrichment
>>  > > config, it still passes through to the index using defaults. In this
>>  new
>>  > > scheme it would not. This changes the default semantics for the system
>>  > and
>>  > > I think it changes it for the worse.
>>  > >
>>  > > I would strongly prefer a on-by-default indexing config as we have now.
>>  > > On Fri, Jan 13, 2017 at 17:13 Casey Stella  wrote:
>>  > >
>>  > > > One thing that I really like about Nick's suggestion is that it
>>  allows
>>  > > > writer-specific configs in a clear and simple way. It is more
>>  complex
>>  > > for
>>  > > > the default case (all writers write to indices named the same thing
>>  > with
>>  > > a
>>  > > > fixed batch size), which I do not like, but maybe it's worth the
>>  > > compromise
>>  > > > to make it less complex for the advanced case.
>>  > > >
>>  > > > Thanks a lot for the suggestion, Nick, it's interesting; I'm
>>  beginning
>>  > > to
>>  > > > lean your way.
>>  > > >
>>  > > > On Fri, Jan 13, 2017 at 2:51 PM, zeo...@gmail.com 
>>  > > > wrote:
>>  > > >
>>  > > > I like the suggestions you made, Nick. The only thing I would add is
>>  > > that
>>  > > > it's also nice to see an explicit when(false), as people newer to the
>>  > > > platform may not know where to expect configs for the different
>>  > writers.
>>  > > > Being able to do it either way, which I think is already assumed in
>>  > your
>>  > > > model, would make sense. I would just suggest that, if we support
>>  but
>>  > > are
>>  > > > disabling a writer, that the platform inserts a default when(false)
>>  to
>>  > be
>>  > > > explicit.
>>  > > >
>>  > > > Jon
>>  > > >
>>  > > > On Fri, Jan 13, 2017 at 11:59 AM Casey Stella 
>>  > > wrote:
>>  > > >
>>  > > > > Let me noodle on this over the weekend. Your syntax

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-16 Thread Casey Stella

hahaha :)

On Mon, Jan 16, 2017 at 10:49 AM, Nick Allen  wrote:

> I don't quite support it for #1 and #2, but you absolutely sold me on #3.
> Good sell.  +1
>
>
> On Mon, Jan 16, 2017 at 10:46 AM, Casey Stella  wrote:
>
> > Well, I like it for a couple of reasons:
> >
> >- It's explicit and clear that the writer is on or off
> >- It enables people to keep their writer config in the file without
> >having the writer on (so I don't have to adjust the when clause to
> > "false"
> >- It enables us to not have to execute a stellar statement for "off"
> >writers.
> >
> >
> >
> > On Mon, Jan 16, 2017 at 10:40 AM, Nick Allen  wrote:
> >
> > > I'm all for a compromise here.  Sounds like we're getting close.
> > >
> > > Just one thing.  Can you layout the reasoning for having 'enabled' and
> > > 'when'?  I don't follow the reasoning, but maybe I am missing
> something.
> > >
> > > On Sat, Jan 14, 2017 at 12:13 PM, Kyle Richardson <
> > > kylerichards...@gmail.com
> > > > wrote:
> > >
> > > > I'm +1 on the current proposal. I like Nick's syntax and agree with
> > Jon's
> > > > enabled property. I also like the idea of a path property for HDFS.
> > > >
> > > > -Kyle
> > > >
> > > > > On Jan 14, 2017, at 10:51 AM, Casey Stella 
> > wrote:
> > > > >
> > > > > I'm +1 on an explicit enabled property and a filter (or when)
> > > property. I
> > > > > think we are zeroing in on a decent design, so that is good.
> > > > >
> > > > > To recap, what I am +1 on is Nick's proposed syntax with the
> > following
> > > > > modifications:
> > > > > 1. An explicit enabled field
> > > > > 2. A default on for unspecified to match current semantics
> > > > >
> > > > > Casey
> > > > >> On Sat, Jan 14, 2017 at 10:45 zeo...@gmail.com 
> > > > wrote:
> > > > >>
> > > > >> This has the additional benefit of doing something like below when
> > you
> > > > want
> > > > >> to temporarily disable the hdfs writer, but don't want to remove
> the
> > > > >> settings.  This removes the need to store the path and batchSize
> > (and
> > > > many
> > > > >> additional settings) somewhere else so they can be brought back in
> > > when
> > > > you
> > > > >> want to re-enable it, which is a nice workflow attribute for the
> end
> > > > user:
> > > > >>
> > > > >> {
> > > > >>   'elasticsearch': {
> > > > >>  'enabled': 'true',
> > > > >>  'index': 'foo',
> > > > >>  'batchSize': 100,
> > > > >>},
> > > > >>   'hdfs': {
> > > > >>  'enabled': 'false',
> > > > >>  'path': '/foo/bar/...',
> > > > >>  'batchSize': 100,
> > > > >>},
> > > > >>   'solr': {
> > > > >>  'enabled': 'false'
> > > > >>}
> > > > >> }
> > > > >>
> > > > >> Jon
> > > > >>
> > > > >>> On Sat, Jan 14, 2017 at 9:24 AM zeo...@gmail.com <
> zeo...@gmail.com
> > >
> > > > wrote:
> > > > >>>
> > > > >>> I similarly have a concern there because I prefer being as
> explicit
> > > as
> > > > >>> possible, which makes things easier to pick up for new users.
> > Using
> > > my
> > > > >>> example from earlier this could look like specifying
> while(false),
> > > but
> > > > an
> > > > >>> even better and more obvious approach may be to use
> enabled(false).
> > > So
> > > > >> the
> > > > >>> current simple default would be:
> > > > >>>
> > > > >>> {
> > > > >>>   'elasticsearch': { 'enabled': 'true' },
> > > > >>>   'hdfs': { 'enabled': 'true' },
> > > > >>>   'solr': { enabled': 'false' }
> > > > >>> }
> > > > >>>
> > > > >>> And to use ES with some overrides but not HDFS or solr it would
> > look
> > > > >> like:
> > > > >>>
> > > > >>> {
> > > > >>>   'elasticsearch': {
> > > > >>>  'enabled': 'true',
> > > > >>>  'index': 'foo',
> > > > >>>  'batchSize': 100
> > > > >>>},
> > > > >>>   'hdfs': {
> > > > >>>  'enabled': 'false'
> > > > >>>},
> > > > >>>   'solr': {
> > > > >>>  'enabled': 'false'
> > > > >>>}
> > > > >>> }
> > > > >>>
> > > > >>> Jon
> > > > >>>
> > > > >>> On Fri, Jan 13, 2017 at 10:21 PM Casey Stella <
> ceste...@gmail.com>
> > > > >> wrote:
> > > > >>>
> > > > >>> One thing that I thought of that I very strenuous do not like in
> > > Nick's
> > > > >>> proposal is that if a writer config is not specified then it is
> > > turned
> > > > >> off
> > > > >>> (I think; if I misunderstood let me know). In the situation where
> > we
> > > > >> have a
> > > > >>> new sensor, right now if there are no index config and no
> > enrichment
> > > > >>> config, it still passes through to the index using defaults. In
> > this
> > > > new
> > > > >>> scheme it would not. This changes the default semantics for the
> > > system
> > > > >> and
> > > > >>> I think it changes it for the worse.
> > > > >>>
> > > > >>> I would strongly prefer a on-by-default indexing config as we
> have
> > > now.
> > > >  On Fri, Jan 13, 2017 at 17:13 Casey Stella 
> > > > wrote:
> > > > 
> > > >

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-16 Thread Casey Stella

Well, I like it for a couple of reasons:

   - It's explicit and clear that the writer is on or off
   - It enables people to keep their writer config in the file without
   having the writer on (so I don't have to adjust the when clause to "false"
   - It enables us to not have to execute a stellar statement for "off"
   writers.



On Mon, Jan 16, 2017 at 10:40 AM, Nick Allen  wrote:

> I'm all for a compromise here.  Sounds like we're getting close.
>
> Just one thing.  Can you layout the reasoning for having 'enabled' and
> 'when'?  I don't follow the reasoning, but maybe I am missing something.
>
> On Sat, Jan 14, 2017 at 12:13 PM, Kyle Richardson <
> kylerichards...@gmail.com
> > wrote:
>
> > I'm +1 on the current proposal. I like Nick's syntax and agree with Jon's
> > enabled property. I also like the idea of a path property for HDFS.
> >
> > -Kyle
> >
> > > On Jan 14, 2017, at 10:51 AM, Casey Stella  wrote:
> > >
> > > I'm +1 on an explicit enabled property and a filter (or when)
> property. I
> > > think we are zeroing in on a decent design, so that is good.
> > >
> > > To recap, what I am +1 on is Nick's proposed syntax with the following
> > > modifications:
> > > 1. An explicit enabled field
> > > 2. A default on for unspecified to match current semantics
> > >
> > > Casey
> > >> On Sat, Jan 14, 2017 at 10:45 zeo...@gmail.com 
> > wrote:
> > >>
> > >> This has the additional benefit of doing something like below when you
> > want
> > >> to temporarily disable the hdfs writer, but don't want to remove the
> > >> settings.  This removes the need to store the path and batchSize (and
> > many
> > >> additional settings) somewhere else so they can be brought back in
> when
> > you
> > >> want to re-enable it, which is a nice workflow attribute for the end
> > user:
> > >>
> > >> {
> > >>   'elasticsearch': {
> > >>  'enabled': 'true',
> > >>  'index': 'foo',
> > >>  'batchSize': 100,
> > >>},
> > >>   'hdfs': {
> > >>  'enabled': 'false',
> > >>  'path': '/foo/bar/...',
> > >>  'batchSize': 100,
> > >>},
> > >>   'solr': {
> > >>  'enabled': 'false'
> > >>}
> > >> }
> > >>
> > >> Jon
> > >>
> > >>> On Sat, Jan 14, 2017 at 9:24 AM zeo...@gmail.com 
> > wrote:
> > >>>
> > >>> I similarly have a concern there because I prefer being as explicit
> as
> > >>> possible, which makes things easier to pick up for new users.  Using
> my
> > >>> example from earlier this could look like specifying while(false),
> but
> > an
> > >>> even better and more obvious approach may be to use enabled(false).
> So
> > >> the
> > >>> current simple default would be:
> > >>>
> > >>> {
> > >>>   'elasticsearch': { 'enabled': 'true' },
> > >>>   'hdfs': { 'enabled': 'true' },
> > >>>   'solr': { enabled': 'false' }
> > >>> }
> > >>>
> > >>> And to use ES with some overrides but not HDFS or solr it would look
> > >> like:
> > >>>
> > >>> {
> > >>>   'elasticsearch': {
> > >>>  'enabled': 'true',
> > >>>  'index': 'foo',
> > >>>  'batchSize': 100
> > >>>},
> > >>>   'hdfs': {
> > >>>  'enabled': 'false'
> > >>>},
> > >>>   'solr': {
> > >>>  'enabled': 'false'
> > >>>}
> > >>> }
> > >>>
> > >>> Jon
> > >>>
> > >>> On Fri, Jan 13, 2017 at 10:21 PM Casey Stella 
> > >> wrote:
> > >>>
> > >>> One thing that I thought of that I very strenuous do not like in
> Nick's
> > >>> proposal is that if a writer config is not specified then it is
> turned
> > >> off
> > >>> (I think; if I misunderstood let me know). In the situation where we
> > >> have a
> > >>> new sensor, right now if there are no index config and no enrichment
> > >>> config, it still passes through to the index using defaults. In this
> > new
> > >>> scheme it would not. This changes the default semantics for the
> system
> > >> and
> > >>> I think it changes it for the worse.
> > >>>
> > >>> I would strongly prefer a on-by-default indexing config as we have
> now.
> >  On Fri, Jan 13, 2017 at 17:13 Casey Stella 
> > wrote:
> > 
> >  One thing that I really like about Nick's suggestion is that it
> allows
> >  writer-specific configs in a clear and simple way.  It is more
> complex
> > >>> for
> >  the default case (all writers write to indices named the same thing
> > >> with
> > >>> a
> >  fixed batch size), which I do not like, but maybe it's worth the
> > >>> compromise
> >  to make it less complex for the advanced case.
> > 
> >  Thanks a lot for the suggestion, Nick, it's interesting;  I'm
> > beginning
> > >>> to
> >  lean your way.
> > 
> >  On Fri, Jan 13, 2017 at 2:51 PM, zeo...@gmail.com  >
> >  wrote:
> > 
> >  I like the suggestions you made, Nick.  The only thing I would add
> is
> > >>> that
> >  it's also nice to see an explicit when(false), as people newer to
> the
> >  platform may

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-16 Thread Casey Stella

Yeah, as far as I'm concerned, it should, at least for the current state of
affairs.

On Mon, Jan 16, 2017 at 10:39 AM, Nick Allen  wrote:

> Just one question around "default on."  How would that behave when we add
> new indexers?  Would those also be default on?
>
> If I have a running system, then upgrade and a new indexer is added, would
> that default to on and change the behavior of my system?  Maybe this is a
> fair trade-off that does not happen too frequently.
>
> On Sat, Jan 14, 2017 at 10:51 AM, Casey Stella  wrote:
>
> > I'm +1 on an explicit enabled property and a filter (or when) property. I
> > think we are zeroing in on a decent design, so that is good.
> >
> > To recap, what I am +1 on is Nick's proposed syntax with the following
> > modifications:
> > 1. An explicit enabled field
> > 2. A default on for unspecified to match current semantics
> >
> > Casey
> > On Sat, Jan 14, 2017 at 10:45 zeo...@gmail.com  wrote:
> >
> > > This has the additional benefit of doing something like below when you
> > want
> > > to temporarily disable the hdfs writer, but don't want to remove the
> > > settings.  This removes the need to store the path and batchSize (and
> > many
> > > additional settings) somewhere else so they can be brought back in when
> > you
> > > want to re-enable it, which is a nice workflow attribute for the end
> > user:
> > >
> > > {
> > >'elasticsearch': {
> > >   'enabled': 'true',
> > >   'index': 'foo',
> > >   'batchSize': 100,
> > > },
> > >'hdfs': {
> > >   'enabled': 'false',
> > >   'path': '/foo/bar/...',
> > >   'batchSize': 100,
> > > },
> > >'solr': {
> > >   'enabled': 'false'
> > > }
> > > }
> > >
> > > Jon
> > >
> > > On Sat, Jan 14, 2017 at 9:24 AM zeo...@gmail.com 
> > wrote:
> > >
> > > > I similarly have a concern there because I prefer being as explicit
> as
> > > > possible, which makes things easier to pick up for new users.  Using
> my
> > > > example from earlier this could look like specifying while(false),
> but
> > an
> > > > even better and more obvious approach may be to use enabled(false).
> So
> > > the
> > > > current simple default would be:
> > > >
> > > > {
> > > >'elasticsearch': { 'enabled': 'true' },
> > > >'hdfs': { 'enabled': 'true' },
> > > >'solr': { enabled': 'false' }
> > > > }
> > > >
> > > > And to use ES with some overrides but not HDFS or solr it would look
> > > like:
> > > >
> > > > {
> > > >'elasticsearch': {
> > > >   'enabled': 'true',
> > > >   'index': 'foo',
> > > >   'batchSize': 100
> > > > },
> > > >'hdfs': {
> > > >   'enabled': 'false'
> > > > },
> > > >'solr': {
> > > >   'enabled': 'false'
> > > > }
> > > > }
> > > >
> > > > Jon
> > > >
> > > > On Fri, Jan 13, 2017 at 10:21 PM Casey Stella 
> > > wrote:
> > > >
> > > > One thing that I thought of that I very strenuous do not like in
> Nick's
> > > > proposal is that if a writer config is not specified then it is
> turned
> > > off
> > > > (I think; if I misunderstood let me know). In the situation where we
> > > have a
> > > > new sensor, right now if there are no index config and no enrichment
> > > > config, it still passes through to the index using defaults. In this
> > new
> > > > scheme it would not. This changes the default semantics for the
> system
> > > and
> > > > I think it changes it for the worse.
> > > >
> > > > I would strongly prefer a on-by-default indexing config as we have
> now.
> > > > On Fri, Jan 13, 2017 at 17:13 Casey Stella 
> wrote:
> > > >
> > > > > One thing that I really like about Nick's suggestion is that it
> > allows
> > > > > writer-specific configs in a clear and simple way.  It is more
> > complex
> > > > for
> > > > > the default case (all writers write to indices named the same thing
> > > with
> > > > a
> > > > > fixed batch size), which I do not like, but maybe it's worth the
> > > > compromise
> > > > > to make it less complex for the advanced case.
> > > > >
> > > > > Thanks a lot for the suggestion, Nick, it's interesting;  I'm
> > beginning
> > > > to
> > > > > lean your way.
> > > > >
> > > > > On Fri, Jan 13, 2017 at 2:51 PM, zeo...@gmail.com <
> zeo...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > I like the suggestions you made, Nick.  The only thing I would add
> is
> > > > that
> > > > > it's also nice to see an explicit when(false), as people newer to
> the
> > > > > platform may not know where to expect configs for the different
> > > writers.
> > > > > Being able to do it either way, which I think is already assumed in
> > > your
> > > > > model, would make sense.  I would just suggest that, if we support
> > but
> > > > are
> > > > > disabling a writer, that the platform inserts a default when(false)
> > to
> > > be
> > > > > explicit.
> > > > >
> > > > > Jon
> > > > >
> > > > > On Fri, Jan

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-16 Thread Nick Allen

I'm all for a compromise here.  Sounds like we're getting close.

Just one thing.  Can you layout the reasoning for having 'enabled' and
'when'?  I don't follow the reasoning, but maybe I am missing something.

On Sat, Jan 14, 2017 at 12:13 PM, Kyle Richardson  wrote:

> I'm +1 on the current proposal. I like Nick's syntax and agree with Jon's
> enabled property. I also like the idea of a path property for HDFS.
>
> -Kyle
>
> > On Jan 14, 2017, at 10:51 AM, Casey Stella  wrote:
> >
> > I'm +1 on an explicit enabled property and a filter (or when) property. I
> > think we are zeroing in on a decent design, so that is good.
> >
> > To recap, what I am +1 on is Nick's proposed syntax with the following
> > modifications:
> > 1. An explicit enabled field
> > 2. A default on for unspecified to match current semantics
> >
> > Casey
> >> On Sat, Jan 14, 2017 at 10:45 zeo...@gmail.com 
> wrote:
> >>
> >> This has the additional benefit of doing something like below when you
> want
> >> to temporarily disable the hdfs writer, but don't want to remove the
> >> settings.  This removes the need to store the path and batchSize (and
> many
> >> additional settings) somewhere else so they can be brought back in when
> you
> >> want to re-enable it, which is a nice workflow attribute for the end
> user:
> >>
> >> {
> >>   'elasticsearch': {
> >>  'enabled': 'true',
> >>  'index': 'foo',
> >>  'batchSize': 100,
> >>},
> >>   'hdfs': {
> >>  'enabled': 'false',
> >>  'path': '/foo/bar/...',
> >>  'batchSize': 100,
> >>},
> >>   'solr': {
> >>  'enabled': 'false'
> >>}
> >> }
> >>
> >> Jon
> >>
> >>> On Sat, Jan 14, 2017 at 9:24 AM zeo...@gmail.com 
> wrote:
> >>>
> >>> I similarly have a concern there because I prefer being as explicit as
> >>> possible, which makes things easier to pick up for new users.  Using my
> >>> example from earlier this could look like specifying while(false), but
> an
> >>> even better and more obvious approach may be to use enabled(false).  So
> >> the
> >>> current simple default would be:
> >>>
> >>> {
> >>>   'elasticsearch': { 'enabled': 'true' },
> >>>   'hdfs': { 'enabled': 'true' },
> >>>   'solr': { enabled': 'false' }
> >>> }
> >>>
> >>> And to use ES with some overrides but not HDFS or solr it would look
> >> like:
> >>>
> >>> {
> >>>   'elasticsearch': {
> >>>  'enabled': 'true',
> >>>  'index': 'foo',
> >>>  'batchSize': 100
> >>>},
> >>>   'hdfs': {
> >>>  'enabled': 'false'
> >>>},
> >>>   'solr': {
> >>>  'enabled': 'false'
> >>>}
> >>> }
> >>>
> >>> Jon
> >>>
> >>> On Fri, Jan 13, 2017 at 10:21 PM Casey Stella 
> >> wrote:
> >>>
> >>> One thing that I thought of that I very strenuous do not like in Nick's
> >>> proposal is that if a writer config is not specified then it is turned
> >> off
> >>> (I think; if I misunderstood let me know). In the situation where we
> >> have a
> >>> new sensor, right now if there are no index config and no enrichment
> >>> config, it still passes through to the index using defaults. In this
> new
> >>> scheme it would not. This changes the default semantics for the system
> >> and
> >>> I think it changes it for the worse.
> >>>
> >>> I would strongly prefer a on-by-default indexing config as we have now.
>  On Fri, Jan 13, 2017 at 17:13 Casey Stella 
> wrote:
> 
>  One thing that I really like about Nick's suggestion is that it allows
>  writer-specific configs in a clear and simple way.  It is more complex
> >>> for
>  the default case (all writers write to indices named the same thing
> >> with
> >>> a
>  fixed batch size), which I do not like, but maybe it's worth the
> >>> compromise
>  to make it less complex for the advanced case.
> 
>  Thanks a lot for the suggestion, Nick, it's interesting;  I'm
> beginning
> >>> to
>  lean your way.
> 
>  On Fri, Jan 13, 2017 at 2:51 PM, zeo...@gmail.com 
>  wrote:
> 
>  I like the suggestions you made, Nick.  The only thing I would add is
> >>> that
>  it's also nice to see an explicit when(false), as people newer to the
>  platform may not know where to expect configs for the different
> >> writers.
>  Being able to do it either way, which I think is already assumed in
> >> your
>  model, would make sense.  I would just suggest that, if we support but
> >>> are
>  disabling a writer, that the platform inserts a default when(false) to
> >> be
>  explicit.
> 
>  Jon
> 
>  On Fri, Jan 13, 2017 at 11:59 AM Casey Stella 
> >>> wrote:
> 
> > Let me noodle on this over the weekend.  Your syntax is looking less
> > onerous to me and I like the following statement from Otto: "In the
> >>> end,
> > each write destination ‘type’ will need it’s own configuration.  This
>

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-14 Thread Kyle Richardson

I'm +1 on the current proposal. I like Nick's syntax and agree with Jon's 
enabled property. I also like the idea of a path property for HDFS.

-Kyle

> On Jan 14, 2017, at 10:51 AM, Casey Stella  wrote:
> 
> I'm +1 on an explicit enabled property and a filter (or when) property. I
> think we are zeroing in on a decent design, so that is good.
> 
> To recap, what I am +1 on is Nick's proposed syntax with the following
> modifications:
> 1. An explicit enabled field
> 2. A default on for unspecified to match current semantics
> 
> Casey
>> On Sat, Jan 14, 2017 at 10:45 zeo...@gmail.com  wrote:
>> 
>> This has the additional benefit of doing something like below when you want
>> to temporarily disable the hdfs writer, but don't want to remove the
>> settings.  This removes the need to store the path and batchSize (and many
>> additional settings) somewhere else so they can be brought back in when you
>> want to re-enable it, which is a nice workflow attribute for the end user:
>> 
>> {
>>   'elasticsearch': {
>>  'enabled': 'true',
>>  'index': 'foo',
>>  'batchSize': 100,
>>},
>>   'hdfs': {
>>  'enabled': 'false',
>>  'path': '/foo/bar/...',
>>  'batchSize': 100,
>>},
>>   'solr': {
>>  'enabled': 'false'
>>}
>> }
>> 
>> Jon
>> 
>>> On Sat, Jan 14, 2017 at 9:24 AM zeo...@gmail.com  wrote:
>>> 
>>> I similarly have a concern there because I prefer being as explicit as
>>> possible, which makes things easier to pick up for new users.  Using my
>>> example from earlier this could look like specifying while(false), but an
>>> even better and more obvious approach may be to use enabled(false).  So
>> the
>>> current simple default would be:
>>> 
>>> {
>>>   'elasticsearch': { 'enabled': 'true' },
>>>   'hdfs': { 'enabled': 'true' },
>>>   'solr': { enabled': 'false' }
>>> }
>>> 
>>> And to use ES with some overrides but not HDFS or solr it would look
>> like:
>>> 
>>> {
>>>   'elasticsearch': {
>>>  'enabled': 'true',
>>>  'index': 'foo',
>>>  'batchSize': 100
>>>},
>>>   'hdfs': {
>>>  'enabled': 'false'
>>>},
>>>   'solr': {
>>>  'enabled': 'false'
>>>}
>>> }
>>> 
>>> Jon
>>> 
>>> On Fri, Jan 13, 2017 at 10:21 PM Casey Stella 
>> wrote:
>>> 
>>> One thing that I thought of that I very strenuous do not like in Nick's
>>> proposal is that if a writer config is not specified then it is turned
>> off
>>> (I think; if I misunderstood let me know). In the situation where we
>> have a
>>> new sensor, right now if there are no index config and no enrichment
>>> config, it still passes through to the index using defaults. In this new
>>> scheme it would not. This changes the default semantics for the system
>> and
>>> I think it changes it for the worse.
>>> 
>>> I would strongly prefer a on-by-default indexing config as we have now.
 On Fri, Jan 13, 2017 at 17:13 Casey Stella  wrote:
 
 One thing that I really like about Nick's suggestion is that it allows
 writer-specific configs in a clear and simple way.  It is more complex
>>> for
 the default case (all writers write to indices named the same thing
>> with
>>> a
 fixed batch size), which I do not like, but maybe it's worth the
>>> compromise
 to make it less complex for the advanced case.
 
 Thanks a lot for the suggestion, Nick, it's interesting;  I'm beginning
>>> to
 lean your way.
 
 On Fri, Jan 13, 2017 at 2:51 PM, zeo...@gmail.com 
 wrote:
 
 I like the suggestions you made, Nick.  The only thing I would add is
>>> that
 it's also nice to see an explicit when(false), as people newer to the
 platform may not know where to expect configs for the different
>> writers.
 Being able to do it either way, which I think is already assumed in
>> your
 model, would make sense.  I would just suggest that, if we support but
>>> are
 disabling a writer, that the platform inserts a default when(false) to
>> be
 explicit.
 
 Jon
 
 On Fri, Jan 13, 2017 at 11:59 AM Casey Stella 
>>> wrote:
 
> Let me noodle on this over the weekend.  Your syntax is looking less
> onerous to me and I like the following statement from Otto: "In the
>>> end,
> each write destination ‘type’ will need it’s own configuration.  This
>>> is
 an
> extension point."
> 
> I may come around to your way of thinking.
> 
> On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler <
>> ottobackwa...@gmail.com
 
> wrote:
> 
>> In the end, each write destination ‘type’ will need it’s own
>> configuration.  This is an extension point.
>> {
>> HDFS:{
>> outputAdapters:[
>> {name: avro,
>> settings:{
>> avro stuff….
>> when:{
>> },
>> {
>> name: sequence file,
>> …..
>> 
>> or some such.
>> 
>> 
>>

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-14 Thread Casey Stella

I'm +1 on an explicit enabled property and a filter (or when) property. I
think we are zeroing in on a decent design, so that is good.

To recap, what I am +1 on is Nick's proposed syntax with the following
modifications:
1. An explicit enabled field
2. A default on for unspecified to match current semantics

Casey
On Sat, Jan 14, 2017 at 10:45 zeo...@gmail.com  wrote:

> This has the additional benefit of doing something like below when you want
> to temporarily disable the hdfs writer, but don't want to remove the
> settings.  This removes the need to store the path and batchSize (and many
> additional settings) somewhere else so they can be brought back in when you
> want to re-enable it, which is a nice workflow attribute for the end user:
>
> {
>'elasticsearch': {
>   'enabled': 'true',
>   'index': 'foo',
>   'batchSize': 100,
> },
>'hdfs': {
>   'enabled': 'false',
>   'path': '/foo/bar/...',
>   'batchSize': 100,
> },
>'solr': {
>   'enabled': 'false'
> }
> }
>
> Jon
>
> On Sat, Jan 14, 2017 at 9:24 AM zeo...@gmail.com  wrote:
>
> > I similarly have a concern there because I prefer being as explicit as
> > possible, which makes things easier to pick up for new users.  Using my
> > example from earlier this could look like specifying while(false), but an
> > even better and more obvious approach may be to use enabled(false).  So
> the
> > current simple default would be:
> >
> > {
> >'elasticsearch': { 'enabled': 'true' },
> >'hdfs': { 'enabled': 'true' },
> >'solr': { enabled': 'false' }
> > }
> >
> > And to use ES with some overrides but not HDFS or solr it would look
> like:
> >
> > {
> >'elasticsearch': {
> >   'enabled': 'true',
> >   'index': 'foo',
> >   'batchSize': 100
> > },
> >'hdfs': {
> >   'enabled': 'false'
> > },
> >'solr': {
> >   'enabled': 'false'
> > }
> > }
> >
> > Jon
> >
> > On Fri, Jan 13, 2017 at 10:21 PM Casey Stella 
> wrote:
> >
> > One thing that I thought of that I very strenuous do not like in Nick's
> > proposal is that if a writer config is not specified then it is turned
> off
> > (I think; if I misunderstood let me know). In the situation where we
> have a
> > new sensor, right now if there are no index config and no enrichment
> > config, it still passes through to the index using defaults. In this new
> > scheme it would not. This changes the default semantics for the system
> and
> > I think it changes it for the worse.
> >
> > I would strongly prefer a on-by-default indexing config as we have now.
> > On Fri, Jan 13, 2017 at 17:13 Casey Stella  wrote:
> >
> > > One thing that I really like about Nick's suggestion is that it allows
> > > writer-specific configs in a clear and simple way.  It is more complex
> > for
> > > the default case (all writers write to indices named the same thing
> with
> > a
> > > fixed batch size), which I do not like, but maybe it's worth the
> > compromise
> > > to make it less complex for the advanced case.
> > >
> > > Thanks a lot for the suggestion, Nick, it's interesting;  I'm beginning
> > to
> > > lean your way.
> > >
> > > On Fri, Jan 13, 2017 at 2:51 PM, zeo...@gmail.com 
> > > wrote:
> > >
> > > I like the suggestions you made, Nick.  The only thing I would add is
> > that
> > > it's also nice to see an explicit when(false), as people newer to the
> > > platform may not know where to expect configs for the different
> writers.
> > > Being able to do it either way, which I think is already assumed in
> your
> > > model, would make sense.  I would just suggest that, if we support but
> > are
> > > disabling a writer, that the platform inserts a default when(false) to
> be
> > > explicit.
> > >
> > > Jon
> > >
> > > On Fri, Jan 13, 2017 at 11:59 AM Casey Stella 
> > wrote:
> > >
> > > > Let me noodle on this over the weekend.  Your syntax is looking less
> > > > onerous to me and I like the following statement from Otto: "In the
> > end,
> > > > each write destination ‘type’ will need it’s own configuration.  This
> > is
> > > an
> > > > extension point."
> > > >
> > > > I may come around to your way of thinking.
> > > >
> > > > On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler <
> ottobackwa...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > In the end, each write destination ‘type’ will need it’s own
> > > > > configuration.  This is an extension point.
> > > > > {
> > > > > HDFS:{
> > > > > outputAdapters:[
> > > > > {name: avro,
> > > > > settings:{
> > > > > avro stuff….
> > > > > when:{
> > > > > },
> > > > > {
> > > > >  name: sequence file,
> > > > >  …..
> > > > >
> > > > > or some such.
> > > > >
> > > > >
> > > > > On January 13, 2017 at 11:51:15, Nick Allen (n...@nickallen.org)
> > > wrote:
> > > > >
> > > > > I will add also that instead of global overrides, like index, we
> > should
> > > > use
> >

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-14 Thread zeo...@gmail.com

This has the additional benefit of doing something like below when you want
to temporarily disable the hdfs writer, but don't want to remove the
settings.  This removes the need to store the path and batchSize (and many
additional settings) somewhere else so they can be brought back in when you
want to re-enable it, which is a nice workflow attribute for the end user:

{
   'elasticsearch': {
  'enabled': 'true',
  'index': 'foo',
  'batchSize': 100,
},
   'hdfs': {
  'enabled': 'false',
  'path': '/foo/bar/...',
  'batchSize': 100,
},
   'solr': {
  'enabled': 'false'
}
}

Jon

On Sat, Jan 14, 2017 at 9:24 AM zeo...@gmail.com  wrote:

> I similarly have a concern there because I prefer being as explicit as
> possible, which makes things easier to pick up for new users.  Using my
> example from earlier this could look like specifying while(false), but an
> even better and more obvious approach may be to use enabled(false).  So the
> current simple default would be:
>
> {
>'elasticsearch': { 'enabled': 'true' },
>'hdfs': { 'enabled': 'true' },
>'solr': { enabled': 'false' }
> }
>
> And to use ES with some overrides but not HDFS or solr it would look like:
>
> {
>'elasticsearch': {
>   'enabled': 'true',
>   'index': 'foo',
>   'batchSize': 100
> },
>'hdfs': {
>   'enabled': 'false'
> },
>'solr': {
>   'enabled': 'false'
> }
> }
>
> Jon
>
> On Fri, Jan 13, 2017 at 10:21 PM Casey Stella  wrote:
>
> One thing that I thought of that I very strenuous do not like in Nick's
> proposal is that if a writer config is not specified then it is turned off
> (I think; if I misunderstood let me know). In the situation where we have a
> new sensor, right now if there are no index config and no enrichment
> config, it still passes through to the index using defaults. In this new
> scheme it would not. This changes the default semantics for the system and
> I think it changes it for the worse.
>
> I would strongly prefer a on-by-default indexing config as we have now.
> On Fri, Jan 13, 2017 at 17:13 Casey Stella  wrote:
>
> > One thing that I really like about Nick's suggestion is that it allows
> > writer-specific configs in a clear and simple way.  It is more complex
> for
> > the default case (all writers write to indices named the same thing with
> a
> > fixed batch size), which I do not like, but maybe it's worth the
> compromise
> > to make it less complex for the advanced case.
> >
> > Thanks a lot for the suggestion, Nick, it's interesting;  I'm beginning
> to
> > lean your way.
> >
> > On Fri, Jan 13, 2017 at 2:51 PM, zeo...@gmail.com 
> > wrote:
> >
> > I like the suggestions you made, Nick.  The only thing I would add is
> that
> > it's also nice to see an explicit when(false), as people newer to the
> > platform may not know where to expect configs for the different writers.
> > Being able to do it either way, which I think is already assumed in your
> > model, would make sense.  I would just suggest that, if we support but
> are
> > disabling a writer, that the platform inserts a default when(false) to be
> > explicit.
> >
> > Jon
> >
> > On Fri, Jan 13, 2017 at 11:59 AM Casey Stella 
> wrote:
> >
> > > Let me noodle on this over the weekend.  Your syntax is looking less
> > > onerous to me and I like the following statement from Otto: "In the
> end,
> > > each write destination ‘type’ will need it’s own configuration.  This
> is
> > an
> > > extension point."
> > >
> > > I may come around to your way of thinking.
> > >
> > > On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler  >
> > > wrote:
> > >
> > > > In the end, each write destination ‘type’ will need it’s own
> > > > configuration.  This is an extension point.
> > > > {
> > > > HDFS:{
> > > > outputAdapters:[
> > > > {name: avro,
> > > > settings:{
> > > > avro stuff….
> > > > when:{
> > > > },
> > > > {
> > > >  name: sequence file,
> > > >  …..
> > > >
> > > > or some such.
> > > >
> > > >
> > > > On January 13, 2017 at 11:51:15, Nick Allen (n...@nickallen.org)
> > wrote:
> > > >
> > > > I will add also that instead of global overrides, like index, we
> should
> > > use
> > > > configuration key names that are more appropriate to the output.
> > > >
> > > > For example, does 'index' really make sense for HDFS? Or would 'path'
> > be
> > > > more appropriate?
> > > >
> > > > {
> > > > 'elasticsearch': {
> > > > 'index': 'foo',
> > > > 'batchSize': 1
> > > > },
> > > > 'hdfs': {
> > > > 'path': '/foo/bar/...',
> > > > 'batchSize': 100
> > > > }
> > > > }
> > > >
> > > > Ok, I've said my peace. Thanks for the effort in summarizing all
> this,
> > > > Casey.
> > > >
> > > >
> > > > On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen 
> > wrote:
> > > >
> > > > > Nick's concerns about my suggestion were that it was overly complex
> >

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-14 Thread zeo...@gmail.com

I similarly have a concern there because I prefer being as explicit as
possible, which makes things easier to pick up for new users.  Using my
example from earlier this could look like specifying while(false), but an
even better and more obvious approach may be to use enabled(false).  So the
current simple default would be:

{
   'elasticsearch': { 'enabled': 'true' },
   'hdfs': { 'enabled': 'true' },
   'solr': { enabled': 'false' }
}

And to use ES with some overrides but not HDFS or solr it would look like:

{
   'elasticsearch': {
  'enabled': 'true',
  'index': 'foo',
  'batchSize': 100
},
   'hdfs': {
  'enabled': 'false'
},
   'solr': {
  'enabled': 'false'
}
}

Jon

On Fri, Jan 13, 2017 at 10:21 PM Casey Stella  wrote:

> One thing that I thought of that I very strenuous do not like in Nick's
> proposal is that if a writer config is not specified then it is turned off
> (I think; if I misunderstood let me know). In the situation where we have a
> new sensor, right now if there are no index config and no enrichment
> config, it still passes through to the index using defaults. In this new
> scheme it would not. This changes the default semantics for the system and
> I think it changes it for the worse.
>
> I would strongly prefer a on-by-default indexing config as we have now.
> On Fri, Jan 13, 2017 at 17:13 Casey Stella  wrote:
>
> > One thing that I really like about Nick's suggestion is that it allows
> > writer-specific configs in a clear and simple way.  It is more complex
> for
> > the default case (all writers write to indices named the same thing with
> a
> > fixed batch size), which I do not like, but maybe it's worth the
> compromise
> > to make it less complex for the advanced case.
> >
> > Thanks a lot for the suggestion, Nick, it's interesting;  I'm beginning
> to
> > lean your way.
> >
> > On Fri, Jan 13, 2017 at 2:51 PM, zeo...@gmail.com 
> > wrote:
> >
> > I like the suggestions you made, Nick.  The only thing I would add is
> that
> > it's also nice to see an explicit when(false), as people newer to the
> > platform may not know where to expect configs for the different writers.
> > Being able to do it either way, which I think is already assumed in your
> > model, would make sense.  I would just suggest that, if we support but
> are
> > disabling a writer, that the platform inserts a default when(false) to be
> > explicit.
> >
> > Jon
> >
> > On Fri, Jan 13, 2017 at 11:59 AM Casey Stella 
> wrote:
> >
> > > Let me noodle on this over the weekend.  Your syntax is looking less
> > > onerous to me and I like the following statement from Otto: "In the
> end,
> > > each write destination ‘type’ will need it’s own configuration.  This
> is
> > an
> > > extension point."
> > >
> > > I may come around to your way of thinking.
> > >
> > > On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler  >
> > > wrote:
> > >
> > > > In the end, each write destination ‘type’ will need it’s own
> > > > configuration.  This is an extension point.
> > > > {
> > > > HDFS:{
> > > > outputAdapters:[
> > > > {name: avro,
> > > > settings:{
> > > > avro stuff….
> > > > when:{
> > > > },
> > > > {
> > > >  name: sequence file,
> > > >  …..
> > > >
> > > > or some such.
> > > >
> > > >
> > > > On January 13, 2017 at 11:51:15, Nick Allen (n...@nickallen.org)
> > wrote:
> > > >
> > > > I will add also that instead of global overrides, like index, we
> should
> > > use
> > > > configuration key names that are more appropriate to the output.
> > > >
> > > > For example, does 'index' really make sense for HDFS? Or would 'path'
> > be
> > > > more appropriate?
> > > >
> > > > {
> > > > 'elasticsearch': {
> > > > 'index': 'foo',
> > > > 'batchSize': 1
> > > > },
> > > > 'hdfs': {
> > > > 'path': '/foo/bar/...',
> > > > 'batchSize': 100
> > > > }
> > > > }
> > > >
> > > > Ok, I've said my peace. Thanks for the effort in summarizing all
> this,
> > > > Casey.
> > > >
> > > >
> > > > On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen 
> > wrote:
> > > >
> > > > > Nick's concerns about my suggestion were that it was overly complex
> > and
> > > > >> hard to grok and that we could dispense with backwards
> compatibility
> > > and
> > > > >> make people do a bit more work on the default case for the
> benefits
> > > of a
> > > > >> simpler advanced case. (Nick, make sure I don't misstate your
> > > position)
> > > > >
> > > > >
> > > > > I will add is that in my mind, the majority case would be a user
> > > > > specifying the outputs, but not things like 'batchSize' or 'when'.
> I
> > > > think
> > > > > in the majority case, the user would accept whatever the default
> > batch
> > > > size
> > > > > is.
> > > > >
> > > > > Here are alternatives suggestions for all the examples that you
> > > provided
> > > > > previously.
> > > > >
> > > > > Base Case
> > > > >
> > > > > - The user

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella

One thing that I thought of that I very strenuous do not like in Nick's
proposal is that if a writer config is not specified then it is turned off
(I think; if I misunderstood let me know). In the situation where we have a
new sensor, right now if there are no index config and no enrichment
config, it still passes through to the index using defaults. In this new
scheme it would not. This changes the default semantics for the system and
I think it changes it for the worse.

I would strongly prefer a on-by-default indexing config as we have now.
On Fri, Jan 13, 2017 at 17:13 Casey Stella  wrote:

> One thing that I really like about Nick's suggestion is that it allows
> writer-specific configs in a clear and simple way.  It is more complex for
> the default case (all writers write to indices named the same thing with a
> fixed batch size), which I do not like, but maybe it's worth the compromise
> to make it less complex for the advanced case.
>
> Thanks a lot for the suggestion, Nick, it's interesting;  I'm beginning to
> lean your way.
>
> On Fri, Jan 13, 2017 at 2:51 PM, zeo...@gmail.com 
> wrote:
>
> I like the suggestions you made, Nick.  The only thing I would add is that
> it's also nice to see an explicit when(false), as people newer to the
> platform may not know where to expect configs for the different writers.
> Being able to do it either way, which I think is already assumed in your
> model, would make sense.  I would just suggest that, if we support but are
> disabling a writer, that the platform inserts a default when(false) to be
> explicit.
>
> Jon
>
> On Fri, Jan 13, 2017 at 11:59 AM Casey Stella  wrote:
>
> > Let me noodle on this over the weekend.  Your syntax is looking less
> > onerous to me and I like the following statement from Otto: "In the end,
> > each write destination ‘type’ will need it’s own configuration.  This is
> an
> > extension point."
> >
> > I may come around to your way of thinking.
> >
> > On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler 
> > wrote:
> >
> > > In the end, each write destination ‘type’ will need it’s own
> > > configuration.  This is an extension point.
> > > {
> > > HDFS:{
> > > outputAdapters:[
> > > {name: avro,
> > > settings:{
> > > avro stuff….
> > > when:{
> > > },
> > > {
> > >  name: sequence file,
> > >  …..
> > >
> > > or some such.
> > >
> > >
> > > On January 13, 2017 at 11:51:15, Nick Allen (n...@nickallen.org)
> wrote:
> > >
> > > I will add also that instead of global overrides, like index, we should
> > use
> > > configuration key names that are more appropriate to the output.
> > >
> > > For example, does 'index' really make sense for HDFS? Or would 'path'
> be
> > > more appropriate?
> > >
> > > {
> > > 'elasticsearch': {
> > > 'index': 'foo',
> > > 'batchSize': 1
> > > },
> > > 'hdfs': {
> > > 'path': '/foo/bar/...',
> > > 'batchSize': 100
> > > }
> > > }
> > >
> > > Ok, I've said my peace. Thanks for the effort in summarizing all this,
> > > Casey.
> > >
> > >
> > > On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen 
> wrote:
> > >
> > > > Nick's concerns about my suggestion were that it was overly complex
> and
> > > >> hard to grok and that we could dispense with backwards compatibility
> > and
> > > >> make people do a bit more work on the default case for the benefits
> > of a
> > > >> simpler advanced case. (Nick, make sure I don't misstate your
> > position)
> > > >
> > > >
> > > > I will add is that in my mind, the majority case would be a user
> > > > specifying the outputs, but not things like 'batchSize' or 'when'. I
> > > think
> > > > in the majority case, the user would accept whatever the default
> batch
> > > size
> > > > is.
> > > >
> > > > Here are alternatives suggestions for all the examples that you
> > provided
> > > > previously.
> > > >
> > > > Base Case
> > > >
> > > > - The user must always specify the 'outputs' for clarity.
> > > > - Uses default index name, batch size and when = true.
> > > >
> > > > {
> > > > 'elasticsearch': {},
> > > > 'hdfs': {}
> > > > }
> > > >
> > > >
> > > > <
> > > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > > a1#writer-non-specific-case>Writer-non-specific
> > >
> > > > Case
> > > >
> > > > - There are no global overrides, as in Casey's proposal.
> > > > - Easier to grok IMO.
> > > >
> > > > {
> > > > 'elasticsearch': {
> > > > 'index': 'foo',
> > > > 'batchSize': 100
> > > > },
> > > > 'hdfs': {
> > > > 'index': 'foo',
> > > > 'batchSize': 100
> > > > }
> > > > }
> > > >
> > > >
> > > > <
> > > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > > a1#writer-specific-case-without-filters>Writer-specific
> > >
> > > > case without filters
> > > >
> > > > {
> > > > 'elasticsearch': {
> > > > 'index': 'foo',
> > > > 'batchSize': 1
> > > > },
> > > > 'hdfs': {
> > > > 'index': 'foo',
> > > > 'batchSize': 100
> > > > }
> > > > }
> > > >
> > > >
>

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella

One thing that I really like about Nick's suggestion is that it allows
writer-specific configs in a clear and simple way.  It is more complex for
the default case (all writers write to indices named the same thing with a
fixed batch size), which I do not like, but maybe it's worth the compromise
to make it less complex for the advanced case.

Thanks a lot for the suggestion, Nick, it's interesting;  I'm beginning to
lean your way.

On Fri, Jan 13, 2017 at 2:51 PM, zeo...@gmail.com  wrote:

> I like the suggestions you made, Nick.  The only thing I would add is that
> it's also nice to see an explicit when(false), as people newer to the
> platform may not know where to expect configs for the different writers.
> Being able to do it either way, which I think is already assumed in your
> model, would make sense.  I would just suggest that, if we support but are
> disabling a writer, that the platform inserts a default when(false) to be
> explicit.
>
> Jon
>
> On Fri, Jan 13, 2017 at 11:59 AM Casey Stella  wrote:
>
> > Let me noodle on this over the weekend.  Your syntax is looking less
> > onerous to me and I like the following statement from Otto: "In the end,
> > each write destination ‘type’ will need it’s own configuration.  This is
> an
> > extension point."
> >
> > I may come around to your way of thinking.
> >
> > On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler 
> > wrote:
> >
> > > In the end, each write destination ‘type’ will need it’s own
> > > configuration.  This is an extension point.
> > > {
> > > HDFS:{
> > > outputAdapters:[
> > > {name: avro,
> > > settings:{
> > > avro stuff….
> > > when:{
> > > },
> > > {
> > >  name: sequence file,
> > >  …..
> > >
> > > or some such.
> > >
> > >
> > > On January 13, 2017 at 11:51:15, Nick Allen (n...@nickallen.org)
> wrote:
> > >
> > > I will add also that instead of global overrides, like index, we should
> > use
> > > configuration key names that are more appropriate to the output.
> > >
> > > For example, does 'index' really make sense for HDFS? Or would 'path'
> be
> > > more appropriate?
> > >
> > > {
> > > 'elasticsearch': {
> > > 'index': 'foo',
> > > 'batchSize': 1
> > > },
> > > 'hdfs': {
> > > 'path': '/foo/bar/...',
> > > 'batchSize': 100
> > > }
> > > }
> > >
> > > Ok, I've said my peace. Thanks for the effort in summarizing all this,
> > > Casey.
> > >
> > >
> > > On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen 
> wrote:
> > >
> > > > Nick's concerns about my suggestion were that it was overly complex
> and
> > > >> hard to grok and that we could dispense with backwards compatibility
> > and
> > > >> make people do a bit more work on the default case for the benefits
> > of a
> > > >> simpler advanced case. (Nick, make sure I don't misstate your
> > position)
> > > >
> > > >
> > > > I will add is that in my mind, the majority case would be a user
> > > > specifying the outputs, but not things like 'batchSize' or 'when'. I
> > > think
> > > > in the majority case, the user would accept whatever the default
> batch
> > > size
> > > > is.
> > > >
> > > > Here are alternatives suggestions for all the examples that you
> > provided
> > > > previously.
> > > >
> > > > Base Case
> > > >
> > > > - The user must always specify the 'outputs' for clarity.
> > > > - Uses default index name, batch size and when = true.
> > > >
> > > > {
> > > > 'elasticsearch': {},
> > > > 'hdfs': {}
> > > > }
> > > >
> > > >
> > > > <
> > > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > > a1#writer-non-specific-case>Writer-non-specific
> > >
> > > > Case
> > > >
> > > > - There are no global overrides, as in Casey's proposal.
> > > > - Easier to grok IMO.
> > > >
> > > > {
> > > > 'elasticsearch': {
> > > > 'index': 'foo',
> > > > 'batchSize': 100
> > > > },
> > > > 'hdfs': {
> > > > 'index': 'foo',
> > > > 'batchSize': 100
> > > > }
> > > > }
> > > >
> > > >
> > > > <
> > > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > > a1#writer-specific-case-without-filters>Writer-specific
> > >
> > > > case without filters
> > > >
> > > > {
> > > > 'elasticsearch': {
> > > > 'index': 'foo',
> > > > 'batchSize': 1
> > > > },
> > > > 'hdfs': {
> > > > 'index': 'foo',
> > > > 'batchSize': 100
> > > > }
> > > > }
> > > >
> > > >
> > > > <
> > > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > > a1#writer-specific-case-with-filters>Writer-specific
> > >
> > > > case with filters
> > > >
> > > > - Instead of having to say when=false, just don't configure HDFS
> > > >
> > > > {
> > > > 'elasticsearch': {
> > > > 'index': 'foo',
> > > > 'batchSize': 100,
> > > > 'when': 'exists(field1)'
> > > > }
> > > > }
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella 
> > > wrote:
> > > >
> > > >> Dave,
> > > >> For the benefit of posterity and people who might not be as deeply
> > > >>

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread zeo...@gmail.com

I like the suggestions you made, Nick.  The only thing I would add is that
it's also nice to see an explicit when(false), as people newer to the
platform may not know where to expect configs for the different writers.
Being able to do it either way, which I think is already assumed in your
model, would make sense.  I would just suggest that, if we support but are
disabling a writer, that the platform inserts a default when(false) to be
explicit.

Jon

On Fri, Jan 13, 2017 at 11:59 AM Casey Stella  wrote:

> Let me noodle on this over the weekend.  Your syntax is looking less
> onerous to me and I like the following statement from Otto: "In the end,
> each write destination ‘type’ will need it’s own configuration.  This is an
> extension point."
>
> I may come around to your way of thinking.
>
> On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler 
> wrote:
>
> > In the end, each write destination ‘type’ will need it’s own
> > configuration.  This is an extension point.
> > {
> > HDFS:{
> > outputAdapters:[
> > {name: avro,
> > settings:{
> > avro stuff….
> > when:{
> > },
> > {
> >  name: sequence file,
> >  …..
> >
> > or some such.
> >
> >
> > On January 13, 2017 at 11:51:15, Nick Allen (n...@nickallen.org) wrote:
> >
> > I will add also that instead of global overrides, like index, we should
> use
> > configuration key names that are more appropriate to the output.
> >
> > For example, does 'index' really make sense for HDFS? Or would 'path' be
> > more appropriate?
> >
> > {
> > 'elasticsearch': {
> > 'index': 'foo',
> > 'batchSize': 1
> > },
> > 'hdfs': {
> > 'path': '/foo/bar/...',
> > 'batchSize': 100
> > }
> > }
> >
> > Ok, I've said my peace. Thanks for the effort in summarizing all this,
> > Casey.
> >
> >
> > On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen  wrote:
> >
> > > Nick's concerns about my suggestion were that it was overly complex and
> > >> hard to grok and that we could dispense with backwards compatibility
> and
> > >> make people do a bit more work on the default case for the benefits
> of a
> > >> simpler advanced case. (Nick, make sure I don't misstate your
> position)
> > >
> > >
> > > I will add is that in my mind, the majority case would be a user
> > > specifying the outputs, but not things like 'batchSize' or 'when'. I
> > think
> > > in the majority case, the user would accept whatever the default batch
> > size
> > > is.
> > >
> > > Here are alternatives suggestions for all the examples that you
> provided
> > > previously.
> > >
> > > Base Case
> > >
> > > - The user must always specify the 'outputs' for clarity.
> > > - Uses default index name, batch size and when = true.
> > >
> > > {
> > > 'elasticsearch': {},
> > > 'hdfs': {}
> > > }
> > >
> > >
> > > <
> > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > a1#writer-non-specific-case>Writer-non-specific
> >
> > > Case
> > >
> > > - There are no global overrides, as in Casey's proposal.
> > > - Easier to grok IMO.
> > >
> > > {
> > > 'elasticsearch': {
> > > 'index': 'foo',
> > > 'batchSize': 100
> > > },
> > > 'hdfs': {
> > > 'index': 'foo',
> > > 'batchSize': 100
> > > }
> > > }
> > >
> > >
> > > <
> > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > a1#writer-specific-case-without-filters>Writer-specific
> >
> > > case without filters
> > >
> > > {
> > > 'elasticsearch': {
> > > 'index': 'foo',
> > > 'batchSize': 1
> > > },
> > > 'hdfs': {
> > > 'index': 'foo',
> > > 'batchSize': 100
> > > }
> > > }
> > >
> > >
> > > <
> > https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> > a1#writer-specific-case-with-filters>Writer-specific
> >
> > > case with filters
> > >
> > > - Instead of having to say when=false, just don't configure HDFS
> > >
> > > {
> > > 'elasticsearch': {
> > > 'index': 'foo',
> > > 'batchSize': 100,
> > > 'when': 'exists(field1)'
> > > }
> > > }
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella 
> > wrote:
> > >
> > >> Dave,
> > >> For the benefit of posterity and people who might not be as deeply
> > >> entangled in the system as we have been, I'll recap things and
> hopefully
> > >> answer your question in the process.
> > >>
> > >> Historically the index configuration is split between the enrichment
> > >> configs and the global configs.
> > >>
> > >> - The global configs really controls configs that apply to all
> sensors.
> > >> Historically this has been stuff like index connection strings, etc.
> > >> - The sensor-specific configs which control things that vary by
> sensor.
> > >>
> > >> As of Metron-652 (in review currently), we moved the sensor specific
> > >> configs from the enrichment configs. The proposal here is to increase
> > the
> > >> granularity of the the sensor specific files to make them support
> index
> > >> writer-specific configs. Right now in the indexing topology, we have 2
> > >> writers (fixed): ES/Solr and

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella

Let me noodle on this over the weekend.  Your syntax is looking less
onerous to me and I like the following statement from Otto: "In the end,
each write destination ‘type’ will need it’s own configuration.  This is an
extension point."

I may come around to your way of thinking.

On Fri, Jan 13, 2017 at 11:57 AM, Otto Fowler 
wrote:

> In the end, each write destination ‘type’ will need it’s own
> configuration.  This is an extension point.
> {
> HDFS:{
> outputAdapters:[
> {name: avro,
> settings:{
> avro stuff….
> when:{
> },
> {
>  name: sequence file,
>  …..
>
> or some such.
>
>
> On January 13, 2017 at 11:51:15, Nick Allen (n...@nickallen.org) wrote:
>
> I will add also that instead of global overrides, like index, we should use
> configuration key names that are more appropriate to the output.
>
> For example, does 'index' really make sense for HDFS? Or would 'path' be
> more appropriate?
>
> {
> 'elasticsearch': {
> 'index': 'foo',
> 'batchSize': 1
> },
> 'hdfs': {
> 'path': '/foo/bar/...',
> 'batchSize': 100
> }
> }
>
> Ok, I've said my peace. Thanks for the effort in summarizing all this,
> Casey.
>
>
> On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen  wrote:
>
> > Nick's concerns about my suggestion were that it was overly complex and
> >> hard to grok and that we could dispense with backwards compatibility and
> >> make people do a bit more work on the default case for the benefits of a
> >> simpler advanced case. (Nick, make sure I don't misstate your position)
> >
> >
> > I will add is that in my mind, the majority case would be a user
> > specifying the outputs, but not things like 'batchSize' or 'when'. I
> think
> > in the majority case, the user would accept whatever the default batch
> size
> > is.
> >
> > Here are alternatives suggestions for all the examples that you provided
> > previously.
> >
> > Base Case
> >
> > - The user must always specify the 'outputs' for clarity.
> > - Uses default index name, batch size and when = true.
> >
> > {
> > 'elasticsearch': {},
> > 'hdfs': {}
> > }
> >
> >
> > <
> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> a1#writer-non-specific-case>Writer-non-specific
>
> > Case
> >
> > - There are no global overrides, as in Casey's proposal.
> > - Easier to grok IMO.
> >
> > {
> > 'elasticsearch': {
> > 'index': 'foo',
> > 'batchSize': 100
> > },
> > 'hdfs': {
> > 'index': 'foo',
> > 'batchSize': 100
> > }
> > }
> >
> >
> > <
> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> a1#writer-specific-case-without-filters>Writer-specific
>
> > case without filters
> >
> > {
> > 'elasticsearch': {
> > 'index': 'foo',
> > 'batchSize': 1
> > },
> > 'hdfs': {
> > 'index': 'foo',
> > 'batchSize': 100
> > }
> > }
> >
> >
> > <
> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0
> a1#writer-specific-case-with-filters>Writer-specific
>
> > case with filters
> >
> > - Instead of having to say when=false, just don't configure HDFS
> >
> > {
> > 'elasticsearch': {
> > 'index': 'foo',
> > 'batchSize': 100,
> > 'when': 'exists(field1)'
> > }
> > }
> >
> >
> >
> >
> >
> > On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella 
> wrote:
> >
> >> Dave,
> >> For the benefit of posterity and people who might not be as deeply
> >> entangled in the system as we have been, I'll recap things and hopefully
> >> answer your question in the process.
> >>
> >> Historically the index configuration is split between the enrichment
> >> configs and the global configs.
> >>
> >> - The global configs really controls configs that apply to all sensors.
> >> Historically this has been stuff like index connection strings, etc.
> >> - The sensor-specific configs which control things that vary by sensor.
> >>
> >> As of Metron-652 (in review currently), we moved the sensor specific
> >> configs from the enrichment configs. The proposal here is to increase
> the
> >> granularity of the the sensor specific files to make them support index
> >> writer-specific configs. Right now in the indexing topology, we have 2
> >> writers (fixed): ES/Solr and HDFS.
> >>
> >> The proposed configuration would allow you to either specify a blanket
> >> sensor-level config for the index name and batchSize and/or override at
> >> the
> >> writer level, thereby supporting a couple of use-cases:
> >>
> >> - Turning off certain index writers (e.g. HDFS)
> >> - Filtering the messages written to certain index writers
> >>
> >> The two competing configs between Nick and I are as follows:
> >>
> >> - I want to make sure we keep the old sensor-specific defaults with
> >> writer-specific overrides available
> >> - Nick thought we could simplify the permutations by making the
> >> indexing
> >> config only the writer-level configs.
> >>
> >> My concerns about Nick's suggestion were that the default and majority
> >> case, specifying the index and the batchSize for all writers (th eone we
> >> support now) would require

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread zeo...@gmail.com

Hmm, I'm not sure I agree that in most cases users would accept the default
batch size, especially in sizeable environments.

In search tiers like ES it is very important, and should be tuned to the
specific data that you're sending because it depends on the number of
bytes, not necessarily number of messages.

This makes me wonder if a separate enhancement would be to allow number of
entries OR size of entries to be the batch size.  If that is the case, I
could see a sane default using number of bytes to send to search being more
static.  However, I don't know how realistic that is in Storm.

Jon

On Fri, Jan 13, 2017 at 11:51 AM Nick Allen  wrote:

> >
> > Nick's concerns about my suggestion were that it was overly complex and
> > hard to grok and that we could dispense with backwards compatibility and
> > make people do a bit more work on the default case for the benefits of a
> > simpler advanced case. (Nick, make sure I don't misstate your position)
>
>
> I will add is that in my mind, the majority case would be a user specifying
> the outputs, but not things like 'batchSize' or 'when'.  I think in the
> majority case, the user would accept whatever the default batch size is.
>
> Here are alternatives suggestions for all the examples that you provided
> previously.
>
> Base Case
>
>- The user must always specify the 'outputs' for clarity.
>- Uses default index name, batch size and when = true.
>
> {
>'elasticsearch': {},
>'hdfs': {}
> }
>
> <
> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0a1#writer-non-specific-case
> >Writer-non-specific
> Case
>
>- There are no global overrides, as in Casey's proposal.
>- Easier to grok IMO.
>
> {
>'elasticsearch': {
>   'index': 'foo',
>   'batchSize': 100
> },
>'hdfs': {
>   'index': 'foo',
>   'batchSize': 100
> }
> }
>
> <
> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0a1#writer-specific-case-without-filters
> >Writer-specific
> case without filters
>
> {
>'elasticsearch': {
>   'index': 'foo',
>   'batchSize': 1
> },
>'hdfs': {
>   'index': 'foo',
>   'batchSize': 100
> }
> }
>
> <
> https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0a1#writer-specific-case-with-filters
> >Writer-specific
> case with filters
>
>- Instead of having to say when=false, just don't configure HDFS
>
> {
>'elasticsearch': {
>   'index': 'foo',
>   'batchSize': 100,
>   'when': 'exists(field1)'
> }
> }
>
>
>
>
>
> On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella  wrote:
>
> > Dave,
> > For the benefit of posterity and people who might not be as deeply
> > entangled in the system as we have been, I'll recap things and hopefully
> > answer your question in the process.
> >
> > Historically the index configuration is split between the enrichment
> > configs and the global configs.
> >
> >- The global configs really controls configs that apply to all
> sensors.
> >Historically this has been stuff like index connection strings, etc.
> >- The sensor-specific configs which control things that vary by
> sensor.
> >
> > As of Metron-652 (in review currently), we moved the sensor specific
> > configs from the enrichment configs.  The proposal here is to increase
> the
> > granularity of the the sensor specific files to make them support index
> > writer-specific configs.  Right now in the indexing topology, we have 2
> > writers (fixed): ES/Solr and HDFS.
> >
> > The proposed configuration would allow you to either specify a blanket
> > sensor-level config for the index name and batchSize and/or override at
> the
> > writer level, thereby supporting a couple of use-cases:
> >
> >- Turning off certain index writers (e.g. HDFS)
> >- Filtering the messages written to certain index writers
> >
> > The two competing configs between Nick and I are as follows:
> >
> >- I want to make sure we keep the old sensor-specific defaults with
> >writer-specific overrides available
> >- Nick thought we could simplify the permutations by making the
> indexing
> >config only the writer-level configs.
> >
> > My concerns about Nick's suggestion were that the default and majority
> > case, specifying the index and the batchSize for all writers (th eone we
> > support now) would require more configuration.
> >
> > Nick's concerns about my suggestion were that it was overly complex and
> > hard to grok and that we could dispense with backwards compatibility and
> > make people do a bit more work on the default case for the benefits of a
> > simpler advanced case. (Nick, make sure I don't misstate your position).
> >
> > Casey
> >
> >
> > On Fri, Jan 13, 2017 at 10:54 AM, David Lyle 
> wrote:
> >
> > > Casey,
> > >
> > > Can you give me a level set of what your thinking is now? I think it's
> > > global control of all index types + overrides on a per-type

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Otto Fowler

In the end, each write destination ‘type’ will need it’s own
configuration.  This is an extension point.
{
HDFS:{
outputAdapters:[
{name: avro,
settings:{
avro stuff….
when:{
},
{
 name: sequence file,
 …..

or some such.


On January 13, 2017 at 11:51:15, Nick Allen (n...@nickallen.org) wrote:

I will add also that instead of global overrides, like index, we should use
configuration key names that are more appropriate to the output.

For example, does 'index' really make sense for HDFS? Or would 'path' be
more appropriate?

{
'elasticsearch': {
'index': 'foo',
'batchSize': 1
},
'hdfs': {
'path': '/foo/bar/...',
'batchSize': 100
}
}

Ok, I've said my peace. Thanks for the effort in summarizing all this,
Casey.


On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen  wrote:

> Nick's concerns about my suggestion were that it was overly complex and
>> hard to grok and that we could dispense with backwards compatibility and
>> make people do a bit more work on the default case for the benefits of a
>> simpler advanced case. (Nick, make sure I don't misstate your position)
>
>
> I will add is that in my mind, the majority case would be a user
> specifying the outputs, but not things like 'batchSize' or 'when'. I
think
> in the majority case, the user would accept whatever the default batch
size
> is.
>
> Here are alternatives suggestions for all the examples that you provided
> previously.
>
> Base Case
>
> - The user must always specify the 'outputs' for clarity.
> - Uses default index name, batch size and when = true.
>
> {
> 'elasticsearch': {},
> 'hdfs': {}
> }
>
>
> <
https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0a1#writer-non-specific-case>Writer-non-specific

> Case
>
> - There are no global overrides, as in Casey's proposal.
> - Easier to grok IMO.
>
> {
> 'elasticsearch': {
> 'index': 'foo',
> 'batchSize': 100
> },
> 'hdfs': {
> 'index': 'foo',
> 'batchSize': 100
> }
> }
>
>
> <
https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0a1#writer-specific-case-without-filters>Writer-specific

> case without filters
>
> {
> 'elasticsearch': {
> 'index': 'foo',
> 'batchSize': 1
> },
> 'hdfs': {
> 'index': 'foo',
> 'batchSize': 100
> }
> }
>
>
> <
https://gist.github.com/nickwallen/489735b65cdb38aae6e45cec7633a0a1#writer-specific-case-with-filters>Writer-specific

> case with filters
>
> - Instead of having to say when=false, just don't configure HDFS
>
> {
> 'elasticsearch': {
> 'index': 'foo',
> 'batchSize': 100,
> 'when': 'exists(field1)'
> }
> }
>
>
>
>
>
> On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella 
wrote:
>
>> Dave,
>> For the benefit of posterity and people who might not be as deeply
>> entangled in the system as we have been, I'll recap things and hopefully
>> answer your question in the process.
>>
>> Historically the index configuration is split between the enrichment
>> configs and the global configs.
>>
>> - The global configs really controls configs that apply to all sensors.
>> Historically this has been stuff like index connection strings, etc.
>> - The sensor-specific configs which control things that vary by sensor.
>>
>> As of Metron-652 (in review currently), we moved the sensor specific
>> configs from the enrichment configs. The proposal here is to increase
the
>> granularity of the the sensor specific files to make them support index
>> writer-specific configs. Right now in the indexing topology, we have 2
>> writers (fixed): ES/Solr and HDFS.
>>
>> The proposed configuration would allow you to either specify a blanket
>> sensor-level config for the index name and batchSize and/or override at
>> the
>> writer level, thereby supporting a couple of use-cases:
>>
>> - Turning off certain index writers (e.g. HDFS)
>> - Filtering the messages written to certain index writers
>>
>> The two competing configs between Nick and I are as follows:
>>
>> - I want to make sure we keep the old sensor-specific defaults with
>> writer-specific overrides available
>> - Nick thought we could simplify the permutations by making the
>> indexing
>> config only the writer-level configs.
>>
>> My concerns about Nick's suggestion were that the default and majority
>> case, specifying the index and the batchSize for all writers (th eone we
>> support now) would require more configuration.
>>
>> Nick's concerns about my suggestion were that it was overly complex and
>> hard to grok and that we could dispense with backwards compatibility and
>> make people do a bit more work on the default case for the benefits of a
>> simpler advanced case. (Nick, make sure I don't misstate your position).
>>
>> Casey
>>
>>
>> On Fri, Jan 13, 2017 at 10:54 AM, David Lyle 
>> wrote:
>>
>> > Casey,
>> >
>> > Can you give me a level set of what your thinking is now? I think it's
>> > global control of all index types + overrides on a per-type basis.
Fwiw,
>> > I'm totally for that, but I want to make sure I'm not imposing my
>> > pre-concieved

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread zeo...@gmail.com

I think Simon has a very valid suggestion.  Additionally, I have a two
questions.  For the following config:

{
  "index" : "foo"
 ,"batchSize" : 100
}

Are now all logs going to the same index?  I read this as a writer-specific
override of the sensor-specific defaults to use an index name of foo* (in
HDFS that's foo, in ES that's foo-${timestamp}).  If that's true, would
something like this work?

{
 "batchSize" : 100
 , "writerConfig" :
   {
  "elasticsearch" : {
   "when" : "exists(field1)",
   "index" : "+foo"
 }
   }
}

How I read this is, set a default batchSize of 100, but for each index
(holding to the sensor-specific defaults), specify an override for
elasticsearch to send to the index foo when field1 exists.  The result in
my mind would be that the sensor-specific default and foo both get this log
line, if field1 exists.

Of course the syntax I used for "+foo" is probably optimal, but just
illustrative that it's appending an additional index to send to, as opposed
to overwriting the destination index (if you didn't add the +).  In fact,
the more I look at it, this appears to be a bad approach but I'm struggling
to think of an exact, cleaner solution to suggest offhand.  Something that
does if(exists(field1); index+=foo.

Also, as previously discussed, this could easily be a follow-on enhancement.

Jon

On Fri, Jan 13, 2017 at 11:18 AM David Lyle  wrote:

Thanks Casey!

I think I had the right of it, but wanted to make sure.

I'm +1 on defaults in global with overrides in sensor-specific. At least in
the first iteration. I (like Otto) suspect we'll have a few go-arounds on
this.

-D...

On Fri, Jan 13, 2017 at 11:09 AM, Otto Fowler 
wrote:

> This is an excellent point
>
>
> On January 13, 2017 at 10:54:07, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> Some thing else to consider here is the possibility of multiple indices
> within a given target technology.
>
> For example, if I’m indexing data from a given sensor into, say solr, I
> may want it filtered differently into two different indices. This would
> enable me to create different ‘views’ which could have different security
> settings applied in that backend. This would be useful for multi-tenant
> installs, and for differing data privilege levels within an organisation.
> You could argue that this is more a concern for filtering of the results
> coming out of an index, but currently this is a lot harder than using
> something like the ranger solr authorisation plugin to control access at
an
> index by index granularity.
>
> Essentially, the indexer topology then becomes a filter and router, which
> argues for it being a separate step, before the process which actually
> writes out to each platform. It may also make sense to have a concept of a
> routing key built up by earlier enrichment to allow shuffle control in
> storm, rather than a full stellar statement for routing, to avoid
overhead.
>
> Simon
>
> > On 13 Jan 2017, at 07:44, Casey Stella  wrote:
> >
> > I am suggesting that, yes. The configs are essentially the same as
yours,
> > except there is an override specified at the top level. Without that, in
> > order to specify both HDFS and ES have batch sizes of 100, you have to
> > explicitly configure each. It's less that I'm trying to have backwards
> > compatibility and more that I'm trying to make the majority case easy:
> both
> > writers write everything to a specified index name with a specified
batch
> > size (which is what we have now). Beyond that, I want to allow for
> > specifying an override for the config on a writer-by-writer basis for
> those
> > who need it.
> >
> > On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen  wrote:
> >
> >> Are you saying we support all of these variants? I realize you are
> trying
> >> to have some backwards compatibility, but this also makes it harder for
> a
> >> user to grok (for me at least).
> >>
> >> Personally I like my original example as there are fewer
sub-structures,
> >> like 'writerConfig', which makes the whole thing simpler and easier to
> >> grok. But maybe others will think your proposal is just as easy to
grok.
> >>
> >>
> >>
> >> On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella 
> wrote:
> >>
> >>> Ok, so here's what I'm thinking based on the discussion:
> >>>
> >>> - Keeping the configs that we have now (batchSize and index) as
> >> defaults
> >>> for the unspecified writer-specific case
> >>> - Adding the config Nick suggested
> >>>
> >>> *Base Case*:
> >>> {
> >>> }
> >>>
> >>> - all writers write all messages
> >>> - index named the same as the sensor for all writers
> >>> - batchSize of 1 for all writers
> >>>
> >>> *Writer-non-specific case*:
> >>> {
> >>> "index" : "foo"
> >>> ,"batchSize" : 100
> >>> }
> >>>
> >>> - All writers write all messages
>

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Nick Allen

>
> Nick's concerns about my suggestion were that it was overly complex and
> hard to grok and that we could dispense with backwards compatibility and
> make people do a bit more work on the default case for the benefits of a
> simpler advanced case. (Nick, make sure I don't misstate your position)


I will add is that in my mind, the majority case would be a user specifying
the outputs, but not things like 'batchSize' or 'when'.  I think in the
majority case, the user would accept whatever the default batch size is.

Here are alternatives suggestions for all the examples that you provided
previously.

Base Case

   - The user must always specify the 'outputs' for clarity.
   - Uses default index name, batch size and when = true.

{
   'elasticsearch': {},
   'hdfs': {}
}

Writer-non-specific
Case

   - There are no global overrides, as in Casey's proposal.
   - Easier to grok IMO.

{
   'elasticsearch': {
  'index': 'foo',
  'batchSize': 100
},
   'hdfs': {
  'index': 'foo',
  'batchSize': 100
}
}

Writer-specific
case without filters

{
   'elasticsearch': {
  'index': 'foo',
  'batchSize': 1
},
   'hdfs': {
  'index': 'foo',
  'batchSize': 100
}
}

Writer-specific
case with filters

   - Instead of having to say when=false, just don't configure HDFS

{
   'elasticsearch': {
  'index': 'foo',
  'batchSize': 100,
  'when': 'exists(field1)'
}
}





On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella  wrote:

> Dave,
> For the benefit of posterity and people who might not be as deeply
> entangled in the system as we have been, I'll recap things and hopefully
> answer your question in the process.
>
> Historically the index configuration is split between the enrichment
> configs and the global configs.
>
>- The global configs really controls configs that apply to all sensors.
>Historically this has been stuff like index connection strings, etc.
>- The sensor-specific configs which control things that vary by sensor.
>
> As of Metron-652 (in review currently), we moved the sensor specific
> configs from the enrichment configs.  The proposal here is to increase the
> granularity of the the sensor specific files to make them support index
> writer-specific configs.  Right now in the indexing topology, we have 2
> writers (fixed): ES/Solr and HDFS.
>
> The proposed configuration would allow you to either specify a blanket
> sensor-level config for the index name and batchSize and/or override at the
> writer level, thereby supporting a couple of use-cases:
>
>- Turning off certain index writers (e.g. HDFS)
>- Filtering the messages written to certain index writers
>
> The two competing configs between Nick and I are as follows:
>
>- I want to make sure we keep the old sensor-specific defaults with
>writer-specific overrides available
>- Nick thought we could simplify the permutations by making the indexing
>config only the writer-level configs.
>
> My concerns about Nick's suggestion were that the default and majority
> case, specifying the index and the batchSize for all writers (th eone we
> support now) would require more configuration.
>
> Nick's concerns about my suggestion were that it was overly complex and
> hard to grok and that we could dispense with backwards compatibility and
> make people do a bit more work on the default case for the benefits of a
> simpler advanced case. (Nick, make sure I don't misstate your position).
>
> Casey
>
>
> On Fri, Jan 13, 2017 at 10:54 AM, David Lyle  wrote:
>
> > Casey,
> >
> > Can you give me a level set of what your thinking is now? I think it's
> > global control of all index types + overrides on a per-type basis. Fwiw,
> > I'm totally for that, but I want to make sure I'm not imposing my
> > pre-concieved notions on your consensus-driven ones.
> >
> > -D
> >
> > On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella 
> wrote:
> >
> > > I am suggesting that, yes.  The configs are essentially the same as
> > yours,
> > > except there is an override specified at the top level.  Without that,
> in
> > > order to specify both HDFS and ES have batch sizes of 100, you have to
> > > explicitly configure each.  It's less that I'm trying to have backwards
> > > compatibility and more that I'm trying to make the majority case easy:
> > both
> > > writers write everything to a specified index name with a specified
> batch
> > > size (which is what we have now).  Beyond that, I want to allow for
> > > specifying an override for the config on a writer-by-writer basis for
> > those
> > > who need it.
> > >
> > > On Fri, Jan 13,

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Nick Allen

I will add also that instead of global overrides, like index, we should use
configuration key names that are more appropriate to the output.

For example, does 'index' really make sense for HDFS?  Or would 'path' be
more appropriate?

{
   'elasticsearch': {
  'index': 'foo',
  'batchSize': 1
},
   'hdfs': {
  'path': '/foo/bar/...',
  'batchSize': 100
}
}

Ok, I've said my peace.  Thanks for the effort in summarizing all this,
Casey.


On Fri, Jan 13, 2017 at 11:42 AM, Nick Allen  wrote:

> Nick's concerns about my suggestion were that it was overly complex and
>> hard to grok and that we could dispense with backwards compatibility and
>> make people do a bit more work on the default case for the benefits of a
>> simpler advanced case. (Nick, make sure I don't misstate your position)
>
>
> I will add is that in my mind, the majority case would be a user
> specifying the outputs, but not things like 'batchSize' or 'when'.  I think
> in the majority case, the user would accept whatever the default batch size
> is.
>
> Here are alternatives suggestions for all the examples that you provided
> previously.
>
> Base Case
>
>- The user must always specify the 'outputs' for clarity.
>- Uses default index name, batch size and when = true.
>
> {
>'elasticsearch': {},
>'hdfs': {}
> }
>
>
> Writer-non-specific
> Case
>
>- There are no global overrides, as in Casey's proposal.
>- Easier to grok IMO.
>
> {
>'elasticsearch': {
>   'index': 'foo',
>   'batchSize': 100
> },
>'hdfs': {
>   'index': 'foo',
>   'batchSize': 100
> }
> }
>
>
> Writer-specific
> case without filters
>
> {
>'elasticsearch': {
>   'index': 'foo',
>   'batchSize': 1
> },
>'hdfs': {
>   'index': 'foo',
>   'batchSize': 100
> }
> }
>
>
> Writer-specific
> case with filters
>
>- Instead of having to say when=false, just don't configure HDFS
>
> {
>'elasticsearch': {
>   'index': 'foo',
>   'batchSize': 100,
>   'when': 'exists(field1)'
> }
> }
>
>
>
>
>
> On Fri, Jan 13, 2017 at 11:06 AM, Casey Stella  wrote:
>
>> Dave,
>> For the benefit of posterity and people who might not be as deeply
>> entangled in the system as we have been, I'll recap things and hopefully
>> answer your question in the process.
>>
>> Historically the index configuration is split between the enrichment
>> configs and the global configs.
>>
>>- The global configs really controls configs that apply to all sensors.
>>Historically this has been stuff like index connection strings, etc.
>>- The sensor-specific configs which control things that vary by sensor.
>>
>> As of Metron-652 (in review currently), we moved the sensor specific
>> configs from the enrichment configs.  The proposal here is to increase the
>> granularity of the the sensor specific files to make them support index
>> writer-specific configs.  Right now in the indexing topology, we have 2
>> writers (fixed): ES/Solr and HDFS.
>>
>> The proposed configuration would allow you to either specify a blanket
>> sensor-level config for the index name and batchSize and/or override at
>> the
>> writer level, thereby supporting a couple of use-cases:
>>
>>- Turning off certain index writers (e.g. HDFS)
>>- Filtering the messages written to certain index writers
>>
>> The two competing configs between Nick and I are as follows:
>>
>>- I want to make sure we keep the old sensor-specific defaults with
>>writer-specific overrides available
>>- Nick thought we could simplify the permutations by making the
>> indexing
>>config only the writer-level configs.
>>
>> My concerns about Nick's suggestion were that the default and majority
>> case, specifying the index and the batchSize for all writers (th eone we
>> support now) would require more configuration.
>>
>> Nick's concerns about my suggestion were that it was overly complex and
>> hard to grok and that we could dispense with backwards compatibility and
>> make people do a bit more work on the default case for the benefits of a
>> simpler advanced case. (Nick, make sure I don't misstate your position).
>>
>> Casey
>>
>>
>> On Fri, Jan 13, 2017 at 10:54 AM, David Lyle 
>> wrote:
>>
>> > Casey,
>> >
>> > Can you give me a level set of what your thinking is now? I think it's
>> > global control of all index types + overrides on a per-type basis. Fwiw,
>> > I'm totally for that, but I want to make sure I'm not imposing my
>> > pre-concieved notions on your consensus-driven ones.
>> >
>> > -D
>> >
>> > On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread David Lyle

Thanks Casey!

I think I had the right of it, but wanted to make sure.

I'm +1 on defaults in global with overrides in sensor-specific. At least in
the first iteration. I (like Otto) suspect we'll have a few go-arounds on
this.

-D...


On Fri, Jan 13, 2017 at 11:09 AM, Otto Fowler 
wrote:

> This is an excellent point
>
>
> On January 13, 2017 at 10:54:07, Simon Elliston Ball (
> si...@simonellistonball.com) wrote:
>
> Some thing else to consider here is the possibility of multiple indices
> within a given target technology.
>
> For example, if I’m indexing data from a given sensor into, say solr, I
> may want it filtered differently into two different indices. This would
> enable me to create different ‘views’ which could have different security
> settings applied in that backend. This would be useful for multi-tenant
> installs, and for differing data privilege levels within an organisation.
> You could argue that this is more a concern for filtering of the results
> coming out of an index, but currently this is a lot harder than using
> something like the ranger solr authorisation plugin to control access at an
> index by index granularity.
>
> Essentially, the indexer topology then becomes a filter and router, which
> argues for it being a separate step, before the process which actually
> writes out to each platform. It may also make sense to have a concept of a
> routing key built up by earlier enrichment to allow shuffle control in
> storm, rather than a full stellar statement for routing, to avoid overhead.
>
> Simon
>
> > On 13 Jan 2017, at 07:44, Casey Stella  wrote:
> >
> > I am suggesting that, yes. The configs are essentially the same as yours,
> > except there is an override specified at the top level. Without that, in
> > order to specify both HDFS and ES have batch sizes of 100, you have to
> > explicitly configure each. It's less that I'm trying to have backwards
> > compatibility and more that I'm trying to make the majority case easy:
> both
> > writers write everything to a specified index name with a specified batch
> > size (which is what we have now). Beyond that, I want to allow for
> > specifying an override for the config on a writer-by-writer basis for
> those
> > who need it.
> >
> > On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen  wrote:
> >
> >> Are you saying we support all of these variants? I realize you are
> trying
> >> to have some backwards compatibility, but this also makes it harder for
> a
> >> user to grok (for me at least).
> >>
> >> Personally I like my original example as there are fewer sub-structures,
> >> like 'writerConfig', which makes the whole thing simpler and easier to
> >> grok. But maybe others will think your proposal is just as easy to grok.
> >>
> >>
> >>
> >> On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella 
> wrote:
> >>
> >>> Ok, so here's what I'm thinking based on the discussion:
> >>>
> >>> - Keeping the configs that we have now (batchSize and index) as
> >> defaults
> >>> for the unspecified writer-specific case
> >>> - Adding the config Nick suggested
> >>>
> >>> *Base Case*:
> >>> {
> >>> }
> >>>
> >>> - all writers write all messages
> >>> - index named the same as the sensor for all writers
> >>> - batchSize of 1 for all writers
> >>>
> >>> *Writer-non-specific case*:
> >>> {
> >>> "index" : "foo"
> >>> ,"batchSize" : 100
> >>> }
> >>>
> >>> - All writers write all messages
> >>> - index is named "foo", different from the sensor for all writers
> >>> - batchSize is 100 for all writers
> >>>
> >>> *Writer-specific case without filters*
> >>> {
> >>> "index" : "foo"
> >>> ,"batchSize" : 1
> >>> , "writerConfig" :
> >>> {
> >>> "elasticsearch" : {
> >>> "batchSize" : 100
> >>> }
> >>> }
> >>> }
> >>>
> >>> - All writers write all messages
> >>> - index is named "foo", different from the sensor for all writers
> >>> - batchSize is 1 for HDFS and 100 for elasticsearch writers
> >>> - NOTE: I could override the index name too
> >>>
> >>> *Writer-specific case with filters*
> >>> {
> >>> "index" : "foo"
> >>> ,"batchSize" : 1
> >>> , "writerConfig" :
> >>> {
> >>> "elasticsearch" : {
> >>> "batchSize" : 100,
> >>> "when" : "exists(field1)"
> >>> },
> >>> "hdfs" : {
> >>> "when" : "false"
> >>> }
> >>> }
> >>> }
> >>>
> >>> - ES writer writes messages which have field1, HDFS doesn't
> >>> - index is named "foo", different from the sensor for all writers
> >>> - 100 for elasticsearch writers
> >>>
> >>> Thoughts?
> >>>
> >>> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> >>> wrote:
> >>>
>  For larger installations you need to control what is indexed so you
> >> don’t
>  end up with a nasty elastic search situation and so you can mine the
> >> data
>  later for reports and training ml models.
> 
>  Thanks
>  Carolyn
> 
> 
> 
> 
>  On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
>

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Otto Fowler

This is an excellent point


On January 13, 2017 at 10:54:07, Simon Elliston Ball 
(si...@simonellistonball.com) wrote:

Some thing else to consider here is the possibility of multiple indices within 
a given target technology.  

For example, if I’m indexing data from a given sensor into, say solr, I may 
want it filtered differently into two different indices. This would enable me 
to create different ‘views’ which could have different security settings 
applied in that backend. This would be useful for multi-tenant installs, and 
for differing data privilege levels within an organisation. You could argue 
that this is more a concern for filtering of the results coming out of an 
index, but currently this is a lot harder than using something like the ranger 
solr authorisation plugin to control access at an index by index granularity.  

Essentially, the indexer topology then becomes a filter and router, which 
argues for it being a separate step, before the process which actually writes 
out to each platform. It may also make sense to have a concept of a routing key 
built up by earlier enrichment to allow shuffle control in storm, rather than a 
full stellar statement for routing, to avoid overhead.  

Simon  

> On 13 Jan 2017, at 07:44, Casey Stella  wrote:  
>  
> I am suggesting that, yes. The configs are essentially the same as yours,  
> except there is an override specified at the top level. Without that, in  
> order to specify both HDFS and ES have batch sizes of 100, you have to  
> explicitly configure each. It's less that I'm trying to have backwards  
> compatibility and more that I'm trying to make the majority case easy: both  
> writers write everything to a specified index name with a specified batch  
> size (which is what we have now). Beyond that, I want to allow for  
> specifying an override for the config on a writer-by-writer basis for those  
> who need it.  
>  
> On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen  wrote:  
>  
>> Are you saying we support all of these variants? I realize you are trying  
>> to have some backwards compatibility, but this also makes it harder for a  
>> user to grok (for me at least).  
>>  
>> Personally I like my original example as there are fewer sub-structures,  
>> like 'writerConfig', which makes the whole thing simpler and easier to  
>> grok. But maybe others will think your proposal is just as easy to grok.  
>>  
>>  
>>  
>> On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella  wrote:  
>>  
>>> Ok, so here's what I'm thinking based on the discussion:  
>>>  
>>> - Keeping the configs that we have now (batchSize and index) as  
>> defaults  
>>> for the unspecified writer-specific case  
>>> - Adding the config Nick suggested  
>>>  
>>> *Base Case*:  
>>> {  
>>> }  
>>>  
>>> - all writers write all messages  
>>> - index named the same as the sensor for all writers  
>>> - batchSize of 1 for all writers  
>>>  
>>> *Writer-non-specific case*:  
>>> {  
>>> "index" : "foo"  
>>> ,"batchSize" : 100  
>>> }  
>>>  
>>> - All writers write all messages  
>>> - index is named "foo", different from the sensor for all writers  
>>> - batchSize is 100 for all writers  
>>>  
>>> *Writer-specific case without filters*  
>>> {  
>>> "index" : "foo"  
>>> ,"batchSize" : 1  
>>> , "writerConfig" :  
>>> {  
>>> "elasticsearch" : {  
>>> "batchSize" : 100  
>>> }  
>>> }  
>>> }  
>>>  
>>> - All writers write all messages  
>>> - index is named "foo", different from the sensor for all writers  
>>> - batchSize is 1 for HDFS and 100 for elasticsearch writers  
>>> - NOTE: I could override the index name too  
>>>  
>>> *Writer-specific case with filters*  
>>> {  
>>> "index" : "foo"  
>>> ,"batchSize" : 1  
>>> , "writerConfig" :  
>>> {  
>>> "elasticsearch" : {  
>>> "batchSize" : 100,  
>>> "when" : "exists(field1)"  
>>> },  
>>> "hdfs" : {  
>>> "when" : "false"  
>>> }  
>>> }  
>>> }  
>>>  
>>> - ES writer writes messages which have field1, HDFS doesn't  
>>> - index is named "foo", different from the sensor for all writers  
>>> - 100 for elasticsearch writers  
>>>  
>>> Thoughts?  
>>>  
>>> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby   
>>> wrote:  
>>>  
 For larger installations you need to control what is indexed so you  
>> don’t  
 end up with a nasty elastic search situation and so you can mine the  
>> data  
 later for reports and training ml models.  
  
 Thanks  
 Carolyn  
  
  
  
  
 On 1/13/17, 9:40 AM, "Casey Stella"  wrote:  
  
> OH that's a good idea!  
>  
> On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen   
>> wrote:  
>  
>> I like the "Index Filtering" option based on the flexibility that it  
>> provides. Should each output (HDFS, ES, etc) have its own  
>>> configuration  
>> settings? For example, aren't things like

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella

Simon,

Great thought.  I had considered it, but didn't want to bite off all that
as part of a PR.  I thought baby-steps for the moment woudl be best.
Perhaps this deserves its own JIRA and discussion?

On Fri, Jan 13, 2017 at 10:53 AM, Simon Elliston Ball <
si...@simonellistonball.com> wrote:

> Some thing else to consider here is the possibility of multiple indices
> within a given target technology.
>
> For example, if I’m indexing data from a given sensor into, say solr, I
> may want it filtered differently into two different indices. This would
> enable me to create different ‘views’ which could have different security
> settings applied in that backend. This would be useful for multi-tenant
> installs, and for differing data privilege levels within an organisation.
> You could argue that this is more a concern for filtering of the results
> coming out of an index, but currently this is a lot harder than using
> something like the ranger solr authorisation plugin to control access at an
> index by index granularity.
>
> Essentially, the indexer topology then becomes a filter and router, which
> argues for it being a separate step, before the process which actually
> writes out to each platform. It may also make sense to have a concept of a
> routing key built up by earlier enrichment to allow shuffle control in
> storm, rather than a full stellar statement for routing, to avoid overhead.
>
> Simon
>
> > On 13 Jan 2017, at 07:44, Casey Stella  wrote:
> >
> > I am suggesting that, yes.  The configs are essentially the same as
> yours,
> > except there is an override specified at the top level.  Without that, in
> > order to specify both HDFS and ES have batch sizes of 100, you have to
> > explicitly configure each.  It's less that I'm trying to have backwards
> > compatibility and more that I'm trying to make the majority case easy:
> both
> > writers write everything to a specified index name with a specified batch
> > size (which is what we have now).  Beyond that, I want to allow for
> > specifying an override for the config on a writer-by-writer basis for
> those
> > who need it.
> >
> > On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen  wrote:
> >
> >> Are you saying we support all of these variants?  I realize you are
> trying
> >> to have some backwards compatibility, but this also makes it harder for
> a
> >> user to grok (for me at least).
> >>
> >> Personally I like my original example as there are fewer sub-structures,
> >> like 'writerConfig', which makes the whole thing simpler and easier to
> >> grok.  But maybe others will think your proposal is just as easy to
> grok.
> >>
> >>
> >>
> >> On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella 
> wrote:
> >>
> >>> Ok, so here's what I'm thinking based on the discussion:
> >>>
> >>>   - Keeping the configs that we have now (batchSize and index) as
> >> defaults
> >>>   for the unspecified writer-specific case
> >>>   - Adding the config Nick suggested
> >>>
> >>> *Base Case*:
> >>> {
> >>> }
> >>>
> >>>   - all writers write all messages
> >>>   - index named the same as the sensor for all writers
> >>>   - batchSize of 1 for all writers
> >>>
> >>> *Writer-non-specific case*:
> >>> {
> >>>  "index" : "foo"
> >>> ,"batchSize" : 100
> >>> }
> >>>
> >>>   - All writers write all messages
> >>>   - index is named "foo", different from the sensor for all writers
> >>>   - batchSize is 100 for all writers
> >>>
> >>> *Writer-specific case without filters*
> >>> {
> >>>  "index" : "foo"
> >>> ,"batchSize" : 1
> >>> , "writerConfig" :
> >>>   {
> >>>  "elasticsearch" : {
> >>>   "batchSize" : 100
> >>> }
> >>>   }
> >>> }
> >>>
> >>>   - All writers write all messages
> >>>   - index is named "foo", different from the sensor for all writers
> >>>   - batchSize is 1 for HDFS and 100 for elasticsearch writers
> >>>   - NOTE: I could override the index name too
> >>>
> >>> *Writer-specific case with filters*
> >>> {
> >>>  "index" : "foo"
> >>> ,"batchSize" : 1
> >>> , "writerConfig" :
> >>>   {
> >>>  "elasticsearch" : {
> >>>   "batchSize" : 100,
> >>>   "when" : "exists(field1)"
> >>> },
> >>>  "hdfs" : {
> >>> "when" : "false"
> >>>  }
> >>>   }
> >>> }
> >>>
> >>>   - ES writer writes messages which have field1, HDFS doesn't
> >>>   - index is named "foo", different from the sensor for all writers
> >>>   - 100 for elasticsearch writers
> >>>
> >>> Thoughts?
> >>>
> >>> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> >>> wrote:
> >>>
>  For larger installations you need to control what is indexed so you
> >> don’t
>  end up with a nasty elastic search situation and so you can mine the
> >> data
>  later for reports and training ml models.
> 
>  Thanks

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella

Dave,
For the benefit of posterity and people who might not be as deeply
entangled in the system as we have been, I'll recap things and hopefully
answer your question in the process.

Historically the index configuration is split between the enrichment
configs and the global configs.

   - The global configs really controls configs that apply to all sensors.
   Historically this has been stuff like index connection strings, etc.
   - The sensor-specific configs which control things that vary by sensor.

As of Metron-652 (in review currently), we moved the sensor specific
configs from the enrichment configs.  The proposal here is to increase the
granularity of the the sensor specific files to make them support index
writer-specific configs.  Right now in the indexing topology, we have 2
writers (fixed): ES/Solr and HDFS.

The proposed configuration would allow you to either specify a blanket
sensor-level config for the index name and batchSize and/or override at the
writer level, thereby supporting a couple of use-cases:

   - Turning off certain index writers (e.g. HDFS)
   - Filtering the messages written to certain index writers

The two competing configs between Nick and I are as follows:

   - I want to make sure we keep the old sensor-specific defaults with
   writer-specific overrides available
   - Nick thought we could simplify the permutations by making the indexing
   config only the writer-level configs.

My concerns about Nick's suggestion were that the default and majority
case, specifying the index and the batchSize for all writers (th eone we
support now) would require more configuration.

Nick's concerns about my suggestion were that it was overly complex and
hard to grok and that we could dispense with backwards compatibility and
make people do a bit more work on the default case for the benefits of a
simpler advanced case. (Nick, make sure I don't misstate your position).

Casey


On Fri, Jan 13, 2017 at 10:54 AM, David Lyle  wrote:

> Casey,
>
> Can you give me a level set of what your thinking is now? I think it's
> global control of all index types + overrides on a per-type basis. Fwiw,
> I'm totally for that, but I want to make sure I'm not imposing my
> pre-concieved notions on your consensus-driven ones.
>
> -D
>
> On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella  wrote:
>
> > I am suggesting that, yes.  The configs are essentially the same as
> yours,
> > except there is an override specified at the top level.  Without that, in
> > order to specify both HDFS and ES have batch sizes of 100, you have to
> > explicitly configure each.  It's less that I'm trying to have backwards
> > compatibility and more that I'm trying to make the majority case easy:
> both
> > writers write everything to a specified index name with a specified batch
> > size (which is what we have now).  Beyond that, I want to allow for
> > specifying an override for the config on a writer-by-writer basis for
> those
> > who need it.
> >
> > On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen  wrote:
> >
> > > Are you saying we support all of these variants?  I realize you are
> > trying
> > > to have some backwards compatibility, but this also makes it harder
> for a
> > > user to grok (for me at least).
> > >
> > > Personally I like my original example as there are fewer
> sub-structures,
> > > like 'writerConfig', which makes the whole thing simpler and easier to
> > > grok.  But maybe others will think your proposal is just as easy to
> grok.
> > >
> > >
> > >
> > > On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella 
> > wrote:
> > >
> > > > Ok, so here's what I'm thinking based on the discussion:
> > > >
> > > >- Keeping the configs that we have now (batchSize and index) as
> > > defaults
> > > >for the unspecified writer-specific case
> > > >- Adding the config Nick suggested
> > > >
> > > > *Base Case*:
> > > > {
> > > > }
> > > >
> > > >- all writers write all messages
> > > >- index named the same as the sensor for all writers
> > > >- batchSize of 1 for all writers
> > > >
> > > > *Writer-non-specific case*:
> > > > {
> > > >   "index" : "foo"
> > > >  ,"batchSize" : 100
> > > > }
> > > >
> > > >- All writers write all messages
> > > >- index is named "foo", different from the sensor for all writers
> > > >- batchSize is 100 for all writers
> > > >
> > > > *Writer-specific case without filters*
> > > > {
> > > >   "index" : "foo"
> > > >  ,"batchSize" : 1
> > > >  , "writerConfig" :
> > > >{
> > > >   "elasticsearch" : {
> > > >"batchSize" : 100
> > > >  }
> > > >}
> > > > }
> > > >
> > > >- All writers write all messages
> > > >- index is named "foo", different from the sensor for all writers
> > > >- batchSize is 1 for HDFS and 100 for elasticsearch writers
> > > >- NOTE: I could override the index

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread David Lyle

Casey,

Can you give me a level set of what your thinking is now? I think it's
global control of all index types + overrides on a per-type basis. Fwiw,
I'm totally for that, but I want to make sure I'm not imposing my
pre-concieved notions on your consensus-driven ones.

-D

On Fri, Jan 13, 2017 at 10:44 AM, Casey Stella  wrote:

> I am suggesting that, yes.  The configs are essentially the same as yours,
> except there is an override specified at the top level.  Without that, in
> order to specify both HDFS and ES have batch sizes of 100, you have to
> explicitly configure each.  It's less that I'm trying to have backwards
> compatibility and more that I'm trying to make the majority case easy: both
> writers write everything to a specified index name with a specified batch
> size (which is what we have now).  Beyond that, I want to allow for
> specifying an override for the config on a writer-by-writer basis for those
> who need it.
>
> On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen  wrote:
>
> > Are you saying we support all of these variants?  I realize you are
> trying
> > to have some backwards compatibility, but this also makes it harder for a
> > user to grok (for me at least).
> >
> > Personally I like my original example as there are fewer sub-structures,
> > like 'writerConfig', which makes the whole thing simpler and easier to
> > grok.  But maybe others will think your proposal is just as easy to grok.
> >
> >
> >
> > On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella 
> wrote:
> >
> > > Ok, so here's what I'm thinking based on the discussion:
> > >
> > >- Keeping the configs that we have now (batchSize and index) as
> > defaults
> > >for the unspecified writer-specific case
> > >- Adding the config Nick suggested
> > >
> > > *Base Case*:
> > > {
> > > }
> > >
> > >- all writers write all messages
> > >- index named the same as the sensor for all writers
> > >- batchSize of 1 for all writers
> > >
> > > *Writer-non-specific case*:
> > > {
> > >   "index" : "foo"
> > >  ,"batchSize" : 100
> > > }
> > >
> > >- All writers write all messages
> > >- index is named "foo", different from the sensor for all writers
> > >- batchSize is 100 for all writers
> > >
> > > *Writer-specific case without filters*
> > > {
> > >   "index" : "foo"
> > >  ,"batchSize" : 1
> > >  , "writerConfig" :
> > >{
> > >   "elasticsearch" : {
> > >"batchSize" : 100
> > >  }
> > >}
> > > }
> > >
> > >- All writers write all messages
> > >- index is named "foo", different from the sensor for all writers
> > >- batchSize is 1 for HDFS and 100 for elasticsearch writers
> > >- NOTE: I could override the index name too
> > >
> > > *Writer-specific case with filters*
> > > {
> > >   "index" : "foo"
> > >  ,"batchSize" : 1
> > >  , "writerConfig" :
> > >{
> > >   "elasticsearch" : {
> > >"batchSize" : 100,
> > >"when" : "exists(field1)"
> > >  },
> > >   "hdfs" : {
> > >  "when" : "false"
> > >   }
> > >}
> > > }
> > >
> > >- ES writer writes messages which have field1, HDFS doesn't
> > >- index is named "foo", different from the sensor for all writers
> > >- 100 for elasticsearch writers
> > >
> > > Thoughts?
> > >
> > > On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> > > wrote:
> > >
> > > > For larger installations you need to control what is indexed so you
> > don’t
> > > > end up with a nasty elastic search situation and so you can mine the
> > data
> > > > later for reports and training ml models.
> > > >
> > > > Thanks
> > > > Carolyn
> > > >
> > > >
> > > >
> > > >
> > > > On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
> > > >
> > > > >OH that's a good idea!
> > > > >
> > > > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen 
> > wrote:
> > > > >
> > > > >> I like the "Index Filtering" option based on the flexibility that
> it
> > > > >> provides.  Should each output (HDFS, ES, etc) have its own
> > > configuration
> > > > >> settings?  For example, aren't things like batching handled
> > separately
> > > > for
> > > > >> HDFS versus Elasticsearch?
> > > > >>
> > > > >> Something along the lines of...
> > > > >>
> > > > >> {
> > > > >>   "hdfs" : {
> > > > >> "when": "exists(field1)",
> > > > >> "batchSize": 100
> > > > >>   },
> > > > >>
> > > > >>   "elasticsearch" : {
> > > > >> "when": "true",
> > > > >> "batchSize": 1000,
> > > > >> "index": "squid"
> > > > >>   }
> > > > >> }
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella  >
> > > > wrote:
> > > > >>
> > > > >> > Yeah, I

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Simon Elliston Ball

Some thing else to consider here is the possibility of multiple indices within 
a given target technology.

For example, if I’m indexing data from a given sensor into, say solr, I may 
want it filtered differently into two different indices. This would enable me 
to create different ‘views’ which could have different security settings 
applied in that backend. This would be useful for multi-tenant installs, and 
for differing data privilege levels within an organisation. You could argue 
that this is more a concern for filtering of the results coming out of an 
index, but currently this is a lot harder than using something like the ranger 
solr authorisation plugin to control access at an index by index granularity. 

Essentially, the indexer topology then becomes a filter and router, which 
argues for it being a separate step, before the process which actually writes 
out to each platform. It may also make sense to have a concept of a routing key 
built up by earlier enrichment to allow shuffle control in storm, rather than a 
full stellar statement for routing, to avoid overhead.

Simon

> On 13 Jan 2017, at 07:44, Casey Stella  wrote:
> 
> I am suggesting that, yes.  The configs are essentially the same as yours,
> except there is an override specified at the top level.  Without that, in
> order to specify both HDFS and ES have batch sizes of 100, you have to
> explicitly configure each.  It's less that I'm trying to have backwards
> compatibility and more that I'm trying to make the majority case easy: both
> writers write everything to a specified index name with a specified batch
> size (which is what we have now).  Beyond that, I want to allow for
> specifying an override for the config on a writer-by-writer basis for those
> who need it.
> 
> On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen  wrote:
> 
>> Are you saying we support all of these variants?  I realize you are trying
>> to have some backwards compatibility, but this also makes it harder for a
>> user to grok (for me at least).
>> 
>> Personally I like my original example as there are fewer sub-structures,
>> like 'writerConfig', which makes the whole thing simpler and easier to
>> grok.  But maybe others will think your proposal is just as easy to grok.
>> 
>> 
>> 
>> On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella  wrote:
>> 
>>> Ok, so here's what I'm thinking based on the discussion:
>>> 
>>>   - Keeping the configs that we have now (batchSize and index) as
>> defaults
>>>   for the unspecified writer-specific case
>>>   - Adding the config Nick suggested
>>> 
>>> *Base Case*:
>>> {
>>> }
>>> 
>>>   - all writers write all messages
>>>   - index named the same as the sensor for all writers
>>>   - batchSize of 1 for all writers
>>> 
>>> *Writer-non-specific case*:
>>> {
>>>  "index" : "foo"
>>> ,"batchSize" : 100
>>> }
>>> 
>>>   - All writers write all messages
>>>   - index is named "foo", different from the sensor for all writers
>>>   - batchSize is 100 for all writers
>>> 
>>> *Writer-specific case without filters*
>>> {
>>>  "index" : "foo"
>>> ,"batchSize" : 1
>>> , "writerConfig" :
>>>   {
>>>  "elasticsearch" : {
>>>   "batchSize" : 100
>>> }
>>>   }
>>> }
>>> 
>>>   - All writers write all messages
>>>   - index is named "foo", different from the sensor for all writers
>>>   - batchSize is 1 for HDFS and 100 for elasticsearch writers
>>>   - NOTE: I could override the index name too
>>> 
>>> *Writer-specific case with filters*
>>> {
>>>  "index" : "foo"
>>> ,"batchSize" : 1
>>> , "writerConfig" :
>>>   {
>>>  "elasticsearch" : {
>>>   "batchSize" : 100,
>>>   "when" : "exists(field1)"
>>> },
>>>  "hdfs" : {
>>> "when" : "false"
>>>  }
>>>   }
>>> }
>>> 
>>>   - ES writer writes messages which have field1, HDFS doesn't
>>>   - index is named "foo", different from the sensor for all writers
>>>   - 100 for elasticsearch writers
>>> 
>>> Thoughts?
>>> 
>>> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
>>> wrote:
>>> 
 For larger installations you need to control what is indexed so you
>> don’t
 end up with a nasty elastic search situation and so you can mine the
>> data
 later for reports and training ml models.
 
 Thanks
 Carolyn
 
 
 
 
 On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
 
> OH that's a good idea!
> 
> On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen 
>> wrote:
> 
>> I like the "Index Filtering" option based on the flexibility that it
>> provides.  Should each output (HDFS, ES, etc) have its own
>>> configuration
>> settings?  For example, aren't things like batching handled
>> separately
 for
>> HDFS versus

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Nick Allen

Yep, that makes sense, Casey.  I understand multiline is still just the
same when statement.  I was more responding to Otto's concern about dealing
with 50 whens.

In regards to multiline, I don't know if adding that is worth the potential
confusion.  I prefer very simple configs that are stupid simple to grok.  I
don't have a strong opinion on multiline though, so could go either way.




On Fri, Jan 13, 2017 at 10:38 AM, Casey Stella  wrote:

> Nick, Yep, that's the example I showed.  I'm just suggesting that that when
> use the multiline JSON trick here
> .  A single "when"
> statement with a couple "or"'s
> So:
> "when" : [ "exists(field1) or"
>  , "exists(field2) or"
>  , "exists(field3)"
>  ]
> would resolve to "exists(field1) or exists(field2), or exists(field3)", a
> single stellar statement behind the scene because the array is joined with
> space into a single string.
>
> On Fri, Jan 13, 2017 at 10:34 AM, Nick Allen  wrote:
>
> > I was thinking there would only be one 'when' for each output.  So if we
> > have Elasticsearch and HDFS, you would have only 2 'when's.  Each when
> > statement could be as simple or complex as you need.
> >
> > On Fri, Jan 13, 2017 at 10:08 AM, Otto Fowler 
> > wrote:
> >
> > > How does it look with 50 whens
> > >
> > >
> > > On January 13, 2017 at 10:02:02, Casey Stella (ceste...@gmail.com)
> > wrote:
> > >
> > > Ok, so here's what I'm thinking based on the discussion:
> > >
> > > - Keeping the configs that we have now (batchSize and index) as
> defaults
> > > for the unspecified writer-specific case
> > > - Adding the config Nick suggested
> > >
> > > *Base Case*:
> > > {
> > > }
> > >
> > > - all writers write all messages
> > > - index named the same as the sensor for all writers
> > > - batchSize of 1 for all writers
> > >
> > > *Writer-non-specific case*:
> > > {
> > > "index" : "foo"
> > > ,"batchSize" : 100
> > > }
> > >
> > > - All writers write all messages
> > > - index is named "foo", different from the sensor for all writers
> > > - batchSize is 100 for all writers
> > >
> > > *Writer-specific case without filters*
> > > {
> > > "index" : "foo"
> > > ,"batchSize" : 1
> > > , "writerConfig" :
> > > {
> > > "elasticsearch" : {
> > > "batchSize" : 100
> > > }
> > > }
> > > }
> > >
> > > - All writers write all messages
> > > - index is named "foo", different from the sensor for all writers
> > > - batchSize is 1 for HDFS and 100 for elasticsearch writers
> > > - NOTE: I could override the index name too
> > >
> > > *Writer-specific case with filters*
> > > {
> > > "index" : "foo"
> > > ,"batchSize" : 1
> > > , "writerConfig" :
> > > {
> > > "elasticsearch" : {
> > > "batchSize" : 100,
> > > "when" : "exists(field1)"
> > > },
> > > "hdfs" : {
> > > "when" : "false"
> > > }
> > > }
> > > }
> > >
> > > - ES writer writes messages which have field1, HDFS doesn't
> > > - index is named "foo", different from the sensor for all writers
> > > - 100 for elasticsearch writers
> > >
> > > Thoughts?
> > >
> > > On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> > > wrote:
> > >
> > > > For larger installations you need to control what is indexed so you
> > don’t
> > > > end up with a nasty elastic search situation and so you can mine the
> > data
> > > > later for reports and training ml models.
> > > >
> > > > Thanks
> > > > Carolyn
> > > >
> > > >
> > > >
> > > >
> > > > On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
> > > >
> > > > >OH that's a good idea!
> > > > >
> > > > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen 
> > wrote:
> > > > >
> > > > >> I like the "Index Filtering" option based on the flexibility that
> it
> > > > >> provides. Should each output (HDFS, ES, etc) have its own
> > > configuration
> > > > >> settings? For example, aren't things like batching handled
> > separately
> > > > for
> > > > >> HDFS versus Elasticsearch?
> > > > >>
> > > > >> Something along the lines of...
> > > > >>
> > > > >> {
> > > > >> "hdfs" : {
> > > > >> "when": "exists(field1)",
> > > > >> "batchSize": 100
> > > > >> },
> > > > >>
> > > > >> "elasticsearch" : {
> > > > >> "when": "true",
> > > > >> "batchSize": 1000,
> > > > >> "index": "squid"
> > > > >> }
> > > > >> }
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella  >
> > > > wrote:
> > > > >>
> > > > >> > Yeah, I tend to like the first option too. Any opposition to
> that
> > > > from
> > > > >> > anyone?
> > > > >> >
> > > > >> > The points brought up are good ones and I think that it may be
> > worth
> > > a
> > > > >> > broader discussion of the requirements of indexing in a separate
> > dev
> > > > list
> > > > >> > thread. Maybe a list of desires with coherent use-cases
> justifying
> > > > them
> > > > >>

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella

I am suggesting that, yes.  The configs are essentially the same as yours,
except there is an override specified at the top level.  Without that, in
order to specify both HDFS and ES have batch sizes of 100, you have to
explicitly configure each.  It's less that I'm trying to have backwards
compatibility and more that I'm trying to make the majority case easy: both
writers write everything to a specified index name with a specified batch
size (which is what we have now).  Beyond that, I want to allow for
specifying an override for the config on a writer-by-writer basis for those
who need it.

On Fri, Jan 13, 2017 at 10:39 AM, Nick Allen  wrote:

> Are you saying we support all of these variants?  I realize you are trying
> to have some backwards compatibility, but this also makes it harder for a
> user to grok (for me at least).
>
> Personally I like my original example as there are fewer sub-structures,
> like 'writerConfig', which makes the whole thing simpler and easier to
> grok.  But maybe others will think your proposal is just as easy to grok.
>
>
>
> On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella  wrote:
>
> > Ok, so here's what I'm thinking based on the discussion:
> >
> >- Keeping the configs that we have now (batchSize and index) as
> defaults
> >for the unspecified writer-specific case
> >- Adding the config Nick suggested
> >
> > *Base Case*:
> > {
> > }
> >
> >- all writers write all messages
> >- index named the same as the sensor for all writers
> >- batchSize of 1 for all writers
> >
> > *Writer-non-specific case*:
> > {
> >   "index" : "foo"
> >  ,"batchSize" : 100
> > }
> >
> >- All writers write all messages
> >- index is named "foo", different from the sensor for all writers
> >- batchSize is 100 for all writers
> >
> > *Writer-specific case without filters*
> > {
> >   "index" : "foo"
> >  ,"batchSize" : 1
> >  , "writerConfig" :
> >{
> >   "elasticsearch" : {
> >"batchSize" : 100
> >  }
> >}
> > }
> >
> >- All writers write all messages
> >- index is named "foo", different from the sensor for all writers
> >- batchSize is 1 for HDFS and 100 for elasticsearch writers
> >- NOTE: I could override the index name too
> >
> > *Writer-specific case with filters*
> > {
> >   "index" : "foo"
> >  ,"batchSize" : 1
> >  , "writerConfig" :
> >{
> >   "elasticsearch" : {
> >"batchSize" : 100,
> >"when" : "exists(field1)"
> >  },
> >   "hdfs" : {
> >  "when" : "false"
> >   }
> >}
> > }
> >
> >- ES writer writes messages which have field1, HDFS doesn't
> >- index is named "foo", different from the sensor for all writers
> >- 100 for elasticsearch writers
> >
> > Thoughts?
> >
> > On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> > wrote:
> >
> > > For larger installations you need to control what is indexed so you
> don’t
> > > end up with a nasty elastic search situation and so you can mine the
> data
> > > later for reports and training ml models.
> > >
> > > Thanks
> > > Carolyn
> > >
> > >
> > >
> > >
> > > On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
> > >
> > > >OH that's a good idea!
> > > >
> > > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen 
> wrote:
> > > >
> > > >> I like the "Index Filtering" option based on the flexibility that it
> > > >> provides.  Should each output (HDFS, ES, etc) have its own
> > configuration
> > > >> settings?  For example, aren't things like batching handled
> separately
> > > for
> > > >> HDFS versus Elasticsearch?
> > > >>
> > > >> Something along the lines of...
> > > >>
> > > >> {
> > > >>   "hdfs" : {
> > > >> "when": "exists(field1)",
> > > >> "batchSize": 100
> > > >>   },
> > > >>
> > > >>   "elasticsearch" : {
> > > >> "when": "true",
> > > >> "batchSize": 1000,
> > > >> "index": "squid"
> > > >>   }
> > > >> }
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella 
> > > wrote:
> > > >>
> > > >> > Yeah, I tend to like the first option too.  Any opposition to that
> > > from
> > > >> > anyone?
> > > >> >
> > > >> > The points brought up are good ones and I think that it may be
> > worth a
> > > >> > broader discussion of the requirements of indexing in a separate
> dev
> > > list
> > > >> > thread.  Maybe a list of desires with coherent use-cases
> justifying
> > > them
> > > >> so
> > > >> > we can think about how this stuff should work and where the
> natural
> > > >> > extension points should be.  Afterall, we need to toe the line
> > between
> > > >> > engineering and overengineering for features nobody will want.
> > > >> >
> > > >> > I'm not

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Nick Allen

Are you saying we support all of these variants?  I realize you are trying
to have some backwards compatibility, but this also makes it harder for a
user to grok (for me at least).

Personally I like my original example as there are fewer sub-structures,
like 'writerConfig', which makes the whole thing simpler and easier to
grok.  But maybe others will think your proposal is just as easy to grok.



On Fri, Jan 13, 2017 at 10:01 AM, Casey Stella  wrote:

> Ok, so here's what I'm thinking based on the discussion:
>
>- Keeping the configs that we have now (batchSize and index) as defaults
>for the unspecified writer-specific case
>- Adding the config Nick suggested
>
> *Base Case*:
> {
> }
>
>- all writers write all messages
>- index named the same as the sensor for all writers
>- batchSize of 1 for all writers
>
> *Writer-non-specific case*:
> {
>   "index" : "foo"
>  ,"batchSize" : 100
> }
>
>- All writers write all messages
>- index is named "foo", different from the sensor for all writers
>- batchSize is 100 for all writers
>
> *Writer-specific case without filters*
> {
>   "index" : "foo"
>  ,"batchSize" : 1
>  , "writerConfig" :
>{
>   "elasticsearch" : {
>"batchSize" : 100
>  }
>}
> }
>
>- All writers write all messages
>- index is named "foo", different from the sensor for all writers
>- batchSize is 1 for HDFS and 100 for elasticsearch writers
>- NOTE: I could override the index name too
>
> *Writer-specific case with filters*
> {
>   "index" : "foo"
>  ,"batchSize" : 1
>  , "writerConfig" :
>{
>   "elasticsearch" : {
>"batchSize" : 100,
>"when" : "exists(field1)"
>  },
>   "hdfs" : {
>  "when" : "false"
>   }
>}
> }
>
>- ES writer writes messages which have field1, HDFS doesn't
>- index is named "foo", different from the sensor for all writers
>- 100 for elasticsearch writers
>
> Thoughts?
>
> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> wrote:
>
> > For larger installations you need to control what is indexed so you don’t
> > end up with a nasty elastic search situation and so you can mine the data
> > later for reports and training ml models.
> >
> > Thanks
> > Carolyn
> >
> >
> >
> >
> > On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
> >
> > >OH that's a good idea!
> > >
> > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen  wrote:
> > >
> > >> I like the "Index Filtering" option based on the flexibility that it
> > >> provides.  Should each output (HDFS, ES, etc) have its own
> configuration
> > >> settings?  For example, aren't things like batching handled separately
> > for
> > >> HDFS versus Elasticsearch?
> > >>
> > >> Something along the lines of...
> > >>
> > >> {
> > >>   "hdfs" : {
> > >> "when": "exists(field1)",
> > >> "batchSize": 100
> > >>   },
> > >>
> > >>   "elasticsearch" : {
> > >> "when": "true",
> > >> "batchSize": 1000,
> > >> "index": "squid"
> > >>   }
> > >> }
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella 
> > wrote:
> > >>
> > >> > Yeah, I tend to like the first option too.  Any opposition to that
> > from
> > >> > anyone?
> > >> >
> > >> > The points brought up are good ones and I think that it may be
> worth a
> > >> > broader discussion of the requirements of indexing in a separate dev
> > list
> > >> > thread.  Maybe a list of desires with coherent use-cases justifying
> > them
> > >> so
> > >> > we can think about how this stuff should work and where the natural
> > >> > extension points should be.  Afterall, we need to toe the line
> between
> > >> > engineering and overengineering for features nobody will want.
> > >> >
> > >> > I'm not sure about the extensions to the standard fields.  I'm torn
> > >> between
> > >> > the notions that we should have no standard fields vs we should
> have a
> > >> > boatload of standard fields (with most of them empty).  I exchange
> > >> > positions fairly regularly on that question. ;)  It may be worth a
> dev
> > >> list
> > >> > discussion to lay out how you imagine an extension of standard
> fields
> > and
> > >> > how it might look as implemented in Metron.
> > >> >
> > >> > Casey
> > >> >
> > >> > Casey
> > >> >
> > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> > >> > kylerichards...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > I'll second my preference for the first option. I think the
> ability
> > to
> > >> > use
> > >> > > Stellar filters to customize indexing would be a big win.
> > >> > >
> > >> > > I'm glad Matt brought up the point about data lake and CEP. I
> think
> > >> this
> > >> > is
> > >> > > a really important use case

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella

Nick, Yep, that's the example I showed.  I'm just suggesting that that when
use the multiline JSON trick here
.  A single "when"
statement with a couple "or"'s
So:
"when" : [ "exists(field1) or"
 , "exists(field2) or"
 , "exists(field3)"
 ]
would resolve to "exists(field1) or exists(field2), or exists(field3)", a
single stellar statement behind the scene because the array is joined with
space into a single string.

On Fri, Jan 13, 2017 at 10:34 AM, Nick Allen  wrote:

> I was thinking there would only be one 'when' for each output.  So if we
> have Elasticsearch and HDFS, you would have only 2 'when's.  Each when
> statement could be as simple or complex as you need.
>
> On Fri, Jan 13, 2017 at 10:08 AM, Otto Fowler 
> wrote:
>
> > How does it look with 50 whens
> >
> >
> > On January 13, 2017 at 10:02:02, Casey Stella (ceste...@gmail.com)
> wrote:
> >
> > Ok, so here's what I'm thinking based on the discussion:
> >
> > - Keeping the configs that we have now (batchSize and index) as defaults
> > for the unspecified writer-specific case
> > - Adding the config Nick suggested
> >
> > *Base Case*:
> > {
> > }
> >
> > - all writers write all messages
> > - index named the same as the sensor for all writers
> > - batchSize of 1 for all writers
> >
> > *Writer-non-specific case*:
> > {
> > "index" : "foo"
> > ,"batchSize" : 100
> > }
> >
> > - All writers write all messages
> > - index is named "foo", different from the sensor for all writers
> > - batchSize is 100 for all writers
> >
> > *Writer-specific case without filters*
> > {
> > "index" : "foo"
> > ,"batchSize" : 1
> > , "writerConfig" :
> > {
> > "elasticsearch" : {
> > "batchSize" : 100
> > }
> > }
> > }
> >
> > - All writers write all messages
> > - index is named "foo", different from the sensor for all writers
> > - batchSize is 1 for HDFS and 100 for elasticsearch writers
> > - NOTE: I could override the index name too
> >
> > *Writer-specific case with filters*
> > {
> > "index" : "foo"
> > ,"batchSize" : 1
> > , "writerConfig" :
> > {
> > "elasticsearch" : {
> > "batchSize" : 100,
> > "when" : "exists(field1)"
> > },
> > "hdfs" : {
> > "when" : "false"
> > }
> > }
> > }
> >
> > - ES writer writes messages which have field1, HDFS doesn't
> > - index is named "foo", different from the sensor for all writers
> > - 100 for elasticsearch writers
> >
> > Thoughts?
> >
> > On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> > wrote:
> >
> > > For larger installations you need to control what is indexed so you
> don’t
> > > end up with a nasty elastic search situation and so you can mine the
> data
> > > later for reports and training ml models.
> > >
> > > Thanks
> > > Carolyn
> > >
> > >
> > >
> > >
> > > On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
> > >
> > > >OH that's a good idea!
> > > >
> > > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen 
> wrote:
> > > >
> > > >> I like the "Index Filtering" option based on the flexibility that it
> > > >> provides. Should each output (HDFS, ES, etc) have its own
> > configuration
> > > >> settings? For example, aren't things like batching handled
> separately
> > > for
> > > >> HDFS versus Elasticsearch?
> > > >>
> > > >> Something along the lines of...
> > > >>
> > > >> {
> > > >> "hdfs" : {
> > > >> "when": "exists(field1)",
> > > >> "batchSize": 100
> > > >> },
> > > >>
> > > >> "elasticsearch" : {
> > > >> "when": "true",
> > > >> "batchSize": 1000,
> > > >> "index": "squid"
> > > >> }
> > > >> }
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella 
> > > wrote:
> > > >>
> > > >> > Yeah, I tend to like the first option too. Any opposition to that
> > > from
> > > >> > anyone?
> > > >> >
> > > >> > The points brought up are good ones and I think that it may be
> worth
> > a
> > > >> > broader discussion of the requirements of indexing in a separate
> dev
> > > list
> > > >> > thread. Maybe a list of desires with coherent use-cases justifying
> > > them
> > > >> so
> > > >> > we can think about how this stuff should work and where the
> natural
> > > >> > extension points should be. Afterall, we need to toe the line
> > between
> > > >> > engineering and overengineering for features nobody will want.
> > > >> >
> > > >> > I'm not sure about the extensions to the standard fields. I'm torn
> > > >> between
> > > >> > the notions that we should have no standard fields vs we should
> have
> > a
> > > >> > boatload of standard fields (with most of them empty). I exchange
> > > >> > positions fairly regularly on that question. ;) It may be worth a
> > dev
> > > >> list
> > > >> > discussion to lay out how you imagine an extension of standard
> > fields
> > > and
> > > >> > how it might look as implemented in Metron.
> > > >> >
> > > >> > Casey
> > >

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Nick Allen

I was thinking there would only be one 'when' for each output.  So if we
have Elasticsearch and HDFS, you would have only 2 'when's.  Each when
statement could be as simple or complex as you need.

On Fri, Jan 13, 2017 at 10:08 AM, Otto Fowler 
wrote:

> How does it look with 50 whens
>
>
> On January 13, 2017 at 10:02:02, Casey Stella (ceste...@gmail.com) wrote:
>
> Ok, so here's what I'm thinking based on the discussion:
>
> - Keeping the configs that we have now (batchSize and index) as defaults
> for the unspecified writer-specific case
> - Adding the config Nick suggested
>
> *Base Case*:
> {
> }
>
> - all writers write all messages
> - index named the same as the sensor for all writers
> - batchSize of 1 for all writers
>
> *Writer-non-specific case*:
> {
> "index" : "foo"
> ,"batchSize" : 100
> }
>
> - All writers write all messages
> - index is named "foo", different from the sensor for all writers
> - batchSize is 100 for all writers
>
> *Writer-specific case without filters*
> {
> "index" : "foo"
> ,"batchSize" : 1
> , "writerConfig" :
> {
> "elasticsearch" : {
> "batchSize" : 100
> }
> }
> }
>
> - All writers write all messages
> - index is named "foo", different from the sensor for all writers
> - batchSize is 1 for HDFS and 100 for elasticsearch writers
> - NOTE: I could override the index name too
>
> *Writer-specific case with filters*
> {
> "index" : "foo"
> ,"batchSize" : 1
> , "writerConfig" :
> {
> "elasticsearch" : {
> "batchSize" : 100,
> "when" : "exists(field1)"
> },
> "hdfs" : {
> "when" : "false"
> }
> }
> }
>
> - ES writer writes messages which have field1, HDFS doesn't
> - index is named "foo", different from the sensor for all writers
> - 100 for elasticsearch writers
>
> Thoughts?
>
> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> wrote:
>
> > For larger installations you need to control what is indexed so you don’t
> > end up with a nasty elastic search situation and so you can mine the data
> > later for reports and training ml models.
> >
> > Thanks
> > Carolyn
> >
> >
> >
> >
> > On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
> >
> > >OH that's a good idea!
> > >
> > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen  wrote:
> > >
> > >> I like the "Index Filtering" option based on the flexibility that it
> > >> provides. Should each output (HDFS, ES, etc) have its own
> configuration
> > >> settings? For example, aren't things like batching handled separately
> > for
> > >> HDFS versus Elasticsearch?
> > >>
> > >> Something along the lines of...
> > >>
> > >> {
> > >> "hdfs" : {
> > >> "when": "exists(field1)",
> > >> "batchSize": 100
> > >> },
> > >>
> > >> "elasticsearch" : {
> > >> "when": "true",
> > >> "batchSize": 1000,
> > >> "index": "squid"
> > >> }
> > >> }
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella 
> > wrote:
> > >>
> > >> > Yeah, I tend to like the first option too. Any opposition to that
> > from
> > >> > anyone?
> > >> >
> > >> > The points brought up are good ones and I think that it may be worth
> a
> > >> > broader discussion of the requirements of indexing in a separate dev
> > list
> > >> > thread. Maybe a list of desires with coherent use-cases justifying
> > them
> > >> so
> > >> > we can think about how this stuff should work and where the natural
> > >> > extension points should be. Afterall, we need to toe the line
> between
> > >> > engineering and overengineering for features nobody will want.
> > >> >
> > >> > I'm not sure about the extensions to the standard fields. I'm torn
> > >> between
> > >> > the notions that we should have no standard fields vs we should have
> a
> > >> > boatload of standard fields (with most of them empty). I exchange
> > >> > positions fairly regularly on that question. ;) It may be worth a
> dev
> > >> list
> > >> > discussion to lay out how you imagine an extension of standard
> fields
> > and
> > >> > how it might look as implemented in Metron.
> > >> >
> > >> > Casey
> > >> >
> > >> > Casey
> > >> >
> > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> > >> > kylerichards...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > I'll second my preference for the first option. I think the
> ability
> > to
> > >> > use
> > >> > > Stellar filters to customize indexing would be a big win.
> > >> > >
> > >> > > I'm glad Matt brought up the point about data lake and CEP. I
> think
> > >> this
> > >> > is
> > >> > > a really important use case that we need to consider. Take a
> simple
> > >> > > example... If I have data coming in from 3 different firewall
> > vendors
> > >> > and 2
> > >> > > different web proxy/url filtering vendors and I want to be able to
> > >> > analyze
> > >> > > that data set, I need the data to be indexed all together (likely
> in
> > >> > HDFS)
> > >> > > and to have a normalized schema such that IP address, URL, and
> user
> >

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Otto Fowler

We also need to account for the complexity of the statements


On January 13, 2017 at 10:27:51, Otto Fowler (ottobackwa...@gmail.com)
wrote:

Like most things, we are best off to try something and iterate.  I just
think we should be aware from the beginning ( have tests etc ) of how it
works when there are many filters.


On January 13, 2017 at 10:11:47, Casey Stella (ceste...@gmail.com) wrote:

I imagined one stellar statement and if you wanted an "or" in there, you
could put it there.  I was also planning on doing the JSOn trick of
accepting either a string or list of strings to let you do multiline.  e.g.
"when" : [ "exists(field1) or"
 , "exists(field2) or"
 , "exists(field3)"
 ]

Thinks that's a bad idea?

Casey

On Fri, Jan 13, 2017 at 10:08 AM, Otto Fowler 
wrote:

> How does it look with 50 whens
>
>
> On January 13, 2017 at 10:02:02, Casey Stella (ceste...@gmail.com) wrote:
>
> Ok, so here's what I'm thinking based on the discussion:
>
> - Keeping the configs that we have now (batchSize and index) as defaults
> for the unspecified writer-specific case
> - Adding the config Nick suggested
>
> *Base Case*:
> {
> }
>
> - all writers write all messages
> - index named the same as the sensor for all writers
> - batchSize of 1 for all writers
>
> *Writer-non-specific case*:
> {
> "index" : "foo"
> ,"batchSize" : 100
> }
>
> - All writers write all messages
> - index is named "foo", different from the sensor for all writers
> - batchSize is 100 for all writers
>
> *Writer-specific case without filters*
> {
> "index" : "foo"
> ,"batchSize" : 1
> , "writerConfig" :
> {
> "elasticsearch" : {
> "batchSize" : 100
> }
> }
> }
>
> - All writers write all messages
> - index is named "foo", different from the sensor for all writers
> - batchSize is 1 for HDFS and 100 for elasticsearch writers
> - NOTE: I could override the index name too
>
> *Writer-specific case with filters*
> {
> "index" : "foo"
> ,"batchSize" : 1
> , "writerConfig" :
> {
> "elasticsearch" : {
> "batchSize" : 100,
> "when" : "exists(field1)"
> },
> "hdfs" : {
> "when" : "false"
> }
> }
> }
>
> - ES writer writes messages which have field1, HDFS doesn't
> - index is named "foo", different from the sensor for all writers
> - 100 for elasticsearch writers
>
> Thoughts?
>
> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> wrote:
>
> > For larger installations you need to control what is indexed so you don’t
> > end up with a nasty elastic search situation and so you can mine the data
> > later for reports and training ml models.
> >
> > Thanks
> > Carolyn
> >
> >
> >
> >
> > On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
> >
> > >OH that's a good idea!
> > >
> > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen  wrote:
> > >
> > >> I like the "Index Filtering" option based on the flexibility that it
> > >> provides. Should each output (HDFS, ES, etc) have its own
> configuration
> > >> settings? For example, aren't things like batching handled separately
> > for
> > >> HDFS versus Elasticsearch?
> > >>
> > >> Something along the lines of...
> > >>
> > >> {
> > >> "hdfs" : {
> > >> "when": "exists(field1)",
> > >> "batchSize": 100
> > >> },
> > >>
> > >> "elasticsearch" : {
> > >> "when": "true",
> > >> "batchSize": 1000,
> > >> "index": "squid"
> > >> }
> > >> }
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella 
> > wrote:
> > >>
> > >> > Yeah, I tend to like the first option too. Any opposition to that
> > from
> > >> > anyone?
> > >> >
> > >> > The points brought up are good ones and I think that it may be
> worth a
> > >> > broader discussion of the requirements of indexing in a separate dev
> > list
> > >> > thread. Maybe a list of desires with coherent use-cases justifying
> > them
> > >> so
> > >> > we can think about how this stuff should work and where the natural
> > >> > extension points should be. Afterall, we need to toe the line
> between
> > >> > engineering and overengineering for features nobody will want.
> > >> >
> > >> > I'm not sure about the extensions to the standard fields. I'm torn
> > >> between
> > >> > the notions that we should have no standard fields vs we should
> have a
> > >> > boatload of standard fields (with most of them empty). I exchange
> > >> > positions fairly regularly on that question. ;) It may be worth a
> dev
> > >> list
> > >> > discussion to lay out how you imagine an extension of standard
> fields
> > and
> > >> > how it might look as implemented in Metron.
> > >> >
> > >> > Casey
> > >> >
> > >> > Casey
> > >> >
> > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> > >> > kylerichards...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > I'll second my preference for the first option. I think the
> ability
> > to
> > >> > use
> > >> > > Stellar filters to customize indexing would be a big win.
> > >> > >
> > >> > >

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Otto Fowler

Like most things, we are best off to try something and iterate.  I just
think we should be aware from the beginning ( have tests etc ) of how it
works when there are many filters.


On January 13, 2017 at 10:11:47, Casey Stella (ceste...@gmail.com) wrote:

I imagined one stellar statement and if you wanted an "or" in there, you
could put it there.  I was also planning on doing the JSOn trick of
accepting either a string or list of strings to let you do multiline.  e.g.
"when" : [ "exists(field1) or"
 , "exists(field2) or"
 , "exists(field3)"
 ]

Thinks that's a bad idea?

Casey

On Fri, Jan 13, 2017 at 10:08 AM, Otto Fowler 
wrote:

> How does it look with 50 whens
>
>
> On January 13, 2017 at 10:02:02, Casey Stella (ceste...@gmail.com) wrote:
>
> Ok, so here's what I'm thinking based on the discussion:
>
> - Keeping the configs that we have now (batchSize and index) as defaults
> for the unspecified writer-specific case
> - Adding the config Nick suggested
>
> *Base Case*:
> {
> }
>
> - all writers write all messages
> - index named the same as the sensor for all writers
> - batchSize of 1 for all writers
>
> *Writer-non-specific case*:
> {
> "index" : "foo"
> ,"batchSize" : 100
> }
>
> - All writers write all messages
> - index is named "foo", different from the sensor for all writers
> - batchSize is 100 for all writers
>
> *Writer-specific case without filters*
> {
> "index" : "foo"
> ,"batchSize" : 1
> , "writerConfig" :
> {
> "elasticsearch" : {
> "batchSize" : 100
> }
> }
> }
>
> - All writers write all messages
> - index is named "foo", different from the sensor for all writers
> - batchSize is 1 for HDFS and 100 for elasticsearch writers
> - NOTE: I could override the index name too
>
> *Writer-specific case with filters*
> {
> "index" : "foo"
> ,"batchSize" : 1
> , "writerConfig" :
> {
> "elasticsearch" : {
> "batchSize" : 100,
> "when" : "exists(field1)"
> },
> "hdfs" : {
> "when" : "false"
> }
> }
> }
>
> - ES writer writes messages which have field1, HDFS doesn't
> - index is named "foo", different from the sensor for all writers
> - 100 for elasticsearch writers
>
> Thoughts?
>
> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> wrote:
>
> > For larger installations you need to control what is indexed so you don’t
> > end up with a nasty elastic search situation and so you can mine the data
> > later for reports and training ml models.
> >
> > Thanks
> > Carolyn
> >
> >
> >
> >
> > On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
> >
> > >OH that's a good idea!
> > >
> > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen  wrote:
> > >
> > >> I like the "Index Filtering" option based on the flexibility that it
> > >> provides. Should each output (HDFS, ES, etc) have its own
> configuration
> > >> settings? For example, aren't things like batching handled separately
> > for
> > >> HDFS versus Elasticsearch?
> > >>
> > >> Something along the lines of...
> > >>
> > >> {
> > >> "hdfs" : {
> > >> "when": "exists(field1)",
> > >> "batchSize": 100
> > >> },
> > >>
> > >> "elasticsearch" : {
> > >> "when": "true",
> > >> "batchSize": 1000,
> > >> "index": "squid"
> > >> }
> > >> }
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella 
> > wrote:
> > >>
> > >> > Yeah, I tend to like the first option too. Any opposition to that
> > from
> > >> > anyone?
> > >> >
> > >> > The points brought up are good ones and I think that it may be
> worth a
> > >> > broader discussion of the requirements of indexing in a separate dev
> > list
> > >> > thread. Maybe a list of desires with coherent use-cases justifying
> > them
> > >> so
> > >> > we can think about how this stuff should work and where the natural
> > >> > extension points should be. Afterall, we need to toe the line
> between
> > >> > engineering and overengineering for features nobody will want.
> > >> >
> > >> > I'm not sure about the extensions to the standard fields. I'm torn
> > >> between
> > >> > the notions that we should have no standard fields vs we should
> have a
> > >> > boatload of standard fields (with most of them empty). I exchange
> > >> > positions fairly regularly on that question. ;) It may be worth a
> dev
> > >> list
> > >> > discussion to lay out how you imagine an extension of standard
> fields
> > and
> > >> > how it might look as implemented in Metron.
> > >> >
> > >> > Casey
> > >> >
> > >> > Casey
> > >> >
> > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> > >> > kylerichards...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > I'll second my preference for the first option. I think the
> ability
> > to
> > >> > use
> > >> > > Stellar filters to customize indexing would be a big win.
> > >> > >
> > >> > > I'm glad Matt brought up the point about data lake and CEP. I
> think
> > >> this
> > >> > is
> > >> > > a really important use case that we

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella

I imagined one stellar statement and if you wanted an "or" in there, you
could put it there.  I was also planning on doing the JSOn trick of
accepting either a string or list of strings to let you do multiline.  e.g.
"when" : [ "exists(field1) or"
 , "exists(field2) or"
 , "exists(field3)"
 ]

Thinks that's a bad idea?

Casey

On Fri, Jan 13, 2017 at 10:08 AM, Otto Fowler 
wrote:

> How does it look with 50 whens
>
>
> On January 13, 2017 at 10:02:02, Casey Stella (ceste...@gmail.com) wrote:
>
> Ok, so here's what I'm thinking based on the discussion:
>
> - Keeping the configs that we have now (batchSize and index) as defaults
> for the unspecified writer-specific case
> - Adding the config Nick suggested
>
> *Base Case*:
> {
> }
>
> - all writers write all messages
> - index named the same as the sensor for all writers
> - batchSize of 1 for all writers
>
> *Writer-non-specific case*:
> {
> "index" : "foo"
> ,"batchSize" : 100
> }
>
> - All writers write all messages
> - index is named "foo", different from the sensor for all writers
> - batchSize is 100 for all writers
>
> *Writer-specific case without filters*
> {
> "index" : "foo"
> ,"batchSize" : 1
> , "writerConfig" :
> {
> "elasticsearch" : {
> "batchSize" : 100
> }
> }
> }
>
> - All writers write all messages
> - index is named "foo", different from the sensor for all writers
> - batchSize is 1 for HDFS and 100 for elasticsearch writers
> - NOTE: I could override the index name too
>
> *Writer-specific case with filters*
> {
> "index" : "foo"
> ,"batchSize" : 1
> , "writerConfig" :
> {
> "elasticsearch" : {
> "batchSize" : 100,
> "when" : "exists(field1)"
> },
> "hdfs" : {
> "when" : "false"
> }
> }
> }
>
> - ES writer writes messages which have field1, HDFS doesn't
> - index is named "foo", different from the sensor for all writers
> - 100 for elasticsearch writers
>
> Thoughts?
>
> On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby 
> wrote:
>
> > For larger installations you need to control what is indexed so you
> don’t
> > end up with a nasty elastic search situation and so you can mine the
> data
> > later for reports and training ml models.
> >
> > Thanks
> > Carolyn
> >
> >
> >
> >
> > On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
> >
> > >OH that's a good idea!
> > >
> > >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen 
> wrote:
> > >
> > >> I like the "Index Filtering" option based on the flexibility that it
> > >> provides. Should each output (HDFS, ES, etc) have its own
> configuration
> > >> settings? For example, aren't things like batching handled separately
> > for
> > >> HDFS versus Elasticsearch?
> > >>
> > >> Something along the lines of...
> > >>
> > >> {
> > >> "hdfs" : {
> > >> "when": "exists(field1)",
> > >> "batchSize": 100
> > >> },
> > >>
> > >> "elasticsearch" : {
> > >> "when": "true",
> > >> "batchSize": 1000,
> > >> "index": "squid"
> > >> }
> > >> }
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella 
> > wrote:
> > >>
> > >> > Yeah, I tend to like the first option too. Any opposition to that
> > from
> > >> > anyone?
> > >> >
> > >> > The points brought up are good ones and I think that it may be
> worth a
> > >> > broader discussion of the requirements of indexing in a separate
> dev
> > list
> > >> > thread. Maybe a list of desires with coherent use-cases justifying
> > them
> > >> so
> > >> > we can think about how this stuff should work and where the natural
> > >> > extension points should be. Afterall, we need to toe the line
> between
> > >> > engineering and overengineering for features nobody will want.
> > >> >
> > >> > I'm not sure about the extensions to the standard fields. I'm torn
> > >> between
> > >> > the notions that we should have no standard fields vs we should
> have a
> > >> > boatload of standard fields (with most of them empty). I exchange
> > >> > positions fairly regularly on that question. ;) It may be worth a
> dev
> > >> list
> > >> > discussion to lay out how you imagine an extension of standard
> fields
> > and
> > >> > how it might look as implemented in Metron.
> > >> >
> > >> > Casey
> > >> >
> > >> > Casey
> > >> >
> > >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> > >> > kylerichards...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > I'll second my preference for the first option. I think the
> ability
> > to
> > >> > use
> > >> > > Stellar filters to customize indexing would be a big win.
> > >> > >
> > >> > > I'm glad Matt brought up the point about data lake and CEP. I
> think
> > >> this
> > >> > is
> > >> > > a really important use case that we need to consider. Take a
> simple
> > >> > > example... If I have data coming in from 3 different firewall
> > vendors
> > >> > and 2
> > >> > > different web proxy/url filtering vendors and I want to be able
> to
> > >> > analyze
> > >> > > that

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella

Ok, so here's what I'm thinking based on the discussion:

   - Keeping the configs that we have now (batchSize and index) as defaults
   for the unspecified writer-specific case
   - Adding the config Nick suggested

*Base Case*:
{
}

   - all writers write all messages
   - index named the same as the sensor for all writers
   - batchSize of 1 for all writers

*Writer-non-specific case*:
{
  "index" : "foo"
 ,"batchSize" : 100
}

   - All writers write all messages
   - index is named "foo", different from the sensor for all writers
   - batchSize is 100 for all writers

*Writer-specific case without filters*
{
  "index" : "foo"
 ,"batchSize" : 1
 , "writerConfig" :
   {
  "elasticsearch" : {
   "batchSize" : 100
 }
   }
}

   - All writers write all messages
   - index is named "foo", different from the sensor for all writers
   - batchSize is 1 for HDFS and 100 for elasticsearch writers
   - NOTE: I could override the index name too

*Writer-specific case with filters*
{
  "index" : "foo"
 ,"batchSize" : 1
 , "writerConfig" :
   {
  "elasticsearch" : {
   "batchSize" : 100,
   "when" : "exists(field1)"
 },
  "hdfs" : {
 "when" : "false"
  }
   }
}

   - ES writer writes messages which have field1, HDFS doesn't
   - index is named "foo", different from the sensor for all writers
   - 100 for elasticsearch writers

Thoughts?

On Fri, Jan 13, 2017 at 9:44 AM, Carolyn Duby  wrote:

> For larger installations you need to control what is indexed so you don’t
> end up with a nasty elastic search situation and so you can mine the data
> later for reports and training ml models.
>
> Thanks
> Carolyn
>
>
>
>
> On 1/13/17, 9:40 AM, "Casey Stella"  wrote:
>
> >OH that's a good idea!
> >
> >On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen  wrote:
> >
> >> I like the "Index Filtering" option based on the flexibility that it
> >> provides.  Should each output (HDFS, ES, etc) have its own configuration
> >> settings?  For example, aren't things like batching handled separately
> for
> >> HDFS versus Elasticsearch?
> >>
> >> Something along the lines of...
> >>
> >> {
> >>   "hdfs" : {
> >> "when": "exists(field1)",
> >> "batchSize": 100
> >>   },
> >>
> >>   "elasticsearch" : {
> >> "when": "true",
> >> "batchSize": 1000,
> >> "index": "squid"
> >>   }
> >> }
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella 
> wrote:
> >>
> >> > Yeah, I tend to like the first option too.  Any opposition to that
> from
> >> > anyone?
> >> >
> >> > The points brought up are good ones and I think that it may be worth a
> >> > broader discussion of the requirements of indexing in a separate dev
> list
> >> > thread.  Maybe a list of desires with coherent use-cases justifying
> them
> >> so
> >> > we can think about how this stuff should work and where the natural
> >> > extension points should be.  Afterall, we need to toe the line between
> >> > engineering and overengineering for features nobody will want.
> >> >
> >> > I'm not sure about the extensions to the standard fields.  I'm torn
> >> between
> >> > the notions that we should have no standard fields vs we should have a
> >> > boatload of standard fields (with most of them empty).  I exchange
> >> > positions fairly regularly on that question. ;)  It may be worth a dev
> >> list
> >> > discussion to lay out how you imagine an extension of standard fields
> and
> >> > how it might look as implemented in Metron.
> >> >
> >> > Casey
> >> >
> >> > Casey
> >> >
> >> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> >> > kylerichards...@gmail.com>
> >> > wrote:
> >> >
> >> > > I'll second my preference for the first option. I think the ability
> to
> >> > use
> >> > > Stellar filters to customize indexing would be a big win.
> >> > >
> >> > > I'm glad Matt brought up the point about data lake and CEP. I think
> >> this
> >> > is
> >> > > a really important use case that we need to consider. Take a simple
> >> > > example... If I have data coming in from 3 different firewall
> vendors
> >> > and 2
> >> > > different web proxy/url filtering vendors and I want to be able to
> >> > analyze
> >> > > that data set, I need the data to be indexed all together (likely in
> >> > HDFS)
> >> > > and to have a normalized schema such that IP address, URL, and user
> >> name
> >> > > (to take a few) can be easily queried and aggregated. I can also
> >> envision
> >> > > scenarios where I would want to index data based on attributes other
> >> than
> >> > > sensor, business unit or subsidiary for example.
> >> > >
> >> > > I've been wanted to propose extending our 7 standard fields to
> include
> >> > > things like URL and user. Is there community

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread zeo...@gmail.com

Darn it Nick, you beat me to the punch.  =)  YES, please.  I think I
discussed this a while back in my ES tuning conversations, but that's
_super_ important.  I have this documented here

under Elasticsearch > On Installation > 4.

Also, I'm a huge fan of option one.  Here's how that would pan out in my
environment almost immediately:

I typically don't want to store POST data, so as it currently sits that
means I don't write the details of POSTs at all.  However it may make sense
for me to collect the POST data off the wire and pass it through Metron,
and once it gets through enrichment/threat intel and if one of the IPs are
noteworthy, only then I store it, otherwise it gets tossed.  That gives me
a nice mix of privacy/security for my user population but also the
information I need to respond to possible incidents.  This could look like
holding onto POSTs being used to manipulate web shells (known bad sources,
known compromised hosts (as a very short term IR information gathering
procedure)), or users POSTing their creds to a plaintext phishing site.
Happy to port this discussion to a separate thread.

Regarding fields - I'm for slightly more standardization, without going
overboard.  I don't think we should go overboard, and this could be a long
discussion, but in summary my opinion is that user is a very sane field to
add, and URL is slightly less so.  Again, I'm thinking about this in the
context of what data I have going into my cluster (and that I know others
are sending to theirs), which may contrast with a more general infosec
population.

Jon

On Fri, Jan 13, 2017 at 9:40 AM Casey Stella  wrote:

> OH that's a good idea!
>
> On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen  wrote:
>
> > I like the "Index Filtering" option based on the flexibility that it
> > provides.  Should each output (HDFS, ES, etc) have its own configuration
> > settings?  For example, aren't things like batching handled separately
> for
> > HDFS versus Elasticsearch?
> >
> > Something along the lines of...
> >
> > {
> >   "hdfs" : {
> > "when": "exists(field1)",
> > "batchSize": 100
> >   },
> >
> >   "elasticsearch" : {
> > "when": "true",
> > "batchSize": 1000,
> > "index": "squid"
> >   }
> > }
> >
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella 
> wrote:
> >
> > > Yeah, I tend to like the first option too.  Any opposition to that from
> > > anyone?
> > >
> > > The points brought up are good ones and I think that it may be worth a
> > > broader discussion of the requirements of indexing in a separate dev
> list
> > > thread.  Maybe a list of desires with coherent use-cases justifying
> them
> > so
> > > we can think about how this stuff should work and where the natural
> > > extension points should be.  Afterall, we need to toe the line between
> > > engineering and overengineering for features nobody will want.
> > >
> > > I'm not sure about the extensions to the standard fields.  I'm torn
> > between
> > > the notions that we should have no standard fields vs we should have a
> > > boatload of standard fields (with most of them empty).  I exchange
> > > positions fairly regularly on that question. ;)  It may be worth a dev
> > list
> > > discussion to lay out how you imagine an extension of standard fields
> and
> > > how it might look as implemented in Metron.
> > >
> > > Casey
> > >
> > > Casey
> > >
> > > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> > > kylerichards...@gmail.com>
> > > wrote:
> > >
> > > > I'll second my preference for the first option. I think the ability
> to
> > > use
> > > > Stellar filters to customize indexing would be a big win.
> > > >
> > > > I'm glad Matt brought up the point about data lake and CEP. I think
> > this
> > > is
> > > > a really important use case that we need to consider. Take a simple
> > > > example... If I have data coming in from 3 different firewall vendors
> > > and 2
> > > > different web proxy/url filtering vendors and I want to be able to
> > > analyze
> > > > that data set, I need the data to be indexed all together (likely in
> > > HDFS)
> > > > and to have a normalized schema such that IP address, URL, and user
> > name
> > > > (to take a few) can be easily queried and aggregated. I can also
> > envision
> > > > scenarios where I would want to index data based on attributes other
> > than
> > > > sensor, business unit or subsidiary for example.
> > > >
> > > > I've been wanted to propose extending our 7 standard fields to
> include
> > > > things like URL and user. Is there community interest/support for
> > moving
> > > in
> > > > that direction? If so, I'll start a new thread.
> > > >
> > > > Thanks!
> > > >
> > > > -Kyle
> > > >
> > > > On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley 
> wrote:
> > > >
> > > > > Ah, I see.  If overriding the default index name

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Carolyn Duby

For larger installations you need to control what is indexed so you don’t end 
up with a nasty elastic search situation and so you can mine the data later for 
reports and training ml models.

Thanks
Carolyn




On 1/13/17, 9:40 AM, "Casey Stella"  wrote:

>OH that's a good idea!
>
>On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen  wrote:
>
>> I like the "Index Filtering" option based on the flexibility that it
>> provides.  Should each output (HDFS, ES, etc) have its own configuration
>> settings?  For example, aren't things like batching handled separately for
>> HDFS versus Elasticsearch?
>>
>> Something along the lines of...
>>
>> {
>>   "hdfs" : {
>> "when": "exists(field1)",
>> "batchSize": 100
>>   },
>>
>>   "elasticsearch" : {
>> "when": "true",
>> "batchSize": 1000,
>> "index": "squid"
>>   }
>> }
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella  wrote:
>>
>> > Yeah, I tend to like the first option too.  Any opposition to that from
>> > anyone?
>> >
>> > The points brought up are good ones and I think that it may be worth a
>> > broader discussion of the requirements of indexing in a separate dev list
>> > thread.  Maybe a list of desires with coherent use-cases justifying them
>> so
>> > we can think about how this stuff should work and where the natural
>> > extension points should be.  Afterall, we need to toe the line between
>> > engineering and overengineering for features nobody will want.
>> >
>> > I'm not sure about the extensions to the standard fields.  I'm torn
>> between
>> > the notions that we should have no standard fields vs we should have a
>> > boatload of standard fields (with most of them empty).  I exchange
>> > positions fairly regularly on that question. ;)  It may be worth a dev
>> list
>> > discussion to lay out how you imagine an extension of standard fields and
>> > how it might look as implemented in Metron.
>> >
>> > Casey
>> >
>> > Casey
>> >
>> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
>> > kylerichards...@gmail.com>
>> > wrote:
>> >
>> > > I'll second my preference for the first option. I think the ability to
>> > use
>> > > Stellar filters to customize indexing would be a big win.
>> > >
>> > > I'm glad Matt brought up the point about data lake and CEP. I think
>> this
>> > is
>> > > a really important use case that we need to consider. Take a simple
>> > > example... If I have data coming in from 3 different firewall vendors
>> > and 2
>> > > different web proxy/url filtering vendors and I want to be able to
>> > analyze
>> > > that data set, I need the data to be indexed all together (likely in
>> > HDFS)
>> > > and to have a normalized schema such that IP address, URL, and user
>> name
>> > > (to take a few) can be easily queried and aggregated. I can also
>> envision
>> > > scenarios where I would want to index data based on attributes other
>> than
>> > > sensor, business unit or subsidiary for example.
>> > >
>> > > I've been wanted to propose extending our 7 standard fields to include
>> > > things like URL and user. Is there community interest/support for
>> moving
>> > in
>> > > that direction? If so, I'll start a new thread.
>> > >
>> > > Thanks!
>> > >
>> > > -Kyle
>> > >
>> > > On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley  wrote:
>> > >
>> > > > Ah, I see.  If overriding the default index name allows using the
>> same
>> > > > name for multiple sensors, then the goal can be achieved.
>> > > > Thanks,
>> > > > --Matt
>> > > >
>> > > >
>> > > > On 1/12/17, 3:30 PM, "Casey Stella"  wrote:
>> > > >
>> > > > Oh, you could!  Let's say you have a syslog parser with data from
>> > > > sources 1
>> > > > 2 and 3.  You'd end up with one kafka queue with 3 parsers
>> attached
>> > > to
>> > > > that
>> > > > queue, each picking part the messages from source 1, 2 and 3.
>> > They'd
>> > > > go
>> > > > through separate enrichment and into the indexing topology.  In
>> the
>> > > > indexing topology, you could specify the same index name "syslog"
>> > and
>> > > > all
>> > > > of the messages go into the same index for CEP querying if so
>> > > desired.
>> > > >
>> > > > On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley 
>> > > wrote:
>> > > >
>> > > > > Syslog is hell on parsers – I know, I worked at LogLogic in a
>> > > > previous
>> > > > > life.  It makes perfect sense to route different lines from
>> > syslog
>> > > > through
>> > > > > different appropriate parsers.  But a lot of what the parsers
>> do
>> > is
>> > > > > identify consistent subsets of metadata and annotate it – eg,
>> > > > src_ip_addr,
>> > > > > event timestamps, etc.  Once those metadata are annotated and
>> > > > available
>> > > > > with common field names, why doesn’t it make sense to index the
>> > > > messages
>> > > > > together, for CEP querying?  I think Splunk has

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Otto Fowler

I prefer option1 with stellar, although I’m concerned that in a real world
scenario the amount of filters and rules might be large, and some thought
about the structure of the rule expressions for maintainability etc will
need to be considered.


On January 12, 2017 at 15:52:03, Casey Stella (ceste...@gmail.com) wrote:

As of METRON-652 , we
will have decoupled the indexing configuration from the enrichment
configuration. As an immediate follow-up to that, I'd like to provide the
ability to turn off and on writers via the configs. I'd like to get some
community feedback on how the functionality should work, if y'all are
amenable. :)


As of now, we have 3 possible writers which can be used in the indexing
topology:

- Solr
- Elasticsearch
- HDFS

HDFS is always used, elasticsearch or solr is used depending on how you
start the indexing topology.

A couple of proposals come to mind immediately:

*Index Filtering*

You would be able to specify a filter as defined by a stellar statement
(likely a reuse of the StellarFilter that exists in the Parsers) which
would allow you to indicate on a message-by-message basis whether or not to
write the message.

The semantics of this would be as follows:

- Default (i.e. unspecified) is to pass everything through (hence
backwards compatible with the current default config).
- Messages which have the associated stellar statement evaluate to true
for the writer type will be written, otherwise not.


Sample indexing config which would write out no messages to HDFS and write
out only messages containing a field called "field1":
{
"index" : "squid"
,"batchSize" : 100
,"filters" : {
"HDFS" : "false"
,"ES" : "exists(field1)"
}
}

*Index On/Off Switch*

A simpler solution would be to just provide a list of writers to write
messages. The semantics would be as follows:

- If the list is unspecified, then the default is to write all messages
for every writer in the indexing topology
- If the list is specified, then a writer will write all messages if and
only if it is named in the list.

Sample indexing config which turns off HDFS and keeps on Elasticsearch:
{
"index" : "squid"
,"batchSize" : 100
,"writers" : [ "ES" ]
}

Thanks in advance for the feedback! Also, if you have any other, better
ideas than the ones presented here, let me know too.

Best,

Casey

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella

OH that's a good idea!

On Fri, Jan 13, 2017 at 9:39 AM, Nick Allen  wrote:

> I like the "Index Filtering" option based on the flexibility that it
> provides.  Should each output (HDFS, ES, etc) have its own configuration
> settings?  For example, aren't things like batching handled separately for
> HDFS versus Elasticsearch?
>
> Something along the lines of...
>
> {
>   "hdfs" : {
> "when": "exists(field1)",
> "batchSize": 100
>   },
>
>   "elasticsearch" : {
> "when": "true",
> "batchSize": 1000,
> "index": "squid"
>   }
> }
>
>
>
>
>
>
>
>
> On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella  wrote:
>
> > Yeah, I tend to like the first option too.  Any opposition to that from
> > anyone?
> >
> > The points brought up are good ones and I think that it may be worth a
> > broader discussion of the requirements of indexing in a separate dev list
> > thread.  Maybe a list of desires with coherent use-cases justifying them
> so
> > we can think about how this stuff should work and where the natural
> > extension points should be.  Afterall, we need to toe the line between
> > engineering and overengineering for features nobody will want.
> >
> > I'm not sure about the extensions to the standard fields.  I'm torn
> between
> > the notions that we should have no standard fields vs we should have a
> > boatload of standard fields (with most of them empty).  I exchange
> > positions fairly regularly on that question. ;)  It may be worth a dev
> list
> > discussion to lay out how you imagine an extension of standard fields and
> > how it might look as implemented in Metron.
> >
> > Casey
> >
> > Casey
> >
> > On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> > kylerichards...@gmail.com>
> > wrote:
> >
> > > I'll second my preference for the first option. I think the ability to
> > use
> > > Stellar filters to customize indexing would be a big win.
> > >
> > > I'm glad Matt brought up the point about data lake and CEP. I think
> this
> > is
> > > a really important use case that we need to consider. Take a simple
> > > example... If I have data coming in from 3 different firewall vendors
> > and 2
> > > different web proxy/url filtering vendors and I want to be able to
> > analyze
> > > that data set, I need the data to be indexed all together (likely in
> > HDFS)
> > > and to have a normalized schema such that IP address, URL, and user
> name
> > > (to take a few) can be easily queried and aggregated. I can also
> envision
> > > scenarios where I would want to index data based on attributes other
> than
> > > sensor, business unit or subsidiary for example.
> > >
> > > I've been wanted to propose extending our 7 standard fields to include
> > > things like URL and user. Is there community interest/support for
> moving
> > in
> > > that direction? If so, I'll start a new thread.
> > >
> > > Thanks!
> > >
> > > -Kyle
> > >
> > > On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley  wrote:
> > >
> > > > Ah, I see.  If overriding the default index name allows using the
> same
> > > > name for multiple sensors, then the goal can be achieved.
> > > > Thanks,
> > > > --Matt
> > > >
> > > >
> > > > On 1/12/17, 3:30 PM, "Casey Stella"  wrote:
> > > >
> > > > Oh, you could!  Let's say you have a syslog parser with data from
> > > > sources 1
> > > > 2 and 3.  You'd end up with one kafka queue with 3 parsers
> attached
> > > to
> > > > that
> > > > queue, each picking part the messages from source 1, 2 and 3.
> > They'd
> > > > go
> > > > through separate enrichment and into the indexing topology.  In
> the
> > > > indexing topology, you could specify the same index name "syslog"
> > and
> > > > all
> > > > of the messages go into the same index for CEP querying if so
> > > desired.
> > > >
> > > > On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley 
> > > wrote:
> > > >
> > > > > Syslog is hell on parsers – I know, I worked at LogLogic in a
> > > > previous
> > > > > life.  It makes perfect sense to route different lines from
> > syslog
> > > > through
> > > > > different appropriate parsers.  But a lot of what the parsers
> do
> > is
> > > > > identify consistent subsets of metadata and annotate it – eg,
> > > > src_ip_addr,
> > > > > event timestamps, etc.  Once those metadata are annotated and
> > > > available
> > > > > with common field names, why doesn’t it make sense to index the
> > > > messages
> > > > > together, for CEP querying?  I think Splunk has illustrated
> this
> > > > model.
> > > > >
> > > > > On 1/12/17, 3:00 PM, "Casey Stella" 
> wrote:
> > > > >
> > > > > yeah, I mean, honestly, I think the approach that we've
> taken
> > > for
> > > > > sources
> > > > > which aggregate different types of data is to provide
> filters
> > > at
> > > > the
> > > > > parser
> > > > > level and have

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Nick Allen

I like the "Index Filtering" option based on the flexibility that it
provides.  Should each output (HDFS, ES, etc) have its own configuration
settings?  For example, aren't things like batching handled separately for
HDFS versus Elasticsearch?

Something along the lines of...

{
  "hdfs" : {
"when": "exists(field1)",
"batchSize": 100
  },

  "elasticsearch" : {
"when": "true",
"batchSize": 1000,
"index": "squid"
  }
}








On Fri, Jan 13, 2017 at 9:10 AM, Casey Stella  wrote:

> Yeah, I tend to like the first option too.  Any opposition to that from
> anyone?
>
> The points brought up are good ones and I think that it may be worth a
> broader discussion of the requirements of indexing in a separate dev list
> thread.  Maybe a list of desires with coherent use-cases justifying them so
> we can think about how this stuff should work and where the natural
> extension points should be.  Afterall, we need to toe the line between
> engineering and overengineering for features nobody will want.
>
> I'm not sure about the extensions to the standard fields.  I'm torn between
> the notions that we should have no standard fields vs we should have a
> boatload of standard fields (with most of them empty).  I exchange
> positions fairly regularly on that question. ;)  It may be worth a dev list
> discussion to lay out how you imagine an extension of standard fields and
> how it might look as implemented in Metron.
>
> Casey
>
> Casey
>
> On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson <
> kylerichards...@gmail.com>
> wrote:
>
> > I'll second my preference for the first option. I think the ability to
> use
> > Stellar filters to customize indexing would be a big win.
> >
> > I'm glad Matt brought up the point about data lake and CEP. I think this
> is
> > a really important use case that we need to consider. Take a simple
> > example... If I have data coming in from 3 different firewall vendors
> and 2
> > different web proxy/url filtering vendors and I want to be able to
> analyze
> > that data set, I need the data to be indexed all together (likely in
> HDFS)
> > and to have a normalized schema such that IP address, URL, and user name
> > (to take a few) can be easily queried and aggregated. I can also envision
> > scenarios where I would want to index data based on attributes other than
> > sensor, business unit or subsidiary for example.
> >
> > I've been wanted to propose extending our 7 standard fields to include
> > things like URL and user. Is there community interest/support for moving
> in
> > that direction? If so, I'll start a new thread.
> >
> > Thanks!
> >
> > -Kyle
> >
> > On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley  wrote:
> >
> > > Ah, I see.  If overriding the default index name allows using the same
> > > name for multiple sensors, then the goal can be achieved.
> > > Thanks,
> > > --Matt
> > >
> > >
> > > On 1/12/17, 3:30 PM, "Casey Stella"  wrote:
> > >
> > > Oh, you could!  Let's say you have a syslog parser with data from
> > > sources 1
> > > 2 and 3.  You'd end up with one kafka queue with 3 parsers attached
> > to
> > > that
> > > queue, each picking part the messages from source 1, 2 and 3.
> They'd
> > > go
> > > through separate enrichment and into the indexing topology.  In the
> > > indexing topology, you could specify the same index name "syslog"
> and
> > > all
> > > of the messages go into the same index for CEP querying if so
> > desired.
> > >
> > > On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley 
> > wrote:
> > >
> > > > Syslog is hell on parsers – I know, I worked at LogLogic in a
> > > previous
> > > > life.  It makes perfect sense to route different lines from
> syslog
> > > through
> > > > different appropriate parsers.  But a lot of what the parsers do
> is
> > > > identify consistent subsets of metadata and annotate it – eg,
> > > src_ip_addr,
> > > > event timestamps, etc.  Once those metadata are annotated and
> > > available
> > > > with common field names, why doesn’t it make sense to index the
> > > messages
> > > > together, for CEP querying?  I think Splunk has illustrated this
> > > model.
> > > >
> > > > On 1/12/17, 3:00 PM, "Casey Stella"  wrote:
> > > >
> > > > yeah, I mean, honestly, I think the approach that we've taken
> > for
> > > > sources
> > > > which aggregate different types of data is to provide filters
> > at
> > > the
> > > > parser
> > > > level and have multiple parser topologies (with different,
> > > possibly
> > > > mutually exclusive filters) running.  This would be a
> > completely
> > > > separate
> > > > sensor.  Imagine a syslog data source that aggregates and you
> > > want to
> > > > pick
> > > > apart certain pieces of messages.  This is why the initial
> > > thought and
> > > >

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-13 Thread Casey Stella

Yeah, I tend to like the first option too.  Any opposition to that from
anyone?

The points brought up are good ones and I think that it may be worth a
broader discussion of the requirements of indexing in a separate dev list
thread.  Maybe a list of desires with coherent use-cases justifying them so
we can think about how this stuff should work and where the natural
extension points should be.  Afterall, we need to toe the line between
engineering and overengineering for features nobody will want.

I'm not sure about the extensions to the standard fields.  I'm torn between
the notions that we should have no standard fields vs we should have a
boatload of standard fields (with most of them empty).  I exchange
positions fairly regularly on that question. ;)  It may be worth a dev list
discussion to lay out how you imagine an extension of standard fields and
how it might look as implemented in Metron.

Casey

Casey

On Thu, Jan 12, 2017 at 9:58 PM, Kyle Richardson 
wrote:

> I'll second my preference for the first option. I think the ability to use
> Stellar filters to customize indexing would be a big win.
>
> I'm glad Matt brought up the point about data lake and CEP. I think this is
> a really important use case that we need to consider. Take a simple
> example... If I have data coming in from 3 different firewall vendors and 2
> different web proxy/url filtering vendors and I want to be able to analyze
> that data set, I need the data to be indexed all together (likely in HDFS)
> and to have a normalized schema such that IP address, URL, and user name
> (to take a few) can be easily queried and aggregated. I can also envision
> scenarios where I would want to index data based on attributes other than
> sensor, business unit or subsidiary for example.
>
> I've been wanted to propose extending our 7 standard fields to include
> things like URL and user. Is there community interest/support for moving in
> that direction? If so, I'll start a new thread.
>
> Thanks!
>
> -Kyle
>
> On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley  wrote:
>
> > Ah, I see.  If overriding the default index name allows using the same
> > name for multiple sensors, then the goal can be achieved.
> > Thanks,
> > --Matt
> >
> >
> > On 1/12/17, 3:30 PM, "Casey Stella"  wrote:
> >
> > Oh, you could!  Let's say you have a syslog parser with data from
> > sources 1
> > 2 and 3.  You'd end up with one kafka queue with 3 parsers attached
> to
> > that
> > queue, each picking part the messages from source 1, 2 and 3.  They'd
> > go
> > through separate enrichment and into the indexing topology.  In the
> > indexing topology, you could specify the same index name "syslog" and
> > all
> > of the messages go into the same index for CEP querying if so
> desired.
> >
> > On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley 
> wrote:
> >
> > > Syslog is hell on parsers – I know, I worked at LogLogic in a
> > previous
> > > life.  It makes perfect sense to route different lines from syslog
> > through
> > > different appropriate parsers.  But a lot of what the parsers do is
> > > identify consistent subsets of metadata and annotate it – eg,
> > src_ip_addr,
> > > event timestamps, etc.  Once those metadata are annotated and
> > available
> > > with common field names, why doesn’t it make sense to index the
> > messages
> > > together, for CEP querying?  I think Splunk has illustrated this
> > model.
> > >
> > > On 1/12/17, 3:00 PM, "Casey Stella"  wrote:
> > >
> > > yeah, I mean, honestly, I think the approach that we've taken
> for
> > > sources
> > > which aggregate different types of data is to provide filters
> at
> > the
> > > parser
> > > level and have multiple parser topologies (with different,
> > possibly
> > > mutually exclusive filters) running.  This would be a
> completely
> > > separate
> > > sensor.  Imagine a syslog data source that aggregates and you
> > want to
> > > pick
> > > apart certain pieces of messages.  This is why the initial
> > thought and
> > > architecture was one index per sensor.
> > >
> > > On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley 
> > wrote:
> > >
> > > > I’m thinking that CEP (Complex Event Processing) is contrary
> > to the
> > > idea
> > > > of silo-ing data per sensor.
> > > > Now it’s true that some of those sensors are already
> > aggregating
> > > data from
> > > > multiple sources, so maybe I’m wrong here.
> > > > But it just seems to me that the “data lake” insights come
> from
> > > being able
> > > > to make decisions over the whole mass of data rather than
> just
> > > vertical
> > > > slices of it.
> > > >
> > > > On 1/12/17, 2:15 PM, "Casey Stella"

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-12 Thread Kyle Richardson

I'll second my preference for the first option. I think the ability to use
Stellar filters to customize indexing would be a big win.

I'm glad Matt brought up the point about data lake and CEP. I think this is
a really important use case that we need to consider. Take a simple
example... If I have data coming in from 3 different firewall vendors and 2
different web proxy/url filtering vendors and I want to be able to analyze
that data set, I need the data to be indexed all together (likely in HDFS)
and to have a normalized schema such that IP address, URL, and user name
(to take a few) can be easily queried and aggregated. I can also envision
scenarios where I would want to index data based on attributes other than
sensor, business unit or subsidiary for example.

I've been wanted to propose extending our 7 standard fields to include
things like URL and user. Is there community interest/support for moving in
that direction? If so, I'll start a new thread.

Thanks!

-Kyle

On Thu, Jan 12, 2017 at 6:51 PM, Matt Foley  wrote:

> Ah, I see.  If overriding the default index name allows using the same
> name for multiple sensors, then the goal can be achieved.
> Thanks,
> --Matt
>
>
> On 1/12/17, 3:30 PM, "Casey Stella"  wrote:
>
> Oh, you could!  Let's say you have a syslog parser with data from
> sources 1
> 2 and 3.  You'd end up with one kafka queue with 3 parsers attached to
> that
> queue, each picking part the messages from source 1, 2 and 3.  They'd
> go
> through separate enrichment and into the indexing topology.  In the
> indexing topology, you could specify the same index name "syslog" and
> all
> of the messages go into the same index for CEP querying if so desired.
>
> On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley  wrote:
>
> > Syslog is hell on parsers – I know, I worked at LogLogic in a
> previous
> > life.  It makes perfect sense to route different lines from syslog
> through
> > different appropriate parsers.  But a lot of what the parsers do is
> > identify consistent subsets of metadata and annotate it – eg,
> src_ip_addr,
> > event timestamps, etc.  Once those metadata are annotated and
> available
> > with common field names, why doesn’t it make sense to index the
> messages
> > together, for CEP querying?  I think Splunk has illustrated this
> model.
> >
> > On 1/12/17, 3:00 PM, "Casey Stella"  wrote:
> >
> > yeah, I mean, honestly, I think the approach that we've taken for
> > sources
> > which aggregate different types of data is to provide filters at
> the
> > parser
> > level and have multiple parser topologies (with different,
> possibly
> > mutually exclusive filters) running.  This would be a completely
> > separate
> > sensor.  Imagine a syslog data source that aggregates and you
> want to
> > pick
> > apart certain pieces of messages.  This is why the initial
> thought and
> > architecture was one index per sensor.
> >
> > On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley 
> wrote:
> >
> > > I’m thinking that CEP (Complex Event Processing) is contrary
> to the
> > idea
> > > of silo-ing data per sensor.
> > > Now it’s true that some of those sensors are already
> aggregating
> > data from
> > > multiple sources, so maybe I’m wrong here.
> > > But it just seems to me that the “data lake” insights come from
> > being able
> > > to make decisions over the whole mass of data rather than just
> > vertical
> > > slices of it.
> > >
> > > On 1/12/17, 2:15 PM, "Casey Stella" 
> wrote:
> > >
> > > Hey Matt,
> > >
> > > Thanks for the comment!
> > > 1. At the moment, we only have one index name, the default
> of
> > which is
> > > the
> > > sensor name but that's entirely up to the user.  This is
> sensor
> > > specific,
> > > so it'd be a separate config for each sensor.  If we want
> to
> > build
> > > multiple
> > > indices per sensor, we'd have to think carefully about how
> to do
> > that
> > > and
> > > would be a bigger undertaking.  I guess I can see the use,
> though
> > > (redirect
> > > messages to one index vs another based on a predicate for
> a given
> > > sensor).
> > > Anyway, not where I was originally thinking that this
> discussion
> > would
> > > go,
> > > but it's an interesting point.
> > >
> > > 2. I hadn't thought through the implementation quite yet,
> but we
> > don't
> > > actually have a splitter bolt in that topology, just a
> spout
> > that goes
> > > to
> >

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-12 Thread Matt Foley

Ah, I see.  If overriding the default index name allows using the same name for 
multiple sensors, then the goal can be achieved.
Thanks,
--Matt


On 1/12/17, 3:30 PM, "Casey Stella"  wrote:

Oh, you could!  Let's say you have a syslog parser with data from sources 1
2 and 3.  You'd end up with one kafka queue with 3 parsers attached to that
queue, each picking part the messages from source 1, 2 and 3.  They'd go
through separate enrichment and into the indexing topology.  In the
indexing topology, you could specify the same index name "syslog" and all
of the messages go into the same index for CEP querying if so desired.

On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley  wrote:

> Syslog is hell on parsers – I know, I worked at LogLogic in a previous
> life.  It makes perfect sense to route different lines from syslog through
> different appropriate parsers.  But a lot of what the parsers do is
> identify consistent subsets of metadata and annotate it – eg, src_ip_addr,
> event timestamps, etc.  Once those metadata are annotated and available
> with common field names, why doesn’t it make sense to index the messages
> together, for CEP querying?  I think Splunk has illustrated this model.
>
> On 1/12/17, 3:00 PM, "Casey Stella"  wrote:
>
> yeah, I mean, honestly, I think the approach that we've taken for
> sources
> which aggregate different types of data is to provide filters at the
> parser
> level and have multiple parser topologies (with different, possibly
> mutually exclusive filters) running.  This would be a completely
> separate
> sensor.  Imagine a syslog data source that aggregates and you want to
> pick
> apart certain pieces of messages.  This is why the initial thought and
> architecture was one index per sensor.
>
> On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley  wrote:
>
> > I’m thinking that CEP (Complex Event Processing) is contrary to the
> idea
> > of silo-ing data per sensor.
> > Now it’s true that some of those sensors are already aggregating
> data from
> > multiple sources, so maybe I’m wrong here.
> > But it just seems to me that the “data lake” insights come from
> being able
> > to make decisions over the whole mass of data rather than just
> vertical
> > slices of it.
> >
> > On 1/12/17, 2:15 PM, "Casey Stella"  wrote:
> >
> > Hey Matt,
> >
> > Thanks for the comment!
> > 1. At the moment, we only have one index name, the default of
> which is
> > the
> > sensor name but that's entirely up to the user.  This is sensor
> > specific,
> > so it'd be a separate config for each sensor.  If we want to
> build
> > multiple
> > indices per sensor, we'd have to think carefully about how to do
> that
> > and
> > would be a bigger undertaking.  I guess I can see the use, 
though
> > (redirect
> > messages to one index vs another based on a predicate for a 
given
> > sensor).
> > Anyway, not where I was originally thinking that this discussion
> would
> > go,
> > but it's an interesting point.
> >
> > 2. I hadn't thought through the implementation quite yet, but we
> don't
> > actually have a splitter bolt in that topology, just a spout
> that goes
> > to
> > the elasticsearch writer and also to the hdfs writer.
> >
> > On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley 
> wrote:
> >
> > > Casey, good to have controls like this.  Couple questions:
> > >
> > > 1. Regarding the “index” : “squid” name/value pair, is the
> index name
> > > expected to always be a sensor name?  Or is the given json
> structure
> > > subordinate to a sensor name in zookeeper?  Or can we build
> arbitrary
> > > indexes with this new specification, independent of sensor?
> Should
> > there
> > > actually be a list of “indexes”, ie
> > > { “indexes” : [
> > > {“index” : “name1”,
> > > …
> > > },
> > > {“index” : “name2”,
> > > …
> > > } ]
> > > }
> > >
> > > 2. Would the filtering / writer selection logic take place in
> the
> > indexing
> > > topology splitter bolt?  Seems like that would have the
> smallest
> > impact on
> > > current

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-12 Thread Casey Stella

Oh, you could!  Let's say you have a syslog parser with data from sources 1
2 and 3.  You'd end up with one kafka queue with 3 parsers attached to that
queue, each picking part the messages from source 1, 2 and 3.  They'd go
through separate enrichment and into the indexing topology.  In the
indexing topology, you could specify the same index name "syslog" and all
of the messages go into the same index for CEP querying if so desired.

On Thu, Jan 12, 2017 at 6:27 PM, Matt Foley  wrote:

> Syslog is hell on parsers – I know, I worked at LogLogic in a previous
> life.  It makes perfect sense to route different lines from syslog through
> different appropriate parsers.  But a lot of what the parsers do is
> identify consistent subsets of metadata and annotate it – eg, src_ip_addr,
> event timestamps, etc.  Once those metadata are annotated and available
> with common field names, why doesn’t it make sense to index the messages
> together, for CEP querying?  I think Splunk has illustrated this model.
>
> On 1/12/17, 3:00 PM, "Casey Stella"  wrote:
>
> yeah, I mean, honestly, I think the approach that we've taken for
> sources
> which aggregate different types of data is to provide filters at the
> parser
> level and have multiple parser topologies (with different, possibly
> mutually exclusive filters) running.  This would be a completely
> separate
> sensor.  Imagine a syslog data source that aggregates and you want to
> pick
> apart certain pieces of messages.  This is why the initial thought and
> architecture was one index per sensor.
>
> On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley  wrote:
>
> > I’m thinking that CEP (Complex Event Processing) is contrary to the
> idea
> > of silo-ing data per sensor.
> > Now it’s true that some of those sensors are already aggregating
> data from
> > multiple sources, so maybe I’m wrong here.
> > But it just seems to me that the “data lake” insights come from
> being able
> > to make decisions over the whole mass of data rather than just
> vertical
> > slices of it.
> >
> > On 1/12/17, 2:15 PM, "Casey Stella"  wrote:
> >
> > Hey Matt,
> >
> > Thanks for the comment!
> > 1. At the moment, we only have one index name, the default of
> which is
> > the
> > sensor name but that's entirely up to the user.  This is sensor
> > specific,
> > so it'd be a separate config for each sensor.  If we want to
> build
> > multiple
> > indices per sensor, we'd have to think carefully about how to do
> that
> > and
> > would be a bigger undertaking.  I guess I can see the use, though
> > (redirect
> > messages to one index vs another based on a predicate for a given
> > sensor).
> > Anyway, not where I was originally thinking that this discussion
> would
> > go,
> > but it's an interesting point.
> >
> > 2. I hadn't thought through the implementation quite yet, but we
> don't
> > actually have a splitter bolt in that topology, just a spout
> that goes
> > to
> > the elasticsearch writer and also to the hdfs writer.
> >
> > On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley 
> wrote:
> >
> > > Casey, good to have controls like this.  Couple questions:
> > >
> > > 1. Regarding the “index” : “squid” name/value pair, is the
> index name
> > > expected to always be a sensor name?  Or is the given json
> structure
> > > subordinate to a sensor name in zookeeper?  Or can we build
> arbitrary
> > > indexes with this new specification, independent of sensor?
> Should
> > there
> > > actually be a list of “indexes”, ie
> > > { “indexes” : [
> > > {“index” : “name1”,
> > > …
> > > },
> > > {“index” : “name2”,
> > > …
> > > } ]
> > > }
> > >
> > > 2. Would the filtering / writer selection logic take place in
> the
> > indexing
> > > topology splitter bolt?  Seems like that would have the
> smallest
> > impact on
> > > current implementation, no?
> > >
> > > Sorry if these are already answered in PR-415, I haven’t had
> time to
> > > review that one yet.
> > > Thanks,
> > > --Matt
> > >
> > >
> > > On 1/12/17, 12:55 PM, "Michael Miklavcic" <
> > michael.miklav...@gmail.com>
> > > wrote:
> > >
> > > I like the flexibility and expressibility of the first
> option
> > with
> > > Stellar
> > > filters.
> > >
> > > M
> > >
> > > On Thu, Jan 12, 2017 at 1:51 PM, Casey Stella <
> > ceste...@gmail.com>
> >

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-12 Thread Matt Foley

Syslog is hell on parsers – I know, I worked at LogLogic in a previous life.  
It makes perfect sense to route different lines from syslog through different 
appropriate parsers.  But a lot of what the parsers do is identify consistent 
subsets of metadata and annotate it – eg, src_ip_addr, event timestamps, etc.  
Once those metadata are annotated and available with common field names, why 
doesn’t it make sense to index the messages together, for CEP querying?  I 
think Splunk has illustrated this model. 

On 1/12/17, 3:00 PM, "Casey Stella"  wrote:

yeah, I mean, honestly, I think the approach that we've taken for sources
which aggregate different types of data is to provide filters at the parser
level and have multiple parser topologies (with different, possibly
mutually exclusive filters) running.  This would be a completely separate
sensor.  Imagine a syslog data source that aggregates and you want to pick
apart certain pieces of messages.  This is why the initial thought and
architecture was one index per sensor.

On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley  wrote:

> I’m thinking that CEP (Complex Event Processing) is contrary to the idea
> of silo-ing data per sensor.
> Now it’s true that some of those sensors are already aggregating data from
> multiple sources, so maybe I’m wrong here.
> But it just seems to me that the “data lake” insights come from being able
> to make decisions over the whole mass of data rather than just vertical
> slices of it.
>
> On 1/12/17, 2:15 PM, "Casey Stella"  wrote:
>
> Hey Matt,
>
> Thanks for the comment!
> 1. At the moment, we only have one index name, the default of which is
> the
> sensor name but that's entirely up to the user.  This is sensor
> specific,
> so it'd be a separate config for each sensor.  If we want to build
> multiple
> indices per sensor, we'd have to think carefully about how to do that
> and
> would be a bigger undertaking.  I guess I can see the use, though
> (redirect
> messages to one index vs another based on a predicate for a given
> sensor).
> Anyway, not where I was originally thinking that this discussion would
> go,
> but it's an interesting point.
>
> 2. I hadn't thought through the implementation quite yet, but we don't
> actually have a splitter bolt in that topology, just a spout that goes
> to
> the elasticsearch writer and also to the hdfs writer.
>
> On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley  wrote:
>
> > Casey, good to have controls like this.  Couple questions:
> >
> > 1. Regarding the “index” : “squid” name/value pair, is the index 
name
> > expected to always be a sensor name?  Or is the given json structure
> > subordinate to a sensor name in zookeeper?  Or can we build 
arbitrary
> > indexes with this new specification, independent of sensor?  Should
> there
> > actually be a list of “indexes”, ie
> > { “indexes” : [
> > {“index” : “name1”,
> > …
> > },
> > {“index” : “name2”,
> > …
> > } ]
> > }
> >
> > 2. Would the filtering / writer selection logic take place in the
> indexing
> > topology splitter bolt?  Seems like that would have the smallest
> impact on
> > current implementation, no?
> >
> > Sorry if these are already answered in PR-415, I haven’t had time to
> > review that one yet.
> > Thanks,
> > --Matt
> >
> >
> > On 1/12/17, 12:55 PM, "Michael Miklavcic" <
> michael.miklav...@gmail.com>
> > wrote:
> >
> > I like the flexibility and expressibility of the first option
> with
> > Stellar
> > filters.
> >
> > M
> >
> > On Thu, Jan 12, 2017 at 1:51 PM, Casey Stella <
> ceste...@gmail.com>
> > wrote:
> >
> > > As of METRON-652  > incubator-metron/pull/415>, we
> > > will have decoupled the indexing configuration from the
> enrichment
> > > configuration.  As an immediate follow-up to that, I'd like to
> > provide the
> > > ability to turn off and on writers via the configs.  I'd like
> to get
> > some
> > > community feedback on how the functionality should work, if
> y'all are
> > > amenable. :)
> > >
> > >
> > > As of now, we have 3 possible writers which can be used in the
> > indexing
> > > topology:
>

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-12 Thread Casey Stella

yeah, I mean, honestly, I think the approach that we've taken for sources
which aggregate different types of data is to provide filters at the parser
level and have multiple parser topologies (with different, possibly
mutually exclusive filters) running.  This would be a completely separate
sensor.  Imagine a syslog data source that aggregates and you want to pick
apart certain pieces of messages.  This is why the initial thought and
architecture was one index per sensor.

On Thu, Jan 12, 2017 at 5:55 PM, Matt Foley  wrote:

> I’m thinking that CEP (Complex Event Processing) is contrary to the idea
> of silo-ing data per sensor.
> Now it’s true that some of those sensors are already aggregating data from
> multiple sources, so maybe I’m wrong here.
> But it just seems to me that the “data lake” insights come from being able
> to make decisions over the whole mass of data rather than just vertical
> slices of it.
>
> On 1/12/17, 2:15 PM, "Casey Stella"  wrote:
>
> Hey Matt,
>
> Thanks for the comment!
> 1. At the moment, we only have one index name, the default of which is
> the
> sensor name but that's entirely up to the user.  This is sensor
> specific,
> so it'd be a separate config for each sensor.  If we want to build
> multiple
> indices per sensor, we'd have to think carefully about how to do that
> and
> would be a bigger undertaking.  I guess I can see the use, though
> (redirect
> messages to one index vs another based on a predicate for a given
> sensor).
> Anyway, not where I was originally thinking that this discussion would
> go,
> but it's an interesting point.
>
> 2. I hadn't thought through the implementation quite yet, but we don't
> actually have a splitter bolt in that topology, just a spout that goes
> to
> the elasticsearch writer and also to the hdfs writer.
>
> On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley  wrote:
>
> > Casey, good to have controls like this.  Couple questions:
> >
> > 1. Regarding the “index” : “squid” name/value pair, is the index name
> > expected to always be a sensor name?  Or is the given json structure
> > subordinate to a sensor name in zookeeper?  Or can we build arbitrary
> > indexes with this new specification, independent of sensor?  Should
> there
> > actually be a list of “indexes”, ie
> > { “indexes” : [
> > {“index” : “name1”,
> > …
> > },
> > {“index” : “name2”,
> > …
> > } ]
> > }
> >
> > 2. Would the filtering / writer selection logic take place in the
> indexing
> > topology splitter bolt?  Seems like that would have the smallest
> impact on
> > current implementation, no?
> >
> > Sorry if these are already answered in PR-415, I haven’t had time to
> > review that one yet.
> > Thanks,
> > --Matt
> >
> >
> > On 1/12/17, 12:55 PM, "Michael Miklavcic" <
> michael.miklav...@gmail.com>
> > wrote:
> >
> > I like the flexibility and expressibility of the first option
> with
> > Stellar
> > filters.
> >
> > M
> >
> > On Thu, Jan 12, 2017 at 1:51 PM, Casey Stella <
> ceste...@gmail.com>
> > wrote:
> >
> > > As of METRON-652  > incubator-metron/pull/415>, we
> > > will have decoupled the indexing configuration from the
> enrichment
> > > configuration.  As an immediate follow-up to that, I'd like to
> > provide the
> > > ability to turn off and on writers via the configs.  I'd like
> to get
> > some
> > > community feedback on how the functionality should work, if
> y'all are
> > > amenable. :)
> > >
> > >
> > > As of now, we have 3 possible writers which can be used in the
> > indexing
> > > topology:
> > >
> > >- Solr
> > >- Elasticsearch
> > >- HDFS
> > >
> > > HDFS is always used, elasticsearch or solr is used depending
> on how
> > you
> > > start the indexing topology.
> > >
> > > A couple of proposals come to mind immediately:
> > >
> > > *Index Filtering*
> > >
> > > You would be able to specify a filter as defined by a stellar
> > statement
> > > (likely a reuse of the StellarFilter that exists in the
> Parsers)
> > which
> > > would allow you to indicate on a message-by-message basis
> whether or
> > not to
> > > write the message.
> > >
> > > The semantics of this would be as follows:
> > >
> > >- Default (i.e. unspecified) is to pass everything through
> (hence
> > >backwards compatible with the current default config).
> > >- Messages which have the associated stellar

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-12 Thread Matt Foley

I’m thinking that CEP (Complex Event Processing) is contrary to the idea of 
silo-ing data per sensor.
Now it’s true that some of those sensors are already aggregating data from 
multiple sources, so maybe I’m wrong here.
But it just seems to me that the “data lake” insights come from being able to 
make decisions over the whole mass of data rather than just vertical slices of 
it.

On 1/12/17, 2:15 PM, "Casey Stella"  wrote:

Hey Matt,

Thanks for the comment!
1. At the moment, we only have one index name, the default of which is the
sensor name but that's entirely up to the user.  This is sensor specific,
so it'd be a separate config for each sensor.  If we want to build multiple
indices per sensor, we'd have to think carefully about how to do that and
would be a bigger undertaking.  I guess I can see the use, though (redirect
messages to one index vs another based on a predicate for a given sensor).
Anyway, not where I was originally thinking that this discussion would go,
but it's an interesting point.

2. I hadn't thought through the implementation quite yet, but we don't
actually have a splitter bolt in that topology, just a spout that goes to
the elasticsearch writer and also to the hdfs writer.

On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley  wrote:

> Casey, good to have controls like this.  Couple questions:
>
> 1. Regarding the “index” : “squid” name/value pair, is the index name
> expected to always be a sensor name?  Or is the given json structure
> subordinate to a sensor name in zookeeper?  Or can we build arbitrary
> indexes with this new specification, independent of sensor?  Should there
> actually be a list of “indexes”, ie
> { “indexes” : [
> {“index” : “name1”,
> …
> },
> {“index” : “name2”,
> …
> } ]
> }
>
> 2. Would the filtering / writer selection logic take place in the indexing
> topology splitter bolt?  Seems like that would have the smallest impact on
> current implementation, no?
>
> Sorry if these are already answered in PR-415, I haven’t had time to
> review that one yet.
> Thanks,
> --Matt
>
>
> On 1/12/17, 12:55 PM, "Michael Miklavcic" 
> wrote:
>
> I like the flexibility and expressibility of the first option with
> Stellar
> filters.
>
> M
>
> On Thu, Jan 12, 2017 at 1:51 PM, Casey Stella 
> wrote:
>
> > As of METRON-652  incubator-metron/pull/415>, we
> > will have decoupled the indexing configuration from the enrichment
> > configuration.  As an immediate follow-up to that, I'd like to
> provide the
> > ability to turn off and on writers via the configs.  I'd like to get
> some
> > community feedback on how the functionality should work, if y'all 
are
> > amenable. :)
> >
> >
> > As of now, we have 3 possible writers which can be used in the
> indexing
> > topology:
> >
> >- Solr
> >- Elasticsearch
> >- HDFS
> >
> > HDFS is always used, elasticsearch or solr is used depending on how
> you
> > start the indexing topology.
> >
> > A couple of proposals come to mind immediately:
> >
> > *Index Filtering*
> >
> > You would be able to specify a filter as defined by a stellar
> statement
> > (likely a reuse of the StellarFilter that exists in the Parsers)
> which
> > would allow you to indicate on a message-by-message basis whether or
> not to
> > write the message.
> >
> > The semantics of this would be as follows:
> >
> >- Default (i.e. unspecified) is to pass everything through (hence
> >backwards compatible with the current default config).
> >- Messages which have the associated stellar statement evaluate
> to true
> >for the writer type will be written, otherwise not.
> >
> >
> > Sample indexing config which would write out no messages to HDFS and
> write
> > out only messages containing a field called "field1":
> > {
> >"index" : "squid"
> >   ,"batchSize" : 100
> >   ,"filters" : {
> >   "HDFS" : "false"
> >  ,"ES" : "exists(field1)"
> >  }
> > }
> >
> > *Index On/Off Switch*
> >
> > A simpler solution would be to just provide a list of writers to
> write
> > messages.  The semantics would be as follows:
> >
> >- If the list is unspecified, then

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-12 Thread Casey Stella

Hey Matt,

Thanks for the comment!
1. At the moment, we only have one index name, the default of which is the
sensor name but that's entirely up to the user.  This is sensor specific,
so it'd be a separate config for each sensor.  If we want to build multiple
indices per sensor, we'd have to think carefully about how to do that and
would be a bigger undertaking.  I guess I can see the use, though (redirect
messages to one index vs another based on a predicate for a given sensor).
Anyway, not where I was originally thinking that this discussion would go,
but it's an interesting point.

2. I hadn't thought through the implementation quite yet, but we don't
actually have a splitter bolt in that topology, just a spout that goes to
the elasticsearch writer and also to the hdfs writer.

On Thu, Jan 12, 2017 at 4:52 PM, Matt Foley  wrote:

> Casey, good to have controls like this.  Couple questions:
>
> 1. Regarding the “index” : “squid” name/value pair, is the index name
> expected to always be a sensor name?  Or is the given json structure
> subordinate to a sensor name in zookeeper?  Or can we build arbitrary
> indexes with this new specification, independent of sensor?  Should there
> actually be a list of “indexes”, ie
> { “indexes” : [
> {“index” : “name1”,
> …
> },
> {“index” : “name2”,
> …
> } ]
> }
>
> 2. Would the filtering / writer selection logic take place in the indexing
> topology splitter bolt?  Seems like that would have the smallest impact on
> current implementation, no?
>
> Sorry if these are already answered in PR-415, I haven’t had time to
> review that one yet.
> Thanks,
> --Matt
>
>
> On 1/12/17, 12:55 PM, "Michael Miklavcic" 
> wrote:
>
> I like the flexibility and expressibility of the first option with
> Stellar
> filters.
>
> M
>
> On Thu, Jan 12, 2017 at 1:51 PM, Casey Stella 
> wrote:
>
> > As of METRON-652  incubator-metron/pull/415>, we
> > will have decoupled the indexing configuration from the enrichment
> > configuration.  As an immediate follow-up to that, I'd like to
> provide the
> > ability to turn off and on writers via the configs.  I'd like to get
> some
> > community feedback on how the functionality should work, if y'all are
> > amenable. :)
> >
> >
> > As of now, we have 3 possible writers which can be used in the
> indexing
> > topology:
> >
> >- Solr
> >- Elasticsearch
> >- HDFS
> >
> > HDFS is always used, elasticsearch or solr is used depending on how
> you
> > start the indexing topology.
> >
> > A couple of proposals come to mind immediately:
> >
> > *Index Filtering*
> >
> > You would be able to specify a filter as defined by a stellar
> statement
> > (likely a reuse of the StellarFilter that exists in the Parsers)
> which
> > would allow you to indicate on a message-by-message basis whether or
> not to
> > write the message.
> >
> > The semantics of this would be as follows:
> >
> >- Default (i.e. unspecified) is to pass everything through (hence
> >backwards compatible with the current default config).
> >- Messages which have the associated stellar statement evaluate
> to true
> >for the writer type will be written, otherwise not.
> >
> >
> > Sample indexing config which would write out no messages to HDFS and
> write
> > out only messages containing a field called "field1":
> > {
> >"index" : "squid"
> >   ,"batchSize" : 100
> >   ,"filters" : {
> >   "HDFS" : "false"
> >  ,"ES" : "exists(field1)"
> >  }
> > }
> >
> > *Index On/Off Switch*
> >
> > A simpler solution would be to just provide a list of writers to
> write
> > messages.  The semantics would be as follows:
> >
> >- If the list is unspecified, then the default is to write all
> messages
> >for every writer in the indexing topology
> >- If the list is specified, then a writer will write all messages
> if and
> >only if it is named in the list.
> >
> > Sample indexing config which turns off HDFS and keeps on
> Elasticsearch:
> > {
> >"index" : "squid"
> >   ,"batchSize" : 100
> >   ,"writers" : [ "ES" ]
> > }
> >
> > Thanks in advance for the feedback!  Also, if you have any other,
> better
> > ideas than the ones presented here, let me know too.
> >
> > Best,
> >
> > Casey
> >
>
>
>
>
>

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-12 Thread Matt Foley

Casey, good to have controls like this.  Couple questions:

1. Regarding the “index” : “squid” name/value pair, is the index name expected 
to always be a sensor name?  Or is the given json structure subordinate to a 
sensor name in zookeeper?  Or can we build arbitrary indexes with this new 
specification, independent of sensor?  Should there actually be a list of 
“indexes”, ie
{ “indexes” : [
{“index” : “name1”,
…
},
{“index” : “name2”,
…
} ]
}

2. Would the filtering / writer selection logic take place in the indexing 
topology splitter bolt?  Seems like that would have the smallest impact on 
current implementation, no?

Sorry if these are already answered in PR-415, I haven’t had time to review 
that one yet.
Thanks,
--Matt


On 1/12/17, 12:55 PM, "Michael Miklavcic"  wrote:

I like the flexibility and expressibility of the first option with Stellar
filters.

M

On Thu, Jan 12, 2017 at 1:51 PM, Casey Stella  wrote:

> As of METRON-652 , we
> will have decoupled the indexing configuration from the enrichment
> configuration.  As an immediate follow-up to that, I'd like to provide the
> ability to turn off and on writers via the configs.  I'd like to get some
> community feedback on how the functionality should work, if y'all are
> amenable. :)
>
>
> As of now, we have 3 possible writers which can be used in the indexing
> topology:
>
>- Solr
>- Elasticsearch
>- HDFS
>
> HDFS is always used, elasticsearch or solr is used depending on how you
> start the indexing topology.
>
> A couple of proposals come to mind immediately:
>
> *Index Filtering*
>
> You would be able to specify a filter as defined by a stellar statement
> (likely a reuse of the StellarFilter that exists in the Parsers) which
> would allow you to indicate on a message-by-message basis whether or not 
to
> write the message.
>
> The semantics of this would be as follows:
>
>- Default (i.e. unspecified) is to pass everything through (hence
>backwards compatible with the current default config).
>- Messages which have the associated stellar statement evaluate to true
>for the writer type will be written, otherwise not.
>
>
> Sample indexing config which would write out no messages to HDFS and write
> out only messages containing a field called "field1":
> {
>"index" : "squid"
>   ,"batchSize" : 100
>   ,"filters" : {
>   "HDFS" : "false"
>  ,"ES" : "exists(field1)"
>  }
> }
>
> *Index On/Off Switch*
>
> A simpler solution would be to just provide a list of writers to write
> messages.  The semantics would be as follows:
>
>- If the list is unspecified, then the default is to write all messages
>for every writer in the indexing topology
>- If the list is specified, then a writer will write all messages if 
and
>only if it is named in the list.
>
> Sample indexing config which turns off HDFS and keeps on Elasticsearch:
> {
>"index" : "squid"
>   ,"batchSize" : 100
>   ,"writers" : [ "ES" ]
> }
>
> Thanks in advance for the feedback!  Also, if you have any other, better
> ideas than the ones presented here, let me know too.
>
> Best,
>
> Casey
>

Re: [DISCUSS] Turning off indexing writers feature discussion

2017-01-12 Thread Michael Miklavcic

I like the flexibility and expressibility of the first option with Stellar
filters.

M

On Thu, Jan 12, 2017 at 1:51 PM, Casey Stella  wrote:

> As of METRON-652 , we
> will have decoupled the indexing configuration from the enrichment
> configuration.  As an immediate follow-up to that, I'd like to provide the
> ability to turn off and on writers via the configs.  I'd like to get some
> community feedback on how the functionality should work, if y'all are
> amenable. :)
>
>
> As of now, we have 3 possible writers which can be used in the indexing
> topology:
>
>- Solr
>- Elasticsearch
>- HDFS
>
> HDFS is always used, elasticsearch or solr is used depending on how you
> start the indexing topology.
>
> A couple of proposals come to mind immediately:
>
> *Index Filtering*
>
> You would be able to specify a filter as defined by a stellar statement
> (likely a reuse of the StellarFilter that exists in the Parsers) which
> would allow you to indicate on a message-by-message basis whether or not to
> write the message.
>
> The semantics of this would be as follows:
>
>- Default (i.e. unspecified) is to pass everything through (hence
>backwards compatible with the current default config).
>- Messages which have the associated stellar statement evaluate to true
>for the writer type will be written, otherwise not.
>
>
> Sample indexing config which would write out no messages to HDFS and write
> out only messages containing a field called "field1":
> {
>"index" : "squid"
>   ,"batchSize" : 100
>   ,"filters" : {
>   "HDFS" : "false"
>  ,"ES" : "exists(field1)"
>  }
> }
>
> *Index On/Off Switch*
>
> A simpler solution would be to just provide a list of writers to write
> messages.  The semantics would be as follows:
>
>- If the list is unspecified, then the default is to write all messages
>for every writer in the indexing topology
>- If the list is specified, then a writer will write all messages if and
>only if it is named in the list.
>
> Sample indexing config which turns off HDFS and keeps on Elasticsearch:
> {
>"index" : "squid"
>   ,"batchSize" : 100
>   ,"writers" : [ "ES" ]
> }
>
> Thanks in advance for the feedback!  Also, if you have any other, better
> ideas than the ones presented here, let me know too.
>
> Best,
>
> Casey
>

[DISCUSS] Turning off indexing writers feature discussion

2017-01-12 Thread Casey Stella

As of METRON-652 , we
will have decoupled the indexing configuration from the enrichment
configuration.  As an immediate follow-up to that, I'd like to provide the
ability to turn off and on writers via the configs.  I'd like to get some
community feedback on how the functionality should work, if y'all are
amenable. :)


As of now, we have 3 possible writers which can be used in the indexing
topology:

   - Solr
   - Elasticsearch
   - HDFS

HDFS is always used, elasticsearch or solr is used depending on how you
start the indexing topology.

A couple of proposals come to mind immediately:

*Index Filtering*

You would be able to specify a filter as defined by a stellar statement
(likely a reuse of the StellarFilter that exists in the Parsers) which
would allow you to indicate on a message-by-message basis whether or not to
write the message.

The semantics of this would be as follows:

   - Default (i.e. unspecified) is to pass everything through (hence
   backwards compatible with the current default config).
   - Messages which have the associated stellar statement evaluate to true
   for the writer type will be written, otherwise not.


Sample indexing config which would write out no messages to HDFS and write
out only messages containing a field called "field1":
{
   "index" : "squid"
  ,"batchSize" : 100
  ,"filters" : {
  "HDFS" : "false"
 ,"ES" : "exists(field1)"
 }
}

*Index On/Off Switch*

A simpler solution would be to just provide a list of writers to write
messages.  The semantics would be as follows:

   - If the list is unspecified, then the default is to write all messages
   for every writer in the indexing topology
   - If the list is specified, then a writer will write all messages if and
   only if it is named in the list.

Sample indexing config which turns off HDFS and keeps on Elasticsearch:
{
   "index" : "squid"
  ,"batchSize" : 100
  ,"writers" : [ "ES" ]
}

Thanks in advance for the feedback!  Also, if you have any other, better
ideas than the ones presented here, let me know too.

Best,

Casey

50 matches

Mail list logo