[DISCUSS] Recurrent Large Indexing Error Messages

2018-10-19 Thread Nick Allen
I want to discuss solutions for the problem that I have described in
METRON-1832; Recurrent Large Indexing Error Messages. I feel this is a very
easy trap to fall into when using the default settings that currently come
with Metron.


## Problem


https://issues.apache.org/jira/browse/METRON-1832


If any index destination like HDFS, Elasticsearch, or Solr goes down while
the Indexing topology is running, an error message is created and sent back
to the user-defined error topic.  By default, this is defined to also be
the 'indexing' topic.

The Indexing topology then consumes this error message and attempts to
write it again. If the index destination is still down, another error
occurs and another error message is created that encapsulates the original
error message.  That message is then sent to the 'indexing' topic, which is
later consumed, yet again, by the Indexing topology.

These error messages will continue to be recycled and grow larger and
larger as each new error message encapsulates all previous error messages
in the "raw_message" field.

Once the index destination recovers, one giant error message will finally
be written that contains massively duplicated, useless information which
can further negatively impact performance of the index destination.

Also, the escape character '\' continually compounds one another leading to
long strings of '\' characters in the error message.


## Background

There was some discussion on how to handle this on the original PR #453
https://github.com/apache/metron/pull/453.

## Solutions

(1) The first, easiest option is to just do nothing.  There was already a
discussion around this and this is the solution that we landed on in #453.

Pros: Really easy; do nothing.

Cons: Intermittent problems with ES/Solr can easily create very large error
messages that can significantly impact both search and ingest performance.


(2) Change the default indexing error topic to 'indexing_errors' to avoid
recycling error messages. Nothing will consume from the 'indexing_errors'
topic, thus preventing a cycle.

Pros: Simple, easy change that prevents the cycle.

Cons: Recoverable indexing errors are not visible to users as they will
never be indexed in ES/Solr.

(2) Add logic to limit the number times a message can be 'recycled' through
the Indexing topology.  This effectively sets a maximum number of retry
attempts.  If a message fails N times, then write the message to a separate
unrecoverable, error topic.

Pros: Recoverable errors are visible to users in ES/Solr.

Cons: More complex.  Users still need to check the unrecoverable, error
topic for potential problems.

(4) Do not further encapsulate error messages in the 'raw_message' field.
If an error message fails, don't encapsulate it in another error message.
Just push it to the error topic as-is.  Could add a field that indicates
how many times the message has failed.

Pros: Prevents giant error messages from being created from recoverable
errors.

Cons: Extended outages would still cause the Indexing topology to
repeatedly recycle these error messages, which would ultimately exhaust
resources in Storm.



What other ways can we solve this?


Re: [DISCUSS] Stellar REST client

2018-10-19 Thread Otto Fowler
I believe the issue of introducing and supporting higher latency
enrichments is a systemic one, and should be solved as such,
with the rest and other higher latency enrichments build on top of that
framework.




On October 19, 2018 at 12:22:28, Ryan Merriman (merrim...@gmail.com) wrote:

Thanks Casey, good questions.

As far as the verbs go, just thinking we might want to support calls other
than GET at some point. For the use case stated (enriching messages from
3rd party services) GET is all we need. Probably a moot point anyways
since every http library will support the different HTTP verbs.

Agreed on the caching. I will defer to those that are more familiar with
the Stellar internals on what the correct approach is.

I was thinking the same thing with regards to the client libraries. Apache
HttpComponents is probably the safest choice but OkHttp looks nice and
could reduce effort and complexity as long as it meets our requirements.

On Fri, Oct 19, 2018 at 10:58 AM Casey Stella  wrote:

> I think it makes a lot of sense. A couple of questions:
>
> - What actions do you see the REST verbs corresponding to? I would
> understand GET (which is in effect "evaluate an expression", right?),
> but
> I'm not sure about the others.
> - We should probably be careful about caching stellar expressions. Not
> all stellar expressions are deterministic (e.g. PROFILE_GET may not be
> as
> the lookback window is bound to current time). Ultimately, I think we
> should probably bake whether a function is deterministic into stellar so
> that *stellar* can cache where appropriate (e.g. if every part of an
> expression is deterministic, then pull from cache otherwise recompute).
> All of this to say, if you're going to make it configurable, IMO we
> should
> make it a configuration that the user passes in at request time so they
> have the control over whether the expression is safe to cache or
> otherwise.
>
> Without more compelling reasons to not do so, I'd suggest we use HTTP
> Components as it's another apache project and under active
> development/support. I'd also be ok with OkHttp if it's actively
> maintained.
>
> On Fri, Oct 19, 2018 at 11:46 AM Ryan Merriman 
> wrote:
>
> > I want to open up discussion around adding a Stellar REST client
> function.
> > There are services available to enrich security telemetry and they are
> > commonly exposed through a REST interface. The primary purpose of this
> > discuss thread to collect requirements from the community and agree on
a
> > general architectural approach.
> >
> > At a minimum I see a Stellar REST client supporting:
> >
> > - Common HTTP verbs including GET, POST, DELETE, etc
> > - Option to provide headers and request parameters as needed
> > - Support for basic authentication
> > - Proper request and error handling (we can discuss further how this
> > should work)
> > - SSL support
> > - Option to use a proxy server (including authentication)
> > - JSON format
> >
> > In addition to these functional requirements, I would also propose we
> > include these performance requirements:
> >
> > - Provide a configurable caching layer
> > - Provide a mechanism for pooling connections
> > - Provide clear documentation and guidance on how to properly use this
> > feature since there is a significant risk of introducing latency
> issues
> >
> > What else would you like to see included?
> >
> > I think the primary architectural decision we need to make (based on
the
> > agreed upon requirements of course) is an appropriate Java HTTP/REST
> client
> > library. Ideally we choose a library that supports everything we need
> > OOTB. I think the majority of the work for this feature will involve
> > wrapping this library in a Stellar function and exposing the
> configuration
> > knobs through Metron's configuration interface (Ambari, Zookeeper,
> etc). I
> > have done some very light research and here is my initial list:
> >
> > - Apache HttpComponents - https://hc.apache.org/
> > - Has support for all of the features listed above as far as I can
> tell
> > - Doesn't introduce a large number of new dependencies (am I wrong
> > here?)
> > - Is sort of included already (we will need to upgrade from
> > httpclient)
> > - Lower level
> > - Google HTTP Client Library for Java -
> >
> >
>
https://developers.google.com/api-client-library/java/google-http-java-client/
> > - Higher level API with pluggable components
> > - Introduces dependencies (we've had issues with Guava in the past)
> > - Netflix Ribbon - https://github.com/Netflix/ribbon
> > - Has a lot of nice features that may be useful in the future
> > - Introduces dependencies (including guava)
> > - Hasn't been committed to in the last 5-6 months
> > - Unirest - https://github.com/Kong/unirest-java
> > - Lightweight API built on top of HttpComponents
> > - Pluggable serialization library (jackson is an issue for us so
> this
> > is nice)
> > - Also has not received a commit in a while
> > - OkHttp - 

Re: [DISCUSS] Stellar REST client

2018-10-19 Thread Ryan Merriman
Thanks Casey, good questions.

As far as the verbs go, just thinking we might want to support calls other
than GET at some point.  For the use case stated (enriching messages from
3rd party services) GET is all we need.  Probably a moot point anyways
since every http library will support the different HTTP verbs.

Agreed on the caching.  I will defer to those that are more familiar with
the Stellar internals on what the correct approach is.

I was thinking the same thing with regards to the client libraries.  Apache
HttpComponents is probably the safest choice but OkHttp looks nice and
could reduce effort and complexity as long as it meets our requirements.

On Fri, Oct 19, 2018 at 10:58 AM Casey Stella  wrote:

> I think it makes a lot of sense.  A couple of questions:
>
>- What actions do you see the REST verbs corresponding to?  I would
>understand GET (which is in effect "evaluate an expression", right?),
> but
>I'm not sure about the others.
>- We should probably be careful about caching stellar expressions.  Not
>all stellar expressions are deterministic (e.g. PROFILE_GET may not be
> as
>the lookback window is bound to current time).  Ultimately, I think we
>should probably bake whether a function is deterministic into stellar so
>that *stellar* can cache where appropriate (e.g. if every part of an
>expression is deterministic, then pull from cache otherwise recompute).
>All of this to say, if you're going to make it configurable, IMO we
> should
>make it a configuration that the user passes in at request time so they
>have the control over whether the expression is safe to cache or
> otherwise.
>
> Without more compelling reasons to not do so, I'd suggest we use HTTP
> Components as it's another apache project and under active
> development/support.  I'd also be ok with OkHttp if it's actively
> maintained.
>
> On Fri, Oct 19, 2018 at 11:46 AM Ryan Merriman 
> wrote:
>
> > I want to open up discussion around adding a Stellar REST client
> function.
> > There are services available to enrich security telemetry and they are
> > commonly exposed through a REST interface.  The primary purpose of this
> > discuss thread to collect requirements from the community and agree on a
> > general architectural approach.
> >
> > At a minimum I see a Stellar REST client supporting:
> >
> >- Common HTTP verbs including GET, POST, DELETE, etc
> >- Option to provide headers and request parameters as needed
> >- Support for basic authentication
> >- Proper request and error handling (we can discuss further how this
> >should work)
> >- SSL support
> >- Option to use a proxy server (including authentication)
> >- JSON format
> >
> > In addition to these functional requirements, I would also propose we
> > include these performance requirements:
> >
> >- Provide a configurable caching layer
> >- Provide a mechanism for pooling connections
> >- Provide clear documentation and guidance on how to properly use this
> >feature since there is a significant risk of introducing latency
> issues
> >
> > What else would you like to see included?
> >
> > I think the primary architectural decision we need to make (based on the
> > agreed upon requirements of course) is an appropriate Java HTTP/REST
> client
> > library.  Ideally we choose a library that supports everything we need
> > OOTB.  I think the majority of the work for this feature will involve
> > wrapping this library in a Stellar function and exposing the
> configuration
> > knobs through Metron's configuration interface (Ambari, Zookeeper,
> etc).  I
> > have done some very light research and here is my initial list:
> >
> >- Apache HttpComponents - https://hc.apache.org/
> >- Has support for all of the features listed above as far as I can
> tell
> >   - Doesn't introduce a large number of new dependencies (am I wrong
> >   here?)
> >   - Is sort of included already (we will need to upgrade from
> >   httpclient)
> >   - Lower level
> >- Google HTTP Client Library for Java -
> >
> >
> https://developers.google.com/api-client-library/java/google-http-java-client/
> >- Higher level API with pluggable components
> >   - Introduces dependencies (we've had issues with Guava in the past)
> >- Netflix Ribbon - https://github.com/Netflix/ribbon
> >   - Has a lot of nice features that may be useful in the future
> >   - Introduces dependencies (including guava)
> >   - Hasn't been committed to in the last 5-6 months
> >- Unirest - https://github.com/Kong/unirest-java
> >   - Lightweight API built on top of HttpComponents
> >   - Pluggable serialization library (jackson is an issue for us so
> this
> >   is nice)
> >   - Also has not received a commit in a while
> >- OkHttp - http://square.github.io/okhttp/
> >- Good documentation and looks easy to use
> >   - Actively maintained
> >
> > 

Re: [DISCUSS] Stellar REST client

2018-10-19 Thread Casey Stella
I think it makes a lot of sense.  A couple of questions:

   - What actions do you see the REST verbs corresponding to?  I would
   understand GET (which is in effect "evaluate an expression", right?), but
   I'm not sure about the others.
   - We should probably be careful about caching stellar expressions.  Not
   all stellar expressions are deterministic (e.g. PROFILE_GET may not be as
   the lookback window is bound to current time).  Ultimately, I think we
   should probably bake whether a function is deterministic into stellar so
   that *stellar* can cache where appropriate (e.g. if every part of an
   expression is deterministic, then pull from cache otherwise recompute).
   All of this to say, if you're going to make it configurable, IMO we should
   make it a configuration that the user passes in at request time so they
   have the control over whether the expression is safe to cache or otherwise.

Without more compelling reasons to not do so, I'd suggest we use HTTP
Components as it's another apache project and under active
development/support.  I'd also be ok with OkHttp if it's actively
maintained.

On Fri, Oct 19, 2018 at 11:46 AM Ryan Merriman  wrote:

> I want to open up discussion around adding a Stellar REST client function.
> There are services available to enrich security telemetry and they are
> commonly exposed through a REST interface.  The primary purpose of this
> discuss thread to collect requirements from the community and agree on a
> general architectural approach.
>
> At a minimum I see a Stellar REST client supporting:
>
>- Common HTTP verbs including GET, POST, DELETE, etc
>- Option to provide headers and request parameters as needed
>- Support for basic authentication
>- Proper request and error handling (we can discuss further how this
>should work)
>- SSL support
>- Option to use a proxy server (including authentication)
>- JSON format
>
> In addition to these functional requirements, I would also propose we
> include these performance requirements:
>
>- Provide a configurable caching layer
>- Provide a mechanism for pooling connections
>- Provide clear documentation and guidance on how to properly use this
>feature since there is a significant risk of introducing latency issues
>
> What else would you like to see included?
>
> I think the primary architectural decision we need to make (based on the
> agreed upon requirements of course) is an appropriate Java HTTP/REST client
> library.  Ideally we choose a library that supports everything we need
> OOTB.  I think the majority of the work for this feature will involve
> wrapping this library in a Stellar function and exposing the configuration
> knobs through Metron's configuration interface (Ambari, Zookeeper, etc).  I
> have done some very light research and here is my initial list:
>
>- Apache HttpComponents - https://hc.apache.org/
>- Has support for all of the features listed above as far as I can tell
>   - Doesn't introduce a large number of new dependencies (am I wrong
>   here?)
>   - Is sort of included already (we will need to upgrade from
>   httpclient)
>   - Lower level
>- Google HTTP Client Library for Java -
>
> https://developers.google.com/api-client-library/java/google-http-java-client/
>- Higher level API with pluggable components
>   - Introduces dependencies (we've had issues with Guava in the past)
>- Netflix Ribbon - https://github.com/Netflix/ribbon
>   - Has a lot of nice features that may be useful in the future
>   - Introduces dependencies (including guava)
>   - Hasn't been committed to in the last 5-6 months
>- Unirest - https://github.com/Kong/unirest-java
>   - Lightweight API built on top of HttpComponents
>   - Pluggable serialization library (jackson is an issue for us so this
>   is nice)
>   - Also has not received a commit in a while
>- OkHttp - http://square.github.io/okhttp/
>- Good documentation and looks easy to use
>   - Actively maintained
>
> Obviously we have a lot of choices.  I think it comes down to balancing the
> tradeoff between ease of use (HttpComponents will likely require the most
> work since it is lower level) and capability.  Introducing additional
> dependencies is something we should also be mindful of because our shading
> practices.
>
> This should get us started.  Let me know what you think!
>


[DISCUSS] Stellar REST client

2018-10-19 Thread Ryan Merriman
I want to open up discussion around adding a Stellar REST client function.
There are services available to enrich security telemetry and they are
commonly exposed through a REST interface.  The primary purpose of this
discuss thread to collect requirements from the community and agree on a
general architectural approach.

At a minimum I see a Stellar REST client supporting:

   - Common HTTP verbs including GET, POST, DELETE, etc
   - Option to provide headers and request parameters as needed
   - Support for basic authentication
   - Proper request and error handling (we can discuss further how this
   should work)
   - SSL support
   - Option to use a proxy server (including authentication)
   - JSON format

In addition to these functional requirements, I would also propose we
include these performance requirements:

   - Provide a configurable caching layer
   - Provide a mechanism for pooling connections
   - Provide clear documentation and guidance on how to properly use this
   feature since there is a significant risk of introducing latency issues

What else would you like to see included?

I think the primary architectural decision we need to make (based on the
agreed upon requirements of course) is an appropriate Java HTTP/REST client
library.  Ideally we choose a library that supports everything we need
OOTB.  I think the majority of the work for this feature will involve
wrapping this library in a Stellar function and exposing the configuration
knobs through Metron's configuration interface (Ambari, Zookeeper, etc).  I
have done some very light research and here is my initial list:

   - Apache HttpComponents - https://hc.apache.org/
   - Has support for all of the features listed above as far as I can tell
  - Doesn't introduce a large number of new dependencies (am I wrong
  here?)
  - Is sort of included already (we will need to upgrade from
  httpclient)
  - Lower level
   - Google HTTP Client Library for Java -
   
https://developers.google.com/api-client-library/java/google-http-java-client/
   - Higher level API with pluggable components
  - Introduces dependencies (we've had issues with Guava in the past)
   - Netflix Ribbon - https://github.com/Netflix/ribbon
  - Has a lot of nice features that may be useful in the future
  - Introduces dependencies (including guava)
  - Hasn't been committed to in the last 5-6 months
   - Unirest - https://github.com/Kong/unirest-java
  - Lightweight API built on top of HttpComponents
  - Pluggable serialization library (jackson is an issue for us so this
  is nice)
  - Also has not received a commit in a while
   - OkHttp - http://square.github.io/okhttp/
   - Good documentation and looks easy to use
  - Actively maintained

Obviously we have a lot of choices.  I think it comes down to balancing the
tradeoff between ease of use (HttpComponents will likely require the most
work since it is lower level) and capability.  Introducing additional
dependencies is something we should also be mindful of because our shading
practices.

This should get us started.  Let me know what you think!


HCP in Cloud infrastructures such as AWS , GCP, AZURE

2018-10-19 Thread deepak kumar
Hi All
I have a quick question around HCP deployments in cloud infra such as AWS.
I am planning to run persistent cluster for all event streaming and
processing.
And then run transient cluster such as AWS EMR to run batch loads on the
data ingested from persistent cluster.
Have anyone tried this model ?
Since data volume is going to be humongous ,cloud is charging lot of money
for data io and storage.
Keeping this in mind , what could be the best cloud deployment of hcp
components assuming there is going to be ingest rate of 10TB per day .

Thanks in advance.


Regards,
Deepak