Re: Ignite not friendly for Monitoring

2018-01-16 Thread Dmitriy Setrakyan
Assigned the version 2.5 to the ticket. Let's try to make progress on this
before then.

On Tue, Jan 16, 2018 at 12:03 PM, Denis Magda  wrote:

> Serge,
>
> Thanks for taking over this. Think we’re moving in a right direction with
> your proposal:
>
> * I would add a top-level domain for “Integrations”. All the integrations
> with Kafka, Spark, Storm, etc. should go there.
>
> * Second-level domains number can grow over the time per a top-level
> layer. Let’s book a decent range for this possible grow.
>
> * Guess external adapters should go to the “Integrations” which sounds
> better to me.
>
> * Agree that this ticket should be used to track the progress in JIRA:
> https://issues.apache.org/jira/browse/IGNITE-3690 <
> https://issues.apache.org/jira/browse/IGNITE-3690>
>
>
> On top of this, this effort has to be tested using a 3rd party tool such
> as DynoTrace or Nagios. If the tools can pick up and analyze our logs to
> automate classic DevOps tasks then the goal will be achieved. Can you
> include this as a required task for QA?
>
> —
> Denis
>
> > On Jan 15, 2018, at 7:48 AM, Serge Puchnin 
> wrote:
> >
> > Igniters,
> >
> > It's a right idea!
> >
> > Let's try to revitalize it and make a move on.
> >
> > As a first step, I would like to propose a list of a top-level domain.
> >
> > -- the phase 1
> >1. UnExpected, UnKnown
> >2. Cluster and Topology
> >Discovery
> >Segmentation
> >Node Startup
> >Communication
> >Queue
> >Activate, startup process
> >Base line topology
> >Marshaller
> >Metadata
> >Topology Validate
> >3. Cache and Storage
> >Partition map exchange
> >Balancing
> >Long-running transactions
> >Checkpoint
> >Create cache
> >Destroy cache
> >Data loading & streaming
> >4. SQL
> >Long-running queries
> >Parsing
> >Queries
> >Scan Queries
> >SqlLine
> >5. Compute
> >Deployment
> >spi.checkpoint
> >spi.collision
> >Job Schedule
> >
> > -- the phase 2
> >6. Service
> >7. Security
> >8. ML
> >9. External Adapters
> >10. WebConsole
> >11. Vendor Specific
> >GG
> >
> >
> > For every second-level domain is planning to reserve one hundred error
> > codes. Sum of second-level domains (rounded up to next thousand) gives us
> > count for top-level.
> >
> > Every error code has a severity level:
> >
> > Critical (Red) - the system is not operational;
> > Warning (Yellow) - the system is operational but health is degraded;
> > Info - just an info.
> >
> > And two or three letter prefix. It allows to find an issue more easily
> > without complex grep rules (something like grep
> > "10[2][5-9][0-5][0-9]|10[3][0-5][0-6][0-9]" * to find codes between
> 102500
> > до 103569)
> >
> >
> > Domains from the first phase look fine but from the second are vague.
> > Initially, we can focus only the first phase.
> >
> > Please share your thoughts on proposed design.
> >
> > Serge.
> >
> >
> >
> > --
> > Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
>
>


Re: Ignite not friendly for Monitoring

2018-01-16 Thread Denis Magda
Serge,

Thanks for taking over this. Think we’re moving in a right direction with your 
proposal:

* I would add a top-level domain for “Integrations”. All the integrations with 
Kafka, Spark, Storm, etc. should go there.

* Second-level domains number can grow over the time per a top-level layer. 
Let’s book a decent range for this possible grow.

* Guess external adapters should go to the “Integrations” which sounds better 
to me.

* Agree that this ticket should be used to track the progress in JIRA: 
https://issues.apache.org/jira/browse/IGNITE-3690 



On top of this, this effort has to be tested using a 3rd party tool such as 
DynoTrace or Nagios. If the tools can pick up and analyze our logs to automate 
classic DevOps tasks then the goal will be achieved. Can you include this as a 
required task for QA?

—
Denis

> On Jan 15, 2018, at 7:48 AM, Serge Puchnin  wrote:
> 
> Igniters, 
> 
> It's a right idea!
> 
> Let's try to revitalize it and make a move on. 
> 
> As a first step, I would like to propose a list of a top-level domain.
> 
> -- the phase 1
>1. UnExpected, UnKnown
>2. Cluster and Topology 
>Discovery
>Segmentation 
>Node Startup
>Communication
>Queue
>Activate, startup process
>Base line topology
>Marshaller
>Metadata
>Topology Validate
>3. Cache and Storage
>Partition map exchange
>Balancing
>Long-running transactions
>Checkpoint
>Create cache
>Destroy cache
>Data loading & streaming
>4. SQL
>Long-running queries
>Parsing
>Queries
>Scan Queries
>SqlLine
>5. Compute
>Deployment
>spi.checkpoint
>spi.collision
>Job Schedule
> 
> -- the phase 2
>6. Service
>7. Security
>8. ML
>9. External Adapters 
>10. WebConsole
>11. Vendor Specific 
>GG
> 
> 
> For every second-level domain is planning to reserve one hundred error
> codes. Sum of second-level domains (rounded up to next thousand) gives us
> count for top-level.
> 
> Every error code has a severity level:
> 
> Critical (Red) - the system is not operational; 
> Warning (Yellow) - the system is operational but health is degraded; 
> Info - just an info.
> 
> And two or three letter prefix. It allows to find an issue more easily
> without complex grep rules (something like grep
> "10[2][5-9][0-5][0-9]|10[3][0-5][0-6][0-9]" * to find codes between 102500 
> до 103569)
> 
> 
> Domains from the first phase look fine but from the second are vague.  
> Initially, we can focus only the first phase. 
> 
> Please share your thoughts on proposed design.
> 
> Serge.
> 
> 
> 
> --
> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/



Re: Ignite not friendly for Monitoring

2018-01-16 Thread Serge Puchnin
It might be an issue:
https://issues.apache.org/jira/browse/IGNITE-3690

I'm going to update it when the domain list will be agreed upon by the
community. 




--
Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/


Re: Ignite not friendly for Monitoring

2018-01-16 Thread Dmitriy Setrakyan
Is there a Jira ticket for it?

On Mon, Jan 15, 2018 at 7:48 AM, Serge Puchnin 
wrote:

> Igniters,
>
> It's a right idea!
>
> Let's try to revitalize it and make a move on.
>
> As a first step, I would like to propose a list of a top-level domain.
>
> -- the phase 1
> 1. UnExpected, UnKnown
> 2. Cluster and Topology
> Discovery
> Segmentation
> Node Startup
> Communication
> Queue
> Activate, startup process
> Base line topology
> Marshaller
> Metadata
> Topology Validate
> 3. Cache and Storage
> Partition map exchange
> Balancing
> Long-running transactions
> Checkpoint
> Create cache
> Destroy cache
> Data loading & streaming
> 4. SQL
> Long-running queries
> Parsing
> Queries
> Scan Queries
> SqlLine
> 5. Compute
> Deployment
> spi.checkpoint
> spi.collision
> Job Schedule
>
> -- the phase 2
> 6. Service
> 7. Security
> 8. ML
> 9. External Adapters
> 10. WebConsole
> 11. Vendor Specific
> GG
>
>
> For every second-level domain is planning to reserve one hundred error
> codes. Sum of second-level domains (rounded up to next thousand) gives us
> count for top-level.
>
> Every error code has a severity level:
>
> Critical (Red) - the system is not operational;
> Warning (Yellow) - the system is operational but health is degraded;
> Info - just an info.
>
> And two or three letter prefix. It allows to find an issue more easily
> without complex grep rules (something like grep
> "10[2][5-9][0-5][0-9]|10[3][0-5][0-6][0-9]" * to find codes between 102500
> до 103569)
>
>
> Domains from the first phase look fine but from the second are vague.
> Initially, we can focus only the first phase.
>
> Please share your thoughts on proposed design.
>
> Serge.
>
>
>
> --
> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
>


Re: Ignite not friendly for Monitoring

2018-01-15 Thread Serge Puchnin
Igniters, 

It's a right idea!

Let's try to revitalize it and make a move on. 

As a first step, I would like to propose a list of a top-level domain.

-- the phase 1
1. UnExpected, UnKnown
2. Cluster and Topology 
Discovery
Segmentation 
Node Startup
Communication
Queue
Activate, startup process
Base line topology
Marshaller
Metadata
Topology Validate
3. Cache and Storage
Partition map exchange
Balancing
Long-running transactions
Checkpoint
Create cache
Destroy cache
Data loading & streaming
4. SQL
Long-running queries
Parsing
Queries
Scan Queries
SqlLine
5. Compute
Deployment
spi.checkpoint
spi.collision
Job Schedule

-- the phase 2
6. Service
7. Security
8. ML
9. External Adapters 
10. WebConsole
11. Vendor Specific 
GG


For every second-level domain is planning to reserve one hundred error
codes. Sum of second-level domains (rounded up to next thousand) gives us
count for top-level.

Every error code has a severity level:

Critical (Red) - the system is not operational; 
Warning (Yellow) - the system is operational but health is degraded; 
Info - just an info.

And two or three letter prefix. It allows to find an issue more easily
without complex grep rules (something like grep
"10[2][5-9][0-5][0-9]|10[3][0-5][0-6][0-9]" * to find codes between 102500 
до 103569)


Domains from the first phase look fine but from the second are vague.  
Initially, we can focus only the first phase. 

Please share your thoughts on proposed design.

Serge.



--
Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/


Re: Ignite not friendly for Monitoring

2017-08-28 Thread Vladimir Ozerov
Dima,

Please see latest comments in the ticket [1]. There is special
specification called SQLSTATE governing what errors code are thrown from
SQL operations [2]. This is applicable to both JDBC and ODBC. Apart of from
standard code, database vendor can add it's own codes as a separate field,
or even extend error codes from the standard. However, as a first iteration
we should start respecting SQLSTATE spec without our own Ignite-specific
error codes.

[1] https://issues.apache.org/jira/browse/IGNITE-5620
[2]
https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/codes/src/tpc/db2z_sqlstatevalues.html#db2z_sqlstatevalues__code07

On Mon, Aug 28, 2017 at 3:23 PM, Dmitriy Setrakyan 
wrote:

> On Mon, Aug 28, 2017 at 1:22 AM, Vladimir Ozerov 
> wrote:
>
> > IGNITE-5620 is about error codes thrown from drivers. This is completely
> > different story, as every driver has specification with it's own specific
> > error codes. There is no common denominator.
> >
>
> Vova, I am not sure I understand. I would expect that drivers should
> provide the same SQL error codes as the underlying database. Perhaps,
> drivers have their custom codes for the errors in the driver itself, not in
> SQL.
>
> Can you please clarify?
>
>
> >
> > On Thu, Aug 17, 2017 at 11:10 PM, Denis Magda  wrote:
> >
> > > Vladimir,
> > >
> > > I would disagree. In IGNITE-5620 we’re going to introduce some constant
> > > error codes and prepare a sheet that will elaborate on every error.
> > That’s
> > > a part of bigger endeavor when the whole platform should be covered by
> > > special unique IDs for errors, warning and events.
> > >
> > > Now, we need to agree at least on the IDs range for SQL.
> > >
> > > —
> > > Denis
> > >
> > > > On Aug 15, 2017, at 11:10 PM, Vladimir Ozerov 
> > > wrote:
> > > >
> > > > Denis,
> > > >
> > > > IGNITE-5620 is completely different thing. Let's do not mix cluster
> > > > monitoring and parser errors.
> > > >
> > > > ср, 16 авг. 2017 г. в 2:57, Denis Magda :
> > > >
> > > >> Alexey,
> > > >>
> > > >> Didn’t know that such an improvement as consistent IDs for errors
> and
> > > >> events can be used as an integration point with the DevOps tools.
> > Thanks
> > > >> for sharing your experience with us.
> > > >>
> > > >> Would you step in as a architect for this task and make out a JIRA
> > > ticket
> > > >> with all the required information.
> > > >>
> > > >> In general, we’ve already planned to do something around this
> starting
> > > >> with SQL:
> > > >> https://issues.apache.org/jira/browse/IGNITE-5620 <
> > > >> https://issues.apache.org/jira/browse/IGNITE-5620>
> > > >>
> > > >> It makes sense to consider your input before the work on IGNITE-5620
> > is
> > > >> started.
> > > >>
> > > >> —
> > > >> Denis
> > > >>
> > > >>> On Aug 15, 2017, at 10:56 AM, Alexey Kukushkin <
> > > >> alexeykukush...@yahoo.com.INVALID> wrote:
> > > >>>
> > > >>> Hi Alexey,
> > > >>> A nice thing about delegating alerting to 3rd party enterprise
> > systems
> > > >> is that those systems already deal with lots of things including
> > > >> distributed apps.
> > > >>> What is needed from Ignite is to consistently write to log files
> > (again
> > > >> that means stable event IDs, proper event granularity, no
> repetition,
> > > >> documentation). This would be 3rd party monitoring system's
> > > responsibility
> > > >> to monitor log files on all nodes, filter, aggregate, process,
> > visualize
> > > >> and notify on events.
> > > >>> How a monitoring tool would deal with an event like "node left":
> > > >>> The only thing needed from Ignite is to write an entry like below
> to
> > > log
> > > >> files on all Ignite servers. In this example 3300 identifies this
> > "node
> > > >> left" event and will never change in the future even if text
> > description
> > > >> changes:
> > > >>> [2017-09-01 10:00:14] [WARN] 3300 Node DF2345F-XCVDS4-34ETJH left
> the
> > > >> cluster
> > > >>> Then we document somewhere on the web that Ignite has event 3300
> and
> > it
> > > >> means a node left the cluster. Maybe provide documentation how to
> deal
> > > with
> > > >> it. Some examples:Oracle Web Cache events:
> > > >> https://docs.oracle.com/cd/B14099_19/caching.1012/b14046/
> > > event.htm#sthref2393MS
> > > >> SQL Server events:
> > > >> https://msdn.microsoft.com/en-us/library/cc645603(v=sql.105).aspx
> > > >>> That is all for Ignite! Everything else is handled by specific
> > > >> monitoring system configured by DevOps on the customer side.
> > > >>> Basing on the Ignite documentation similar to above, DevOps of a
> > > company
> > > >> where Ignite is going to be used will configure their monitoring
> > system
> > > to
> > > >> understand Ignite events. Consider the "node left" event as an
> > example.
> > > >>> - This event is output on every node but DevOps do not want to be
> > > >> notified many times. To address this, 

Re: Ignite not friendly for Monitoring

2017-08-28 Thread Dmitriy Setrakyan
On Mon, Aug 28, 2017 at 1:22 AM, Vladimir Ozerov 
wrote:

> IGNITE-5620 is about error codes thrown from drivers. This is completely
> different story, as every driver has specification with it's own specific
> error codes. There is no common denominator.
>

Vova, I am not sure I understand. I would expect that drivers should
provide the same SQL error codes as the underlying database. Perhaps,
drivers have their custom codes for the errors in the driver itself, not in
SQL.

Can you please clarify?


>
> On Thu, Aug 17, 2017 at 11:10 PM, Denis Magda  wrote:
>
> > Vladimir,
> >
> > I would disagree. In IGNITE-5620 we’re going to introduce some constant
> > error codes and prepare a sheet that will elaborate on every error.
> That’s
> > a part of bigger endeavor when the whole platform should be covered by
> > special unique IDs for errors, warning and events.
> >
> > Now, we need to agree at least on the IDs range for SQL.
> >
> > —
> > Denis
> >
> > > On Aug 15, 2017, at 11:10 PM, Vladimir Ozerov 
> > wrote:
> > >
> > > Denis,
> > >
> > > IGNITE-5620 is completely different thing. Let's do not mix cluster
> > > monitoring and parser errors.
> > >
> > > ср, 16 авг. 2017 г. в 2:57, Denis Magda :
> > >
> > >> Alexey,
> > >>
> > >> Didn’t know that such an improvement as consistent IDs for errors and
> > >> events can be used as an integration point with the DevOps tools.
> Thanks
> > >> for sharing your experience with us.
> > >>
> > >> Would you step in as a architect for this task and make out a JIRA
> > ticket
> > >> with all the required information.
> > >>
> > >> In general, we’ve already planned to do something around this starting
> > >> with SQL:
> > >> https://issues.apache.org/jira/browse/IGNITE-5620 <
> > >> https://issues.apache.org/jira/browse/IGNITE-5620>
> > >>
> > >> It makes sense to consider your input before the work on IGNITE-5620
> is
> > >> started.
> > >>
> > >> —
> > >> Denis
> > >>
> > >>> On Aug 15, 2017, at 10:56 AM, Alexey Kukushkin <
> > >> alexeykukush...@yahoo.com.INVALID> wrote:
> > >>>
> > >>> Hi Alexey,
> > >>> A nice thing about delegating alerting to 3rd party enterprise
> systems
> > >> is that those systems already deal with lots of things including
> > >> distributed apps.
> > >>> What is needed from Ignite is to consistently write to log files
> (again
> > >> that means stable event IDs, proper event granularity, no repetition,
> > >> documentation). This would be 3rd party monitoring system's
> > responsibility
> > >> to monitor log files on all nodes, filter, aggregate, process,
> visualize
> > >> and notify on events.
> > >>> How a monitoring tool would deal with an event like "node left":
> > >>> The only thing needed from Ignite is to write an entry like below to
> > log
> > >> files on all Ignite servers. In this example 3300 identifies this
> "node
> > >> left" event and will never change in the future even if text
> description
> > >> changes:
> > >>> [2017-09-01 10:00:14] [WARN] 3300 Node DF2345F-XCVDS4-34ETJH left the
> > >> cluster
> > >>> Then we document somewhere on the web that Ignite has event 3300 and
> it
> > >> means a node left the cluster. Maybe provide documentation how to deal
> > with
> > >> it. Some examples:Oracle Web Cache events:
> > >> https://docs.oracle.com/cd/B14099_19/caching.1012/b14046/
> > event.htm#sthref2393MS
> > >> SQL Server events:
> > >> https://msdn.microsoft.com/en-us/library/cc645603(v=sql.105).aspx
> > >>> That is all for Ignite! Everything else is handled by specific
> > >> monitoring system configured by DevOps on the customer side.
> > >>> Basing on the Ignite documentation similar to above, DevOps of a
> > company
> > >> where Ignite is going to be used will configure their monitoring
> system
> > to
> > >> understand Ignite events. Consider the "node left" event as an
> example.
> > >>> - This event is output on every node but DevOps do not want to be
> > >> notified many times. To address this, they will build an "Ignite
> model"
> > >> where there will be a parent-child dependency between components
> "Ignite
> > >> Cluster" and "Ignite Node". For example, this is how you do it in
> > Nagios:
> > >> https://assets.nagios.com/downloads/nagioscore/docs/
> > nagioscore/4/en/dependencies.html
> > >> and this is how you do it in Microsoft SCSM:
> > >> https://docs.microsoft.com/en-us/system-center/scsm/auth-classes.
> Then
> > >> DevOps will configure "node left" monitors in SCSM (or a "checks" in
> > >> Nagios) for parent "Ignite Cluster" and child "Ignite Service"
> > components.
> > >> State change (OK -> WARNING) and notification (email, SMS, whatever)
> > will
> > >> be configured only for the "Ignite Cluster"'s "node left" monitor.-
> Now
> > >> suppose a node left. The "node left" monitor (that uses log file
> > monitoring
> > >> plugin) on "Ignite Node" will detect the event and pass it to the
> > parent.
> > >> This will trigger 

Re: Ignite not friendly for Monitoring

2017-08-28 Thread Vladimir Ozerov
IGNITE-5620 is about error codes thrown from drivers. This is completely
different story, as every driver has specification with it's own specific
error codes. There is no common denominator.

On Thu, Aug 17, 2017 at 11:10 PM, Denis Magda  wrote:

> Vladimir,
>
> I would disagree. In IGNITE-5620 we’re going to introduce some constant
> error codes and prepare a sheet that will elaborate on every error. That’s
> a part of bigger endeavor when the whole platform should be covered by
> special unique IDs for errors, warning and events.
>
> Now, we need to agree at least on the IDs range for SQL.
>
> —
> Denis
>
> > On Aug 15, 2017, at 11:10 PM, Vladimir Ozerov 
> wrote:
> >
> > Denis,
> >
> > IGNITE-5620 is completely different thing. Let's do not mix cluster
> > monitoring and parser errors.
> >
> > ср, 16 авг. 2017 г. в 2:57, Denis Magda :
> >
> >> Alexey,
> >>
> >> Didn’t know that such an improvement as consistent IDs for errors and
> >> events can be used as an integration point with the DevOps tools. Thanks
> >> for sharing your experience with us.
> >>
> >> Would you step in as a architect for this task and make out a JIRA
> ticket
> >> with all the required information.
> >>
> >> In general, we’ve already planned to do something around this starting
> >> with SQL:
> >> https://issues.apache.org/jira/browse/IGNITE-5620 <
> >> https://issues.apache.org/jira/browse/IGNITE-5620>
> >>
> >> It makes sense to consider your input before the work on IGNITE-5620 is
> >> started.
> >>
> >> —
> >> Denis
> >>
> >>> On Aug 15, 2017, at 10:56 AM, Alexey Kukushkin <
> >> alexeykukush...@yahoo.com.INVALID> wrote:
> >>>
> >>> Hi Alexey,
> >>> A nice thing about delegating alerting to 3rd party enterprise systems
> >> is that those systems already deal with lots of things including
> >> distributed apps.
> >>> What is needed from Ignite is to consistently write to log files (again
> >> that means stable event IDs, proper event granularity, no repetition,
> >> documentation). This would be 3rd party monitoring system's
> responsibility
> >> to monitor log files on all nodes, filter, aggregate, process, visualize
> >> and notify on events.
> >>> How a monitoring tool would deal with an event like "node left":
> >>> The only thing needed from Ignite is to write an entry like below to
> log
> >> files on all Ignite servers. In this example 3300 identifies this "node
> >> left" event and will never change in the future even if text description
> >> changes:
> >>> [2017-09-01 10:00:14] [WARN] 3300 Node DF2345F-XCVDS4-34ETJH left the
> >> cluster
> >>> Then we document somewhere on the web that Ignite has event 3300 and it
> >> means a node left the cluster. Maybe provide documentation how to deal
> with
> >> it. Some examples:Oracle Web Cache events:
> >> https://docs.oracle.com/cd/B14099_19/caching.1012/b14046/
> event.htm#sthref2393MS
> >> SQL Server events:
> >> https://msdn.microsoft.com/en-us/library/cc645603(v=sql.105).aspx
> >>> That is all for Ignite! Everything else is handled by specific
> >> monitoring system configured by DevOps on the customer side.
> >>> Basing on the Ignite documentation similar to above, DevOps of a
> company
> >> where Ignite is going to be used will configure their monitoring system
> to
> >> understand Ignite events. Consider the "node left" event as an example.
> >>> - This event is output on every node but DevOps do not want to be
> >> notified many times. To address this, they will build an "Ignite model"
> >> where there will be a parent-child dependency between components "Ignite
> >> Cluster" and "Ignite Node". For example, this is how you do it in
> Nagios:
> >> https://assets.nagios.com/downloads/nagioscore/docs/
> nagioscore/4/en/dependencies.html
> >> and this is how you do it in Microsoft SCSM:
> >> https://docs.microsoft.com/en-us/system-center/scsm/auth-classes. Then
> >> DevOps will configure "node left" monitors in SCSM (or a "checks" in
> >> Nagios) for parent "Ignite Cluster" and child "Ignite Service"
> components.
> >> State change (OK -> WARNING) and notification (email, SMS, whatever)
> will
> >> be configured only for the "Ignite Cluster"'s "node left" monitor.- Now
> >> suppose a node left. The "node left" monitor (that uses log file
> monitoring
> >> plugin) on "Ignite Node" will detect the event and pass it to the
> parent.
> >> This will trigger "Ignite Cluster" state change from OK to WARNING and
> send
> >> a notification. No more notification will be sent unless the "Ignite
> >> Cluster" state is reset back to OK, which happens either manually or on
> >> timeout or automatically on "node joined".
> >>> This was just FYI. We, Ignite developers, do not care about how
> >> monitoring works - this is responsibility of customer's DevOps. Our
> >> responsibility is consistent event logging.
> >>> Thank you!
> >>>
> >>>
> >>> Best regards, Alexey
> >>>
> >>>
> >>> On Tuesday, August 15, 2017, 6:16:25 PM 

Re: Ignite not friendly for Monitoring

2017-08-16 Thread Vladimir Ozerov
Denis,

IGNITE-5620 is completely different thing. Let's do not mix cluster
monitoring and parser errors.

ср, 16 авг. 2017 г. в 2:57, Denis Magda :

> Alexey,
>
> Didn’t know that such an improvement as consistent IDs for errors and
> events can be used as an integration point with the DevOps tools. Thanks
> for sharing your experience with us.
>
> Would you step in as a architect for this task and make out a JIRA ticket
> with all the required information.
>
> In general, we’ve already planned to do something around this starting
> with SQL:
> https://issues.apache.org/jira/browse/IGNITE-5620 <
> https://issues.apache.org/jira/browse/IGNITE-5620>
>
> It makes sense to consider your input before the work on IGNITE-5620 is
> started.
>
> —
> Denis
>
> > On Aug 15, 2017, at 10:56 AM, Alexey Kukushkin <
> alexeykukush...@yahoo.com.INVALID> wrote:
> >
> > Hi Alexey,
> > A nice thing about delegating alerting to 3rd party enterprise systems
> is that those systems already deal with lots of things including
> distributed apps.
> > What is needed from Ignite is to consistently write to log files (again
> that means stable event IDs, proper event granularity, no repetition,
> documentation). This would be 3rd party monitoring system's responsibility
> to monitor log files on all nodes, filter, aggregate, process, visualize
> and notify on events.
> > How a monitoring tool would deal with an event like "node left":
> > The only thing needed from Ignite is to write an entry like below to log
> files on all Ignite servers. In this example 3300 identifies this "node
> left" event and will never change in the future even if text description
> changes:
> > [2017-09-01 10:00:14] [WARN] 3300 Node DF2345F-XCVDS4-34ETJH left the
> cluster
> > Then we document somewhere on the web that Ignite has event 3300 and it
> means a node left the cluster. Maybe provide documentation how to deal with
> it. Some examples:Oracle Web Cache events:
> https://docs.oracle.com/cd/B14099_19/caching.1012/b14046/event.htm#sthref2393MS
> SQL Server events:
> https://msdn.microsoft.com/en-us/library/cc645603(v=sql.105).aspx
> > That is all for Ignite! Everything else is handled by specific
> monitoring system configured by DevOps on the customer side.
> > Basing on the Ignite documentation similar to above, DevOps of a company
> where Ignite is going to be used will configure their monitoring system to
> understand Ignite events. Consider the "node left" event as an example.
> > - This event is output on every node but DevOps do not want to be
> notified many times. To address this, they will build an "Ignite model"
> where there will be a parent-child dependency between components "Ignite
> Cluster" and "Ignite Node". For example, this is how you do it in Nagios:
> https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/dependencies.html
> and this is how you do it in Microsoft SCSM:
> https://docs.microsoft.com/en-us/system-center/scsm/auth-classes. Then
> DevOps will configure "node left" monitors in SCSM (or a "checks" in
> Nagios) for parent "Ignite Cluster" and child "Ignite Service" components.
> State change (OK -> WARNING) and notification (email, SMS, whatever) will
> be configured only for the "Ignite Cluster"'s "node left" monitor.- Now
> suppose a node left. The "node left" monitor (that uses log file monitoring
> plugin) on "Ignite Node" will detect the event and pass it to the parent.
> This will trigger "Ignite Cluster" state change from OK to WARNING and send
> a notification. No more notification will be sent unless the "Ignite
> Cluster" state is reset back to OK, which happens either manually or on
> timeout or automatically on "node joined".
> > This was just FYI. We, Ignite developers, do not care about how
> monitoring works - this is responsibility of customer's DevOps. Our
> responsibility is consistent event logging.
> > Thank you!
> >
> >
> > Best regards, Alexey
> >
> >
> > On Tuesday, August 15, 2017, 6:16:25 PM GMT+3, Alexey Kuznetsov <
> akuznet...@apache.org> wrote:
> >
> > Alexey,
> >
> > How you are going to deal with distributed nature of Ignite cluster?
> > And how do you propose handle nodes restart / stop?
> >
> > On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin <
> > alexeykukush...@yahoo.com.invalid> wrote:
> >
> >> Hi Denis,
> >> Monitoring tools simply watch event logs for patterns (regex in case of
> >> unstructured logs like text files). A stable (not changing in new
> releases)
> >> event ID identifying specific issue would be such a pattern.
> >> We need to introduce such event IDs according to the principles I
> >> described in my previous mail.
> >> Best regards, Alexey
> >>
> >>
> >> On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda <
> >> dma...@apache.org> wrote:
> >>
> >> Hello Alexey,
> >>
> >> Thanks for the detailed input.
> >>
> >> Assuming that Ignite supported the suggested events based model, how can
> >> it be integrated with mentioned tools 

Re: Ignite not friendly for Monitoring

2017-08-15 Thread Denis Magda
Alexey,

Didn’t know that such an improvement as consistent IDs for errors and events 
can be used as an integration point with the DevOps tools. Thanks for sharing 
your experience with us.

Would you step in as a architect for this task and make out a JIRA ticket with 
all the required information.

In general, we’ve already planned to do something around this starting with SQL:
https://issues.apache.org/jira/browse/IGNITE-5620 


It makes sense to consider your input before the work on IGNITE-5620 is started.

—
Denis

> On Aug 15, 2017, at 10:56 AM, Alexey Kukushkin 
>  wrote:
> 
> Hi Alexey,
> A nice thing about delegating alerting to 3rd party enterprise systems is 
> that those systems already deal with lots of things including distributed 
> apps.
> What is needed from Ignite is to consistently write to log files (again that 
> means stable event IDs, proper event granularity, no repetition, 
> documentation). This would be 3rd party monitoring system's responsibility to 
> monitor log files on all nodes, filter, aggregate, process, visualize and 
> notify on events.
> How a monitoring tool would deal with an event like "node left":
> The only thing needed from Ignite is to write an entry like below to log 
> files on all Ignite servers. In this example 3300 identifies this "node left" 
> event and will never change in the future even if text description changes:
> [2017-09-01 10:00:14] [WARN] 3300 Node DF2345F-XCVDS4-34ETJH left the cluster
> Then we document somewhere on the web that Ignite has event 3300 and it means 
> a node left the cluster. Maybe provide documentation how to deal with it. 
> Some examples:Oracle Web Cache events: 
> https://docs.oracle.com/cd/B14099_19/caching.1012/b14046/event.htm#sthref2393MS
>  SQL Server events: 
> https://msdn.microsoft.com/en-us/library/cc645603(v=sql.105).aspx 
> That is all for Ignite! Everything else is handled by specific monitoring 
> system configured by DevOps on the customer side. 
> Basing on the Ignite documentation similar to above, DevOps of a company 
> where Ignite is going to be used will configure their monitoring system to 
> understand Ignite events. Consider the "node left" event as an example.
> - This event is output on every node but DevOps do not want to be notified 
> many times. To address this, they will build an "Ignite model" where there 
> will be a parent-child dependency between components "Ignite Cluster" and 
> "Ignite Node". For example, this is how you do it in Nagios: 
> https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/dependencies.html
>  and this is how you do it in Microsoft SCSM: 
> https://docs.microsoft.com/en-us/system-center/scsm/auth-classes. Then DevOps 
> will configure "node left" monitors in SCSM (or a "checks" in Nagios) for 
> parent "Ignite Cluster" and child "Ignite Service" components. State change 
> (OK -> WARNING) and notification (email, SMS, whatever) will be configured 
> only for the "Ignite Cluster"'s "node left" monitor.- Now suppose a node 
> left. The "node left" monitor (that uses log file monitoring plugin) on 
> "Ignite Node" will detect the event and pass it to the parent. This will 
> trigger "Ignite Cluster" state change from OK to WARNING and send a 
> notification. No more notification will be sent unless the "Ignite Cluster" 
> state is reset back to OK, which happens either manually or on timeout or 
> automatically on "node joined". 
> This was just FYI. We, Ignite developers, do not care about how monitoring 
> works - this is responsibility of customer's DevOps. Our responsibility is 
> consistent event logging.
> Thank you!
> 
> 
> Best regards, Alexey
> 
> 
> On Tuesday, August 15, 2017, 6:16:25 PM GMT+3, Alexey Kuznetsov 
>  wrote:
> 
> Alexey,
> 
> How you are going to deal with distributed nature of Ignite cluster?
> And how do you propose handle nodes restart / stop?
> 
> On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin <
> alexeykukush...@yahoo.com.invalid> wrote:
> 
>> Hi Denis,
>> Monitoring tools simply watch event logs for patterns (regex in case of
>> unstructured logs like text files). A stable (not changing in new releases)
>> event ID identifying specific issue would be such a pattern.
>> We need to introduce such event IDs according to the principles I
>> described in my previous mail.
>> Best regards, Alexey
>> 
>> 
>> On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda <
>> dma...@apache.org> wrote:
>> 
>> Hello Alexey,
>> 
>> Thanks for the detailed input.
>> 
>> Assuming that Ignite supported the suggested events based model, how can
>> it be integrated with mentioned tools like DynaTrace or Nagios? Is this all
>> we need?
>> 
>> —
>> Denis
>> 
>>> On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin > .INVALID> wrote:
>>> 
>>> Igniters,
>>> While preparing some Ignite materials for Administrators I 

Re: Ignite not friendly for Monitoring

2017-08-15 Thread Alexey Kukushkin
Hi Alexey,
A nice thing about delegating alerting to 3rd party enterprise systems is that 
those systems already deal with lots of things including distributed apps.
What is needed from Ignite is to consistently write to log files (again that 
means stable event IDs, proper event granularity, no repetition, 
documentation). This would be 3rd party monitoring system's responsibility to 
monitor log files on all nodes, filter, aggregate, process, visualize and 
notify on events.
How a monitoring tool would deal with an event like "node left":
The only thing needed from Ignite is to write an entry like below to log files 
on all Ignite servers. In this example 3300 identifies this "node left" event 
and will never change in the future even if text description changes:
[2017-09-01 10:00:14] [WARN] 3300 Node DF2345F-XCVDS4-34ETJH left the cluster
Then we document somewhere on the web that Ignite has event 3300 and it means a 
node left the cluster. Maybe provide documentation how to deal with it. Some 
examples:Oracle Web Cache events: 
https://docs.oracle.com/cd/B14099_19/caching.1012/b14046/event.htm#sthref2393MS 
SQL Server events: 
https://msdn.microsoft.com/en-us/library/cc645603(v=sql.105).aspx 
That is all for Ignite! Everything else is handled by specific monitoring 
system configured by DevOps on the customer side. 
Basing on the Ignite documentation similar to above, DevOps of a company where 
Ignite is going to be used will configure their monitoring system to understand 
Ignite events. Consider the "node left" event as an example.
- This event is output on every node but DevOps do not want to be notified many 
times. To address this, they will build an "Ignite model" where there will be a 
parent-child dependency between components "Ignite Cluster" and "Ignite Node". 
For example, this is how you do it in Nagios: 
https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/dependencies.html
 and this is how you do it in Microsoft SCSM: 
https://docs.microsoft.com/en-us/system-center/scsm/auth-classes. Then DevOps 
will configure "node left" monitors in SCSM (or a "checks" in Nagios) for 
parent "Ignite Cluster" and child "Ignite Service" components. State change (OK 
-> WARNING) and notification (email, SMS, whatever) will be configured only for 
the "Ignite Cluster"'s "node left" monitor.- Now suppose a node left. The "node 
left" monitor (that uses log file monitoring plugin) on "Ignite Node" will 
detect the event and pass it to the parent. This will trigger "Ignite Cluster" 
state change from OK to WARNING and send a notification. No more notification 
will be sent unless the "Ignite Cluster" state is reset back to OK, which 
happens either manually or on timeout or automatically on "node joined". 
This was just FYI. We, Ignite developers, do not care about how monitoring 
works - this is responsibility of customer's DevOps. Our responsibility is 
consistent event logging.
Thank you!


Best regards, Alexey


On Tuesday, August 15, 2017, 6:16:25 PM GMT+3, Alexey Kuznetsov 
 wrote:

Alexey,

How you are going to deal with distributed nature of Ignite cluster?
And how do you propose handle nodes restart / stop?

On Tue, Aug 15, 2017 at 9:12 PM, Alexey Kukushkin <
alexeykukush...@yahoo.com.invalid> wrote:

> Hi Denis,
> Monitoring tools simply watch event logs for patterns (regex in case of
> unstructured logs like text files). A stable (not changing in new releases)
> event ID identifying specific issue would be such a pattern.
> We need to introduce such event IDs according to the principles I
> described in my previous mail.
> Best regards, Alexey
>
>
> On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda <
> dma...@apache.org> wrote:
>
> Hello Alexey,
>
> Thanks for the detailed input.
>
> Assuming that Ignite supported the suggested events based model, how can
> it be integrated with mentioned tools like DynaTrace or Nagios? Is this all
> we need?
>
> —
> Denis
>
> > On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin  .INVALID> wrote:
> >
> > Igniters,
> > While preparing some Ignite materials for Administrators I found Ignite
> is not friendly for such a critical DevOps practice as monitoring.
> > TL;DRI think Ignite misses structured descriptions of abnormal events
> with references to event IDs in the logs not changing as new versions are
> released.
> > MORE DETAILS
> > I call an application “monitoring friendly” if it allows DevOps to:
> > 1. immediately receive a notification (email, SMS, etc.)
> > 2. understand what a problem is without involving developers
> > 3. provide automated recovery action.
> >
> > Large enterprises do not implement custom solutions. They usually use
> tools like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the
> enterprise consistently. All such tools have similar architecture providing
> a dashboard showing apps as “green/yellow/red”, and numerous “connectors”
> to look for events in text 

Re: Ignite not friendly for Monitoring

2017-08-15 Thread Alexey Kukushkin
Hi Denis,
Monitoring tools simply watch event logs for patterns (regex in case of 
unstructured logs like text files). A stable (not changing in new releases) 
event ID identifying specific issue would be such a pattern. 
We need to introduce such event IDs according to the principles I described in 
my previous mail.
Best regards, Alexey


On Tuesday, August 15, 2017, 4:53:05 AM GMT+3, Denis Magda  
wrote:

Hello Alexey,

Thanks for the detailed input.

Assuming that Ignite supported the suggested events based model, how can it be 
integrated with mentioned tools like DynaTrace or Nagios? Is this all we need?

—
Denis
 
> On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin 
>  wrote:
> 
> Igniters,
> While preparing some Ignite materials for Administrators I found Ignite is 
> not friendly for such a critical DevOps practice as monitoring. 
> TL;DRI think Ignite misses structured descriptions of abnormal events with 
> references to event IDs in the logs not changing as new versions are released.
> MORE DETAILS
> I call an application “monitoring friendly” if it allows DevOps to:
> 1. immediately receive a notification (email, SMS, etc.)
> 2. understand what a problem is without involving developers 
> 3. provide automated recovery action.
> 
> Large enterprises do not implement custom solutions. They usually use tools 
> like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the enterprise 
> consistently. All such tools have similar architecture providing a dashboard 
> showing apps as “green/yellow/red”, and numerous “connectors” to look for 
> events in text logs, ESBs, database tables, etc.
> 
> For each app DevOps build a “health model” - a diagram displaying the app’s 
> “manageable” components and the app boundaries. A “manageable” component is 
> something that can be started/stopped/configured in isolation. “System 
> boundary” is a list of external apps that the monitored app interacts with.
> 
> The main attribute of a manageable component is a list of “operationally 
> significant events”. Those are the events that DevOps can do something with. 
> For example, “failed to connect to cache store” is significant, while “user 
> input validation failed” is not.
> 
> Events shall be as specific as possible so that DevOps do not spend time for 
> further analysis. For example, a “database failure” event is not good. There 
> should be “database connection failure”, “invalid database schema”, “database 
> authentication failure”, etc. events.  
> 
> “Event” is NOT the same as exception occurred in the code. Events identify 
> specific problem from the DevOps point of view. For example, even if 
> “connection to cache store failed” exception might be thrown from several 
> places in the code, that is still the same event. On the other side, even if 
> a SqlServerConnectionTimeout and OracleConnectionTimeout exceptions might be 
> caught in the same place, those are different events since MS SQL Server and 
> Oracle are usually different DevOps groups in large enterprises!
> 
> The operationally significant event IDs must be stable: they must not change 
> from one release to another. This is like a contract between developers and 
> DevOps.
> 
> This should be the developer’s responsibility to publish and maintain a table 
> with attributes:
> 
> - Event ID
> - Severity: Critical (Red) - the system is not operational; Warning (Yellow) 
> - the system is operational but health is degraded; None - just an info.
> - Description: concise but enough for DevOps to act without developer’s help
> - Recovery actions: what DevOps shall do to fix the issue without developer’s 
> help. DevOps might create automated recovery scripts based on this 
> information.
> 
> For example:
> 10100 - Critical - Could not connect to Zookeeper to discovery nodes - 1) 
> Open ignite configuration and find zookeeper connection string 2) Make sure 
> the Zookeeper is running
> 10200 - Warning - Ignite node left the cluster.
> 
> Back to Ignite: it looks to me we do not design for operations as described 
> above. We have no event IDs: our logging is subject to change in new version 
> so that any patterns DevOps might use to detect significant events would stop 
> working after upgrade.
> 
> If I am not the only one how have such concerns then we might open a ticket 
> to address this.
> 
> 
> Best regards, Alexey


Re: Ignite not friendly for Monitoring

2017-08-14 Thread Denis Magda
Hello Alexey,

Thanks for the detailed input.

Assuming that Ignite supported the suggested events based model, how can it be 
integrated with mentioned tools like DynaTrace or Nagios? Is this all we need?

—
Denis
 
> On Aug 14, 2017, at 5:02 AM, Alexey Kukushkin 
>  wrote:
> 
> Igniters,
> While preparing some Ignite materials for Administrators I found Ignite is 
> not friendly for such a critical DevOps practice as monitoring. 
> TL;DRI think Ignite misses structured descriptions of abnormal events with 
> references to event IDs in the logs not changing as new versions are released.
> MORE DETAILS
> I call an application “monitoring friendly” if it allows DevOps to:
> 1. immediately receive a notification (email, SMS, etc.)
> 2. understand what a problem is without involving developers 
> 3. provide automated recovery action.
> 
> Large enterprises do not implement custom solutions. They usually use tools 
> like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the enterprise 
> consistently. All such tools have similar architecture providing a dashboard 
> showing apps as “green/yellow/red”, and numerous “connectors” to look for 
> events in text logs, ESBs, database tables, etc.
> 
> For each app DevOps build a “health model” - a diagram displaying the app’s 
> “manageable” components and the app boundaries. A “manageable” component is 
> something that can be started/stopped/configured in isolation. “System 
> boundary” is a list of external apps that the monitored app interacts with.
> 
> The main attribute of a manageable component is a list of “operationally 
> significant events”. Those are the events that DevOps can do something with. 
> For example, “failed to connect to cache store” is significant, while “user 
> input validation failed” is not.
> 
> Events shall be as specific as possible so that DevOps do not spend time for 
> further analysis. For example, a “database failure” event is not good. There 
> should be “database connection failure”, “invalid database schema”, “database 
> authentication failure”, etc. events.  
> 
> “Event” is NOT the same as exception occurred in the code. Events identify 
> specific problem from the DevOps point of view. For example, even if 
> “connection to cache store failed” exception might be thrown from several 
> places in the code, that is still the same event. On the other side, even if 
> a SqlServerConnectionTimeout and OracleConnectionTimeout exceptions might be 
> caught in the same place, those are different events since MS SQL Server and 
> Oracle are usually different DevOps groups in large enterprises!
> 
> The operationally significant event IDs must be stable: they must not change 
> from one release to another. This is like a contract between developers and 
> DevOps.
> 
> This should be the developer’s responsibility to publish and maintain a table 
> with attributes:
> 
> - Event ID
> - Severity: Critical (Red) - the system is not operational; Warning (Yellow) 
> - the system is operational but health is degraded; None - just an info.
> - Description: concise but enough for DevOps to act without developer’s help
> - Recovery actions: what DevOps shall do to fix the issue without developer’s 
> help. DevOps might create automated recovery scripts based on this 
> information.
> 
> For example:
> 10100 - Critical - Could not connect to Zookeeper to discovery nodes - 1) 
> Open ignite configuration and find zookeeper connection string 2) Make sure 
> the Zookeeper is running
> 10200 - Warning - Ignite node left the cluster.
> 
> Back to Ignite: it looks to me we do not design for operations as described 
> above. We have no event IDs: our logging is subject to change in new version 
> so that any patterns DevOps might use to detect significant events would stop 
> working after upgrade.
> 
> If I am not the only one how have such concerns then we might open a ticket 
> to address this.
> 
> 
> Best regards, Alexey



Ignite not friendly for Monitoring

2017-08-14 Thread Alexey Kukushkin
Igniters,
While preparing some Ignite materials for Administrators I found Ignite is not 
friendly for such a critical DevOps practice as monitoring. 
TL;DRI think Ignite misses structured descriptions of abnormal events with 
references to event IDs in the logs not changing as new versions are released.
MORE DETAILS
I call an application “monitoring friendly” if it allows DevOps to:
1. immediately receive a notification (email, SMS, etc.)
2. understand what a problem is without involving developers 
3. provide automated recovery action.

Large enterprises do not implement custom solutions. They usually use tools 
like DynaTrace, Nagios, SCOM, etc. to monitor all apps in the enterprise 
consistently. All such tools have similar architecture providing a dashboard 
showing apps as “green/yellow/red”, and numerous “connectors” to look for 
events in text logs, ESBs, database tables, etc.

For each app DevOps build a “health model” - a diagram displaying the app’s 
“manageable” components and the app boundaries. A “manageable” component is 
something that can be started/stopped/configured in isolation. “System 
boundary” is a list of external apps that the monitored app interacts with.

The main attribute of a manageable component is a list of “operationally 
significant events”. Those are the events that DevOps can do something with. 
For example, “failed to connect to cache store” is significant, while “user 
input validation failed” is not.

Events shall be as specific as possible so that DevOps do not spend time for 
further analysis. For example, a “database failure” event is not good. There 
should be “database connection failure”, “invalid database schema”, “database 
authentication failure”, etc. events.  

“Event” is NOT the same as exception occurred in the code. Events identify 
specific problem from the DevOps point of view. For example, even if 
“connection to cache store failed” exception might be thrown from several 
places in the code, that is still the same event. On the other side, even if a 
SqlServerConnectionTimeout and OracleConnectionTimeout exceptions might be 
caught in the same place, those are different events since MS SQL Server and 
Oracle are usually different DevOps groups in large enterprises!

The operationally significant event IDs must be stable: they must not change 
from one release to another. This is like a contract between developers and 
DevOps.

This should be the developer’s responsibility to publish and maintain a table 
with attributes:
 
- Event ID
- Severity: Critical (Red) - the system is not operational; Warning (Yellow) - 
the system is operational but health is degraded; None - just an info.
- Description: concise but enough for DevOps to act without developer’s help
- Recovery actions: what DevOps shall do to fix the issue without developer’s 
help. DevOps might create automated recovery scripts based on this information.

For example:
10100 - Critical - Could not connect to Zookeeper to discovery nodes - 1) Open 
ignite configuration and find zookeeper connection string 2) Make sure the 
Zookeeper is running
10200 - Warning - Ignite node left the cluster.

Back to Ignite: it looks to me we do not design for operations as described 
above. We have no event IDs: our logging is subject to change in new version so 
that any patterns DevOps might use to detect significant events would stop 
working after upgrade.

If I am not the only one how have such concerns then we might open a ticket to 
address this.


Best regards, Alexey