[jira] [Comment Edited] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.

2016-02-22 Thread Dongwon Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158023#comment-15158023
 ] 

Dongwon Kim edited comment on FLINK-1502 at 2/23/16 4:44 AM:
-

Let's consider the following scenario:
||Sessions||Node1||Node2||Node3||
|Session1|TM1|TM2|TM3|
|Session2|TM2|TM3|TM1|
|Session3|TM3|TM2|TM1|

After Session1 is finished, Ganglia adds the following metric to the list of 
metrics for Node1:
- Flink.taskmanager.1.gc_time 

After Session2 is finished, Ganglia adds the following metric to the list of 
metrics for Node1:
- Flink.taskmanager.2.gc_time

After Session3 is finished, Ganglia adds the following metric to the list of 
metrics for Node1:
- Flink.taskmanager.3.gc_time

Around this time, Ganglia has three metrics for each node.
The problem is getting worse if the user has to launch much more TaskManagers.
For example, 500 TaskManagers over multiple sessions will end up with creating 
500 metrics for each host.

Wouldn't be better to assign indexes to TaskManagers scoped to each host?


was (Author: eastcirclek):
Let's consider the following scenario:
||Sessions||Node1||Node2||Node3||
|Session1|TM1|TM2|TM3|
|Session2|TM2|TM3|TM1|
|Session3|TM3|TM2|TM1|

After Session1 is finished, Node1 has the following metrics:
- cluster.MyCluster.taskmanager.1.gc_time 

After Session2 is finished, Node1 has the following metrics:
- cluster.MyCluster.taskmanager.1.gc_time 
- cluster.MyCluster.taskmanager.2.gc_time

After Session3 is finished, Node1 has the following metrics:
- cluster.MyCluster.taskmanager.1.gc_time 
- cluster.MyCluster.taskmanager.2.gc_time
- cluster.MyCluster.taskmanager.3.gc_time

Around this time, a user should check which metric is the one for the current 
session among the above three metrics.
The problem is getting worse if the user has to launch much more TaskManagers.
For example, 500 TaskManagers over multiple sessions will end up with creating 
500 metrics for each host.

Wouldn't be better to assign indexes to TaskManagers scoped to each host?

p.s.
I'm going to start without considering multiple TaskManagers on the same node 
as we haven't yet reached a consensus.
But I think we still need to develop this discussion further.

> Expose metrics to graphite, ganglia and JMX.
> 
>
> Key: FLINK-1502
> URL: https://issues.apache.org/jira/browse/FLINK-1502
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager, TaskManager
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Assignee: Dongwon Kim
>Priority: Minor
> Fix For: pre-apache
>
>
> The metrics library allows to expose collected metrics easily to other 
> systems such as graphite, ganglia or Java's JVM (VisualVM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.

2016-02-22 Thread Jamie Grier (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158090#comment-15158090
 ] 

Jamie Grier edited comment on FLINK-1502 at 2/23/16 1:30 AM:
-

[~eastcirclek] Let's define our terms to  make sure we're talking about the 
same thing.

*Session*: A single instance of a Job Manager and some # of TaskManagers 
working together.   A session can be created "on-the-fly" for a single job or 
it can be a long-running thing.  Multiple jobs can start, run, and finish in 
the same session.  Think of the "yarn-session.sh" command.  This creates a 
session outside of any particular job.  This is also what I've meant when I've 
said "cluster".  A Yarn session is a "cluster" that we've spun up for some 
length of time on Yarn.  Another example of a cluster would be a standalone 
install of Flink on some # of machines.

*Job*: A single batch or streaming job that runs on a Flink cluster.

In the above scenario, and if your definition of sessions is in agreement with 
mine.  You would instead have the following.  Note that I've named the cluster 
according to the "session" name you've given, because in this case each session 
is really a different (ad-hoc) cluster.  When you run a job directly using just 
"flink run -ytm ..." on YARN you are spinning up an ad-hoc cluster for your job.

After Session 1 is finished, Node 1 would have the following metrics:

- cluster.session1.taskmanager.1.gc_time

After session 2 is finshed, Node 1 would have the following metrics:

- cluster.session1.taskmanager.1.gc_time 
- cluster.session2.taskmanager.2.gc_time
- cluster.session3.taskmanager.3.gc_time

There are many metrics in this case because that's exactly what you want.  
These are JVM scope metrics we are talking about and those are 3 different 
JVMS, not the same one so it makes total sense for them to have these different 
names/scopes.  These metrics have nothing to do with each other and it doesn't 
matter which host they are from.  They are scoped to the cluster (or session) 
and logical TaskManager index, not the host.

The above should not be confused with any host level metrics we want to report. 
 Host level metrics would be scoped simply by the hostname so they wouldn't 
grow either.

One more example, hopefully to clarify.  Let's say I spun up a long-running 
cluster (or session) using yarn-session.sh -tm 3.  Now we have a Flink cluster 
running on YARN with no jobs running and three TaskManagers.  We then run three 
different jobs one after another on this cluster.  The metrics would still 
simply be:

- cluster.yarn-session.taskmanager.1.gc_time
- cluster.yarn-session.taskmanager.2.gc_time
- cluster.yarn-session.taskmanager.3.gc_time

No matter how many jobs you ran this list would not grow, which is natural 
because there have only been 3 TaskManagers.  Now if one of these TaskManagers 
were to fail and be restarted it would assume the same name -- that's the point 
of using "logical" indexes so the set of metrics name in that case still would 
not be larger than the above.

In the initial case you describe above if you didn't want lot's of different 
metrics over time you could also just give all of your sessions the same name.  
Your metrics are growing because you're spinning up many different clusters 
(sessions) over time with different names each time.  If you used the same name 
for the cluster (session) every time this metrics namespace growth would not 
occur.

I hope any of that made sense ;)  This is getting a bit hard to describe this 
way.  We could also sync via Hangouts or something if that is easier.




was (Author: jgrier):
[~eastcirclek] Let's define our terms to  make sure we're talking about the 
same thing.

*Session*: A single instance of a Job Manager and some # of TaskManagers 
working together.   A session can be created "on-the-fly" for a single job or 
it can be a long-running thing.  Multiple jobs can start, run, and finish in 
the same session.  Think of the "yarn-session.sh" command.  This creates a 
session outside of any particular job.  This is also what I've meant when I've 
said "cluster".  A Yarn session is a "cluster" that we've spun up for some 
length of time on Yarn.  Another example of a cluster would be a standalone 
install of Flink on some # of machines.

*Job*: A single batch or streaming job that runs on a Flink cluster.

In the above scenario, and if your definition of sessions is in agreement with 
mine.  You would instead have the following.  Note that I've named the cluster 
according to the "session" name you've given, because in this case each session 
is really a different (ad-hoc) cluster.  When you run a job directly using just 
"flink run -ytm ..." on YARN you are spinning up an ad-hoc cluster for your job.

After Session 1 is finished, Node 1 would have the following metrics:

- cluster.session1.taskmanager.1.gc_time

After 

[jira] [Comment Edited] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.

2016-02-22 Thread Dongwon Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158023#comment-15158023
 ] 

Dongwon Kim edited comment on FLINK-1502 at 2/23/16 12:43 AM:
--

Let's consider the following scenario:
||Sessions||Node1||Node2||Node3||
|Session1|TM1|TM2|TM3|
|Session2|TM2|TM3|TM1|
|Session3|TM3|TM2|TM1|

After Session1 is finished, Node1 has the following metrics:
- cluster.MyCluster.taskmanager.1.gc_time 

After Session2 is finished, Node1 has the following metrics:
- cluster.MyCluster.taskmanager.1.gc_time 
- cluster.MyCluster.taskmanager.2.gc_time

After Session3 is finished, Node1 has the following metrics:
- cluster.MyCluster.taskmanager.1.gc_time 
- cluster.MyCluster.taskmanager.2.gc_time
- cluster.MyCluster.taskmanager.3.gc_time

Around this time, a user should check which metric is the one for the current 
session among the above three metrics.
The problem is getting worse if the user has to launch much more TaskManagers.
For example, 500 TaskManagers over multiple sessions will end up with creating 
500 metrics for each host.

Wouldn't be better to assign indexes to TaskManagers scoped to each host?

p.s.
I'm going to start without considering multiple TaskManagers on the same node 
as we haven't yet reached a consensus.
But I think we still need to develop this discussion further.


was (Author: eastcirclek):
Let's consider the following scenario:
||Sessions||Node1||Node2||Node3||
|Session1|TM1|TM2|TM3|
|Session2|TM2|TM3|TM1|
|Session3|TM3|TM2|TM1|

After Session1 is finished, Node1 have the following metrics:
- cluster.MyCluster.taskmanager.1.gc_time 

After Session2 is finished, Node1 have the following metrics:
- cluster.MyCluster.taskmanager.1.gc_time 
- cluster.MyCluster.taskmanager.2.gc_time

After Session3 is finished, Node1 have the following metrics:
- cluster.MyCluster.taskmanager.1.gc_time 
- cluster.MyCluster.taskmanager.2.gc_time
- cluster.MyCluster.taskmanager.3.gc_time
Around this time, a user should check which metric is the one for the current 
session among the above three metrics.
The problem is getting worse if the user has to launch much more TaskManagers.
For example, 500 TaskManagers over multiple sessions will end up with 500 
metrics for each host.

Wouldn't be better to assign indexes to TaskManagers scoped to each host?

p.s.
I'm going to start without considering multiple TaskManagers on the same node 
as we haven't yet reached a consensus.
But I think we still need to develop this discussion further.

> Expose metrics to graphite, ganglia and JMX.
> 
>
> Key: FLINK-1502
> URL: https://issues.apache.org/jira/browse/FLINK-1502
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager, TaskManager
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Assignee: Dongwon Kim
>Priority: Minor
> Fix For: pre-apache
>
>
> The metrics library allows to expose collected metrics easily to other 
> systems such as graphite, ganglia or Java's JVM (VisualVM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.

2016-02-22 Thread Dongwon Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158023#comment-15158023
 ] 

Dongwon Kim edited comment on FLINK-1502 at 2/23/16 12:41 AM:
--

Let's consider the following scenario:
||Sessions||Node1||Node2||Node3||
|Session1|TM1|TM2|TM3|
|Session2|TM2|TM3|TM1|
|Session3|TM3|TM2|TM1|

After Session1 is finished, Node1 have the following metrics:
- cluster.MyCluster.taskmanager.1.gc_time 

After Session2 is finished, Node1 have the following metrics:
- cluster.MyCluster.taskmanager.1.gc_time 
- cluster.MyCluster.taskmanager.2.gc_time

After Session3 is finished, Node1 have the following metrics:
- cluster.MyCluster.taskmanager.1.gc_time 
- cluster.MyCluster.taskmanager.2.gc_time
- cluster.MyCluster.taskmanager.3.gc_time
Around this time, a user should check which metric is the one for the current 
session among the above three metrics.
The problem is getting worse if the user has to launch much more TaskManagers.
For example, 500 TaskManagers over multiple sessions will end up with 500 
metrics for each host.

Wouldn't be better to assign indexes to TaskManagers scoped to each host?

p.s.
I'm going to start without considering multiple TaskManagers on the same node 
as we haven't yet reached a consensus.
But I think we still need to develop this discussion further.


was (Author: eastcirclek):
Let's consider the following scenario:

 |  Node1 (N1)  |   N2   |   N3 
--
Session1   |   TM1|  TM2  |  TM3
Session2   |   TM2   |  TM3  |  TM1
Session3   |   TM3   |  TM2  |  TM1

After Session1 is finished, Node1 have the following metrics:
- cluster.MyCluster.taskmanager.1.gc_time 

After Session2 is finished, Node1 have the following metrics:
- cluster.MyCluster.taskmanager.1.gc_time 
- cluster.MyCluster.taskmanager.2.gc_time

After Session3 is finished, Node1 have the following metrics:
- cluster.MyCluster.taskmanager.1.gc_time 
- cluster.MyCluster.taskmanager.2.gc_time
- cluster.MyCluster.taskmanager.3.gc_time
Around this time, a user should check which metric is the one for the current 
session among the above three metrics.
The problem is getting worse if the user has to launch much more TaskManagers.
For example, 500 TaskManagers over multiple sessions will end up with 500 
metrics for each host.

Wouldn't be better to assign indexes to TaskManagers scoped to each host?

p.s.
I'm going to start without considering multiple TaskManagers on the same node 
as we haven't yet reached a consensus.
But I think we still need to develop this discussion further.

> Expose metrics to graphite, ganglia and JMX.
> 
>
> Key: FLINK-1502
> URL: https://issues.apache.org/jira/browse/FLINK-1502
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager, TaskManager
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Assignee: Dongwon Kim
>Priority: Minor
> Fix For: pre-apache
>
>
> The metrics library allows to expose collected metrics easily to other 
> systems such as graphite, ganglia or Java's JVM (VisualVM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.

2016-02-18 Thread Jamie Grier (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152917#comment-15152917
 ] 

Jamie Grier edited comment on FLINK-1502 at 2/18/16 7:21 PM:
-

To be clear what I meant here is to have the indexes assigned to the 
TaskManagers scoped to the *entire* cluster.  Not a particular host like what 
you're describing here.  So, for example, if you spun up a Flink cluster with 
10 TaskManagers running on 10 different hosts the TaskManager's would be given 
a unique *index* on the *cluster*.  Literally, TaskManager[1-10].  Use this to 
scope the metrics, e.g.:

cluster.MyCluster.taskmanager.1.gc_time
cluster.MyCluster.taskmanager.2.gc_time
...
...
cluster.MyCluster.taskmanager.10.gc_time

It doesn't matter which hosts they are on.  These are 10 unique JVMS on some 
set of hosts.




was (Author: jgrier):
To be clear what I meant here is to have the indexes assigned to the 
TaskManagers scoped to the *entire* cluster.  Not a particular host like what 
you're describing here.  So, for example, if you spun up a Flink cluster with 
10 TaskManagers running on 10 different hosts the TaskManager's would be given 
a unique INDEX on the cluster.  Literally, TaskManager[1-10].  Use this to 
scope the metrics, e.g.:

cluster.MyCluster.taskmanager.1.gc_time
cluster.MyCluster.taskmanager.2.gc_time
...
...
cluster.MyCluster.taskmanager.10.gc_time

It doesn't matter which hosts they are on.  These are 10 unique JVMS on some 
set of hosts.



> Expose metrics to graphite, ganglia and JMX.
> 
>
> Key: FLINK-1502
> URL: https://issues.apache.org/jira/browse/FLINK-1502
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager, TaskManager
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Assignee: Dongwon Kim
>Priority: Minor
> Fix For: pre-apache
>
>
> The metrics library allows to expose collected metrics easily to other 
> systems such as graphite, ganglia or Java's JVM (VisualVM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.

2016-02-17 Thread Dongwon Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151346#comment-15151346
 ] 

Dongwon Kim edited comment on FLINK-1502 at 2/18/16 2:30 AM:
-

To [~StephanEwen], [~mxm], [~jgrier], 
First of all, sorry for the late response.

We just need to make each TaskManager report its metrics to 
JMX/Ganglia/Graphite as you guys suggested.

To [~mxm], 
The problem mainly comes from such a design is that a newly launched 
TaskManager is given a randomly generated UUID and it will create too many 
Ganglia metrics as [~jgrier] mentioned above.
I think [~jgrier]'s solution is quite simple yet viable:

cluster..taskmanager.1.gc_time
cluster..taskmanager.2.gc_time

To that end, we need to open a new issue to assign such IDs to TaskManagers 
running on the same host.
One concern is that we need to do such numbering even when only one TaskManager 
is running on each node like .taskmanager.1.gc_time.
I'm okay with it but users could think that the numbering is quite ugly.

How do you guys think?


was (Author: eastcirclek):
To [~StephanEwen], [~mxm], [~jgrier], 
First of all, sorry for the late response.

We just need to make each TaskManager report its metrics to 
JMX/Ganglia/Graphite as you guys suggested.

To [~mxm], 
The problem mainly comes from such a design is that a newly launched 
TaskManager is given a randomly generated UUID and it will create too many 
Ganglia metrics as [~jgrier] mentioned above.
I think [~jgrier]'s solution is quite simple yet viable:

cluster..taskmanager.1.gc_time
cluster..taskmanager.2.gc_time

To that end, we need to open a new issue to assign such IDs to TaskManagers 
running on the same host.
One concern is that, despite only one TaskManager running each node, we need to 
do such numbering (e.g. .taskmanager.1.gc_time).
I'm okay with it but users could think that the numbering is quite ugly.

How do you guys think?

> Expose metrics to graphite, ganglia and JMX.
> 
>
> Key: FLINK-1502
> URL: https://issues.apache.org/jira/browse/FLINK-1502
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager, TaskManager
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Assignee: Dongwon Kim
>Priority: Minor
> Fix For: pre-apache
>
>
> The metrics library allows to expose collected metrics easily to other 
> systems such as graphite, ganglia or Java's JVM (VisualVM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.

2016-02-17 Thread Dongwon Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151346#comment-15151346
 ] 

Dongwon Kim edited comment on FLINK-1502 at 2/17/16 11:02 PM:
--

To [~StephanEwen], [~mxm], [~jgrier], 
First of all, sorry for the late response.

We just need to make each TaskManager report its metrics to 
JMX/Ganglia/Graphite as you guys suggested.

To [~mxm], 
The problem mainly comes from such a design is that a newly launched 
TaskManager is given a randomly generated UUID and it will create too many 
Ganglia metrics as [~jgrier] mentioned above.
I think [~jgrier]'s solution is quite simple yet viable:

cluster..taskmanager.1.gc_time
cluster..taskmanager.2.gc_time

To that end, we need to open a new issue to assign such IDs to TaskManagers 
running on the same host.
One concern is that, despite only one TaskManager running each node, we need to 
do such numbering (e.g. .taskmanager.1.gc_time).
I'm okay with it but users could think that the numbering is quite ugly.

How do you guys think?


was (Author: eastcirclek):
To [~StephanEwen], [~mxm], [~jgrier], 
First of all, sorry for the late response.

We just need to make each TaskManager report its metrics to 
JMX/Ganglia/Graphite as you guys suggested.

To [~mxm], 
the problem mainly comes from such a design is that a newly launched 
TaskManager is given a randomly generated UUID and it will create too many 
Ganglia metrics as [~jgrier] mentioned above.
I think [~jgrier]'s solution is quite simple yet viable:

cluster..taskmanager.1.gc_time
cluster..taskmanager.2.gc_time

To that end, we need to open a new issue to assign such IDs to TaskManagers 
running on the same host.
One concern is that, despite only one TaskManager running each node, we need to 
do such numbering (e.g. .taskmanager.1.gc_time).
I'm okay with it but users could think that the numbering is quite ugly.

How do you guys think?

> Expose metrics to graphite, ganglia and JMX.
> 
>
> Key: FLINK-1502
> URL: https://issues.apache.org/jira/browse/FLINK-1502
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager, TaskManager
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Assignee: Dongwon Kim
>Priority: Minor
> Fix For: pre-apache
>
>
> The metrics library allows to expose collected metrics easily to other 
> systems such as graphite, ganglia or Java's JVM (VisualVM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.

2016-02-17 Thread Dongwon Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151346#comment-15151346
 ] 

Dongwon Kim edited comment on FLINK-1502 at 2/17/16 10:57 PM:
--

To [~StephanEwen], [~mxm], [~jgrier], 
First of all, sorry for the late response.

We just need to make each TaskManager report its metrics to 
JMX/Ganglia/Graphite as you guys suggested.

To [~mxm], 
the problem mainly comes from such a design is that a newly launched 
TaskManager is given a randomly generated UUID and it will create too many 
Ganglia metrics as [~jgrier] mentioned above.
I think [~jgrier]'s solution is quite simple yet viable:

cluster..taskmanager.1.gc_time
cluster..taskmanager.2.gc_time

To that end, we need to open a new issue to assign such IDs to TaskManagers 
running on the same host.
One concern is that, despite only one TaskManager running each node, we need to 
do such numbering (e.g. .taskmanager.1.gc_time).
I'm okay with it but users could think that the numbering is quite ugly.

How do you guys think?


was (Author: eastcirclek):
To [~StephanEwen], [~mxm], [~jgrier], 
First of all, sorry for the late response.

We just need to make each TaskManager report its metrics to 
JMX/Ganglia/Graphite as you guys suggested.

To [~mxm], 
the problem mainly comes from such a design is that a newly launched 
TaskManager is given a randomly generated UUID and it will create too many 
Ganglia metrics as [~jgrier] mentioned above.
I think [~jgrier]'s solution is quite simple yet viable:

cluster..taskmanager.1.gc_time
cluster..taskmanager.2.gc_time

To that end, we need to open a new issue to assign such IDs to TaskManagers 
running on the same host.
One concern is that. despite only one TaskManager running each node, we need to 
do such numbering (e.g. .taskmanager.1.gc_time).
I'm okay with it but users could think that the numbering is quite ugly.

How do you guys think?

> Expose metrics to graphite, ganglia and JMX.
> 
>
> Key: FLINK-1502
> URL: https://issues.apache.org/jira/browse/FLINK-1502
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager, TaskManager
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Assignee: Dongwon Kim
>Priority: Minor
> Fix For: pre-apache
>
>
> The metrics library allows to expose collected metrics easily to other 
> systems such as graphite, ganglia or Java's JVM (VisualVM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.

2016-02-16 Thread Jamie Grier (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149109#comment-15149109
 ] 

Jamie Grier edited comment on FLINK-1502 at 2/16/16 7:09 PM:
-

Is there no way to refer to a TaskManager by index in order to solve this 
problem?  It would be nice if we didn't have to send all the metrics through 
the JobManager but rather just report them via JMX locally on each host.  I 
think I understand the problem you are describing but wouldn't just having a 
logical index for each TaskManager solve this problem.  I would like to avoid 
having to send the metrics through a central node if possible as I would like 
to see the # of total metrics go up dramatically as we instrument the code more 
and more and give users more insight into how Flink is running.

Maybe we can collaborate on this.  I want a general way to instrument both 
Flink code and user code and make those metrics available easily via JMX at a 
minimum and maybe directly in Graphite and Ganglia.  Once available in JMX 
there are many tools to integrate with other metrics and alerting systems.


was (Author: jgrier):
Is there no way to refer to a TaskManager by index in order to solve this 
problem.  It would be nice if we didn't have to send all the metrics through 
the JobManager but rather just report them via JMX locally on each host.  I 
think I understand the problem you are describing but would just having a 
logical index for each TaskManager solve this problem.  I would like to avoid 
having to send the metrics through a central node if possible as I would like 
to see the # of total metrics go up dramatically as we instrument the code more 
and more.

> Expose metrics to graphite, ganglia and JMX.
> 
>
> Key: FLINK-1502
> URL: https://issues.apache.org/jira/browse/FLINK-1502
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager, TaskManager
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Assignee: Dongwon Kim
>Priority: Minor
> Fix For: pre-apache
>
>
> The metrics library allows to expose collected metrics easily to other 
> systems such as graphite, ganglia or Java's JVM (VisualVM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.

2016-02-11 Thread Dongwon Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144128#comment-15144128
 ] 

Dongwon Kim edited comment on FLINK-1502 at 2/12/16 6:29 AM:
-

Before deciding the design, we should take into consideration an environment in 
which a user can launch multiple TaskManager instances on a single machine 
(this is my local development environment) while Ganglia is usually setup to 
run a single monitoring daemon on each machine. This could be a common case 
sooner or later when Flink is capable of dynamic runtime scaling under YARN or 
MESOS (Spark already supports dynamic runtime scaling by executing multiple 
smaller executors per node and killing some of them when underloaded). 

What could be a problem in such an environment is that, if each of two 
TaskManagers running on a cluster node reports to Ganglia its metrics as if it 
is an only Flink daemon solely running on the node, Ganglia shows two different 
metrics in a single graph without aggregating them. The graph could be sawtooth 
shaped in my experience. A workaround could distinguish metrics from two 
TaskManagers by appending TaskManager IDs to the name of each metric when 
reporting to Ganglia. The workaround, however, will generate too many Ganglia 
metrics (also RRD files each corresponding to a Ganglia metric) because 
TaskManagers are given a randomly generated ID whenever newly launched.

That being said, I design a initial plan as follows:
- JobManager takes responsibility for reporting TaskManager's metrics to 
Ganglia/Graphite. Note that TaskManagers already send metrics through heartbeat 
messages to JobManager. 
- I want JobManager to aggregate metrics from TaskManagers running on the same 
node. I'm not sure whether this decision is good enough because different 
TaskManagers running on the same node could exhibit different runtime behaviors.
- After aggregating values of a metric from different TaskManagers running on a 
cluster node, JobManager reports to Ganglia the aggregated value of the metric 
with the hostname. 
- By doing that, Ganglia will end up with having a single Ganglia metric.


was (Author: eastcirclek):
Before deciding the design, we should take into consideration an environment in 
which a user can launch multiple TaskManager instances on a single machine 
(this is my local development environment) while Ganglia is usually setup to 
run a single monitoring daemon on each machine. This could be a common case 
sooner or later when Flink is capable of dynamic runtime scaling under YARN or 
MESOS (Spark already supports dynamic runtime scaling by executing multiple 
smaller executors per node and killing some of them when underloaded). 

What could be a problem in such an environment is that, if each of two 
TaskManagers running on a cluster node reports to Ganglia its metrics as if it 
is an only Flink daemon solely running on the node, Ganglia shows two different 
metrics in a single graph without aggregating them. The graph could be sawtooth 
shaped in my experience. A workaround could distinguish metrics from two 
TaskManagers by appending TaskManager IDs to the name of each metric when 
reporting to Ganglia. The workaround, however, will generate too many Ganglia 
metrics (also RRD files each corresponding to a Ganglia metric) in the Ganglia 
master node because TaskManagers are given a randomly generated ID whenever 
newly launched.

That being said, I design a initial plan as follows:
- JobManager takes responsibility for reporting TaskManager's metrics to 
Ganglia/Graphite. Note that TaskManagers already send metrics through heartbeat 
messages to JobManager. 
- I want JobManager to aggregate metrics from TaskManagers running on the same 
node. I'm not sure whether this decision is good enough because different 
TaskManagers running on the same node could exhibit different runtime behaviors.
- After aggregating values of a metric from different TaskManagers running on a 
cluster node, JobManager reports to Ganglia the aggregated value of the metric 
with the hostname. 
- By doing that, Ganglia will end up with having a single Ganglia metric.

> Expose metrics to graphite, ganglia and JMX.
> 
>
> Key: FLINK-1502
> URL: https://issues.apache.org/jira/browse/FLINK-1502
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager, TaskManager
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Assignee: Dongwon Kim
>Priority: Minor
> Fix For: pre-apache
>
>
> The metrics library allows to expose collected metrics easily to other 
> systems such as graphite, ganglia or Java's JVM (VisualVM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.

2016-02-11 Thread Dongwon Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144128#comment-15144128
 ] 

Dongwon Kim edited comment on FLINK-1502 at 2/12/16 6:31 AM:
-

Before deciding the design, we should take into consideration an environment in 
which a user can launch multiple TaskManager instances on a single machine 
(this is my local development environment) while Ganglia is usually setup to 
run a single monitoring daemon on each machine. This could be a common case 
sooner or later when Flink is capable of dynamic runtime scaling under YARN or 
MESOS (Spark already supports dynamic runtime scaling by executing multiple 
smaller executors per node and killing some of them when underloaded). 

What could be a problem in such an environment is that, if each of two 
TaskManagers running on a cluster node reports to Ganglia its metrics as if it 
is an only Flink daemon solely running on the node, Ganglia shows two different 
metrics in a single graph without aggregating them. The graph could be sawtooth 
shaped in my experience. A workaround could distinguish metrics from two 
TaskManagers by appending TaskManager IDs to the name of each metric when 
reporting to Ganglia. The workaround, however, will generate too many Ganglia 
metrics (also RRD files each corresponding to a Ganglia metric) because 
TaskManagers are given a randomly generated ID whenever newly launched.

That being said, I design a initial plan as follows:
- JobManager takes responsibility for reporting TaskManager's metrics to 
Ganglia/Graphite. Note that TaskManagers already send metrics through heartbeat 
messages to JobManager. 
- I want JobManager to aggregate metrics from TaskManagers running on the same 
node. I'm not sure whether this decision is good enough because different 
TaskManagers running on the same node could exhibit different runtime behaviors.
- After aggregating values of a metric from different TaskManagers running on a 
cluster node, JobManager reports to Ganglia the aggregated value of the metric 
with the hostname. 
- By doing that, Ganglia will end up with having a single Ganglia metric for 
each Flink metric.


was (Author: eastcirclek):
Before deciding the design, we should take into consideration an environment in 
which a user can launch multiple TaskManager instances on a single machine 
(this is my local development environment) while Ganglia is usually setup to 
run a single monitoring daemon on each machine. This could be a common case 
sooner or later when Flink is capable of dynamic runtime scaling under YARN or 
MESOS (Spark already supports dynamic runtime scaling by executing multiple 
smaller executors per node and killing some of them when underloaded). 

What could be a problem in such an environment is that, if each of two 
TaskManagers running on a cluster node reports to Ganglia its metrics as if it 
is an only Flink daemon solely running on the node, Ganglia shows two different 
metrics in a single graph without aggregating them. The graph could be sawtooth 
shaped in my experience. A workaround could distinguish metrics from two 
TaskManagers by appending TaskManager IDs to the name of each metric when 
reporting to Ganglia. The workaround, however, will generate too many Ganglia 
metrics (also RRD files each corresponding to a Ganglia metric) because 
TaskManagers are given a randomly generated ID whenever newly launched.

That being said, I design a initial plan as follows:
- JobManager takes responsibility for reporting TaskManager's metrics to 
Ganglia/Graphite. Note that TaskManagers already send metrics through heartbeat 
messages to JobManager. 
- I want JobManager to aggregate metrics from TaskManagers running on the same 
node. I'm not sure whether this decision is good enough because different 
TaskManagers running on the same node could exhibit different runtime behaviors.
- After aggregating values of a metric from different TaskManagers running on a 
cluster node, JobManager reports to Ganglia the aggregated value of the metric 
with the hostname. 
- By doing that, Ganglia will end up with having a single Ganglia metric.

> Expose metrics to graphite, ganglia and JMX.
> 
>
> Key: FLINK-1502
> URL: https://issues.apache.org/jira/browse/FLINK-1502
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager, TaskManager
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Assignee: Dongwon Kim
>Priority: Minor
> Fix For: pre-apache
>
>
> The metrics library allows to expose collected metrics easily to other 
> systems such as graphite, ganglia or Java's JVM (VisualVM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1502) Expose metrics to graphite, ganglia and JMX.

2016-02-08 Thread Maximilian Michels (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15136704#comment-15136704
 ] 

Maximilian Michels edited comment on FLINK-1502 at 2/8/16 8:37 AM:
---

I was working on this in FLINK-3170 but priorities have shifted a bit so I 
haven't completed the work yet.

After I tied the initial collection of metrics to the existing runtime, I 
realized that it would be better to build an abstraction for publishing the 
metrics. What I did is to replace the accumulator {{HashMap}} with a custom 
{{TaskAccumulator}} type. In the runtime implementation, the actual 
implementation can trigger publishing of the metrics during runtime. It would 
suffice to register the accumulators once and then have them pulled in by the 
BeanServer of the JVM.

This approach wouldn't touch too many runtime classes or introduce an extra 
synchronization between the runtime thread and a metrics thread. All 
non-job-related metrics which are published through the task managers (and 
heartbeated to the job manager), can be exposed much more easily.  


was (Author: mxm):
I was working on this in FLINK-3170 but priorities have shifted a bit so I 
haven't completed the work yet.

After I tied the initial collection of metrics to the existing runtime, I 
realized that it would be better to build an abstraction for publishing the 
metrics. What I did is to replace the accumulator {{HashMap}}s with a custom 
{{TaskAccumulator}} type. In the runtime implementation, the actual 
implementation can trigger publishing of the metrics during runtime. It would 
suffice to register the accumulators once and then have them pulled in by the 
BeanServer of the JVM.

This approach wouldn't touch too many runtime classes or introduce an extra 
synchronization between the runtime thread and a metrics thread. All 
non-job-related metrics which are published through the task managers (and 
heartbeated to the job manager), can be exposed much more easily.  

> Expose metrics to graphite, ganglia and JMX.
> 
>
> Key: FLINK-1502
> URL: https://issues.apache.org/jira/browse/FLINK-1502
> Project: Flink
>  Issue Type: Sub-task
>  Components: JobManager, TaskManager
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Priority: Minor
> Fix For: pre-apache
>
>
> The metrics library allows to expose collected metrics easily to other 
> systems such as graphite, ganglia or Java's JVM (VisualVM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)