Re: Partition map exchange metrics

2019-07-26 Thread Pavel Kovalenko
Nikolay, Looks like final resolution. +1. пт, 26 июл. 2019 г. в 12:08, Nikolay Izhikov : > Pavel. > > > I just want to add that currentPmeTime is also useful alerting systems, > not > > only for eye observing > > Fully agree. > > Let me make it as clear as I can. > In the end we should have 4

Re: Partition map exchange metrics

2019-07-26 Thread Nikolay Izhikov
Pavel. > I just want to add that currentPmeTime is also useful alerting systems, not > only for eye observing Fully agree. Let me make it as clear as I can. In the end we should have 4 metrics: `CurrentPMEDuration` - existing metric, shows current PME duration.

Re: Partition map exchange metrics

2019-07-25 Thread Pavel Kovalenko
Nikolay, Okay, sounds reasonable. I just want to add that currentPmeTime is also useful alerting systems, not only for eye observing. If the time become too long and exceeds some threshold appropriate alert firing can help to early determine a critical problem. On Thu, 25 Jul 2019 at 21.12,

Re: Partition map exchange metrics

2019-07-25 Thread Nikolay Izhikov
I think exact time should be obtained from logs, isnt it? чт, 25 июля 2019 г., 20:00 Pavel Kovalenko : > Nikolay, > > Yes, I have a chance to see HistogramMetric and moreover reviewed it) My > question was mostly about what exactly we will track in Histogram. > If we use histogram do you know

Re: Partition map exchange metrics

2019-07-25 Thread Pavel Kovalenko
Nikolay, Yes, I have a chance to see HistogramMetric and moreover reviewed it) My question was mostly about what exactly we will track in Histogram. If we use histogram do you know how we can find exact time e.g. when PME with time > 1s happened? чт, 25 июл. 2019 г. в 19:24, Nikolay Izhikov : >

Re: Partition map exchange metrics

2019-07-25 Thread Nikolay Izhikov
Pavel Do you have a chance to see HistogramMetric source? It in master now. Look in source would be better then my explanation) We should count PME processes that blocks operations for some amount of time. For example [less then 50, less then 250, less then 1000, more then 1000] millis. чт, 25

Re: Partition map exchange metrics

2019-07-25 Thread Pavel Kovalenko
Nikolay, Could you please explain deeper what structure will be of PME histogram? чт, 25 июл. 2019 г. в 11:56, Nikolay Izhikov : > Hello, Nikita. > > I think > > > 1. The totalCacheOperationsBlockedDuration metric that will accumulate > > all blocking durations that happen after node starts. >

Re: Partition map exchange metrics

2019-07-25 Thread Nikolay Izhikov
Hello, Nikita. I think > 1. The totalCacheOperationsBlockedDuration metric that will accumulate > all blocking durations that happen after node starts. No, we don't need it. > 2. Blocking duration histogram. Based on the HistogramMetric class. Yes, we need it. В Чт, 25/07/2019 в 11:50 +0300,

Re: Partition map exchange metrics

2019-07-25 Thread Nikita Amelchev
Igniters, All want to see the сacheOperationsBlockedDuration metric that will show current blocking duration or 0 if there is no blocking right now. Do we need the following metrics? It seems one of them will be superfluous. 1. The totalCacheOperationsBlockedDuration metric that will accumulate

Re: Partition map exchange metrics

2019-07-24 Thread Nikolay Izhikov
Guys. I think we should go with the 2 metrics * current PME duration (resets on finish) This metric required for alerting(or automatic actions) on long PME. * PME duration histogram (value added to metrics on PME finish) This metric required for

Re: Partition map exchange metrics

2019-07-24 Thread Ivan Rakov
Nikita and Maxim, What if we just update current metric getCurrentPmeDuration behaviour to show durations only for blocking PMEs? Remain it as a long value and rename it to getCacheOperationsBlockedDuration. No other changes will require. WDYT? I agree with these two metrics. I also think

Re: Re[2]: Partition map exchange metrics

2019-07-24 Thread Maxim Muzafarov
Folks, +1 with Anton post. What if we just update current metric getCurrentPmeDuration behaviour to show durations only for blocking PMEs? Remain it as a long value and rename it to getCacheOperationsBlockedDuration. No other changes will require. WDYT? On Wed, 24 Jul 2019 at 14:02, Nikita

Re: Re[2]: Partition map exchange metrics

2019-07-24 Thread Nikita Amelchev
Nikolay, The сacheOperationsBlockedDuration metric will show current blocking duration or 0 if there is no blocking right now. The totalCacheOperationsBlockedDuration metric will accumulate all blocking durations that happen after node starts. ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov : > >

Re: Re[2]: Partition map exchange metrics

2019-07-24 Thread Nikolay Izhikov
Nikita What is the difference between those two metrics? ср, 24 июля 2019 г., 12:45 Nikita Amelchev : > Igniters, thanks for comments. > > From the discussion it can be seen that we need only two metrics for now: > - сacheOperationsBlockedDuration (long) > - totalCacheOperationsBlockedDuration

Re: Re[2]: Partition map exchange metrics

2019-07-24 Thread Nikita Amelchev
Igniters, thanks for comments. >From the discussion it can be seen that we need only two metrics for now: - сacheOperationsBlockedDuration (long) - totalCacheOperationsBlockedDuration (long) I will prepare PR at the nearest time. ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky : > > +1 with

Re[2]: Partition map exchange metrics

2019-07-24 Thread Zhenya Stanilovsky
+1 with Anton decisions. >Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov : > >Folks, > >It looks like we're trying to implement "extended debug" instead of >"monitoring". >It should not be interesting for real admin what phase of PME is in >progress and so on. >Interested metrics are >-

Re: Partition map exchange metrics

2019-07-23 Thread Anton Vinogradov
Folks, It looks like we're trying to implement "extended debug" instead of "monitoring". It should not be interesting for real admin what phase of PME is in progress and so on. Interested metrics are - total blocked time (will be used for real SLA counting) - are we blocked right now (shows we

Re: Partition map exchange metrics

2019-07-23 Thread Ivan Rakov
Folks, let me step in. Nikita, thanks for your suggestions! 1. initialVersion. Topology version that initiates the exchange. 2. initTime. Time PME was started. 3. initEvent. Event that triggered PME. 4. partitionReleaseTime. Time when a node has finished waiting for all updates and

Re: Partition map exchange metrics

2019-07-23 Thread Pavel Kovalenko
Nikita, I agree with total blocking duration metric but I still don't understand why instant value indicating that operations are blocked should be boolean. Duration time since blocking has started looks more appropriate and useful. It gives more information while semantic is left the same.

Re: Partition map exchange metrics

2019-07-23 Thread Nikita Amelchev
Folks, All previous suggestions have some disadvantages. It can be several exchanges between two metric updates and fast exchange can rewrite previous long exchange. We can introduce a metric of total blocking duration that will accumulate at the end of the exchange. So, users will get actual

Re: Partition map exchange metrics

2019-07-22 Thread Anton Vinogradov
Nikolay, Still see no reason to replace boolean with long. On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov wrote: > Anton. > > 1. Value exported based on SPI settings, not in the moment it changed. > > 2. Clock synchronisation - if we export start time, we should also export > node local

Re: Partition map exchange metrics

2019-07-22 Thread Nikolay Izhikov
Anton. 1. Value exported based on SPI settings, not in the moment it changed. 2. Clock synchronisation - if we export start time, we should also export node local timestamp. пн, 22 июля 2019 г., 8:33 Anton Vinogradov : > Folks, > > What's the reason for duration counting? > AFAIU, it's a

Re: Partition map exchange metrics

2019-07-21 Thread Anton Vinogradov
Folks, What's the reason for duration counting? AFAIU, it's a monitoring system feature to count the durations. Sine monitoring system checks metrics periodically it will know the duration by its own log. On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko wrote: > Nikita, > > Yes, I mean duration

Re: Partition map exchange metrics

2019-07-19 Thread Pavel Kovalenko
Nikita, Yes, I mean duration not timestamp. For the metric name, I suggest "cacheOperationsBlockingDuration", I think it cleaner represents what is blocked during PME. We can also combine both timestamp "cacheOperationsBlockingStartTs" and duration to have better correlation when cache operations

Re: Partition map exchange metrics

2019-07-19 Thread Nikita Amelchev
Pavel, The main purpose of this metric is >> how much time we wait for resuming cache operations Seems I misunderstood you. Do you mean timestamp or duration here? >> What do you think if we change the boolean value of metric to a long value >> that represents time in milliseconds when

Re: Partition map exchange metrics

2019-07-19 Thread Pavel Kovalenko
Nikita, I think getCurrentPmeDuration doesn't show useful information. The main PME side effect for end-users is blocking cache operations. Not all PME time blocks it. What information gives to an end-user timestamp of "timeSinceOperationsBlocked"? For what analysis it can be used and how? пт,

Re: Partition map exchange metrics

2019-07-19 Thread Nikita Amelchev
Hi Pavel, This time already can be obtained from the getCurrentPmeDuration and new isOperationsBlockedByPme metrics. As an alternative solution, I can rework recently added getCurrentPmeDuration metric (not released yet). Seems for users it useless in case of non-blocking PME. Lets name it

Re: Partition map exchange metrics

2019-07-19 Thread Pavel Kovalenko
Hi Nikita, Thank you for working on this. What do you think if we change the boolean value of metric to a long value that represents time in milliseconds when operations were blocked? Since we have not only JMX and now metrics are periodically exported to some backend it can give a more clear

Re: Partition map exchange metrics

2019-07-19 Thread Nikita Amelchev
Anton, Nikolay, Thanks for the support. For now, we have the getCurrentPmeDuration() metric that does not show influence on the cluster correctly. PME can be without blocking operations. For example, client node join/leave events. I suggest add new metric - isOperationsBlockedByPme(). Together,

Re: Partition map exchange metrics

2019-07-16 Thread Nikolay Izhikov
I think administator of Ignite cluster should be able to monitor all Ignite process, including non blocking PME. В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет: > BTW, > Found PME metric - getCurrentPmeDuration(). > Seems, it shows exactly PME time and not so useful because of this. >

Re: Partition map exchange metrics

2019-07-16 Thread Anton Vinogradov
BTW, Found PME metric - getCurrentPmeDuration(). Seems, it shows exactly PME time and not so useful because of this. The goal it so show exactly blocking period. When PME cause no blocking, it's a good PME and I see no reason to have monitoring related to it :) On Tue, Jul 16, 2019 at 2:50 PM

Re: Partition map exchange metrics

2019-07-16 Thread Nikolay Izhikov
Anton. Why do we need to postpone implementation of this metrics? For now, implementation of new metric is very simple. I think we can implement this metrics as a single contribution. В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет: > Nikita, > > Looks like all we need now is a 1 simple

Re: Partition map exchange metrics

2019-07-16 Thread Anton Vinogradov
Nikita, Looks like all we need now is a 1 simple metric: are operations blocked? Just a true or false. Lest start from this. All other metrics can be extracted from logs now and can be implemented later. On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov wrote: > +1. > > Nikita, please, go

Re: Partition map exchange metrics

2019-07-16 Thread Nikolay Izhikov
+1. Nikita, please, go ahead. вт, 16 июля 2019 г., 11:45 Nikita Amelchev : > Hello, Igniters. > > I suggest to add some useful metrics about the partition map exchange > (PME). For now, the duration of PME stages available only in log files > and cannot be obtained using JMX or other external

Partition map exchange metrics

2019-07-16 Thread Nikita Amelchev
Hello, Igniters. I suggest to add some useful metrics about the partition map exchange (PME). For now, the duration of PME stages available only in log files and cannot be obtained using JMX or other external tools. [1] I made the list of local node metrics that help to understand the actual