Nikolay,
Looks like final resolution. +1.
пт, 26 июл. 2019 г. в 12:08, Nikolay Izhikov :
> Pavel.
>
> > I just want to add that currentPmeTime is also useful alerting systems,
> not
> > only for eye observing
>
> Fully agree.
>
> Let me make it as clear as I can.
> In the end we should have 4
Pavel.
> I just want to add that currentPmeTime is also useful alerting systems, not
> only for eye observing
Fully agree.
Let me make it as clear as I can.
In the end we should have 4 metrics:
`CurrentPMEDuration` - existing metric, shows current PME duration.
Nikolay,
Okay, sounds reasonable.
I just want to add that currentPmeTime is also useful alerting systems, not
only for eye observing. If the time become too long and exceeds some
threshold appropriate alert firing can help to early determine a critical
problem.
On Thu, 25 Jul 2019 at 21.12,
I think exact time should be obtained from logs, isnt it?
чт, 25 июля 2019 г., 20:00 Pavel Kovalenko :
> Nikolay,
>
> Yes, I have a chance to see HistogramMetric and moreover reviewed it) My
> question was mostly about what exactly we will track in Histogram.
> If we use histogram do you know
Nikolay,
Yes, I have a chance to see HistogramMetric and moreover reviewed it) My
question was mostly about what exactly we will track in Histogram.
If we use histogram do you know how we can find exact time e.g. when PME
with time > 1s happened?
чт, 25 июл. 2019 г. в 19:24, Nikolay Izhikov :
>
Pavel
Do you have a chance to see HistogramMetric source?
It in master now.
Look in source would be better then my explanation)
We should count PME processes that blocks operations for some amount of
time. For example [less then 50, less then 250, less then 1000, more then
1000] millis.
чт, 25
Nikolay,
Could you please explain deeper what structure will be of PME histogram?
чт, 25 июл. 2019 г. в 11:56, Nikolay Izhikov :
> Hello, Nikita.
>
> I think
>
> > 1. The totalCacheOperationsBlockedDuration metric that will accumulate
> > all blocking durations that happen after node starts.
>
Hello, Nikita.
I think
> 1. The totalCacheOperationsBlockedDuration metric that will accumulate
> all blocking durations that happen after node starts.
No, we don't need it.
> 2. Blocking duration histogram. Based on the HistogramMetric class.
Yes, we need it.
В Чт, 25/07/2019 в 11:50 +0300,
Igniters,
All want to see the сacheOperationsBlockedDuration metric that will
show current blocking duration or 0 if there is no blocking right now.
Do we need the following metrics? It seems one of them will be superfluous.
1. The totalCacheOperationsBlockedDuration metric that will accumulate
Guys.
I think we should go with the 2 metrics
* current PME duration (resets on finish)
This metric required for alerting(or automatic actions) on long
PME.
* PME duration histogram (value added to metrics on PME finish)
This metric required for
Nikita and Maxim,
What if we just update current metric getCurrentPmeDuration behaviour
to show durations only for blocking PMEs?
Remain it as a long value and rename it to getCacheOperationsBlockedDuration.
No other changes will require.
WDYT?
I agree with these two metrics. I also think
Folks,
+1 with Anton post.
What if we just update current metric getCurrentPmeDuration behaviour
to show durations only for blocking PMEs?
Remain it as a long value and rename it to getCacheOperationsBlockedDuration.
No other changes will require.
WDYT?
On Wed, 24 Jul 2019 at 14:02, Nikita
Nikolay,
The сacheOperationsBlockedDuration metric will show current blocking
duration or 0 if there is no blocking right now.
The totalCacheOperationsBlockedDuration metric will accumulate all
blocking durations that happen after node starts.
ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov :
>
>
Nikita
What is the difference between those two metrics?
ср, 24 июля 2019 г., 12:45 Nikita Amelchev :
> Igniters, thanks for comments.
>
> From the discussion it can be seen that we need only two metrics for now:
> - сacheOperationsBlockedDuration (long)
> - totalCacheOperationsBlockedDuration
Igniters, thanks for comments.
>From the discussion it can be seen that we need only two metrics for now:
- сacheOperationsBlockedDuration (long)
- totalCacheOperationsBlockedDuration (long)
I will prepare PR at the nearest time.
ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky :
>
> +1 with
+1 with Anton decisions.
>Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov :
>
>Folks,
>
>It looks like we're trying to implement "extended debug" instead of
>"monitoring".
>It should not be interesting for real admin what phase of PME is in
>progress and so on.
>Interested metrics are
>-
Folks,
It looks like we're trying to implement "extended debug" instead of
"monitoring".
It should not be interesting for real admin what phase of PME is in
progress and so on.
Interested metrics are
- total blocked time (will be used for real SLA counting)
- are we blocked right now (shows we
Folks, let me step in.
Nikita, thanks for your suggestions!
1. initialVersion. Topology version that initiates the exchange.
2. initTime. Time PME was started.
3. initEvent. Event that triggered PME.
4. partitionReleaseTime. Time when a node has finished waiting for all
updates and
Nikita,
I agree with total blocking duration metric but
I still don't understand why instant value indicating that operations are
blocked should be boolean.
Duration time since blocking has started looks more appropriate and useful.
It gives more information while semantic is left the same.
Folks,
All previous suggestions have some disadvantages. It can be several
exchanges between two metric updates and fast exchange can rewrite
previous long exchange.
We can introduce a metric of total blocking duration that will
accumulate at the end of the exchange. So, users will get actual
Nikolay,
Still see no reason to replace boolean with long.
On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov wrote:
> Anton.
>
> 1. Value exported based on SPI settings, not in the moment it changed.
>
> 2. Clock synchronisation - if we export start time, we should also export
> node local
Anton.
1. Value exported based on SPI settings, not in the moment it changed.
2. Clock synchronisation - if we export start time, we should also export
node local timestamp.
пн, 22 июля 2019 г., 8:33 Anton Vinogradov :
> Folks,
>
> What's the reason for duration counting?
> AFAIU, it's a
Folks,
What's the reason for duration counting?
AFAIU, it's a monitoring system feature to count the durations.
Sine monitoring system checks metrics periodically it will know the
duration by its own log.
On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko wrote:
> Nikita,
>
> Yes, I mean duration
Nikita,
Yes, I mean duration not timestamp. For the metric name, I suggest
"cacheOperationsBlockingDuration", I think it cleaner represents what is
blocked during PME.
We can also combine both timestamp "cacheOperationsBlockingStartTs" and
duration to have better correlation when cache operations
Pavel,
The main purpose of this metric is
>> how much time we wait for resuming cache operations
Seems I misunderstood you. Do you mean timestamp or duration here?
>> What do you think if we change the boolean value of metric to a long value
>> that represents time in milliseconds when
Nikita,
I think getCurrentPmeDuration doesn't show useful information. The main PME
side effect for end-users is blocking cache operations. Not all PME time
blocks it.
What information gives to an end-user timestamp of
"timeSinceOperationsBlocked"? For what analysis it can be used and how?
пт,
Hi Pavel,
This time already can be obtained from the getCurrentPmeDuration and
new isOperationsBlockedByPme metrics.
As an alternative solution, I can rework recently added
getCurrentPmeDuration metric (not released yet). Seems for users it
useless in case of non-blocking PME.
Lets name it
Hi Nikita,
Thank you for working on this. What do you think if we change the boolean
value of metric to a long value that represents time in milliseconds when
operations were blocked?
Since we have not only JMX and now metrics are periodically exported to
some backend it can give a more clear
Anton, Nikolay,
Thanks for the support.
For now, we have the getCurrentPmeDuration() metric that does not show
influence on the cluster correctly. PME can be without blocking
operations. For example, client node join/leave events.
I suggest add new metric - isOperationsBlockedByPme(). Together,
I think administator of Ignite cluster should be able to monitor all Ignite
process, including non blocking PME.
В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> BTW,
> Found PME metric - getCurrentPmeDuration().
> Seems, it shows exactly PME time and not so useful because of this.
>
BTW,
Found PME metric - getCurrentPmeDuration().
Seems, it shows exactly PME time and not so useful because of this.
The goal it so show exactly blocking period.
When PME cause no blocking, it's a good PME and I see no reason to have
monitoring related to it :)
On Tue, Jul 16, 2019 at 2:50 PM
Anton.
Why do we need to postpone implementation of this metrics?
For now, implementation of new metric is very simple.
I think we can implement this metrics as a single contribution.
В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov пишет:
> Nikita,
>
> Looks like all we need now is a 1 simple
Nikita,
Looks like all we need now is a 1 simple metric: are operations blocked?
Just a true or false.
Lest start from this.
All other metrics can be extracted from logs now and can be implemented
later.
On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov
wrote:
> +1.
>
> Nikita, please, go
+1.
Nikita, please, go ahead.
вт, 16 июля 2019 г., 11:45 Nikita Amelchev :
> Hello, Igniters.
>
> I suggest to add some useful metrics about the partition map exchange
> (PME). For now, the duration of PME stages available only in log files
> and cannot be obtained using JMX or other external
Hello, Igniters.
I suggest to add some useful metrics about the partition map exchange
(PME). For now, the duration of PME stages available only in log files
and cannot be obtained using JMX or other external tools. [1]
I made the list of local node metrics that help to understand the
actual
35 matches
Mail list logo