Re: Extended logging for rebalance performance analysis

2020-07-27 Thread ткаленко кирилл
Discussed in personal correspondence with Stas, decided to improve the message:
Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, 
partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB,
topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, 
rebalanceId=1]

into:
Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, 
partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, avgSpeed=5,9 
KB/sec,
histPartitions=1, histEntries=30, histBytesRcvd=1 KB,
fullPartitions=1, fullEntries=30, fullBytesRcvd=3 KB
topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, 
rebalanceId=1]

Where:
partitions=2 - total number of partitions received
entries=60 - total number of entries received
duration=8ms - duration from first demand of message to output of message to log
bytesRcvd=5,9 KB - total number of bytes received in B,KB,MB,GB

avgSpeed= bytesRcvd/duration in KB/sec

histPartitions=1 - total number of partitions received by historical mode
histEntries=30 - total number of entries received by historical mode
histBytesRcvd=1 KB - total number of bytes received in B,KB,MB,GB by historical 
mode

fullPartitions=1 - total number of partitions received by full mode
fullEntries=30 - total number of entries received by full mode
fullBytesRcvd=3 KB - total number of bytes received in B,KB,MB,GB by full mode

27.07.2020, 11:50, "ткаленко кирилл" :
> Discussed in personal correspondence with Stas, decided to improve the 
> message:
> Completed rebalancing [grp=grp0, 
> supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1,
>     partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB,
>     topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, 
> rebalanceId=1]
>
> into:
> Completed rebalancing [grp=grp0, 
> supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1,
>     partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, avgSpeed=5,9 
> KB/sec,
>     histPartitions=1, histEntries=30, histBytesRcvd=1 KB,
>     fullPartitions=1, fullEntries=30, fullBytesRcvd=3 KB
>     topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, 
> rebalanceId=1]
>
> Where:
> partitions=2 - total number of partitions received
> entries=60 - total number of entries received
> duration=8ms - duration from first demand of message to output of message to 
> log
> bytesRcvd=5,9 KB - total number of bytes received in B,KB,MB,GB
>
> avgSpeed= bytesRcvd/duration in KB/sec
>
> histPartitions=1 - total number of partitions received by historical mode
> histEntries=30 - total number of entries received by historical mode
> histBytesRcvd=1 KB - total number of bytes received in B,KB,MB,GB by 
> historical mode
>
> fullPartitions=1 - total number of partitions received by full mode
> fullEntries=30 - total number of entries received by full mode
> fullBytesRcvd=3 KB - total number of bytes received in B,KB,MB,GB by full mode
>
> 03.07.2020, 17:21, "ткаленко кирилл" :
>> Sorry, forget.
>>
>> [1] - 
>> org.apache.ignite.internal.processors.cache.CacheGroupsMetricsRebalanceTest#testCacheGroupRebalance
>>
>> 03.07.2020, 17:20, "ткаленко кирилл" :
>>>  Hi, Stan!
>>>
>>>  I don't understand you yet.
>>>
>>>  Now you can use metrics as it was done in the test [1]. Or can you tell me 
>>> where to do this, for example when completing rebalancing for all groups?
>>>
>>>  See what is now available and added in the logs:
>>>  1)Which group is rebalanced and which type of rebalance.
>>>  Starting rebalance routine [grp0, topVer=AffinityTopologyVersion 
>>> [topVer=4, minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, 
>>> fullPartitions=[4, 7], histPartitions=[], rebalanceId=1]
>>>
>>>  2) Completion of rebalancing from one of the suppliers.
>>>  Completed rebalancing [grp=grp0, 
>>> supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, partitions=2, entries=60, 
>>> duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, 
>>> minorTopVer=0], progress=1/3, rebalanceId=1]
>>>
>>>  3) Completion of the entire rebalance.
>>>  Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, 
>>> duration=41ms, bytesRcvd=40,4 KB]
>>>
>>>  These messages have a common parameter rebalanceId=1.
>>>
>>>  03.07.2020, 16:48, "Stanislav Lukyanov" :
>    On 3 Jul 2020, at 09:51, ткаленко кирилл  wrote:
>
>    To calculate the average value, you can use the existing metrics 
> "RebalancingStartTime", "RebalancingLastCancelledTime", 
> "RebalancingEndTime", "RebalancingPartitionsLeft", 
> "RebalancingReceivedKeys" and "RebalancingReceivedBytes".

   You can calculate it, and I believe that this is the first thing anyone 
 would do when reading these logs and metrics.
   If that's an essential thing then maybe it should be available out of 
 the box?

>    This also works correctly with the historical rebalance.
>    Now we can see rebalance type 

Re: Extended logging for rebalance performance analysis

2020-07-24 Thread Stanislav Lukyanov
Hi Kirill,

I think it would be helpful to add to the log line
Completed rebalancing [grp=grp0, 
supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, partitions=2, entries=60, 
duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, 
minorTopVer=0], progress=1/3, rebalanceId=1]
a few more metrics:
avgSpeed=, partitionsNum=N, histPartitionsNum=M, 
histBytesRcvd=X KB, histDuration=Yms, 
histAvgSpeed=, fullPartitionsNum=K, fullBytesRcvd=X 
KB, fullDuration=Yms, fullAvgSpeed=

I believe these additional metrics are good in that a user can easily setup 
monitoring based on them, and they give a lot of useful info about rebalance 
performance in general. No need to look for the "Starting rebalance" message, 
everything is in one place.
If any of these are hard to add then we can skip them (I suspect histDuration 
and fullDuration can't be measured in practice but don't know for sure). Let's 
add what we can.

Thanks,
Stan

> On 3 Jul 2020, at 17:21, ткаленко кирилл  wrote:
> 
> Sorry, forget.
> 
> [1] - 
> org.apache.ignite.internal.processors.cache.CacheGroupsMetricsRebalanceTest#testCacheGroupRebalance
> 
> 03.07.2020, 17:20, "ткаленко кирилл" :
>> Hi, Stan!
>> 
>> I don't understand you yet.
>> 
>> Now you can use metrics as it was done in the test [1]. Or can you tell me 
>> where to do this, for example when completing rebalancing for all groups?
>> 
>> See what is now available and added in the logs:
>> 1)Which group is rebalanced and which type of rebalance.
>> Starting rebalance routine [grp0, topVer=AffinityTopologyVersion [topVer=4, 
>> minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, 
>> fullPartitions=[4, 7], histPartitions=[], rebalanceId=1]
>> 
>> 2) Completion of rebalancing from one of the suppliers.
>> Completed rebalancing [grp=grp0, 
>> supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, partitions=2, entries=60, 
>> duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, 
>> minorTopVer=0], progress=1/3, rebalanceId=1]
>> 
>> 3) Completion of the entire rebalance.
>> Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, 
>> duration=41ms, bytesRcvd=40,4 KB]
>> 
>> These messages have a common parameter rebalanceId=1.
>> 
>> 03.07.2020, 16:48, "Stanislav Lukyanov" :
   On 3 Jul 2020, at 09:51, ткаленко кирилл  wrote:
 
   To calculate the average value, you can use the existing metrics 
 "RebalancingStartTime", "RebalancingLastCancelledTime", 
 "RebalancingEndTime", "RebalancingPartitionsLeft", 
 "RebalancingReceivedKeys" and "RebalancingReceivedBytes".
>>> 
>>>  You can calculate it, and I believe that this is the first thing anyone 
>>> would do when reading these logs and metrics.
>>>  If that's an essential thing then maybe it should be available out of the 
>>> box?
>>> 
   This also works correctly with the historical rebalance.
   Now we can see rebalance type for each group and for each supplier in 
 logs. I don't think we should duplicate this information.
 
   [2020-07-03 09:49:31,481][INFO 
 ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance 
 routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, 
 minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a5957710, 
 fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1]
>>> 
>>>  I'm talking about adding info on how much data has been transferred during 
>>> rebalance.
>>>  When rebalance completes I'd like to know how much data has been 
>>> transferred, was it historical or full, what was the average rebalance 
>>> speed.
>>> 
>>>  There are two reasons for having all that.
>>> 
>>>  First, it helps to analyze the issues by searching the logs and looking 
>>> for anomalies.
>>> 
>>>  Second, this allows to automate alerts: e.g. if you know your typical 
>>> historical rebalance speed, you can trigger an alert if it drops below that.
>>> 
   03.07.2020, 02:49, "Stanislav Lukyanov" :
>   Kirill,
> 
>   I've looked through the patch.
>   Looks good, but it feels like the first thing someone will try to do 
> given bytesRcvd and duration is to divide one by another to get an 
> average speed.
>   Do you think it's reasonable to also add it to the logs? Maybe even to 
> the metrics?
> 
>   Also, this works with historical rebalance, right? Can we specify how 
> much data was transferred via historical or full rebalance from each 
> supplier?
>   Maybe even estimate transfer speed in entries and bytes for each 
> rebalance type?
> 
>   Thanks,
>   Stan
> 
>>On 29 Jun 2020, at 11:50, Ivan Rakov  wrote:
>> 
>>+1 to Alex G.
>> 
>>From my experience, the most interesting cases with Ignite rebalancing
>>happen exactly in production. According to the fact that we already 
>> have
>>detailed rebalancing logging, adding info about rebalance 

Re: Extended logging for rebalance performance analysis

2020-07-03 Thread ткаленко кирилл
Sorry, forget.

[1] - 
org.apache.ignite.internal.processors.cache.CacheGroupsMetricsRebalanceTest#testCacheGroupRebalance

03.07.2020, 17:20, "ткаленко кирилл" :
> Hi, Stan!
>
> I don't understand you yet.
>
> Now you can use metrics as it was done in the test [1]. Or can you tell me 
> where to do this, for example when completing rebalancing for all groups?
>
> See what is now available and added in the logs:
> 1)Which group is rebalanced and which type of rebalance.
> Starting rebalance routine [grp0, topVer=AffinityTopologyVersion [topVer=4, 
> minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, 
> fullPartitions=[4, 7], histPartitions=[], rebalanceId=1]
>
> 2) Completion of rebalancing from one of the suppliers.
> Completed rebalancing [grp=grp0, 
> supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, partitions=2, entries=60, 
> duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, 
> minorTopVer=0], progress=1/3, rebalanceId=1]
>
> 3) Completion of the entire rebalance.
> Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, 
> duration=41ms, bytesRcvd=40,4 KB]
>
> These messages have a common parameter rebalanceId=1.
>
> 03.07.2020, 16:48, "Stanislav Lukyanov" :
>>>   On 3 Jul 2020, at 09:51, ткаленко кирилл  wrote:
>>>
>>>   To calculate the average value, you can use the existing metrics 
>>> "RebalancingStartTime", "RebalancingLastCancelledTime", 
>>> "RebalancingEndTime", "RebalancingPartitionsLeft", 
>>> "RebalancingReceivedKeys" and "RebalancingReceivedBytes".
>>
>>  You can calculate it, and I believe that this is the first thing anyone 
>> would do when reading these logs and metrics.
>>  If that's an essential thing then maybe it should be available out of the 
>> box?
>>
>>>   This also works correctly with the historical rebalance.
>>>   Now we can see rebalance type for each group and for each supplier in 
>>> logs. I don't think we should duplicate this information.
>>>
>>>   [2020-07-03 09:49:31,481][INFO 
>>> ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance 
>>> routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, 
>>> minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a5957710, 
>>> fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1]
>>
>>  I'm talking about adding info on how much data has been transferred during 
>> rebalance.
>>  When rebalance completes I'd like to know how much data has been 
>> transferred, was it historical or full, what was the average rebalance speed.
>>
>>  There are two reasons for having all that.
>>
>>  First, it helps to analyze the issues by searching the logs and looking for 
>> anomalies.
>>
>>  Second, this allows to automate alerts: e.g. if you know your typical 
>> historical rebalance speed, you can trigger an alert if it drops below that.
>>
>>>   03.07.2020, 02:49, "Stanislav Lukyanov" :
   Kirill,

   I've looked through the patch.
   Looks good, but it feels like the first thing someone will try to do 
 given bytesRcvd and duration is to divide one by another to get an average 
 speed.
   Do you think it's reasonable to also add it to the logs? Maybe even to 
 the metrics?

   Also, this works with historical rebalance, right? Can we specify how 
 much data was transferred via historical or full rebalance from each 
 supplier?
   Maybe even estimate transfer speed in entries and bytes for each 
 rebalance type?

   Thanks,
   Stan

>    On 29 Jun 2020, at 11:50, Ivan Rakov  wrote:
>
>    +1 to Alex G.
>
>    From my experience, the most interesting cases with Ignite rebalancing
>    happen exactly in production. According to the fact that we already 
> have
>    detailed rebalancing logging, adding info about rebalance performance 
> looks
>    like a reasonable improvement. With new logs we'll be able to detect 
> and
>    investigate situations when rebalance is slow due to uneven suppliers
>    distribution or network issues.
>    Option to disable the feature in runtime shouldn't be used often, but 
> it
>    will keep us on the safe side in case something goes wrong.
>    The format described in
>    https://issues.apache.org/jira/browse/IGNITE-12080 looks
>    good to me.
>
>    On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл 
>    wrote:
>
>>    Hello, Alexey!
>>
>>    Currently there is no way to disable / enable it, but it seems that 
>> the
>>    logs will not be overloaded, since Alexei Scherbakov offer seems 
>> reasonable
>>    and compact. Of course, you can add disabling / enabling statistics
>>    collection via jmx for example.
>>
>>    23.06.2020, 18:47, "Alexey Goncharuk" :
>>>    Hello Maxim, folks,
>>>
>>>    ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov :
>>>
    We won't do performance analysis on 

Re: Extended logging for rebalance performance analysis

2020-07-03 Thread ткаленко кирилл
Hi, Stan!

I don't understand you yet.

Now you can use metrics as it was done in the test [1]. Or can you tell me 
where to do this, for example when completing rebalancing for all groups?

See what is now available and added in the logs:
1)Which group is rebalanced and which type of rebalance.
Starting rebalance routine [grp0, topVer=AffinityTopologyVersion [topVer=4, 
minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, 
fullPartitions=[4, 7], histPartitions=[], rebalanceId=1]

2) Completion of rebalancing from one of the suppliers.
Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, 
partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, 
topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, 
rebalanceId=1]

3) Completion of the entire rebalance.
Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, 
duration=41ms, bytesRcvd=40,4 KB]

These messages have a common parameter rebalanceId=1.

03.07.2020, 16:48, "Stanislav Lukyanov" :
>>  On 3 Jul 2020, at 09:51, ткаленко кирилл  wrote:
>>
>>  To calculate the average value, you can use the existing metrics 
>> "RebalancingStartTime", "RebalancingLastCancelledTime", 
>> "RebalancingEndTime", "RebalancingPartitionsLeft", "RebalancingReceivedKeys" 
>> and "RebalancingReceivedBytes".
>
> You can calculate it, and I believe that this is the first thing anyone would 
> do when reading these logs and metrics.
> If that's an essential thing then maybe it should be available out of the box?
>
>>  This also works correctly with the historical rebalance.
>>  Now we can see rebalance type for each group and for each supplier in logs. 
>> I don't think we should duplicate this information.
>>
>>  [2020-07-03 09:49:31,481][INFO 
>> ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance 
>> routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, 
>> minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a5957710, 
>> fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1]
>
> I'm talking about adding info on how much data has been transferred during 
> rebalance.
> When rebalance completes I'd like to know how much data has been transferred, 
> was it historical or full, what was the average rebalance speed.
>
> There are two reasons for having all that.
>
> First, it helps to analyze the issues by searching the logs and looking for 
> anomalies.
>
> Second, this allows to automate alerts: e.g. if you know your typical 
> historical rebalance speed, you can trigger an alert if it drops below that.
>
>>  03.07.2020, 02:49, "Stanislav Lukyanov" :
>>>  Kirill,
>>>
>>>  I've looked through the patch.
>>>  Looks good, but it feels like the first thing someone will try to do given 
>>> bytesRcvd and duration is to divide one by another to get an average speed.
>>>  Do you think it's reasonable to also add it to the logs? Maybe even to the 
>>> metrics?
>>>
>>>  Also, this works with historical rebalance, right? Can we specify how much 
>>> data was transferred via historical or full rebalance from each supplier?
>>>  Maybe even estimate transfer speed in entries and bytes for each rebalance 
>>> type?
>>>
>>>  Thanks,
>>>  Stan
>>>
   On 29 Jun 2020, at 11:50, Ivan Rakov  wrote:

   +1 to Alex G.

   From my experience, the most interesting cases with Ignite rebalancing
   happen exactly in production. According to the fact that we already have
   detailed rebalancing logging, adding info about rebalance performance 
 looks
   like a reasonable improvement. With new logs we'll be able to detect and
   investigate situations when rebalance is slow due to uneven suppliers
   distribution or network issues.
   Option to disable the feature in runtime shouldn't be used often, but it
   will keep us on the safe side in case something goes wrong.
   The format described in
   https://issues.apache.org/jira/browse/IGNITE-12080 looks
   good to me.

   On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл 
   wrote:

>   Hello, Alexey!
>
>   Currently there is no way to disable / enable it, but it seems that the
>   logs will not be overloaded, since Alexei Scherbakov offer seems 
> reasonable
>   and compact. Of course, you can add disabling / enabling statistics
>   collection via jmx for example.
>
>   23.06.2020, 18:47, "Alexey Goncharuk" :
>>   Hello Maxim, folks,
>>
>>   ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov :
>>
>>>   We won't do performance analysis on the production environment. Each
>>>   time we need performance analysis it will be done on a test
>>>   environment with verbose logging enabled. Thus I suggest moving these
>>>   changes to a separate `profiling` module and extend the logging much
>>>   more without any ышяу limitations. The same as these [2] [3]
>>>   activities do.
>>
>>   I 

Re: Extended logging for rebalance performance analysis

2020-07-03 Thread Stanislav Lukyanov


> On 3 Jul 2020, at 09:51, ткаленко кирилл  wrote:
> 
> To calculate the average value, you can use the existing metrics 
> "RebalancingStartTime", "RebalancingLastCancelledTime", "RebalancingEndTime", 
> "RebalancingPartitionsLeft", "RebalancingReceivedKeys" and 
> "RebalancingReceivedBytes".

You can calculate it, and I believe that this is the first thing anyone would 
do when reading these logs and metrics.
If that's an essential thing then maybe it should be available out of the box?

> 
> This also works correctly with the historical rebalance.
> Now we can see rebalance type for each group and for each supplier in logs. I 
> don't think we should duplicate this information.
> 
> [2020-07-03 09:49:31,481][INFO 
> ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance 
> routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, 
> minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a5957710, 
> fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1]

I'm talking about adding info on how much data has been transferred during 
rebalance.
When rebalance completes I'd like to know how much data has been transferred, 
was it historical or full, what was the average rebalance speed.

There are two reasons for having all that.

First, it helps to analyze the issues by searching the logs and looking for 
anomalies.

Second, this allows to automate alerts: e.g. if you know your typical 
historical rebalance speed, you can trigger an alert if it drops below that.

> 
> 03.07.2020, 02:49, "Stanislav Lukyanov" :
>> Kirill,
>> 
>> I've looked through the patch.
>> Looks good, but it feels like the first thing someone will try to do given 
>> bytesRcvd and duration is to divide one by another to get an average speed.
>> Do you think it's reasonable to also add it to the logs? Maybe even to the 
>> metrics?
>> 
>> Also, this works with historical rebalance, right? Can we specify how much 
>> data was transferred via historical or full rebalance from each supplier?
>> Maybe even estimate transfer speed in entries and bytes for each rebalance 
>> type?
>> 
>> Thanks,
>> Stan
>> 
>>>  On 29 Jun 2020, at 11:50, Ivan Rakov  wrote:
>>> 
>>>  +1 to Alex G.
>>> 
>>>  From my experience, the most interesting cases with Ignite rebalancing
>>>  happen exactly in production. According to the fact that we already have
>>>  detailed rebalancing logging, adding info about rebalance performance looks
>>>  like a reasonable improvement. With new logs we'll be able to detect and
>>>  investigate situations when rebalance is slow due to uneven suppliers
>>>  distribution or network issues.
>>>  Option to disable the feature in runtime shouldn't be used often, but it
>>>  will keep us on the safe side in case something goes wrong.
>>>  The format described in
>>>  https://issues.apache.org/jira/browse/IGNITE-12080 looks
>>>  good to me.
>>> 
>>>  On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл 
>>>  wrote:
>>> 
  Hello, Alexey!
 
  Currently there is no way to disable / enable it, but it seems that the
  logs will not be overloaded, since Alexei Scherbakov offer seems 
 reasonable
  and compact. Of course, you can add disabling / enabling statistics
  collection via jmx for example.
 
  23.06.2020, 18:47, "Alexey Goncharuk" :
>  Hello Maxim, folks,
> 
>  ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov :
> 
>>  We won't do performance analysis on the production environment. Each
>>  time we need performance analysis it will be done on a test
>>  environment with verbose logging enabled. Thus I suggest moving these
>>  changes to a separate `profiling` module and extend the logging much
>>  more without any ышяу limitations. The same as these [2] [3]
>>  activities do.
> 
>  I strongly disagree with this statement. I am not sure who is meant here
>  by 'we', but I see a strong momentum in increasing observability tooling
>  that helps people to understand what exactly happens in the production
>  environment [1]. Not everybody can afford two identical environments for
>  testing. We should make sure users have enough information to understand
>  the root cause after the incident happened, and not force them to
  reproduce
>  it, let alone make them add another module to the classpath and restart
  the
>  nodes.
>  I think having this functionality in the core module with the ability to
>  disable/enable it is the right approach. Having the information printed
  to
>  log is ok, having it in an event that can be sent to a monitoring/tracing
>  subsystem is even better.
> 
>  Kirill, can we enable and disable this feature in runtime to avoid the
  very
>  same nodes restart?
> 
>  [1]
  https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/



Re: Extended logging for rebalance performance analysis

2020-07-03 Thread ткаленко кирилл
LongMetric startTime = mreg.findMetric("RebalancingStartTime");
LongMetric lastCancelledTime = 
mreg.findMetric("RebalancingLastCancelledTime");
LongMetric endTime = mreg.findMetric("RebalancingEndTime");
LongMetric partitionsLeft = mreg.findMetric();
LongMetric receivedKeys = mreg.findMetric();
LongMetric receivedBytes = mreg.findMetric();

Hi, Stas!

To calculate the average value, you can use the existing metrics 
"RebalancingStartTime", "RebalancingLastCancelledTime", "RebalancingEndTime", 
"RebalancingPartitionsLeft", "RebalancingReceivedKeys" and 
"RebalancingReceivedBytes".

This also works correctly with the historical rebalance.
Now we can see rebalance type for each group and for each supplier in logs. I 
don't think we should duplicate this information.

[2020-07-03 09:49:31,481][INFO 
][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance 
routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, 
minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a5957710, 
fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1]

03.07.2020, 02:49, "Stanislav Lukyanov" :
> Kirill,
>
> I've looked through the patch.
> Looks good, but it feels like the first thing someone will try to do given 
> bytesRcvd and duration is to divide one by another to get an average speed.
> Do you think it's reasonable to also add it to the logs? Maybe even to the 
> metrics?
>
> Also, this works with historical rebalance, right? Can we specify how much 
> data was transferred via historical or full rebalance from each supplier?
> Maybe even estimate transfer speed in entries and bytes for each rebalance 
> type?
>
> Thanks,
> Stan
>
>>  On 29 Jun 2020, at 11:50, Ivan Rakov  wrote:
>>
>>  +1 to Alex G.
>>
>>  From my experience, the most interesting cases with Ignite rebalancing
>>  happen exactly in production. According to the fact that we already have
>>  detailed rebalancing logging, adding info about rebalance performance looks
>>  like a reasonable improvement. With new logs we'll be able to detect and
>>  investigate situations when rebalance is slow due to uneven suppliers
>>  distribution or network issues.
>>  Option to disable the feature in runtime shouldn't be used often, but it
>>  will keep us on the safe side in case something goes wrong.
>>  The format described in
>>  https://issues.apache.org/jira/browse/IGNITE-12080 looks
>>  good to me.
>>
>>  On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл 
>>  wrote:
>>
>>>  Hello, Alexey!
>>>
>>>  Currently there is no way to disable / enable it, but it seems that the
>>>  logs will not be overloaded, since Alexei Scherbakov offer seems reasonable
>>>  and compact. Of course, you can add disabling / enabling statistics
>>>  collection via jmx for example.
>>>
>>>  23.06.2020, 18:47, "Alexey Goncharuk" :
  Hello Maxim, folks,

  ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov :

>  We won't do performance analysis on the production environment. Each
>  time we need performance analysis it will be done on a test
>  environment with verbose logging enabled. Thus I suggest moving these
>  changes to a separate `profiling` module and extend the logging much
>  more without any ышяу limitations. The same as these [2] [3]
>  activities do.

  I strongly disagree with this statement. I am not sure who is meant here
  by 'we', but I see a strong momentum in increasing observability tooling
  that helps people to understand what exactly happens in the production
  environment [1]. Not everybody can afford two identical environments for
  testing. We should make sure users have enough information to understand
  the root cause after the incident happened, and not force them to
>>>  reproduce
  it, let alone make them add another module to the classpath and restart
>>>  the
  nodes.
  I think having this functionality in the core module with the ability to
  disable/enable it is the right approach. Having the information printed
>>>  to
  log is ok, having it in an event that can be sent to a monitoring/tracing
  subsystem is even better.

  Kirill, can we enable and disable this feature in runtime to avoid the
>>>  very
  same nodes restart?

  [1]
>>>  https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/


Re: Extended logging for rebalance performance analysis

2020-07-02 Thread Stanislav Lukyanov
Kirill,

I've looked through the patch.
Looks good, but it feels like the first thing someone will try to do given 
bytesRcvd and duration is to divide one by another to get an average speed.
Do you think it's reasonable to also add it to the logs? Maybe even to the 
metrics?

Also, this works with historical rebalance, right? Can we specify how much data 
was transferred via historical or full rebalance from each supplier? 
Maybe even estimate transfer speed in entries and bytes for each rebalance type?

Thanks,
Stan

> On 29 Jun 2020, at 11:50, Ivan Rakov  wrote:
> 
> +1 to Alex G.
> 
> From my experience, the most interesting cases with Ignite rebalancing
> happen exactly in production. According to the fact that we already have
> detailed rebalancing logging, adding info about rebalance performance looks
> like a reasonable improvement. With new logs we'll be able to detect and
> investigate situations when rebalance is slow due to uneven suppliers
> distribution or network issues.
> Option to disable the feature in runtime shouldn't be used often, but it
> will keep us on the safe side in case something goes wrong.
> The format described in
> https://issues.apache.org/jira/browse/IGNITE-12080 looks
> good to me.
> 
> On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл 
> wrote:
> 
>> Hello, Alexey!
>> 
>> Currently there is no way to disable / enable it, but it seems that the
>> logs will not be overloaded, since Alexei Scherbakov offer seems reasonable
>> and compact. Of course, you can add disabling / enabling statistics
>> collection via jmx for example.
>> 
>> 23.06.2020, 18:47, "Alexey Goncharuk" :
>>> Hello Maxim, folks,
>>> 
>>> ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov :
>>> 
 We won't do performance analysis on the production environment. Each
 time we need performance analysis it will be done on a test
 environment with verbose logging enabled. Thus I suggest moving these
 changes to a separate `profiling` module and extend the logging much
 more without any ышяу limitations. The same as these [2] [3]
 activities do.
>>> 
>>> I strongly disagree with this statement. I am not sure who is meant here
>>> by 'we', but I see a strong momentum in increasing observability tooling
>>> that helps people to understand what exactly happens in the production
>>> environment [1]. Not everybody can afford two identical environments for
>>> testing. We should make sure users have enough information to understand
>>> the root cause after the incident happened, and not force them to
>> reproduce
>>> it, let alone make them add another module to the classpath and restart
>> the
>>> nodes.
>>> I think having this functionality in the core module with the ability to
>>> disable/enable it is the right approach. Having the information printed
>> to
>>> log is ok, having it in an event that can be sent to a monitoring/tracing
>>> subsystem is even better.
>>> 
>>> Kirill, can we enable and disable this feature in runtime to avoid the
>> very
>>> same nodes restart?
>>> 
>>> [1]
>> https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/
>> 



Re: Extended logging for rebalance performance analysis

2020-06-29 Thread Ivan Rakov
+1 to Alex G.

>From my experience, the most interesting cases with Ignite rebalancing
happen exactly in production. According to the fact that we already have
detailed rebalancing logging, adding info about rebalance performance looks
like a reasonable improvement. With new logs we'll be able to detect and
investigate situations when rebalance is slow due to uneven suppliers
distribution or network issues.
Option to disable the feature in runtime shouldn't be used often, but it
will keep us on the safe side in case something goes wrong.
The format described in
https://issues.apache.org/jira/browse/IGNITE-12080 looks
good to me.

On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл 
wrote:

> Hello, Alexey!
>
> Currently there is no way to disable / enable it, but it seems that the
> logs will not be overloaded, since Alexei Scherbakov offer seems reasonable
> and compact. Of course, you can add disabling / enabling statistics
> collection via jmx for example.
>
> 23.06.2020, 18:47, "Alexey Goncharuk" :
> > Hello Maxim, folks,
> >
> > ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov :
> >
> >>  We won't do performance analysis on the production environment. Each
> >>  time we need performance analysis it will be done on a test
> >>  environment with verbose logging enabled. Thus I suggest moving these
> >>  changes to a separate `profiling` module and extend the logging much
> >>  more without any ышяу limitations. The same as these [2] [3]
> >>  activities do.
> >
> >  I strongly disagree with this statement. I am not sure who is meant here
> > by 'we', but I see a strong momentum in increasing observability tooling
> > that helps people to understand what exactly happens in the production
> > environment [1]. Not everybody can afford two identical environments for
> > testing. We should make sure users have enough information to understand
> > the root cause after the incident happened, and not force them to
> reproduce
> > it, let alone make them add another module to the classpath and restart
> the
> > nodes.
> > I think having this functionality in the core module with the ability to
> > disable/enable it is the right approach. Having the information printed
> to
> > log is ok, having it in an event that can be sent to a monitoring/tracing
> > subsystem is even better.
> >
> > Kirill, can we enable and disable this feature in runtime to avoid the
> very
> > same nodes restart?
> >
> > [1]
> https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/
>


Re: Extended logging for rebalance performance analysis

2020-06-23 Thread ткаленко кирилл
Hello, Alexey!

Currently there is no way to disable / enable it, but it seems that the logs 
will not be overloaded, since Alexei Scherbakov offer seems reasonable and 
compact. Of course, you can add disabling / enabling statistics collection via 
jmx for example.

23.06.2020, 18:47, "Alexey Goncharuk" :
> Hello Maxim, folks,
>
> ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov :
>
>>  We won't do performance analysis on the production environment. Each
>>  time we need performance analysis it will be done on a test
>>  environment with verbose logging enabled. Thus I suggest moving these
>>  changes to a separate `profiling` module and extend the logging much
>>  more without any ышяу limitations. The same as these [2] [3]
>>  activities do.
>
>  I strongly disagree with this statement. I am not sure who is meant here
> by 'we', but I see a strong momentum in increasing observability tooling
> that helps people to understand what exactly happens in the production
> environment [1]. Not everybody can afford two identical environments for
> testing. We should make sure users have enough information to understand
> the root cause after the incident happened, and not force them to reproduce
> it, let alone make them add another module to the classpath and restart the
> nodes.
> I think having this functionality in the core module with the ability to
> disable/enable it is the right approach. Having the information printed to
> log is ok, having it in an event that can be sent to a monitoring/tracing
> subsystem is even better.
>
> Kirill, can we enable and disable this feature in runtime to avoid the very
> same nodes restart?
>
> [1] https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/


Re: Extended logging for rebalance performance analysis

2020-06-23 Thread Alexey Goncharuk
Hello Maxim, folks,

ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov :

> We won't do performance analysis on the production environment. Each
> time we need performance analysis it will be done on a test
> environment with verbose logging enabled. Thus I suggest moving these
> changes to a separate `profiling` module and extend the logging much
> more without any ышяу limitations. The same as these [2] [3]
> activities do.
>

 I strongly disagree with this statement. I am not sure who is meant here
by 'we', but I see a strong momentum in increasing observability tooling
that helps people to understand what exactly happens in the production
environment [1]. Not everybody can afford two identical environments for
testing. We should make sure users have enough information to understand
the root cause after the incident happened, and not force them to reproduce
it, let alone make them add another module to the classpath and restart the
nodes.
I think having this functionality in the core module with the ability to
disable/enable it is the right approach. Having the information printed to
log is ok, having it in an event that can be sent to a monitoring/tracing
subsystem is even better.

Kirill, can we enable and disable this feature in runtime to avoid the very
same nodes restart?

[1] https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/


Re: Extended logging for rebalance performance analysis

2020-06-23 Thread Alexei Scherbakov
Hi, Kirill.

Looks good to me.

вт, 23 июн. 2020 г. в 18:05, ткаленко кирилл :

> Hello, Alexey!
>
> I suggest that we decide what we can do within ticket [1].
>
> Add "rebalanceId" and "topVer" related to rebalance to all messages.
>
> Add statistical information to a log message:
> [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing
> [rebalanceId=1, grp=cache1,
> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, partitions=5, entries=100,
> duration=12ms,
> bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> progress=1/2]
>
> Add a message to log after rebalancing for all cache groups is complete:
> [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
> [rebalanceId=2, partitions=10, entries=200, duration=50ms, bytesRcvd=10M]
>
> Any comments or suggestions?
>
> [1] - https://issues.apache.org/jira/browse/IGNITE-12080
>
> 20.05.2020, 23:08, "ткаленко кирилл" :
> > Hello, Alexey! Unfortunately, my response was delayed.
> >
> > Point 2: You can do as you suggested, I think it is still worth
> specifying how many partitions were obtained.
> >
> > [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
> > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, partitions=5,
> entries=100, duration=12ms,
> > bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> > progress=1/2]
> >
> > Point 3: is It "rebalanceId"?
> >
> > Point 5: I think we can output a summary for each supplier, so as not to
> keep it in mind.
> >
> > [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
> > [rebalanceId=2, [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1,
> partitions=5, entries=100, duration=12ms, bytesRcvd=5M],
> > [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba2, partitions=5,
> entries=100, duration=12ms, bytesRcvd=5M]]
> >
> > I can add "rebalanceId" to each message that you gave at above.
> >
> > A detailed message will help us understand how correctly the suppliers
> were selected.
> >
> > 06.05.2020, 22:08, "Alexei Scherbakov" :
> >>  Hello.
> >>
> >>  Let's look at existing rebalancing log for a single group:
> >>
> >>  [2020-05-06 20:56:36,999][INFO ][...] Rebalancing scheduled
> >>  [order=[ignite-sys-cache, cache1, cache2, default],
> >>  top=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> >>  evt=DISCOVERY_CUSTOM_EVT, node=9d9edb7b-eb01-47a1-8ff9-fef715d2]
> >>  ...
> >>  [2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1,
> >>  mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1,
> >>  partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3,
> >>  minorTopVer=1]]
> >>  [2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1,
> >>  mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50,
> >>  partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3,
> >>  minorTopVer=1]]
> >>  [2020-05-06 20:56:37,036][INFO ][...] Starting rebalance routine
> [cache1,
> >>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> >>  supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, fullPartitions=[1, 5,
> 7, 9,
> >>  11, 13, 15, 23, 27, 29, 31], histPartitions=[]]
> >>  [2020-05-06 20:56:37,037][INFO ][...] Starting rebalance routine
> [cache1,
> >>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> >>  supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, fullPartitions=[6, 8,
> 10,
> >>  16, 18, 20, 22, 24, 26, 28], histPartitions=[]]
> >>  [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing
> [grp=cache1,
> >>  supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1,
> >>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2]
> >>  [2020-05-06 20:56:37,046][INFO ][...] Completed (final) rebalancing
> >>  [grp=cache1, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50,
> >>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=2/2]
> >>  [2020-05-06 20:56:37,048][INFO ][...] Completed rebalance future:
> >>  RebalanceFuture [grp=CacheGroupContext [grp=cache1],
> >>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> rebalanceId=2,
> >>  routines=2]
> >>
> >>  From these logs I'm already can get answers to 1 and 4.
> >>  The logs look concise and easy to read and understand, and should
> >>  remain what way.
> >>
> >>  But I think some proposed improvements can be done here without harm.
> >>
> >>  2. OK, let's add it to supplier info per cache with additional info:
> >>
> >>  [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing
> [grp=cache1,
> >>  supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, entries=100,
> duration=12ms,
> >>  bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> >>  progress=1/2]
> >>
> >>  3. This information is already printed to log.
> >>
> >>  5. OK, let's add a line which is printed after all groups in the chain
> are
> >>  rebalanced or cancelled with a summary:
> >>
> >>  [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
> >>  [rebalanceId=2, entries=200, duration=50ms, 

Re: Extended logging for rebalance performance analysis

2020-06-23 Thread ткаленко кирилл
Hello, Alexey!

I suggest that we decide what we can do within ticket [1].

Add "rebalanceId" and "topVer" related to rebalance to all messages.

Add statistical information to a log message:
[2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [rebalanceId=1, 
grp=cache1,
supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, partitions=5, entries=100, 
duration=12ms,
bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
progress=1/2]

Add a message to log after rebalancing for all cache groups is complete:
[2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
[rebalanceId=2, partitions=10, entries=200, duration=50ms, bytesRcvd=10M]

Any comments or suggestions?

[1] - https://issues.apache.org/jira/browse/IGNITE-12080

20.05.2020, 23:08, "ткаленко кирилл" :
> Hello, Alexey! Unfortunately, my response was delayed.
>
> Point 2: You can do as you suggested, I think it is still worth specifying 
> how many partitions were obtained.
>
> [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, partitions=5, entries=100, 
> duration=12ms,
> bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> progress=1/2]
>
> Point 3: is It "rebalanceId"?
>
> Point 5: I think we can output a summary for each supplier, so as not to keep 
> it in mind.
>
> [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
> [rebalanceId=2, [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, partitions=5, 
> entries=100, duration=12ms, bytesRcvd=5M],
> [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba2, partitions=5, entries=100, 
> duration=12ms, bytesRcvd=5M]]
>
> I can add "rebalanceId" to each message that you gave at above.
>
> A detailed message will help us understand how correctly the suppliers were 
> selected.
>
> 06.05.2020, 22:08, "Alexei Scherbakov" :
>>  Hello.
>>
>>  Let's look at existing rebalancing log for a single group:
>>
>>  [2020-05-06 20:56:36,999][INFO ][...] Rebalancing scheduled
>>  [order=[ignite-sys-cache, cache1, cache2, default],
>>  top=AffinityTopologyVersion [topVer=3, minorTopVer=1],
>>  evt=DISCOVERY_CUSTOM_EVT, node=9d9edb7b-eb01-47a1-8ff9-fef715d2]
>>  ...
>>  [2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1,
>>  mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1,
>>  partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3,
>>  minorTopVer=1]]
>>  [2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1,
>>  mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50,
>>  partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3,
>>  minorTopVer=1]]
>>  [2020-05-06 20:56:37,036][INFO ][...] Starting rebalance routine [cache1,
>>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
>>  supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, fullPartitions=[1, 5, 7, 9,
>>  11, 13, 15, 23, 27, 29, 31], histPartitions=[]]
>>  [2020-05-06 20:56:37,037][INFO ][...] Starting rebalance routine [cache1,
>>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
>>  supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, fullPartitions=[6, 8, 10,
>>  16, 18, 20, 22, 24, 26, 28], histPartitions=[]]
>>  [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
>>  supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1,
>>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2]
>>  [2020-05-06 20:56:37,046][INFO ][...] Completed (final) rebalancing
>>  [grp=cache1, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50,
>>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=2/2]
>>  [2020-05-06 20:56:37,048][INFO ][...] Completed rebalance future:
>>  RebalanceFuture [grp=CacheGroupContext [grp=cache1],
>>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], rebalanceId=2,
>>  routines=2]
>>
>>  From these logs I'm already can get answers to 1 and 4.
>>  The logs look concise and easy to read and understand, and should
>>  remain what way.
>>
>>  But I think some proposed improvements can be done here without harm.
>>
>>  2. OK, let's add it to supplier info per cache with additional info:
>>
>>  [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
>>  supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, entries=100, duration=12ms,
>>  bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
>>  progress=1/2]
>>
>>  3. This information is already printed to log.
>>
>>  5. OK, let's add a line which is printed after all groups in the chain are
>>  rebalanced or cancelled with a summary:
>>
>>  [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
>>  [rebalanceId=2, entries=200, duration=50ms, bytesRcvd=10M]
>>
>>  Another thing I would suggest: add rebalanceId to all messages for ease of
>>  grepping and to distinguish multiple rebalances for a same topology version.
>>
>>  Regarding the "detailed" message - I wasn't able to figure out any case
>>  where it is useful. Not sure if we 

Re: Extended logging for rebalance performance analysis

2020-05-26 Thread Alexei Scherbakov
Hi, Kirill.

2. Ok for me.

3. We already have log message about rebalance cancellation for specific
rebalanceId and log message about new rebalancing has started with next
rebalanceId.

5. We already have summary information per group for each supplier added in
p.2
I would avoid duplication here or put it under debug logging.

Information about selected suppliers can already be obtained from logs:
[2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1,
mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1,
partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3,
minorTopVer=1]]
[2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1,
mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50,
partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3,
minorTopVer=1]]

so I still do not see any value in "detailed" message.

I think this is enough to understand rebalancing flow using proper grepping
by topVer and rebalanceId.

All additional aggregation should be done by external tools using
rebalancing metrics.
I'm on the same page with Maxim Muzafarov here.



ср, 20 мая 2020 г. в 23:08, ткаленко кирилл :

> Hello, Alexey! Unfortunately, my response was delayed.
>
> Point 2: You can do as you suggested, I think it is still worth specifying
> how many partitions were obtained.
>
> [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, partitions=5, entries=100,
> duration=12ms,
> bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> progress=1/2]
>
> Point 3: is It "rebalanceId"?
>
> Point 5: I think we can output a summary for each supplier, so as not to
> keep it in mind.
>
> [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
> [rebalanceId=2, [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1,
> partitions=5, entries=100, duration=12ms, bytesRcvd=5M],
> [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba2, partitions=5, entries=100,
> duration=12ms, bytesRcvd=5M]]
>
> I can add "rebalanceId" to each message that you gave at above.
>
> A detailed message will help us understand how correctly the suppliers
> were selected.
>
> 06.05.2020, 22:08, "Alexei Scherbakov" :
> > Hello.
> >
> > Let's look at existing rebalancing log for a single group:
> >
> > [2020-05-06 20:56:36,999][INFO ][...] Rebalancing scheduled
> > [order=[ignite-sys-cache, cache1, cache2, default],
> > top=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> > evt=DISCOVERY_CUSTOM_EVT, node=9d9edb7b-eb01-47a1-8ff9-fef715d2]
> > ...
> > [2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1,
> > mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1,
> > partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3,
> > minorTopVer=1]]
> > [2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1,
> > mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50,
> > partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3,
> > minorTopVer=1]]
> > [2020-05-06 20:56:37,036][INFO ][...] Starting rebalance routine [cache1,
> > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, fullPartitions=[1, 5, 7,
> 9,
> > 11, 13, 15, 23, 27, 29, 31], histPartitions=[]]
> > [2020-05-06 20:56:37,037][INFO ][...] Starting rebalance routine [cache1,
> > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> > supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, fullPartitions=[6, 8, 10,
> > 16, 18, 20, 22, 24, 26, 28], histPartitions=[]]
> > [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
> > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1,
> > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2]
> > [2020-05-06 20:56:37,046][INFO ][...] Completed (final) rebalancing
> > [grp=cache1, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50,
> > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=2/2]
> > [2020-05-06 20:56:37,048][INFO ][...] Completed rebalance future:
> > RebalanceFuture [grp=CacheGroupContext [grp=cache1],
> > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], rebalanceId=2,
> > routines=2]
> >
> > From these logs I'm already can get answers to 1 and 4.
> > The logs look concise and easy to read and understand, and should
> > remain what way.
> >
> > But I think some proposed improvements can be done here without harm.
> >
> > 2. OK, let's add it to supplier info per cache with additional info:
> >
> > [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
> > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, entries=100,
> duration=12ms,
> > bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> > progress=1/2]
> >
> > 3. This information is already printed to log.
> >
> > 5. OK, let's add a line which is printed after all groups in the chain
> are
> > rebalanced or cancelled with a summary:
> >
> > [2020-05-06 20:56:36,999][INFO 

Re: Extended logging for rebalance performance analysis

2020-05-20 Thread ткаленко кирилл
Hello, Alexey! Unfortunately, my response was delayed.

Point 2: You can do as you suggested, I think it is still worth specifying how 
many partitions were obtained.

[2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, partitions=5, entries=100, 
duration=12ms,
bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
progress=1/2]

Point 3: is It "rebalanceId"?

Point 5: I think we can output a summary for each supplier, so as not to keep 
it in mind.

[2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
[rebalanceId=2, [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, partitions=5, 
entries=100, duration=12ms, bytesRcvd=5M], 
[supplier=94a3fcbc-18d5-4c64-b0ab-4313aba2, partitions=5, entries=100, 
duration=12ms, bytesRcvd=5M]]

I can add "rebalanceId" to each message that you gave at above.

A detailed message will help us understand how correctly the suppliers were 
selected.

06.05.2020, 22:08, "Alexei Scherbakov" :
> Hello.
>
> Let's look at existing rebalancing log for a single group:
>
> [2020-05-06 20:56:36,999][INFO ][...] Rebalancing scheduled
> [order=[ignite-sys-cache, cache1, cache2, default],
> top=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> evt=DISCOVERY_CUSTOM_EVT, node=9d9edb7b-eb01-47a1-8ff9-fef715d2]
> ...
> [2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1,
> mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1,
> partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3,
> minorTopVer=1]]
> [2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1,
> mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50,
> partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3,
> minorTopVer=1]]
> [2020-05-06 20:56:37,036][INFO ][...] Starting rebalance routine [cache1,
> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, fullPartitions=[1, 5, 7, 9,
> 11, 13, 15, 23, 27, 29, 31], histPartitions=[]]
> [2020-05-06 20:56:37,037][INFO ][...] Starting rebalance routine [cache1,
> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, fullPartitions=[6, 8, 10,
> 16, 18, 20, 22, 24, 26, 28], histPartitions=[]]
> [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1,
> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2]
> [2020-05-06 20:56:37,046][INFO ][...] Completed (final) rebalancing
> [grp=cache1, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50,
> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=2/2]
> [2020-05-06 20:56:37,048][INFO ][...] Completed rebalance future:
> RebalanceFuture [grp=CacheGroupContext [grp=cache1],
> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], rebalanceId=2,
> routines=2]
>
> From these logs I'm already can get answers to 1 and 4.
> The logs look concise and easy to read and understand, and should
> remain what way.
>
> But I think some proposed improvements can be done here without harm.
>
> 2. OK, let's add it to supplier info per cache with additional info:
>
> [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, entries=100, duration=12ms,
> bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> progress=1/2]
>
> 3. This information is already printed to log.
>
> 5. OK, let's add a line which is printed after all groups in the chain are
> rebalanced or cancelled with a summary:
>
> [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
> [rebalanceId=2, entries=200, duration=50ms, bytesRcvd=10M]
>
> Another thing I would suggest: add rebalanceId to all messages for ease of
> grepping and to distinguish multiple rebalances for a same topology version.
>
> Regarding the "detailed" message - I wasn't able to figure out any case
> where it is useful. Not sure if we need it at all.
> Kirill Tkalenko, can you give me an example?
>
> ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov :
>
>>  Kirill,
>>
>>  Thank you for raising this topic. It's true that the rebalance process
>>  still requires additional information for analyzing issues. Please,
>>  don't think that I'm against your changes :-)
>>
>>  * My short answer. *
>>
>>  We won't do performance analysis on the production environment. Each
>>  time we need performance analysis it will be done on a test
>>  environment with verbose logging enabled. Thus I suggest moving these
>>  changes to a separate `profiling` module and extend the logging much
>>  more without any ышяу limitations. The same as these [2] [3]
>>  activities do.
>>
>>  Let's keep the `core` module as simple as possible.
>>  Let's design the right API for accessing rebalance internals for
>>  profiling tools.
>>  Can you, please, remove all changes from your PR [6] which are not
>>  related to 

Re: Extended logging for rebalance performance analysis

2020-05-20 Thread ткаленко кирилл
Hi, Maxim! Unfortunately, I delayed my answer.

The production environment may not be ideal and problems may occur in it. I 
only suggest adding a little logging to understand some of the cases that may 
occur during operation. In addition, this functionality can be disabled by the 
system property. Rename variables to improve code readability.

We can do it via metrics, but whether need to store it for all suppliers and 
how long is unclear. In my case, I do a simple aggregation and clear it after 
output.


06.05.2020, 21:01, "Maxim Muzafarov" :
> Kirill,
>
> Thank you for raising this topic. It's true that the rebalance process
> still requires additional information for analyzing issues. Please,
> don't think that I'm against your changes :-)
>
> * My short answer. *
>
> We won't do performance analysis on the production environment. Each
> time we need performance analysis it will be done on a test
> environment with verbose logging enabled. Thus I suggest moving these
> changes to a separate `profiling` module and extend the logging much
> more without any ышяу limitations. The same as these [2] [3]
> activities do.
>
> Let's keep the `core` module as simple as possible.
> Let's design the right API for accessing rebalance internals for
> profiling tools.
> Can you, please, remove all changes from your PR [6] which are not
> related to the proposed topic? (e.g. variable renamings).
>
> * The long answer. *
>
> Here are my thoughts on this. There are two different types of issues
> in the rebalance process. The first case must be covered by daily
> monitoring subsystems, the second case must be covered by additional
> profiling tools.
> 1. errors during the rebalancing (e.g. rebalance not happens when required)
> 2. performance rebalancing issues (e.g. rebalance is slow)
>
> Daily monitoring tools (JMX, Logging) are always turned on and
> shouldn't require additional systems resources themselves. Since these
> metrics must be lightweight any internal aggregation machinery is not
> used on them. All these metrics are collected from each node
> independently. Please, take a look at this issue [1] which covers most
> of your questions mentioned above.
>
> For all available metrics, we can configure LogExporterSpi, so they
> will be available in logs.
>
>>  1)How long was the balance(divided by supplier)?
>
> rebalancingStartTime, rebalancingEndTime already exists for cache groups [4].
> We can add the same for suppliers.
>
>>  2)How many records and bytes per supplier were rebalanced?
>
> We already have rebalancingReceivedKeys, rebalancingReceivedBytes [4]
> We will have rebalancingExpectedKeys [5].
> We can add a new metric per cache keys, per supplier.
>
>>  3)How many times did the rebalance restart?
>
> rebalancingLastCancelledTime [4] metric exists.
> Do we need to keep historical data on it?
>
>>  4)Which partitions were rebalanced and from which nodes did they receive 
>> them?
>
> Let's print this information to log prior to the rebalance process
> starts. This will be helpful information and do not require a lot of
> changes.
>
>>  5)When did rebalance for all cache groups end?
>
> This metric may be aggregated from rebalancingEndTime [4] by pulling
> it from all nodes for all caches.
>
> Performance rebalancing issues are related to profiling tools. They
> may require additional system resources and definitely require a
> dedicated environment for tests. We can't do performance analysis on
> production environments due to performance impact.
> I see some disadvantages of adding such tools to production code:
> - verbose logging may affect performance.
> - the problematic process may become even worse if an automatic
> threshold suddenly turns on.
> - new code changes will require additional efforts to keep logging up-to-date.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-12183
> [2] https://issues.apache.org/jira/browse/IGNITE-12666
> [3] 
> https://cwiki.apache.org/confluence/display/IGNITE/Cluster+performance+profiling+tool
> [4] https://issues.apache.org/jira/browse/IGNITE-12193
> [5] https://issues.apache.org/jira/browse/IGNITE-12194
> [6] https://github.com/apache/ignite/pull/7705/files
>
> On Wed, 6 May 2020 at 19:50, Ivan Rakov  wrote:
>>  Hi,
>>
>>  IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
>>  > duration rebalance of cache group after which partitions distribution is
>>  > output, set in milliseconds, default value is 10 minutes.
>>
>>   Does it mean that if the rebalancing process took less than 10 minutes,
>>  only a short version of the message (with supplier statistics) will show up?
>>
>>  In general, I have no objections.
>>
>>  On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл 
>>  wrote:
>>
>>  > Hi, Igniters!
>>  >
>>  > I'd like to share a new small feature in AI [1].
>>  >
>>  > Current rebalance logging does not allow you to quickly answer following
>>  > questions:
>>  > 1)How long was the balance(divided by supplier)?
>>  > 2)How many 

Re: Extended logging for rebalance performance analysis

2020-05-07 Thread ткаленко кирилл
Hi, that's it.

06.05.2020, 19:50, "Ivan Rakov" :
> Hi,
>
> IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
>>  duration rebalance of cache group after which partitions distribution is
>>  output, set in milliseconds, default value is 10 minutes.
>
>  Does it mean that if the rebalancing process took less than 10 minutes,
> only a short version of the message (with supplier statistics) will show up?
>
> In general, I have no objections.
>
> On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл 
> wrote:
>
>>  Hi, Igniters!
>>
>>  I'd like to share a new small feature in AI [1].
>>
>>  Current rebalance logging does not allow you to quickly answer following
>>  questions:
>>  1)How long was the balance(divided by supplier)?
>>  2)How many records and bytes per supplier were rebalanced?
>>  3)How many times did rebalance restart?
>>  4)Which partitions were rebalanced and from which nodes did they receive
>>  them?
>>  5)When did rebalance for all cache groups end?
>>
>>  What you can see in logs now:
>>
>>  1)Starting rebalance with order of cache groups.
>>  Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0],
>>  top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false,
>>  evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c1]
>>
>>  2)Start rebalance of cache group from a specific supplier, specifying
>>  partition ids and mode - historical or full.
>>  Starting rebalance routine [ignite-sys-cache,
>>  topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0],
>>  supplier=8c525892-703b-4fc4-b28b-b2f13970, fullPartitions=[0-99],
>>  histPartitions=[]]
>>
>>  3)Getting partial or complete partitions of cache group.
>>  Completed rebalancing [grp=ignite-sys-cache,
>>  supplier=8c525892-703b-4fc4-b28b-b2f13970,
>>  topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=1/2]
>>  Completed (final) rebalancing [grp=ignite-sys-cache,
>>  supplier=c2146a04-dc23-4bc9-870d-dfbb55c1,
>>  topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=2/2]
>>
>>  4)End rebalance of cache group.
>>  Completed rebalance future: RebalanceFuture [grp=CacheGroupContext
>>  [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2,
>>  minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200,
>>  receivedKeys=0, partitionsLeft=0, startTime=1588519707607, endTime=-1,
>>  lastCancelledTime=-1]
>>
>>  Rebalance statistics:
>>
>>  To speed up rebalance analysis, statistics will be output for each cache
>>  group and total for all cache groups.
>>  If duration rebalance for cache group is greater than threshold value,
>>  partition distribution is output.
>>  Statistics will you to analyze duration of the balance for each supplier
>>  to understand which of them has been transmitting data for longest time.
>>
>>  System properties are used to output statistics:
>>
>>  IGNITE_QUIET - to output statistics, value must be false;
>>  IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
>>  duration rebalance of cache group after which partitions distribution is
>>  output, set in milliseconds, default value is 10 minutes.
>>
>>  Statistics examples:
>>
>>  Successful full and historical rebalance of group cache, without
>>  partitions distribution.
>>  Rebalance information per cache group (successful rebalance): [id=3181548,
>>  name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
>>  10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5,
>>  d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b -
>>  bytes, d - duration, h - historical, nodeId mapping
>>  (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  [1=rebalancing.RebalanceStatisticsTest0]
>>  Rebalance information per cache group (successful rebalance): [id=3181547,
>>  name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
>>  15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10,
>>  he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
>>  bytes, d - duration, h - historical, nodeId mapping
>>  (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
>>
>>  Successful full and historical rebalance of group cache, with partitions
>>  distribution.
>>  Rebalance information per cache group (successful rebalance): [id=3181548,
>>  name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
>>  10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5,
>>  d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b -
>>  bytes, d - duration, h - historical, nodeId mapping
>>  (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was greater
>>  than 5 ms, printing detailed information about partitions distribution
>>  (threshold can be changed by setting number of milliseconds into
>>  IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 =

Re: Extended logging for rebalance performance analysis

2020-05-06 Thread Alexei Scherbakov
Hello.

Let's look at existing rebalancing log for a single group:

[2020-05-06 20:56:36,999][INFO ][...] Rebalancing scheduled
[order=[ignite-sys-cache, cache1, cache2, default],
top=AffinityTopologyVersion [topVer=3, minorTopVer=1],
evt=DISCOVERY_CUSTOM_EVT, node=9d9edb7b-eb01-47a1-8ff9-fef715d2]
...
[2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1,
mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1,
partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3,
minorTopVer=1]]
[2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1,
mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50,
partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3,
minorTopVer=1]]
[2020-05-06 20:56:37,036][INFO ][...] Starting rebalance routine [cache1,
topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, fullPartitions=[1, 5, 7, 9,
11, 13, 15, 23, 27, 29, 31], histPartitions=[]]
[2020-05-06 20:56:37,037][INFO ][...] Starting rebalance routine [cache1,
topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, fullPartitions=[6, 8, 10,
16, 18, 20, 22, 24, 26, 28], histPartitions=[]]
[2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1,
topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2]
[2020-05-06 20:56:37,046][INFO ][...] Completed (final) rebalancing
[grp=cache1, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50,
topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=2/2]
[2020-05-06 20:56:37,048][INFO ][...] Completed rebalance future:
RebalanceFuture [grp=CacheGroupContext [grp=cache1],
topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], rebalanceId=2,
routines=2]

>From these logs I'm already can get answers to 1 and 4.
The logs look concise and easy to read and understand, and should
remain what way.

But I think some proposed improvements can be done here without harm.

2. OK, let's add it to supplier info per cache with additional info:

[2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, entries=100, duration=12ms,
bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
progress=1/2]

3. This information is already printed to log.

5. OK, let's add a line which is printed after all groups in the chain are
rebalanced or cancelled with a summary:

[2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
[rebalanceId=2, entries=200, duration=50ms, bytesRcvd=10M]

Another thing I would suggest: add rebalanceId to all messages for ease of
grepping and to distinguish multiple rebalances for a same topology version.

Regarding the "detailed" message - I wasn't able to figure out any case
where it is useful. Not sure if we need it at all.
Kirill Tkalenko, can you give me an example?

ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov :

> Kirill,
>
>
> Thank you for raising this topic. It's true that the rebalance process
> still requires additional information for analyzing issues. Please,
> don't think that I'm against your changes :-)
>
> * My short answer. *
>
> We won't do performance analysis on the production environment. Each
> time we need performance analysis it will be done on a test
> environment with verbose logging enabled. Thus I suggest moving these
> changes to a separate `profiling` module and extend the logging much
> more without any ышяу limitations. The same as these [2] [3]
> activities do.
>
> Let's keep the `core` module as simple as possible.
> Let's design the right API for accessing rebalance internals for
> profiling tools.
> Can you, please, remove all changes from your PR [6] which are not
> related to the proposed topic? (e.g. variable renamings).
>
>
> * The long answer. *
>
> Here are my thoughts on this. There are two different types of issues
> in the rebalance process. The first case must be covered by daily
> monitoring subsystems, the second case must be covered by additional
> profiling tools.
> 1. errors during the rebalancing (e.g. rebalance not happens when required)
> 2. performance rebalancing issues (e.g. rebalance is slow)
>
> Daily monitoring tools (JMX, Logging) are always turned on and
> shouldn't require additional systems resources themselves. Since these
> metrics must be lightweight any internal aggregation machinery is not
> used on them. All these metrics are collected from each node
> independently. Please, take a look at this issue [1] which covers most
> of your questions mentioned above.
>
> For all available metrics, we can configure LogExporterSpi, so they
> will be available in logs.
>
> > 1)How long was the balance(divided by supplier)?
> rebalancingStartTime, rebalancingEndTime already exists for cache groups
> [4].
> We can add the same for suppliers.
>
> > 2)How many records and bytes per supplier were rebalanced?
> We 

Re: Extended logging for rebalance performance analysis

2020-05-06 Thread Maxim Muzafarov
Kirill,


Thank you for raising this topic. It's true that the rebalance process
still requires additional information for analyzing issues. Please,
don't think that I'm against your changes :-)

* My short answer. *

We won't do performance analysis on the production environment. Each
time we need performance analysis it will be done on a test
environment with verbose logging enabled. Thus I suggest moving these
changes to a separate `profiling` module and extend the logging much
more without any ышяу limitations. The same as these [2] [3]
activities do.

Let's keep the `core` module as simple as possible.
Let's design the right API for accessing rebalance internals for
profiling tools.
Can you, please, remove all changes from your PR [6] which are not
related to the proposed topic? (e.g. variable renamings).


* The long answer. *

Here are my thoughts on this. There are two different types of issues
in the rebalance process. The first case must be covered by daily
monitoring subsystems, the second case must be covered by additional
profiling tools.
1. errors during the rebalancing (e.g. rebalance not happens when required)
2. performance rebalancing issues (e.g. rebalance is slow)

Daily monitoring tools (JMX, Logging) are always turned on and
shouldn't require additional systems resources themselves. Since these
metrics must be lightweight any internal aggregation machinery is not
used on them. All these metrics are collected from each node
independently. Please, take a look at this issue [1] which covers most
of your questions mentioned above.

For all available metrics, we can configure LogExporterSpi, so they
will be available in logs.

> 1)How long was the balance(divided by supplier)?
rebalancingStartTime, rebalancingEndTime already exists for cache groups [4].
We can add the same for suppliers.

> 2)How many records and bytes per supplier were rebalanced?
We already have rebalancingReceivedKeys, rebalancingReceivedBytes [4]
We will have rebalancingExpectedKeys [5].
We can add a new metric per cache keys, per supplier.

> 3)How many times did the rebalance restart?
rebalancingLastCancelledTime [4] metric exists.
Do we need to keep historical data on it?

> 4)Which partitions were rebalanced and from which nodes did they receive them?
Let's print this information to log prior to the rebalance process
starts. This will be helpful information and do not require a lot of
changes.

> 5)When did rebalance for all cache groups end?
This metric may be aggregated from rebalancingEndTime [4] by pulling
it from all nodes for all caches.


Performance rebalancing issues are related to profiling tools. They
may require additional system resources and definitely require a
dedicated environment for tests. We can't do performance analysis on
production environments due to performance impact.
I see some disadvantages of adding such tools to production code:
- verbose logging may affect performance.
- the problematic process may become even worse if an automatic
threshold suddenly turns on.
- new code changes will require additional efforts to keep logging up-to-date.


[1] https://issues.apache.org/jira/browse/IGNITE-12183
[2] https://issues.apache.org/jira/browse/IGNITE-12666
[3] 
https://cwiki.apache.org/confluence/display/IGNITE/Cluster+performance+profiling+tool
[4] https://issues.apache.org/jira/browse/IGNITE-12193
[5] https://issues.apache.org/jira/browse/IGNITE-12194
[6] https://github.com/apache/ignite/pull/7705/files

On Wed, 6 May 2020 at 19:50, Ivan Rakov  wrote:
>
> Hi,
>
> IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
> > duration rebalance of cache group after which partitions distribution is
> > output, set in milliseconds, default value is 10 minutes.
>
>  Does it mean that if the rebalancing process took less than 10 minutes,
> only a short version of the message (with supplier statistics) will show up?
>
> In general, I have no objections.
>
>
> On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл 
> wrote:
>
> > Hi, Igniters!
> >
> > I'd like to share a new small feature in AI [1].
> >
> > Current rebalance logging does not allow you to quickly answer following
> > questions:
> > 1)How long was the balance(divided by supplier)?
> > 2)How many records and bytes per supplier were rebalanced?
> > 3)How many times did rebalance restart?
> > 4)Which partitions were rebalanced and from which nodes did they receive
> > them?
> > 5)When did rebalance for all cache groups end?
> >
> > What you can see in logs now:
> >
> > 1)Starting rebalance with order of cache groups.
> > Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0],
> > top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false,
> > evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c1]
> >
> > 2)Start rebalance of cache group from a specific supplier, specifying
> > partition ids and mode - historical or full.
> > Starting rebalance routine [ignite-sys-cache,
> > topVer=AffinityTopologyVersion [topVer=2, 

Re: Extended logging for rebalance performance analysis

2020-05-06 Thread Ivan Rakov
Hi,

IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
> duration rebalance of cache group after which partitions distribution is
> output, set in milliseconds, default value is 10 minutes.

 Does it mean that if the rebalancing process took less than 10 minutes,
only a short version of the message (with supplier statistics) will show up?

In general, I have no objections.


On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл 
wrote:

> Hi, Igniters!
>
> I'd like to share a new small feature in AI [1].
>
> Current rebalance logging does not allow you to quickly answer following
> questions:
> 1)How long was the balance(divided by supplier)?
> 2)How many records and bytes per supplier were rebalanced?
> 3)How many times did rebalance restart?
> 4)Which partitions were rebalanced and from which nodes did they receive
> them?
> 5)When did rebalance for all cache groups end?
>
> What you can see in logs now:
>
> 1)Starting rebalance with order of cache groups.
> Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0],
> top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false,
> evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c1]
>
> 2)Start rebalance of cache group from a specific supplier, specifying
> partition ids and mode - historical or full.
> Starting rebalance routine [ignite-sys-cache,
> topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0],
> supplier=8c525892-703b-4fc4-b28b-b2f13970, fullPartitions=[0-99],
> histPartitions=[]]
>
> 3)Getting partial or complete partitions of cache group.
> Completed rebalancing [grp=ignite-sys-cache,
> supplier=8c525892-703b-4fc4-b28b-b2f13970,
> topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=1/2]
> Completed (final) rebalancing [grp=ignite-sys-cache,
> supplier=c2146a04-dc23-4bc9-870d-dfbb55c1,
> topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=2/2]
>
> 4)End rebalance of cache group.
> Completed rebalance future: RebalanceFuture [grp=CacheGroupContext
> [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2,
> minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200,
> receivedKeys=0, partitionsLeft=0, startTime=1588519707607, endTime=-1,
> lastCancelledTime=-1]
>
> Rebalance statistics:
>
> To speed up rebalance analysis, statistics will be output for each cache
> group and total for all cache groups.
> If duration rebalance for cache group is greater than threshold value,
> partition distribution is output.
> Statistics will you to analyze duration of the balance for each supplier
> to understand which of them has been transmitting data for longest time.
>
> System properties are used to output statistics:
>
> IGNITE_QUIET - to output statistics, value must be false;
> IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
> duration rebalance of cache group after which partitions distribution is
> output, set in milliseconds, default value is 10 minutes.
>
> Statistics examples:
>
> Successful full and historical rebalance of group cache, without
> partitions distribution.
> Rebalance information per cache group (successful rebalance): [id=3181548,
> name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
> 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5,
> d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b -
> bytes, d - duration, h - historical, nodeId mapping
> (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> [1=rebalancing.RebalanceStatisticsTest0]
> Rebalance information per cache group (successful rebalance): [id=3181547,
> name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
> 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10,
> he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
> bytes, d - duration, h - historical, nodeId mapping
> (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
>
> Successful full and historical rebalance of group cache, with partitions
> distribution.
> Rebalance information per cache group (successful rebalance): [id=3181548,
> name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
> 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5,
> d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b -
> bytes, d - duration, h - historical, nodeId mapping
> (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was greater
> than 5 ms, printing detailed information about partitions distribution
> (threshold can be changed by setting number of milliseconds into
> IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 =
> [0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 =
> [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 =
> [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 =
> [0,bu,su],[1,bu],[2,pr,su] 7 = 

Extended logging for rebalance performance analysis

2020-05-04 Thread ткаленко кирилл
Hi, Igniters!

I'd like to share a new small feature in AI [1].

Current rebalance logging does not allow you to quickly answer following 
questions:
1)How long was the balance(divided by supplier)? 
2)How many records and bytes per supplier were rebalanced?
3)How many times did rebalance restart?
4)Which partitions were rebalanced and from which nodes did they receive them?
5)When did rebalance for all cache groups end?

What you can see in logs now:

1)Starting rebalance with order of cache groups.
Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0], 
top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false, 
evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c1]

2)Start rebalance of cache group from a specific supplier, specifying partition 
ids and mode - historical or full.
Starting rebalance routine [ignite-sys-cache, topVer=AffinityTopologyVersion 
[topVer=2, minorTopVer=0], supplier=8c525892-703b-4fc4-b28b-b2f13970, 
fullPartitions=[0-99], histPartitions=[]]

3)Getting partial or complete partitions of cache group.
Completed rebalancing [grp=ignite-sys-cache, 
supplier=8c525892-703b-4fc4-b28b-b2f13970, topVer=AffinityTopologyVersion 
[topVer=5, minorTopVer=0], progress=1/2]
Completed (final) rebalancing [grp=ignite-sys-cache, 
supplier=c2146a04-dc23-4bc9-870d-dfbb55c1, topVer=AffinityTopologyVersion 
[topVer=5, minorTopVer=0], progress=2/2]

4)End rebalance of cache group.
Completed rebalance future: RebalanceFuture [grp=CacheGroupContext 
[grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2, 
minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200, receivedKeys=0, 
partitionsLeft=0, startTime=1588519707607, endTime=-1, lastCancelledTime=-1]

Rebalance statistics:

To speed up rebalance analysis, statistics will be output for each cache group 
and total for all cache groups. 
If duration rebalance for cache group is greater than threshold value, 
partition distribution is output.
Statistics will you to analyze duration of the balance for each supplier to 
understand which of them has been transmitting data for longest time.

System properties are used to output statistics:

IGNITE_QUIET - to output statistics, value must be false;
IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold duration 
rebalance of cache group after which partitions distribution is output, set in 
milliseconds, default value is 10 minutes.

Statistics examples:

Successful full and historical rebalance of group cache, without partitions 
distribution.
Rebalance information per cache group (successful rebalance): [id=3181548, 
name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13 
10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5, d=10 
ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b - bytes, d 
- duration, h - historical, nodeId mapping (nodeId=id,consistentId) 
[0=rebalancing.RebalanceStatisticsTest1] 
[1=rebalancing.RebalanceStatisticsTest0] 
Rebalance information per cache group (successful rebalance): [id=3181547, 
name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13 
15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10, 
he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b - bytes, d 
- duration, h - historical, nodeId mapping (nodeId=id,consistentId) 
[0=rebalancing.RebalanceStatisticsTest0]

Successful full and historical rebalance of group cache, with partitions 
distribution.
Rebalance information per cache group (successful rebalance): [id=3181548, 
name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13 
10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5, d=10 
ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b - bytes, d 
- duration, h - historical, nodeId mapping (nodeId=id,consistentId) 
[0=rebalancing.RebalanceStatisticsTest1] 
[1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was greater than 5 
ms, printing detailed information about partitions distribution (threshold can 
be changed by setting number of milliseconds into 
IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 = 
[0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 = 
[0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 = 
[0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 = 
[0,bu,su],[1,bu],[2,pr,su] 7 = [0,bu,su],[1,bu],[2,pr,su] 8 = 
[0,bu,su],[1,bu],[2,pr,su] 9 = [0,bu,su],[1,bu],[2,pr,su] Aliases: pr - 
primary, bu - backup, su - supplier node, h - historical, nodeId mapping 
(nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1] 
[1=rebalancing.RebalanceStatisticsTest2] 
[2=rebalancing.RebalanceStatisticsTest0] 
Rebalance information per cache group (successful rebalance): [id=3181547, 
name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13 
15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10, 
he=300, hb=30267, d=116 ms]