Re: Extended logging for rebalance performance analysis
Discussed in personal correspondence with Stas, decided to improve the message: Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1] into: Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, avgSpeed=5,9 KB/sec, histPartitions=1, histEntries=30, histBytesRcvd=1 KB, fullPartitions=1, fullEntries=30, fullBytesRcvd=3 KB topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1] Where: partitions=2 - total number of partitions received entries=60 - total number of entries received duration=8ms - duration from first demand of message to output of message to log bytesRcvd=5,9 KB - total number of bytes received in B,KB,MB,GB avgSpeed= bytesRcvd/duration in KB/sec histPartitions=1 - total number of partitions received by historical mode histEntries=30 - total number of entries received by historical mode histBytesRcvd=1 KB - total number of bytes received in B,KB,MB,GB by historical mode fullPartitions=1 - total number of partitions received by full mode fullEntries=30 - total number of entries received by full mode fullBytesRcvd=3 KB - total number of bytes received in B,KB,MB,GB by full mode 27.07.2020, 11:50, "ткаленко кирилл" : > Discussed in personal correspondence with Stas, decided to improve the > message: > Completed rebalancing [grp=grp0, > supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, > partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, > topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, > rebalanceId=1] > > into: > Completed rebalancing [grp=grp0, > supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, > partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, avgSpeed=5,9 > KB/sec, > histPartitions=1, histEntries=30, histBytesRcvd=1 KB, > fullPartitions=1, fullEntries=30, fullBytesRcvd=3 KB > topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, > rebalanceId=1] > > Where: > partitions=2 - total number of partitions received > entries=60 - total number of entries received > duration=8ms - duration from first demand of message to output of message to > log > bytesRcvd=5,9 KB - total number of bytes received in B,KB,MB,GB > > avgSpeed= bytesRcvd/duration in KB/sec > > histPartitions=1 - total number of partitions received by historical mode > histEntries=30 - total number of entries received by historical mode > histBytesRcvd=1 KB - total number of bytes received in B,KB,MB,GB by > historical mode > > fullPartitions=1 - total number of partitions received by full mode > fullEntries=30 - total number of entries received by full mode > fullBytesRcvd=3 KB - total number of bytes received in B,KB,MB,GB by full mode > > 03.07.2020, 17:21, "ткаленко кирилл" : >> Sorry, forget. >> >> [1] - >> org.apache.ignite.internal.processors.cache.CacheGroupsMetricsRebalanceTest#testCacheGroupRebalance >> >> 03.07.2020, 17:20, "ткаленко кирилл" : >>> Hi, Stan! >>> >>> I don't understand you yet. >>> >>> Now you can use metrics as it was done in the test [1]. Or can you tell me >>> where to do this, for example when completing rebalancing for all groups? >>> >>> See what is now available and added in the logs: >>> 1)Which group is rebalanced and which type of rebalance. >>> Starting rebalance routine [grp0, topVer=AffinityTopologyVersion >>> [topVer=4, minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, >>> fullPartitions=[4, 7], histPartitions=[], rebalanceId=1] >>> >>> 2) Completion of rebalancing from one of the suppliers. >>> Completed rebalancing [grp=grp0, >>> supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, partitions=2, entries=60, >>> duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, >>> minorTopVer=0], progress=1/3, rebalanceId=1] >>> >>> 3) Completion of the entire rebalance. >>> Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, >>> duration=41ms, bytesRcvd=40,4 KB] >>> >>> These messages have a common parameter rebalanceId=1. >>> >>> 03.07.2020, 16:48, "Stanislav Lukyanov" : > On 3 Jul 2020, at 09:51, ткаленко кирилл wrote: > > To calculate the average value, you can use the existing metrics > "RebalancingStartTime", "RebalancingLastCancelledTime", > "RebalancingEndTime", "RebalancingPartitionsLeft", > "RebalancingReceivedKeys" and "RebalancingReceivedBytes". You can calculate it, and I believe that this is the first thing anyone would do when reading these logs and metrics. If that's an essential thing then maybe it should be available out of the box? > This also works correctly with the historical rebalance. > Now we can see rebalance type
Re: Extended logging for rebalance performance analysis
Hi Kirill, I think it would be helpful to add to the log line Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1] a few more metrics: avgSpeed=, partitionsNum=N, histPartitionsNum=M, histBytesRcvd=X KB, histDuration=Yms, histAvgSpeed=, fullPartitionsNum=K, fullBytesRcvd=X KB, fullDuration=Yms, fullAvgSpeed= I believe these additional metrics are good in that a user can easily setup monitoring based on them, and they give a lot of useful info about rebalance performance in general. No need to look for the "Starting rebalance" message, everything is in one place. If any of these are hard to add then we can skip them (I suspect histDuration and fullDuration can't be measured in practice but don't know for sure). Let's add what we can. Thanks, Stan > On 3 Jul 2020, at 17:21, ткаленко кирилл wrote: > > Sorry, forget. > > [1] - > org.apache.ignite.internal.processors.cache.CacheGroupsMetricsRebalanceTest#testCacheGroupRebalance > > 03.07.2020, 17:20, "ткаленко кирилл" : >> Hi, Stan! >> >> I don't understand you yet. >> >> Now you can use metrics as it was done in the test [1]. Or can you tell me >> where to do this, for example when completing rebalancing for all groups? >> >> See what is now available and added in the logs: >> 1)Which group is rebalanced and which type of rebalance. >> Starting rebalance routine [grp0, topVer=AffinityTopologyVersion [topVer=4, >> minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, >> fullPartitions=[4, 7], histPartitions=[], rebalanceId=1] >> >> 2) Completion of rebalancing from one of the suppliers. >> Completed rebalancing [grp=grp0, >> supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, partitions=2, entries=60, >> duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, >> minorTopVer=0], progress=1/3, rebalanceId=1] >> >> 3) Completion of the entire rebalance. >> Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, >> duration=41ms, bytesRcvd=40,4 KB] >> >> These messages have a common parameter rebalanceId=1. >> >> 03.07.2020, 16:48, "Stanislav Lukyanov" : On 3 Jul 2020, at 09:51, ткаленко кирилл wrote: To calculate the average value, you can use the existing metrics "RebalancingStartTime", "RebalancingLastCancelledTime", "RebalancingEndTime", "RebalancingPartitionsLeft", "RebalancingReceivedKeys" and "RebalancingReceivedBytes". >>> >>> You can calculate it, and I believe that this is the first thing anyone >>> would do when reading these logs and metrics. >>> If that's an essential thing then maybe it should be available out of the >>> box? >>> This also works correctly with the historical rebalance. Now we can see rebalance type for each group and for each supplier in logs. I don't think we should duplicate this information. [2020-07-03 09:49:31,481][INFO ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a5957710, fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1] >>> >>> I'm talking about adding info on how much data has been transferred during >>> rebalance. >>> When rebalance completes I'd like to know how much data has been >>> transferred, was it historical or full, what was the average rebalance >>> speed. >>> >>> There are two reasons for having all that. >>> >>> First, it helps to analyze the issues by searching the logs and looking >>> for anomalies. >>> >>> Second, this allows to automate alerts: e.g. if you know your typical >>> historical rebalance speed, you can trigger an alert if it drops below that. >>> 03.07.2020, 02:49, "Stanislav Lukyanov" : > Kirill, > > I've looked through the patch. > Looks good, but it feels like the first thing someone will try to do > given bytesRcvd and duration is to divide one by another to get an > average speed. > Do you think it's reasonable to also add it to the logs? Maybe even to > the metrics? > > Also, this works with historical rebalance, right? Can we specify how > much data was transferred via historical or full rebalance from each > supplier? > Maybe even estimate transfer speed in entries and bytes for each > rebalance type? > > Thanks, > Stan > >>On 29 Jun 2020, at 11:50, Ivan Rakov wrote: >> >>+1 to Alex G. >> >>From my experience, the most interesting cases with Ignite rebalancing >>happen exactly in production. According to the fact that we already >> have >>detailed rebalancing logging, adding info about rebalance
Re: Extended logging for rebalance performance analysis
Sorry, forget. [1] - org.apache.ignite.internal.processors.cache.CacheGroupsMetricsRebalanceTest#testCacheGroupRebalance 03.07.2020, 17:20, "ткаленко кирилл" : > Hi, Stan! > > I don't understand you yet. > > Now you can use metrics as it was done in the test [1]. Or can you tell me > where to do this, for example when completing rebalancing for all groups? > > See what is now available and added in the logs: > 1)Which group is rebalanced and which type of rebalance. > Starting rebalance routine [grp0, topVer=AffinityTopologyVersion [topVer=4, > minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, > fullPartitions=[4, 7], histPartitions=[], rebalanceId=1] > > 2) Completion of rebalancing from one of the suppliers. > Completed rebalancing [grp=grp0, > supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, partitions=2, entries=60, > duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, > minorTopVer=0], progress=1/3, rebalanceId=1] > > 3) Completion of the entire rebalance. > Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, > duration=41ms, bytesRcvd=40,4 KB] > > These messages have a common parameter rebalanceId=1. > > 03.07.2020, 16:48, "Stanislav Lukyanov" : >>> On 3 Jul 2020, at 09:51, ткаленко кирилл wrote: >>> >>> To calculate the average value, you can use the existing metrics >>> "RebalancingStartTime", "RebalancingLastCancelledTime", >>> "RebalancingEndTime", "RebalancingPartitionsLeft", >>> "RebalancingReceivedKeys" and "RebalancingReceivedBytes". >> >> You can calculate it, and I believe that this is the first thing anyone >> would do when reading these logs and metrics. >> If that's an essential thing then maybe it should be available out of the >> box? >> >>> This also works correctly with the historical rebalance. >>> Now we can see rebalance type for each group and for each supplier in >>> logs. I don't think we should duplicate this information. >>> >>> [2020-07-03 09:49:31,481][INFO >>> ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance >>> routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, >>> minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a5957710, >>> fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1] >> >> I'm talking about adding info on how much data has been transferred during >> rebalance. >> When rebalance completes I'd like to know how much data has been >> transferred, was it historical or full, what was the average rebalance speed. >> >> There are two reasons for having all that. >> >> First, it helps to analyze the issues by searching the logs and looking for >> anomalies. >> >> Second, this allows to automate alerts: e.g. if you know your typical >> historical rebalance speed, you can trigger an alert if it drops below that. >> >>> 03.07.2020, 02:49, "Stanislav Lukyanov" : Kirill, I've looked through the patch. Looks good, but it feels like the first thing someone will try to do given bytesRcvd and duration is to divide one by another to get an average speed. Do you think it's reasonable to also add it to the logs? Maybe even to the metrics? Also, this works with historical rebalance, right? Can we specify how much data was transferred via historical or full rebalance from each supplier? Maybe even estimate transfer speed in entries and bytes for each rebalance type? Thanks, Stan > On 29 Jun 2020, at 11:50, Ivan Rakov wrote: > > +1 to Alex G. > > From my experience, the most interesting cases with Ignite rebalancing > happen exactly in production. According to the fact that we already > have > detailed rebalancing logging, adding info about rebalance performance > looks > like a reasonable improvement. With new logs we'll be able to detect > and > investigate situations when rebalance is slow due to uneven suppliers > distribution or network issues. > Option to disable the feature in runtime shouldn't be used often, but > it > will keep us on the safe side in case something goes wrong. > The format described in > https://issues.apache.org/jira/browse/IGNITE-12080 looks > good to me. > > On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл > wrote: > >> Hello, Alexey! >> >> Currently there is no way to disable / enable it, but it seems that >> the >> logs will not be overloaded, since Alexei Scherbakov offer seems >> reasonable >> and compact. Of course, you can add disabling / enabling statistics >> collection via jmx for example. >> >> 23.06.2020, 18:47, "Alexey Goncharuk" : >>> Hello Maxim, folks, >>> >>> ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov : >>> We won't do performance analysis on
Re: Extended logging for rebalance performance analysis
Hi, Stan! I don't understand you yet. Now you can use metrics as it was done in the test [1]. Or can you tell me where to do this, for example when completing rebalancing for all groups? See what is now available and added in the logs: 1)Which group is rebalanced and which type of rebalance. Starting rebalance routine [grp0, topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, fullPartitions=[4, 7], histPartitions=[], rebalanceId=1] 2) Completion of rebalancing from one of the suppliers. Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a1, partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1] 3) Completion of the entire rebalance. Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, duration=41ms, bytesRcvd=40,4 KB] These messages have a common parameter rebalanceId=1. 03.07.2020, 16:48, "Stanislav Lukyanov" : >> On 3 Jul 2020, at 09:51, ткаленко кирилл wrote: >> >> To calculate the average value, you can use the existing metrics >> "RebalancingStartTime", "RebalancingLastCancelledTime", >> "RebalancingEndTime", "RebalancingPartitionsLeft", "RebalancingReceivedKeys" >> and "RebalancingReceivedBytes". > > You can calculate it, and I believe that this is the first thing anyone would > do when reading these logs and metrics. > If that's an essential thing then maybe it should be available out of the box? > >> This also works correctly with the historical rebalance. >> Now we can see rebalance type for each group and for each supplier in logs. >> I don't think we should duplicate this information. >> >> [2020-07-03 09:49:31,481][INFO >> ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance >> routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, >> minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a5957710, >> fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1] > > I'm talking about adding info on how much data has been transferred during > rebalance. > When rebalance completes I'd like to know how much data has been transferred, > was it historical or full, what was the average rebalance speed. > > There are two reasons for having all that. > > First, it helps to analyze the issues by searching the logs and looking for > anomalies. > > Second, this allows to automate alerts: e.g. if you know your typical > historical rebalance speed, you can trigger an alert if it drops below that. > >> 03.07.2020, 02:49, "Stanislav Lukyanov" : >>> Kirill, >>> >>> I've looked through the patch. >>> Looks good, but it feels like the first thing someone will try to do given >>> bytesRcvd and duration is to divide one by another to get an average speed. >>> Do you think it's reasonable to also add it to the logs? Maybe even to the >>> metrics? >>> >>> Also, this works with historical rebalance, right? Can we specify how much >>> data was transferred via historical or full rebalance from each supplier? >>> Maybe even estimate transfer speed in entries and bytes for each rebalance >>> type? >>> >>> Thanks, >>> Stan >>> On 29 Jun 2020, at 11:50, Ivan Rakov wrote: +1 to Alex G. From my experience, the most interesting cases with Ignite rebalancing happen exactly in production. According to the fact that we already have detailed rebalancing logging, adding info about rebalance performance looks like a reasonable improvement. With new logs we'll be able to detect and investigate situations when rebalance is slow due to uneven suppliers distribution or network issues. Option to disable the feature in runtime shouldn't be used often, but it will keep us on the safe side in case something goes wrong. The format described in https://issues.apache.org/jira/browse/IGNITE-12080 looks good to me. On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл wrote: > Hello, Alexey! > > Currently there is no way to disable / enable it, but it seems that the > logs will not be overloaded, since Alexei Scherbakov offer seems > reasonable > and compact. Of course, you can add disabling / enabling statistics > collection via jmx for example. > > 23.06.2020, 18:47, "Alexey Goncharuk" : >> Hello Maxim, folks, >> >> ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov : >> >>> We won't do performance analysis on the production environment. Each >>> time we need performance analysis it will be done on a test >>> environment with verbose logging enabled. Thus I suggest moving these >>> changes to a separate `profiling` module and extend the logging much >>> more without any ышяу limitations. The same as these [2] [3] >>> activities do. >> >> I
Re: Extended logging for rebalance performance analysis
> On 3 Jul 2020, at 09:51, ткаленко кирилл wrote: > > To calculate the average value, you can use the existing metrics > "RebalancingStartTime", "RebalancingLastCancelledTime", "RebalancingEndTime", > "RebalancingPartitionsLeft", "RebalancingReceivedKeys" and > "RebalancingReceivedBytes". You can calculate it, and I believe that this is the first thing anyone would do when reading these logs and metrics. If that's an essential thing then maybe it should be available out of the box? > > This also works correctly with the historical rebalance. > Now we can see rebalance type for each group and for each supplier in logs. I > don't think we should duplicate this information. > > [2020-07-03 09:49:31,481][INFO > ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance > routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, > minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a5957710, > fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1] I'm talking about adding info on how much data has been transferred during rebalance. When rebalance completes I'd like to know how much data has been transferred, was it historical or full, what was the average rebalance speed. There are two reasons for having all that. First, it helps to analyze the issues by searching the logs and looking for anomalies. Second, this allows to automate alerts: e.g. if you know your typical historical rebalance speed, you can trigger an alert if it drops below that. > > 03.07.2020, 02:49, "Stanislav Lukyanov" : >> Kirill, >> >> I've looked through the patch. >> Looks good, but it feels like the first thing someone will try to do given >> bytesRcvd and duration is to divide one by another to get an average speed. >> Do you think it's reasonable to also add it to the logs? Maybe even to the >> metrics? >> >> Also, this works with historical rebalance, right? Can we specify how much >> data was transferred via historical or full rebalance from each supplier? >> Maybe even estimate transfer speed in entries and bytes for each rebalance >> type? >> >> Thanks, >> Stan >> >>> On 29 Jun 2020, at 11:50, Ivan Rakov wrote: >>> >>> +1 to Alex G. >>> >>> From my experience, the most interesting cases with Ignite rebalancing >>> happen exactly in production. According to the fact that we already have >>> detailed rebalancing logging, adding info about rebalance performance looks >>> like a reasonable improvement. With new logs we'll be able to detect and >>> investigate situations when rebalance is slow due to uneven suppliers >>> distribution or network issues. >>> Option to disable the feature in runtime shouldn't be used often, but it >>> will keep us on the safe side in case something goes wrong. >>> The format described in >>> https://issues.apache.org/jira/browse/IGNITE-12080 looks >>> good to me. >>> >>> On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл >>> wrote: >>> Hello, Alexey! Currently there is no way to disable / enable it, but it seems that the logs will not be overloaded, since Alexei Scherbakov offer seems reasonable and compact. Of course, you can add disabling / enabling statistics collection via jmx for example. 23.06.2020, 18:47, "Alexey Goncharuk" : > Hello Maxim, folks, > > ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov : > >> We won't do performance analysis on the production environment. Each >> time we need performance analysis it will be done on a test >> environment with verbose logging enabled. Thus I suggest moving these >> changes to a separate `profiling` module and extend the logging much >> more without any ышяу limitations. The same as these [2] [3] >> activities do. > > I strongly disagree with this statement. I am not sure who is meant here > by 'we', but I see a strong momentum in increasing observability tooling > that helps people to understand what exactly happens in the production > environment [1]. Not everybody can afford two identical environments for > testing. We should make sure users have enough information to understand > the root cause after the incident happened, and not force them to reproduce > it, let alone make them add another module to the classpath and restart the > nodes. > I think having this functionality in the core module with the ability to > disable/enable it is the right approach. Having the information printed to > log is ok, having it in an event that can be sent to a monitoring/tracing > subsystem is even better. > > Kirill, can we enable and disable this feature in runtime to avoid the very > same nodes restart? > > [1] https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/
Re: Extended logging for rebalance performance analysis
LongMetric startTime = mreg.findMetric("RebalancingStartTime"); LongMetric lastCancelledTime = mreg.findMetric("RebalancingLastCancelledTime"); LongMetric endTime = mreg.findMetric("RebalancingEndTime"); LongMetric partitionsLeft = mreg.findMetric(); LongMetric receivedKeys = mreg.findMetric(); LongMetric receivedBytes = mreg.findMetric(); Hi, Stas! To calculate the average value, you can use the existing metrics "RebalancingStartTime", "RebalancingLastCancelledTime", "RebalancingEndTime", "RebalancingPartitionsLeft", "RebalancingReceivedKeys" and "RebalancingReceivedBytes". This also works correctly with the historical rebalance. Now we can see rebalance type for each group and for each supplier in logs. I don't think we should duplicate this information. [2020-07-03 09:49:31,481][INFO ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a5957710, fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1] 03.07.2020, 02:49, "Stanislav Lukyanov" : > Kirill, > > I've looked through the patch. > Looks good, but it feels like the first thing someone will try to do given > bytesRcvd and duration is to divide one by another to get an average speed. > Do you think it's reasonable to also add it to the logs? Maybe even to the > metrics? > > Also, this works with historical rebalance, right? Can we specify how much > data was transferred via historical or full rebalance from each supplier? > Maybe even estimate transfer speed in entries and bytes for each rebalance > type? > > Thanks, > Stan > >> On 29 Jun 2020, at 11:50, Ivan Rakov wrote: >> >> +1 to Alex G. >> >> From my experience, the most interesting cases with Ignite rebalancing >> happen exactly in production. According to the fact that we already have >> detailed rebalancing logging, adding info about rebalance performance looks >> like a reasonable improvement. With new logs we'll be able to detect and >> investigate situations when rebalance is slow due to uneven suppliers >> distribution or network issues. >> Option to disable the feature in runtime shouldn't be used often, but it >> will keep us on the safe side in case something goes wrong. >> The format described in >> https://issues.apache.org/jira/browse/IGNITE-12080 looks >> good to me. >> >> On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл >> wrote: >> >>> Hello, Alexey! >>> >>> Currently there is no way to disable / enable it, but it seems that the >>> logs will not be overloaded, since Alexei Scherbakov offer seems reasonable >>> and compact. Of course, you can add disabling / enabling statistics >>> collection via jmx for example. >>> >>> 23.06.2020, 18:47, "Alexey Goncharuk" : Hello Maxim, folks, ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov : > We won't do performance analysis on the production environment. Each > time we need performance analysis it will be done on a test > environment with verbose logging enabled. Thus I suggest moving these > changes to a separate `profiling` module and extend the logging much > more without any ышяу limitations. The same as these [2] [3] > activities do. I strongly disagree with this statement. I am not sure who is meant here by 'we', but I see a strong momentum in increasing observability tooling that helps people to understand what exactly happens in the production environment [1]. Not everybody can afford two identical environments for testing. We should make sure users have enough information to understand the root cause after the incident happened, and not force them to >>> reproduce it, let alone make them add another module to the classpath and restart >>> the nodes. I think having this functionality in the core module with the ability to disable/enable it is the right approach. Having the information printed >>> to log is ok, having it in an event that can be sent to a monitoring/tracing subsystem is even better. Kirill, can we enable and disable this feature in runtime to avoid the >>> very same nodes restart? [1] >>> https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/
Re: Extended logging for rebalance performance analysis
Kirill, I've looked through the patch. Looks good, but it feels like the first thing someone will try to do given bytesRcvd and duration is to divide one by another to get an average speed. Do you think it's reasonable to also add it to the logs? Maybe even to the metrics? Also, this works with historical rebalance, right? Can we specify how much data was transferred via historical or full rebalance from each supplier? Maybe even estimate transfer speed in entries and bytes for each rebalance type? Thanks, Stan > On 29 Jun 2020, at 11:50, Ivan Rakov wrote: > > +1 to Alex G. > > From my experience, the most interesting cases with Ignite rebalancing > happen exactly in production. According to the fact that we already have > detailed rebalancing logging, adding info about rebalance performance looks > like a reasonable improvement. With new logs we'll be able to detect and > investigate situations when rebalance is slow due to uneven suppliers > distribution or network issues. > Option to disable the feature in runtime shouldn't be used often, but it > will keep us on the safe side in case something goes wrong. > The format described in > https://issues.apache.org/jira/browse/IGNITE-12080 looks > good to me. > > On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл > wrote: > >> Hello, Alexey! >> >> Currently there is no way to disable / enable it, but it seems that the >> logs will not be overloaded, since Alexei Scherbakov offer seems reasonable >> and compact. Of course, you can add disabling / enabling statistics >> collection via jmx for example. >> >> 23.06.2020, 18:47, "Alexey Goncharuk" : >>> Hello Maxim, folks, >>> >>> ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov : >>> We won't do performance analysis on the production environment. Each time we need performance analysis it will be done on a test environment with verbose logging enabled. Thus I suggest moving these changes to a separate `profiling` module and extend the logging much more without any ышяу limitations. The same as these [2] [3] activities do. >>> >>> I strongly disagree with this statement. I am not sure who is meant here >>> by 'we', but I see a strong momentum in increasing observability tooling >>> that helps people to understand what exactly happens in the production >>> environment [1]. Not everybody can afford two identical environments for >>> testing. We should make sure users have enough information to understand >>> the root cause after the incident happened, and not force them to >> reproduce >>> it, let alone make them add another module to the classpath and restart >> the >>> nodes. >>> I think having this functionality in the core module with the ability to >>> disable/enable it is the right approach. Having the information printed >> to >>> log is ok, having it in an event that can be sent to a monitoring/tracing >>> subsystem is even better. >>> >>> Kirill, can we enable and disable this feature in runtime to avoid the >> very >>> same nodes restart? >>> >>> [1] >> https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/ >>
Re: Extended logging for rebalance performance analysis
+1 to Alex G. >From my experience, the most interesting cases with Ignite rebalancing happen exactly in production. According to the fact that we already have detailed rebalancing logging, adding info about rebalance performance looks like a reasonable improvement. With new logs we'll be able to detect and investigate situations when rebalance is slow due to uneven suppliers distribution or network issues. Option to disable the feature in runtime shouldn't be used often, but it will keep us on the safe side in case something goes wrong. The format described in https://issues.apache.org/jira/browse/IGNITE-12080 looks good to me. On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл wrote: > Hello, Alexey! > > Currently there is no way to disable / enable it, but it seems that the > logs will not be overloaded, since Alexei Scherbakov offer seems reasonable > and compact. Of course, you can add disabling / enabling statistics > collection via jmx for example. > > 23.06.2020, 18:47, "Alexey Goncharuk" : > > Hello Maxim, folks, > > > > ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov : > > > >> We won't do performance analysis on the production environment. Each > >> time we need performance analysis it will be done on a test > >> environment with verbose logging enabled. Thus I suggest moving these > >> changes to a separate `profiling` module and extend the logging much > >> more without any ышяу limitations. The same as these [2] [3] > >> activities do. > > > > I strongly disagree with this statement. I am not sure who is meant here > > by 'we', but I see a strong momentum in increasing observability tooling > > that helps people to understand what exactly happens in the production > > environment [1]. Not everybody can afford two identical environments for > > testing. We should make sure users have enough information to understand > > the root cause after the incident happened, and not force them to > reproduce > > it, let alone make them add another module to the classpath and restart > the > > nodes. > > I think having this functionality in the core module with the ability to > > disable/enable it is the right approach. Having the information printed > to > > log is ok, having it in an event that can be sent to a monitoring/tracing > > subsystem is even better. > > > > Kirill, can we enable and disable this feature in runtime to avoid the > very > > same nodes restart? > > > > [1] > https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/ >
Re: Extended logging for rebalance performance analysis
Hello, Alexey! Currently there is no way to disable / enable it, but it seems that the logs will not be overloaded, since Alexei Scherbakov offer seems reasonable and compact. Of course, you can add disabling / enabling statistics collection via jmx for example. 23.06.2020, 18:47, "Alexey Goncharuk" : > Hello Maxim, folks, > > ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov : > >> We won't do performance analysis on the production environment. Each >> time we need performance analysis it will be done on a test >> environment with verbose logging enabled. Thus I suggest moving these >> changes to a separate `profiling` module and extend the logging much >> more without any ышяу limitations. The same as these [2] [3] >> activities do. > > I strongly disagree with this statement. I am not sure who is meant here > by 'we', but I see a strong momentum in increasing observability tooling > that helps people to understand what exactly happens in the production > environment [1]. Not everybody can afford two identical environments for > testing. We should make sure users have enough information to understand > the root cause after the incident happened, and not force them to reproduce > it, let alone make them add another module to the classpath and restart the > nodes. > I think having this functionality in the core module with the ability to > disable/enable it is the right approach. Having the information printed to > log is ok, having it in an event that can be sent to a monitoring/tracing > subsystem is even better. > > Kirill, can we enable and disable this feature in runtime to avoid the very > same nodes restart? > > [1] https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/
Re: Extended logging for rebalance performance analysis
Hello Maxim, folks, ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov : > We won't do performance analysis on the production environment. Each > time we need performance analysis it will be done on a test > environment with verbose logging enabled. Thus I suggest moving these > changes to a separate `profiling` module and extend the logging much > more without any ышяу limitations. The same as these [2] [3] > activities do. > I strongly disagree with this statement. I am not sure who is meant here by 'we', but I see a strong momentum in increasing observability tooling that helps people to understand what exactly happens in the production environment [1]. Not everybody can afford two identical environments for testing. We should make sure users have enough information to understand the root cause after the incident happened, and not force them to reproduce it, let alone make them add another module to the classpath and restart the nodes. I think having this functionality in the core module with the ability to disable/enable it is the right approach. Having the information printed to log is ok, having it in an event that can be sent to a monitoring/tracing subsystem is even better. Kirill, can we enable and disable this feature in runtime to avoid the very same nodes restart? [1] https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/
Re: Extended logging for rebalance performance analysis
Hi, Kirill. Looks good to me. вт, 23 июн. 2020 г. в 18:05, ткаленко кирилл : > Hello, Alexey! > > I suggest that we decide what we can do within ticket [1]. > > Add "rebalanceId" and "topVer" related to rebalance to all messages. > > Add statistical information to a log message: > [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing > [rebalanceId=1, grp=cache1, > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, partitions=5, entries=100, > duration=12ms, > bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > progress=1/2] > > Add a message to log after rebalancing for all cache groups is complete: > [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain: > [rebalanceId=2, partitions=10, entries=200, duration=50ms, bytesRcvd=10M] > > Any comments or suggestions? > > [1] - https://issues.apache.org/jira/browse/IGNITE-12080 > > 20.05.2020, 23:08, "ткаленко кирилл" : > > Hello, Alexey! Unfortunately, my response was delayed. > > > > Point 2: You can do as you suggested, I think it is still worth > specifying how many partitions were obtained. > > > > [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1, > > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, partitions=5, > entries=100, duration=12ms, > > bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > > progress=1/2] > > > > Point 3: is It "rebalanceId"? > > > > Point 5: I think we can output a summary for each supplier, so as not to > keep it in mind. > > > > [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain: > > [rebalanceId=2, [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, > partitions=5, entries=100, duration=12ms, bytesRcvd=5M], > > [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba2, partitions=5, > entries=100, duration=12ms, bytesRcvd=5M]] > > > > I can add "rebalanceId" to each message that you gave at above. > > > > A detailed message will help us understand how correctly the suppliers > were selected. > > > > 06.05.2020, 22:08, "Alexei Scherbakov" : > >> Hello. > >> > >> Let's look at existing rebalancing log for a single group: > >> > >> [2020-05-06 20:56:36,999][INFO ][...] Rebalancing scheduled > >> [order=[ignite-sys-cache, cache1, cache2, default], > >> top=AffinityTopologyVersion [topVer=3, minorTopVer=1], > >> evt=DISCOVERY_CUSTOM_EVT, node=9d9edb7b-eb01-47a1-8ff9-fef715d2] > >> ... > >> [2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1, > >> mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, > >> partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3, > >> minorTopVer=1]] > >> [2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1, > >> mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, > >> partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3, > >> minorTopVer=1]] > >> [2020-05-06 20:56:37,036][INFO ][...] Starting rebalance routine > [cache1, > >> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > >> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, fullPartitions=[1, 5, > 7, 9, > >> 11, 13, 15, 23, 27, 29, 31], histPartitions=[]] > >> [2020-05-06 20:56:37,037][INFO ][...] Starting rebalance routine > [cache1, > >> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > >> supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, fullPartitions=[6, 8, > 10, > >> 16, 18, 20, 22, 24, 26, 28], histPartitions=[]] > >> [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing > [grp=cache1, > >> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, > >> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2] > >> [2020-05-06 20:56:37,046][INFO ][...] Completed (final) rebalancing > >> [grp=cache1, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, > >> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=2/2] > >> [2020-05-06 20:56:37,048][INFO ][...] Completed rebalance future: > >> RebalanceFuture [grp=CacheGroupContext [grp=cache1], > >> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > rebalanceId=2, > >> routines=2] > >> > >> From these logs I'm already can get answers to 1 and 4. > >> The logs look concise and easy to read and understand, and should > >> remain what way. > >> > >> But I think some proposed improvements can be done here without harm. > >> > >> 2. OK, let's add it to supplier info per cache with additional info: > >> > >> [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing > [grp=cache1, > >> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, entries=100, > duration=12ms, > >> bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > >> progress=1/2] > >> > >> 3. This information is already printed to log. > >> > >> 5. OK, let's add a line which is printed after all groups in the chain > are > >> rebalanced or cancelled with a summary: > >> > >> [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain: > >> [rebalanceId=2, entries=200, duration=50ms,
Re: Extended logging for rebalance performance analysis
Hello, Alexey! I suggest that we decide what we can do within ticket [1]. Add "rebalanceId" and "topVer" related to rebalance to all messages. Add statistical information to a log message: [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [rebalanceId=1, grp=cache1, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, partitions=5, entries=100, duration=12ms, bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2] Add a message to log after rebalancing for all cache groups is complete: [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain: [rebalanceId=2, partitions=10, entries=200, duration=50ms, bytesRcvd=10M] Any comments or suggestions? [1] - https://issues.apache.org/jira/browse/IGNITE-12080 20.05.2020, 23:08, "ткаленко кирилл" : > Hello, Alexey! Unfortunately, my response was delayed. > > Point 2: You can do as you suggested, I think it is still worth specifying > how many partitions were obtained. > > [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1, > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, partitions=5, entries=100, > duration=12ms, > bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > progress=1/2] > > Point 3: is It "rebalanceId"? > > Point 5: I think we can output a summary for each supplier, so as not to keep > it in mind. > > [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain: > [rebalanceId=2, [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, partitions=5, > entries=100, duration=12ms, bytesRcvd=5M], > [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba2, partitions=5, entries=100, > duration=12ms, bytesRcvd=5M]] > > I can add "rebalanceId" to each message that you gave at above. > > A detailed message will help us understand how correctly the suppliers were > selected. > > 06.05.2020, 22:08, "Alexei Scherbakov" : >> Hello. >> >> Let's look at existing rebalancing log for a single group: >> >> [2020-05-06 20:56:36,999][INFO ][...] Rebalancing scheduled >> [order=[ignite-sys-cache, cache1, cache2, default], >> top=AffinityTopologyVersion [topVer=3, minorTopVer=1], >> evt=DISCOVERY_CUSTOM_EVT, node=9d9edb7b-eb01-47a1-8ff9-fef715d2] >> ... >> [2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1, >> mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, >> partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3, >> minorTopVer=1]] >> [2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1, >> mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, >> partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3, >> minorTopVer=1]] >> [2020-05-06 20:56:37,036][INFO ][...] Starting rebalance routine [cache1, >> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], >> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, fullPartitions=[1, 5, 7, 9, >> 11, 13, 15, 23, 27, 29, 31], histPartitions=[]] >> [2020-05-06 20:56:37,037][INFO ][...] Starting rebalance routine [cache1, >> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], >> supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, fullPartitions=[6, 8, 10, >> 16, 18, 20, 22, 24, 26, 28], histPartitions=[]] >> [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1, >> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, >> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2] >> [2020-05-06 20:56:37,046][INFO ][...] Completed (final) rebalancing >> [grp=cache1, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, >> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=2/2] >> [2020-05-06 20:56:37,048][INFO ][...] Completed rebalance future: >> RebalanceFuture [grp=CacheGroupContext [grp=cache1], >> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], rebalanceId=2, >> routines=2] >> >> From these logs I'm already can get answers to 1 and 4. >> The logs look concise and easy to read and understand, and should >> remain what way. >> >> But I think some proposed improvements can be done here without harm. >> >> 2. OK, let's add it to supplier info per cache with additional info: >> >> [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1, >> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, entries=100, duration=12ms, >> bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], >> progress=1/2] >> >> 3. This information is already printed to log. >> >> 5. OK, let's add a line which is printed after all groups in the chain are >> rebalanced or cancelled with a summary: >> >> [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain: >> [rebalanceId=2, entries=200, duration=50ms, bytesRcvd=10M] >> >> Another thing I would suggest: add rebalanceId to all messages for ease of >> grepping and to distinguish multiple rebalances for a same topology version. >> >> Regarding the "detailed" message - I wasn't able to figure out any case >> where it is useful. Not sure if we
Re: Extended logging for rebalance performance analysis
Hi, Kirill. 2. Ok for me. 3. We already have log message about rebalance cancellation for specific rebalanceId and log message about new rebalancing has started with next rebalanceId. 5. We already have summary information per group for each supplier added in p.2 I would avoid duplication here or put it under debug logging. Information about selected suppliers can already be obtained from logs: [2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1, mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1]] [2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1, mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1]] so I still do not see any value in "detailed" message. I think this is enough to understand rebalancing flow using proper grepping by topVer and rebalanceId. All additional aggregation should be done by external tools using rebalancing metrics. I'm on the same page with Maxim Muzafarov here. ср, 20 мая 2020 г. в 23:08, ткаленко кирилл : > Hello, Alexey! Unfortunately, my response was delayed. > > Point 2: You can do as you suggested, I think it is still worth specifying > how many partitions were obtained. > > [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1, > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, partitions=5, entries=100, > duration=12ms, > bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > progress=1/2] > > Point 3: is It "rebalanceId"? > > Point 5: I think we can output a summary for each supplier, so as not to > keep it in mind. > > [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain: > [rebalanceId=2, [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, > partitions=5, entries=100, duration=12ms, bytesRcvd=5M], > [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba2, partitions=5, entries=100, > duration=12ms, bytesRcvd=5M]] > > I can add "rebalanceId" to each message that you gave at above. > > A detailed message will help us understand how correctly the suppliers > were selected. > > 06.05.2020, 22:08, "Alexei Scherbakov" : > > Hello. > > > > Let's look at existing rebalancing log for a single group: > > > > [2020-05-06 20:56:36,999][INFO ][...] Rebalancing scheduled > > [order=[ignite-sys-cache, cache1, cache2, default], > > top=AffinityTopologyVersion [topVer=3, minorTopVer=1], > > evt=DISCOVERY_CUSTOM_EVT, node=9d9edb7b-eb01-47a1-8ff9-fef715d2] > > ... > > [2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1, > > mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, > > partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3, > > minorTopVer=1]] > > [2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1, > > mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, > > partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3, > > minorTopVer=1]] > > [2020-05-06 20:56:37,036][INFO ][...] Starting rebalance routine [cache1, > > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, fullPartitions=[1, 5, 7, > 9, > > 11, 13, 15, 23, 27, 29, 31], histPartitions=[]] > > [2020-05-06 20:56:37,037][INFO ][...] Starting rebalance routine [cache1, > > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > > supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, fullPartitions=[6, 8, 10, > > 16, 18, 20, 22, 24, 26, 28], histPartitions=[]] > > [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1, > > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, > > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2] > > [2020-05-06 20:56:37,046][INFO ][...] Completed (final) rebalancing > > [grp=cache1, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, > > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=2/2] > > [2020-05-06 20:56:37,048][INFO ][...] Completed rebalance future: > > RebalanceFuture [grp=CacheGroupContext [grp=cache1], > > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], rebalanceId=2, > > routines=2] > > > > From these logs I'm already can get answers to 1 and 4. > > The logs look concise and easy to read and understand, and should > > remain what way. > > > > But I think some proposed improvements can be done here without harm. > > > > 2. OK, let's add it to supplier info per cache with additional info: > > > > [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1, > > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, entries=100, > duration=12ms, > > bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > > progress=1/2] > > > > 3. This information is already printed to log. > > > > 5. OK, let's add a line which is printed after all groups in the chain > are > > rebalanced or cancelled with a summary: > > > > [2020-05-06 20:56:36,999][INFO
Re: Extended logging for rebalance performance analysis
Hello, Alexey! Unfortunately, my response was delayed. Point 2: You can do as you suggested, I think it is still worth specifying how many partitions were obtained. [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, partitions=5, entries=100, duration=12ms, bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2] Point 3: is It "rebalanceId"? Point 5: I think we can output a summary for each supplier, so as not to keep it in mind. [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain: [rebalanceId=2, [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, partitions=5, entries=100, duration=12ms, bytesRcvd=5M], [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba2, partitions=5, entries=100, duration=12ms, bytesRcvd=5M]] I can add "rebalanceId" to each message that you gave at above. A detailed message will help us understand how correctly the suppliers were selected. 06.05.2020, 22:08, "Alexei Scherbakov" : > Hello. > > Let's look at existing rebalancing log for a single group: > > [2020-05-06 20:56:36,999][INFO ][...] Rebalancing scheduled > [order=[ignite-sys-cache, cache1, cache2, default], > top=AffinityTopologyVersion [topVer=3, minorTopVer=1], > evt=DISCOVERY_CUSTOM_EVT, node=9d9edb7b-eb01-47a1-8ff9-fef715d2] > ... > [2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1, > mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, > partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3, > minorTopVer=1]] > [2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1, > mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, > partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3, > minorTopVer=1]] > [2020-05-06 20:56:37,036][INFO ][...] Starting rebalance routine [cache1, > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, fullPartitions=[1, 5, 7, 9, > 11, 13, 15, 23, 27, 29, 31], histPartitions=[]] > [2020-05-06 20:56:37,037][INFO ][...] Starting rebalance routine [cache1, > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, fullPartitions=[6, 8, 10, > 16, 18, 20, 22, 24, 26, 28], histPartitions=[]] > [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1, > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2] > [2020-05-06 20:56:37,046][INFO ][...] Completed (final) rebalancing > [grp=cache1, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=2/2] > [2020-05-06 20:56:37,048][INFO ][...] Completed rebalance future: > RebalanceFuture [grp=CacheGroupContext [grp=cache1], > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], rebalanceId=2, > routines=2] > > From these logs I'm already can get answers to 1 and 4. > The logs look concise and easy to read and understand, and should > remain what way. > > But I think some proposed improvements can be done here without harm. > > 2. OK, let's add it to supplier info per cache with additional info: > > [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1, > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, entries=100, duration=12ms, > bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > progress=1/2] > > 3. This information is already printed to log. > > 5. OK, let's add a line which is printed after all groups in the chain are > rebalanced or cancelled with a summary: > > [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain: > [rebalanceId=2, entries=200, duration=50ms, bytesRcvd=10M] > > Another thing I would suggest: add rebalanceId to all messages for ease of > grepping and to distinguish multiple rebalances for a same topology version. > > Regarding the "detailed" message - I wasn't able to figure out any case > where it is useful. Not sure if we need it at all. > Kirill Tkalenko, can you give me an example? > > ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov : > >> Kirill, >> >> Thank you for raising this topic. It's true that the rebalance process >> still requires additional information for analyzing issues. Please, >> don't think that I'm against your changes :-) >> >> * My short answer. * >> >> We won't do performance analysis on the production environment. Each >> time we need performance analysis it will be done on a test >> environment with verbose logging enabled. Thus I suggest moving these >> changes to a separate `profiling` module and extend the logging much >> more without any ышяу limitations. The same as these [2] [3] >> activities do. >> >> Let's keep the `core` module as simple as possible. >> Let's design the right API for accessing rebalance internals for >> profiling tools. >> Can you, please, remove all changes from your PR [6] which are not >> related to
Re: Extended logging for rebalance performance analysis
Hi, Maxim! Unfortunately, I delayed my answer. The production environment may not be ideal and problems may occur in it. I only suggest adding a little logging to understand some of the cases that may occur during operation. In addition, this functionality can be disabled by the system property. Rename variables to improve code readability. We can do it via metrics, but whether need to store it for all suppliers and how long is unclear. In my case, I do a simple aggregation and clear it after output. 06.05.2020, 21:01, "Maxim Muzafarov" : > Kirill, > > Thank you for raising this topic. It's true that the rebalance process > still requires additional information for analyzing issues. Please, > don't think that I'm against your changes :-) > > * My short answer. * > > We won't do performance analysis on the production environment. Each > time we need performance analysis it will be done on a test > environment with verbose logging enabled. Thus I suggest moving these > changes to a separate `profiling` module and extend the logging much > more without any ышяу limitations. The same as these [2] [3] > activities do. > > Let's keep the `core` module as simple as possible. > Let's design the right API for accessing rebalance internals for > profiling tools. > Can you, please, remove all changes from your PR [6] which are not > related to the proposed topic? (e.g. variable renamings). > > * The long answer. * > > Here are my thoughts on this. There are two different types of issues > in the rebalance process. The first case must be covered by daily > monitoring subsystems, the second case must be covered by additional > profiling tools. > 1. errors during the rebalancing (e.g. rebalance not happens when required) > 2. performance rebalancing issues (e.g. rebalance is slow) > > Daily monitoring tools (JMX, Logging) are always turned on and > shouldn't require additional systems resources themselves. Since these > metrics must be lightweight any internal aggregation machinery is not > used on them. All these metrics are collected from each node > independently. Please, take a look at this issue [1] which covers most > of your questions mentioned above. > > For all available metrics, we can configure LogExporterSpi, so they > will be available in logs. > >> 1)How long was the balance(divided by supplier)? > > rebalancingStartTime, rebalancingEndTime already exists for cache groups [4]. > We can add the same for suppliers. > >> 2)How many records and bytes per supplier were rebalanced? > > We already have rebalancingReceivedKeys, rebalancingReceivedBytes [4] > We will have rebalancingExpectedKeys [5]. > We can add a new metric per cache keys, per supplier. > >> 3)How many times did the rebalance restart? > > rebalancingLastCancelledTime [4] metric exists. > Do we need to keep historical data on it? > >> 4)Which partitions were rebalanced and from which nodes did they receive >> them? > > Let's print this information to log prior to the rebalance process > starts. This will be helpful information and do not require a lot of > changes. > >> 5)When did rebalance for all cache groups end? > > This metric may be aggregated from rebalancingEndTime [4] by pulling > it from all nodes for all caches. > > Performance rebalancing issues are related to profiling tools. They > may require additional system resources and definitely require a > dedicated environment for tests. We can't do performance analysis on > production environments due to performance impact. > I see some disadvantages of adding such tools to production code: > - verbose logging may affect performance. > - the problematic process may become even worse if an automatic > threshold suddenly turns on. > - new code changes will require additional efforts to keep logging up-to-date. > > [1] https://issues.apache.org/jira/browse/IGNITE-12183 > [2] https://issues.apache.org/jira/browse/IGNITE-12666 > [3] > https://cwiki.apache.org/confluence/display/IGNITE/Cluster+performance+profiling+tool > [4] https://issues.apache.org/jira/browse/IGNITE-12193 > [5] https://issues.apache.org/jira/browse/IGNITE-12194 > [6] https://github.com/apache/ignite/pull/7705/files > > On Wed, 6 May 2020 at 19:50, Ivan Rakov wrote: >> Hi, >> >> IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold >> > duration rebalance of cache group after which partitions distribution is >> > output, set in milliseconds, default value is 10 minutes. >> >> Does it mean that if the rebalancing process took less than 10 minutes, >> only a short version of the message (with supplier statistics) will show up? >> >> In general, I have no objections. >> >> On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл >> wrote: >> >> > Hi, Igniters! >> > >> > I'd like to share a new small feature in AI [1]. >> > >> > Current rebalance logging does not allow you to quickly answer following >> > questions: >> > 1)How long was the balance(divided by supplier)? >> > 2)How many
Re: Extended logging for rebalance performance analysis
Hi, that's it. 06.05.2020, 19:50, "Ivan Rakov" : > Hi, > > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold >> duration rebalance of cache group after which partitions distribution is >> output, set in milliseconds, default value is 10 minutes. > > Does it mean that if the rebalancing process took less than 10 minutes, > only a short version of the message (with supplier statistics) will show up? > > In general, I have no objections. > > On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл > wrote: > >> Hi, Igniters! >> >> I'd like to share a new small feature in AI [1]. >> >> Current rebalance logging does not allow you to quickly answer following >> questions: >> 1)How long was the balance(divided by supplier)? >> 2)How many records and bytes per supplier were rebalanced? >> 3)How many times did rebalance restart? >> 4)Which partitions were rebalanced and from which nodes did they receive >> them? >> 5)When did rebalance for all cache groups end? >> >> What you can see in logs now: >> >> 1)Starting rebalance with order of cache groups. >> Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0], >> top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false, >> evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c1] >> >> 2)Start rebalance of cache group from a specific supplier, specifying >> partition ids and mode - historical or full. >> Starting rebalance routine [ignite-sys-cache, >> topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], >> supplier=8c525892-703b-4fc4-b28b-b2f13970, fullPartitions=[0-99], >> histPartitions=[]] >> >> 3)Getting partial or complete partitions of cache group. >> Completed rebalancing [grp=ignite-sys-cache, >> supplier=8c525892-703b-4fc4-b28b-b2f13970, >> topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=1/2] >> Completed (final) rebalancing [grp=ignite-sys-cache, >> supplier=c2146a04-dc23-4bc9-870d-dfbb55c1, >> topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=2/2] >> >> 4)End rebalance of cache group. >> Completed rebalance future: RebalanceFuture [grp=CacheGroupContext >> [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2, >> minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200, >> receivedKeys=0, partitionsLeft=0, startTime=1588519707607, endTime=-1, >> lastCancelledTime=-1] >> >> Rebalance statistics: >> >> To speed up rebalance analysis, statistics will be output for each cache >> group and total for all cache groups. >> If duration rebalance for cache group is greater than threshold value, >> partition distribution is output. >> Statistics will you to analyze duration of the balance for each supplier >> to understand which of them has been transmitting data for longest time. >> >> System properties are used to output statistics: >> >> IGNITE_QUIET - to output statistics, value must be false; >> IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold >> duration rebalance of cache group after which partitions distribution is >> output, set in milliseconds, default value is 10 minutes. >> >> Statistics examples: >> >> Successful full and historical rebalance of group cache, without >> partitions distribution. >> Rebalance information per cache group (successful rebalance): [id=3181548, >> name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13 >> 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5, >> d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b - >> bytes, d - duration, h - historical, nodeId mapping >> (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1] >> [1=rebalancing.RebalanceStatisticsTest0] >> Rebalance information per cache group (successful rebalance): [id=3181547, >> name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13 >> 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10, >> he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b - >> bytes, d - duration, h - historical, nodeId mapping >> (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0] >> >> Successful full and historical rebalance of group cache, with partitions >> distribution. >> Rebalance information per cache group (successful rebalance): [id=3181548, >> name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13 >> 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5, >> d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b - >> bytes, d - duration, h - historical, nodeId mapping >> (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1] >> [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was greater >> than 5 ms, printing detailed information about partitions distribution >> (threshold can be changed by setting number of milliseconds into >> IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 =
Re: Extended logging for rebalance performance analysis
Hello. Let's look at existing rebalancing log for a single group: [2020-05-06 20:56:36,999][INFO ][...] Rebalancing scheduled [order=[ignite-sys-cache, cache1, cache2, default], top=AffinityTopologyVersion [topVer=3, minorTopVer=1], evt=DISCOVERY_CUSTOM_EVT, node=9d9edb7b-eb01-47a1-8ff9-fef715d2] ... [2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1, mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1]] [2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1, mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1]] [2020-05-06 20:56:37,036][INFO ][...] Starting rebalance routine [cache1, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, fullPartitions=[1, 5, 7, 9, 11, 13, 15, 23, 27, 29, 31], histPartitions=[]] [2020-05-06 20:56:37,037][INFO ][...] Starting rebalance routine [cache1, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, fullPartitions=[6, 8, 10, 16, 18, 20, 22, 24, 26, 28], histPartitions=[]] [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2] [2020-05-06 20:56:37,046][INFO ][...] Completed (final) rebalancing [grp=cache1, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa50, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=2/2] [2020-05-06 20:56:37,048][INFO ][...] Completed rebalance future: RebalanceFuture [grp=CacheGroupContext [grp=cache1], topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], rebalanceId=2, routines=2] >From these logs I'm already can get answers to 1 and 4. The logs look concise and easy to read and understand, and should remain what way. But I think some proposed improvements can be done here without harm. 2. OK, let's add it to supplier info per cache with additional info: [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba1, entries=100, duration=12ms, bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2] 3. This information is already printed to log. 5. OK, let's add a line which is printed after all groups in the chain are rebalanced or cancelled with a summary: [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain: [rebalanceId=2, entries=200, duration=50ms, bytesRcvd=10M] Another thing I would suggest: add rebalanceId to all messages for ease of grepping and to distinguish multiple rebalances for a same topology version. Regarding the "detailed" message - I wasn't able to figure out any case where it is useful. Not sure if we need it at all. Kirill Tkalenko, can you give me an example? ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov : > Kirill, > > > Thank you for raising this topic. It's true that the rebalance process > still requires additional information for analyzing issues. Please, > don't think that I'm against your changes :-) > > * My short answer. * > > We won't do performance analysis on the production environment. Each > time we need performance analysis it will be done on a test > environment with verbose logging enabled. Thus I suggest moving these > changes to a separate `profiling` module and extend the logging much > more without any ышяу limitations. The same as these [2] [3] > activities do. > > Let's keep the `core` module as simple as possible. > Let's design the right API for accessing rebalance internals for > profiling tools. > Can you, please, remove all changes from your PR [6] which are not > related to the proposed topic? (e.g. variable renamings). > > > * The long answer. * > > Here are my thoughts on this. There are two different types of issues > in the rebalance process. The first case must be covered by daily > monitoring subsystems, the second case must be covered by additional > profiling tools. > 1. errors during the rebalancing (e.g. rebalance not happens when required) > 2. performance rebalancing issues (e.g. rebalance is slow) > > Daily monitoring tools (JMX, Logging) are always turned on and > shouldn't require additional systems resources themselves. Since these > metrics must be lightweight any internal aggregation machinery is not > used on them. All these metrics are collected from each node > independently. Please, take a look at this issue [1] which covers most > of your questions mentioned above. > > For all available metrics, we can configure LogExporterSpi, so they > will be available in logs. > > > 1)How long was the balance(divided by supplier)? > rebalancingStartTime, rebalancingEndTime already exists for cache groups > [4]. > We can add the same for suppliers. > > > 2)How many records and bytes per supplier were rebalanced? > We
Re: Extended logging for rebalance performance analysis
Kirill, Thank you for raising this topic. It's true that the rebalance process still requires additional information for analyzing issues. Please, don't think that I'm against your changes :-) * My short answer. * We won't do performance analysis on the production environment. Each time we need performance analysis it will be done on a test environment with verbose logging enabled. Thus I suggest moving these changes to a separate `profiling` module and extend the logging much more without any ышяу limitations. The same as these [2] [3] activities do. Let's keep the `core` module as simple as possible. Let's design the right API for accessing rebalance internals for profiling tools. Can you, please, remove all changes from your PR [6] which are not related to the proposed topic? (e.g. variable renamings). * The long answer. * Here are my thoughts on this. There are two different types of issues in the rebalance process. The first case must be covered by daily monitoring subsystems, the second case must be covered by additional profiling tools. 1. errors during the rebalancing (e.g. rebalance not happens when required) 2. performance rebalancing issues (e.g. rebalance is slow) Daily monitoring tools (JMX, Logging) are always turned on and shouldn't require additional systems resources themselves. Since these metrics must be lightweight any internal aggregation machinery is not used on them. All these metrics are collected from each node independently. Please, take a look at this issue [1] which covers most of your questions mentioned above. For all available metrics, we can configure LogExporterSpi, so they will be available in logs. > 1)How long was the balance(divided by supplier)? rebalancingStartTime, rebalancingEndTime already exists for cache groups [4]. We can add the same for suppliers. > 2)How many records and bytes per supplier were rebalanced? We already have rebalancingReceivedKeys, rebalancingReceivedBytes [4] We will have rebalancingExpectedKeys [5]. We can add a new metric per cache keys, per supplier. > 3)How many times did the rebalance restart? rebalancingLastCancelledTime [4] metric exists. Do we need to keep historical data on it? > 4)Which partitions were rebalanced and from which nodes did they receive them? Let's print this information to log prior to the rebalance process starts. This will be helpful information and do not require a lot of changes. > 5)When did rebalance for all cache groups end? This metric may be aggregated from rebalancingEndTime [4] by pulling it from all nodes for all caches. Performance rebalancing issues are related to profiling tools. They may require additional system resources and definitely require a dedicated environment for tests. We can't do performance analysis on production environments due to performance impact. I see some disadvantages of adding such tools to production code: - verbose logging may affect performance. - the problematic process may become even worse if an automatic threshold suddenly turns on. - new code changes will require additional efforts to keep logging up-to-date. [1] https://issues.apache.org/jira/browse/IGNITE-12183 [2] https://issues.apache.org/jira/browse/IGNITE-12666 [3] https://cwiki.apache.org/confluence/display/IGNITE/Cluster+performance+profiling+tool [4] https://issues.apache.org/jira/browse/IGNITE-12193 [5] https://issues.apache.org/jira/browse/IGNITE-12194 [6] https://github.com/apache/ignite/pull/7705/files On Wed, 6 May 2020 at 19:50, Ivan Rakov wrote: > > Hi, > > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold > > duration rebalance of cache group after which partitions distribution is > > output, set in milliseconds, default value is 10 minutes. > > Does it mean that if the rebalancing process took less than 10 minutes, > only a short version of the message (with supplier statistics) will show up? > > In general, I have no objections. > > > On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл > wrote: > > > Hi, Igniters! > > > > I'd like to share a new small feature in AI [1]. > > > > Current rebalance logging does not allow you to quickly answer following > > questions: > > 1)How long was the balance(divided by supplier)? > > 2)How many records and bytes per supplier were rebalanced? > > 3)How many times did rebalance restart? > > 4)Which partitions were rebalanced and from which nodes did they receive > > them? > > 5)When did rebalance for all cache groups end? > > > > What you can see in logs now: > > > > 1)Starting rebalance with order of cache groups. > > Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0], > > top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false, > > evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c1] > > > > 2)Start rebalance of cache group from a specific supplier, specifying > > partition ids and mode - historical or full. > > Starting rebalance routine [ignite-sys-cache, > > topVer=AffinityTopologyVersion [topVer=2,
Re: Extended logging for rebalance performance analysis
Hi, IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold > duration rebalance of cache group after which partitions distribution is > output, set in milliseconds, default value is 10 minutes. Does it mean that if the rebalancing process took less than 10 minutes, only a short version of the message (with supplier statistics) will show up? In general, I have no objections. On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл wrote: > Hi, Igniters! > > I'd like to share a new small feature in AI [1]. > > Current rebalance logging does not allow you to quickly answer following > questions: > 1)How long was the balance(divided by supplier)? > 2)How many records and bytes per supplier were rebalanced? > 3)How many times did rebalance restart? > 4)Which partitions were rebalanced and from which nodes did they receive > them? > 5)When did rebalance for all cache groups end? > > What you can see in logs now: > > 1)Starting rebalance with order of cache groups. > Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0], > top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false, > evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c1] > > 2)Start rebalance of cache group from a specific supplier, specifying > partition ids and mode - historical or full. > Starting rebalance routine [ignite-sys-cache, > topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], > supplier=8c525892-703b-4fc4-b28b-b2f13970, fullPartitions=[0-99], > histPartitions=[]] > > 3)Getting partial or complete partitions of cache group. > Completed rebalancing [grp=ignite-sys-cache, > supplier=8c525892-703b-4fc4-b28b-b2f13970, > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=1/2] > Completed (final) rebalancing [grp=ignite-sys-cache, > supplier=c2146a04-dc23-4bc9-870d-dfbb55c1, > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=2/2] > > 4)End rebalance of cache group. > Completed rebalance future: RebalanceFuture [grp=CacheGroupContext > [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2, > minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200, > receivedKeys=0, partitionsLeft=0, startTime=1588519707607, endTime=-1, > lastCancelledTime=-1] > > Rebalance statistics: > > To speed up rebalance analysis, statistics will be output for each cache > group and total for all cache groups. > If duration rebalance for cache group is greater than threshold value, > partition distribution is output. > Statistics will you to analyze duration of the balance for each supplier > to understand which of them has been transmitting data for longest time. > > System properties are used to output statistics: > > IGNITE_QUIET - to output statistics, value must be false; > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold > duration rebalance of cache group after which partitions distribution is > output, set in milliseconds, default value is 10 minutes. > > Statistics examples: > > Successful full and historical rebalance of group cache, without > partitions distribution. > Rebalance information per cache group (successful rebalance): [id=3181548, > name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13 > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5, > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b - > bytes, d - duration, h - historical, nodeId mapping > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1] > [1=rebalancing.RebalanceStatisticsTest0] > Rebalance information per cache group (successful rebalance): [id=3181547, > name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13 > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10, > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b - > bytes, d - duration, h - historical, nodeId mapping > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0] > > Successful full and historical rebalance of group cache, with partitions > distribution. > Rebalance information per cache group (successful rebalance): [id=3181548, > name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13 > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5, > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b - > bytes, d - duration, h - historical, nodeId mapping > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1] > [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was greater > than 5 ms, printing detailed information about partitions distribution > (threshold can be changed by setting number of milliseconds into > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 = > [0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 = > [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 = > [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 = > [0,bu,su],[1,bu],[2,pr,su] 7 =
Extended logging for rebalance performance analysis
Hi, Igniters! I'd like to share a new small feature in AI [1]. Current rebalance logging does not allow you to quickly answer following questions: 1)How long was the balance(divided by supplier)? 2)How many records and bytes per supplier were rebalanced? 3)How many times did rebalance restart? 4)Which partitions were rebalanced and from which nodes did they receive them? 5)When did rebalance for all cache groups end? What you can see in logs now: 1)Starting rebalance with order of cache groups. Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0], top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false, evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c1] 2)Start rebalance of cache group from a specific supplier, specifying partition ids and mode - historical or full. Starting rebalance routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], supplier=8c525892-703b-4fc4-b28b-b2f13970, fullPartitions=[0-99], histPartitions=[]] 3)Getting partial or complete partitions of cache group. Completed rebalancing [grp=ignite-sys-cache, supplier=8c525892-703b-4fc4-b28b-b2f13970, topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=1/2] Completed (final) rebalancing [grp=ignite-sys-cache, supplier=c2146a04-dc23-4bc9-870d-dfbb55c1, topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=2/2] 4)End rebalance of cache group. Completed rebalance future: RebalanceFuture [grp=CacheGroupContext [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200, receivedKeys=0, partitionsLeft=0, startTime=1588519707607, endTime=-1, lastCancelledTime=-1] Rebalance statistics: To speed up rebalance analysis, statistics will be output for each cache group and total for all cache groups. If duration rebalance for cache group is greater than threshold value, partition distribution is output. Statistics will you to analyze duration of the balance for each supplier to understand which of them has been transmitting data for longest time. System properties are used to output statistics: IGNITE_QUIET - to output statistics, value must be false; IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold duration rebalance of cache group after which partitions distribution is output, set in milliseconds, default value is 10 minutes. Statistics examples: Successful full and historical rebalance of group cache, without partitions distribution. Rebalance information per cache group (successful rebalance): [id=3181548, name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5, d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b - bytes, d - duration, h - historical, nodeId mapping (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1] [1=rebalancing.RebalanceStatisticsTest0] Rebalance information per cache group (successful rebalance): [id=3181547, name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10, he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b - bytes, d - duration, h - historical, nodeId mapping (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0] Successful full and historical rebalance of group cache, with partitions distribution. Rebalance information per cache group (successful rebalance): [id=3181548, name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5, d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b - bytes, d - duration, h - historical, nodeId mapping (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1] [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was greater than 5 ms, printing detailed information about partitions distribution (threshold can be changed by setting number of milliseconds into IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 = [0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 = [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 = [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 = [0,bu,su],[1,bu],[2,pr,su] 7 = [0,bu,su],[1,bu],[2,pr,su] 8 = [0,bu,su],[1,bu],[2,pr,su] 9 = [0,bu,su],[1,bu],[2,pr,su] Aliases: pr - primary, bu - backup, su - supplier node, h - historical, nodeId mapping (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1] [1=rebalancing.RebalanceStatisticsTest2] [2=rebalancing.RebalanceStatisticsTest0] Rebalance information per cache group (successful rebalance): [id=3181547, name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10, he=300, hb=30267, d=116 ms]