Re: Choosing historical rebalance heuristics

2020-07-17 Thread Alexei Scherbakov
Hi.

The proposed approach to improve rebalancing efficiency looks a little bit
naive to me.

A mature solution to a problem should include a computation of a "cost" for
each rebalancing type.

The cost formula should take into consideration various aspects, like
number of pages in the memory cache, page replacement, a disk read speed, a
number of indexes, etc.

For a short term solution the proposed heuristic looks acceptable.

My suggestions for an implementation:

1. Implement the cost framework right now.
The rebalance cost is measured as a sum of all costs for each rebalancing
group. This sum should be minimized.
Rebalancing behavior should be dependent only on cost formula outcome,
without the need of modifying other code.

2. Avoid unnecessary WAL scanning.
Ideally we should scan only segments containing required updates.
We can build for each partition a tracking data structure and adjust it on
each checkpoint and WAL archive removal, then use it for efficient scanning
during WAL iteration.

3. Avoid any calculations during the PME sync phase. The coordinator should
only collect available WAL history, switch lagging OWNING partitions to
MOVING, and send all this to the nodes.
After PME nodes should calculate a cost and choose the most efficient
rebalancing for local data.

4. We should log a reason for choosing a specific rebalancing type due to a
cost.

I'm planning to implement tombstones in the near future. This should remove
the requirement for clearing a partition before full rebalancing in most
cases, improving it's cost.

This can be further enhanced later using some kind of merkle trees, file
based rebalancing or something else.

чт, 16 июл. 2020 г. в 18:33, Ivan Rakov :

> >
> >  I think we can modify the heuristic so
> > 1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD -
> > reduce it to 500)
> > 2) Select only that partition for historical rebalance where difference
> > between counters less that partition size.
>
> Agreed, let's go this way.
>
> On Thu, Jul 16, 2020 at 11:03 AM Vladislav Pyatkov 
> wrote:
>
> > I completely forget about another promise to favor of using historical
> > rebalance where it is possible. When cluster decided to use a full
> balance,
> > demander nodes should clear not empty partitions.
> > This can to consume a long time, in some cases that may be compared with
> a
> > time of rebalance.
> > It also accepts a side of heuristics above.
> >
> > On Thu, Jul 16, 2020 at 12:09 AM Vladislav Pyatkov  >
> > wrote:
> >
> > > Ivan,
> > >
> > > I agree with a combined approach: threshold for small partitions and
> > count
> > > of update for partition that outgrew it.
> > > This helps to avoid partitions that update not frequently.
> > >
> > > Reading of a big WAL piece (more than 100Gb) it can happen, when a
> client
> > > configured it intentionally.
> > > There are no doubts we can to read it, otherwise WAL space was not
> > > configured that too large.
> > >
> > > I don't see a connection optimization of iterator and issue in atomic
> > > protocol.
> > > Reordering in WAL, that happened in checkpoint where counter was not
> > > changing, is an extremely rare case and the issue will not solve for
> > > generic case, this should be fixed in bound of protocol.
> > >
> > > I think we can modify the heuristic so
> > > 1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD
> -
> > > reduce it to 500)
> > > 2) Select only that partition for historical rebalance where difference
> > > between counters less that partition size.
> > >
> > > Also implement mentioned optimization for historical iterator, that may
> > > reduce a time on reading large WAL interval.
> > >
> > > On Wed, Jul 15, 2020 at 3:15 PM Ivan Rakov 
> > wrote:
> > >
> > >> Hi Vladislav,
> > >>
> > >> Thanks for raising this topic.
> > >> Currently present IGNITE_PDS_WAL_REBALANCE_THRESHOLD (default is
> > 500_000)
> > >> is controversial. Assuming that the default number of partitions is
> > 1024,
> > >> cache should contain a really huge amount of data in order to make WAL
> > >> delta rebalancing possible. In fact, it's currently disabled for most
> > >> production cases, which makes rebalancing of persistent caches
> > >> unreasonably
> > >> long.
> > >>
> > >> I think, your approach [1] makes much more sense than the current
> > >> heuristic, let's move forward with the proposed solution.
> > >>
> > >> Though, there are some other corner cases, e.g. this one:
> > >> - Configured size of WAL archive is big (>100 GB)
> > >> - Cache has small partitions (e.g. 1000 entries)
> > >> - Infrequent updates (e.g. ~100 in the whole WAL history of any node)
> > >> - There is another cache with very frequent updates which allocate
> >99%
> > of
> > >> WAL
> > >> In such scenario we may need to iterate over >100 GB of WAL in order
> to
> > >> fetch <1% of needed updates. Even though the amount of network traffic
> > is
> > >> still optimized, it would be more effective to 

Re: Choosing historical rebalance heuristics

2020-07-16 Thread Ivan Rakov
>
>  I think we can modify the heuristic so
> 1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD -
> reduce it to 500)
> 2) Select only that partition for historical rebalance where difference
> between counters less that partition size.

Agreed, let's go this way.

On Thu, Jul 16, 2020 at 11:03 AM Vladislav Pyatkov 
wrote:

> I completely forget about another promise to favor of using historical
> rebalance where it is possible. When cluster decided to use a full balance,
> demander nodes should clear not empty partitions.
> This can to consume a long time, in some cases that may be compared with a
> time of rebalance.
> It also accepts a side of heuristics above.
>
> On Thu, Jul 16, 2020 at 12:09 AM Vladislav Pyatkov 
> wrote:
>
> > Ivan,
> >
> > I agree with a combined approach: threshold for small partitions and
> count
> > of update for partition that outgrew it.
> > This helps to avoid partitions that update not frequently.
> >
> > Reading of a big WAL piece (more than 100Gb) it can happen, when a client
> > configured it intentionally.
> > There are no doubts we can to read it, otherwise WAL space was not
> > configured that too large.
> >
> > I don't see a connection optimization of iterator and issue in atomic
> > protocol.
> > Reordering in WAL, that happened in checkpoint where counter was not
> > changing, is an extremely rare case and the issue will not solve for
> > generic case, this should be fixed in bound of protocol.
> >
> > I think we can modify the heuristic so
> > 1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD -
> > reduce it to 500)
> > 2) Select only that partition for historical rebalance where difference
> > between counters less that partition size.
> >
> > Also implement mentioned optimization for historical iterator, that may
> > reduce a time on reading large WAL interval.
> >
> > On Wed, Jul 15, 2020 at 3:15 PM Ivan Rakov 
> wrote:
> >
> >> Hi Vladislav,
> >>
> >> Thanks for raising this topic.
> >> Currently present IGNITE_PDS_WAL_REBALANCE_THRESHOLD (default is
> 500_000)
> >> is controversial. Assuming that the default number of partitions is
> 1024,
> >> cache should contain a really huge amount of data in order to make WAL
> >> delta rebalancing possible. In fact, it's currently disabled for most
> >> production cases, which makes rebalancing of persistent caches
> >> unreasonably
> >> long.
> >>
> >> I think, your approach [1] makes much more sense than the current
> >> heuristic, let's move forward with the proposed solution.
> >>
> >> Though, there are some other corner cases, e.g. this one:
> >> - Configured size of WAL archive is big (>100 GB)
> >> - Cache has small partitions (e.g. 1000 entries)
> >> - Infrequent updates (e.g. ~100 in the whole WAL history of any node)
> >> - There is another cache with very frequent updates which allocate >99%
> of
> >> WAL
> >> In such scenario we may need to iterate over >100 GB of WAL in order to
> >> fetch <1% of needed updates. Even though the amount of network traffic
> is
> >> still optimized, it would be more effective to transfer partitions with
> >> ~1000 entries fully instead of reading >100 GB of WAL.
> >>
> >> I want to highlight that your heuristic definitely makes the situation
> >> better, but due to possible corner cases we should keep the fallback
> lever
> >> to restrict or limit historical rebalance as before. Probably, it would
> be
> >> handy to keep IGNITE_PDS_WAL_REBALANCE_THRESHOLD property with a low
> >> default value (1000, 500 or even 0) and apply your heuristic only for
> >> partitions with bigger size.
> >>
> >> Regarding case [2]: it looks like an improvement that can mitigate some
> >> corner cases (including the one that I have described). I'm ok with it
> as
> >> long as it takes data updates reordering on backup nodes into account.
> We
> >> don't track skipped updates for atomic caches. As a result, detection of
> >> the absence of updates between two checkpoint markers with the same
> >> partition counter can be false positive.
> >>
> >> --
> >> Best Regards,
> >> Ivan Rakov
> >>
> >> On Tue, Jul 14, 2020 at 3:03 PM Vladislav Pyatkov  >
> >> wrote:
> >>
> >> > Hi guys,
> >> >
> >> > I want to implement a more honest heuristic for historical rebalance.
> >> > Before, a cluster makes a choice between the historical rebalance or
> >> not it
> >> > only from a partition size. This threshold more known by a name of
> >> property
> >> > IGNITE_PDS_WAL_REBALANCE_THRESHOLD.
> >> > It might prevent a historical rebalance when a partition is too small,
> >> but
> >> > not if WAL contains more updates than a size of partition, historical
> >> > rebalance still can be chosen.
> >> > There is a ticket where need to implement more fair heuristic[1].
> >> >
> >> > My idea for implementation is need to estimate a size of data which
> >> will be
> >> > transferred owe network. In other word if need to rebalance a part of
> >> WAL
> >> > that contains N updates, for 

Re: Choosing historical rebalance heuristics

2020-07-16 Thread Vladislav Pyatkov
I completely forget about another promise to favor of using historical
rebalance where it is possible. When cluster decided to use a full balance,
demander nodes should clear not empty partitions.
This can to consume a long time, in some cases that may be compared with a
time of rebalance.
It also accepts a side of heuristics above.

On Thu, Jul 16, 2020 at 12:09 AM Vladislav Pyatkov 
wrote:

> Ivan,
>
> I agree with a combined approach: threshold for small partitions and count
> of update for partition that outgrew it.
> This helps to avoid partitions that update not frequently.
>
> Reading of a big WAL piece (more than 100Gb) it can happen, when a client
> configured it intentionally.
> There are no doubts we can to read it, otherwise WAL space was not
> configured that too large.
>
> I don't see a connection optimization of iterator and issue in atomic
> protocol.
> Reordering in WAL, that happened in checkpoint where counter was not
> changing, is an extremely rare case and the issue will not solve for
> generic case, this should be fixed in bound of protocol.
>
> I think we can modify the heuristic so
> 1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD -
> reduce it to 500)
> 2) Select only that partition for historical rebalance where difference
> between counters less that partition size.
>
> Also implement mentioned optimization for historical iterator, that may
> reduce a time on reading large WAL interval.
>
> On Wed, Jul 15, 2020 at 3:15 PM Ivan Rakov  wrote:
>
>> Hi Vladislav,
>>
>> Thanks for raising this topic.
>> Currently present IGNITE_PDS_WAL_REBALANCE_THRESHOLD (default is 500_000)
>> is controversial. Assuming that the default number of partitions is 1024,
>> cache should contain a really huge amount of data in order to make WAL
>> delta rebalancing possible. In fact, it's currently disabled for most
>> production cases, which makes rebalancing of persistent caches
>> unreasonably
>> long.
>>
>> I think, your approach [1] makes much more sense than the current
>> heuristic, let's move forward with the proposed solution.
>>
>> Though, there are some other corner cases, e.g. this one:
>> - Configured size of WAL archive is big (>100 GB)
>> - Cache has small partitions (e.g. 1000 entries)
>> - Infrequent updates (e.g. ~100 in the whole WAL history of any node)
>> - There is another cache with very frequent updates which allocate >99% of
>> WAL
>> In such scenario we may need to iterate over >100 GB of WAL in order to
>> fetch <1% of needed updates. Even though the amount of network traffic is
>> still optimized, it would be more effective to transfer partitions with
>> ~1000 entries fully instead of reading >100 GB of WAL.
>>
>> I want to highlight that your heuristic definitely makes the situation
>> better, but due to possible corner cases we should keep the fallback lever
>> to restrict or limit historical rebalance as before. Probably, it would be
>> handy to keep IGNITE_PDS_WAL_REBALANCE_THRESHOLD property with a low
>> default value (1000, 500 or even 0) and apply your heuristic only for
>> partitions with bigger size.
>>
>> Regarding case [2]: it looks like an improvement that can mitigate some
>> corner cases (including the one that I have described). I'm ok with it as
>> long as it takes data updates reordering on backup nodes into account. We
>> don't track skipped updates for atomic caches. As a result, detection of
>> the absence of updates between two checkpoint markers with the same
>> partition counter can be false positive.
>>
>> --
>> Best Regards,
>> Ivan Rakov
>>
>> On Tue, Jul 14, 2020 at 3:03 PM Vladislav Pyatkov 
>> wrote:
>>
>> > Hi guys,
>> >
>> > I want to implement a more honest heuristic for historical rebalance.
>> > Before, a cluster makes a choice between the historical rebalance or
>> not it
>> > only from a partition size. This threshold more known by a name of
>> property
>> > IGNITE_PDS_WAL_REBALANCE_THRESHOLD.
>> > It might prevent a historical rebalance when a partition is too small,
>> but
>> > not if WAL contains more updates than a size of partition, historical
>> > rebalance still can be chosen.
>> > There is a ticket where need to implement more fair heuristic[1].
>> >
>> > My idea for implementation is need to estimate a size of data which
>> will be
>> > transferred owe network. In other word if need to rebalance a part of
>> WAL
>> > that contains N updates, for recover a partition on another node, which
>> > have to contain M rows at all, need chooses a historical rebalance on
>> the
>> > case where N < M (WAL history should be presented as well).
>> >
>> > This approach is easy implemented, because a coordinator node has the
>> size
>> > of partitions and counters' interval. But in this case cluster still can
>> > find not many updates in too long WAL history. I assume a possibility to
>> > work around it, if rebalance historical iterator will not handle
>> > checkpoints where not contains updates of particular cache. 

Re: Choosing historical rebalance heuristics

2020-07-15 Thread Vladislav Pyatkov
Ivan,

I agree with a combined approach: threshold for small partitions and count
of update for partition that outgrew it.
This helps to avoid partitions that update not frequently.

Reading of a big WAL piece (more than 100Gb) it can happen, when a client
configured it intentionally.
There are no doubts we can to read it, otherwise WAL space was not
configured that too large.

I don't see a connection optimization of iterator and issue in atomic
protocol.
Reordering in WAL, that happened in checkpoint where counter was not
changing, is an extremely rare case and the issue will not solve for
generic case, this should be fixed in bound of protocol.

I think we can modify the heuristic so
1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD -
reduce it to 500)
2) Select only that partition for historical rebalance where difference
between counters less that partition size.

Also implement mentioned optimization for historical iterator, that may
reduce a time on reading large WAL interval.

On Wed, Jul 15, 2020 at 3:15 PM Ivan Rakov  wrote:

> Hi Vladislav,
>
> Thanks for raising this topic.
> Currently present IGNITE_PDS_WAL_REBALANCE_THRESHOLD (default is 500_000)
> is controversial. Assuming that the default number of partitions is 1024,
> cache should contain a really huge amount of data in order to make WAL
> delta rebalancing possible. In fact, it's currently disabled for most
> production cases, which makes rebalancing of persistent caches unreasonably
> long.
>
> I think, your approach [1] makes much more sense than the current
> heuristic, let's move forward with the proposed solution.
>
> Though, there are some other corner cases, e.g. this one:
> - Configured size of WAL archive is big (>100 GB)
> - Cache has small partitions (e.g. 1000 entries)
> - Infrequent updates (e.g. ~100 in the whole WAL history of any node)
> - There is another cache with very frequent updates which allocate >99% of
> WAL
> In such scenario we may need to iterate over >100 GB of WAL in order to
> fetch <1% of needed updates. Even though the amount of network traffic is
> still optimized, it would be more effective to transfer partitions with
> ~1000 entries fully instead of reading >100 GB of WAL.
>
> I want to highlight that your heuristic definitely makes the situation
> better, but due to possible corner cases we should keep the fallback lever
> to restrict or limit historical rebalance as before. Probably, it would be
> handy to keep IGNITE_PDS_WAL_REBALANCE_THRESHOLD property with a low
> default value (1000, 500 or even 0) and apply your heuristic only for
> partitions with bigger size.
>
> Regarding case [2]: it looks like an improvement that can mitigate some
> corner cases (including the one that I have described). I'm ok with it as
> long as it takes data updates reordering on backup nodes into account. We
> don't track skipped updates for atomic caches. As a result, detection of
> the absence of updates between two checkpoint markers with the same
> partition counter can be false positive.
>
> --
> Best Regards,
> Ivan Rakov
>
> On Tue, Jul 14, 2020 at 3:03 PM Vladislav Pyatkov 
> wrote:
>
> > Hi guys,
> >
> > I want to implement a more honest heuristic for historical rebalance.
> > Before, a cluster makes a choice between the historical rebalance or not
> it
> > only from a partition size. This threshold more known by a name of
> property
> > IGNITE_PDS_WAL_REBALANCE_THRESHOLD.
> > It might prevent a historical rebalance when a partition is too small,
> but
> > not if WAL contains more updates than a size of partition, historical
> > rebalance still can be chosen.
> > There is a ticket where need to implement more fair heuristic[1].
> >
> > My idea for implementation is need to estimate a size of data which will
> be
> > transferred owe network. In other word if need to rebalance a part of WAL
> > that contains N updates, for recover a partition on another node, which
> > have to contain M rows at all, need chooses a historical rebalance on the
> > case where N < M (WAL history should be presented as well).
> >
> > This approach is easy implemented, because a coordinator node has the
> size
> > of partitions and counters' interval. But in this case cluster still can
> > find not many updates in too long WAL history. I assume a possibility to
> > work around it, if rebalance historical iterator will not handle
> > checkpoints where not contains updates of particular cache. Checkpoints
> can
> > skip if counters for the cache (maybe even a specific partitions) was not
> > changed between it and next one.
> >
> > Ticket for improvement rebalance historical iterator[2]
> >
> > I want to hear a view of community on the thought above.
> > Maybe anyone has another opinion?
> >
> > [1]: https://issues.apache.org/jira/browse/IGNITE-13253
> > [2]: https://issues.apache.org/jira/browse/IGNITE-13254
> >
> > --
> > Vladislav Pyatkov
> >
>


-- 
Vladislav Pyatkov


Re: Choosing historical rebalance heuristics

2020-07-15 Thread Ivan Rakov
Hi Vladislav,

Thanks for raising this topic.
Currently present IGNITE_PDS_WAL_REBALANCE_THRESHOLD (default is 500_000)
is controversial. Assuming that the default number of partitions is 1024,
cache should contain a really huge amount of data in order to make WAL
delta rebalancing possible. In fact, it's currently disabled for most
production cases, which makes rebalancing of persistent caches unreasonably
long.

I think, your approach [1] makes much more sense than the current
heuristic, let's move forward with the proposed solution.

Though, there are some other corner cases, e.g. this one:
- Configured size of WAL archive is big (>100 GB)
- Cache has small partitions (e.g. 1000 entries)
- Infrequent updates (e.g. ~100 in the whole WAL history of any node)
- There is another cache with very frequent updates which allocate >99% of
WAL
In such scenario we may need to iterate over >100 GB of WAL in order to
fetch <1% of needed updates. Even though the amount of network traffic is
still optimized, it would be more effective to transfer partitions with
~1000 entries fully instead of reading >100 GB of WAL.

I want to highlight that your heuristic definitely makes the situation
better, but due to possible corner cases we should keep the fallback lever
to restrict or limit historical rebalance as before. Probably, it would be
handy to keep IGNITE_PDS_WAL_REBALANCE_THRESHOLD property with a low
default value (1000, 500 or even 0) and apply your heuristic only for
partitions with bigger size.

Regarding case [2]: it looks like an improvement that can mitigate some
corner cases (including the one that I have described). I'm ok with it as
long as it takes data updates reordering on backup nodes into account. We
don't track skipped updates for atomic caches. As a result, detection of
the absence of updates between two checkpoint markers with the same
partition counter can be false positive.

--
Best Regards,
Ivan Rakov

On Tue, Jul 14, 2020 at 3:03 PM Vladislav Pyatkov 
wrote:

> Hi guys,
>
> I want to implement a more honest heuristic for historical rebalance.
> Before, a cluster makes a choice between the historical rebalance or not it
> only from a partition size. This threshold more known by a name of property
> IGNITE_PDS_WAL_REBALANCE_THRESHOLD.
> It might prevent a historical rebalance when a partition is too small, but
> not if WAL contains more updates than a size of partition, historical
> rebalance still can be chosen.
> There is a ticket where need to implement more fair heuristic[1].
>
> My idea for implementation is need to estimate a size of data which will be
> transferred owe network. In other word if need to rebalance a part of WAL
> that contains N updates, for recover a partition on another node, which
> have to contain M rows at all, need chooses a historical rebalance on the
> case where N < M (WAL history should be presented as well).
>
> This approach is easy implemented, because a coordinator node has the size
> of partitions and counters' interval. But in this case cluster still can
> find not many updates in too long WAL history. I assume a possibility to
> work around it, if rebalance historical iterator will not handle
> checkpoints where not contains updates of particular cache. Checkpoints can
> skip if counters for the cache (maybe even a specific partitions) was not
> changed between it and next one.
>
> Ticket for improvement rebalance historical iterator[2]
>
> I want to hear a view of community on the thought above.
> Maybe anyone has another opinion?
>
> [1]: https://issues.apache.org/jira/browse/IGNITE-13253
> [2]: https://issues.apache.org/jira/browse/IGNITE-13254
>
> --
> Vladislav Pyatkov
>


Choosing historical rebalance heuristics

2020-07-14 Thread Vladislav Pyatkov
Hi guys,

I want to implement a more honest heuristic for historical rebalance.
Before, a cluster makes a choice between the historical rebalance or not it
only from a partition size. This threshold more known by a name of property
IGNITE_PDS_WAL_REBALANCE_THRESHOLD.
It might prevent a historical rebalance when a partition is too small, but
not if WAL contains more updates than a size of partition, historical
rebalance still can be chosen.
There is a ticket where need to implement more fair heuristic[1].

My idea for implementation is need to estimate a size of data which will be
transferred owe network. In other word if need to rebalance a part of WAL
that contains N updates, for recover a partition on another node, which
have to contain M rows at all, need chooses a historical rebalance on the
case where N < M (WAL history should be presented as well).

This approach is easy implemented, because a coordinator node has the size
of partitions and counters' interval. But in this case cluster still can
find not many updates in too long WAL history. I assume a possibility to
work around it, if rebalance historical iterator will not handle
checkpoints where not contains updates of particular cache. Checkpoints can
skip if counters for the cache (maybe even a specific partitions) was not
changed between it and next one.

Ticket for improvement rebalance historical iterator[2]

I want to hear a view of community on the thought above.
Maybe anyone has another opinion?

[1]: https://issues.apache.org/jira/browse/IGNITE-13253
[2]: https://issues.apache.org/jira/browse/IGNITE-13254

-- 
Vladislav Pyatkov