[jira] [Updated] (IGNITE-19133) Increase partitions count upper bound

2023-04-07 Thread Kirill Tkalenko (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-19133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirill Tkalenko updated IGNITE-19133:
-
Description: 
h3. Problem

Data partitioning is used to distribute data (hopefully) evenly across the 
cluster and to provide necessary operation parallelism for the end user. As a 
rule of thumb, one may consider allocating 256 partitions per Ignite node, in 
order to achieve that.

This rule only scales up to a certain point. Imagine a cluster of 1000 nodes, 
with a table that has 3 replicas of each partition (ability to lose 1 backup). 
With current limit of 65500 partitions, the maximal number of partitions per 
node would be {{{}65500*3/1000 ~= 196{}}}. This is the limit of our 
scalability, according to aforementioned rule. To provide 256 partitions per 
node, the user would have to:
 * either increase the number of backups, which proportionally increases 
required storage space (affects cost),
 * or increase the total number of partitions up to about 85 thousands. This is 
not possible right now.

h3. What's the reason of current limit

Disclaimer: I'm not the one who designed it, so my thoughts may be speculative 
in some sense.

Short answer is: we need a number of partitions to fit into 2 bytes.

Long answer: in current implementation we have 1 to 1 correspondence between 
logical partition id and physical partition id. We use the same value both in 
affinity and in physical file name. This makes system simpler, and I believe 
that simplicity is the real explanation of the restriction.

Why does it have to be 2 bytes, and not 3, for example. The key is the 
structure of page identifiers in data regions:
{code:java}
+---++-+-+
| rotation/item id (1 byte) | flags (1 byte) | partition (2 bytes) | index (4 
bytes) |
+---++-+-+{code}
The idea was to fit it into 8 bytes. Good idea, in my opinion. Making it bigger 
doesn't feel right.
h3. Proposed solution

As mentioned, there are to components to the problem:
 # One to one correspondence between partition ids.
 # Hard limit in a single data region, caused by the page id layout.

There's not much we can do with component #2, because the implications are 
unpredictable, and the amount of code we would need to fix is astonishing.
h4. More reasonable restrictions

This leads us to the following problem: every single Ignite node can't have 
more than 65500 partitions for a table (or distribution zone). So, imagine the 
situation:
 * user has a cluster with 3 nodes
 * user tries to create distribution zone with 3 nodes, 3 replicas for each 
partitions and 10 partitions

While this is absurd, the configuration is still "valid", but it leads to 100k 
partitions per node, which is impossible.

Such zone configurations must be banned. Such restriction doesn't seem 
unreasonable. If a user wants to start so many partitions for such a small 
cluster, they really don't understand what they're doing.

This naturally gives us a minimal number of nodes per the number of partitions, 
as the following formula (assuming that you can't have 2 replicas of the same 
partition on the same Ignite node):
{code:java}
nodes >= min(replicas, ceil(partitions * replicas / 65500))
{code}
This estimation is imprecise, because it assumes perfect distribution. In 
reality, 
rendezvous affinity is uneven, so the real value must be checked when user 
configures the number of nodes for specific distribution zone.
h4. Ties to rebalance

For this question I would probably need an assistance. While affinity 
reassignment, each node may store more partition then it's stated in every 
single distribution. What do I mean by this:
 * imagine node having partitions 1, 2, and 3
 * after the reassignment, the node has partitions 3, 4 and 5

Each individual distribution states that node only has 3 partitions, while 
during the rebalance, it may store all 5: sending 1 and 2 to some node, and 
receiving 4 and 5 from some different node.

Multiply that by a big factor, and it is possible to have situation, where 
local number of partitions exceeds 65500. The only way to beat it, in my 
opinion, is to lower the hard limit in affinity function to 32xxx per node, 
leaving a space for partitions in a MOVING state.
h4. Mapping partition ids

With that being said, all that's left is to map logical partition ids from the 
range 0..N (where N is unlimited) to physical ids from the range 0..65500.

Such mapping is a local entity, encapsulated deep inside of the storage engine. 
Simplest way to do so is to have a HashMap \{ logical -> physical } and to 
increase physical partition id by 1 every time you insert a new value. If the 
{{values()}} set is not continuous, one may occupy the gap, it's not too hard 
to implement.

Of course, this 

[jira] [Updated] (IGNITE-19133) Increase partitions count upper bound

2023-04-07 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-19133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-19133:
---
Description: 
h3. Problem

Data partitioning is used to distribute data (hopefully) evenly across the 
cluster and to provide necessary operation parallelism for the end user. As a 
rule of thumb, one may consider allocating 256 partitions per Ignite node, in 
order to achieve that.

This rule only scales up to a certain point. Imagine a cluster of 1000 nodes, 
with a table that has 3 replicas of each partition (ability to lose 1 backup). 
With current limit of 65500 partitions, the maximal number of partitions per 
node would be {{{}65500*3/1000 ~= 196{}}}. This is the limit of our 
scalability, according to aforementioned rule. To provide 256 partitions per 
node, the user would have to:
 * either increase the number of backups, which proportionally increases 
required storage space (affects cost),
 * or increase the total number of partitions up to about 85 thousands. This is 
not possible right now.

h3. What's the reason of current limit

Disclaimer: I'm not the one who designed it, so my thoughts may be speculative 
in some sense.

Short answer is: we need a number of partitions to fit into 2 bytes.

Long answer: in current implementation we have 1 to 1 correspondence between 
logical partition id and physical partition id. We use the same value both in 
affinity and in physical file name. This makes system simpler, and I believe 
that simplicity is the real explanation of the restriction.

Why does it have to be 2 bytes, and not 3, for example. The key is the 
structure of page identifiers in data regions:
{code:java}
+---++-+-+
| rotation/item id (1 byte) | flags (1 byte) | partition (2 bytes) | index (4 
bytes) |
+---++-+-+{code}
The idea was to fit it into 8 bytes. Good idea, in my opinion. Making it bigger 
doesn't feel right.
h3. Proposed solution

As mentioned, there are to components to the problem:
 # One to one correspondence between partition ids.
 # Hard limit in a single data region, caused by the page id layout.

There's not much we can do with component #2, because the implications are 
unpredictable, and the amount of code we would need to fix is astonishing.
h4. More reasonable restrictions

This leads us to the following problem: every single Ignite node can't have 
more than 65500 partitions for a table (or distribution zone). So, imagine the 
situation:
 * user has a cluster with 3 nodes
 * user tries to create distribution zone with 3 nodes, 3 replicas for each 
partitions and 10 partitions

While this is absurd, the configuration is still "valid", but it leads to 100k 
partitions per node, which is impossible.

Such zone configurations must be banned. Such restriction doesn't seem 
unreasonable. If a user wants to start so many partitions for such a small 
cluster, they really don't understand what they're doing.

This naturally gives us a minimal number of nodes per the number of partitions, 
as the following formula (assuming that you can't have 2 replicas of the same 
partition on the same Ignite node):
{code:java}
nodes >= min(replicas, ceil(partitions * replicas / 65500))
{code}
This estimation is imprecise, because it assumes perfect distribution. In 
reality, rendezvous affinity is uneven, so the real value must be checked when 
user configures the number of nodes for specific distribution zone.
h4. Ties to rebalance

For this question I would probably need an assistance. While affinity 
reassignment, each node may store more partition then it's stated in every 
single distribution. What do I mean by this:
 * imagine node having partitions 1, 2, and 3
 * after the reassignment, the node has partitions 3, 4 and 5

Each individual distribution states that node only has 3 partitions, while 
during the rebalance, it may store all 5: sending 1 and 2 to some node, and 
receiving 4 and 5 from some different node.

Multiply that by a big factor, and it is possible to have situation, where 
local number of partitions exceeds 65500. The only way to beat it, in my 
opinion, is to lower the hard limit in affinity function to 32xxx per node, 
leaving a space for partitions in a MOVING state.
h4. Mapping partition ids

With that being said, all that's left is to map logical partition ids from the 
range 0..N (where N is unlimited) to physical ids from the range 0..65500.

Such mapping is a local entity, encapsulated deep inside of the storage engine. 
Simplest way to do so is to have a HashMap \{ logical -> physical } and to 
increase physical partition id by 1 every time you add a new partition to the 
node. If the {{values()}} set is not continuous, one may occupy the gap, it's 
not too hard to implement.

Of course, this 

[jira] [Updated] (IGNITE-19133) Increase partitions count upper bound

2023-04-07 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-19133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-19133:
---
Description: 
h3. Problem

Data partitioning is used to distribute data (hopefully) evenly across the 
cluster and to provide necessary operation parallelism for the end user. As a 
rule of thumb, one may consider allocating 256 partitions per Ignite node, in 
order to achieve that.

This rule only scales up to a certain point. Imagine a cluster of 1000 nodes, 
with a table that has 3 replicas of each partition (ability to lose 1 backup). 
With current limit of 65500 partitions, the maximal number of partitions per 
node would be {{{}65500*3/1000 ~= 196{}}}. This is the limit of our 
scalability, according to aforementioned rule. To provide 256 partitions per 
node, the user would have to:
 * either increase the number of backups, which proportionally increases 
required storage space (affects cost),
 * or increase the total number of partitions up to about 85 thousands. This is 
not possible right now.

h3. What's the reason of current limit

Disclaimer: I'm not the one who designed it, so my thoughts may be speculative 
in some sense.

Short answer is: we need a number of partitions to fit into 2 bytes.

Long answer: in current implementation we have 1 to 1 correspondence between 
logical partition id and physical partition id. We use the same value both in 
affinity and in physical file name. This makes system simpler, and I believe 
that simplicity is the real explanation of the restriction.

Why does it have to be 2 bytes, and not 3, for example. The key is the 
structure of page identifiers in data regions:
{code:java}
+---++-+-+
| rotation/item id (1 byte) | flags (1 byte) | partition (2 bytes) | index (4 
bytes) |
+---++-+-+{code}
The idea was to fit it into 8 bytes. Good idea, in my opinion. Making it bigger 
doesn't feel right.
h3. Proposed solution

As mentioned, there are to components to the problem:
 # One to one correspondence between partition ids.
 # Hard limit in a single data region, caused by the page id layout.

There's not much we can do with component #2, because the implications are 
unpredictable, and the amount of code we would need to fix is astonishing.
h4. More reasonable restrictions

This leads us to the following problem: every single Ignite node can't have 
more than 65500 partitions for a table (or distribution zone). So, imagine the 
situation:
 * user has a cluster with 3 nodes
 * user tries to create distribution zone with 3 nodes, 3 replicas for each 
partitions and 10 partitions

While this is absurd, the configuration is still "valid", but it leads to 100k 
partitions per node, which is impossible.

Such zone configurations must be banned. Such restriction doesn't seem 
unreasonable. If a user wants to start so many partitions for such a small 
cluster, they really don't understand what they're doing.

This naturally gives us a minimal number of nodes per the number of partitions, 
as the following formula (assuming that you can't have 2 replicas of the same 
partition on the same Ignite node):
{code:java}
nodes >= min(replicas, ceil(partitions * replicas / 65500))
{code}
This estimation is imprecise, because it assumes perfect distribution. In 
reality, rendezvous affinity is uneven, so the real value must be checked when 
user configures the number of nodes for specific distribution zone.
h4. Ties to rebalance

For this question I would probably need an assistance. While affinity 
reassignment, each node may store more partition then it's stated in every 
single distribution. What do I mean by this:
 * imagine node having partitions 1, 2, and 3
 * after the reassignment, the node has partitions 3, 4 and 5

Each individual distribution states that node only has 3 partitions, while 
during the rebalance, it may store all 5: sending 1 and 2 to some node, and 
receiving 4 and 5 from some different node.

Multiply that by a big factor, and it is possible to have situation, where 
local number of partitions exceeds 65500. The only way to beat it, in my 
opinion, is to lower the hard limit in affinity function to 32xxx per node, 
leaving a space for partitions in a MOVING state.
h4. Mapping partition ids

With that being said, all that's left is to map logical partition ids from the 
range 0..N (where N is unlimited) to physical ids from the range 0..65500.

Such mapping is a local entity, encapsulated deep inside of the storage engine. 
Simplest way to do so is to have a HashMap \{ logical -> physical } and to 
increase physical partition id by 1 every time you insert a new value. If the 
{{values()}} set is not continuous, one may occupy the gap, it's not too hard 
to implement.

Of course, this correspondence 

[jira] [Updated] (IGNITE-19133) Increase partitions count upper bound

2023-03-28 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-19133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-19133:
---
Description: 
h3. Problem

Data partitioning is used to distribute data (hopefully) evenly across the 
cluster and to provide necessary operation parallelism for the end user. As a 
rule of thumb, one may consider allocating 256 partitions per Ignite node, in 
order to achieve that.

This rule only scales up to a certain point. Imagine a cluster of 1000 nodes, 
with a table that has 3 replicas of each partition (ability to lose 1 backup). 
With current limit of 65500 partitions, the maximal number of partitions per 
node would be {{{}65500*3/1000 ~= 196{}}}. This is the limit of our 
scalability, according to aforementioned rule. To provide 256 partitions per 
node, the user would have to:
 * either increase the number of backups, which proportionally increases 
required storage space (affects cost),
 * or increase the total number of partitions up to about 85 thousands. This is 
not possible right now.

h3. What's the reason of current limit

Disclaimer: I'm not the one who designed it, so my thoughts may be speculative 
in some sense.

Short answer is: we need a number of partitions to fit into 2 bytes.

Long answer: in current implementation we have 1 to 1 correspondence between 
logical partition id and physical partition id. We use the same value both in 
affinity and in physical file name. This makes system simpler, and I believe 
that simplicity is the real explanation of the restriction.

Why does it have to be 2 bytes, and not 3, for example. The key is the 
structure of page identifiers in data regions:
{code:java}
+---++-+-+
| rotation/item id (1 byte) | flags (1 byte) | partition (2 bytes) | index (4 
bytes) |
+---++-+-+{code}
The idea was to fit it into 8 bytes. Good idea, in my opinion. Making it bigger 
doesn't feel right.
h3. Proposed solution

As mentioned, there are to components to the problem:
 # One to one correspondence between partition ids.
 # Hard limit in a single data region, caused by the page id layout.

There's not much we can do with component #2, because the implications are 
unpredictable, and the amount of code we would need to fix is astonishing.
h4. More reasonable restrictions

This leads us to the following problem: every single Ignite node can't have 
more than 65500 partitions for a table (or distribution zone). So, imagine the 
situation:
 * user has a cluster with 3 nodes
 * user tries to create distribution zone with 3 nodes, 3 replicas for each 
partitions and 10 partitions

While this is absurd, the configuration is still "valid", but it leads to 100k 
partitions per node, which is impossible.

Such zone configurations must be banned. Such restriction doesn't seem 
unreasonable. If a user wants to start so many partitions for such a small 
cluster, they really don't understand what they're doing.

This naturally gives us a minimal number of nodes per the number of partitions, 
as the following formula (assuming that you can't have 2 replicas of the same 
partition on the same Ignite node):
{code:java}
nodes >= min(replicas, ceil(partitions * replicas / 65500))
{code}
This estimation is imprecise, because it assumes perfect distribution. In 
reality, 
rendezvous affinity is uneven, so the real value must be checked when user 
configures the number of nodes for specific distribution zone.
h4. Ties to rebalance

For this question I would probably need an assistance. While affinity 
reassignment, each node may store more partition then it's stated in every 
single distribution. What do I mean by this:
 * imagine node having partitions 1, 2, and 3
 * after the reassignment, the node has partitions 3, 4 and 5

Each individual distribution states that node only has 3 partitions, while 
during the rebalance, it may store all 5: sending 1 and 2 to some node, and 
receiving 4 and 5 from some different node.

Multiply that by a big factor, and it is possible to have situation, where 
local number of partitions exceeds 65500. The only way to beat it, in my 
opinion, is to lower the hard limit in affinity function to 32xxx per node, 
leaving a space for partitions in a MOVING state.
h4. Mapping partition ids

With that being said, all that's left is to map logical partition ids from the 
range 0..N (where N is unlimited) to physical ids from the range 0..65500.

Such mapping is a local entity, encapsulated deep inside of the storage engine. 
Simplest way to do so is to have a HashMap \{ logical -> physical } and to 
increase physical partition id by 1 every time you insert a new value. If the 
{{values()}} set is not continuous, one may occupy the gap, it's not too hard 
to implement.

Of course, this correspondence 

[jira] [Updated] (IGNITE-19133) Increase partitions count upper bound

2023-03-28 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-19133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-19133:
---
Description: 
h3. Problem

Data partitioning is used to distribute data (hopefully) evenly across the 
cluster and to provide necessary operation parallelism for the end user. As a 
rule of thumb, one may consider allocating 256 partitions per Ignite node, in 
order to achieve that.

This rule only scales up to a certain point. Imagine a cluster of 1000 nodes, 
with a table that has 3 replicas of each partition (ability to lose 1 backup). 
With current limit of 65500 partitions, the maximal number of partitions per 
node would be {{{}65500*3/1000 ~= 196{}}}. This is the limit of our 
scalability, according to aforementioned rule. To provide 256 partitions per 
node, the user would have to:
 * either increase the number of backups, which proportionally increases 
required storage space (affects cost),
 * or increase the total number of partitions up to about 85 thousands. This is 
not possible right now.

h3. What's the reason of current limit

Disclaimer: I'm not the one who designed it, so my thoughts may be speculative 
in some sense.

Short answer is: we need a number of partitions to fit into 2 bytes.

Long answer: in current implementation we have 1 to 1 correspondence between 
logical partition id and physical partition id. We use the same value both in 
affinity and in physical file name. This makes system simpler, and I believe 
that simplicity is the real explanation of the restriction.

Why does it have to be 2 bytes, and not 3, for example. The key is the 
structure of page identifiers in data regions:
{code:java}
+---++-++
| rotation/item id (1 byte) | flags (1 byte) | partition (2 bytes) | index (4 
bytes) |
+---++-++{code}
The idea was to fit it into 8 bytes. Good idea, in my opinion. Making it bigger 
doesn't feel right.
h3. Proposed solution

As mentioned, there are to components to the problem:
 # One to one correspondence between partition ids.
 # Hard limit in a single data region, caused by the page id layout.

There's not much we can do with component #2, because the implications are 
unpredictable, and the amount of code we would need to fix is astonishing.
h4. More reasonable restrictions

This leads us to the following problem: every single Ignite node can't have 
more than 65500 partitions for a table (or distribution zone). So, imagine the 
situation:
 * user has a cluster with 3 nodes
 * user tries to create distribution zone with 3 nodes, 3 replicas for each 
partitions and 10 partitions

While this is absurd, the configuration is still "valid", but it leads to 100k 
partitions per node, which is impossible.

Such zone configurations must be banned. Such restriction doesn't seem 
unreasonable. If a user wants to start so many partitions for such a small 
cluster, they really don't understand what they're doing.

This naturally gives us a minimal number of nodes per the number of partitions, 
as the following formula (assuming that you can't have 2 replicas of the same 
partition on the same Ignite node):
{code:java}
nodes >= min(replicas, ceil(partitions * replicas / 65500))
{code}
This estimation is imprecise, because it assumes perfect distribution. In 
reality, 
rendezvous affinity is uneven, so the real value must be checked when user 
configures the number of nodes for specific distribution zone.
h4. Ties to rebalance

For this question I would probably need an assistance. While affinity 
reassignment, each node may store more partition then it's stated in every 
single distribution. What do I mean by this:
 * imagine node having partitions 1, 2, and 3
 * after the reassignment, the node has partitions 3, 4 and 5

Each individual distribution states that node only has 3 partitions, while 
during the rebalance, it may store all 5: sending 1 and 2 to some node, and 
receiving 4 and 5 from some different node.

Multiply that by a big factor, and it is possible to have situation, where 
local number of partitions exceeds 65500. The only way to beat it, in my 
opinion, is to lower the hard limit in affinity function to 32xxx per node, 
leaving a space for partitions in a MOVING state.
h4. Mapping partition ids

With that being said, all that's left is to map logical partition ids from the 
range 0..N (where N is unlimited) to physical ids from the range 0..65500.

Such mapping is a local entity, encapsulated deep inside of the storage engine. 
Simplest way to do so is to have a HashMap \{ logical -> physical } and to 
increase physical partition id by 1 every time you insert a new value. If the 
{{values()}} set is not continuous, one may occupy the gap, it's not too hard 
to implement.

Of course, this correspondence