Re: Backup filter in ignite [Multi AZ deployment]

Surinder Mehra Sun, 06 Nov 2022 08:06:07 -0800

Understood thanks !

On Sun, 6 Nov 2022, 21:33 Jeremy McMillan, <[email protected]>
wrote:


> Think of each AZ as being a massive piece of server hardware running VMs
> or workloads for you. When hardware (or infrastructure maintenance process)
> fails, assume everything on one AZ is lost at the same time.
>
> On Sun, Nov 6, 2022, 09:58 Surinder Mehra <[email protected]> wrote:
>
>> That's partially true. Whole excercise of configuring AZ as backup filter
>> is because we want to handle AZ level failure.
>>
>> Anyway, thanks for inputs. Will figure out further steps
>>
>> On Sun, 6 Nov 2022, 20:55 Jeremy McMillan, <[email protected]>
>> wrote:
>>
>>> Don't configure 2 backups when you only have two failure domains.
>>>
>>> You're worried about node level failure, but you're telling Ignite to
>>> worry about AZ level failure.
>>>
>>>
>>> On Sat, Nov 5, 2022, 21:57 Surinder Mehra <[email protected]> wrote:
>>>
>>>> Yeah I think there is a misunderstanding. Although I figured out my
>>>> answers from our discussion, I will try one final attempt to clarify my
>>>> point on 2X space for node3
>>>>
>>>> Node setup:
>>>> Node1 and node 2 placed in AZ1
>>>> Node 3 placed in AZ2
>>>>
>>>>  Since I am using AZ as backup filter as I mentioned in my first
>>>> message. Back up if node 1 cannot be placed on node2 and back up of node 2
>>>> cannot be placed on node1 as they are in same AZ. This simply means their
>>>> backups would go to node3 which in another AZ. Hence node 3 space =(node3
>>>> primary partitions+node 1 back up partitions+node2 backup partitions)
>>>>
>>>> Wouldn't this mean node 3 need 2X space as compared to node 1 and node2.
>>>> Assuming backup partitions of node 3 would be equally distributed among
>>>> other two nodes. They would need almost same space.
>>>>
>>>>
>>>> On Tue, 1 Nov 2022, 23:30 Jeremy McMillan, <
>>>> [email protected]> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 1, 2022 at 10:02 AM Surinder Mehra <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Even if we have 2 copies of data and primary and backup copy would be
>>>>>> stored in different AZs. My question remains valid in this case as well.
>>>>>>
>>>>>
>>>>> I think additional backup copies in the same AZ are superfluous if we
>>>>> start with the assumption that multiple concurrent failures are most 
>>>>> likely
>>>>> to affect resources in the concurrent AZ. A second node failure, if that's
>>>>> your failure budget, is likely to corrupt all the backup copies in the
>>>>> second AZ.
>>>>>
>>>>> If you only have two AZs available in some data centers/deployments,
>>>>> but you need 3-way redundancy on certain caches/tables, then using AZ node
>>>>> attribute for backup filtering is too coarse grained. Using AZ is a 
>>>>> general
>>>>> case best practice which gives your cluster the best chance of surviving
>>>>> multiple hardware failures in AWS because they pool hardware resources in
>>>>> AZs. Maybe you just need three AZs? Maybe AZ isn't the correct failure
>>>>> domain for your use case?
>>>>>
>>>>>
>>>>>> Do we have to ensure nodes in two AZs are always present or does
>>>>>> ignite have a way to indicate it couldn't create backups. Silently 
>>>>>> killing
>>>>>> backups is not desirable state.
>>>>>>
>>>>>
>>>>> Do you use synchronous or asynchronous backups?
>>>>>
>>>>> https://ignite.apache.org/docs/2.11.1/configuring-caches/configuring-backups#synchronous-and-asynchronous-backups
>>>>>
>>>>> You can periodically poll caches' configurations or hook a cluster
>>>>> state event, and re-compare the cache backup configuration against the
>>>>> enumerated available AZs, and raise an exception or log a message or
>>>>> whatever to detect the issue as soon as AZ count drops below minimum. This
>>>>> way might also be good for fuzzy warning condition detection point for
>>>>> proactive infrastructure operations. If you count all of the nodes in each
>>>>> AZ, you can detect and track AZ load imbalances as the ratio between the
>>>>> smallest AZ node count and the average AZ node count.
>>>>>
>>>>>
>>>>>> 2. In my original message with 2 nodes(node1 and node2) in AZ1, and
>>>>>> 3rdnode in second AZ, backups of node1 and node2 would be placed one 
>>>>>> node 3
>>>>>> in AZ2. It would mean it need to have 2X space to store backups.
>>>>>> Just trying to ensure my understanding is correct.
>>>>>>
>>>>>
>>>>> If you have three nodes, you divide your total footprint by three to
>>>>> get the minimum node capacity.
>>>>>
>>>>> If you have 2 backups, that is one primary copy plus two more backup
>>>>> copies, so you multiply your total footprint by 3.
>>>>>
>>>>> If you multiply, say 32GB by three for redundancy, that would be 96GB
>>>>> total space needed for the sum of all nodes' footprint.
>>>>>
>>>>> If you divide the 96GB storage commitment among three nodes, then each
>>>>> node must have a minimum of 32GB. That's what we started with as a nominal
>>>>> data footprint, so 1x not 2x. Node 1 will need to accommodate backups from
>>>>> node 2 and node 3. Node 2 will need to accommodate backups from node 1 and
>>>>> node 3. Each node has one primary and two backup partition copies for each
>>>>> partition of each cache with two backups.
>>>>>
>>>>>
>>>>>> Hope my queries are clear to you now
>>>>>>
>>>>>
>>>>> I still don't understand your operational goals, so I feel like we may
>>>>> be dancing around a misunderstanding.
>>>>>
>>>>>
>>>>>> On Tue, 1 Nov 2022, 20:19 Surinder Mehra, <[email protected]> wrote:
>>>>>>
>>>>>>> Thanks for your reply. Let me try to answer your 2 questions below.
>>>>>>> 1. I understand that it sacrifices the backups incase it can't place
>>>>>>> backups appropriately. Question is, is it possible to fail the 
>>>>>>> deployment
>>>>>>> rather than risking single copy of data present in cluster. If this only
>>>>>>> copy goes down, we will have downtime as data won't be present in 
>>>>>>> cluster.
>>>>>>> We should rather throw error if enough hardware is not present than 
>>>>>>> risking
>>>>>>> data unavailability issue during business activity
>>>>>>>
>>>>>>> 2. Why we want 3 copies of data. It's a design choice. We want to
>>>>>>> ensure even if 2 nodes go down, we still have 3rd present to serve the
>>>>>>> data.
>>>>>>>
>>>>>>> Hope I answered your question
>>>>>>>
>>>>>>> On Tue, 1 Nov 2022, 19:40 Jeremy McMillan, <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> This question is a design question.
>>>>>>>>
>>>>>>>> What kids of fault states do you expect to tolerate? What is your
>>>>>>>> failure budget?
>>>>>>>>
>>>>>>>> Why are you trying to make more than 2 copies of the data
>>>>>>>> distribute across only two failure domains?
>>>>>>>>
>>>>>>>> Also "fail fast" means discover your implementation defects faster
>>>>>>>> than your release cycle, not how fast you can cause data loss.
>>>>>>>>
>>>>>>>> On Tue, Nov 1, 2022, 09:01 Surinder Mehra <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> gentle reminder.
>>>>>>>>> One additional question: We have observed that if available AZs
>>>>>>>>> are less than backups count, ignite skips creating backups. Is this 
>>>>>>>>> correct
>>>>>>>>> understanding? If yes, how can we fail fast if backups can not be 
>>>>>>>>> placed
>>>>>>>>> due to AZ limitation?
>>>>>>>>>
>>>>>>>>> On Mon, Oct 31, 2022 at 6:30 PM Surinder Mehra <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>> As per link attached, to ensure primary and backup partitions are
>>>>>>>>>> not stored on same node, We used AWS AZ as backup filter and now I 
>>>>>>>>>> can see
>>>>>>>>>> if I start two ignite nodes on the same machine, primary partitions 
>>>>>>>>>> are
>>>>>>>>>> evenly distributed but backups are always zero which is expected.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://www.gridgain.com/docs/latest/installation-guide/aws/multiple-availability-zone-aws
>>>>>>>>>>
>>>>>>>>>> My question is what would happen if AZ-1 has 2 machines and AZ-2
>>>>>>>>>> has 1 machine and ignite cluster has only 3 nodes, each machine 
>>>>>>>>>> having one
>>>>>>>>>> ignite node.
>>>>>>>>>>
>>>>>>>>>> Node1[AZ1] - keys 1-100
>>>>>>>>>> Node2[AZ1] -  keys 101-200
>>>>>>>>>> Node3[AZ2] - keys  201 -300
>>>>>>>>>>
>>>>>>>>>> In the above scenario, if the backup count is 2, how would back
>>>>>>>>>> up partitions be distributed.
>>>>>>>>>>
>>>>>>>>>> 1. Would it mean node3 will have 2 backup copies of primary
>>>>>>>>>> partitions of node 1 and 2 ?
>>>>>>>>>> 2. If we have a 4 node cluster with 2 nodes in each AZ, would
>>>>>>>>>> backup copies also be placed on different nodes(In other words, does 
>>>>>>>>>> the
>>>>>>>>>> backup filter also apply to how backup copies are placed on nodes) ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>

Re: Backup filter in ignite [Multi AZ deployment]

Reply via email to