2018-11-06 13:34:48 UTC - Aaron Claypoole: @Aaron Claypoole has joined the channel ---- 2018-11-06 21:49:08 UTC - Aniket: In case of active standby replication, in case when active dc goes down and producer/consumer are moved to fallback dc; is there any information on how offsets are determined? Is there a possibility of message loss if the offset is moved ahead during this time, as there can be few messages in source dc that has not yet been replicated to fallback dc, and offset is moved ahead? ---- 2018-11-06 21:49:10 UTC - Aniket: <https://streaml.io/blog/geo-replication-patterns-practices> ---- 2018-11-06 22:02:52 UTC - Matteo Merli: @Aniket the subscriptions (which maintain offsets) are local to one particular region, so there’s no information being passed on at this point. ---- 2018-11-06 22:30:19 UTC - Aniket: Ok, then in that case when the DC goes down and producer and consumer are moved to fallback DC, the consumer starts from first message? ---- 2018-11-06 22:30:35 UTC - Aniket: that can lead to lot of extra work ---- 2018-11-06 22:30:47 UTC - Matteo Merli: Depends how you configure the consumers. ---- 2018-11-06 22:30:58 UTC - Aniket: Ok ---- 2018-11-06 22:31:15 UTC - Matteo Merli: One option is to manually reset (by time) the subscription in the fallback DC ---- 2018-11-06 22:32:08 UTC - Aniket: yes, I was thinking of that option but there is limitation with that as well (unless we do some extra state) ---- 2018-11-06 22:32:22 UTC - Aniket: DC1 goes down, so we move the consumer to DC2 ---- 2018-11-06 22:32:30 UTC - Aniket: and also the producer ---- 2018-11-06 22:32:57 UTC - Aniket: consumer will consume the existing replicated events and also start consuming new events from producer to DC2 ---- 2018-11-06 22:33:31 UTC - Matteo Merli: In case of a DC failover, it’s typically better to initiate manually ---- 2018-11-06 22:34:00 UTC - Aniket: When the source DC comes back online, we will have to, 1. move consumer and producer to the source DC 2. replicate produced events from fallback DC to source DC 3. And process the events prior to fail over first ---- 2018-11-06 22:34:09 UTC - Matteo Merli: so, before doing that, you can reset subscriptions to ~10mins earlier and then failover ---- 2018-11-06 22:34:18 UTC - Aniket: > In case of a DC failover, it’s typically better to initiate manually ---- 2018-11-06 22:34:19 UTC - Aniket: ok ---- 2018-11-06 22:34:45 UTC - Matteo Merli: Other option is to always have consumer in bother DCs, and just move producers ---- 2018-11-06 22:35:29 UTC - Aniket: yeah, but the requirement might be to reduce the redundant work as low as it can be ---- 2018-11-06 22:35:53 UTC - Matteo Merli: sure ---- 2018-11-06 22:36:19 UTC - Aniket: I am also worried about the message loss when disaster recovery happens ---- 2018-11-06 22:36:35 UTC - Aniket: if I try to avoid redundant work during the time of failover ---- 2018-11-06 22:37:46 UTC - Matteo Merli: with async replication, the messages in flight between DC and DC2 might be either come out-of-order (when DC comes back) or lost, if it doesn’t come back ---- 2018-11-06 22:37:59 UTC - Aniket: ok ---- 2018-11-06 22:38:15 UTC - Matteo Merli: if you want to avoid that, then you should consider using sync-replication ---- 2018-11-06 22:38:27 UTC - Aniket: But it will be committed to DC first and then replicated to DC2. So, technically my understanding is it will still be there in DC ---- 2018-11-06 22:38:38 UTC - Aniket: yes, I am talking about sync-replication ---- 2018-11-06 22:39:17 UTC - Matteo Merli: with sync replication, each DC is seen as a logical “rack” of machines ---- 2018-11-06 22:39:33 UTC - Aniket: right ---- 2018-11-06 22:39:48 UTC - Matteo Merli: when you configure to have 3 replicas, the 3 replicas will be placed in nodes in different DCs ---- 2018-11-06 22:39:58 UTC - Matteo Merli: there’s no failover from client perspective ---- 2018-11-06 22:40:26 UTC - Matteo Merli: the service will be up, even if one region is unavalable ---- 2018-11-06 22:41:33 UTC - Aniket: Yes, I understand ---- 2018-11-06 22:41:46 UTC - Aniket: my concern is specifically with offsets ---- 2018-11-06 22:42:13 UTC - Aniket: because 1. it can cause redundant process of messages 2. if can cause loss of messages ---- 2018-11-06 22:42:39 UTC - Matteo Merli: with sync replication, from Pulsar perspective is a single “cluster”,spanning multiple DCs ---- 2018-11-06 22:42:56 UTC - Matteo Merli: Subscription (and offset) is then consistent in this case ---- 2018-11-06 22:43:20 UTC - Matteo Merli: consumer will be automatically redirected to an available broker ---- 2018-11-06 22:43:35 UTC - Matteo Merli: restarting from next unacked message ---- 2018-11-06 22:44:33 UTC - Aniket: If there are 1 to 2M messages created in Main DC, and 1.5M replicated to Failover DC. Consumer consumes 500K messages in Main DC. Now there is an outage. at this moment the offset for topic in Main DC is 500K, so when consumer + producer fails over to Failover DC, if the offset is not present the consumer will start from 1, if offset is replicated as well it will start at 501, but if producer is producing events in Failover DC for same topic, it can effectively overwrite the remaining 500K messages that were not replicated from Main DC and the offset can move forward. May be I am confused or have some misundertanding ---- 2018-11-06 22:45:37 UTC - Matteo Merli: If it’s “sync” replication, there’s no failover so to speak, just that some machine in the clusters are no available ---- 2018-11-06 22:45:46 UTC - Aniket: i see, ok ---- 2018-11-06 22:45:51 UTC - Aniket: I understand ---- 2018-11-06 22:46:14 UTC - Aniket: Can you point me to some documentation that explains this more - so that I don’t disturb you more with questions? ---- 2018-11-06 22:46:56 UTC - Aniket: <https://streaml.io/blog/apache-pulsar-geo-replication> ---- 2018-11-06 22:47:03 UTC - Aniket: I came across this ---- 2018-11-06 22:58:45 UTC - Aniket: Thanks for your feedback and answers to my questions :slightly_smiling_face: ---- 2018-11-06 22:58:58 UTC - Aniket: appreciate your help ---- 2018-11-06 23:27:44 UTC - Matteo Merli: I think we don’t have a ready-made tutorial I’m afraid (though we should). The idea is to deploy the cluster spanning 3 DCs and then use rack-aware policy for bookies, to make sure data is stored in all the DCs. Take a look at `pulsar-admin bookies` command to set that up ---- 2018-11-06 23:41:03 UTC - Aniket: Ok, I will check it out, thank you ---- 2018-11-07 08:41:40 UTC - David Tinker: I am busy testing Pulsar 2.2.0 using 2 consumers on the same subscription configured for failover. It seems that sometimes the "inactive" consumer doesn't get activated when the "active" consumer shuts down. Mostly it works but sometimes it doesn't. I can see from /admin/v2/.../topic/stats that the live consumer is connected but I don't get any 'becameActive' notification or messages and the backlog builds up. I am using the Java client. Any ideas? I suspect I can work around this by periodically re-starting my "inactive" consumers. ---- 2018-11-07 08:44:22 UTC - David Tinker: { "msgRateIn" : 0.26666650042232587, ... "publishers" : [ { ... } ], "subscriptions" : { "gammon" : { "msgRateOut" : 0.0, "msgThroughputOut" : 0.0, "msgRateRedeliver" : 0.0, "msgBacklog" : 419, "blockedSubscriptionOnUnackedMsgs" : false, "unackedMessages" : 0, "type" : "Failover", "msgRateExpired" : 0.0, "consumers" : [ { "msgRateOut" : 0.0, "msgThroughputOut" : 0.0, "msgRateRedeliver" : 0.0, "consumerName" : "gammon-e1d", "availablePermits" : 1000, "unackedMessages" : 0, "blockedConsumerOnUnackedMsgs" : false, "metadata" : { }, "address" : "...", "clientVersion" : "2.2.0", "connectedSince" : "2018-11-07T08:18:28.751Z" } ] } }, ----
