[ https://issues.apache.org/jira/browse/IMPALA-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zoram Thanga resolved IMPALA-2642. ---------------------------------- Resolution: Fixed Fix Version/s: Impala 2.12.0 {noformat} commit f97634fcf9a2e48d307f19595e70bdcee9c1980e Author: Zoram Thanga <zo...@cloudera.com> Date: Tue Jan 16 12:01:09 2018 -0800 IMPALA-2642: Fix a potential deadlock in statestore The statestored can deadlock if the number of subscribers has reached STATESTORE_MAX_SUBSCRIBERS, because the DoSubscriberUpdate() method calls OfferUpdate(), while holding subscribers_lock_, which also tries to take the same lock in this situation. Fix the issue by moving out the call to acquire subscribers_lock_ from OfferUpdate(), and depend on the callers to take it. We also make the maximum number of statestore subscribers a start-up time tuneable, to allow us to test the limit more easily. Testing: The problem is easily reproduced by lowering the value of STATESTORE_MAX_SUBSCRIBERS to 3, and then launching a mini cluster with 3 impalads. Without the fix, the statestored becomes completely deadlocked. A new EE test has been added to exercise this scenario. The test verifies that statestored correctly rejects new subscription requests when the limit it reached. Change-Id: I5d49dede221ce1f50ec299643b5532c61f93f0c6 Reviewed-on: http://gerrit.cloudera.org:8080/9038 Reviewed-by: Sailesh Mukil <sail...@cloudera.com> Tested-by: Impala Public Jenkins {noformat} > Potential deadlock in statestore error path > ------------------------------------------- > > Key: IMPALA-2642 > URL: https://issues.apache.org/jira/browse/IMPALA-2642 > Project: IMPALA > Issue Type: Bug > Components: Distributed Exec > Affects Versions: Impala 2.3.0 > Reporter: Henry Robinson > Assignee: Zoram Thanga > Priority: Minor > Labels: hang, statestore > Fix For: Impala 2.12.0 > > > I just noticed this while reading the statestore code: {{OfferUpdate()}} > takes {{subscribers_lock_}} if the update queue is full, but in one place > that lock has already been taken by the caller: > {code} > { > lock_guard<mutex> l(subscribers_lock_); > // ... snip ... > if (state == FailureDetector::FAILED) { > if (is_heartbeat) { > // TODO: Consider if a metric to track the number of failures would > be useful. > LOG(INFO) << "Subscriber '" << subscriber->id() << "' has failed, > disconnected " > << "or re-registered (last known registration ID: " << > update.second > << ")"; > UnregisterSubscriber(subscriber.get()); > } > } else { > // Schedule the next message. > VLOG(3) << "Next " << (is_heartbeat ? "heartbeat" : "update") << " > deadline for: " > << subscriber->id() << " is in " << deadline_ms << "ms"; > // vvvvvvvvvvvvvvvvvv oops vvvvvvvvvvvvvvvvvv > OfferUpdate(make_pair(deadline_ms, subscriber->id()), is_heartbeat ? > &subscriber_heartbeat_threadpool_ : > &subscriber_topic_update_threadpool_); > } > } > {code} > It's not as scary as it sounds, because if the update queue has > 10000 > entries there's something wrong anyway, but we should fix this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)