[jira] [Commented] (IMPALA-12757) TSAN flags lock-order-inversion during internal-server-test
[ https://issues.apache.org/jira/browse/IMPALA-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17811419#comment-17811419 ] ASF subversion and git services commented on IMPALA-12757: -- Commit f3ac2ddbfef0d7cd359b7c9ae47d424791327c6d in impala's branch refs/heads/master from Michael Smith [ https://gitbox.apache.org/repos/asf?p=impala.git;h=f3ac2ddbf ] IMPALA-12747: Atomic update of execution state QueryDriver owns instances of ClientRequestState and TExecRequest. The ClientRequestState is used to track execution state of the client-facing side of a query. TExecRequest encapsulates context about the query produced by the planner. When a QueryDriver is created, it creates an instance of ClientRequestState, but has not yet executed planning. It would create an empty TExecRequest and pass a pointer to it to ClientRequestState, then update the content of TExecRequest when RunFrontendPlanner is called from ImpalaServer::ExecuteInternal. Updating TExecRequest was not atomic, so it was possible other operations - like producing a QueryStateRecord for /queries in the web UI - would try to read the content of TExecRequest while updating. This caused TSAN errors and occasional crashes in internal-server-test, which runs concurrent requests and examines them through calls to /queries. Changes ClientRequestState to - Provide a static placeholder for TExecRequest during creation that represents an empty context for an UNKNOWN statement type (default initialized in Thrift). - Make all references to TExecRequest const so its content cannot be updated in a non-thread-safe manner. - ClientRequestState uses an AtomicPtr which is updated atomically when the filled TExecRequest is available. QueryDriver does not publicly expose access to TExecRequest, so we can ensure its use is thread-safe without atomics. ClientRequestState::exec_request() will return either a reference to the static placeholder or the value provided after - which is never changed - so this reference will always be valid for the lifetime of the ClientRequestState. Updates user_has_profile_access to be AtomicBool for the same reason. Reverts tsan-suppressions for IMPALA-12660 so we get TSAN coverage. Adds suppression for a lock-order-inversion bug (IMPALA-12757) that was uncovered after fixing this data race. Testing: - InternalServerTest.SimultaneousMultipleQueriesOneSession would fail after ~10 test runs. Ran 90 times without failure. - Passed TSAN run of backend tests. Change-Id: I9a967c5c84b6a401f8f5764373f6cd7ee807545f Reviewed-on: http://gerrit.cloudera.org:8080/20956 Reviewed-by: Jason Fehr Reviewed-by: Riza Suminto Tested-by: Impala Public Jenkins > TSAN flags lock-order-inversion during internal-server-test > --- > > Key: IMPALA-12757 > URL: https://issues.apache.org/jira/browse/IMPALA-12757 > Project: IMPALA > Issue Type: Bug >Affects Versions: Impala 4.4.0 >Reporter: Michael Smith >Priority: Major > > internal-server-test has a tight loop starting queries and fetching /queries. > That's led to identifying several latent threading issues. Once IMPALA-12747 > is addressed, TSAN identifies a new error: > {code} > $ run-jvm-binary.sh ./be/build/debug/service/internal-server-test > I20240125 10:44:44.971467 152733 openssl_util.cc:110] FIPS mode is disabled. > Picked up JAVA_TOOL_OPTIONS: -Dsun.java.command=internal-server-test > 24/01/25 10:44:45 WARN fs.FileSystem: Cannot load filesystem: > java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: > Provider org.apache.hadoop.hive.ql.io.NullScanFileSystem not found > 24/01/25 10:44:45 WARN fs.FileSystem: Cannot load filesystem: > java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: > Provider org.apache.hadoop.hive.ql.io.ProxyLocalFileSystem not found > 24/01/25 10:44:45 INFO util.JvmPauseMonitor: Starting JVM pause monitor > [==] Running 10 tests from 1 test case. > [--] Global test environment set-up. > [--] 10 tests from InternalServerTest > [ RUN ] InternalServerTest.QueryTimeout > [ OK ] InternalServerTest.QueryTimeout (10236 ms) > [ RUN ] InternalServerTest.InvalidQueryOption > [ OK ] InternalServerTest.InvalidQueryOption (76 ms) > [ RUN ] InternalServerTest.MultipleQueriesMultipleSessions > /home/michael/Impala/be/src/service/internal-server-test.cc:289: Failure > Value of: status_.ok() > Actual: false > Expected: true > Error: Failed due to unreachable impalad(s): michaelsmith-22742:27000 > [ FAILED ] InternalServerTest.MultipleQueriesMultipleSessions (17225 ms) > [ RUN ] InternalServerTest.RetryFailedQuery > [ OK ] InternalServerTest.RetryFailedQuery (1206 ms) > [ RUN ] InternalServerTest.MultipleQueriesOneSession > ^C
[jira] [Commented] (IMPALA-12757) TSAN flags lock-order-inversion during internal-server-test
[ https://issues.apache.org/jira/browse/IMPALA-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17811064#comment-17811064 ] Michael Smith commented on IMPALA-12757: https://github.com/google/sanitizers/issues/814 might also be related. I haven't worked through the circumstances enough to tell yet. > TSAN flags lock-order-inversion during internal-server-test > --- > > Key: IMPALA-12757 > URL: https://issues.apache.org/jira/browse/IMPALA-12757 > Project: IMPALA > Issue Type: Bug >Affects Versions: Impala 4.4.0 >Reporter: Michael Smith >Priority: Major > > internal-server-test has a tight loop starting queries and fetching /queries. > That's led to identifying several latent threading issues. Once IMPALA-12747 > is addressed, TSAN identifies a new error: > {code} > $ run-jvm-binary.sh ./be/build/debug/service/internal-server-test > I20240125 10:44:44.971467 152733 openssl_util.cc:110] FIPS mode is disabled. > Picked up JAVA_TOOL_OPTIONS: -Dsun.java.command=internal-server-test > 24/01/25 10:44:45 WARN fs.FileSystem: Cannot load filesystem: > java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: > Provider org.apache.hadoop.hive.ql.io.NullScanFileSystem not found > 24/01/25 10:44:45 WARN fs.FileSystem: Cannot load filesystem: > java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: > Provider org.apache.hadoop.hive.ql.io.ProxyLocalFileSystem not found > 24/01/25 10:44:45 INFO util.JvmPauseMonitor: Starting JVM pause monitor > [==] Running 10 tests from 1 test case. > [--] Global test environment set-up. > [--] 10 tests from InternalServerTest > [ RUN ] InternalServerTest.QueryTimeout > [ OK ] InternalServerTest.QueryTimeout (10236 ms) > [ RUN ] InternalServerTest.InvalidQueryOption > [ OK ] InternalServerTest.InvalidQueryOption (76 ms) > [ RUN ] InternalServerTest.MultipleQueriesMultipleSessions > /home/michael/Impala/be/src/service/internal-server-test.cc:289: Failure > Value of: status_.ok() > Actual: false > Expected: true > Error: Failed due to unreachable impalad(s): michaelsmith-22742:27000 > [ FAILED ] InternalServerTest.MultipleQueriesMultipleSessions (17225 ms) > [ RUN ] InternalServerTest.RetryFailedQuery > [ OK ] InternalServerTest.RetryFailedQuery (1206 ms) > [ RUN ] InternalServerTest.MultipleQueriesOneSession > ^C== > WARNING: ThreadSanitizer: thread leak (pid=152733) > Thread T1086 (tid=154103, finished) created by main thread at: > #0 pthread_create (internal-server-test+0x203c383) > #1 boost::thread::start_thread_noexcept() > (internal-server-test+0x3bcd3fd) > #2 boost::thread::thread std::char_traits, std::allocator > const&, > std::__cxx11::basic_string, std::allocator > > const&, boost::function, impala::ThreadDebugInfo const*, > impala::Promise*), > std::__cxx11::basic_string, std::allocator > >, std::__cxx11::basic_string, > std::allocator >, boost::function, impala::ThreadDebugInfo*, > impala::Promise*>(void > (*)(std::__cxx11::basic_string, > std::allocator > const&, std::__cxx11::basic_string std::char_traits, std::allocator > const&, boost::function ()>, impala::ThreadDebugInfo const*, impala::Promise (impala::PromiseMode)0>*), std::__cxx11::basic_string std::char_traits, std::allocator >, > std::__cxx11::basic_string, std::allocator > >, boost::function, impala::ThreadDebugInfo*, impala::Promise (impala::PromiseMode)0>*) > /home/michael/Impala/toolchain/toolchain-packages-gcc10.4.0/boost-1.74.0-p1/include/boost/thread/detail/thread.hpp:424:13 > (internal-server-test+0x2d84514) > #3 impala::Thread::StartThread(std::__cxx11::basic_string std::char_traits, std::allocator > const&, > std::__cxx11::basic_string, std::allocator > > const&, boost::function const&, std::unique_ptr std::default_delete >*, bool) > /home/michael/Impala/be/src/util/thread.cc:317:13 > (internal-server-test+0x2d8091c) > #4 impala::Status impala::Thread::Create (impala::ClientRequestState::*)(), > impala::ClientRequestState*>(std::__cxx11::basic_string std::char_traits, std::allocator > const&, > std::__cxx11::basic_string, std::allocator > > const&, void (impala::ClientRequestState::* const&)(), > impala::ClientRequestState* const&, std::unique_ptr std::default_delete >*, bool) > /home/michael/Impala/be/src/util/thread.h:81:12 > (internal-server-test+0x2b382f7) > #5 impala::ClientRequestState::WaitAsync() > /home/michael/Impala/be/src/service/client-request-state.cc:1126:10 > (internal-server-test+0x2b2f9f4) > #6 impala::ImpalaServer::WaitForResults(impala::TUniqueId&) > /home/michael/Impala/be/src/service/internal-server.cc:156:3 > (internal-server-test+0x2ad148e) > #7 non-virtual thunk to > impa