[jira] [Commented] (IMPALA-12757) TSAN flags lock-order-inversion during internal-server-test

2024-01-26 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17811419#comment-17811419
 ] 

ASF subversion and git services commented on IMPALA-12757:
--

Commit f3ac2ddbfef0d7cd359b7c9ae47d424791327c6d in impala's branch 
refs/heads/master from Michael Smith
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=f3ac2ddbf ]

IMPALA-12747: Atomic update of execution state

QueryDriver owns instances of ClientRequestState and TExecRequest. The
ClientRequestState is used to track execution state of the client-facing
side of a query. TExecRequest encapsulates context about the query
produced by the planner.

When a QueryDriver is created, it creates an instance of
ClientRequestState, but has not yet executed planning. It would create
an empty TExecRequest and pass a pointer to it to ClientRequestState,
then update the content of TExecRequest when RunFrontendPlanner is
called from ImpalaServer::ExecuteInternal.

Updating TExecRequest was not atomic, so it was possible other
operations - like producing a QueryStateRecord for /queries in the web
UI - would try to read the content of TExecRequest while updating. This
caused TSAN errors and occasional crashes in internal-server-test, which
runs concurrent requests and examines them through calls to /queries.

Changes ClientRequestState to
- Provide a static placeholder for TExecRequest during creation that
  represents an empty context for an UNKNOWN statement type (default
  initialized in Thrift).
- Make all references to TExecRequest const so its content cannot be
  updated in a non-thread-safe manner.
- ClientRequestState uses an AtomicPtr which is updated atomically when
  the filled TExecRequest is available.

QueryDriver does not publicly expose access to TExecRequest, so we can
ensure its use is thread-safe without atomics.

ClientRequestState::exec_request() will return either a reference to the
static placeholder or the value provided after - which is never changed
- so this reference will always be valid for the lifetime of the
ClientRequestState.

Updates user_has_profile_access to be AtomicBool for the same reason.

Reverts tsan-suppressions for IMPALA-12660 so we get TSAN coverage. Adds
suppression for a lock-order-inversion bug (IMPALA-12757) that was
uncovered after fixing this data race.

Testing:
- InternalServerTest.SimultaneousMultipleQueriesOneSession would fail
  after ~10 test runs. Ran 90 times without failure.
- Passed TSAN run of backend tests.

Change-Id: I9a967c5c84b6a401f8f5764373f6cd7ee807545f
Reviewed-on: http://gerrit.cloudera.org:8080/20956
Reviewed-by: Jason Fehr 
Reviewed-by: Riza Suminto 
Tested-by: Impala Public Jenkins 


> TSAN flags lock-order-inversion during internal-server-test
> ---
>
> Key: IMPALA-12757
> URL: https://issues.apache.org/jira/browse/IMPALA-12757
> Project: IMPALA
>  Issue Type: Bug
>Affects Versions: Impala 4.4.0
>Reporter: Michael Smith
>Priority: Major
>
> internal-server-test has a tight loop starting queries and fetching /queries. 
> That's led to identifying several latent threading issues. Once IMPALA-12747 
> is addressed, TSAN identifies a new error:
> {code}
> $ run-jvm-binary.sh ./be/build/debug/service/internal-server-test
> I20240125 10:44:44.971467 152733 openssl_util.cc:110] FIPS mode is disabled.
> Picked up JAVA_TOOL_OPTIONS: -Dsun.java.command=internal-server-test
> 24/01/25 10:44:45 WARN fs.FileSystem: Cannot load filesystem: 
> java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: 
> Provider org.apache.hadoop.hive.ql.io.NullScanFileSystem not found
> 24/01/25 10:44:45 WARN fs.FileSystem: Cannot load filesystem: 
> java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: 
> Provider org.apache.hadoop.hive.ql.io.ProxyLocalFileSystem not found
> 24/01/25 10:44:45 INFO util.JvmPauseMonitor: Starting JVM pause monitor
> [==] Running 10 tests from 1 test case.
> [--] Global test environment set-up.
> [--] 10 tests from InternalServerTest
> [ RUN  ] InternalServerTest.QueryTimeout
> [   OK ] InternalServerTest.QueryTimeout (10236 ms)
> [ RUN  ] InternalServerTest.InvalidQueryOption
> [   OK ] InternalServerTest.InvalidQueryOption (76 ms)
> [ RUN  ] InternalServerTest.MultipleQueriesMultipleSessions
> /home/michael/Impala/be/src/service/internal-server-test.cc:289: Failure
> Value of: status_.ok()
>   Actual: false
> Expected: true
> Error: Failed due to unreachable impalad(s): michaelsmith-22742:27000
> [  FAILED  ] InternalServerTest.MultipleQueriesMultipleSessions (17225 ms)
> [ RUN  ] InternalServerTest.RetryFailedQuery
> [   OK ] InternalServerTest.RetryFailedQuery (1206 ms)
> [ RUN  ] InternalServerTest.MultipleQueriesOneSession
> ^C

[jira] [Commented] (IMPALA-12757) TSAN flags lock-order-inversion during internal-server-test

2024-01-25 Thread Michael Smith (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17811064#comment-17811064
 ] 

Michael Smith commented on IMPALA-12757:


https://github.com/google/sanitizers/issues/814 might also be related. I 
haven't worked through the circumstances enough to tell yet.

> TSAN flags lock-order-inversion during internal-server-test
> ---
>
> Key: IMPALA-12757
> URL: https://issues.apache.org/jira/browse/IMPALA-12757
> Project: IMPALA
>  Issue Type: Bug
>Affects Versions: Impala 4.4.0
>Reporter: Michael Smith
>Priority: Major
>
> internal-server-test has a tight loop starting queries and fetching /queries. 
> That's led to identifying several latent threading issues. Once IMPALA-12747 
> is addressed, TSAN identifies a new error:
> {code}
> $ run-jvm-binary.sh ./be/build/debug/service/internal-server-test
> I20240125 10:44:44.971467 152733 openssl_util.cc:110] FIPS mode is disabled.
> Picked up JAVA_TOOL_OPTIONS: -Dsun.java.command=internal-server-test
> 24/01/25 10:44:45 WARN fs.FileSystem: Cannot load filesystem: 
> java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: 
> Provider org.apache.hadoop.hive.ql.io.NullScanFileSystem not found
> 24/01/25 10:44:45 WARN fs.FileSystem: Cannot load filesystem: 
> java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: 
> Provider org.apache.hadoop.hive.ql.io.ProxyLocalFileSystem not found
> 24/01/25 10:44:45 INFO util.JvmPauseMonitor: Starting JVM pause monitor
> [==] Running 10 tests from 1 test case.
> [--] Global test environment set-up.
> [--] 10 tests from InternalServerTest
> [ RUN  ] InternalServerTest.QueryTimeout
> [   OK ] InternalServerTest.QueryTimeout (10236 ms)
> [ RUN  ] InternalServerTest.InvalidQueryOption
> [   OK ] InternalServerTest.InvalidQueryOption (76 ms)
> [ RUN  ] InternalServerTest.MultipleQueriesMultipleSessions
> /home/michael/Impala/be/src/service/internal-server-test.cc:289: Failure
> Value of: status_.ok()
>   Actual: false
> Expected: true
> Error: Failed due to unreachable impalad(s): michaelsmith-22742:27000
> [  FAILED  ] InternalServerTest.MultipleQueriesMultipleSessions (17225 ms)
> [ RUN  ] InternalServerTest.RetryFailedQuery
> [   OK ] InternalServerTest.RetryFailedQuery (1206 ms)
> [ RUN  ] InternalServerTest.MultipleQueriesOneSession
> ^C==
> WARNING: ThreadSanitizer: thread leak (pid=152733)
>   Thread T1086 (tid=154103, finished) created by main thread at:
> #0 pthread_create  (internal-server-test+0x203c383)
> #1 boost::thread::start_thread_noexcept()  
> (internal-server-test+0x3bcd3fd)
> #2 boost::thread::thread std::char_traits, std::allocator > const&, 
> std::__cxx11::basic_string, std::allocator 
> > const&, boost::function, impala::ThreadDebugInfo const*, 
> impala::Promise*), 
> std::__cxx11::basic_string, std::allocator 
> >, std::__cxx11::basic_string, 
> std::allocator >, boost::function, impala::ThreadDebugInfo*, 
> impala::Promise*>(void 
> (*)(std::__cxx11::basic_string, 
> std::allocator > const&, std::__cxx11::basic_string std::char_traits, std::allocator > const&, boost::function ()>, impala::ThreadDebugInfo const*, impala::Promise (impala::PromiseMode)0>*), std::__cxx11::basic_string std::char_traits, std::allocator >, 
> std::__cxx11::basic_string, std::allocator 
> >, boost::function, impala::ThreadDebugInfo*, impala::Promise (impala::PromiseMode)0>*) 
> /home/michael/Impala/toolchain/toolchain-packages-gcc10.4.0/boost-1.74.0-p1/include/boost/thread/detail/thread.hpp:424:13
>  (internal-server-test+0x2d84514)
> #3 impala::Thread::StartThread(std::__cxx11::basic_string std::char_traits, std::allocator > const&, 
> std::__cxx11::basic_string, std::allocator 
> > const&, boost::function const&, std::unique_ptr std::default_delete >*, bool) 
> /home/michael/Impala/be/src/util/thread.cc:317:13 
> (internal-server-test+0x2d8091c)
> #4 impala::Status impala::Thread::Create (impala::ClientRequestState::*)(), 
> impala::ClientRequestState*>(std::__cxx11::basic_string std::char_traits, std::allocator > const&, 
> std::__cxx11::basic_string, std::allocator 
> > const&, void (impala::ClientRequestState::* const&)(), 
> impala::ClientRequestState* const&, std::unique_ptr std::default_delete >*, bool) 
> /home/michael/Impala/be/src/util/thread.h:81:12 
> (internal-server-test+0x2b382f7)
> #5 impala::ClientRequestState::WaitAsync() 
> /home/michael/Impala/be/src/service/client-request-state.cc:1126:10 
> (internal-server-test+0x2b2f9f4)
> #6 impala::ImpalaServer::WaitForResults(impala::TUniqueId&) 
> /home/michael/Impala/be/src/service/internal-server.cc:156:3 
> (internal-server-test+0x2ad148e)
> #7 non-virtual thunk to 
> impa