[jira] [Commented] (KUDU-1520) Possible race between alter schema lock release and tablet shutdown

2017-08-25 Thread Adar Dembo (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142351#comment-16142351
 ] 

Adar Dembo commented on KUDU-1520:
--

Yes. I looked through the AlterSchema/AlterSchemaState/TransactionDriver code 
and I think it's still vulnerable to this race.

> Possible race between alter schema lock release and tablet shutdown
> ---
>
> Key: KUDU-1520
> URL: https://issues.apache.org/jira/browse/KUDU-1520
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 0.9.1
>Reporter: Adar Dembo
>
> I've been running a new stress that hammers a cluster with concurrent alter 
> and delete table requests, and one of my test runs failed with the following:
> {noformat}
> F0707 18:59:34.311122   373 rw_semaphore.h:145] Check failed: 
> base::subtle::NoBarrier_Load(_) == kWriteFlag (0 vs. 2147483648) 
> *** Check failure stack trace: ***
> @ 0x7f86cd37df5d  google::LogMessage::Fail() at ??:0
> @ 0x7f86cd37fe5d  google::LogMessage::SendToLog() at ??:0
> @ 0x7f86cd37da99  google::LogMessage::Flush() at ??:0
> @ 0x7f86cd3808ff  google::LogMessageFatal::~LogMessageFatal() at ??:0
> @ 0x7f86d4f77c78  kudu::rw_semaphore::unlock() at ??:0
> @ 0x7f86d3728de0  std::unique_lock<>::unlock() at ??:0
> @ 0x7f86d3727192  std::unique_lock<>::~unique_lock() at ??:0
> @ 0x7f86d3725582  
> kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at 
> ??:0
> @ 0x7f86d37255be  
> kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at 
> ??:0
> @ 0x7f86d4f68dce  std::default_delete<>::operator()() at ??:0
> @ 0x7f86d4f670b9  std::unique_ptr<>::~unique_ptr() at ??:0
> @ 0x7f86d374510e  
> kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
> @ 0x7f86d374514a  
> kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
> @ 0x7f86d373f532  kudu::DefaultDeleter<>::operator()() at ??:0
> @ 0x7f86d373df4a  
> kudu::internal::gscoped_ptr_impl<>::~gscoped_ptr_impl() at ??:0
> @ 0x7f86d373d552  gscoped_ptr<>::~gscoped_ptr() at ??:0
> @ 0x7f86d373d580  
> kudu::tablet::TransactionDriver::~TransactionDriver() at ??:0
> @ 0x7f86d3740ab4  kudu::RefCountedThreadSafe<>::DeleteInternal() at 
> ??:0
> @ 0x7f86d3740405  
> kudu::DefaultRefCountedThreadSafeTraits<>::Destruct() at ??:0
> @ 0x7f86d373f928  kudu::RefCountedThreadSafe<>::Release() at ??:0
> @ 0x7f86d373e769  scoped_refptr<>::~scoped_refptr() at ??:0
> @ 0x7f86d37397cd  kudu::tablet::TabletPeer::SubmitAlterSchema() at 
> ??:0
> @ 0x7f86d4f4e070  
> kudu::tserver::TabletServiceAdminImpl::AlterSchema() at ??:0
> @ 0x7f86d27a4e92  
> _ZZN4kudu7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS_12MetricEntityEEENKUlPKN6google8protobuf7MessageEPS9_PNS_3rpc10RpcContextEE1_clESB_SC_SF_
>  at ??:0
> @ 0x7f86d27a5d96  
> _ZNSt17_Function_handlerIFvPKN6google8protobuf7MessageEPS2_PN4kudu3rpc10RpcContextEEZNS6_7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS6_12MetricEntityEEEUlS4_S5_S9_E1_E9_M_invokeERKSt9_Any_dataS4_S5_S9_
>  at ??:0
> @ 0x7f86d22ce6e4  std::function<>::operator()() at ??:0
> @ 0x7f86d22ce19b  kudu::rpc::GeneratedServiceIf::Handle() at ??:0
> @ 0x7f86d22d0a97  kudu::rpc::ServicePool::RunThread() at ??:0
> @ 0x7f86d22d1d45  boost::_mfi::mf0<>::operator()() at ??:0
> @ 0x7f86d22d1b6c  boost::_bi::list1<>::operator()<>() at ??:0
> @ 0x7f86d22d1a61  boost::_bi::bind_t<>::operator()() at ??:0
> @ 0x7f86d22d1998  
> boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
> {noformat}
> After looking through the code a bit, I suspect this happened because, in the 
> event of failure, the AlterSchema transaction releases the tablet's schema 
> lock implicitly (i.e. when AlterSchemaTransactionState is destroyed) _after_ 
> the transaction itself is removed from the driver's TransactionTracker. Thus, 
> the WaitForAllToFinish() performed during the tablet shutdown process thinks 
> all the transactions are done and proceeds to free tablet state. Later, the 
> last reference to the transaction is released (in 
> TabletPeer::SubmitAlterSchema), the transaction is destroyed, and we try to 
> unlock a lock whose memory has already been freed.
> If this analysis is correct, the broken invariant is: once the transaction 
> has been released from the tracker, it may no longer access any tablet state.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-1520) Possible race between alter schema lock release and tablet shutdown

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142272#comment-16142272
 ] 

Jean-Daniel Cryans commented on KUDU-1520:
--

[~adar] is this still an issue?

> Possible race between alter schema lock release and tablet shutdown
> ---
>
> Key: KUDU-1520
> URL: https://issues.apache.org/jira/browse/KUDU-1520
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 0.9.1
>Reporter: Adar Dembo
>
> I've been running a new stress that hammers a cluster with concurrent alter 
> and delete table requests, and one of my test runs failed with the following:
> {noformat}
> F0707 18:59:34.311122   373 rw_semaphore.h:145] Check failed: 
> base::subtle::NoBarrier_Load(_) == kWriteFlag (0 vs. 2147483648) 
> *** Check failure stack trace: ***
> @ 0x7f86cd37df5d  google::LogMessage::Fail() at ??:0
> @ 0x7f86cd37fe5d  google::LogMessage::SendToLog() at ??:0
> @ 0x7f86cd37da99  google::LogMessage::Flush() at ??:0
> @ 0x7f86cd3808ff  google::LogMessageFatal::~LogMessageFatal() at ??:0
> @ 0x7f86d4f77c78  kudu::rw_semaphore::unlock() at ??:0
> @ 0x7f86d3728de0  std::unique_lock<>::unlock() at ??:0
> @ 0x7f86d3727192  std::unique_lock<>::~unique_lock() at ??:0
> @ 0x7f86d3725582  
> kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at 
> ??:0
> @ 0x7f86d37255be  
> kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at 
> ??:0
> @ 0x7f86d4f68dce  std::default_delete<>::operator()() at ??:0
> @ 0x7f86d4f670b9  std::unique_ptr<>::~unique_ptr() at ??:0
> @ 0x7f86d374510e  
> kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
> @ 0x7f86d374514a  
> kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
> @ 0x7f86d373f532  kudu::DefaultDeleter<>::operator()() at ??:0
> @ 0x7f86d373df4a  
> kudu::internal::gscoped_ptr_impl<>::~gscoped_ptr_impl() at ??:0
> @ 0x7f86d373d552  gscoped_ptr<>::~gscoped_ptr() at ??:0
> @ 0x7f86d373d580  
> kudu::tablet::TransactionDriver::~TransactionDriver() at ??:0
> @ 0x7f86d3740ab4  kudu::RefCountedThreadSafe<>::DeleteInternal() at 
> ??:0
> @ 0x7f86d3740405  
> kudu::DefaultRefCountedThreadSafeTraits<>::Destruct() at ??:0
> @ 0x7f86d373f928  kudu::RefCountedThreadSafe<>::Release() at ??:0
> @ 0x7f86d373e769  scoped_refptr<>::~scoped_refptr() at ??:0
> @ 0x7f86d37397cd  kudu::tablet::TabletPeer::SubmitAlterSchema() at 
> ??:0
> @ 0x7f86d4f4e070  
> kudu::tserver::TabletServiceAdminImpl::AlterSchema() at ??:0
> @ 0x7f86d27a4e92  
> _ZZN4kudu7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS_12MetricEntityEEENKUlPKN6google8protobuf7MessageEPS9_PNS_3rpc10RpcContextEE1_clESB_SC_SF_
>  at ??:0
> @ 0x7f86d27a5d96  
> _ZNSt17_Function_handlerIFvPKN6google8protobuf7MessageEPS2_PN4kudu3rpc10RpcContextEEZNS6_7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS6_12MetricEntityEEEUlS4_S5_S9_E1_E9_M_invokeERKSt9_Any_dataS4_S5_S9_
>  at ??:0
> @ 0x7f86d22ce6e4  std::function<>::operator()() at ??:0
> @ 0x7f86d22ce19b  kudu::rpc::GeneratedServiceIf::Handle() at ??:0
> @ 0x7f86d22d0a97  kudu::rpc::ServicePool::RunThread() at ??:0
> @ 0x7f86d22d1d45  boost::_mfi::mf0<>::operator()() at ??:0
> @ 0x7f86d22d1b6c  boost::_bi::list1<>::operator()<>() at ??:0
> @ 0x7f86d22d1a61  boost::_bi::bind_t<>::operator()() at ??:0
> @ 0x7f86d22d1998  
> boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
> {noformat}
> After looking through the code a bit, I suspect this happened because, in the 
> event of failure, the AlterSchema transaction releases the tablet's schema 
> lock implicitly (i.e. when AlterSchemaTransactionState is destroyed) _after_ 
> the transaction itself is removed from the driver's TransactionTracker. Thus, 
> the WaitForAllToFinish() performed during the tablet shutdown process thinks 
> all the transactions are done and proceeds to free tablet state. Later, the 
> last reference to the transaction is released (in 
> TabletPeer::SubmitAlterSchema), the transaction is destroyed, and we try to 
> unlock a lock whose memory has already been freed.
> If this analysis is correct, the broken invariant is: once the transaction 
> has been released from the tracker, it may no longer access any tablet state.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-1520) Possible race between alter schema lock release and tablet shutdown

2016-12-04 Thread Adar Dembo (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15721237#comment-15721237
 ] 

Adar Dembo commented on KUDU-1520:
--

Yes, it was merged as master-stress-test.

I don't see this issue in the flaky test dashboard (at least not for 
master-stress-test), but I'd be surprised if it had been accidentally fixed.

> Possible race between alter schema lock release and tablet shutdown
> ---
>
> Key: KUDU-1520
> URL: https://issues.apache.org/jira/browse/KUDU-1520
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 0.9.1
>Reporter: Adar Dembo
>
> I've been running a new stress that hammers a cluster with concurrent alter 
> and delete table requests, and one of my test runs failed with the following:
> {noformat}
> F0707 18:59:34.311122   373 rw_semaphore.h:145] Check failed: 
> base::subtle::NoBarrier_Load(_) == kWriteFlag (0 vs. 2147483648) 
> *** Check failure stack trace: ***
> @ 0x7f86cd37df5d  google::LogMessage::Fail() at ??:0
> @ 0x7f86cd37fe5d  google::LogMessage::SendToLog() at ??:0
> @ 0x7f86cd37da99  google::LogMessage::Flush() at ??:0
> @ 0x7f86cd3808ff  google::LogMessageFatal::~LogMessageFatal() at ??:0
> @ 0x7f86d4f77c78  kudu::rw_semaphore::unlock() at ??:0
> @ 0x7f86d3728de0  std::unique_lock<>::unlock() at ??:0
> @ 0x7f86d3727192  std::unique_lock<>::~unique_lock() at ??:0
> @ 0x7f86d3725582  
> kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at 
> ??:0
> @ 0x7f86d37255be  
> kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at 
> ??:0
> @ 0x7f86d4f68dce  std::default_delete<>::operator()() at ??:0
> @ 0x7f86d4f670b9  std::unique_ptr<>::~unique_ptr() at ??:0
> @ 0x7f86d374510e  
> kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
> @ 0x7f86d374514a  
> kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
> @ 0x7f86d373f532  kudu::DefaultDeleter<>::operator()() at ??:0
> @ 0x7f86d373df4a  
> kudu::internal::gscoped_ptr_impl<>::~gscoped_ptr_impl() at ??:0
> @ 0x7f86d373d552  gscoped_ptr<>::~gscoped_ptr() at ??:0
> @ 0x7f86d373d580  
> kudu::tablet::TransactionDriver::~TransactionDriver() at ??:0
> @ 0x7f86d3740ab4  kudu::RefCountedThreadSafe<>::DeleteInternal() at 
> ??:0
> @ 0x7f86d3740405  
> kudu::DefaultRefCountedThreadSafeTraits<>::Destruct() at ??:0
> @ 0x7f86d373f928  kudu::RefCountedThreadSafe<>::Release() at ??:0
> @ 0x7f86d373e769  scoped_refptr<>::~scoped_refptr() at ??:0
> @ 0x7f86d37397cd  kudu::tablet::TabletPeer::SubmitAlterSchema() at 
> ??:0
> @ 0x7f86d4f4e070  
> kudu::tserver::TabletServiceAdminImpl::AlterSchema() at ??:0
> @ 0x7f86d27a4e92  
> _ZZN4kudu7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS_12MetricEntityEEENKUlPKN6google8protobuf7MessageEPS9_PNS_3rpc10RpcContextEE1_clESB_SC_SF_
>  at ??:0
> @ 0x7f86d27a5d96  
> _ZNSt17_Function_handlerIFvPKN6google8protobuf7MessageEPS2_PN4kudu3rpc10RpcContextEEZNS6_7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS6_12MetricEntityEEEUlS4_S5_S9_E1_E9_M_invokeERKSt9_Any_dataS4_S5_S9_
>  at ??:0
> @ 0x7f86d22ce6e4  std::function<>::operator()() at ??:0
> @ 0x7f86d22ce19b  kudu::rpc::GeneratedServiceIf::Handle() at ??:0
> @ 0x7f86d22d0a97  kudu::rpc::ServicePool::RunThread() at ??:0
> @ 0x7f86d22d1d45  boost::_mfi::mf0<>::operator()() at ??:0
> @ 0x7f86d22d1b6c  boost::_bi::list1<>::operator()<>() at ??:0
> @ 0x7f86d22d1a61  boost::_bi::bind_t<>::operator()() at ??:0
> @ 0x7f86d22d1998  
> boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
> {noformat}
> After looking through the code a bit, I suspect this happened because, in the 
> event of failure, the AlterSchema transaction releases the tablet's schema 
> lock implicitly (i.e. when AlterSchemaTransactionState is destroyed) _after_ 
> the transaction itself is removed from the driver's TransactionTracker. Thus, 
> the WaitForAllToFinish() performed during the tablet shutdown process thinks 
> all the transactions are done and proceeds to free tablet state. Later, the 
> last reference to the transaction is released (in 
> TabletPeer::SubmitAlterSchema), the transaction is destroyed, and we try to 
> unlock a lock whose memory has already been freed.
> If this analysis is correct, the broken invariant is: once the transaction 
> has been released from the tracker, it may no longer access any tablet state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KUDU-1520) Possible race between alter schema lock release and tablet shutdown

2016-12-04 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15721136#comment-15721136
 ] 

Todd Lipcon commented on KUDU-1520:
---

[~adar] -- is the above-mentioned stress test checked in nowadays? Would be 
good to know if the issue is still around, and if so which test can be used to 
repro.

> Possible race between alter schema lock release and tablet shutdown
> ---
>
> Key: KUDU-1520
> URL: https://issues.apache.org/jira/browse/KUDU-1520
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 0.9.1
>Reporter: Adar Dembo
>Priority: Critical
>
> I've been running a new stress that hammers a cluster with concurrent alter 
> and delete table requests, and one of my test runs failed with the following:
> {noformat}
> F0707 18:59:34.311122   373 rw_semaphore.h:145] Check failed: 
> base::subtle::NoBarrier_Load(_) == kWriteFlag (0 vs. 2147483648) 
> *** Check failure stack trace: ***
> @ 0x7f86cd37df5d  google::LogMessage::Fail() at ??:0
> @ 0x7f86cd37fe5d  google::LogMessage::SendToLog() at ??:0
> @ 0x7f86cd37da99  google::LogMessage::Flush() at ??:0
> @ 0x7f86cd3808ff  google::LogMessageFatal::~LogMessageFatal() at ??:0
> @ 0x7f86d4f77c78  kudu::rw_semaphore::unlock() at ??:0
> @ 0x7f86d3728de0  std::unique_lock<>::unlock() at ??:0
> @ 0x7f86d3727192  std::unique_lock<>::~unique_lock() at ??:0
> @ 0x7f86d3725582  
> kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at 
> ??:0
> @ 0x7f86d37255be  
> kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at 
> ??:0
> @ 0x7f86d4f68dce  std::default_delete<>::operator()() at ??:0
> @ 0x7f86d4f670b9  std::unique_ptr<>::~unique_ptr() at ??:0
> @ 0x7f86d374510e  
> kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
> @ 0x7f86d374514a  
> kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
> @ 0x7f86d373f532  kudu::DefaultDeleter<>::operator()() at ??:0
> @ 0x7f86d373df4a  
> kudu::internal::gscoped_ptr_impl<>::~gscoped_ptr_impl() at ??:0
> @ 0x7f86d373d552  gscoped_ptr<>::~gscoped_ptr() at ??:0
> @ 0x7f86d373d580  
> kudu::tablet::TransactionDriver::~TransactionDriver() at ??:0
> @ 0x7f86d3740ab4  kudu::RefCountedThreadSafe<>::DeleteInternal() at 
> ??:0
> @ 0x7f86d3740405  
> kudu::DefaultRefCountedThreadSafeTraits<>::Destruct() at ??:0
> @ 0x7f86d373f928  kudu::RefCountedThreadSafe<>::Release() at ??:0
> @ 0x7f86d373e769  scoped_refptr<>::~scoped_refptr() at ??:0
> @ 0x7f86d37397cd  kudu::tablet::TabletPeer::SubmitAlterSchema() at 
> ??:0
> @ 0x7f86d4f4e070  
> kudu::tserver::TabletServiceAdminImpl::AlterSchema() at ??:0
> @ 0x7f86d27a4e92  
> _ZZN4kudu7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS_12MetricEntityEEENKUlPKN6google8protobuf7MessageEPS9_PNS_3rpc10RpcContextEE1_clESB_SC_SF_
>  at ??:0
> @ 0x7f86d27a5d96  
> _ZNSt17_Function_handlerIFvPKN6google8protobuf7MessageEPS2_PN4kudu3rpc10RpcContextEEZNS6_7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS6_12MetricEntityEEEUlS4_S5_S9_E1_E9_M_invokeERKSt9_Any_dataS4_S5_S9_
>  at ??:0
> @ 0x7f86d22ce6e4  std::function<>::operator()() at ??:0
> @ 0x7f86d22ce19b  kudu::rpc::GeneratedServiceIf::Handle() at ??:0
> @ 0x7f86d22d0a97  kudu::rpc::ServicePool::RunThread() at ??:0
> @ 0x7f86d22d1d45  boost::_mfi::mf0<>::operator()() at ??:0
> @ 0x7f86d22d1b6c  boost::_bi::list1<>::operator()<>() at ??:0
> @ 0x7f86d22d1a61  boost::_bi::bind_t<>::operator()() at ??:0
> @ 0x7f86d22d1998  
> boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
> {noformat}
> After looking through the code a bit, I suspect this happened because, in the 
> event of failure, the AlterSchema transaction releases the tablet's schema 
> lock implicitly (i.e. when AlterSchemaTransactionState is destroyed) _after_ 
> the transaction itself is removed from the driver's TransactionTracker. Thus, 
> the WaitForAllToFinish() performed during the tablet shutdown process thinks 
> all the transactions are done and proceeds to free tablet state. Later, the 
> last reference to the transaction is released (in 
> TabletPeer::SubmitAlterSchema), the transaction is destroyed, and we try to 
> unlock a lock whose memory has already been freed.
> If this analysis is correct, the broken invariant is: once the transaction 
> has been released from the tracker, it may no longer access any tablet state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)