[jira] [Updated] (KUDU-3570) Use-after-free and data race in MajorDeltaCompactionOp when AlterTablet is running

2024-04-26 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3570:

Code Review: https://gerrit.cloudera.org/#/c/21362/

> Use-after-free and data race in MajorDeltaCompactionOp when AlterTablet is 
> running
> --
>
> Key: KUDU-3570
> URL: https://issues.apache.org/jira/browse/KUDU-3570
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Reporter: Alexey Serbin
>Priority: Major
>
> Running {{alter_table-randomized-test}} under TSAN produced 
> heap-use-after-free and data race warnings like below, indicating 
> corresponding conditions might hit when a major delta compaction 
> (MajorDeltaCompactionOp) maintenance operation is run when a table being 
> altered.
> In addition to TSAN warnings, running the {{alter_table-randomized-test}} for 
> DEBUG/ASAN/TSAN builds would crash with SIGABRT and fatal messages like below 
> due to a triggered DCHECK constraint.  In RELEASE build that condition might 
> lead to an unexpected behavior (e.g., silent data corruption, crash, etc.) 
> when {{Schema::num_columns()}} or other methods called on a corrupted 
> {{Schema}} object.
> DCHECK triggers a crash with SIGABRT with funny size numbers:
> {noformat}
> F20240426 14:25:15.006683 245509 schema.h:584] Check failed: cols_.size() == 
> name_to_index_.size() (5270498306772959232 vs. 643461730718517486) 
> *** Check failure stack trace: ***
> @ 0x7f2006677390  google::LogMessage::Flush()
> @ 0x7f200667c4cb  google::LogMessageFatal::~LogMessageFatal()
> @   0x4eefff  kudu::Schema::num_columns()
> @ 0x7f200dd18529  kudu::tablet::DeltaPreparer<>::Start()
> @ 0x7f200dcde94f  kudu::tablet::DeltaFileIterator<>::PrepareBatch()
> @ 0x7f200dd07d81  kudu::tablet::DeltaIteratorMerger::PrepareBatch()
> @ 0x7f200dd012b1  
> kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas()
> @ 0x7f200dd02fa1  kudu::tablet::MajorDeltaCompaction::Compact()
> @ 0x7f200dc1f85d  
> kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds()
> @ 0x7f200dc1f504  kudu::tablet::DiskRowSet::MajorCompactDeltaStores()
> @ 0x7f200dae1cc3  kudu::tablet::Tablet::CompactWorstDeltas()
> @ 0x7f200db74cd7  kudu::tablet::MajorDeltaCompactionOp::Perform()
> @ 0x7f2007498827  kudu::MaintenanceManager::LaunchOp()
> @ 0x7f200749c773  
> kudu::MaintenanceManager::RunSchedulerThread()::$_3::operator()()
> {noformat}
> TSAN warning on use-after-free:
> {noformat}
> WARNING: ThreadSanitizer: heap-use-after-free (pid=3392364)
>   Read of size 8 at 0x7b4400100060 by thread T23 (mutexes: write 
> M236855935862339888, write M206456689917755456):
> #0 std::__1::vector std::__1::allocator >::size() const 
> thirdparty/installed/tsan/include/c++/v1/vector:658:46 (kudu+0x4ee25b)
> #1 kudu::Schema::num_columns() const src/kudu/common/schema.h:584:5 
> (kudu+0x4eef50)
> #2 
> kudu::tablet::DeltaPreparer
>  >::Start(unsigned long, int) src/kudu/tablet/delta_store.cc:204:46 
> (libtablet.so+0x578488)
> #3 
> kudu::tablet::DeltaFileIterator<(kudu::tablet::DeltaType)0>::PrepareBatch(unsigned
>  long, int) src/kudu/tablet/deltafile.cc:608:13 (libtablet.so+0x53e8ae)
> #4 kudu::tablet::DeltaIteratorMerger::PrepareBatch(unsigned long, int) 
> src/kudu/tablet/delta_iterator_merger.cc:66:5 (libtablet.so+0x567ce0)
> #5 
> kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas(kudu::fs::IOContext 
> const*) src/kudu/tablet/delta_compaction.cc:155:5 (libtablet.so+0x561210)
> #6 kudu::tablet::MajorDeltaCompaction::Compact(kudu::fs::IOContext 
> const*) src/kudu/tablet/delta_compaction.cc:340:3 (libtablet.so+0x562f00)
> #7 
> kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(std::__1::vector  std::__1::allocator > const&, kudu::fs::IOContext const*, 
> kudu::tablet::HistoryGcOpts) src/kudu/tablet/diskrowset.cc:588:3 
> (libtablet.so+0x47f7bc)
> #8 kudu::tablet::DiskRowSet::MajorCompactDeltaStores(kudu::fs::IOContext 
> const*, kudu::tablet::HistoryGcOpts) src/kudu/tablet/diskrowset.cc:572:10 
> (libtablet.so+0x47f463)
> #9 
> kudu::tablet::Tablet::CompactWorstDeltas(kudu::tablet::RowSet::DeltaCompactionType)
>  src/kudu/tablet/tablet.cc:2881:5 (libtablet.so+0x341c92)
> #10 kudu::tablet::MajorDeltaCompactionOp::Perform() 
> src/kudu/tablet/tablet_mm_ops.cc:364:3 (libtablet.so+0x3d4ca6)
> #11 kudu::MaintenanceManager::LaunchOp(kudu::MaintenanceOp*) 
> src/kudu/util/maintenance_manager.cc:640:9 (libkudu_util.so+0x38c826)
> #12 kudu::MaintenanceManager::RunSchedulerThread()::$_3::operator()() 
> const src/kudu/util/maintenance_manager.cc:422:5 (libkudu_util.so+0x390772)
> ...
>   Previous write of size 8 at 

[jira] [Updated] (KUDU-3570) Use-after-free and data race in MajorDeltaCompactionOp when AlterTablet is running

2024-04-26 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3570:

Status: In Review  (was: Open)

> Use-after-free and data race in MajorDeltaCompactionOp when AlterTablet is 
> running
> --
>
> Key: KUDU-3570
> URL: https://issues.apache.org/jira/browse/KUDU-3570
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Reporter: Alexey Serbin
>Priority: Major
>
> Running {{alter_table-randomized-test}} under TSAN produced 
> heap-use-after-free and data race warnings like below, indicating 
> corresponding conditions might hit when a major delta compaction 
> (MajorDeltaCompactionOp) maintenance operation is run when a table being 
> altered.
> In addition to TSAN warnings, running the {{alter_table-randomized-test}} for 
> DEBUG/ASAN/TSAN builds would crash with SIGABRT and fatal messages like below 
> due to a triggered DCHECK constraint.  In RELEASE build that condition might 
> lead to an unexpected behavior (e.g., silent data corruption, crash, etc.) 
> when {{Schema::num_columns()}} or other methods called on a corrupted 
> {{Schema}} object.
> DCHECK triggers a crash with SIGABRT with funny size numbers:
> {noformat}
> F20240426 14:25:15.006683 245509 schema.h:584] Check failed: cols_.size() == 
> name_to_index_.size() (5270498306772959232 vs. 643461730718517486) 
> *** Check failure stack trace: ***
> @ 0x7f2006677390  google::LogMessage::Flush()
> @ 0x7f200667c4cb  google::LogMessageFatal::~LogMessageFatal()
> @   0x4eefff  kudu::Schema::num_columns()
> @ 0x7f200dd18529  kudu::tablet::DeltaPreparer<>::Start()
> @ 0x7f200dcde94f  kudu::tablet::DeltaFileIterator<>::PrepareBatch()
> @ 0x7f200dd07d81  kudu::tablet::DeltaIteratorMerger::PrepareBatch()
> @ 0x7f200dd012b1  
> kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas()
> @ 0x7f200dd02fa1  kudu::tablet::MajorDeltaCompaction::Compact()
> @ 0x7f200dc1f85d  
> kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds()
> @ 0x7f200dc1f504  kudu::tablet::DiskRowSet::MajorCompactDeltaStores()
> @ 0x7f200dae1cc3  kudu::tablet::Tablet::CompactWorstDeltas()
> @ 0x7f200db74cd7  kudu::tablet::MajorDeltaCompactionOp::Perform()
> @ 0x7f2007498827  kudu::MaintenanceManager::LaunchOp()
> @ 0x7f200749c773  
> kudu::MaintenanceManager::RunSchedulerThread()::$_3::operator()()
> {noformat}
> TSAN warning on use-after-free:
> {noformat}
> WARNING: ThreadSanitizer: heap-use-after-free (pid=3392364)
>   Read of size 8 at 0x7b4400100060 by thread T23 (mutexes: write 
> M236855935862339888, write M206456689917755456):
> #0 std::__1::vector std::__1::allocator >::size() const 
> thirdparty/installed/tsan/include/c++/v1/vector:658:46 (kudu+0x4ee25b)
> #1 kudu::Schema::num_columns() const src/kudu/common/schema.h:584:5 
> (kudu+0x4eef50)
> #2 
> kudu::tablet::DeltaPreparer
>  >::Start(unsigned long, int) src/kudu/tablet/delta_store.cc:204:46 
> (libtablet.so+0x578488)
> #3 
> kudu::tablet::DeltaFileIterator<(kudu::tablet::DeltaType)0>::PrepareBatch(unsigned
>  long, int) src/kudu/tablet/deltafile.cc:608:13 (libtablet.so+0x53e8ae)
> #4 kudu::tablet::DeltaIteratorMerger::PrepareBatch(unsigned long, int) 
> src/kudu/tablet/delta_iterator_merger.cc:66:5 (libtablet.so+0x567ce0)
> #5 
> kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas(kudu::fs::IOContext 
> const*) src/kudu/tablet/delta_compaction.cc:155:5 (libtablet.so+0x561210)
> #6 kudu::tablet::MajorDeltaCompaction::Compact(kudu::fs::IOContext 
> const*) src/kudu/tablet/delta_compaction.cc:340:3 (libtablet.so+0x562f00)
> #7 
> kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(std::__1::vector  std::__1::allocator > const&, kudu::fs::IOContext const*, 
> kudu::tablet::HistoryGcOpts) src/kudu/tablet/diskrowset.cc:588:3 
> (libtablet.so+0x47f7bc)
> #8 kudu::tablet::DiskRowSet::MajorCompactDeltaStores(kudu::fs::IOContext 
> const*, kudu::tablet::HistoryGcOpts) src/kudu/tablet/diskrowset.cc:572:10 
> (libtablet.so+0x47f463)
> #9 
> kudu::tablet::Tablet::CompactWorstDeltas(kudu::tablet::RowSet::DeltaCompactionType)
>  src/kudu/tablet/tablet.cc:2881:5 (libtablet.so+0x341c92)
> #10 kudu::tablet::MajorDeltaCompactionOp::Perform() 
> src/kudu/tablet/tablet_mm_ops.cc:364:3 (libtablet.so+0x3d4ca6)
> #11 kudu::MaintenanceManager::LaunchOp(kudu::MaintenanceOp*) 
> src/kudu/util/maintenance_manager.cc:640:9 (libkudu_util.so+0x38c826)
> #12 kudu::MaintenanceManager::RunSchedulerThread()::$_3::operator()() 
> const src/kudu/util/maintenance_manager.cc:422:5 (libkudu_util.so+0x390772)
> ...
>   Previous write of size 8 at 0x7b4400100060 by thread T158:

[jira] [Updated] (KUDU-3570) Use-after-free and data race in MajorDeltaCompactionOp when AlterTablet is running

2024-04-26 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3570:

Description: 
Running {{alter_table-randomized-test}} under TSAN produced heap-use-after-free 
and data race warnings like below, indicating corresponding conditions might 
hit when a major delta compaction (MajorDeltaCompactionOp) maintenance 
operation is run when a table being altered.

In addition to TSAN warnings, running the {{alter_table-randomized-test}} for 
DEBUG/ASAN/TSAN builds would crash with SIGABRT and fatal messages like below 
due to a triggered DCHECK constraint.  In RELEASE build that condition might 
lead to an unexpected behavior (e.g., silent data corruption, crash, etc.) when 
{{Schema::num_columns()}} or other methods called on a corrupted {{Schema}} 
object.

DCHECK triggers a crash with SIGABRT with funny size numbers:
{noformat}
F20240426 14:25:15.006683 245509 schema.h:584] Check failed: cols_.size() == 
name_to_index_.size() (5270498306772959232 vs. 643461730718517486) 
*** Check failure stack trace: ***
@ 0x7f2006677390  google::LogMessage::Flush()
@ 0x7f200667c4cb  google::LogMessageFatal::~LogMessageFatal()
@   0x4eefff  kudu::Schema::num_columns()
@ 0x7f200dd18529  kudu::tablet::DeltaPreparer<>::Start()
@ 0x7f200dcde94f  kudu::tablet::DeltaFileIterator<>::PrepareBatch()
@ 0x7f200dd07d81  kudu::tablet::DeltaIteratorMerger::PrepareBatch()
@ 0x7f200dd012b1  
kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas()
@ 0x7f200dd02fa1  kudu::tablet::MajorDeltaCompaction::Compact()
@ 0x7f200dc1f85d  
kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds()
@ 0x7f200dc1f504  kudu::tablet::DiskRowSet::MajorCompactDeltaStores()
@ 0x7f200dae1cc3  kudu::tablet::Tablet::CompactWorstDeltas()
@ 0x7f200db74cd7  kudu::tablet::MajorDeltaCompactionOp::Perform()
@ 0x7f2007498827  kudu::MaintenanceManager::LaunchOp()
@ 0x7f200749c773  
kudu::MaintenanceManager::RunSchedulerThread()::$_3::operator()()
{noformat}

TSAN warning on use-after-free:
{noformat}
WARNING: ThreadSanitizer: heap-use-after-free (pid=3392364)
  Read of size 8 at 0x7b4400100060 by thread T23 (mutexes: write 
M236855935862339888, write M206456689917755456):
#0 std::__1::vector >::size() const 
thirdparty/installed/tsan/include/c++/v1/vector:658:46 (kudu+0x4ee25b)
#1 kudu::Schema::num_columns() const src/kudu/common/schema.h:584:5 
(kudu+0x4eef50)
#2 
kudu::tablet::DeltaPreparer
 >::Start(unsigned long, int) src/kudu/tablet/delta_store.cc:204:46 
(libtablet.so+0x578488)
#3 
kudu::tablet::DeltaFileIterator<(kudu::tablet::DeltaType)0>::PrepareBatch(unsigned
 long, int) src/kudu/tablet/deltafile.cc:608:13 (libtablet.so+0x53e8ae)
#4 kudu::tablet::DeltaIteratorMerger::PrepareBatch(unsigned long, int) 
src/kudu/tablet/delta_iterator_merger.cc:66:5 (libtablet.so+0x567ce0)
#5 
kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas(kudu::fs::IOContext 
const*) src/kudu/tablet/delta_compaction.cc:155:5 (libtablet.so+0x561210)
#6 kudu::tablet::MajorDeltaCompaction::Compact(kudu::fs::IOContext const*) 
src/kudu/tablet/delta_compaction.cc:340:3 (libtablet.so+0x562f00)
#7 
kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(std::__1::vector > const&, kudu::fs::IOContext const*, 
kudu::tablet::HistoryGcOpts) src/kudu/tablet/diskrowset.cc:588:3 
(libtablet.so+0x47f7bc)
#8 kudu::tablet::DiskRowSet::MajorCompactDeltaStores(kudu::fs::IOContext 
const*, kudu::tablet::HistoryGcOpts) src/kudu/tablet/diskrowset.cc:572:10 
(libtablet.so+0x47f463)
#9 
kudu::tablet::Tablet::CompactWorstDeltas(kudu::tablet::RowSet::DeltaCompactionType)
 src/kudu/tablet/tablet.cc:2881:5 (libtablet.so+0x341c92)
#10 kudu::tablet::MajorDeltaCompactionOp::Perform() 
src/kudu/tablet/tablet_mm_ops.cc:364:3 (libtablet.so+0x3d4ca6)
#11 kudu::MaintenanceManager::LaunchOp(kudu::MaintenanceOp*) 
src/kudu/util/maintenance_manager.cc:640:9 (libkudu_util.so+0x38c826)
#12 kudu::MaintenanceManager::RunSchedulerThread()::$_3::operator()() const 
src/kudu/util/maintenance_manager.cc:422:5 (libkudu_util.so+0x390772)
...
  Previous write of size 8 at 0x7b4400100060 by thread T158:
#0 operator delete(void*) 
thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_new_delete.cpp:126
 (kudu+0x4dd4e9)
#1 std::__1::_DeallocateCaller::__do_call(void*) 
thirdparty/installed/tsan/include/c++/v1/new:334:12 (kudu+0x4e9389)
#2 std::__1::_DeallocateCaller::__do_deallocate_handle_size(void*, unsigned 
long) thirdparty/installed/tsan/include/c++/v1/new:292:12 (kudu+0x4e9329)
#3 std::__1::_DeallocateCaller::__do_deallocate_handle_size_align(void*, 
unsigned long, unsigned long) 
thirdparty/installed/tsan/include/c++/v1/new:268:14 (libtablet.so+0x35fe42)
#4 

[jira] [Created] (KUDU-3570) Use-after-free and data race in MajorDeltaCompactionOp when AlterTablet is running

2024-04-26 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3570:
---

 Summary: Use-after-free and data race in MajorDeltaCompactionOp 
when AlterTablet is running
 Key: KUDU-3570
 URL: https://issues.apache.org/jira/browse/KUDU-3570
 Project: Kudu
  Issue Type: Bug
  Components: tserver
Reporter: Alexey Serbin


Running {{alter_table-randomized-test}} under TSAN produced heap-use-after-free 
and data race warnings like below, indicating corresponding conditions might 
hit when a major delta compaction (MajorDeltaCompactionOp) maintenance 
operation is run when a table being altered.

In addition to TSAN warnings, running the {{alter_table-randomized-test}} for 
DEBUG/ASAN/TSAN builds would crash with SIGABRT and fatal messages like below 
due to a triggered DCHECK constraint.  In RELEASE build that condition might 
lead to a silent data corruption or a crash.

DCHECK triggers a crash with SIGABRT with funny size numbers:
{noformat}
F20240426 14:25:15.006683 245509 schema.h:584] Check failed: cols_.size() == 
name_to_index_.size() (5270498306772959232 vs. 643461730718517486) 
*** Check failure stack trace: ***
@ 0x7f2006677390  google::LogMessage::Flush()
@ 0x7f200667c4cb  google::LogMessageFatal::~LogMessageFatal()
@   0x4eefff  kudu::Schema::num_columns()
@ 0x7f200dd18529  kudu::tablet::DeltaPreparer<>::Start()
@ 0x7f200dcde94f  kudu::tablet::DeltaFileIterator<>::PrepareBatch()
@ 0x7f200dd07d81  kudu::tablet::DeltaIteratorMerger::PrepareBatch()
@ 0x7f200dd012b1  
kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas()
@ 0x7f200dd02fa1  kudu::tablet::MajorDeltaCompaction::Compact()
@ 0x7f200dc1f85d  
kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds()
@ 0x7f200dc1f504  kudu::tablet::DiskRowSet::MajorCompactDeltaStores()
@ 0x7f200dae1cc3  kudu::tablet::Tablet::CompactWorstDeltas()
@ 0x7f200db74cd7  kudu::tablet::MajorDeltaCompactionOp::Perform()
@ 0x7f2007498827  kudu::MaintenanceManager::LaunchOp()
@ 0x7f200749c773  
kudu::MaintenanceManager::RunSchedulerThread()::$_3::operator()()
{noformat}

TSAN warning on use-after-free:
{noformat}
WARNING: ThreadSanitizer: heap-use-after-free (pid=3392364)
  Read of size 8 at 0x7b4400100060 by thread T23 (mutexes: write 
M236855935862339888, write M206456689917755456):
#0 std::__1::vector >::size() const 
thirdparty/installed/tsan/include/c++/v1/vector:658:46 (kudu+0x4ee25b)
#1 kudu::Schema::num_columns() const src/kudu/common/schema.h:584:5 
(kudu+0x4eef50)
#2 
kudu::tablet::DeltaPreparer
 >::Start(unsigned long, int) src/kudu/tablet/delta_store.cc:204:46 
(libtablet.so+0x578488)
#3 
kudu::tablet::DeltaFileIterator<(kudu::tablet::DeltaType)0>::PrepareBatch(unsigned
 long, int) src/kudu/tablet/deltafile.cc:608:13 (libtablet.so+0x53e8ae)
#4 kudu::tablet::DeltaIteratorMerger::PrepareBatch(unsigned long, int) 
src/kudu/tablet/delta_iterator_merger.cc:66:5 (libtablet.so+0x567ce0)
#5 
kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas(kudu::fs::IOContext 
const*) src/kudu/tablet/delta_compaction.cc:155:5 (libtablet.so+0x561210)
#6 kudu::tablet::MajorDeltaCompaction::Compact(kudu::fs::IOContext const*) 
src/kudu/tablet/delta_compaction.cc:340:3 (libtablet.so+0x562f00)
#7 
kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(std::__1::vector > const&, kudu::fs::IOContext const*, 
kudu::tablet::HistoryGcOpts) src/kudu/tablet/diskrowset.cc:588:3 
(libtablet.so+0x47f7bc)
#8 kudu::tablet::DiskRowSet::MajorCompactDeltaStores(kudu::fs::IOContext 
const*, kudu::tablet::HistoryGcOpts) src/kudu/tablet/diskrowset.cc:572:10 
(libtablet.so+0x47f463)
#9 
kudu::tablet::Tablet::CompactWorstDeltas(kudu::tablet::RowSet::DeltaCompactionType)
 src/kudu/tablet/tablet.cc:2881:5 (libtablet.so+0x341c92)
#10 kudu::tablet::MajorDeltaCompactionOp::Perform() 
src/kudu/tablet/tablet_mm_ops.cc:364:3 (libtablet.so+0x3d4ca6)
#11 kudu::MaintenanceManager::LaunchOp(kudu::MaintenanceOp*) 
src/kudu/util/maintenance_manager.cc:640:9 (libkudu_util.so+0x38c826)
#12 kudu::MaintenanceManager::RunSchedulerThread()::$_3::operator()() const 
src/kudu/util/maintenance_manager.cc:422:5 (libkudu_util.so+0x390772)
...
  Previous write of size 8 at 0x7b4400100060 by thread T158:
#0 operator delete(void*) 
thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_new_delete.cpp:126
 (kudu+0x4dd4e9)
#1 std::__1::_DeallocateCaller::__do_call(void*) 
thirdparty/installed/tsan/include/c++/v1/new:334:12 (kudu+0x4e9389)
#2 std::__1::_DeallocateCaller::__do_deallocate_handle_size(void*, unsigned 
long) thirdparty/installed/tsan/include/c++/v1/new:292:12 (kudu+0x4e9329)
#3 std::__1::_DeallocateCaller::__do_deallocate_handle_size_align(void*, 
unsigned long, unsigned long) 

[jira] [Updated] (KUDU-3569) Data race in CFileSet::Iterator::OptimizePKPredicates()

2024-04-26 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3569:

Fix Version/s: 1.18.0
   Resolution: Fixed
   Status: Resolved  (was: In Review)

> Data race in CFileSet::Iterator::OptimizePKPredicates()
> ---
>
> Key: KUDU-3569
> URL: https://issues.apache.org/jira/browse/KUDU-3569
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.17.0
>Reporter: Alexey Serbin
>Priority: Major
> Fix For: 1.18.0
>
>
> Running {{alter_table-randomized-test}} under TSAN produced data race 
> warnings like below, indicating a race in 
> {{CFileSet::Iterator::OptimizePKPredicates()}}.  One actor was 
> {{tablet::AlterSchemaOp::Apply()}} initiated by AlterTable, the other 
> concurrent actor was the maintenance thread running major delta compaction.  
> Apparently, the same data race might happen if the other concurrent actor was 
> a thread handling a scan request containing IN-list predicates optimized at 
> the DRS level.
> {noformat}
> WARNING: ThreadSanitizer: data race (pid=3919595)
>   Write of size 8 at 0x7b44000f4a20 by thread T7:
> #0 std::__1::__vector_base long>>::__destruct_at_end(unsigned long*) 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:429:12 
> (kudu+0x4d4080)
> #1 std::__1::__vector_base long>>::clear() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:371:29 
> (kudu+0x4d3f94)
> #2 std::__1::__vector_base long>>::~__vector_base() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:465:9 
> (kudu+0x4d3d4b)
> #3 std::__1::vector 
> >::~vector() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:557:5 
> (kudu+0x4d1261)
> #4 kudu::Schema::~Schema() 
> /root/Projects/kudu/src/kudu/common/schema.h:491:7 (kudu+0x4cc40f)
> #5 std::__1::__shared_ptr_emplace std::__1::allocator>::__on_zero_shared() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3503:23 
> (libtablet.so+0x389d45)
> #6 std::__1::__shared_count::__release_shared() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3341:9 
> (kudu+0x4d4d05)
> #7 std::__1::__shared_weak_count::__release_shared() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3383:27 
> (kudu+0x4d4ca9)
> #8 std::__1::shared_ptr::~shared_ptr() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:4098:19 
> (kudu+0x5303e8)
> #9 
> kudu::tablet::TabletMetadata::SetSchema(std::__1::shared_ptr 
> const&, unsigned int) 
> /root/Projects/kudu/src/kudu/tablet/tablet_metadata.cc:957:1 
> (libtablet.so+0x4d8882)
> #10 kudu::tablet::Tablet::AlterSchema(kudu::tablet::AlterSchemaOpState*) 
> /root/Projects/kudu/src/kudu/tablet/tablet.cc:1727:14 (libtablet.so+0x32720a)
> #11 kudu::tablet::AlterSchemaOp::Apply(kudu::consensus::CommitMsg**) 
> /root/Projects/kudu/src/kudu/tablet/ops/alter_schema_op.cc:127:3 
> (libtablet.so+0x4013f8)
> #12 kudu::tablet::OpDriver::ApplyTask() 
> /root/Projects/kudu/src/kudu/tablet/ops/op_driver.cc:527:21 
> (libtablet.so+0x40873a)
> ...
>   Previous read of size 8 at 0x7b44000f4a20 by thread T22 (mutexes: write 
> M799524414306809968, write M765184518688777856):
> #0 std::__1::vector 
> >::empty() const 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:664:41 
> (kudu+0x5ca926)
> #1 kudu::Schema::initialized() const 
> /root/Projects/kudu/src/kudu/common/schema.h:676:26 (kudu+0x5ca3fd)
> #2 kudu::Schema::key_byte_size() const 
> /root/Projects/kudu/src/kudu/common/schema.h:572:5 
> (libkudu_common.so+0x171eae)
> #3 kudu::EncodedKey::DecodeEncodedString(kudu::Schema const&, 
> kudu::Arena*, kudu::Slice const&, kudu::EncodedKey**) 
> /root/Projects/kudu/src/kudu/common/encoded_key.cc:60:76 
> (libkudu_common.so+0x171091)
> #4 
> kudu::tablet::CFileSet::Iterator::OptimizePKPredicates(kudu::ScanSpec*) 
> /root/Projects/kudu/src/kudu/tablet/cfile_set.cc:444:5 (libtablet.so+0x428934)
> #5 kudu::tablet::CFileSet::Iterator::Init(kudu::ScanSpec*) 
> /root/Projects/kudu/src/kudu/tablet/cfile_set.cc:410:3 (libtablet.so+0x4285d7)
> #6 kudu::MaterializingIterator::Init(kudu::ScanSpec*) 
> /root/Projects/kudu/src/kudu/common/generic_iterators.cc:1176:3 
> (libkudu_common.so+0x178872)
> #7 
> kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas(kudu::fs::IOContext 
> const*) /root/Projects/kudu/src/kudu/tablet/delta_compaction.cc:130:3 
> (libtablet.so+0x54ca30)
> #8 kudu::tablet::MajorDeltaCompaction::Compact(kudu::fs::IOContext 
> const*) /root/Projects/kudu/src/kudu/tablet/delta_compaction.cc:340:3 
> (libtablet.so+0x54ead0)
> #9 
> 

[jira] [Updated] (KUDU-3569) Data race in CFileSet::Iterator::OptimizePKPredicates()

2024-04-26 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3569:

Status: In Review  (was: Open)

> Data race in CFileSet::Iterator::OptimizePKPredicates()
> ---
>
> Key: KUDU-3569
> URL: https://issues.apache.org/jira/browse/KUDU-3569
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.17.0
>Reporter: Alexey Serbin
>Priority: Major
>
> Running {{alter_table-randomized-test}} under TSAN produced data race 
> warnings like below, indicating a race in 
> {{CFileSet::Iterator::OptimizePKPredicates()}}.  One actor was 
> {{tablet::AlterSchemaOp::Apply()}} initiated by AlterTable, the other 
> concurrent actor was the maintenance thread running major delta compaction.  
> Apparently, the same data race might happen if the other concurrent actor was 
> a thread handling a scan request containing IN-list predicates optimized at 
> the DRS level.
> {noformat}
> WARNING: ThreadSanitizer: data race (pid=3919595)
>   Write of size 8 at 0x7b44000f4a20 by thread T7:
> #0 std::__1::__vector_base long>>::__destruct_at_end(unsigned long*) 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:429:12 
> (kudu+0x4d4080)
> #1 std::__1::__vector_base long>>::clear() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:371:29 
> (kudu+0x4d3f94)
> #2 std::__1::__vector_base long>>::~__vector_base() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:465:9 
> (kudu+0x4d3d4b)
> #3 std::__1::vector 
> >::~vector() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:557:5 
> (kudu+0x4d1261)
> #4 kudu::Schema::~Schema() 
> /root/Projects/kudu/src/kudu/common/schema.h:491:7 (kudu+0x4cc40f)
> #5 std::__1::__shared_ptr_emplace std::__1::allocator>::__on_zero_shared() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3503:23 
> (libtablet.so+0x389d45)
> #6 std::__1::__shared_count::__release_shared() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3341:9 
> (kudu+0x4d4d05)
> #7 std::__1::__shared_weak_count::__release_shared() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3383:27 
> (kudu+0x4d4ca9)
> #8 std::__1::shared_ptr::~shared_ptr() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:4098:19 
> (kudu+0x5303e8)
> #9 
> kudu::tablet::TabletMetadata::SetSchema(std::__1::shared_ptr 
> const&, unsigned int) 
> /root/Projects/kudu/src/kudu/tablet/tablet_metadata.cc:957:1 
> (libtablet.so+0x4d8882)
> #10 kudu::tablet::Tablet::AlterSchema(kudu::tablet::AlterSchemaOpState*) 
> /root/Projects/kudu/src/kudu/tablet/tablet.cc:1727:14 (libtablet.so+0x32720a)
> #11 kudu::tablet::AlterSchemaOp::Apply(kudu::consensus::CommitMsg**) 
> /root/Projects/kudu/src/kudu/tablet/ops/alter_schema_op.cc:127:3 
> (libtablet.so+0x4013f8)
> #12 kudu::tablet::OpDriver::ApplyTask() 
> /root/Projects/kudu/src/kudu/tablet/ops/op_driver.cc:527:21 
> (libtablet.so+0x40873a)
> ...
>   Previous read of size 8 at 0x7b44000f4a20 by thread T22 (mutexes: write 
> M799524414306809968, write M765184518688777856):
> #0 std::__1::vector 
> >::empty() const 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:664:41 
> (kudu+0x5ca926)
> #1 kudu::Schema::initialized() const 
> /root/Projects/kudu/src/kudu/common/schema.h:676:26 (kudu+0x5ca3fd)
> #2 kudu::Schema::key_byte_size() const 
> /root/Projects/kudu/src/kudu/common/schema.h:572:5 
> (libkudu_common.so+0x171eae)
> #3 kudu::EncodedKey::DecodeEncodedString(kudu::Schema const&, 
> kudu::Arena*, kudu::Slice const&, kudu::EncodedKey**) 
> /root/Projects/kudu/src/kudu/common/encoded_key.cc:60:76 
> (libkudu_common.so+0x171091)
> #4 
> kudu::tablet::CFileSet::Iterator::OptimizePKPredicates(kudu::ScanSpec*) 
> /root/Projects/kudu/src/kudu/tablet/cfile_set.cc:444:5 (libtablet.so+0x428934)
> #5 kudu::tablet::CFileSet::Iterator::Init(kudu::ScanSpec*) 
> /root/Projects/kudu/src/kudu/tablet/cfile_set.cc:410:3 (libtablet.so+0x4285d7)
> #6 kudu::MaterializingIterator::Init(kudu::ScanSpec*) 
> /root/Projects/kudu/src/kudu/common/generic_iterators.cc:1176:3 
> (libkudu_common.so+0x178872)
> #7 
> kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas(kudu::fs::IOContext 
> const*) /root/Projects/kudu/src/kudu/tablet/delta_compaction.cc:130:3 
> (libtablet.so+0x54ca30)
> #8 kudu::tablet::MajorDeltaCompaction::Compact(kudu::fs::IOContext 
> const*) /root/Projects/kudu/src/kudu/tablet/delta_compaction.cc:340:3 
> (libtablet.so+0x54ead0)
> #9 
> kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(std::__1::vector  std::__1::allocator > const&, kudu::fs::IOContext const*, 

[jira] [Updated] (KUDU-3569) Data race in CFileSet::Iterator::OptimizePKPredicates()

2024-04-26 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3569:

Code Review: http://gerrit.cloudera.org:8080/21359

> Data race in CFileSet::Iterator::OptimizePKPredicates()
> ---
>
> Key: KUDU-3569
> URL: https://issues.apache.org/jira/browse/KUDU-3569
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.17.0
>Reporter: Alexey Serbin
>Priority: Major
>
> Running {{alter_table-randomized-test}} under TSAN produced data race 
> warnings like below, indicating a race in 
> {{CFileSet::Iterator::OptimizePKPredicates()}}.  One actor was 
> {{tablet::AlterSchemaOp::Apply()}} initiated by AlterTable, the other 
> concurrent actor was the maintenance thread running major delta compaction.  
> Apparently, the same data race might happen if the other concurrent actor was 
> a thread handling a scan request containing IN-list predicates optimized at 
> the DRS level.
> {noformat}
> WARNING: ThreadSanitizer: data race (pid=3919595)
>   Write of size 8 at 0x7b44000f4a20 by thread T7:
> #0 std::__1::__vector_base long>>::__destruct_at_end(unsigned long*) 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:429:12 
> (kudu+0x4d4080)
> #1 std::__1::__vector_base long>>::clear() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:371:29 
> (kudu+0x4d3f94)
> #2 std::__1::__vector_base long>>::~__vector_base() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:465:9 
> (kudu+0x4d3d4b)
> #3 std::__1::vector 
> >::~vector() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:557:5 
> (kudu+0x4d1261)
> #4 kudu::Schema::~Schema() 
> /root/Projects/kudu/src/kudu/common/schema.h:491:7 (kudu+0x4cc40f)
> #5 std::__1::__shared_ptr_emplace std::__1::allocator>::__on_zero_shared() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3503:23 
> (libtablet.so+0x389d45)
> #6 std::__1::__shared_count::__release_shared() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3341:9 
> (kudu+0x4d4d05)
> #7 std::__1::__shared_weak_count::__release_shared() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3383:27 
> (kudu+0x4d4ca9)
> #8 std::__1::shared_ptr::~shared_ptr() 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:4098:19 
> (kudu+0x5303e8)
> #9 
> kudu::tablet::TabletMetadata::SetSchema(std::__1::shared_ptr 
> const&, unsigned int) 
> /root/Projects/kudu/src/kudu/tablet/tablet_metadata.cc:957:1 
> (libtablet.so+0x4d8882)
> #10 kudu::tablet::Tablet::AlterSchema(kudu::tablet::AlterSchemaOpState*) 
> /root/Projects/kudu/src/kudu/tablet/tablet.cc:1727:14 (libtablet.so+0x32720a)
> #11 kudu::tablet::AlterSchemaOp::Apply(kudu::consensus::CommitMsg**) 
> /root/Projects/kudu/src/kudu/tablet/ops/alter_schema_op.cc:127:3 
> (libtablet.so+0x4013f8)
> #12 kudu::tablet::OpDriver::ApplyTask() 
> /root/Projects/kudu/src/kudu/tablet/ops/op_driver.cc:527:21 
> (libtablet.so+0x40873a)
> ...
>   Previous read of size 8 at 0x7b44000f4a20 by thread T22 (mutexes: write 
> M799524414306809968, write M765184518688777856):
> #0 std::__1::vector 
> >::empty() const 
> /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:664:41 
> (kudu+0x5ca926)
> #1 kudu::Schema::initialized() const 
> /root/Projects/kudu/src/kudu/common/schema.h:676:26 (kudu+0x5ca3fd)
> #2 kudu::Schema::key_byte_size() const 
> /root/Projects/kudu/src/kudu/common/schema.h:572:5 
> (libkudu_common.so+0x171eae)
> #3 kudu::EncodedKey::DecodeEncodedString(kudu::Schema const&, 
> kudu::Arena*, kudu::Slice const&, kudu::EncodedKey**) 
> /root/Projects/kudu/src/kudu/common/encoded_key.cc:60:76 
> (libkudu_common.so+0x171091)
> #4 
> kudu::tablet::CFileSet::Iterator::OptimizePKPredicates(kudu::ScanSpec*) 
> /root/Projects/kudu/src/kudu/tablet/cfile_set.cc:444:5 (libtablet.so+0x428934)
> #5 kudu::tablet::CFileSet::Iterator::Init(kudu::ScanSpec*) 
> /root/Projects/kudu/src/kudu/tablet/cfile_set.cc:410:3 (libtablet.so+0x4285d7)
> #6 kudu::MaterializingIterator::Init(kudu::ScanSpec*) 
> /root/Projects/kudu/src/kudu/common/generic_iterators.cc:1176:3 
> (libkudu_common.so+0x178872)
> #7 
> kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas(kudu::fs::IOContext 
> const*) /root/Projects/kudu/src/kudu/tablet/delta_compaction.cc:130:3 
> (libtablet.so+0x54ca30)
> #8 kudu::tablet::MajorDeltaCompaction::Compact(kudu::fs::IOContext 
> const*) /root/Projects/kudu/src/kudu/tablet/delta_compaction.cc:340:3 
> (libtablet.so+0x54ead0)
> #9 
> kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(std::__1::vector  std::__1::allocator > const&, 

[jira] [Updated] (KUDU-3569) Data race in CFileSet::Iterator::OptimizePKPredicates()

2024-04-25 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3569:

Description: 
Running {{alter_table-randomized-test}} under TSAN produced data race warnings 
like below, indicating a race in 
{{CFileSet::Iterator::OptimizePKPredicates()}}.  One actor was 
{{tablet::AlterSchemaOp::Apply()}} initiated by AlterTable, the other 
concurrent actor was the maintenance thread running major delta compaction.  
Apparently, the same data race might happen if the other concurrent actor was a 
thread handling a scan request containing IN-list predicates optimized at the 
DRS level.

{noformat}
WARNING: ThreadSanitizer: data race (pid=3919595)
  Write of size 8 at 0x7b44000f4a20 by thread T7:
#0 std::__1::__vector_base>::__destruct_at_end(unsigned long*) 
/root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:429:12 
(kudu+0x4d4080)
#1 std::__1::__vector_base>::clear() 
/root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:371:29 
(kudu+0x4d3f94)
#2 std::__1::__vector_base>::~__vector_base() 
/root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:465:9 
(kudu+0x4d3d4b)
#3 std::__1::vector 
>::~vector() 
/root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:557:5 
(kudu+0x4d1261)
#4 kudu::Schema::~Schema() 
/root/Projects/kudu/src/kudu/common/schema.h:491:7 (kudu+0x4cc40f)
#5 std::__1::__shared_ptr_emplace>::__on_zero_shared() 
/root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3503:23 
(libtablet.so+0x389d45)
#6 std::__1::__shared_count::__release_shared() 
/root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3341:9 
(kudu+0x4d4d05)
#7 std::__1::__shared_weak_count::__release_shared() 
/root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3383:27 
(kudu+0x4d4ca9)
#8 std::__1::shared_ptr::~shared_ptr() 
/root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:4098:19 
(kudu+0x5303e8)
#9 
kudu::tablet::TabletMetadata::SetSchema(std::__1::shared_ptr 
const&, unsigned int) 
/root/Projects/kudu/src/kudu/tablet/tablet_metadata.cc:957:1 
(libtablet.so+0x4d8882)
#10 kudu::tablet::Tablet::AlterSchema(kudu::tablet::AlterSchemaOpState*) 
/root/Projects/kudu/src/kudu/tablet/tablet.cc:1727:14 (libtablet.so+0x32720a)
#11 kudu::tablet::AlterSchemaOp::Apply(kudu::consensus::CommitMsg**) 
/root/Projects/kudu/src/kudu/tablet/ops/alter_schema_op.cc:127:3 
(libtablet.so+0x4013f8)
#12 kudu::tablet::OpDriver::ApplyTask() 
/root/Projects/kudu/src/kudu/tablet/ops/op_driver.cc:527:21 
(libtablet.so+0x40873a)
...

  Previous read of size 8 at 0x7b44000f4a20 by thread T22 (mutexes: write 
M799524414306809968, write M765184518688777856):
#0 std::__1::vector 
>::empty() const 
/root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:664:41 
(kudu+0x5ca926)
#1 kudu::Schema::initialized() const 
/root/Projects/kudu/src/kudu/common/schema.h:676:26 (kudu+0x5ca3fd)
#2 kudu::Schema::key_byte_size() const 
/root/Projects/kudu/src/kudu/common/schema.h:572:5 (libkudu_common.so+0x171eae)
#3 kudu::EncodedKey::DecodeEncodedString(kudu::Schema const&, kudu::Arena*, 
kudu::Slice const&, kudu::EncodedKey**) 
/root/Projects/kudu/src/kudu/common/encoded_key.cc:60:76 
(libkudu_common.so+0x171091)
#4 kudu::tablet::CFileSet::Iterator::OptimizePKPredicates(kudu::ScanSpec*) 
/root/Projects/kudu/src/kudu/tablet/cfile_set.cc:444:5 (libtablet.so+0x428934)
#5 kudu::tablet::CFileSet::Iterator::Init(kudu::ScanSpec*) 
/root/Projects/kudu/src/kudu/tablet/cfile_set.cc:410:3 (libtablet.so+0x4285d7)
#6 kudu::MaterializingIterator::Init(kudu::ScanSpec*) 
/root/Projects/kudu/src/kudu/common/generic_iterators.cc:1176:3 
(libkudu_common.so+0x178872)
#7 
kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas(kudu::fs::IOContext 
const*) /root/Projects/kudu/src/kudu/tablet/delta_compaction.cc:130:3 
(libtablet.so+0x54ca30)
#8 kudu::tablet::MajorDeltaCompaction::Compact(kudu::fs::IOContext const*) 
/root/Projects/kudu/src/kudu/tablet/delta_compaction.cc:340:3 
(libtablet.so+0x54ead0)
#9 
kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(std::__1::vector > const&, kudu::fs::IOContext const*, 
kudu::tablet::HistoryGcOpts) 
/root/Projects/kudu/src/kudu/tablet/diskrowset.cc:588:3 (libtablet.so+0x46b38c)
#10 kudu::tablet::DiskRowSet::MajorCompactDeltaStores(kudu::fs::IOContext 
const*, kudu::tablet::HistoryGcOpts) 
/root/Projects/kudu/src/kudu/tablet/diskrowset.cc:572:10 (libtablet.so+0x46b033)
#11 
kudu::tablet::Tablet::CompactWorstDeltas(kudu::tablet::RowSet::DeltaCompactionType)
 /root/Projects/kudu/src/kudu/tablet/tablet.cc:2881:5 (libtablet.so+0x32d832)
#12 kudu::tablet::MajorDeltaCompactionOp::Perform() 
/root/Projects/kudu/src/kudu/tablet/tablet_mm_ops.cc:364:3 
(libtablet.so+0x3c0846)
#13 

[jira] [Created] (KUDU-3569) Data race in CFileSet::Iterator::OptimizePKPredicates()

2024-04-25 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3569:
---

 Summary: Data race in CFileSet::Iterator::OptimizePKPredicates()
 Key: KUDU-3569
 URL: https://issues.apache.org/jira/browse/KUDU-3569
 Project: Kudu
  Issue Type: Bug
  Components: tserver
Affects Versions: 1.17.0
Reporter: Alexey Serbin


Running {{alter_table-randomized-test}} under TSAN produced data race warnings 
like below, indicating a race in 
{{CFileSet::Iterator::OptimizePKPredicates()}}.  One actor was 
{{tablet::AlterSchemaOp::Apply()}} initiated by AlterTable, the other 
concurrent actor was the maintenance thread running major delta compaction.  
Apparently, the same data race might happen if the other concurrent actor was a 
thread handling a scan request containing IN-list predicates optimized at the 
DRS level.

{noformat}
WARNING: ThreadSanitizer: data race (pid=3919595)
  Write of size 8 at 0x7b44000f4a20 by thread T7:
#0 std::__1::__vector_base
 >::__destruct_at_end(unsigned long*) /root/Projects/kudu/thirdparty/installed/t
san/include/c++/v1/vector:429:12 (kudu+0x4d4080)
#1 std::__1::__vector_base
 >::clear() /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:
371:29 (kudu+0x4d3f94)
#2 std::__1::__vector_base
 >::~__vector_base() /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v
1/vector:465:9 (kudu+0x4d3d4b)
#3 std::__1::vector >::~ve
ctor() /root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:557:5
 (kudu+0x4d1261)
#4 kudu::Schema::~Schema() /root/Projects/kudu/src/kudu/common/schema.h:491:
7 (kudu+0x4cc40f)
#5 std::__1::__shared_ptr_emplace >::__on_zero_shared() /root/Projects/kudu/thirdparty/installed/tsan/includ
e/c++/v1/memory:3503:23 (libtablet.so+0x389d45)
#6 std::__1::__shared_count::__release_shared() 
/root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3341:9 
(kudu+0x4d4d05)
#7 std::__1::__shared_weak_count::__release_shared() 
/root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3383:27 
(kudu+0x4d4ca9)
#8 std::__1::shared_ptr::~shared_ptr() 
/root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/memory:4098:19 
(kudu+0x5303e8)
#9 
kudu::tablet::TabletMetadata::SetSchema(std::__1::shared_ptr 
const&, unsigned int) 
/root/Projects/kudu/src/kudu/tablet/tablet_metadata.cc:957:1 
(libtablet.so+0x4d8882)
#10 kudu::tablet::Tablet::AlterSchema(kudu::tablet::AlterSchemaOpState*) 
/root/Projects/kudu/src/kudu/tablet/tablet.cc:1727:14 (libtablet.so+0x32720a)
#11 kudu::tablet::AlterSchemaOp::Apply(kudu::consensus::CommitMsg**) 
/root/Projects/kudu/src/kudu/tablet/ops/alter_schema_op.cc:127:3 
(libtablet.so+0x4013f8)
#12 kudu::tablet::OpDriver::ApplyTask() 
/root/Projects/kudu/src/kudu/tablet/ops/op_driver.cc:527:21 
(libtablet.so+0x40873a)
...

  Previous read of size 8 at 0x7b44000f4a20 by thread T22 (mutexes: write 
M799524414306809968, write M765184518688777856):
#0 std::__1::vector 
>::empty() const 
/root/Projects/kudu/thirdparty/installed/tsan/include/c++/v1/vector:664:41 
(kudu+0x5ca926)
#1 kudu::Schema::initialized() const 
/root/Projects/kudu/src/kudu/common/schema.h:676:26 (kudu+0x5ca3fd)
#2 kudu::Schema::key_byte_size() const 
/root/Projects/kudu/src/kudu/common/schema.h:572:5 (libkudu_common.so+0x171eae)
#3 kudu::EncodedKey::DecodeEncodedString(kudu::Schema const&, kudu::Arena*, 
kudu::Slice const&, kudu::EncodedKey**) 
/root/Projects/kudu/src/kudu/common/encoded_key.cc:60:76 
(libkudu_common.so+0x171091)
#4 kudu::tablet::CFileSet::Iterator::OptimizePKPredicates(kudu::ScanSpec*) 
/root/Projects/kudu/src/kudu/tablet/cfile_set.cc:444:5 (libtablet.so+0x428934)
#5 kudu::tablet::CFileSet::Iterator::Init(kudu::ScanSpec*) 
/root/Projects/kudu/src/kudu/tablet/cfile_set.cc:410:3 (libtablet.so+0x4285d7)
#6 kudu::MaterializingIterator::Init(kudu::ScanSpec*) 
/root/Projects/kudu/src/kudu/common/generic_iterators.cc:1176:3 
(libkudu_common.so+0x178872)
#7 
kudu::tablet::MajorDeltaCompaction::FlushRowSetAndDeltas(kudu::fs::IOContext 
const*) /root/Projects/kudu/src/kudu/tablet/delta_compaction.cc:130:3 
(libtablet.so+0x54ca30)
#8 kudu::tablet::MajorDeltaCompaction::Compact(kudu::fs::IOContext const*) 
/root/Projects/kudu/src/kudu/tablet/delta_compaction.cc:340:3 
(libtablet.so+0x54ead0)
#9 
kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds(std::__1::vector > const&, kudu::fs::IOContext const*, 
kudu::tablet::HistoryGcOpts) 
/root/Projects/kudu/src/kudu/tablet/diskrowset.cc:588:3 (libtablet.so+0x46b38c)
#10 kudu::tablet::DiskRowSet::MajorCompactDeltaStores(kudu::fs::IOContext 
const*, kudu::tablet::HistoryGcOpts) 
/root/Projects/kudu/src/kudu/tablet/diskrowset.cc:572:10 (libtablet.so+0x46b033)
#11 
kudu::tablet::Tablet::CompactWorstDeltas(kudu::tablet::RowSet::DeltaCompactionType)
 /root/Projects/kudu/src/kudu/tablet/tablet.cc:2881:5 

[jira] [Updated] (KUDU-3568) TestRowSetCompactionSkipWithBudgetingConstraints fails when run on some nodes

2024-04-25 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3568:

Affects Version/s: 1.18.0

> TestRowSetCompactionSkipWithBudgetingConstraints fails when run on some nodes
> -
>
> Key: KUDU-3568
> URL: https://issues.apache.org/jira/browse/KUDU-3568
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Alexey Serbin
>Priority: Major
> Attachments: test-failure.log.xz
>
>
> The {{TestCompaction.TestRowSetCompactionSkipWithBudgetingConstraints}} 
> scenario fails with the error like below when run on a machine with 
> relatively high memory (it might be just a Docker instance with tiny actual 
> memory allocated, but having the access to the {{/proc}} filesystem of the 
> host machine).  The full test log is attached.
> {noformat}
> src/kudu/tablet/compaction-test.cc:908: Failure
> Value of: JoinStrings(sink.logged_msgs(), "\n")
> Expected: has substring "removed from compaction input due to memory 
> constraints"
>   Actual: "I20240425 10:13:05.497732 3573764 compaction-test.cc:902] 
> CompactRowSetsOp complete. Timing: real 0.673s\tuser 0.669s\tsys 0.004s 
> Metrics: 
> {\"bytes_written\":4817,\"cfile_cache_hit\":90,\"cfile_cache_hit_bytes\":4310,\"cfile_cache_miss\":330,\"cfile_cache_miss_bytes\":3794180,\"cfile_init\":41,\"delta_iterators_relevant\":40,\"dirs.queue_time_us\":503,\"dirs.run_cpu_time_us\":338,\"dirs.run_wall_time_us\":1780,\"drs_written\":1,\"lbm_read_time_us\":1951,\"lbm_reads_lt_1ms\":494,\"lbm_write_time_us\":1767,\"lbm_writes_lt_1ms\":132,\"mutex_wait_us\":189,\"num_input_rowsets\":10,\"peak_mem_usage\":2147727,\"rows_written\":20,\"thread_start_us\":242,\"threads_started\":5}"
>  (of type std::string)
> {noformat}
> For extra information, below is 10 lines from {{/proc/meminfo}} file on a 
> node where the test failed:
> {noformat}
> # cat /proc/meminfo  | head -10
> MemTotal:   527417196 kB
> MemFree:96640684 kB
> MemAvailable:   363590980 kB
> Buffers:15352304 kB
> Cached: 246687576 kB
> SwapCached:  1294016 kB
> Active: 214889608 kB
> Inactive:   189745504 kB
> Active(anon):   133110648 kB
> Inactive(anon): 16977280 kB
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KUDU-3568) TestRowSetCompactionSkipWithBudgetingConstraints fails when run on some nodes

2024-04-25 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-3568:
---

Assignee: Ashwani Raina

> TestRowSetCompactionSkipWithBudgetingConstraints fails when run on some nodes
> -
>
> Key: KUDU-3568
> URL: https://issues.apache.org/jira/browse/KUDU-3568
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.18.0
>Reporter: Alexey Serbin
>Assignee: Ashwani Raina
>Priority: Major
> Attachments: test-failure.log.xz
>
>
> The {{TestCompaction.TestRowSetCompactionSkipWithBudgetingConstraints}} 
> scenario fails with the error like below when run on a machine with 
> relatively high memory (it might be just a Docker instance with tiny actual 
> memory allocated, but having the access to the {{/proc}} filesystem of the 
> host machine).  The full test log is attached.
> {noformat}
> src/kudu/tablet/compaction-test.cc:908: Failure
> Value of: JoinStrings(sink.logged_msgs(), "\n")
> Expected: has substring "removed from compaction input due to memory 
> constraints"
>   Actual: "I20240425 10:13:05.497732 3573764 compaction-test.cc:902] 
> CompactRowSetsOp complete. Timing: real 0.673s\tuser 0.669s\tsys 0.004s 
> Metrics: 
> {\"bytes_written\":4817,\"cfile_cache_hit\":90,\"cfile_cache_hit_bytes\":4310,\"cfile_cache_miss\":330,\"cfile_cache_miss_bytes\":3794180,\"cfile_init\":41,\"delta_iterators_relevant\":40,\"dirs.queue_time_us\":503,\"dirs.run_cpu_time_us\":338,\"dirs.run_wall_time_us\":1780,\"drs_written\":1,\"lbm_read_time_us\":1951,\"lbm_reads_lt_1ms\":494,\"lbm_write_time_us\":1767,\"lbm_writes_lt_1ms\":132,\"mutex_wait_us\":189,\"num_input_rowsets\":10,\"peak_mem_usage\":2147727,\"rows_written\":20,\"thread_start_us\":242,\"threads_started\":5}"
>  (of type std::string)
> {noformat}
> For extra information, below is 10 lines from {{/proc/meminfo}} file on a 
> node where the test failed:
> {noformat}
> # cat /proc/meminfo  | head -10
> MemTotal:   527417196 kB
> MemFree:96640684 kB
> MemAvailable:   363590980 kB
> Buffers:15352304 kB
> Cached: 246687576 kB
> SwapCached:  1294016 kB
> Active: 214889608 kB
> Inactive:   189745504 kB
> Active(anon):   133110648 kB
> Inactive(anon): 16977280 kB
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KUDU-3568) TestRowSetCompactionSkipWithBudgetingConstraints fails when run on some nodes

2024-04-25 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3568:
---

 Summary: TestRowSetCompactionSkipWithBudgetingConstraints fails 
when run on some nodes
 Key: KUDU-3568
 URL: https://issues.apache.org/jira/browse/KUDU-3568
 Project: Kudu
  Issue Type: Bug
Reporter: Alexey Serbin
 Attachments: test-failure.log.xz

The {{TestCompaction.TestRowSetCompactionSkipWithBudgetingConstraints}} 
scenario fails with the error like below when run on a machine with relatively 
high memory (it might be just a Docker instance with tiny actual memory 
allocated, but having the access to the {{/proc}} filesystem of the host 
machine).  The full test log is attached.

{noformat}
src/kudu/tablet/compaction-test.cc:908: Failure
Value of: JoinStrings(sink.logged_msgs(), "\n")
Expected: has substring "removed from compaction input due to memory 
constraints"
  Actual: "I20240425 10:13:05.497732 3573764 compaction-test.cc:902] 
CompactRowSetsOp complete. Timing: real 0.673s\tuser 0.669s\tsys 0.004s 
Metrics: 
{\"bytes_written\":4817,\"cfile_cache_hit\":90,\"cfile_cache_hit_bytes\":4310,\"cfile_cache_miss\":330,\"cfile_cache_miss_bytes\":3794180,\"cfile_init\":41,\"delta_iterators_relevant\":40,\"dirs.queue_time_us\":503,\"dirs.run_cpu_time_us\":338,\"dirs.run_wall_time_us\":1780,\"drs_written\":1,\"lbm_read_time_us\":1951,\"lbm_reads_lt_1ms\":494,\"lbm_write_time_us\":1767,\"lbm_writes_lt_1ms\":132,\"mutex_wait_us\":189,\"num_input_rowsets\":10,\"peak_mem_usage\":2147727,\"rows_written\":20,\"thread_start_us\":242,\"threads_started\":5}"
 (of type std::string)
{noformat}

For extra information, below is 10 lines from {{/proc/meminfo}} file on a node 
where the test failed:
{noformat}
# cat /proc/meminfo  | head -10
MemTotal:   527417196 kB
MemFree:96640684 kB
MemAvailable:   363590980 kB
Buffers:15352304 kB
Cached: 246687576 kB
SwapCached:  1294016 kB
Active: 214889608 kB
Inactive:   189745504 kB
Active(anon):   133110648 kB
Inactive(anon): 16977280 kB
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KUDU-3567) Resource leakage related to HashedWheelTimer in AsyncKuduScanner

2024-04-24 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3567:
---

 Summary: Resource leakage related to HashedWheelTimer in 
AsyncKuduScanner
 Key: KUDU-3567
 URL: https://issues.apache.org/jira/browse/KUDU-3567
 Project: Kudu
  Issue Type: Bug
  Components: client, java
Affects Versions: 1.18.0
Reporter: Alexey Serbin


With KUDU-3498 implemented in 
[8683b8bdb|https://github.com/apache/kudu/commit/8683b8bdb675db96aac52d75a31d00232f7b9fb8],
 now there are resource leak reports, see below.

Overall, the way how {{HashedWheelTimer}} is used for keeping scanners alive is 
in direct contradiction with the recommendation at [this documentation 
page|https://netty.io/4.1/api/io/netty/util/HashedWheelTimer.html]:
{quote}*Do not create many instances.*

HashedWheelTimer creates a new thread whenever it is instantiated and started. 
Therefore, you should make sure to create only one instance and share it across 
your application. One of the common mistakes, that makes your application 
unresponsive, is to create a new instance for every connection.
{quote}

Probably, a better way of implementing the keep-alive feature for scanner 
objects in Kudu Java client would be reusing the {{HashedWheelTimer}} instance 
from corresponding {{AsyncKuduClient}} client instance, not creating a new 
instance of the timer (along with corresponding thread) per AsyncKuduScanner 
object.  At least, an instance of {{HashedWheelTimer}} should be properly 
released/shutdown to avoid resource leakages (a running thread?) when GC-ing 
{{AsyncKuduScanner}} objects.

For example, below is an example how the leak is reported when running 
{{TestKuduClient.testStrings}}:

{noformat}
23:04:57.774 [ERROR - main] (ResourceLeakDetector.java:327) LEAK: 
HashedWheelTimer.release() was not called before it's garbage-collected. See 
https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records:
Created at:
  io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:312)
  io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:251)
  io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:224)
  io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:203)
  io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:185)
  org.apache.kudu.client.AsyncKuduScanner.(AsyncKuduScanner.java:296)
  org.apache.kudu.client.AsyncKuduScanner.(AsyncKuduScanner.java:431)
  
org.apache.kudu.client.KuduScanner$KuduScannerBuilder.build(KuduScanner.java:260)
  org.apache.kudu.client.TestKuduClient.testStrings(TestKuduClient.java:692)
  sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  java.lang.reflect.Method.invoke(Method.java:498)
  
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
  
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
  
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
  
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
  
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
  java.util.concurrent.FutureTask.run(FutureTask.java:266)
  java.lang.Thread.run(Thread.java:748)
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3194) testReadDataFrameAtSnapshot(org.apache.kudu.spark.kudu.DefaultSourceTest) sometimes fails

2024-04-24 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3194:

Affects Version/s: 1.17.0

> testReadDataFrameAtSnapshot(org.apache.kudu.spark.kudu.DefaultSourceTest) 
> sometimes fails
> -
>
> Key: KUDU-3194
> URL: https://issues.apache.org/jira/browse/KUDU-3194
> Project: Kudu
>  Issue Type: Bug
>  Components: client, test
>Affects Versions: 1.13.0, 1.14.0, 1.15.0, 1.16.0, 1.17.0
>Reporter: Alexey Serbin
>Priority: Major
> Attachments: test-output-20201125.txt.xz, test-output.txt.xz
>
>
> The test scenario sometimes fails.
> {noformat}  
> Time: 55.485
> There was 1 failure:
> 1) testReadDataFrameAtSnapshot(org.apache.kudu.spark.kudu.DefaultSourceTest)
> java.lang.AssertionError: expected:<100> but was:<99>
>   at org.junit.Assert.fail(Assert.java:89)
>   at org.junit.Assert.failNotEquals(Assert.java:835)
>   at org.junit.Assert.assertEquals(Assert.java:647)
>   at org.junit.Assert.assertEquals(Assert.java:633)
>   at 
> org.apache.kudu.spark.kudu.DefaultSourceTest.testReadDataFrameAtSnapshot(DefaultSourceTest.scala:784)
> FAILURES!!!
> Tests run: 30,  Failures: 1
> {noformat}
> The full log is attached (RELEASE build); the relevant stack trace looks like 
> the following:
> {noformat}
> 23:53:48.683 [ERROR - main] (RetryRule.java:219) 
> org.apache.kudu.spark.kudu.DefaultSourceTest.testReadDataFrameAtSnapshot: 
> failed attempt 1
> java.lang.AssertionError: expected:<100> but was:<99> 
>   
>   at org.junit.Assert.fail(Assert.java:89) ~[junit-4.13.jar:4.13] 
>   
>   at org.junit.Assert.failNotEquals(Assert.java:835) ~[junit-4.13.jar:4.13]   
>   
>   at org.junit.Assert.assertEquals(Assert.java:647) ~[junit-4.13.jar:4.13]
>   
>   at org.junit.Assert.assertEquals(Assert.java:633) ~[junit-4.13.jar:4.13]
>   
>   at 
> org.apache.kudu.spark.kudu.DefaultSourceTest.testReadDataFrameAtSnapshot(DefaultSourceTest.scala:784)
>  ~[test/:?]
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:1.8.0_141] 
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_141]
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:1.8.0_141]
>   at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_141]  
>   
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>  ~[junit-4.13.jar:4.13]
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  ~[junit-4.13.jar:4.13]
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>  ~[junit-4.13.jar:4.13]
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  ~[junit-4.13.jar:4.13]
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
> ~[junit-4.13.jar:4.13]
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> ~[junit-4.13.jar:4.13]
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54) 
> ~[junit-4.13.jar:4.13]
>   at 
> org.apache.kudu.test.junit.RetryRule$RetryStatement.doOneAttempt(RetryRule.java:217)
>  [kudu-test-utils-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT]
>   at 
> org.apache.kudu.test.junit.RetryRule$RetryStatement.evaluate(RetryRule.java:234)
>  [kudu-test-utils-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT]
>   at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) 
> [junit-4.13.jar:4.13]
>   at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>  [junit-4.13.jar:4.13]
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) 
> [junit-4.13.jar:4.13]
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
>  [junit-4.13.jar:4.13]
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
>  [junit-4.13.jar:4.13]
>   at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) 
> [junit-4.13.jar:4.13]
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) 
> [junit-4.13.jar:4.13]
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) 
> [junit-4.13.jar:4.13]
>   at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) 
> [junit-4.13.jar:4.13]
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) 
> [junit-4.13.jar:4.13]
>   at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) 
> [junit-4.13.jar:4.13]
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:413) 
> [junit-4.13.jar:4.13]
>   at 

[jira] [Updated] (KUDU-3194) testReadDataFrameAtSnapshot(org.apache.kudu.spark.kudu.DefaultSourceTest) sometimes fails

2024-04-24 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3194:

Issue Type: Bug  (was: Improvement)

> testReadDataFrameAtSnapshot(org.apache.kudu.spark.kudu.DefaultSourceTest) 
> sometimes fails
> -
>
> Key: KUDU-3194
> URL: https://issues.apache.org/jira/browse/KUDU-3194
> Project: Kudu
>  Issue Type: Bug
>  Components: client, test
>Affects Versions: 1.13.0, 1.14.0, 1.15.0, 1.16.0
>Reporter: Alexey Serbin
>Priority: Major
> Attachments: test-output-20201125.txt.xz, test-output.txt.xz
>
>
> The test scenario sometimes fails.
> {noformat}  
> Time: 55.485
> There was 1 failure:
> 1) testReadDataFrameAtSnapshot(org.apache.kudu.spark.kudu.DefaultSourceTest)
> java.lang.AssertionError: expected:<100> but was:<99>
>   at org.junit.Assert.fail(Assert.java:89)
>   at org.junit.Assert.failNotEquals(Assert.java:835)
>   at org.junit.Assert.assertEquals(Assert.java:647)
>   at org.junit.Assert.assertEquals(Assert.java:633)
>   at 
> org.apache.kudu.spark.kudu.DefaultSourceTest.testReadDataFrameAtSnapshot(DefaultSourceTest.scala:784)
> FAILURES!!!
> Tests run: 30,  Failures: 1
> {noformat}
> The full log is attached (RELEASE build); the relevant stack trace looks like 
> the following:
> {noformat}
> 23:53:48.683 [ERROR - main] (RetryRule.java:219) 
> org.apache.kudu.spark.kudu.DefaultSourceTest.testReadDataFrameAtSnapshot: 
> failed attempt 1
> java.lang.AssertionError: expected:<100> but was:<99> 
>   
>   at org.junit.Assert.fail(Assert.java:89) ~[junit-4.13.jar:4.13] 
>   
>   at org.junit.Assert.failNotEquals(Assert.java:835) ~[junit-4.13.jar:4.13]   
>   
>   at org.junit.Assert.assertEquals(Assert.java:647) ~[junit-4.13.jar:4.13]
>   
>   at org.junit.Assert.assertEquals(Assert.java:633) ~[junit-4.13.jar:4.13]
>   
>   at 
> org.apache.kudu.spark.kudu.DefaultSourceTest.testReadDataFrameAtSnapshot(DefaultSourceTest.scala:784)
>  ~[test/:?]
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:1.8.0_141] 
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_141]
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:1.8.0_141]
>   at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_141]  
>   
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>  ~[junit-4.13.jar:4.13]
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  ~[junit-4.13.jar:4.13]
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>  ~[junit-4.13.jar:4.13]
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  ~[junit-4.13.jar:4.13]
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
> ~[junit-4.13.jar:4.13]
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> ~[junit-4.13.jar:4.13]
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54) 
> ~[junit-4.13.jar:4.13]
>   at 
> org.apache.kudu.test.junit.RetryRule$RetryStatement.doOneAttempt(RetryRule.java:217)
>  [kudu-test-utils-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT]
>   at 
> org.apache.kudu.test.junit.RetryRule$RetryStatement.evaluate(RetryRule.java:234)
>  [kudu-test-utils-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT]
>   at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) 
> [junit-4.13.jar:4.13]
>   at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>  [junit-4.13.jar:4.13]
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) 
> [junit-4.13.jar:4.13]
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
>  [junit-4.13.jar:4.13]
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
>  [junit-4.13.jar:4.13]
>   at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) 
> [junit-4.13.jar:4.13]
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) 
> [junit-4.13.jar:4.13]
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) 
> [junit-4.13.jar:4.13]
>   at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) 
> [junit-4.13.jar:4.13]
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) 
> [junit-4.13.jar:4.13]
>   at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) 
> [junit-4.13.jar:4.13]
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:413) 
> [junit-4.13.jar:4.13]
>   

[jira] [Updated] (KUDU-3566) Incorrect semantics for Prometheus-style histogram metrics

2024-04-19 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3566:

Status: In Review  (was: Open)

> Incorrect semantics for Prometheus-style histogram metrics
> --
>
> Key: KUDU-3566
> URL: https://issues.apache.org/jira/browse/KUDU-3566
> Project: Kudu
>  Issue Type: Bug
>  Components: master, tserver
>Affects Versions: 1.17.0
>Reporter: Alexey Serbin
>Priority: Major
>  Labels: metrics, observability
>
> Original KUDU-3375 implementation incorrectly exposes [summary-type 
> Prometheus metrics|https://prometheus.io/docs/concepts/metric_types/#summary] 
> as [histogram-type 
> ones|https://prometheus.io/docs/concepts/metric_types/#histogram] for data 
> collected by corresponding HDR histograms.  For example, below are snippets 
> from {{/metric}} and {{/metrics_prometheus}} for statistics on ListMasters 
> RPC.
> The data exposed as Prometheus-style histogram metrics should have been 
> reported as summary metrics instead.
> JSON-style:
> {noformat}
> {   
> "name": "handler_latency_kudu_master_MasterService_ListMasters",  
>   "total_count": 26,
> "min": 152,
> "mean": 301.2692307692308,
> "percentile_75": 324,
> "percentile_95": 468,
> "percentile_99": 844,
> "percentile_99_9": 844,
> "percentile_99_99": 844,
> "max": 844,
> "total_sum": 7833
> }
> {noformat}
> Prometheus-style counterpart:
> {noformat}
> # HELP kudu_master_handler_latency_kudu_master_MasterService_ListMasters 
> Microseconds spent handling kudu.master.MasterService.ListMasters RPC requests
> # TYPE kudu_master_handler_latency_kudu_master_MasterService_ListMasters 
> histogram
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
>  le="0.75"} 324
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
>  le="0.95"} 468
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
>  le="0.99"} 844
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
>  le="0.999"} 844
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
>  le="0."} 844
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
>  le="+Inf"} 26
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_sum{unit_type="microseconds"}
>  7833
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_count{unit_type="microseconds"}
>  26
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3566) Incorrect semantics for Prometheus-style histogram metrics

2024-04-19 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3566:

Code Review: https://gerrit.cloudera.org/#/c/21338/

> Incorrect semantics for Prometheus-style histogram metrics
> --
>
> Key: KUDU-3566
> URL: https://issues.apache.org/jira/browse/KUDU-3566
> Project: Kudu
>  Issue Type: Bug
>  Components: master, tserver
>Affects Versions: 1.17.0
>Reporter: Alexey Serbin
>Priority: Major
>  Labels: metrics, observability
>
> Original KUDU-3375 implementation incorrectly exposes [summary-type 
> Prometheus metrics|https://prometheus.io/docs/concepts/metric_types/#summary] 
> as [histogram-type 
> ones|https://prometheus.io/docs/concepts/metric_types/#histogram] for data 
> collected by corresponding HDR histograms.  For example, below are snippets 
> from {{/metric}} and {{/metrics_prometheus}} for statistics on ListMasters 
> RPC.
> The data exposed as Prometheus-style histogram metrics should have been 
> reported as summary metrics instead.
> JSON-style:
> {noformat}
> {   
> "name": "handler_latency_kudu_master_MasterService_ListMasters",  
>   "total_count": 26,
> "min": 152,
> "mean": 301.2692307692308,
> "percentile_75": 324,
> "percentile_95": 468,
> "percentile_99": 844,
> "percentile_99_9": 844,
> "percentile_99_99": 844,
> "max": 844,
> "total_sum": 7833
> }
> {noformat}
> Prometheus-style counterpart:
> {noformat}
> # HELP kudu_master_handler_latency_kudu_master_MasterService_ListMasters 
> Microseconds spent handling kudu.master.MasterService.ListMasters RPC requests
> # TYPE kudu_master_handler_latency_kudu_master_MasterService_ListMasters 
> histogram
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
>  le="0.75"} 324
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
>  le="0.95"} 468
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
>  le="0.99"} 844
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
>  le="0.999"} 844
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
>  le="0."} 844
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
>  le="+Inf"} 26
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_sum{unit_type="microseconds"}
>  7833
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_count{unit_type="microseconds"}
>  26
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3566) Incorrect semantics for Prometheus-style histogram metrics

2024-04-19 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3566:

Description: 
Original KUDU-3375 implementation incorrectly exposes [summary-type Prometheus 
metrics|https://prometheus.io/docs/concepts/metric_types/#summary] as 
[histogram-type 
ones|https://prometheus.io/docs/concepts/metric_types/#histogram] for data 
collected by corresponding HDR histograms.  For example, below are snippets 
from {{/metric}} and {{/metrics_prometheus}} for statistics on ListMasters RPC.

The data exposed as Prometheus-style histogram metrics should have been 
reported as summary metrics instead.

JSON-style:
{noformat}
{   
"name": "handler_latency_kudu_master_MasterService_ListMasters",
"total_count": 26,
"min": 152,
"mean": 301.2692307692308,
"percentile_75": 324,
"percentile_95": 468,
"percentile_99": 844,
"percentile_99_9": 844,
"percentile_99_99": 844,
"max": 844,
"total_sum": 7833
}
{noformat}

Prometheus-style counterpart:
{noformat}
# HELP kudu_master_handler_latency_kudu_master_MasterService_ListMasters 
Microseconds spent handling kudu.master.MasterService.ListMasters RPC requests
# TYPE kudu_master_handler_latency_kudu_master_MasterService_ListMasters 
histogram
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
 le="0.75"} 324
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
 le="0.95"} 468
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
 le="0.99"} 844
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
 le="0.999"} 844
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
 le="0."} 844
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
 le="+Inf"} 26
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_sum{unit_type="microseconds"}
 7833
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_count{unit_type="microseconds"}
 26
{noformat}


  was:
Original KUDU-3375 implementation incorrectly exposes [summary-type Prometheus 
metrics|https://prometheus.io/docs/concepts/metric_types/#summary] as 
[histogram-type 
ones|https://prometheus.io/docs/concepts/metric_types/#histogram] for data 
collected by corresponding HDR histograms.  For example, below are snippets 
from {{/metric}} and {{/metrics_prometheus}} for statistics on ListMasters RPC.

JSON-style:
{noformat}
{   
"name": "handler_latency_kudu_master_MasterService_ListMasters",
"total_count": 26,
"min": 152,
"mean": 301.2692307692308,
"percentile_75": 324,
"percentile_95": 468,
"percentile_99": 844,
"percentile_99_9": 844,
"percentile_99_99": 844,
"max": 844,
"total_sum": 7833
}
{noformat}

Prometheus-style counterpart:
{noformat}
# HELP kudu_master_handler_latency_kudu_master_MasterService_ListMasters 
Microseconds spent handling kudu.master.MasterService.ListMasters RPC requests
# TYPE kudu_master_handler_latency_kudu_master_MasterService_ListMasters 
histogram
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
 le="0.75"} 324
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
 le="0.95"} 468
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
 le="0.99"} 844
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
 le="0.999"} 844
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
 le="0."} 844
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
 le="+Inf"} 26
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_sum{unit_type="microseconds"}
 7833
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_count{unit_type="microseconds"}
 26
{noformat}



> Incorrect semantics for Prometheus-style histogram metrics
> --
>
> Key: KUDU-3566
> URL: https://issues.apache.org/jira/browse/KUDU-3566
> Project: Kudu
>  Issue Type: Bug
>  Components: master, tserver
>Affects Versions: 1.17.0
>Reporter: Alexey Serbin
>Priority: Major
>  Labels: metrics, observability
>
> Original KUDU-3375 implementation incorrectly exposes [summary-type 
> Prometheus metrics|https://prometheus.io/docs/concepts/metric_types/#summary] 
> as [histogram-type 
> ones|https://prometheus.io/docs/concepts/metric_types/#histogram] for data 
> collected by corresponding HDR 

[jira] [Updated] (KUDU-3566) Incorrect semantics for Prometheus-style histogram metrics

2024-04-19 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3566:

Summary: Incorrect semantics for Prometheus-style histogram metrics  (was: 
Incorrect semantics of Prometheus-style histogram-type metrics)

> Incorrect semantics for Prometheus-style histogram metrics
> --
>
> Key: KUDU-3566
> URL: https://issues.apache.org/jira/browse/KUDU-3566
> Project: Kudu
>  Issue Type: Bug
>  Components: master, tserver
>Affects Versions: 1.17.0
>Reporter: Alexey Serbin
>Priority: Major
>  Labels: metrics, observability
>
> Original KUDU-3375 implementation incorrectly exposes [summary-type 
> Prometheus metrics|https://prometheus.io/docs/concepts/metric_types/#summary] 
> as [histogram-type 
> ones|https://prometheus.io/docs/concepts/metric_types/#histogram] for data 
> collected by corresponding HDR histograms.  For example, below are snippets 
> from {{/metric}} and {{/metrics_prometheus}} for statistics on ListMasters 
> RPC.
> JSON-style:
> {noformat}
> {   
> "name": "handler_latency_kudu_master_MasterService_ListMasters",  
>   "total_count": 26,
> "min": 152,
> "mean": 301.2692307692308,
> "percentile_75": 324,
> "percentile_95": 468,
> "percentile_99": 844,
> "percentile_99_9": 844,
> "percentile_99_99": 844,
> "max": 844,
> "total_sum": 7833
> }
> {noformat}
> Prometheus-style counterpart:
> {noformat}
> # HELP kudu_master_handler_latency_kudu_master_MasterService_ListMasters 
> Microseconds spent handling kudu.master.MasterService.ListMasters RPC requests
> # TYPE kudu_master_handler_latency_kudu_master_MasterService_ListMasters 
> histogram
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
>  le="0.75"} 324
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
>  le="0.95"} 468
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
>  le="0.99"} 844
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
>  le="0.999"} 844
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
>  le="0."} 844
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
>  le="+Inf"} 26
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_sum{unit_type="microseconds"}
>  7833
> kudu_master_handler_latency_kudu_master_MasterService_ListMasters_count{unit_type="microseconds"}
>  26
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KUDU-3566) Incorrect semantics of Prometheus-style histogram-type metrics

2024-04-19 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3566:
---

 Summary: Incorrect semantics of Prometheus-style histogram-type 
metrics
 Key: KUDU-3566
 URL: https://issues.apache.org/jira/browse/KUDU-3566
 Project: Kudu
  Issue Type: Bug
  Components: master, tserver
Affects Versions: 1.17.0
Reporter: Alexey Serbin


Original KUDU-3375 implementation incorrectly exposes [summary-type Prometheus 
metrics|https://prometheus.io/docs/concepts/metric_types/#summary] as 
[histogram-type 
ones|https://prometheus.io/docs/concepts/metric_types/#histogram] for data 
collected by corresponding HDR histograms.  For example, below are snippets 
from {{/metric}} and {{/metrics_prometheus}} for statistics on ListMasters RPC.

JSON-style:
{noformat}
{   
"name": "handler_latency_kudu_master_MasterService_ListMasters",
"total_count": 26,
"min": 152,
"mean": 301.2692307692308,
"percentile_75": 324,
"percentile_95": 468,
"percentile_99": 844,
"percentile_99_9": 844,
"percentile_99_99": 844,
"max": 844,
"total_sum": 7833
}
{noformat}

Prometheus-style counterpart:
{noformat}
# HELP kudu_master_handler_latency_kudu_master_MasterService_ListMasters 
Microseconds spent handling kudu.master.MasterService.ListMasters RPC requests
# TYPE kudu_master_handler_latency_kudu_master_MasterService_ListMasters 
histogram
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
 le="0.75"} 324
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
 le="0.95"} 468
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
 le="0.99"} 844
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
 le="0.999"} 844
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
 le="0."} 844
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_bucket{unit_type="microseconds",
 le="+Inf"} 26
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_sum{unit_type="microseconds"}
 7833
kudu_master_handler_latency_kudu_master_MasterService_ListMasters_count{unit_type="microseconds"}
 26
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KUDU-3518) node error when impala query

2024-04-16 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3518.
-
Fix Version/s: 1.18.0
   1.17.1
   Resolution: Fixed

> node error when impala query
> 
>
> Key: KUDU-3518
> URL: https://issues.apache.org/jira/browse/KUDU-3518
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.17.0
> Environment: centos7.9
>Reporter: Pain Sun
>Priority: Major
> Fix For: 1.18.0, 1.17.1
>
> Attachments: profile_error_1.17.txt, profile_success_1.16.txt, 
> profile_success_1.17.txt
>
>
> Scan kudu with impala-4.3.0 ,there is a bug when reading a table with an 
> empty string in primary key field.
> sql:
> select
>     count(distinct thirdnick)
> from
>     member.qyexternaluserdetailinfo_new
> where
>     (
>         mainshopnick = "xxx"
>         and ownercorpid in ("xxx", "")
>         and shoptype not in ("35", "56")
>         and isDelete = 0
>         and thirdnick != ""
>         and thirdnick is not null
>     );
>  
> error:ERROR: Unable to open scanner for node with id '1' for Kudu table 
> 'impala::member.qyexternaluserdetailinfo_new': Invalid argument: No such 
> column: shopnick
>  
> If update sql like this:
> select
>     count(distinct thirdnick)
> from
>     member.qyexternaluserdetailinfo_new
> where
>     (
>         mainshopnick = "xxx"
>         and ownercorpid in ("xxx", "")
>         and shopnick not in ('')
>         and shoptype not in ("35", "56")
>         and isDelete = 0
>         and thirdnick != ""
>         and thirdnick is not null
>     );
> no error.
>  
> this error appears in kudu-1.17.0 ,but kudu-1.16.0 is good.
>  
> There is 100 items in this table ,28 items where empty string.
> table schema like this:
> ++---+-+-++--+---+---+-++
> | name           | type      | comment | primary_key | key_unique | nullable 
> | default_value | encoding      | compression         | block_size |
> ++---+-+-++--+---+---+-++
> | mainshopnick   | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | shopnick       | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | ownercorpid    | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | shoptype       | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | clientid       | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | thirdnick      | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | id             | bigint    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | receivermobile | string    |         | false       |            | true     
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | thirdrealname  | string    |         | false       |            | true     
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | remark         | string    |         | false       |            | true     
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | createtime     | timestamp |         | false       |            | true     
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | updatetime     | timestamp |         | false       |            | true     
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | isdelete       | int       |         | false       |            | true     
> | 0             | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | buyernick      | string    |         | false       |            | true     
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> ++---+-+-++--+---+---+-++



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3565) Avoid scheduling multiple IO-heavy maintenance operations at once for the same tablet

2024-04-10 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3565:

Description: 
Current implementation of scheduling maintenance operations inadvertently 
induces much more disk IO demand than it could.  The maintenance manager thread 
can schedule multiple IO-heavy operations _on the same tablet_ to run 
concurrently when there is more  than one maintenance worker thread.  
Sometimes, it creates bottlenecks by saturating disk IO throughput since the 
maintenance worker threads access data backed by the same set of drives when 
working on the same tablet.

Apparently, such behavior doesn't scale well when increasing the number of 
maintenance worker threads, especially when data directories are backed by 
spinning drives (HDD).

In the presence of four maintenance threads I saw the logs like below.  Notice 
that when four operations are scheduled to run almost at the same time, it 
takes more than 3 seconds to run each.  Other instances of similar flush 
operations at the same tablet  take less than 100ms to complete when no other 
flush operation on the same tablet is running.  Notice that not only 
fdatasync_us is higher for concurrently running operations, but some of them 
now waiting on synchronization primitives, blocking the maintenance threads 
from doing other useful work, e.g. compacting/flushing data of other active 
tablets.  In case of MRS/DMS flushes, the flusher that acquires corresponding 
synchronization primitives as a writer, blocks concurrently running 
insert/update/delete operations since they acquire the same synchronization 
primitives as readers when checking for a primary key's presence.  So, long 
running flush operations induce an extra latency into concurrently running 
write workloads.

It's necessary to improve the scheduling algorithm of maintenance operations, 
differentiating operations not only by their score, but taking into account the 
IO impact and dependencies between them.  As a simplest approach, it's 
necessary to add an option to enable the following behavior:
* an IO-heavy operation (such as flush, merge compaction, etc.) should not be 
scheduled when there is another one scheduled to run or already running at the 
same tablet

{noformat}
I0328 14:36:18.732758 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf 
score=0.035118I0328
I0328 14:36:18.816906 1278984 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 
0.084s user 0.017s sys 0.005s Metrics: 
{"bytes_written":33228,"delete_count":0,"fdatasync":3,"fdatasync_us":56207,"lbm_write_time_us":115,",lbm_writes_lt_1ms":6,"reinsert_count":0,"update_count":7500}
...
I0328 14:36:18.848975 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035150
I0328 14:36:18.962837 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035182
I0328 14:36:19.073328 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035214
I0328 14:36:19.182384 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035244
...
I0328 14:36:22.128937 1278984 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 
3.280s user 0.021s sys 0.005s Metrics: 
{"bytes_written":35275,"delete_count":0,"fdatasync":3,"fdatasync_us":167953,"lbm_write_time_us":124,"lbm_writes_lt_1ms":7,"reinsert_count":0,"update_count":7896}
I0328 14:36:22.241256 1278985 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 
3.278s user 0.021s sys 0.000s Metrics: 
{"bytes_written":28910,"delete_count":0,"fdatasync":3,"fdatasync_us":99705,"lbm_write_time_us":123,"lbm_writes_lt_1ms":6,"mutex_wait_us":3081560,"reinsert_count":0,"update_count":6381}
...
I0328 14:36:22.292019 1278986 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 
3.219s user 0.018s sys 0.003s Metrics: 
{"bytes_written":30826,"delete_count":0,"fdatasync":3,"fdatasync_us":84332,"lbm_write_time_us":65,"lbm_writes_lt_1ms":6,"mutex_wait_us":3106656,"reinsert_count":0,"update_count":6915}
I0328 14:36:22.341326 1278983 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 

[jira] [Created] (KUDU-3565) Avoid scheduling multiple IO-heavy maintenance operations at once for the same tablet

2024-04-10 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3565:
---

 Summary: Avoid scheduling multiple IO-heavy maintenance operations 
at once for the same tablet
 Key: KUDU-3565
 URL: https://issues.apache.org/jira/browse/KUDU-3565
 Project: Kudu
  Issue Type: Improvement
  Components: master, tserver
Affects Versions: 1.17.0, 1.16.0, 1.15.0, 1.14.0, 1.13.0, 1.11.1, 1.12.0, 
1.11.0, 1.10.1, 1.10.0, 1.9.0, 1.7.1, 1.8.0, 1.7.0, 1.6.0, 1.5.0, 1.4.0, 1.3.1, 
1.3.0, 1.2.0, 1.1.0, 1.0.1, 1.0.0
Reporter: Alexey Serbin


Current implementation of scheduling maintenance operations inadvertently 
induces much more disk IO demand than it could.  The maintenance manager thread 
can schedule multiple IO-heavy operations _on the same tablet_ to run 
concurrently when there is more  than one maintenance worker thread.  
Sometimes, it creates bottlenecks by saturating disk IO throughput since the 
maintenance worker threads access data backed by the same set of drives when 
working on the same tablet.

Apparently, such behavior doesn't scale well when increasing the number of 
maintenance worker threads, especially when data directories are backed by 
spinning drives (HDD).

In the presence of four maintenance threads I saw the logs like below.  Notice 
that when four operations are scheduled to run almost at the same time, it 
takes more than 3 seconds to run each.  Other instances of similar flush 
operations at the same tablet  take less than 100ms to complete when no other 
flush operation on the same tablet is running.  Notice that not only 
fdatasync_us is higher for concurrently running operations, but some of them 
now waiting on synchronization primitives, blocking the maintenance threads 
from doing other useful work, e.g. compacting/flushing data of other active 
tablets.  In case of MRS/DMS flushes, the flusher that acquires corresponding 
synchronization primitives as a writer, blocks concurrently running 
insert/update/delete operations since they acquire the same synchronization 
primitives as readers when checking for a primary key's presence.  So, long 
running flush operations induce an extra latency into concurrently running 
write workloads.

It's necessary to improve the scheduling algorithm of maintenance operations, 
differentiating operations not only by their score, but taking into account the 
IO impact and dependencies between them.  As a simplest approach, it's 
necessary to add an option to enable the following behavior:
* an IO-heavy operation (such as flush, merge compaction, etc.) should not be 
scheduled when there is another one scheduled to run or already running at the 
same tablet

{noformat}
I0328 14:36:18.732758 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf 
score=0.035118I0328 14:36:18.816906 1278984 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. 
Timing: real 0.084s user 0.017s sys 0.005s Metrics: 
{"bytes_written":33228,"delete_count":0,"fdatasync":3,"fdatasync_us":56207,"lbm_write_time_us":115,",lbm_writes_lt_1ms":6,"reinsert_count":0,"update_count":7500}
...
I0328 14:36:18.848975 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035150
I0328 14:36:18.962837 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035182
I0328 14:36:19.073328 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035214
I0328 14:36:19.182384 1279870 maintenance_manager.cc:382] P 
db78de4070874e75be1482b02b5161e6: Scheduling 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde): perf score=0.035244
...
I0328 14:36:22.128937 1278984 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 
3.280s user 0.021s sys 0.005s Metrics: 
{"bytes_written":35275,"delete_count":0,"fdatasync":3,"fdatasync_us":167953,"lbm_write_time_us":124,"lbm_writes_lt_1ms":7,"reinsert_count":0,"update_count":7896}
I0328 14:36:22.241256 1278985 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 
FlushDeltaMemStoresOp(716b7bbfa9514d728b61f6109d473dde) complete. Timing: real 
3.278s user 0.021s sys 0.000s Metrics: 
{"bytes_written":28910,"delete_count":0,"fdatasync":3,"fdatasync_us":99705,"lbm_write_time_us":123,"lbm_writes_lt_1ms":6,"mutex_wait_us":3081560,"reinsert_count":0,"update_count":6381}
...
I0328 14:36:22.292019 1278986 maintenance_manager.cc:603] P 
db78de4070874e75be1482b02b5161e6: 

[jira] [Updated] (KUDU-3433) ClientTest.TestDeleteWithDeletedTableReserveSecondsWorks is flaky

2024-04-10 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3433:

Fix Version/s: 1.17.1

> ClientTest.TestDeleteWithDeletedTableReserveSecondsWorks is flaky
> -
>
> Key: KUDU-3433
> URL: https://issues.apache.org/jira/browse/KUDU-3433
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: Alexey Serbin
>Priority: Major
> Fix For: 1.18.0, 1.17.1
>
> Attachments: client-test.2.txt.xz, client-test.3.txt.xz, 
> client-test.4.txt.xz
>
>
> The {{TestDeleteWithDeletedTableReserveSecondsWorks}} in {{client-test}} 
> sometimes fails with the following message:
> {noformat}
> src/kudu/client/client-test.cc:5436: Failure  
> Value of: tables.empty()  
>   
>   Actual: false   
>   
> Expected: true
> {noformat}
> I'm attaching a full log for reference.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KUDU-3433) ClientTest.TestDeleteWithDeletedTableReserveSecondsWorks is flaky

2024-04-09 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3433.
-
Fix Version/s: 1.18.0
   Resolution: Fixed

> ClientTest.TestDeleteWithDeletedTableReserveSecondsWorks is flaky
> -
>
> Key: KUDU-3433
> URL: https://issues.apache.org/jira/browse/KUDU-3433
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: Alexey Serbin
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: client-test.2.txt.xz, client-test.3.txt.xz, 
> client-test.4.txt.xz
>
>
> The {{TestDeleteWithDeletedTableReserveSecondsWorks}} in {{client-test}} 
> sometimes fails with the following message:
> {noformat}
> src/kudu/client/client-test.cc:5436: Failure  
> Value of: tables.empty()  
>   
>   Actual: false   
>   
> Expected: true
> {noformat}
> I'm attaching a full log for reference.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KUDU-3545) codegen test fails on SLES with higher libgcc version

2024-04-09 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835086#comment-17835086
 ] 

Alexey Serbin edited comment on KUDU-3545 at 4/9/24 1:51 PM:
-

I haven't tried to track down the exact root cause behind the crash, but I 
suspect the root cause is something described in KUDU-2068, i.e. ABI 
incompatibilities between GCC toolchains of different versions.

In essence, Kudu's third-party CLANG (used to generate {{precompiled.ll}}) 
picks up a toolchain of the latest version available at the build machine, but 
the rest of Kudu is built with a toolchain of different version (e.g., think of 
GCC7-based and GCC13-based toolchains on SLES15).  If there is an ABI 
incompatibility on the size of an STL-based type or anything else that's being 
passed between auto-generated code derived from {{precompiled.ll}} and the rest 
of the {{kudu-tserver}} runtime, there is a risk of either a memory corruption 
or, if you are lucky, an immediate crash of the {{kudu-tserver}} process or 
even a crash of the {{codegen-test}}.

The CLANG's behavior of picking up the latest available version of GCC 
toolchain that it can find is described in [its 
documentation|https://clang.llvm.org/docs/ClangCommandLineReference.html#dumping-preprocessor-state],
 see the paragraph for the {{\-\-gcc-toolchain}} option.  In newer versions of 
CLANG (starting with 16.0.0) there is a better alternative to the 
{{\-\-gcc-toolchain}} flag: {{\-\-gcc-install-dir}} (see [this e-mail 
thread|https://discourse.llvm.org/t/add-gcc-install-dir-deprecate-gcc-toolchain-and-remove-gcc-install-prefix/65091]
 for more details).  I guess we should employ this option once we upgrade 
Kudu's thirdparty LLVM (it's 11.0.0 as of April 2024) at least to 16.0.0 or 
newer.


was (Author: aserbin):
I haven't tried to track down the exact root cause behind the crash, but I 
suspect the root cause is something described in KUDU-2068, i.e. ABI 
incompatibilities between GCC toolchains of different versions.

In essence, Kudu's third-party CLANG (used to generate {{precompiled.ll}}) 
picks up a toolchain of the latest version available at the build machine, but 
the rest of Kudu is built with a toolchain of different version (e.g., think of 
GCC7-based and GCC13-based toolchains on SLES15).  If there is an ABI 
incompatibility on the size of an STL-based type or anything else that's being 
passed between auto-generated code derived from {{precompiled.ll}} and the rest 
of the {{kudu-tserver}} runtime, there is a risk of either a memory corruption 
or, if you are lucky, an immediate crash of the {{kudu-tserver}} process or 
even a crash of the {{codegen-test}}.

The CLANG's behavior of picking up the latest available version of GCC 
toolchain that it can find is described in [its 
documentation|https://clang.llvm.org/docs/ClangCommandLineReference.html#dumping-preprocessor-state],
 see the paragraph for the {{\-\-gcc-toolchain}} option.  In newer versions of 
CLANG (starting with 16.0.0) there is a better alternative to the 
{{\-\-gcc-toolchain}} flag: {{\-\-gcc-install-dir}} (see [this e-mail 
thread|https://discourse.llvm.org/t/add-gcc-install-dir-deprecate-gcc-toolchain-and-remove-gcc-install-prefix/65091]
 for more details).  I guess we should employ this option once we upgrade 
Kudu's thirdparty LLVM at least to 16.0.0 version or newer (it's 11.0.0 as of 
April 2024).

> codegen test fails on SLES with higher libgcc version
> -
>
> Key: KUDU-3545
> URL: https://issues.apache.org/jira/browse/KUDU-3545
> Project: Kudu
>  Issue Type: Bug
>  Components: codegen
>Reporter: Ashwani Raina
>Assignee: Ashwani Raina
>Priority: Minor
>
> On a SLES 15 withlibgcc_s1-13.2.1+git7813-15.1.6.1.x86_64 version, 
> codegen-test fails with following crash:
> {noformat}
> *** SIGABRT (@0x3162e) received by PID 202286 (TID 0x7f71d1bfe700) from PID 
> 202286; stack trace: ***
>     @     0x7f71d41f5910 (unknown)
>     @     0x7f71d2725d2b __GI_raise
>     @     0x7f71d27273e5 __GI_abort
>     @     0x7f71d28d78d7 (unknown)
>     @     0x7f71d28f1009 __deregister_frame
>     @     0x7f71d4d6c9e0 llvm::RTDyldMemoryManager::deregisterEHFrames()
>     @     0x7f71d4976b02 llvm::MCJIT::~MCJIT()
>     @     0x7f71d4977241 llvm::MCJIT::~MCJIT()
>     @     0x7f71d481c222 std::default_delete<>::operator()()
>     @     0x7f71d481c12d std::unique_ptr<>::~unique_ptr()
>     @     0x7f71d481bfaf kudu::codegen::JITWrapper::~JITWrapper()
>     @     0x7f71d4835f34 
> kudu::codegen::RowProjectorFunctions::~RowProjectorFunctions()
>     @     0x7f71d4835f50 
> kudu::codegen::RowProjectorFunctions::~RowProjectorFunctions()
>     @           0x46297c kudu::RefCountedThreadSafe<>::DeleteInternal()
>     @           0x45f3d1 

[jira] [Comment Edited] (KUDU-3545) codegen test fails on SLES with higher libgcc version

2024-04-08 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835086#comment-17835086
 ] 

Alexey Serbin edited comment on KUDU-3545 at 4/9/24 1:36 AM:
-

I haven't tried to track down the exact root cause behind the crash, but I 
suspect the root cause is something described in KUDU-2068, i.e. ABI 
incompatibilities between GCC toolchains of different versions.

In essence, Kudu's third-party CLANG (used to generate {{precompiled.ll}}) 
picks up a toolchain of the latest version available at the build machine, but 
the rest of Kudu is built with a toolchain of different version (e.g., think of 
GCC7-based and GCC13-based toolchains on SLES15).  If there is an ABI 
incompatibility on the size of an STL-based type or anything else that's being 
passed between auto-generated code derived from {{precompiled.ll}} and the rest 
of the {{kudu-tserver}} runtime, there is a risk of either a memory corruption 
or, if you are lucky, an immediate crash of the {{kudu-tserver}} process or 
even a crash of the {{codegen-test}}.

The CLANG's behavior of picking up the latest available version of GCC 
toolchain that it can find is described in [its 
documentation|https://clang.llvm.org/docs/ClangCommandLineReference.html#dumping-preprocessor-state],
 see the paragraph for the {{\-\-gcc-toolchain}} option.  In newer versions of 
CLANG (starting with 16.0.0) there is a better alternative to the 
{{\-\-gcc-toolchain}} flag: {{\-\-gcc-install-dir}} (see [this e-mail 
thread|https://discourse.llvm.org/t/add-gcc-install-dir-deprecate-gcc-toolchain-and-remove-gcc-install-prefix/65091]
 for more details).  I guess we should employ this option once we upgrade 
Kudu's thirdparty LLVM at least to 16.0.0 version or newer (it's 11.0.0 as of 
April 2024).


was (Author: aserbin):
I haven't tried to track down the exact root cause behind the crash, but I 
suspect the root cause is something described in KUDU-2068, i.e. ABI 
incompatibilities between GCC toolchains of different versions.

In essence, Kudu's third-party CLANG (used to generate {{precompiled.ll}}) 
picks up a toolchain of the latest version available at the build machine, but 
the rest of Kudu is built with a toolchain of different version (e.g., think of 
GCC7-based and GCC13-based toolchains on SLES15).  If there is an ABI 
incompatibility on the size of an STL-based type or anything else that's being 
passing between auto-generated code derived from {{precompiled.ll}} and the 
rest of the {{kudu-tserver}} runtime, there is a risk of either a memory 
corruption or, if you are lucky, an immediate crash of the {{kudu-tserver}} 
process or even a crash of the {{codegen-test}}.

The CLANG's behavior of picking up the latest available version of GCC 
toolchain that it can find is described in [its 
documentation|https://clang.llvm.org/docs/ClangCommandLineReference.html#dumping-preprocessor-state],
 see the paragraph for the {{\-\-gcc-toolchain}} option.  In newer versions of 
CLANG (starting with 16.0.0) there is a better alternative to the 
{{\-\-gcc-toolchain}} flag: {{\-\-gcc-install-dir}} (see [this e-mail 
thread|https://discourse.llvm.org/t/add-gcc-install-dir-deprecate-gcc-toolchain-and-remove-gcc-install-prefix/65091]
 for more details).  I guess we should employ this option once we upgrade 
Kudu's thirdparty LLVM at least to 16.0.0 version or newer (it's 11.0.0 as of 
April 2024).

> codegen test fails on SLES with higher libgcc version
> -
>
> Key: KUDU-3545
> URL: https://issues.apache.org/jira/browse/KUDU-3545
> Project: Kudu
>  Issue Type: Bug
>  Components: codegen
>Reporter: Ashwani Raina
>Priority: Minor
>
> On a SLES 15 withlibgcc_s1-13.2.1+git7813-15.1.6.1.x86_64 version, 
> codegen-test fails with following crash:
> {noformat}
> *** SIGABRT (@0x3162e) received by PID 202286 (TID 0x7f71d1bfe700) from PID 
> 202286; stack trace: ***
>     @     0x7f71d41f5910 (unknown)
>     @     0x7f71d2725d2b __GI_raise
>     @     0x7f71d27273e5 __GI_abort
>     @     0x7f71d28d78d7 (unknown)
>     @     0x7f71d28f1009 __deregister_frame
>     @     0x7f71d4d6c9e0 llvm::RTDyldMemoryManager::deregisterEHFrames()
>     @     0x7f71d4976b02 llvm::MCJIT::~MCJIT()
>     @     0x7f71d4977241 llvm::MCJIT::~MCJIT()
>     @     0x7f71d481c222 std::default_delete<>::operator()()
>     @     0x7f71d481c12d std::unique_ptr<>::~unique_ptr()
>     @     0x7f71d481bfaf kudu::codegen::JITWrapper::~JITWrapper()
>     @     0x7f71d4835f34 
> kudu::codegen::RowProjectorFunctions::~RowProjectorFunctions()
>     @     0x7f71d4835f50 
> kudu::codegen::RowProjectorFunctions::~RowProjectorFunctions()
>     @           0x46297c kudu::RefCountedThreadSafe<>::DeleteInternal()
>     @           0x45f3d1 

[jira] [Updated] (KUDU-3545) codegen test fails on SLES with higher libgcc version

2024-04-08 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3545:

Description: 
On a SLES 15 withlibgcc_s1-13.2.1+git7813-15.1.6.1.x86_64 version, 
codegen-test fails with following crash:
{noformat}
*** SIGABRT (@0x3162e) received by PID 202286 (TID 0x7f71d1bfe700) from PID 
202286; stack trace: ***
    @     0x7f71d41f5910 (unknown)
    @     0x7f71d2725d2b __GI_raise
    @     0x7f71d27273e5 __GI_abort
    @     0x7f71d28d78d7 (unknown)
    @     0x7f71d28f1009 __deregister_frame
    @     0x7f71d4d6c9e0 llvm::RTDyldMemoryManager::deregisterEHFrames()
    @     0x7f71d4976b02 llvm::MCJIT::~MCJIT()
    @     0x7f71d4977241 llvm::MCJIT::~MCJIT()
    @     0x7f71d481c222 std::default_delete<>::operator()()
    @     0x7f71d481c12d std::unique_ptr<>::~unique_ptr()
    @     0x7f71d481bfaf kudu::codegen::JITWrapper::~JITWrapper()
    @     0x7f71d4835f34 
kudu::codegen::RowProjectorFunctions::~RowProjectorFunctions()
    @     0x7f71d4835f50 
kudu::codegen::RowProjectorFunctions::~RowProjectorFunctions()
    @           0x46297c kudu::RefCountedThreadSafe<>::DeleteInternal()
    @           0x45f3d1 kudu::DefaultRefCountedThreadSafeTraits<>::Destruct()
    @           0x45acb0 kudu::RefCountedThreadSafe<>::Release()
    @     0x7f71d480c191 
kudu::codegen::CodeCache::EvictionCallback::EvictedEntry()
    @     0x7f71d3c5e4bb kudu::(anonymous namespace)::CacheShard<>::FreeEntry()
    @     0x7f71d3c60b31 kudu::(anonymous namespace)::CacheShard<>::Insert()
    @     0x7f71d3c5fb73 kudu::(anonymous namespace)::ShardedCache<>::Insert()
    @     0x7f71d480bab6 kudu::codegen::CodeCache::AddEntry()
    @     0x7f71d4811fea kudu::codegen::(anonymous 
namespace)::CompilationTask::RunWithStatus()
    @     0x7f71d4811a64 kudu::codegen::(anonymous 
namespace)::CompilationTask::Run()
    @     0x7f71d481288a 
_ZZN4kudu7codegen18CompilationManager19RequestRowProjectorEPKNS_6SchemaES4_PSt10unique_ptrINS0_12RowProjectorESt14default_deleteIS6_EEENKUlvE_clEv
    @     0x7f71d4813e72 
_ZNSt17_Function_handlerIFvvEZN4kudu7codegen18CompilationManager19RequestRowProjectorEPKNS1_6SchemaES6_PSt10unique_ptrINS2_12RowProjectorESt14default_deleteIS8_EEEUlvE_E9_M_invokeERKSt9_Any_data
    @           0x452430 std::function<>::operator()()
    @     0x7f71d3d98648 kudu::ThreadPool::DispatchThread()
    @     0x7f71d3d98ee9 _ZZN4kudu10ThreadPool12CreateThreadEvENKUlvE_clEv
    @     0x7f71d3d9a6a0 
_ZNSt17_Function_handlerIFvvEZN4kudu10ThreadPool12CreateThreadEvEUlvE_E9_M_invokeERKSt9_Any_data
    @           0x452430 std::function<>::operator()()
    @     0x7f71d3d89482 kudu::Thread::SuperviseThread()
    @     0x7f71d41e96ea start_thread
{noformat}


>From the stack frame, it seems that __deregister_frame is probably being fed 
>some invalid input that is already de-initialised before calling the 
>__deregister_frame.
We seem to be hitting this assert:

[https://github.com/gcc-mirror/gcc/blob/65e2c932019b4e36d7c1d49952dc006fa7419a3d/libgcc/unwind-dw2-fde.c#L291C11-L291C11]

gcc_assert (in_shutdown || ob);

  was:
On a SLES 15 withlibgcc_s1-13.2.1+git7813-15.1.6.1.x86_64 version, 
codegen-test fails with following crash:
{noformat}

*** SIGABRT (@0x3162e) received by PID 202286 (TID 0x7f71d1bfe700) from PID 
202286; stack trace: ***

    @     0x7f71d41f5910 (unknown)

    @     0x7f71d2725d2b __GI_raise

    @     0x7f71d27273e5 __GI_abort

    @     0x7f71d28d78d7 (unknown)

    @     0x7f71d28f1009 __deregister_frame

    @     0x7f71d4d6c9e0 llvm::RTDyldMemoryManager::deregisterEHFrames()

    @     0x7f71d4976b02 llvm::MCJIT::~MCJIT()

    @     0x7f71d4977241 llvm::MCJIT::~MCJIT()

    @     0x7f71d481c222 std::default_delete<>::operator()()

    @     0x7f71d481c12d std::unique_ptr<>::~unique_ptr()

    @     0x7f71d481bfaf kudu::codegen::JITWrapper::~JITWrapper()

    @     0x7f71d4835f34 
kudu::codegen::RowProjectorFunctions::~RowProjectorFunctions()

    @     0x7f71d4835f50 
kudu::codegen::RowProjectorFunctions::~RowProjectorFunctions()

    @           0x46297c kudu::RefCountedThreadSafe<>::DeleteInternal()

    @           0x45f3d1 kudu::DefaultRefCountedThreadSafeTraits<>::Destruct()

    @           0x45acb0 kudu::RefCountedThreadSafe<>::Release()

    @     0x7f71d480c191 
kudu::codegen::CodeCache::EvictionCallback::EvictedEntry()

    @     0x7f71d3c5e4bb kudu::(anonymous namespace)::CacheShard<>::FreeEntry()

    @     0x7f71d3c60b31 kudu::(anonymous namespace)::CacheShard<>::Insert()

    @     0x7f71d3c5fb73 kudu::(anonymous namespace)::ShardedCache<>::Insert()

    @     0x7f71d480bab6 kudu::codegen::CodeCache::AddEntry()

    @     0x7f71d4811fea kudu::codegen::(anonymous 
namespace)::CompilationTask::RunWithStatus()

    @     0x7f71d4811a64 kudu::codegen::(anonymous 
namespace)::CompilationTask::Run()

    @     0x7f71d481288a 

[jira] [Updated] (KUDU-3545) codegen test fails on SLES with higher libgcc version

2024-04-08 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3545:

Description: 
On a SLES 15 withlibgcc_s1-13.2.1+git7813-15.1.6.1.x86_64 version, 
codegen-test fails with following crash:
{noformat}

*** SIGABRT (@0x3162e) received by PID 202286 (TID 0x7f71d1bfe700) from PID 
202286; stack trace: ***

    @     0x7f71d41f5910 (unknown)

    @     0x7f71d2725d2b __GI_raise

    @     0x7f71d27273e5 __GI_abort

    @     0x7f71d28d78d7 (unknown)

    @     0x7f71d28f1009 __deregister_frame

    @     0x7f71d4d6c9e0 llvm::RTDyldMemoryManager::deregisterEHFrames()

    @     0x7f71d4976b02 llvm::MCJIT::~MCJIT()

    @     0x7f71d4977241 llvm::MCJIT::~MCJIT()

    @     0x7f71d481c222 std::default_delete<>::operator()()

    @     0x7f71d481c12d std::unique_ptr<>::~unique_ptr()

    @     0x7f71d481bfaf kudu::codegen::JITWrapper::~JITWrapper()

    @     0x7f71d4835f34 
kudu::codegen::RowProjectorFunctions::~RowProjectorFunctions()

    @     0x7f71d4835f50 
kudu::codegen::RowProjectorFunctions::~RowProjectorFunctions()

    @           0x46297c kudu::RefCountedThreadSafe<>::DeleteInternal()

    @           0x45f3d1 kudu::DefaultRefCountedThreadSafeTraits<>::Destruct()

    @           0x45acb0 kudu::RefCountedThreadSafe<>::Release()

    @     0x7f71d480c191 
kudu::codegen::CodeCache::EvictionCallback::EvictedEntry()

    @     0x7f71d3c5e4bb kudu::(anonymous namespace)::CacheShard<>::FreeEntry()

    @     0x7f71d3c60b31 kudu::(anonymous namespace)::CacheShard<>::Insert()

    @     0x7f71d3c5fb73 kudu::(anonymous namespace)::ShardedCache<>::Insert()

    @     0x7f71d480bab6 kudu::codegen::CodeCache::AddEntry()

    @     0x7f71d4811fea kudu::codegen::(anonymous 
namespace)::CompilationTask::RunWithStatus()

    @     0x7f71d4811a64 kudu::codegen::(anonymous 
namespace)::CompilationTask::Run()

    @     0x7f71d481288a 
_ZZN4kudu7codegen18CompilationManager19RequestRowProjectorEPKNS_6SchemaES4_PSt10unique_ptrINS0_12RowProjectorESt14default_deleteIS6_EEENKUlvE_clEv

    @     0x7f71d4813e72 
_ZNSt17_Function_handlerIFvvEZN4kudu7codegen18CompilationManager19RequestRowProjectorEPKNS1_6SchemaES6_PSt10unique_ptrINS2_12RowProjectorESt14default_deleteIS8_EEEUlvE_E9_M_invokeERKSt9_Any_data

    @           0x452430 std::function<>::operator()()

    @     0x7f71d3d98648 kudu::ThreadPool::DispatchThread()

    @     0x7f71d3d98ee9 _ZZN4kudu10ThreadPool12CreateThreadEvENKUlvE_clEv

    @     0x7f71d3d9a6a0 
_ZNSt17_Function_handlerIFvvEZN4kudu10ThreadPool12CreateThreadEvEUlvE_E9_M_invokeERKSt9_Any_data

    @           0x452430 std::function<>::operator()()

    @     0x7f71d3d89482 kudu::Thread::SuperviseThread()

    @     0x7f71d41e96ea start_thread


+++



>From the stack frame, it seems that __deregister_frame is probably being fed 
>some invalid input that is already de-initialised before calling the 
>__deregister_frame.
We seem to be hitting this assert:

[https://github.com/gcc-mirror/gcc/blob/65e2c932019b4e36d7c1d49952dc006fa7419a3d/libgcc/unwind-dw2-fde.c#L291C11-L291C11]

gcc_assert (in_shutdown || ob);

  was:
On a SLES 15 withlibgcc_s1-13.2.1+git7813-15.1.6.1.x86_64 version, 
codegen-test fails with following crash:
+++

*** SIGABRT (@0x3162e) received by PID 202286 (TID 0x7f71d1bfe700) from PID 
202286; stack trace: ***

    @     0x7f71d41f5910 (unknown)

    @     0x7f71d2725d2b __GI_raise

    @     0x7f71d27273e5 __GI_abort

    @     0x7f71d28d78d7 (unknown)

    @     0x7f71d28f1009 __deregister_frame

    @     0x7f71d4d6c9e0 llvm::RTDyldMemoryManager::deregisterEHFrames()

    @     0x7f71d4976b02 llvm::MCJIT::~MCJIT()

    @     0x7f71d4977241 llvm::MCJIT::~MCJIT()

    @     0x7f71d481c222 std::default_delete<>::operator()()

    @     0x7f71d481c12d std::unique_ptr<>::~unique_ptr()

    @     0x7f71d481bfaf kudu::codegen::JITWrapper::~JITWrapper()

    @     0x7f71d4835f34 
kudu::codegen::RowProjectorFunctions::~RowProjectorFunctions()

    @     0x7f71d4835f50 
kudu::codegen::RowProjectorFunctions::~RowProjectorFunctions()

    @           0x46297c kudu::RefCountedThreadSafe<>::DeleteInternal()

    @           0x45f3d1 kudu::DefaultRefCountedThreadSafeTraits<>::Destruct()

    @           0x45acb0 kudu::RefCountedThreadSafe<>::Release()

    @     0x7f71d480c191 
kudu::codegen::CodeCache::EvictionCallback::EvictedEntry()

    @     0x7f71d3c5e4bb kudu::(anonymous namespace)::CacheShard<>::FreeEntry()

    @     0x7f71d3c60b31 kudu::(anonymous namespace)::CacheShard<>::Insert()

    @     0x7f71d3c5fb73 kudu::(anonymous namespace)::ShardedCache<>::Insert()

    @     0x7f71d480bab6 kudu::codegen::CodeCache::AddEntry()

    @     0x7f71d4811fea kudu::codegen::(anonymous 
namespace)::CompilationTask::RunWithStatus()

    @     0x7f71d4811a64 kudu::codegen::(anonymous 
namespace)::CompilationTask::Run()

    @     0x7f71d481288a 

[jira] [Commented] (KUDU-3545) codegen test fails on SLES with higher libgcc version

2024-04-08 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835086#comment-17835086
 ] 

Alexey Serbin commented on KUDU-3545:
-

I haven't tried to track down the exact root cause behind the crash, but I 
suspect the root cause is something described in KUDU-2068, i.e. ABI 
incompatibilities between GCC toolchains of different versions.

In essence, Kudu's third-party CLANG (used to generate {{precompiled.ll}}) 
picks up a toolchain of the latest version available at the build machine, but 
the rest of Kudu is built with a toolchain of different version (e.g., think of 
GCC7-based and GCC13-based toolchains on SLES15).  If there is an ABI 
incompatibility on the size of an STL-based type or anything else that's being 
passing between auto-generated code derived from {{precompiled.ll}} and the 
rest of the {{kudu-tserver}} runtime, there is a risk of either a memory 
corruption or, if you are lucky, an immediate crash of the {{kudu-tserver}} 
process or even a crash of the {{codegen-test}}.

The CLANG's behavior of picking up the latest available version of GCC 
toolchain that it can find is described in [its 
documentation|https://clang.llvm.org/docs/ClangCommandLineReference.html#dumping-preprocessor-state],
 see the paragraph for the {{\-\-gcc-toolchain}} option.  In newer versions of 
CLANG (starting with 16.0.0) there is a better alternative to the 
{{\-\-gcc-toolchain}} flag: {{\-\-gcc-install-dir}} (see [this e-mail 
thread|https://discourse.llvm.org/t/add-gcc-install-dir-deprecate-gcc-toolchain-and-remove-gcc-install-prefix/65091]
 for more details).  I guess we should employ this option once we upgrade 
Kudu's thirdparty LLVM at least to 16.0.0 version or newer (it's 11.0.0 as of 
April 2024).

> codegen test fails on SLES with higher libgcc version
> -
>
> Key: KUDU-3545
> URL: https://issues.apache.org/jira/browse/KUDU-3545
> Project: Kudu
>  Issue Type: Bug
>  Components: codegen
>Reporter: Ashwani Raina
>Priority: Minor
>
> On a SLES 15 withlibgcc_s1-13.2.1+git7813-15.1.6.1.x86_64 version, 
> codegen-test fails with following crash:
> +++
> *** SIGABRT (@0x3162e) received by PID 202286 (TID 0x7f71d1bfe700) from PID 
> 202286; stack trace: ***
>     @     0x7f71d41f5910 (unknown)
>     @     0x7f71d2725d2b __GI_raise
>     @     0x7f71d27273e5 __GI_abort
>     @     0x7f71d28d78d7 (unknown)
>     @     0x7f71d28f1009 __deregister_frame
>     @     0x7f71d4d6c9e0 llvm::RTDyldMemoryManager::deregisterEHFrames()
>     @     0x7f71d4976b02 llvm::MCJIT::~MCJIT()
>     @     0x7f71d4977241 llvm::MCJIT::~MCJIT()
>     @     0x7f71d481c222 std::default_delete<>::operator()()
>     @     0x7f71d481c12d std::unique_ptr<>::~unique_ptr()
>     @     0x7f71d481bfaf kudu::codegen::JITWrapper::~JITWrapper()
>     @     0x7f71d4835f34 
> kudu::codegen::RowProjectorFunctions::~RowProjectorFunctions()
>     @     0x7f71d4835f50 
> kudu::codegen::RowProjectorFunctions::~RowProjectorFunctions()
>     @           0x46297c kudu::RefCountedThreadSafe<>::DeleteInternal()
>     @           0x45f3d1 kudu::DefaultRefCountedThreadSafeTraits<>::Destruct()
>     @           0x45acb0 kudu::RefCountedThreadSafe<>::Release()
>     @     0x7f71d480c191 
> kudu::codegen::CodeCache::EvictionCallback::EvictedEntry()
>     @     0x7f71d3c5e4bb kudu::(anonymous 
> namespace)::CacheShard<>::FreeEntry()
>     @     0x7f71d3c60b31 kudu::(anonymous namespace)::CacheShard<>::Insert()
>     @     0x7f71d3c5fb73 kudu::(anonymous namespace)::ShardedCache<>::Insert()
>     @     0x7f71d480bab6 kudu::codegen::CodeCache::AddEntry()
>     @     0x7f71d4811fea kudu::codegen::(anonymous 
> namespace)::CompilationTask::RunWithStatus()
>     @     0x7f71d4811a64 kudu::codegen::(anonymous 
> namespace)::CompilationTask::Run()
>     @     0x7f71d481288a 
> _ZZN4kudu7codegen18CompilationManager19RequestRowProjectorEPKNS_6SchemaES4_PSt10unique_ptrINS0_12RowProjectorESt14default_deleteIS6_EEENKUlvE_clEv
>     @     0x7f71d4813e72 
> _ZNSt17_Function_handlerIFvvEZN4kudu7codegen18CompilationManager19RequestRowProjectorEPKNS1_6SchemaES6_PSt10unique_ptrINS2_12RowProjectorESt14default_deleteIS8_EEEUlvE_E9_M_invokeERKSt9_Any_data
>     @           0x452430 std::function<>::operator()()
>     @     0x7f71d3d98648 kudu::ThreadPool::DispatchThread()
>     @     0x7f71d3d98ee9 _ZZN4kudu10ThreadPool12CreateThreadEvENKUlvE_clEv
>     @     0x7f71d3d9a6a0 
> _ZNSt17_Function_handlerIFvvEZN4kudu10ThreadPool12CreateThreadEvEUlvE_E9_M_invokeERKSt9_Any_data
>     @           0x452430 std::function<>::operator()()
>     @     0x7f71d3d89482 kudu::Thread::SuperviseThread()
>     @     0x7f71d41e96ea start_thread
> +++
> From the stack frame, it seems that __deregister_frame is probably being fed 
> some invalid input that is 

[jira] [Commented] (KUDU-3561) Too Many Warning Log: "Entity is not relevant to Prometheus"

2024-04-08 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834994#comment-17834994
 ] 

Alexey Serbin commented on KUDU-3561:
-

[~liumumumumumu],

Thank you very much for reporting the issue promptly -- it helps a lot!

It's nice to hear you find Kudu useful.  Thank you for the kind words :)

> Too Many Warning Log: "Entity is not relevant to Prometheus"
> 
>
> Key: KUDU-3561
> URL: https://issues.apache.org/jira/browse/KUDU-3561
> Project: Kudu
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 1.17.0
>Reporter: Liu Lin
>Assignee: Alexey Serbin
>Priority: Major
> Fix For: 1.18.0, 1.17.1
>
> Attachments: image-2024-03-30-22-54-28-735.png
>
>
> I encountered the same issue as KUDU-3549, so I recompiled Kudu using the 
> master branch.
> Now Prometheus is able to collect metrics from Kudu correctly through the 
> /metrics_prometheus endpoint.
> However, I noticed a large number of "Failed to write entity [xxx] as 
> Prometheus: Not found: Entity is not relevant to Prometheus" warning logs in 
> the Tserver's logs.
> !image-2024-03-30-22-54-28-735.png!
> Could you please confirm if this is a code issue? If not, how can I reduce 
> these logs?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KUDU-3563) Output tablet-level metrics in Prometheus format

2024-04-08 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834991#comment-17834991
 ] 

Alexey Serbin commented on KUDU-3563:
-

[~liumumumumumu],

Thank you very much for your interest in Kudu and reporting the issues promptly!

It's nice to hear that you found Kudu useful.  I hope the support for tablet 
level metrics in Kudu will be added soon (i.e. addressing this KUDU-3563 JIRA 
item).

Thank you for the kind words!

> Output tablet-level metrics in Prometheus format
> 
>
> Key: KUDU-3563
> URL: https://issues.apache.org/jira/browse/KUDU-3563
> Project: Kudu
>  Issue Type: Improvement
>  Components: master, server, tserver
>Reporter: Alexey Serbin
>Priority: Major
>  Labels: metrics, prometheus, supportability, visibility
>
> The request to support outputting Kudu metrics in Prometheus format is 
> tracked [KUDU-3375|https://issues.apache.org/jira/browse/KUDU-3375].  The 
> [first take on 
> this|https://github.com/apache/kudu/commit/00efc6826ac9a1f5d10750296c7357790a041fec]
>  has taken care of the server-level metrics, ignoring all the tablet-level 
> metrics.
> In the scope of this JIRA item, it's necessary to output all the tablet-level 
> metrics in Prometheus format as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3564) Range specific hashing table when queried with InList predicate may lead to incorrect results

2024-04-07 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3564:

Affects Version/s: 1.16.0

> Range specific hashing table when queried with InList predicate may lead to 
> incorrect results
> -
>
> Key: KUDU-3564
> URL: https://issues.apache.org/jira/browse/KUDU-3564
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.16.0, 1.17.0
>Reporter: YifanZhang
>Priority: Major
> Fix For: 1.18.0, 1.17.1
>
>
> Reproduce steps that copy from the Slack channel:
>  
> {code:sql}
> -- create the table and data in Impala:
> CREATE TABLE age_table
> (
> id BIGINT,
> name STRING,
> age INT,
> PRIMARY KEY(id,name,age)
> )
> PARTITION BY HASH (id) PARTITIONS 4,
> HASH (name) PARTITIONS 4,
> range (age)
> ( 
> PARTITION 30 <= VALUES < 60,
> PARTITION 60 <= VALUES < 90
> ) 
> STORED AS KUDU 
> TBLPROPERTIES ('kudu.num_tablet_replicas' = '1');
> ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120
> HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3;
> INSERT INTO age_table VALUES (3, 'alex', 50);
> INSERT INTO age_table VALUES (12, 'bob', 100);
> {code}
> Now, let's run a few queries using the {{kudu table scan}} CLI tool:
> {noformat}
> # This query produces wrong results: the expected row for 'bob' isn't 
> returned.
> # Note that the troublesome row is in the range partition with custom 
> (per-range) hash schema.
> $ sudo -u kudu kudu table scan  default.age_table -columns=id,age 
> -predicates='["AND", ["IN", "id", [12,20]]]'
> Total count 0 cost 0.0224966 seconds
> # This query produces correct results: the expected row for 'alex' is 
> returned.
> $ sudo -u kudu kudu table scan  default.age_table -columns=id,age 
> -predicates='["AND", ["IN", "id", [3,20]]]'
> (int64 id=3, int32 age=50)
> Total count 1 cost 0.0178102 seconds
> # However, predicates on the primary key columns seem to work as expected, 
> even for the rows in the range with custom hash schema.
> $ sudo -u kudu kudu table scan  default.age_table -columns=id,age 
> -predicates='["AND", ["=", "id", 12]]'
> (int64 id=12, int32 age=100)
> Total count 1 cost 0.0137217 seconds
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3564) Range specific hashing table when queried with InList predicate may lead to incorrect results

2024-04-07 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3564:

Fix Version/s: 1.18.0
   1.17.1
   Resolution: Fixed
   Status: Resolved  (was: In Review)

> Range specific hashing table when queried with InList predicate may lead to 
> incorrect results
> -
>
> Key: KUDU-3564
> URL: https://issues.apache.org/jira/browse/KUDU-3564
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: YifanZhang
>Priority: Major
> Fix For: 1.18.0, 1.17.1
>
>
> Reproduce steps that copy from the Slack channel:
>  
> {code:sql}
> -- create the table and data in Impala:
> CREATE TABLE age_table
> (
> id BIGINT,
> name STRING,
> age INT,
> PRIMARY KEY(id,name,age)
> )
> PARTITION BY HASH (id) PARTITIONS 4,
> HASH (name) PARTITIONS 4,
> range (age)
> ( 
> PARTITION 30 <= VALUES < 60,
> PARTITION 60 <= VALUES < 90
> ) 
> STORED AS KUDU 
> TBLPROPERTIES ('kudu.num_tablet_replicas' = '1');
> ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120
> HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3;
> INSERT INTO age_table VALUES (3, 'alex', 50);
> INSERT INTO age_table VALUES (12, 'bob', 100);
> {code}
> Now, let's run a few queries using the {{kudu table scan}} CLI tool:
> {noformat}
> # This query produces wrong results: the expected row for 'bob' isn't 
> returned.
> # Note that the troublesome row is in the range partition with custom 
> (per-range) hash schema.
> $ sudo -u kudu kudu table scan  default.age_table -columns=id,age 
> -predicates='["AND", ["IN", "id", [12,20]]]'
> Total count 0 cost 0.0224966 seconds
> # This query produces correct results: the expected row for 'alex' is 
> returned.
> $ sudo -u kudu kudu table scan  default.age_table -columns=id,age 
> -predicates='["AND", ["IN", "id", [3,20]]]'
> (int64 id=3, int32 age=50)
> Total count 1 cost 0.0178102 seconds
> # However, predicates on the primary key columns seem to work as expected, 
> even for the rows in the range with custom hash schema.
> $ sudo -u kudu kudu table scan  default.age_table -columns=id,age 
> -predicates='["AND", ["=", "id", 12]]'
> (int64 id=12, int32 age=100)
> Total count 1 cost 0.0137217 seconds
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3564) Range specific hashing table when queried with InList predicate may lead to incorrect results

2024-04-05 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3564:

Description: 
Reproduce steps that copy from the Slack channel:
 
{code:sql}
-- create the table and data in Impala:
CREATE TABLE age_table
(
id BIGINT,
name STRING,
age INT,
PRIMARY KEY(id,name,age)
)
PARTITION BY HASH (id) PARTITIONS 4,
HASH (name) PARTITIONS 4,
range (age)
( 
PARTITION 30 <= VALUES < 60,
PARTITION 60 <= VALUES < 90
) 
STORED AS KUDU 
TBLPROPERTIES ('kudu.num_tablet_replicas' = '1');

ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120
HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3;

INSERT INTO age_table VALUES (3, 'alex', 50);
INSERT INTO age_table VALUES (12, 'bob', 100);
{code}

Now, let's run a few queries using the {{kudu table scan}} CLI tool:
{noformat}
# This query produces wrong results: the expected row for 'bob' isn't returned.
# Note that the troublesome row is in the range partition with custom 
(per-range) hash schema.
$ sudo -u kudu kudu table scan  default.age_table -columns=id,age 
-predicates='["AND", ["IN", "id", [12,20]]]'
Total count 0 cost 0.0224966 seconds

# This query produces correct results: the expected row for 'alex' is returned.
$ sudo -u kudu kudu table scan  default.age_table -columns=id,age 
-predicates='["AND", ["IN", "id", [3,20]]]'
(int64 id=3, int32 age=50)
Total count 1 cost 0.0178102 seconds

# However, predicates on the primary key columns seem to work as expected, even 
for the rows in the range with custom hash schema.
$ sudo -u kudu kudu table scan  default.age_table -columns=id,age 
-predicates='["AND", ["=", "id", 12]]'
(int64 id=12, int32 age=100)
Total count 1 cost 0.0137217 seconds
{noformat}

  was:
Reproduce steps that copy from the Slack channel:
 
{code:sql}
// create the table and data in Impala:
CREATE TABLE age_table
(
id BIGINT,
name STRING,
age INT,
PRIMARY KEY(id,name,age)
)
PARTITION BY HASH (id) PARTITIONS 4,
HASH (name) PARTITIONS 4,
range (age)
( 
PARTITION 30 <= VALUES < 60,
PARTITION 60 <= VALUES < 90
) 
STORED AS KUDU 
TBLPROPERTIES ('kudu.num_tablet_replicas' = '1');

ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120
HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3;


INSERT INTO age_table VALUES (3, 'alex', 50);
INSERT INTO age_table VALUES (12, 'bob', 100);

# This query produces wrong results: the expected row for 'bob' isn't returned.
# Note that the troublesome row is in the range partition with custom 
(per-range) hash schema.
sudo -u kudu kudu table scan  default.age_table -columns=id,age 
-predicates='["AND", ["IN", "id", [12,20]]]'
Total count 0 cost 0.0224966 seconds

# This query produces correct results: the expected row for 'alex' is returned.
sudo -u kudu kudu table scan  default.age_table -columns=id,age 
-predicates='["AND", ["IN", "id", [3,20]]]'
(int64 id=3, int32 age=50)
Total count 1 cost 0.0178102 seconds

# However, predicates on the primary key columns seem to work as expected, even 
for the rows in the range with custom hash schema.
sudo -u kudu kudu table scan  default.age_table -columns=id,age 
-predicates='["AND", ["=", "id", 12]]'
(int64 id=12, int32 age=100)
Total count 1 cost 0.0137217 seconds

{code}


> Range specific hashing table when queried with InList predicate may lead to 
> incorrect results
> -
>
> Key: KUDU-3564
> URL: https://issues.apache.org/jira/browse/KUDU-3564
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: YifanZhang
>Priority: Major
>
> Reproduce steps that copy from the Slack channel:
>  
> {code:sql}
> -- create the table and data in Impala:
> CREATE TABLE age_table
> (
> id BIGINT,
> name STRING,
> age INT,
> PRIMARY KEY(id,name,age)
> )
> PARTITION BY HASH (id) PARTITIONS 4,
> HASH (name) PARTITIONS 4,
> range (age)
> ( 
> PARTITION 30 <= VALUES < 60,
> PARTITION 60 <= VALUES < 90
> ) 
> STORED AS KUDU 
> TBLPROPERTIES ('kudu.num_tablet_replicas' = '1');
> ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120
> HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3;
> INSERT INTO age_table VALUES (3, 'alex', 50);
> INSERT INTO age_table VALUES (12, 'bob', 100);
> {code}
> Now, let's run a few queries using the {{kudu table scan}} CLI tool:
> {noformat}
> # This query produces wrong results: the expected row for 'bob' isn't 
> returned.
> # Note that the troublesome row is in the range partition with custom 
> (per-range) hash schema.
> $ sudo -u kudu kudu table scan  default.age_table -columns=id,age 
> -predicates='["AND", ["IN", "id", [12,20]]]'
> Total count 0 cost 0.0224966 seconds
> # This query produces correct results: the expected row for 'alex' is 
> returned.
> $ sudo -u kudu kudu table scan  default.age_table -columns=id,age 
> -predicates='["AND", ["IN", "id", 

[jira] [Updated] (KUDU-3564) Range specific hashing table when queried with InList predicate may lead to incorrect results

2024-04-05 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3564:

Description: 
Reproduce steps that copy from the Slack channel:
 
{code:sql}
// create the table and data in Impala:
CREATE TABLE age_table
(
id BIGINT,
name STRING,
age INT,
PRIMARY KEY(id,name,age)
)
PARTITION BY HASH (id) PARTITIONS 4,
HASH (name) PARTITIONS 4,
range (age)
( 
PARTITION 30 <= VALUES < 60,
PARTITION 60 <= VALUES < 90
) 
STORED AS KUDU 
TBLPROPERTIES ('kudu.num_tablet_replicas' = '1');

ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120
HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3;


INSERT INTO age_table VALUES (3, 'alex', 50);
INSERT INTO age_table VALUES (12, 'bob', 100);

// This query produces wrong results: the expected row for 'bob' isn't returned.
// Note that the troublesome row is in the range partition with custom 
(per-range) hash schema.
sudo -u kudu kudu table scan  default.age_table -columns=id,age 
-predicates='["AND", ["IN", "id", [12,20]]]'
Total count 0 cost 0.0224966 seconds

// This query produces correct results: the expected row for 'alex' is returned.
sudo -u kudu kudu table scan  default.age_table -columns=id,age 
-predicates='["AND", ["IN", "id", [3,20]]]'
(int64 id=3, int32 age=50)
Total count 1 cost 0.0178102 seconds

// However, predicates on the primary key columns seem to work as expected, 
even for the rows in the range with custom hash schema.
sudo -u kudu kudu table scan  default.age_table -columns=id,age 
-predicates='["AND", ["=", "id", 12]]'
(int64 id=12, int32 age=100)
Total count 1 cost 0.0137217 seconds

{code}

  was:
Reproduce steps that copy from the Slack channel:
 
{code:sql}
// create the table and data in Impala:
CREATE TABLE age_table
(
id BIGINT,
name STRING,
age INT,
PRIMARY KEY(id,name,age)
)
PARTITION BY HASH (id) PARTITIONS 4,
HASH (name) PARTITIONS 4,
range (age)
( 
PARTITION 30 <= VALUES < 60,
PARTITION 60 <= VALUES < 90
) 
STORED AS KUDU 
TBLPROPERTIES ('kudu.num_tablet_replicas' = '1');

ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120
HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3;


insert into age_table values (3, 'alex', 50);
insert into age_table values (12, 'bob', 100);

// only predicate "in" for data in custom hash cannot be found,
sudo -u kudu kudu table scan  default.age_table -columns=id,age 
-predicates='["AND", ["IN", "id", [3,20]]]'
(int64 id=3, int32 age=50)
Total count 1 cost 0.0178102 seconds {code}


> Range specific hashing table when queried with InList predicate may lead to 
> incorrect results
> -
>
> Key: KUDU-3564
> URL: https://issues.apache.org/jira/browse/KUDU-3564
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: YifanZhang
>Priority: Major
>
> Reproduce steps that copy from the Slack channel:
>  
> {code:sql}
> // create the table and data in Impala:
> CREATE TABLE age_table
> (
> id BIGINT,
> name STRING,
> age INT,
> PRIMARY KEY(id,name,age)
> )
> PARTITION BY HASH (id) PARTITIONS 4,
> HASH (name) PARTITIONS 4,
> range (age)
> ( 
> PARTITION 30 <= VALUES < 60,
> PARTITION 60 <= VALUES < 90
> ) 
> STORED AS KUDU 
> TBLPROPERTIES ('kudu.num_tablet_replicas' = '1');
> ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120
> HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3;
> INSERT INTO age_table VALUES (3, 'alex', 50);
> INSERT INTO age_table VALUES (12, 'bob', 100);
> // This query produces wrong results: the expected row for 'bob' isn't 
> returned.
> // Note that the troublesome row is in the range partition with custom 
> (per-range) hash schema.
> sudo -u kudu kudu table scan  default.age_table -columns=id,age 
> -predicates='["AND", ["IN", "id", [12,20]]]'
> Total count 0 cost 0.0224966 seconds
> // This query produces correct results: the expected row for 'alex' is 
> returned.
> sudo -u kudu kudu table scan  default.age_table -columns=id,age 
> -predicates='["AND", ["IN", "id", [3,20]]]'
> (int64 id=3, int32 age=50)
> Total count 1 cost 0.0178102 seconds
> // However, predicates on the primary key columns seem to work as expected, 
> even for the rows in the range with custom hash schema.
> sudo -u kudu kudu table scan  default.age_table -columns=id,age 
> -predicates='["AND", ["=", "id", 12]]'
> (int64 id=12, int32 age=100)
> Total count 1 cost 0.0137217 seconds
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3564) Range specific hashing table when queried with InList predicate may lead to incorrect results

2024-04-05 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3564:

Description: 
Reproduce steps that copy from the Slack channel:
 
{code:sql}
// create the table and data in Impala:
CREATE TABLE age_table
(
id BIGINT,
name STRING,
age INT,
PRIMARY KEY(id,name,age)
)
PARTITION BY HASH (id) PARTITIONS 4,
HASH (name) PARTITIONS 4,
range (age)
( 
PARTITION 30 <= VALUES < 60,
PARTITION 60 <= VALUES < 90
) 
STORED AS KUDU 
TBLPROPERTIES ('kudu.num_tablet_replicas' = '1');

ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120
HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3;


INSERT INTO age_table VALUES (3, 'alex', 50);
INSERT INTO age_table VALUES (12, 'bob', 100);

# This query produces wrong results: the expected row for 'bob' isn't returned.
# Note that the troublesome row is in the range partition with custom 
(per-range) hash schema.
sudo -u kudu kudu table scan  default.age_table -columns=id,age 
-predicates='["AND", ["IN", "id", [12,20]]]'
Total count 0 cost 0.0224966 seconds

# This query produces correct results: the expected row for 'alex' is returned.
sudo -u kudu kudu table scan  default.age_table -columns=id,age 
-predicates='["AND", ["IN", "id", [3,20]]]'
(int64 id=3, int32 age=50)
Total count 1 cost 0.0178102 seconds

# However, predicates on the primary key columns seem to work as expected, even 
for the rows in the range with custom hash schema.
sudo -u kudu kudu table scan  default.age_table -columns=id,age 
-predicates='["AND", ["=", "id", 12]]'
(int64 id=12, int32 age=100)
Total count 1 cost 0.0137217 seconds

{code}

  was:
Reproduce steps that copy from the Slack channel:
 
{code:sql}
// create the table and data in Impala:
CREATE TABLE age_table
(
id BIGINT,
name STRING,
age INT,
PRIMARY KEY(id,name,age)
)
PARTITION BY HASH (id) PARTITIONS 4,
HASH (name) PARTITIONS 4,
range (age)
( 
PARTITION 30 <= VALUES < 60,
PARTITION 60 <= VALUES < 90
) 
STORED AS KUDU 
TBLPROPERTIES ('kudu.num_tablet_replicas' = '1');

ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120
HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3;


INSERT INTO age_table VALUES (3, 'alex', 50);
INSERT INTO age_table VALUES (12, 'bob', 100);

// This query produces wrong results: the expected row for 'bob' isn't returned.
// Note that the troublesome row is in the range partition with custom 
(per-range) hash schema.
sudo -u kudu kudu table scan  default.age_table -columns=id,age 
-predicates='["AND", ["IN", "id", [12,20]]]'
Total count 0 cost 0.0224966 seconds

// This query produces correct results: the expected row for 'alex' is returned.
sudo -u kudu kudu table scan  default.age_table -columns=id,age 
-predicates='["AND", ["IN", "id", [3,20]]]'
(int64 id=3, int32 age=50)
Total count 1 cost 0.0178102 seconds

// However, predicates on the primary key columns seem to work as expected, 
even for the rows in the range with custom hash schema.
sudo -u kudu kudu table scan  default.age_table -columns=id,age 
-predicates='["AND", ["=", "id", 12]]'
(int64 id=12, int32 age=100)
Total count 1 cost 0.0137217 seconds

{code}


> Range specific hashing table when queried with InList predicate may lead to 
> incorrect results
> -
>
> Key: KUDU-3564
> URL: https://issues.apache.org/jira/browse/KUDU-3564
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: YifanZhang
>Priority: Major
>
> Reproduce steps that copy from the Slack channel:
>  
> {code:sql}
> // create the table and data in Impala:
> CREATE TABLE age_table
> (
> id BIGINT,
> name STRING,
> age INT,
> PRIMARY KEY(id,name,age)
> )
> PARTITION BY HASH (id) PARTITIONS 4,
> HASH (name) PARTITIONS 4,
> range (age)
> ( 
> PARTITION 30 <= VALUES < 60,
> PARTITION 60 <= VALUES < 90
> ) 
> STORED AS KUDU 
> TBLPROPERTIES ('kudu.num_tablet_replicas' = '1');
> ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120
> HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3;
> INSERT INTO age_table VALUES (3, 'alex', 50);
> INSERT INTO age_table VALUES (12, 'bob', 100);
> # This query produces wrong results: the expected row for 'bob' isn't 
> returned.
> # Note that the troublesome row is in the range partition with custom 
> (per-range) hash schema.
> sudo -u kudu kudu table scan  default.age_table -columns=id,age 
> -predicates='["AND", ["IN", "id", [12,20]]]'
> Total count 0 cost 0.0224966 seconds
> # This query produces correct results: the expected row for 'alex' is 
> returned.
> sudo -u kudu kudu table scan  default.age_table -columns=id,age 
> -predicates='["AND", ["IN", "id", [3,20]]]'
> (int64 id=3, int32 age=50)
> Total count 1 cost 0.0178102 seconds
> # However, predicates on the primary key columns seem to work as expected, 
> even for the rows in the range 

[jira] [Updated] (KUDU-3564) Range specific hashing table when queried with InList predicate may lead to incorrect results

2024-04-05 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3564:

Status: In Review  (was: Open)

> Range specific hashing table when queried with InList predicate may lead to 
> incorrect results
> -
>
> Key: KUDU-3564
> URL: https://issues.apache.org/jira/browse/KUDU-3564
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: YifanZhang
>Priority: Major
>
> Reproduce steps that copy from the Slack channel:
>  
> {code:sql}
> // create the table and data in Impala:
> CREATE TABLE age_table
> (
> id BIGINT,
> name STRING,
> age INT,
> PRIMARY KEY(id,name,age)
> )
> PARTITION BY HASH (id) PARTITIONS 4,
> HASH (name) PARTITIONS 4,
> range (age)
> ( 
> PARTITION 30 <= VALUES < 60,
> PARTITION 60 <= VALUES < 90
> ) 
> STORED AS KUDU 
> TBLPROPERTIES ('kudu.num_tablet_replicas' = '1');
> ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120
> HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3;
> insert into age_table values (3, 'alex', 50);
> insert into age_table values (12, 'bob', 100);
> // only predicate "in" for data in custom hash cannot be found,
> sudo -u kudu kudu table scan  default.age_table -columns=id,age 
> -predicates='["AND", ["IN", "id", [3,20]]]'
> (int64 id=3, int32 age=50)
> Total count 1 cost 0.0178102 seconds {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3564) Range specific hashing table when queried with InList predicate may lead to incorrect results

2024-04-05 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3564:

Code Review: https://gerrit.cloudera.org/#/c/21243/

> Range specific hashing table when queried with InList predicate may lead to 
> incorrect results
> -
>
> Key: KUDU-3564
> URL: https://issues.apache.org/jira/browse/KUDU-3564
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: YifanZhang
>Priority: Major
>
> Reproduce steps that copy from the Slack channel:
>  
> {code:sql}
> // create the table and data in Impala:
> CREATE TABLE age_table
> (
> id BIGINT,
> name STRING,
> age INT,
> PRIMARY KEY(id,name,age)
> )
> PARTITION BY HASH (id) PARTITIONS 4,
> HASH (name) PARTITIONS 4,
> range (age)
> ( 
> PARTITION 30 <= VALUES < 60,
> PARTITION 60 <= VALUES < 90
> ) 
> STORED AS KUDU 
> TBLPROPERTIES ('kudu.num_tablet_replicas' = '1');
> ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120
> HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3;
> insert into age_table values (3, 'alex', 50);
> insert into age_table values (12, 'bob', 100);
> // only predicate "in" for data in custom hash cannot be found,
> sudo -u kudu kudu table scan  default.age_table -columns=id,age 
> -predicates='["AND", ["IN", "id", [3,20]]]'
> (int64 id=3, int32 age=50)
> Total count 1 cost 0.0178102 seconds {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3562) Inconsistency in 'available space' metrics when reserving more than 2GiB

2024-04-05 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3562:

Fix Version/s: 1.18.0
   1.17.1
   Resolution: Fixed
   Status: Resolved  (was: In Review)

> Inconsistency in 'available space' metrics when reserving more than 2GiB
> 
>
> Key: KUDU-3562
> URL: https://issues.apache.org/jira/browse/KUDU-3562
> Project: Kudu
>  Issue Type: Bug
>  Components: master, tserver
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
> Fix For: 1.18.0, 1.17.1
>
>
> When setting {{\-\-fs_data_dirs_reserved_bytes}} and 
> {{\-\-fs_wal_dir_reserved_bytes}} flags to values greater than 2GiBytes, the 
> {{wal_dir_space_available_bytes}} and {{data_dirs_space_available_bytes}} 
> reports incorrect numbers for both Kudu masters and tablet servers.  The 
> metrics report much more space than actually available.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3561) Too Many Warning Log: "Entity is not relevant to Prometheus"

2024-04-05 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3561:

Fix Version/s: 1.18.0
   1.17.1
   Resolution: Fixed
   Status: Resolved  (was: In Review)

> Too Many Warning Log: "Entity is not relevant to Prometheus"
> 
>
> Key: KUDU-3561
> URL: https://issues.apache.org/jira/browse/KUDU-3561
> Project: Kudu
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 1.17.0
>Reporter: Liu Lin
>Assignee: Alexey Serbin
>Priority: Major
> Fix For: 1.18.0, 1.17.1
>
> Attachments: image-2024-03-30-22-54-28-735.png
>
>
> I encountered the same issue as KUDU-3549, so I recompiled Kudu using the 
> master branch.
> Now Prometheus is able to collect metrics from Kudu correctly through the 
> /metrics_prometheus endpoint.
> However, I noticed a large number of "Failed to write entity [xxx] as 
> Prometheus: Not found: Entity is not relevant to Prometheus" warning logs in 
> the Tserver's logs.
> !image-2024-03-30-22-54-28-735.png!
> Could you please confirm if this is a code issue? If not, how can I reduce 
> these logs?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3563) Output tablet-level metrics in Prometheus format

2024-04-02 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3563:

Summary: Output tablet-level metrics in Prometheus format  (was: Output 
tablet-level metrics as well in Prometheus format)

> Output tablet-level metrics in Prometheus format
> 
>
> Key: KUDU-3563
> URL: https://issues.apache.org/jira/browse/KUDU-3563
> Project: Kudu
>  Issue Type: Improvement
>  Components: master, server, tserver
>Reporter: Alexey Serbin
>Priority: Major
>  Labels: metrics, prometheus, supportability, visibility
>
> The request to support outputting Kudu metrics in Prometheus format is 
> tracked [KUDU-3375|https://issues.apache.org/jira/browse/KUDU-3375].  The 
> [first take on 
> this|https://github.com/apache/kudu/commit/00efc6826ac9a1f5d10750296c7357790a041fec]
>  has taken care of the server-level metrics, ignoring all the tablet-level 
> metrics.
> In the scope of this JIRA item, it's necessary to output all the tablet-level 
> metrics in Prometheus format as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KUDU-3563) Output tablet-level metrics as well in Prometheus format

2024-04-02 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3563:
---

 Summary: Output tablet-level metrics as well in Prometheus format
 Key: KUDU-3563
 URL: https://issues.apache.org/jira/browse/KUDU-3563
 Project: Kudu
  Issue Type: Improvement
  Components: master, server, tserver
Reporter: Alexey Serbin


The request to support outputting Kudu metrics in Prometheus format is tracked 
[KUDU-3375|https://issues.apache.org/jira/browse/KUDU-3375].  The [first take 
on 
this|https://github.com/apache/kudu/commit/00efc6826ac9a1f5d10750296c7357790a041fec]
 has taken care of the server-level metrics, ignoring all the tablet-level 
metrics.

In the scope of this JIRA item, it's necessary to output all the tablet-level 
metrics in Prometheus format as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3562) Inconsistency in 'available space' metrics when reserving more than 2GiB

2024-04-01 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3562:

Status: In Review  (was: Open)

> Inconsistency in 'available space' metrics when reserving more than 2GiB
> 
>
> Key: KUDU-3562
> URL: https://issues.apache.org/jira/browse/KUDU-3562
> Project: Kudu
>  Issue Type: Bug
>  Components: master, tserver
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>
> When setting {{\-\-fs_data_dirs_reserved_bytes}} and 
> {{\-\-fs_wal_dir_reserved_bytes}} flags to values greater than 2GiBytes, the 
> {{wal_dir_space_available_bytes}} and {{data_dirs_space_available_bytes}} 
> reports incorrect numbers for both Kudu masters and tablet servers.  The 
> metrics report much more space than actually available.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3562) Inconsistency in 'available space' metrics when reserving more than 2GiB

2024-04-01 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3562:

Code Review: https://gerrit.cloudera.org/#/c/21227/

> Inconsistency in 'available space' metrics when reserving more than 2GiB
> 
>
> Key: KUDU-3562
> URL: https://issues.apache.org/jira/browse/KUDU-3562
> Project: Kudu
>  Issue Type: Bug
>  Components: master, tserver
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>
> When setting {{\-\-fs_data_dirs_reserved_bytes}} and 
> {{\-\-fs_wal_dir_reserved_bytes}} flags to values greater than 2GiBytes, the 
> {{wal_dir_space_available_bytes}} and {{data_dirs_space_available_bytes}} 
> reports incorrect numbers for both Kudu masters and tablet servers.  The 
> metrics report much more space than actually available.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KUDU-3562) Inconsistency in 'available space' metrics when reserving more than 2GiB

2024-04-01 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3562:
---

 Summary: Inconsistency in 'available space' metrics when reserving 
more than 2GiB
 Key: KUDU-3562
 URL: https://issues.apache.org/jira/browse/KUDU-3562
 Project: Kudu
  Issue Type: Bug
  Components: master, tserver
Affects Versions: 1.17.0, 1.16.0
Reporter: Alexey Serbin
Assignee: Alexey Serbin


When setting {{\-\-fs_data_dirs_reserved_bytes}} and 
{{\-\-fs_wal_dir_reserved_bytes}} flags to values greater than 2GiBytes, the 
{{wal_dir_space_available_bytes}} and {{data_dirs_space_available_bytes}} 
reports incorrect numbers for both Kudu masters and tablet servers.  The 
metrics report much more space than actually available.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3561) Too Many Warning Log: "Entity is not relevant to Prometheus"

2024-04-01 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3561:

Affects Version/s: (was: 1.18.0)

> Too Many Warning Log: "Entity is not relevant to Prometheus"
> 
>
> Key: KUDU-3561
> URL: https://issues.apache.org/jira/browse/KUDU-3561
> Project: Kudu
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 1.17.0
>Reporter: Liu Lin
>Priority: Major
> Attachments: image-2024-03-30-22-54-28-735.png
>
>
> I encountered the same issue as KUDU-3549, so I recompiled Kudu using the 
> master branch.
> Now Prometheus is able to collect metrics from Kudu correctly through the 
> /metrics_prometheus endpoint.
> However, I noticed a large number of "Failed to write entity [xxx] as 
> Prometheus: Not found: Entity is not relevant to Prometheus" warning logs in 
> the Tserver's logs.
> !image-2024-03-30-22-54-28-735.png!
> Could you please confirm if this is a code issue? If not, how can I reduce 
> these logs?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KUDU-3561) Too Many Warning Log: "Entity is not relevant to Prometheus"

2024-04-01 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-3561:
---

Assignee: Alexey Serbin

> Too Many Warning Log: "Entity is not relevant to Prometheus"
> 
>
> Key: KUDU-3561
> URL: https://issues.apache.org/jira/browse/KUDU-3561
> Project: Kudu
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 1.17.0
>Reporter: Liu Lin
>Assignee: Alexey Serbin
>Priority: Major
> Attachments: image-2024-03-30-22-54-28-735.png
>
>
> I encountered the same issue as KUDU-3549, so I recompiled Kudu using the 
> master branch.
> Now Prometheus is able to collect metrics from Kudu correctly through the 
> /metrics_prometheus endpoint.
> However, I noticed a large number of "Failed to write entity [xxx] as 
> Prometheus: Not found: Entity is not relevant to Prometheus" warning logs in 
> the Tserver's logs.
> !image-2024-03-30-22-54-28-735.png!
> Could you please confirm if this is a code issue? If not, how can I reduce 
> these logs?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3561) Too Many Warning Log: "Entity is not relevant to Prometheus"

2024-04-01 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3561:

Code Review: https://gerrit.cloudera.org/#/c/21226/

> Too Many Warning Log: "Entity is not relevant to Prometheus"
> 
>
> Key: KUDU-3561
> URL: https://issues.apache.org/jira/browse/KUDU-3561
> Project: Kudu
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 1.17.0
>Reporter: Liu Lin
>Priority: Major
> Attachments: image-2024-03-30-22-54-28-735.png
>
>
> I encountered the same issue as KUDU-3549, so I recompiled Kudu using the 
> master branch.
> Now Prometheus is able to collect metrics from Kudu correctly through the 
> /metrics_prometheus endpoint.
> However, I noticed a large number of "Failed to write entity [xxx] as 
> Prometheus: Not found: Entity is not relevant to Prometheus" warning logs in 
> the Tserver's logs.
> !image-2024-03-30-22-54-28-735.png!
> Could you please confirm if this is a code issue? If not, how can I reduce 
> these logs?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3561) Too Many Warning Log: "Entity is not relevant to Prometheus"

2024-04-01 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3561:

Status: In Review  (was: In Progress)

> Too Many Warning Log: "Entity is not relevant to Prometheus"
> 
>
> Key: KUDU-3561
> URL: https://issues.apache.org/jira/browse/KUDU-3561
> Project: Kudu
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 1.17.0
>Reporter: Liu Lin
>Assignee: Alexey Serbin
>Priority: Major
> Attachments: image-2024-03-30-22-54-28-735.png
>
>
> I encountered the same issue as KUDU-3549, so I recompiled Kudu using the 
> master branch.
> Now Prometheus is able to collect metrics from Kudu correctly through the 
> /metrics_prometheus endpoint.
> However, I noticed a large number of "Failed to write entity [xxx] as 
> Prometheus: Not found: Entity is not relevant to Prometheus" warning logs in 
> the Tserver's logs.
> !image-2024-03-30-22-54-28-735.png!
> Could you please confirm if this is a code issue? If not, how can I reduce 
> these logs?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KUDU-3498) Scanner keeps alive in periodically

2024-03-28 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3498.
-
Fix Version/s: 1.18.0
   Resolution: Fixed

> Scanner keeps alive in periodically
> ---
>
> Key: KUDU-3498
> URL: https://issues.apache.org/jira/browse/KUDU-3498
> Project: Kudu
>  Issue Type: Improvement
>Reporter: Xixu Wang
>Priority: Major
> Fix For: 1.18.0
>
>
> Kudu caches the scanner id in the tablet server for continuing reading. It 
> will be expired if the idle time is over the defined scanner ttl time. 
> Sometimes the client reads a batch of data, if the data is every large, it 
> takes a long time to handle it. Then the client reads the next batch using 
> the same scanner, the scanner will be expired even if it has send a keep 
> alive requests.
> There is an example:
> /main/logs/sp/kudu/tserver/3.kudu.log.INFO.20230731-143052.2665:I0731 
> 14:57:19.307266  9280 scanners.cc:280] Expiring scanner id: 
> a279f6e3715d437d935d0bd79788c591, of tablet 0f8f4920ba514624abc294c7c64725c1, 
> after 184023 ms of inactivity, which is > TTL (18 ms).
> /main/logs/sp/kudu/tserver/kudu.log.INFO.20230731-143052.2665:I0731 
> 15:03:07.419070  9289 tablet_service.cc:2957] Scan: Not found: Scanner 
> a279f6e3715d437d935d0bd79788c591 not found (it may have expired): call 
> sequence id=10, remote=\{username='impala'} at host:26278
>  
> The client takes 9 minutes to handle a batch of data, but the scanner has 
> already been expired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3486) Tserver: Too many tombstone tablet may lead to high memory usage.

2024-03-22 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3486:

Fix Version/s: 1.17.1

> Tserver: Too many tombstone tablet may lead to high memory usage.
> -
>
> Key: KUDU-3486
> URL: https://issues.apache.org/jira/browse/KUDU-3486
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.14.0
>Reporter: Song Jiacheng
>Priority: Minor
> Fix For: 1.18.0, 1.17.1
>
> Attachments: image-2023-07-06-15-59-44-181.png
>
>
> There are two kinds of tablet replica deletion: tombstone and delete. A 
> tombstone tablet replica might never be deleted since the delete-type 
> deletion could only occur when the tablet is deleted, and the requests will 
> be sent to the voters, not including the tombstone ones. 
> Here is a example:
> Tablet T:
> replica A
> replica B
> replica C
> After rebalance:
> replica A
> replica B
> replica C(Tombstone)
> replica D
> When the tablet T is deleted, A B D are deleted, and C exists forever.
> Like this picture, the tablet had already been deleted at 3:00 am 13th Jun, 
> but the tombstone replica still exists.
> !image-2023-07-06-15-59-44-181.png|width=568,height=261! 
> The data of tombstone replica is deleted, but metadata is persisted in 
> memory, especially the biggest one SchemaPB will occupy a lot of memory.
> In some of our clusters, tombstone replicas of each tserver could reach 50k ~ 
> 100k, which takes about 10G.
> It takes too much resource if adds a vector for each tablet to store the 
> history tablet servers that used to hold a replica of the tablet. So I think 
> periodically heartbeat might be a good way to solve the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KUDU-3486) Tserver: Too many tombstone tablet may lead to high memory usage.

2024-03-20 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3486.
-
Fix Version/s: 1.18.0
   Resolution: Fixed

> Tserver: Too many tombstone tablet may lead to high memory usage.
> -
>
> Key: KUDU-3486
> URL: https://issues.apache.org/jira/browse/KUDU-3486
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.14.0
>Reporter: Song Jiacheng
>Priority: Minor
> Fix For: 1.18.0
>
> Attachments: image-2023-07-06-15-59-44-181.png
>
>
> There are two kinds of tablet replica deletion: tombstone and delete. A 
> tombstone tablet replica might never be deleted since the delete-type 
> deletion could only occur when the tablet is deleted, and the requests will 
> be sent to the voters, not including the tombstone ones. 
> Here is a example:
> Tablet T:
> replica A
> replica B
> replica C
> After rebalance:
> replica A
> replica B
> replica C(Tombstone)
> replica D
> When the tablet T is deleted, A B D are deleted, and C exists forever.
> Like this picture, the tablet had already been deleted at 3:00 am 13th Jun, 
> but the tombstone replica still exists.
> !image-2023-07-06-15-59-44-181.png|width=568,height=261! 
> The data of tombstone replica is deleted, but metadata is persisted in 
> memory, especially the biggest one SchemaPB will occupy a lot of memory.
> In some of our clusters, tombstone replicas of each tserver could reach 50k ~ 
> 100k, which takes about 10G.
> It takes too much resource if adds a vector for each tablet to store the 
> history tablet servers that used to hold a replica of the tablet. So I think 
> periodically heartbeat might be a good way to solve the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KUDU-3496) Support for SPNEGO dedicated keytab

2024-03-20 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3496.
-
Fix Version/s: 1.18.0
   Resolution: Fixed

> Support for SPNEGO dedicated keytab
> ---
>
> Key: KUDU-3496
> URL: https://issues.apache.org/jira/browse/KUDU-3496
> Project: Kudu
>  Issue Type: Improvement
>Reporter: halim kim
>Priority: Minor
> Fix For: 1.18.0
>
>
> Same as Apache Impala : https://issues.apache.org/jira/browse/IMPALA-12318
> Supporting seperation of SPNEGO keytab and service keytab will be useful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KUDU-3526) [java] Scanner should bound with a tserver in java client.

2024-03-20 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3526.
-
Fix Version/s: 1.18.0
   1.17.1
   Resolution: Fixed

> [java] Scanner should bound with a tserver in java client.
> --
>
> Key: KUDU-3526
> URL: https://issues.apache.org/jira/browse/KUDU-3526
> Project: Kudu
>  Issue Type: Bug
>Reporter: Song Jiacheng
>Priority: Major
> Fix For: 1.18.0, 1.17.1
>
>
> We always meet the "scanner not found" exception while using presto + kudu, 
> even if we have set the timeout of scanner very large.
> It turns out the scanner does not bound with the tserver which it first 
> communicates with. 
> Here is the code in scanNextRows of java client:
> {code:java}
> final ServerInfo info = 
> tablet.getReplicaSelectedServerInfo(nextRequest.getReplicaSelection(),
> location); {code}
> It still trying to find the tserver by the locations and selection policy. So 
> if the leader is changed and the next scan request is sent to the new leader, 
> the tserver will response with the "scanner not found" exception.
> We should make the scanner bound with the tserver, like how it does in c++ 
> client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KUDU-3527) Fix BlockManagerTest.TestMetadataOkayDespiteFailure on rhel 8.8 graviton

2024-03-20 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3527.
-
Fix Version/s: 1.18.0
   Resolution: Fixed

> Fix BlockManagerTest.TestMetadataOkayDespiteFailure on rhel  8.8 graviton
> -
>
> Key: KUDU-3527
> URL: https://issues.apache.org/jira/browse/KUDU-3527
> Project: Kudu
>  Issue Type: Bug
>Reporter: Zoltan Martonka
>Assignee: Zoltan Martonka
>Priority: Major
> Fix For: 1.18.0
>
>
> BlockManagerTest.TestMetadataOkayDespiteFailure might fail on systems where 
> fs_block_size=64k.
> *Cause:*
> Currently tablets fail to load if one metadata is missing but there is still 
> a non-empty ".data" file. If FLAGS_env_inject_eio is not zero, then there is 
> a chance that, when we delete the container file, we only delete the ".meta", 
> but leave the ".data" file.
> In the current test on systems with fs_block_size=4k deletion never occurs. 
> Changing to kNumAppends=64 will cause the test to randomly fail on x86 
> systems too, although only with a 2-3% chance (at least on my ubuntu20 
> machine).
> *Solution:*
> This test was not intended to test the file deletion itself (as it does not 
> do it on x86_64 or 4k arm kernels). It only occurs, because 
> _FLAGS_log_container_max_size = 256 * 1024;_ is _not "large enought"._
> _We should just set_ FLAGS_log_block_manager_delete_dead_container = false; 
> to restore the original scope of the test.
> There is a separate issue for the root cause (which is not arm specific at 
> all):
> https://issues.apache.org/jira/browse/KUDU-3528



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KUDU-3558) Error out when ranger subprocess is started with kerberos disabled

2024-03-20 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3558.
-
Fix Version/s: 1.18.0
   Resolution: Fixed

> Error out when ranger subprocess is started with kerberos disabled
> --
>
> Key: KUDU-3558
> URL: https://issues.apache.org/jira/browse/KUDU-3558
> Project: Kudu
>  Issue Type: Bug
>Reporter: Ashwani Raina
>Assignee: Ashwani Raina
>Priority: Minor
> Fix For: 1.18.0
>
>
> Today, when kudu cluster (with disabled authentication), is started with Kudu 
> ranger, the ranger subprocess doesn't start because of missing keytab file.
> We do catch this error but it happens pretty late in java subprocess init 
> routine. And sometimes can be pretty confusing during investigation if not 
> looking at the right log file.
> The error message looks like this in stderr log file:
> ++
> Exception in thread "main" 
> org.apache.kudu.subprocess.KuduSubprocessException: Kudu principal and Keytab 
> file must be provided when Kerberos is enabled in Ranger
> at 
> org.apache.kudu.subprocess.ranger.authorization.RangerKuduAuthorizer.init(RangerKuduAuthorizer.java:78)
> at 
> org.apache.kudu.subprocess.ranger.RangerProtocolHandler.(RangerProtocolHandler.java:45)
> at 
> org.apache.kudu.subprocess.ranger.RangerSubprocessMain.main(RangerSubprocessMain.java:39)
> ++
> This Jira will be used to detect this error pretty early in the process and 
> log some actionable information inside kudu master logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KUDU-2696) libgmock is linked into the kudu cli binary

2024-03-20 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-2696.
-
Fix Version/s: 1.18.0
   Resolution: Fixed

The kudu CLI tool can now be built without libgmock linked in.  To do so, it's 
necessary to run cmake with {{NO_TESTS=1}} and {{KUDU_CLI_TEST_TOOL_ENABLED=0}} 
extra flags.

For details, see the following changelists:
* 
[7f9b74ff6|https://github.com/apache/kudu/commit/7f9b74ff665d3593e005e31a50d0029f4f2e2619]
* 
[83cec9c1e|https://github.com/apache/kudu/commit/83cec9c1ecdd17967bef39f9d990c9c31dc4302c]

For example, to have such a build in DEBUG configuration, configure the project 
with the following command line from the build directory:
{noformat}
../../build-support/enable_devtoolset.sh \
../../thirdparty/installed/common/bin/cmake \
-DCMAKE_BUILD_TYPE=debug \
-DNO_TESTS=1 \
-DKUDU_CLI_TEST_TOOL_ENABLED=0 \
../..
{noformat}

> libgmock is linked into the kudu cli binary
> ---
>
> Key: KUDU-2696
> URL: https://issues.apache.org/jira/browse/KUDU-2696
> Project: Kudu
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.8.0
>Reporter: Mike Percy
>Priority: Minor
> Fix For: 1.18.0
>
>
> libgmock is linked into the kudu cli binary, even though we consider it a 
> test-only dependency. Possibly a configuration problem in our cmake files?
> {code:java}
> $ ldd build/dynclang/bin/kudu | grep mock
>  libgmock.so => 
> /home/mpercy/src/kudu/thirdparty/installed/uninstrumented/lib/libgmock.so 
> (0x7f01f1495000)
> {code}
> The gmock dependency does not appear in the server binaries, as expected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KUDU-3559) AutoRebalancerTest.TestMaxMovesPerServer is flaky

2024-02-28 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3559:
---

 Summary: AutoRebalancerTest.TestMaxMovesPerServer is flaky
 Key: KUDU-3559
 URL: https://issues.apache.org/jira/browse/KUDU-3559
 Project: Kudu
  Issue Type: Bug
  Components: test
Affects Versions: 1.17.0
Reporter: Alexey Serbin
 Attachments: auto_rebalancer-test.log.xz

The {{AutoRebalancerTest.TestMaxMovesPerServer}} is flaky, sometimes failing 
with an error like below.  The full log is attached.

{noformat}
src/kudu/master/auto_rebalancer-test.cc:196: Failure
Expected equality of these values:
  0
  NumMovesScheduled(leader_idx, BalanceThreadType::LEADER_REBALANCE)
Which is: 1
src/kudu/util/test_util.cc:395: Failure
Failed
Timed out waiting for assertion to pass.
src/kudu/master/auto_rebalancer-test.cc:575: Failure
Expected: CheckNoLeaderMovesScheduled() doesn't generate new fatal failures in 
the current thread.
  Actual: it does.
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KUDU-3556) Auto-tune number of worker threads per Kudu server's RPC interface

2024-02-21 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3556:
---

 Summary: Auto-tune number of worker threads per Kudu server's RPC 
interface
 Key: KUDU-3556
 URL: https://issues.apache.org/jira/browse/KUDU-3556
 Project: Kudu
  Issue Type: Improvement
  Components: master, server, tserver
Reporter: Alexey Serbin


As of now, the value for the {{\-\-rpc_num_service_threads}} flag is hard-coded 
(but is different for masters vs tablet servers).  A manual tuning is required 
to enable more efficient utilization of the available CPU cores/threads by Kudu 
servers.  It would be nice to have a mode when the number of worker RPC threads 
is adaptively configured to be 80% of all the available CPU cores/threads.  The 
exact utilization ratio of CPU cores for RPC workers might be a parameter 
defined by a new flag with default value of 0.8.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KUDU-2439) There's no way to safely clean up a KuduClient or Messenger

2024-02-08 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815877#comment-17815877
 ] 

Alexey Serbin commented on KUDU-2439:
-

Below is one more trace when the issue most likely caused a crash of the kudu 
CLI binary.  That was a TSAN build from sources in the 1.17.x branch, built 
from d9371b11fee23ae23d4a00f156d43f5a0beeb0c0:

{noformat}
/home/jenkins-slave/workspace/kudu-branch-1.17.x/0/src/kudu/tools/kudu-admin-test.cc:3085:
 Failure
Value of: s.ok()
  Actual: false
Expected: true
Runtime error: /tmp/dist-test-taskL_Beoo/build/tsan/bin/kudu: process exited on 
signal 11 (core dumped)
stdout: 
stderr: W20240208 05:41:40.006017  9002 flags.cc:413] Enabled unsafe flag: 
--openssl_security_level_override=0
W20240208 05:41:40.006431  9002 flags.cc:413] Enabled unsafe flag: 
--never_fsync=true
W20240208 05:41:40.991340  9002 thread.cc:633] rpc reactor (reactor) Time spent 
creating pthread: real 0.928s   user 0.001s sys 0.000s
W20240208 05:41:40.991601  9002 thread.cc:599] rpc reactor (reactor) Time spent 
starting thread: real 0.928suser 0.001s sys 0.000s
*** Aborted at 1707370901 (unix time) try "date -d @1707370901" if you are using
 GNU date ***
PC: @0x0 (unknown)
*** SIGSEGV (@0x18) received by PID 9002 (TID 0x7fd8d7c84700) from PID 24; 
stack trace: ***
@   0x443be0 __tsan::CallUserSignalHandler() at /home/jenkins-slave/
workspace/kudu-branch-1.17.x/0/src/kudu/tools/kudu-admin-test.cc:1900
@   0x4460c3 rtl_sigaction() at 
/home/jenkins-slave/workspace/kudu-branch-1.17.x/0/src/kudu/tools/kudu-admin-test.cc:2017
@ 0x7fd8e6686980 (unknown) at ??:0
@ 0x7fd8e668126a __GI___pthread_rwlock_unlock at ??:0
@   0x4713f3 __interceptor_pthread_rwlock_unlock at 
/home/jenkins-slave/workspace/kudu-branch-1.17.x/0/thirdparty/installed/tsan/include/c++/v1/functional:?
@ 0x7fd8e3f6e8a9 CRYPTO_THREAD_unlock at ??:0
@ 0x7fd8e3f0481f CRYPTO_free_ex_data at ??:0
@ 0x7fd8e3b39ae2 SSL_CTX_free at ??:0
@ 0x7fd8e5a27888 _ZNSt3__18__invokeIRPFvP10ssl_ctx_stEJS2_EEEDTclclsr3st
d3__1E7forwardIT_Efp_Espclsr3std3__1E7forwardIT0_Efp0_EEEOS6_DpOS7_ at ??:0
@ 0x7fd8e5a277f4 std::__1::__invoke_void_return_wrapper<>::__call<>() 
at ??:0
@ 0x7fd8e5a277a4 std::__1::__function::__alloc_func<>::operator()() at 
??:0
@ 0x7fd8e5a263fd std::__1::__function::__func<>::operator()() at ??:0
@ 0x7fd8e9cc17c7 std::__1::__function::__value_func<>::operator()() at 
??:0
@ 0x7fd8e9cc16fc std::__1::function<>::operator()() at ??:0
@ 0x7fd8e9cc160e std::__1::unique_ptr<>::reset() at ??:0
@ 0x7fd8e9cc120c std::__1::unique_ptr<>::~unique_ptr() at ??:0
@ 0x7fd8e9cc116a kudu::security::TlsContext::~TlsContext() at ??:0
@ 0x7fd8e9cc10bf std::__1::default_delete<>::operator()() at ??:0
@ 0x7fd8e9cc102e std::__1::unique_ptr<>::reset() at ??:0
@ 0x7fd8e9cb46ac std::__1::unique_ptr<>::~unique_ptr() at ??:0
@ 0x7fd8e9cb13cf kudu::rpc::Messenger::~Messenger() at ??:0
@ 0x7fd8e9cc046f std::__1::default_delete<>::operator()() at ??:0
@ 0x7fd8e9cc0213 std::__1::__shared_ptr_pointer<>::__on_zero_shared() 
at ??:0
@   0x4d2776 std::__1::__shared_count::__release_shared() at 
/home/jenkins-slave/workspace/kudu-branch-1.17.x/0/thirdparty/installed/tsan/include/c++/v1/vector:880
@   0x4d271a std::__1::__shared_weak_count::__release_shared() at 
/home/jenkins-slave/workspace/kudu-branch-1.17.x/0/thirdparty/installed/tsan/include/c++/v1/memory:2203
@   0x569a29 std::__1::shared_ptr<>::~shared_ptr() at ??:?
@ 0x7fd8e9cb3076 std::__1::shared_ptr<>::reset() at ??:0
@ 0x7fd8e9cde1ce kudu::rpc::ReactorThread::RunThread() at ??:0
@ 0x7fd8e9ce0eb2 kudu::rpc::ReactorThread::Init()::$_0::operator()() at 
??:0
@ 0x7fd8e9ce0e6a 
_ZNSt3__18__invokeIRZN4kudu3rpc13ReactorThread4InitEvE3$_0JEEEDTclclsr3std3__1E7forwardIT_Efp_Espclsr3std3__1E7forwardIT0_Efp0_EEEOS6_DpOS7_
 at ??:0
@ 0x7fd8e9ce0dfa std::__1::__invoke_void_return_wrapper<>::__call<>() 
at ??:0
@ 0x7fd8e9ce0dc2 std::__1::__function::__alloc_func<>::operator()() at 
??:0


I20240208 05:41:41.334489  6626 external_mini_cluster.cc:1554] Killing 
/tmp/dist-test-taskL_Beoo/build/tsan/bin/kudu with pid 8826
{noformat}

The full log is attached for convenience:  [^kudu-admin-test.7.txt.xz] 

> There's no way to safely clean up a KuduClient or Messenger
> ---
>
> Key: KUDU-2439
> URL: https://issues.apache.org/jira/browse/KUDU-2439
> Project: Kudu
>  Issue Type: Bug
>  Components: client, rpc
>Affects Versions: 1.8.0
>Reporter: Adar Dembo
>Priority: Major
> Attachments: 

[jira] [Updated] (KUDU-2439) There's no way to safely clean up a KuduClient or Messenger

2024-02-08 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2439:

Attachment: kudu-admin-test.7.txt.xz

> There's no way to safely clean up a KuduClient or Messenger
> ---
>
> Key: KUDU-2439
> URL: https://issues.apache.org/jira/browse/KUDU-2439
> Project: Kudu
>  Issue Type: Bug
>  Components: client, rpc
>Affects Versions: 1.8.0
>Reporter: Adar Dembo
>Priority: Major
> Attachments: kudu-admin-test.7.txt.xz
>
>
> KuduClient has shared ownership, and its only "shutdown" knob is to drop its 
> last ref. This drops the last ref on the underlying Messenger object, but 
> Messenger itself has a funky "internal" vs. "external" ref system, and 
> destroying the KuduClient only drops the last external ref. The Messenger is 
> only destroyed when the last internal ref is dropped, and that only happens 
> when outstanding reactor threads finish whatever processing they were busy 
> doing. So, there's no way for a user to "destroy this KuduClient and wait for 
> all outstanding resources to be cleaned up".
> Why is this important? For one, there's a known data race with outstanding 
> work done by the KuduClient's DnsResolver. For two, there's a similar issue 
> with OpenSSL. OpenSSL 1.1 registers an atexit() handler that tears down its 
> global library state. For this to be safe, all allocated OpenSSL resources 
> must have been released by the time atexit() handlers run (which is to say, 
> all OpenSSL resources must be released by the time main() returns or exit() 
> is called). Because we can't wait on the KuduClient to destroy itself and its 
> OpenSSL resources, applications using the KuduClient may run the atexit() 
> handler at an unsafe time.
> Here's the TSAN output for the OpenSSL data race:. It's trivial to reproduce 
> via reactor-test once the appropriate suppression is removed from 
> tsan-suppressions.txt:
> {noformat}
> WARNING: ThreadSanitizer: data race (pid=7914)
> Write of size 1 at 0x7b104340 by main thread:
> #0 pthread_rwlock_destroy 
> /home/adar/Source/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1313
>  (reactor-test+0x48205e)
> #1 CRYPTO_THREAD_lock_free 
> /home/adar/openssl/openssl-1.1.0g/build_shared/../crypto/threads_pthread.c:95 
> (libcrypto.so.1.1+0x20d0e5)
> Previous atomic read of size 1 at 0x7b104340 by thread T16:
> #0 pthread_rwlock_wrlock 
> /home/adar/Source/kudu/thirdparty/src/llvm-6.0.0.src/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1352
>  (reactor-test+0x45ffe3)
> #1 CRYPTO_THREAD_write_lock 
> /home/adar/openssl/openssl-1.1.0g/build_shared/../crypto/threads_pthread.c:66 
> (libcrypto.so.1.1+0x20d08a)
> #2 std::__1::__function::__func std::__1::allocator, void 
> (ssl_ctx_st*)>::operator()(ssl_ctx_st*&&) 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/c++/v1/functional:1562:12
>  (libsecurity.so+0x56b14)
> #3 std::__1::function::operator()(ssl_ctx_st*) const 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/c++/v1/functional:1916:12
>  (libkrpc.so+0xb139d)
> #4 std::__1::unique_ptr 
> >::reset(ssl_ctx_st*) 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/c++/v1/memory:2598:7 
> (libkrpc.so+0xb0f9e)
> #5 std::__1::unique_ptr 
> >::~unique_ptr() 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/c++/v1/memory:2552 
> (libkrpc.so+0xb0f9e)
> #6 kudu::security::TlsContext::~TlsContext() 
> /home/adar/Source/kudu/build/tsan/../../src/kudu/security/tls_context.h:77 
> (libkrpc.so+0xb0f9e)
> #7 
> std::__1::default_delete::operator()(kudu::security::TlsContext*)
>  const 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/c++/v1/memory:2285:5 
> (libkrpc.so+0xab32c)
> #8 std::__1::unique_ptr std::__1::default_delete 
> >::reset(kudu::security::TlsContext*) 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/c++/v1/memory:2598 
> (libkrpc.so+0xab32c)
> #9 std::__1::unique_ptr std::__1::default_delete >::~unique_ptr() 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/c++/v1/memory:2552 
> (libkrpc.so+0xab32c)
> #10 kudu::rpc::Messenger::~Messenger() 
> /home/adar/Source/kudu/build/tsan/../../src/kudu/rpc/messenger.cc:433 
> (libkrpc.so+0xab32c)
> #11 
> std::__1::default_delete::operator()(kudu::rpc::Messenger*)
>  const 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/c++/v1/memory:2285:5 
> (libkrpc.so+0xb0ddb)
> #12 std::__1::__shared_ptr_pointer std::__1::default_delete, 
> std::__1::allocator >::__on_zero_shared() 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3586 
> (libkrpc.so+0xb0ddb)
> #13 std::__1::__shared_count::__release_shared() 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/c++/v1/memory:3490:9 
> 

[jira] [Resolved] (KUDU-3535) [tserver] Should clear log cache while tombstoning a replica.

2024-02-08 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3535.
-
Fix Version/s: 1.18.0
   1.17.1
   Resolution: Fixed

> [tserver] Should clear log cache while tombstoning a replica.
> -
>
> Key: KUDU-3535
> URL: https://issues.apache.org/jira/browse/KUDU-3535
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Reporter: Song Jiacheng
>Priority: Major
> Fix For: 1.18.0, 1.17.1
>
> Attachments: image-2023-12-20-15-04-14-249.png, 
> image-2023-12-20-15-05-02-970.png, image-2023-12-20-15-05-21-394.png
>
>
> The log cache of a replica still exists even if the replica has been 
> tombstoned.  The 2 pictures below show the problem.
> !image-2023-12-20-15-05-02-970.png|width=372,height=171!
> !image-2023-12-20-15-05-21-394.png|width=369,height=184!
> We should clear the log cache while delete the replica with delete type "
> TABLET_DATA_TOMBSTONED"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KUDU-3549) String gauge exposed in prometheus format

2024-02-07 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815382#comment-17815382
 ] 

Alexey Serbin commented on KUDU-3549:
-

[~eub],

Thank you very much for reporting the issue!

The bug has been fixed.
The fix are available in both the 1.17.x and the main branches of the Kudu's 
git repo.

> String gauge exposed in prometheus format
> -
>
> Key: KUDU-3549
> URL: https://issues.apache.org/jira/browse/KUDU-3549
> Project: Kudu
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 1.17.0
>Reporter: YUBI LEE
>Assignee: Alexey Serbin
>Priority: Major
> Fix For: 1.18.0, 1.17.1
>
>
> According to KUDU-3375, "Kudu now exposes all its metrics except for string 
> gauges in Prometheus format via the embedded webserver's 
> `/metrics_prometheus` endpoint".
>  
>  * 
> [https://github.com/apache/kudu/blob/89e2715faf96afe0b67482166fda9c8699e8052f/docs/prior_release_notes.adoc?plain=1#L143-L145]
>  * https://issues.apache.org/jira/browse/KUDU-3375
>  
> However, with this commit 
> ([https://github.com/apache/kudu/commit/e65ea38a4860c007d93ada9c991bccec903a80b1)]
>  , string gauge related to clock_ntp_status is exposed.
>  
> {code:java}
> # HELP kudu_master_clock_ntp_status Output of ntp_adjtime()/ntp_gettime() 
> kernel API call
> # TYPE kudu_master_clock_ntp_status gauge
> kudu_master_clock_ntp_status{unit_type="state"} now:1706665936956760 
> maxerror:70013 status:ok {code}
> It prevents prometheus operators from collecting prometheus metrics for kudu.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KUDU-3549) String gauge exposed in prometheus format

2024-02-07 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3549.
-
Fix Version/s: 1.18.0
   1.17.1
   Resolution: Fixed

> String gauge exposed in prometheus format
> -
>
> Key: KUDU-3549
> URL: https://issues.apache.org/jira/browse/KUDU-3549
> Project: Kudu
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 1.17.0
>Reporter: YUBI LEE
>Assignee: Alexey Serbin
>Priority: Major
> Fix For: 1.18.0, 1.17.1
>
>
> According to KUDU-3375, "Kudu now exposes all its metrics except for string 
> gauges in Prometheus format via the embedded webserver's 
> `/metrics_prometheus` endpoint".
>  
>  * 
> [https://github.com/apache/kudu/blob/89e2715faf96afe0b67482166fda9c8699e8052f/docs/prior_release_notes.adoc?plain=1#L143-L145]
>  * https://issues.apache.org/jira/browse/KUDU-3375
>  
> However, with this commit 
> ([https://github.com/apache/kudu/commit/e65ea38a4860c007d93ada9c991bccec903a80b1)]
>  , string gauge related to clock_ntp_status is exposed.
>  
> {code:java}
> # HELP kudu_master_clock_ntp_status Output of ntp_adjtime()/ntp_gettime() 
> kernel API call
> # TYPE kudu_master_clock_ntp_status gauge
> kudu_master_clock_ntp_status{unit_type="state"} now:1706665936956760 
> maxerror:70013 status:ok {code}
> It prevents prometheus operators from collecting prometheus metrics for kudu.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KUDU-3433) ClientTest.TestDeleteWithDeletedTableReserveSecondsWorks is flaky

2024-02-07 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815380#comment-17815380
 ] 

Alexey Serbin commented on KUDU-3433:
-

The test failing from time to time.  The logs from one of the recent failures 
attached. [^client-test.3.txt.xz] 

> ClientTest.TestDeleteWithDeletedTableReserveSecondsWorks is flaky
> -
>
> Key: KUDU-3433
> URL: https://issues.apache.org/jira/browse/KUDU-3433
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: Alexey Serbin
>Priority: Major
> Attachments: client-test.2.txt.xz, client-test.3.txt.xz, 
> client-test.4.txt.xz
>
>
> The {{TestDeleteWithDeletedTableReserveSecondsWorks}} in {{client-test}} 
> sometimes fails with the following message:
> {noformat}
> src/kudu/client/client-test.cc:5436: Failure  
> Value of: tables.empty()  
>   
>   Actual: false   
>   
> Expected: true
> {noformat}
> I'm attaching a full log for reference.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3433) ClientTest.TestDeleteWithDeletedTableReserveSecondsWorks is flaky

2024-02-07 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3433:

Attachment: client-test.3.txt.xz

> ClientTest.TestDeleteWithDeletedTableReserveSecondsWorks is flaky
> -
>
> Key: KUDU-3433
> URL: https://issues.apache.org/jira/browse/KUDU-3433
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: Alexey Serbin
>Priority: Major
> Attachments: client-test.2.txt.xz, client-test.3.txt.xz, 
> client-test.4.txt.xz
>
>
> The {{TestDeleteWithDeletedTableReserveSecondsWorks}} in {{client-test}} 
> sometimes fails with the following message:
> {noformat}
> src/kudu/client/client-test.cc:5436: Failure  
> Value of: tables.empty()  
>   
>   Actual: false   
>   
> Expected: true
> {noformat}
> I'm attaching a full log for reference.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KUDU-3549) String gauge exposed in prometheus format

2024-02-05 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814526#comment-17814526
 ] 

Alexey Serbin edited comment on KUDU-3549 at 2/5/24 10:27 PM:
--

A patch to fix the issue is posted at https://gerrit.cloudera.org/#/c/20990/


was (Author: aserbin):
The fix is available at https://gerrit.cloudera.org/#/c/20990/

> String gauge exposed in prometheus format
> -
>
> Key: KUDU-3549
> URL: https://issues.apache.org/jira/browse/KUDU-3549
> Project: Kudu
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 1.17.0
>Reporter: YUBI LEE
>Assignee: Alexey Serbin
>Priority: Major
>
> According to KUDU-3375, "Kudu now exposes all its metrics except for string 
> gauges in Prometheus format via the embedded webserver's 
> `/metrics_prometheus` endpoint".
>  
>  * 
> [https://github.com/apache/kudu/blob/89e2715faf96afe0b67482166fda9c8699e8052f/docs/prior_release_notes.adoc?plain=1#L143-L145]
>  * https://issues.apache.org/jira/browse/KUDU-3375
>  
> However, with this commit 
> ([https://github.com/apache/kudu/commit/e65ea38a4860c007d93ada9c991bccec903a80b1)]
>  , string gauge related to clock_ntp_status is exposed.
>  
> {code:java}
> # HELP kudu_master_clock_ntp_status Output of ntp_adjtime()/ntp_gettime() 
> kernel API call
> # TYPE kudu_master_clock_ntp_status gauge
> kudu_master_clock_ntp_status{unit_type="state"} now:1706665936956760 
> maxerror:70013 status:ok {code}
> It prevents prometheus operators from collecting prometheus metrics for kudu.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KUDU-3549) String gauge exposed in prometheus format

2024-02-05 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814526#comment-17814526
 ] 

Alexey Serbin commented on KUDU-3549:
-

The fix is available at https://gerrit.cloudera.org/#/c/20990/

> String gauge exposed in prometheus format
> -
>
> Key: KUDU-3549
> URL: https://issues.apache.org/jira/browse/KUDU-3549
> Project: Kudu
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 1.17.0
>Reporter: YUBI LEE
>Assignee: Alexey Serbin
>Priority: Major
>
> According to KUDU-3375, "Kudu now exposes all its metrics except for string 
> gauges in Prometheus format via the embedded webserver's 
> `/metrics_prometheus` endpoint".
>  
>  * 
> [https://github.com/apache/kudu/blob/89e2715faf96afe0b67482166fda9c8699e8052f/docs/prior_release_notes.adoc?plain=1#L143-L145]
>  * https://issues.apache.org/jira/browse/KUDU-3375
>  
> However, with this commit 
> ([https://github.com/apache/kudu/commit/e65ea38a4860c007d93ada9c991bccec903a80b1)]
>  , string gauge related to clock_ntp_status is exposed.
>  
> {code:java}
> # HELP kudu_master_clock_ntp_status Output of ntp_adjtime()/ntp_gettime() 
> kernel API call
> # TYPE kudu_master_clock_ntp_status gauge
> kudu_master_clock_ntp_status{unit_type="state"} now:1706665936956760 
> maxerror:70013 status:ok {code}
> It prevents prometheus operators from collecting prometheus metrics for kudu.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KUDU-3549) String gauge exposed in prometheus format

2024-02-05 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-3549:
---

Assignee: Alexey Serbin

> String gauge exposed in prometheus format
> -
>
> Key: KUDU-3549
> URL: https://issues.apache.org/jira/browse/KUDU-3549
> Project: Kudu
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 1.17.0
>Reporter: YUBI LEE
>Assignee: Alexey Serbin
>Priority: Major
>
> According to KUDU-3375, "Kudu now exposes all its metrics except for string 
> gauges in Prometheus format via the embedded webserver's 
> `/metrics_prometheus` endpoint".
>  
>  * 
> [https://github.com/apache/kudu/blob/89e2715faf96afe0b67482166fda9c8699e8052f/docs/prior_release_notes.adoc?plain=1#L143-L145]
>  * https://issues.apache.org/jira/browse/KUDU-3375
>  
> However, with this commit 
> ([https://github.com/apache/kudu/commit/e65ea38a4860c007d93ada9c991bccec903a80b1)]
>  , string gauge related to clock_ntp_status is exposed.
>  
> {code:java}
> # HELP kudu_master_clock_ntp_status Output of ntp_adjtime()/ntp_gettime() 
> kernel API call
> # TYPE kudu_master_clock_ntp_status gauge
> kudu_master_clock_ntp_status{unit_type="state"} now:1706665936956760 
> maxerror:70013 status:ok {code}
> It prevents prometheus operators from collecting prometheus metrics for kudu.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3549) String gauge exposed in prometheus format

2024-02-05 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3549:

Code Review: https://gerrit.cloudera.org/#/c/20990/

> String gauge exposed in prometheus format
> -
>
> Key: KUDU-3549
> URL: https://issues.apache.org/jira/browse/KUDU-3549
> Project: Kudu
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 1.17.0
>Reporter: YUBI LEE
>Priority: Major
>
> According to KUDU-3375, "Kudu now exposes all its metrics except for string 
> gauges in Prometheus format via the embedded webserver's 
> `/metrics_prometheus` endpoint".
>  
>  * 
> [https://github.com/apache/kudu/blob/89e2715faf96afe0b67482166fda9c8699e8052f/docs/prior_release_notes.adoc?plain=1#L143-L145]
>  * https://issues.apache.org/jira/browse/KUDU-3375
>  
> However, with this commit 
> ([https://github.com/apache/kudu/commit/e65ea38a4860c007d93ada9c991bccec903a80b1)]
>  , string gauge related to clock_ntp_status is exposed.
>  
> {code:java}
> # HELP kudu_master_clock_ntp_status Output of ntp_adjtime()/ntp_gettime() 
> kernel API call
> # TYPE kudu_master_clock_ntp_status gauge
> kudu_master_clock_ntp_status{unit_type="state"} now:1706665936956760 
> maxerror:70013 status:ok {code}
> It prevents prometheus operators from collecting prometheus metrics for kudu.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KUDU-3550) Improve master RPC address comparison in the 'kudu hms check' CLI tool

2024-02-05 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3550:
---

 Summary: Improve master RPC address comparison in the 'kudu hms 
check' CLI tool
 Key: KUDU-3550
 URL: https://issues.apache.org/jira/browse/KUDU-3550
 Project: Kudu
  Issue Type: Improvement
  Components: CLI
Affects Versions: 1.17.0
Reporter: Alexey Serbin


The {{kudu hms check}} CLI tool treats RPC addresses with and without default 
master RPC port 7051 as different ones, even if those are the same in fact.  
That leads to bogus warnings when running the tool with the list of masters the 
same as in the Kudu table's HMS entry, differing only in presence/absence of 
the default master RPC port 7051.

For example, if the table 'db.my_table' had master1,master2,master3 as the list 
of Kudu masters, running {{kudu hms check 
master1:7051,master2:7051,master3:7051}} would output the following:

{noformat}
I0205 14:22:28.166532 1053705 tool_action_hms.cc:432] Skipping HMS table 
db.my_table with different masters specified: master1,master2,master3
{noformat}

It would be nice to make the tool comparing master addresses in a more 
intelligent way, so that Kudu master RPC addresses that differ only in presence 
of the master default RPC port treated as same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KUDU-3433) ClientTest.TestDeleteWithDeletedTableReserveSecondsWorks is flaky

2024-01-22 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17809704#comment-17809704
 ] 

Alexey Serbin commented on KUDU-3433:
-

The test has been failing from time to time.  Just recently it failed again 
when running binaries built in DEBUG configuration from this git changelist: 
[cdbe4577a|https://github.com/apache/kudu/commit/cdbe4577a91a171718ff485acc1dd1261e73a8d2],
 see the attached [^client-test.4.txt.xz] file.

> ClientTest.TestDeleteWithDeletedTableReserveSecondsWorks is flaky
> -
>
> Key: KUDU-3433
> URL: https://issues.apache.org/jira/browse/KUDU-3433
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: Alexey Serbin
>Priority: Major
> Attachments: client-test.2.txt.xz, client-test.4.txt.xz
>
>
> The {{TestDeleteWithDeletedTableReserveSecondsWorks}} in {{client-test}} 
> sometimes fails with the following message:
> {noformat}
> src/kudu/client/client-test.cc:5436: Failure  
> Value of: tables.empty()  
>   
>   Actual: false   
>   
> Expected: true
> {noformat}
> I'm attaching a full log for reference.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3433) ClientTest.TestDeleteWithDeletedTableReserveSecondsWorks is flaky

2024-01-22 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3433:

Attachment: client-test.4.txt.xz

> ClientTest.TestDeleteWithDeletedTableReserveSecondsWorks is flaky
> -
>
> Key: KUDU-3433
> URL: https://issues.apache.org/jira/browse/KUDU-3433
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: Alexey Serbin
>Priority: Major
> Attachments: client-test.2.txt.xz, client-test.4.txt.xz
>
>
> The {{TestDeleteWithDeletedTableReserveSecondsWorks}} in {{client-test}} 
> sometimes fails with the following message:
> {noformat}
> src/kudu/client/client-test.cc:5436: Failure  
> Value of: tables.empty()  
>   
>   Actual: false   
>   
> Expected: true
> {noformat}
> I'm attaching a full log for reference.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3538) Implement HTTP Service Discovery endpoint for Prometheus

2024-01-03 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3538:

Description: 
For better integration with Prometheus, it would be great to make Kudu 
providing the [Prometheus HTTP-based Service Discovery 
mechanism|https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config].

The SD endpoint should be served by the embedded webserver of a Kudu master.  
The endpoint should provide the list of URLs with Prometheus-enabled metric 
endpoints for all masters and tablet servers in a Kudu cluster.  The content 
must conform with the requirements outlined the corresponding documentation at 
[https://prometheus.io/docs/prometheus/latest/http_sd|https://prometheus.io/docs/prometheus/latest/http_sd/]





  was:
For better integration with Prometheus, it would be great to make Kudu 
providing the [Prometheus HTTP-based Service Discovery 
mechanism|https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config].

The SD endpoint can be served by kudu-master processes.  The endpoint should 
output the list of URLs with Prometheus-enabled metric endpoints for all 
masters and tablet servers in a Kudu cluster.  The content must conform with 
the requirements outlined the corresponding documentation at 
[https://prometheus.io/docs/prometheus/latest/http_sd|https://prometheus.io/docs/prometheus/latest/http_sd/]






> Implement HTTP Service Discovery endpoint for Prometheus
> 
>
> Key: KUDU-3538
> URL: https://issues.apache.org/jira/browse/KUDU-3538
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Reporter: Alexey Serbin
>Assignee: Marton Greber
>Priority: Major
>  Labels: Integration, metrics, observability, supportability
>
> For better integration with Prometheus, it would be great to make Kudu 
> providing the [Prometheus HTTP-based Service Discovery 
> mechanism|https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config].
> The SD endpoint should be served by the embedded webserver of a Kudu master.  
> The endpoint should provide the list of URLs with Prometheus-enabled metric 
> endpoints for all masters and tablet servers in a Kudu cluster.  The content 
> must conform with the requirements outlined the corresponding documentation 
> at 
> [https://prometheus.io/docs/prometheus/latest/http_sd|https://prometheus.io/docs/prometheus/latest/http_sd/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3538) Implement HTTP Service Discovery endpoint for Prometheus

2024-01-03 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3538:

Summary: Implement HTTP Service Discovery endpoint for Prometheus  (was: 
Implement HTTP Service Discovery for Prometheus)

> Implement HTTP Service Discovery endpoint for Prometheus
> 
>
> Key: KUDU-3538
> URL: https://issues.apache.org/jira/browse/KUDU-3538
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Reporter: Alexey Serbin
>Assignee: Marton Greber
>Priority: Major
>  Labels: Integration, metrics, observability, supportability
>
> For better integration with Prometheus, it would be great to make Kudu 
> providing the [Prometheus HTTP-based Service Discovery 
> mechanism|https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config].
> The SD endpoint can be served by kudu-master processes.  The endpoint should 
> output the list of URLs with Prometheus-enabled metric endpoints for all 
> masters and tablet servers in a Kudu cluster.  The content must conform with 
> the requirements outlined the corresponding documentation at 
> [https://prometheus.io/docs/prometheus/latest/http_sd|https://prometheus.io/docs/prometheus/latest/http_sd/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3538) Implement HTTP Service Discovery for Prometheus

2024-01-03 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3538:

Description: 
For better integration with Prometheus, it would be great to make Kudu 
providing the [Prometheus HTTP-based Service Discovery 
mechanism|https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config].

The SD endpoint can be served by kudu-master processes.  The endpoint should 
output the list of URLs with Prometheus-enabled metric endpoints for all 
masters and tablet servers in a Kudu cluster.  The content must conform with 
the requirements outlined the corresponding documentation at 
[https://prometheus.io/docs/prometheus/latest/http_sd|https://prometheus.io/docs/prometheus/latest/http_sd/]





  was:
For better integration with Prometheus, it would be great to make Kudu 
providing the [Prometheus HTTP-based Service Discovery 
mechanism|https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config].

The SD endpoint can be served by kudu-master processes.  The endpoint should 
output the list of URLs of with Prometheus-enabled metric endpoints for all 
master and tablet servers in a Kudu cluster.  The content must conform with the 
requirements outlined the corresponding documentation at 
[https://prometheus.io/docs/prometheus/latest/http_sd|https://prometheus.io/docs/prometheus/latest/http_sd/]






> Implement HTTP Service Discovery for Prometheus
> ---
>
> Key: KUDU-3538
> URL: https://issues.apache.org/jira/browse/KUDU-3538
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Reporter: Alexey Serbin
>Assignee: Marton Greber
>Priority: Major
>  Labels: Integration, metrics, observability, supportability
>
> For better integration with Prometheus, it would be great to make Kudu 
> providing the [Prometheus HTTP-based Service Discovery 
> mechanism|https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config].
> The SD endpoint can be served by kudu-master processes.  The endpoint should 
> output the list of URLs with Prometheus-enabled metric endpoints for all 
> masters and tablet servers in a Kudu cluster.  The content must conform with 
> the requirements outlined the corresponding documentation at 
> [https://prometheus.io/docs/prometheus/latest/http_sd|https://prometheus.io/docs/prometheus/latest/http_sd/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KUDU-3538) Implement HTTP Service Discovery for Prometheus

2024-01-02 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3538:
---

 Summary: Implement HTTP Service Discovery for Prometheus
 Key: KUDU-3538
 URL: https://issues.apache.org/jira/browse/KUDU-3538
 Project: Kudu
  Issue Type: Improvement
  Components: master
Reporter: Alexey Serbin


For better integration with Prometheus, it would be great to make Kudu 
providing the [Prometheus HTTP-based Service Discovery 
mechanism|https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config].

The SD endpoint can be served by kudu-master processes.  The endpoint should 
output the list of URLs of with Prometheus-enabled metric endpoints for all 
master and tablet servers in a Kudu cluster.  The content must conform with the 
requirements outlined the corresponding documentation at 
[https://prometheus.io/docs/prometheus/latest/http_sd|https://prometheus.io/docs/prometheus/latest/http_sd/]







--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3531) Limit the amount of resources used by tombstoned tablet replicas

2023-12-11 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3531:

Labels: scalability tablet  (was: scalability tserver)

> Limit the amount of resources used by tombstoned tablet replicas
> 
>
> Key: KUDU-3531
> URL: https://issues.apache.org/jira/browse/KUDU-3531
> Project: Kudu
>  Issue Type: Improvement
>Reporter: Alexey Serbin
>Priority: Major
>  Labels: scalability, tablet
>
> I came across a case where a tablet server had just about 2K live tablet 
> replicas, but it opened about 24K files in its WAL and data directories.  The 
> issue stems from the fact that tombstoned tablet replica's files are opened 
> by the FS manager the same as for a live replica, and those are kept open 
> even if they are never about to change.  It would be prudent to avoid keeping 
> tombstoned tablet replicas' files open, if possible: maybe, just read the 
> required information (last voted term and opId index?) and keep it in runtime 
> structures, but close corresponding files right after bootstrapping?
> Otherwise, this doesn't seem to scale well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3531) Limit the amount of resources used by tombstoned tablet replicas

2023-12-11 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3531:

Labels: scalability tserver  (was: scalability)

> Limit the amount of resources used by tombstoned tablet replicas
> 
>
> Key: KUDU-3531
> URL: https://issues.apache.org/jira/browse/KUDU-3531
> Project: Kudu
>  Issue Type: Improvement
>Reporter: Alexey Serbin
>Priority: Major
>  Labels: scalability, tserver
>
> I came across a case where a tablet server had just about 2K live tablet 
> replicas, but it opened about 24K files in its WAL and data directories.  The 
> issue stems from the fact that tombstoned tablet replica's files are opened 
> by the FS manager the same as for a live replica, and those are kept open 
> even if they are never about to change.  It would be prudent to avoid keeping 
> tombstoned tablet replicas' files open, if possible: maybe, just read the 
> required information (last voted term and opId index?) and keep it in runtime 
> structures, but close corresponding files right after bootstrapping?
> Otherwise, this doesn't seem to scale well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3531) Limit the amount of resources used by tombstoned tablet replicas

2023-12-06 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3531:

Description: 
I came across a case where a tablet server had just about 2K live tablet 
replicas, but it opened about 24K files in its WAL and data directories.  The 
issue stems from the fact that tombstoned tablet replica's files are opened by 
the FS manager the same as for a live replica, and those are kept open even if 
they are never about to change.  It would be prudent to avoid keeping 
tombstoned tablet replicas' files open, if possible: maybe, just read the 
required information (last voted term and opId index?) and keep it in runtime 
structures, but close corresponding files right after bootstrapping?

Otherwise, this doesn't seem to scale well.

  was:
I came across a case where a tablet server has just about 2K tablet replicas, 
but it opened about 24K files in its WAL and data directories.  The issue stems 
from the fact that tobmstoned tablet replica's files are opened by the FS 
manager as well, and those are kept open even if they are never about to change 
or receive any Raft updates.  It would be prudent to avoid keeping tombstoned 
tablet replicas' files open, if possible: maybe, just read the require 
information (last voted term and opId index?) and keep it in runtime 
structures, but close corresponding files right after bootstrapping?

Otherwise, this doesn't seem to scale well.


> Limit the amount of resources used by tombstoned tablet replicas
> 
>
> Key: KUDU-3531
> URL: https://issues.apache.org/jira/browse/KUDU-3531
> Project: Kudu
>  Issue Type: Improvement
>Reporter: Alexey Serbin
>Priority: Major
>
> I came across a case where a tablet server had just about 2K live tablet 
> replicas, but it opened about 24K files in its WAL and data directories.  The 
> issue stems from the fact that tombstoned tablet replica's files are opened 
> by the FS manager the same as for a live replica, and those are kept open 
> even if they are never about to change.  It would be prudent to avoid keeping 
> tombstoned tablet replicas' files open, if possible: maybe, just read the 
> required information (last voted term and opId index?) and keep it in runtime 
> structures, but close corresponding files right after bootstrapping?
> Otherwise, this doesn't seem to scale well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KUDU-3531) Limit the amount of resources used by tombstoned tablet replicas

2023-12-06 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3531:
---

 Summary: Limit the amount of resources used by tombstoned tablet 
replicas
 Key: KUDU-3531
 URL: https://issues.apache.org/jira/browse/KUDU-3531
 Project: Kudu
  Issue Type: Improvement
Reporter: Alexey Serbin


I came across a case where a tablet server has just about 2K tablet replicas, 
but it opened about 24K files in its WAL and data directories.  The issue stems 
from the fact that tobmstoned tablet replica's files are opened by the FS 
manager as well, and those are kept open even if they are never about to change 
or receive any Raft updates.  It would be prudent to avoid keeping tombstoned 
tablet replicas' files open, if possible: maybe, just read the require 
information (last voted term and opId index?) and keep it in runtime 
structures, but close corresponding files right after bootstrapping?

Otherwise, this doesn't seem to scale well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KUDU-3530) Add guardrails to prevent inconsistencies on attemps to add multiple Kudu masters at once in a cluster

2023-12-05 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3530:
---

 Summary: Add guardrails to prevent inconsistencies on attemps to 
add multiple Kudu masters at once in a cluster
 Key: KUDU-3530
 URL: https://issues.apache.org/jira/browse/KUDU-3530
 Project: Kudu
  Issue Type: Improvement
  Components: master
Reporter: Alexey Serbin


There have been a few reports on inconsistencies in the system catalog tablet's 
Raft configuration upon trying to add multiple new masters at once into a Kudu 
cluster.  It seems the current implementation of the {{AddMaster}} RPC isn't 
thread-safe, since the Raft configuration of the system catalog tablet became 
corrupted after an attempt to add multiple extra masters at once (i.e. starting 
multiple of those to-be-added-masters at once).  The original Kudu master 
reported an error like below upon next restart:

{noformat}
Invalid argument: RunMasterServer() failed: Unable to initialize catalog 
manager: Failed to initialize sys tables async: on-disk master list (:0) and 
provided master list (m1.my.org:7051, m2.my.org:7051, m3.my.org:7051) differ by 
more than one address. Their symmetric difference is: :0, m1.my.org:7051, 
m2.my.org:7051, m3.my.org:7051
{noformat}

It would be great to have guardrails preventing such a corruption.  
Essentially, we should enforce the one-new-master-at-a-time invariant which the 
current implementation implicitly assumes, but doesn't consistently enforce.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KUDU-3525) Introduce kudu master leader_step_down CLI tool

2023-11-16 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3525:
---

 Summary: Introduce kudu master leader_step_down CLI tool
 Key: KUDU-3525
 URL: https://issues.apache.org/jira/browse/KUDU-3525
 Project: Kudu
  Issue Type: Improvement
  Components: CLI, master
Reporter: Alexey Serbin


Currently there is {{kudu tablet leader_step_down}} CLI tool, but it doesn't 
work for the system catalog tablet.  However, in various scenarios it's useful 
to have a CLI tool to make the current leader master stepping down and Raft 
election to happen for the system catalog tablet.  To name just a few scenarios:
* Removing a second master when downsizing down the 2 -> 1 path (that's a 
second step of 3 -> 2 -> 1 migration), if the master to be removed hosts the 
leader replica of the system catalog.  In such a case, the {{kudu master 
remove}} CLI isn't working since it cannot make the current leader replica of 
the system tablet to step down.
* Explicitly moving the system catalog leadership from one master node to 
another.  That might be useful in various scenarios, at least for test 
scenarios involving {{ExternalMiniCluster}} 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3520) File descriptor leak in Env::NewRWFile() when encryption-at-rest is enabled

2023-11-15 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3520:

Summary: File descriptor leak in Env::NewRWFile() when encryption-at-rest 
is enabled  (was: File descriptor leak in Env::NewRWFile() when 
ecryption-at-rest is enabled)

> File descriptor leak in Env::NewRWFile() when encryption-at-rest is enabled
> ---
>
> Key: KUDU-3520
> URL: https://issues.apache.org/jira/browse/KUDU-3520
> Project: Kudu
>  Issue Type: Bug
>  Components: fs, security, tserver
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Alexey Serbin
>Assignee: Attila Bukor
>Priority: Major
> Fix For: 1.18.0
>
>
> There is a file descriptor leak in {{Env::NewRWFile()}} on an error path when 
> encryption-at-rest is enabled.
> In the code below, if {{ReadEncryptionHeader()}} or 
> {{WriteEncryptionHeader()}} failed, the descriptor of the file opened by 
> {{DoOpen()}} would be leaked.
> {noformat}
> RETURN_NOT_OK(DoOpen(fname, opts.mode, ));
> EncryptionHeader eh;
> if (encrypt) {
>   DCHECK(encryption_key_);
>   if (size >= kEncryptionHeaderSize) {
> RETURN_NOT_OK(ReadEncryptionHeader(fd, fname, *encryption_key_, ));
>   } else {
> RETURN_NOT_OK(GenerateHeader());
> RETURN_NOT_OK(WriteEncryptionHeader(fd, fname, *encryption_key_, eh));
>   }
> }
> result->reset(new PosixRWFile(fname, fd, opts.sync_on_close, encrypt, 
> eh));
> {noformat}
> It's been evidenced in the wild when creating the metadata file for a tablet 
> during tablet copying failed with the error like below:
> {noformat}
> Runtime error: Couldn't create tablet metadata: Failed to write tablet 
> metadata d199a872b03848d695f067ed5c694835: Failed to initialize encryption: 
> error:0607B083:digital envelope routines:EVP_CipherInit_ex:no cipher 
> set:crypto/evp/evp_enc.c:170
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3524) The TestScannerKeepAlivePeriodicallyCrossServers scenario fails with SIGABRT

2023-11-13 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3524:

Summary: The TestScannerKeepAlivePeriodicallyCrossServers scenario fails 
with SIGABRT  (was: The {{TestScannerKeepAlivePeriodicallyCrossServers}} 
scenario fails with SIGABRT)

> The TestScannerKeepAlivePeriodicallyCrossServers scenario fails with SIGABRT
> 
>
> Key: KUDU-3524
> URL: https://issues.apache.org/jira/browse/KUDU-3524
> Project: Kudu
>  Issue Type: Bug
>Reporter: Alexey Serbin
>Priority: Major
>
> Running the newly added tests scenario 
> {{TestScannerKeepAlivePeriodicallyCrossServers}} fails with SIGABRT when run 
> as the following on macOS (but I guess it's not macOS-specific) in DEBUG 
> build:
> {noformat}
> ./bin/client-test --stress_cpu_threads=32 
> --gtest_filter='*TestScannerKeepAlivePeriodicallyCrossServers*'
> {noformat}
> The error message and the stacktrace is below:
> {noformat}
> F20231113 12:21:13.431455 41195482 thread_restrictions.cc:79] Check failed: 
> LoadTLS()->wait_allowed Waiting is not allowed to be used on this thread to 
> prevent server-wide latency aberrations and deadlocks. Thread 41195482 (name: 
> "rpc reactor", category: "reactor")
> *** Check failure stack trace: ***
> Process 77090 stopped
> * thread #335, name = 'rpc reactor-41195482', stop reason = signal SIGABRT
> frame #0: 0x7fff205b890e libsystem_kernel.dylib`__pthread_kill + 10
> libsystem_kernel.dylib`__pthread_kill:
> ->  0x7fff205b890e <+10>: jae0x7fff205b8918; <+20>
> 0x7fff205b8910 <+12>: movq   %rax, %rdi
> 0x7fff205b8913 <+15>: jmp0x7fff205b2ab9; cerror_nocancel
> 0x7fff205b8918 <+20>: retq   
> Target 0: (client-test) stopped.
> (lldb) bt
> * thread #335, name = 'rpc reactor-41195482', stop reason = signal SIGABRT
>   * frame #0: 0x7fff205b890e libsystem_kernel.dylib`__pthread_kill + 10
> frame #1: 0x7fff205e75bd libsystem_pthread.dylib`pthread_kill + 263
> frame #2: 0x7fff2053c406 libsystem_c.dylib`abort + 125
> frame #3: 0x00010f64ebd8 
> libglog.1.dylib`google::LogMessage::SendToLog() [inlined] 
> google::LogMessage::Fail() at logging.cc:1946:3 [opt]
> frame #4: 0x00010f64ebd2 
> libglog.1.dylib`google::LogMessage::SendToLog(this=0x70001a95e108) at 
> logging.cc:1920:5 [opt]
> frame #5: 0x00010f64f47a 
> libglog.1.dylib`google::LogMessage::Flush(this=0x70001a95e108) at 
> logging.cc:1777:5 [opt]
> frame #6: 0x00010f65428f 
> libglog.1.dylib`google::LogMessageFatal::~LogMessageFatal(this=0x70001a95e108)
>  at logging.cc:2557:5 [opt]
> frame #7: 0x00010f650349 
> libglog.1.dylib`google::LogMessageFatal::~LogMessageFatal(this=) 
> at logging.cc:2556:37 [opt]
> frame #8: 0x00010e545473 
> libkudu_util.dylib`kudu::ThreadRestrictions::AssertWaitAllowed() at 
> thread_restrictions.cc:79:3
> frame #9: 0x00010013ebb9 
> client-test`kudu::CountDownLatch::Wait(this=0x70001a95e2a0) const at 
> countdown_latch.h:74:5
> frame #10: 0x00010a1749f5 
> libkrpc.dylib`kudu::Notification::WaitForNotification(this=0x70001a95e2a0)
>  const at notification.h:127:12
> frame #11: 0x00010a1748e9 
> libkrpc.dylib`kudu::rpc::Proxy::SyncRequest(this=0x00011317e9b8, 
> method="ScannerKeepAlive", req=0x70001a95e428, resp=0x70001a95e408, 
> controller=0x70001a95e458) at proxy.cc:259:8
> frame #12: 0x00010697220f 
> libtserver_service_proto.dylib`kudu::tserver::TabletServerServiceProxy::ScannerKeepAlive(this=0x00011317e9b8,
>  req=0x70001a95e428, resp=0x70001a95e408, 
> controller=0x70001a95e458) at tserver_service.proxy.cc:98:10
> frame #13: 0x00010525c5b6 
> libkudu_client.dylib`kudu::client::KuduScanner::Data::KeepAlive(this=0x00011290c700)
>  at scanner-internal.cc:664:3
> frame #14: 0x000105269e76 
> libkudu_client.dylib`kudu::client::KuduScanner::Data::StartKeepAlivePeriodically(this=0x000112899858)::$_0::operator()()
>  const at scanner-internal.cc:112:16
> frame #15: 0x000105269e30 
> libkudu_client.dylib`decltype(__f=0x000112899858)::$_0&>(fp)()) 
> std::__1::__invoke  long long, 
> std::__1::shared_ptr)::$_0&>(kudu::client::KuduScanner::Data::StartKeepAlivePeriodically(unsigned
>  long long, std::__1::shared_ptr)::$_0&) at 
> type_traits:3694:1
> frame #16: 0x000105269dd1 libkudu_client.dylib`void 
> std::__1::__invoke_void_return_wrapper true>::__call(kudu::client::KuduScanner::Data::StartKeepAlivePeriodically(unsigned
>  long long, std::__1::shared_ptr)::$_0&) at 
> __functional_base:348:9
> frame #17: 0x000105269d9d 
> libkudu_client.dylib`std::__1::__function::__alloc_func  long long, std::__1::shared_ptr)::$_0, 
> 

[jira] [Assigned] (KUDU-3524) The {{TestScannerKeepAlivePeriodicallyCrossServers}} scenario fails with SIGABRT

2023-11-13 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-3524:
---

Assignee: (was: Alexey Serbin)

> The {{TestScannerKeepAlivePeriodicallyCrossServers}} scenario fails with 
> SIGABRT
> 
>
> Key: KUDU-3524
> URL: https://issues.apache.org/jira/browse/KUDU-3524
> Project: Kudu
>  Issue Type: Bug
>Reporter: Alexey Serbin
>Priority: Major
>
> Running the newly added tests scenario 
> {{TestScannerKeepAlivePeriodicallyCrossServers}} fails with SIGABRT when run 
> as the following on macOS (but I guess it's not macOS-specific) in DEBUG 
> build:
> {noformat}
> ./bin/client-test --stress_cpu_threads=32 
> --gtest_filter='*TestScannerKeepAlivePeriodicallyCrossServers*'
> {noformat}
> The error message and the stacktrace is below:
> {noformat}
> F20231113 12:21:13.431455 41195482 thread_restrictions.cc:79] Check failed: 
> LoadTLS()->wait_allowed Waiting is not allowed to be used on this thread to 
> prevent server-wide latency aberrations and deadlocks. Thread 41195482 (name: 
> "rpc reactor", category: "reactor")
> *** Check failure stack trace: ***
> Process 77090 stopped
> * thread #335, name = 'rpc reactor-41195482', stop reason = signal SIGABRT
> frame #0: 0x7fff205b890e libsystem_kernel.dylib`__pthread_kill + 10
> libsystem_kernel.dylib`__pthread_kill:
> ->  0x7fff205b890e <+10>: jae0x7fff205b8918; <+20>
> 0x7fff205b8910 <+12>: movq   %rax, %rdi
> 0x7fff205b8913 <+15>: jmp0x7fff205b2ab9; cerror_nocancel
> 0x7fff205b8918 <+20>: retq   
> Target 0: (client-test) stopped.
> (lldb) bt
> * thread #335, name = 'rpc reactor-41195482', stop reason = signal SIGABRT
>   * frame #0: 0x7fff205b890e libsystem_kernel.dylib`__pthread_kill + 10
> frame #1: 0x7fff205e75bd libsystem_pthread.dylib`pthread_kill + 263
> frame #2: 0x7fff2053c406 libsystem_c.dylib`abort + 125
> frame #3: 0x00010f64ebd8 
> libglog.1.dylib`google::LogMessage::SendToLog() [inlined] 
> google::LogMessage::Fail() at logging.cc:1946:3 [opt]
> frame #4: 0x00010f64ebd2 
> libglog.1.dylib`google::LogMessage::SendToLog(this=0x70001a95e108) at 
> logging.cc:1920:5 [opt]
> frame #5: 0x00010f64f47a 
> libglog.1.dylib`google::LogMessage::Flush(this=0x70001a95e108) at 
> logging.cc:1777:5 [opt]
> frame #6: 0x00010f65428f 
> libglog.1.dylib`google::LogMessageFatal::~LogMessageFatal(this=0x70001a95e108)
>  at logging.cc:2557:5 [opt]
> frame #7: 0x00010f650349 
> libglog.1.dylib`google::LogMessageFatal::~LogMessageFatal(this=) 
> at logging.cc:2556:37 [opt]
> frame #8: 0x00010e545473 
> libkudu_util.dylib`kudu::ThreadRestrictions::AssertWaitAllowed() at 
> thread_restrictions.cc:79:3
> frame #9: 0x00010013ebb9 
> client-test`kudu::CountDownLatch::Wait(this=0x70001a95e2a0) const at 
> countdown_latch.h:74:5
> frame #10: 0x00010a1749f5 
> libkrpc.dylib`kudu::Notification::WaitForNotification(this=0x70001a95e2a0)
>  const at notification.h:127:12
> frame #11: 0x00010a1748e9 
> libkrpc.dylib`kudu::rpc::Proxy::SyncRequest(this=0x00011317e9b8, 
> method="ScannerKeepAlive", req=0x70001a95e428, resp=0x70001a95e408, 
> controller=0x70001a95e458) at proxy.cc:259:8
> frame #12: 0x00010697220f 
> libtserver_service_proto.dylib`kudu::tserver::TabletServerServiceProxy::ScannerKeepAlive(this=0x00011317e9b8,
>  req=0x70001a95e428, resp=0x70001a95e408, 
> controller=0x70001a95e458) at tserver_service.proxy.cc:98:10
> frame #13: 0x00010525c5b6 
> libkudu_client.dylib`kudu::client::KuduScanner::Data::KeepAlive(this=0x00011290c700)
>  at scanner-internal.cc:664:3
> frame #14: 0x000105269e76 
> libkudu_client.dylib`kudu::client::KuduScanner::Data::StartKeepAlivePeriodically(this=0x000112899858)::$_0::operator()()
>  const at scanner-internal.cc:112:16
> frame #15: 0x000105269e30 
> libkudu_client.dylib`decltype(__f=0x000112899858)::$_0&>(fp)()) 
> std::__1::__invoke  long long, 
> std::__1::shared_ptr)::$_0&>(kudu::client::KuduScanner::Data::StartKeepAlivePeriodically(unsigned
>  long long, std::__1::shared_ptr)::$_0&) at 
> type_traits:3694:1
> frame #16: 0x000105269dd1 libkudu_client.dylib`void 
> std::__1::__invoke_void_return_wrapper true>::__call(kudu::client::KuduScanner::Data::StartKeepAlivePeriodically(unsigned
>  long long, std::__1::shared_ptr)::$_0&) at 
> __functional_base:348:9
> frame #17: 0x000105269d9d 
> libkudu_client.dylib`std::__1::__function::__alloc_func  long long, std::__1::shared_ptr)::$_0, 
> std::__1::allocator  long long, std::__1::shared_ptr)::$_0>, void 
> ()>::operator(this=0x000112899858)() at 

[jira] [Assigned] (KUDU-3524) The {{TestScannerKeepAlivePeriodicallyCrossServers}} scenario fails with SIGABRT

2023-11-13 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-3524:
---

Assignee: Alexey Serbin

> The {{TestScannerKeepAlivePeriodicallyCrossServers}} scenario fails with 
> SIGABRT
> 
>
> Key: KUDU-3524
> URL: https://issues.apache.org/jira/browse/KUDU-3524
> Project: Kudu
>  Issue Type: Bug
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>
> Running the newly added tests scenario 
> {{TestScannerKeepAlivePeriodicallyCrossServers}} fails with SIGABRT when run 
> as the following on macOS (but I guess it's not macOS-specific) in DEBUG 
> build:
> {noformat}
> ./bin/client-test --stress_cpu_threads=32 
> --gtest_filter='*TestScannerKeepAlivePeriodicallyCrossServers*'
> {noformat}
> The error message and the stacktrace is below:
> {noformat}
> F20231113 12:21:13.431455 41195482 thread_restrictions.cc:79] Check failed: 
> LoadTLS()->wait_allowed Waiting is not allowed to be used on this thread to 
> prevent server-wide latency aberrations and deadlocks. Thread 41195482 (name: 
> "rpc reactor", category: "reactor")
> *** Check failure stack trace: ***
> Process 77090 stopped
> * thread #335, name = 'rpc reactor-41195482', stop reason = signal SIGABRT
> frame #0: 0x7fff205b890e libsystem_kernel.dylib`__pthread_kill + 10
> libsystem_kernel.dylib`__pthread_kill:
> ->  0x7fff205b890e <+10>: jae0x7fff205b8918; <+20>
> 0x7fff205b8910 <+12>: movq   %rax, %rdi
> 0x7fff205b8913 <+15>: jmp0x7fff205b2ab9; cerror_nocancel
> 0x7fff205b8918 <+20>: retq   
> Target 0: (client-test) stopped.
> (lldb) bt
> * thread #335, name = 'rpc reactor-41195482', stop reason = signal SIGABRT
>   * frame #0: 0x7fff205b890e libsystem_kernel.dylib`__pthread_kill + 10
> frame #1: 0x7fff205e75bd libsystem_pthread.dylib`pthread_kill + 263
> frame #2: 0x7fff2053c406 libsystem_c.dylib`abort + 125
> frame #3: 0x00010f64ebd8 
> libglog.1.dylib`google::LogMessage::SendToLog() [inlined] 
> google::LogMessage::Fail() at logging.cc:1946:3 [opt]
> frame #4: 0x00010f64ebd2 
> libglog.1.dylib`google::LogMessage::SendToLog(this=0x70001a95e108) at 
> logging.cc:1920:5 [opt]
> frame #5: 0x00010f64f47a 
> libglog.1.dylib`google::LogMessage::Flush(this=0x70001a95e108) at 
> logging.cc:1777:5 [opt]
> frame #6: 0x00010f65428f 
> libglog.1.dylib`google::LogMessageFatal::~LogMessageFatal(this=0x70001a95e108)
>  at logging.cc:2557:5 [opt]
> frame #7: 0x00010f650349 
> libglog.1.dylib`google::LogMessageFatal::~LogMessageFatal(this=) 
> at logging.cc:2556:37 [opt]
> frame #8: 0x00010e545473 
> libkudu_util.dylib`kudu::ThreadRestrictions::AssertWaitAllowed() at 
> thread_restrictions.cc:79:3
> frame #9: 0x00010013ebb9 
> client-test`kudu::CountDownLatch::Wait(this=0x70001a95e2a0) const at 
> countdown_latch.h:74:5
> frame #10: 0x00010a1749f5 
> libkrpc.dylib`kudu::Notification::WaitForNotification(this=0x70001a95e2a0)
>  const at notification.h:127:12
> frame #11: 0x00010a1748e9 
> libkrpc.dylib`kudu::rpc::Proxy::SyncRequest(this=0x00011317e9b8, 
> method="ScannerKeepAlive", req=0x70001a95e428, resp=0x70001a95e408, 
> controller=0x70001a95e458) at proxy.cc:259:8
> frame #12: 0x00010697220f 
> libtserver_service_proto.dylib`kudu::tserver::TabletServerServiceProxy::ScannerKeepAlive(this=0x00011317e9b8,
>  req=0x70001a95e428, resp=0x70001a95e408, 
> controller=0x70001a95e458) at tserver_service.proxy.cc:98:10
> frame #13: 0x00010525c5b6 
> libkudu_client.dylib`kudu::client::KuduScanner::Data::KeepAlive(this=0x00011290c700)
>  at scanner-internal.cc:664:3
> frame #14: 0x000105269e76 
> libkudu_client.dylib`kudu::client::KuduScanner::Data::StartKeepAlivePeriodically(this=0x000112899858)::$_0::operator()()
>  const at scanner-internal.cc:112:16
> frame #15: 0x000105269e30 
> libkudu_client.dylib`decltype(__f=0x000112899858)::$_0&>(fp)()) 
> std::__1::__invoke  long long, 
> std::__1::shared_ptr)::$_0&>(kudu::client::KuduScanner::Data::StartKeepAlivePeriodically(unsigned
>  long long, std::__1::shared_ptr)::$_0&) at 
> type_traits:3694:1
> frame #16: 0x000105269dd1 libkudu_client.dylib`void 
> std::__1::__invoke_void_return_wrapper true>::__call(kudu::client::KuduScanner::Data::StartKeepAlivePeriodically(unsigned
>  long long, std::__1::shared_ptr)::$_0&) at 
> __functional_base:348:9
> frame #17: 0x000105269d9d 
> libkudu_client.dylib`std::__1::__function::__alloc_func  long long, std::__1::shared_ptr)::$_0, 
> std::__1::allocator  long long, std::__1::shared_ptr)::$_0>, void 
> 

[jira] [Created] (KUDU-3524) The {{TestScannerKeepAlivePeriodicallyCrossServers}} scenario fails with SIGABRT

2023-11-13 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3524:
---

 Summary: The {{TestScannerKeepAlivePeriodicallyCrossServers}} 
scenario fails with SIGABRT
 Key: KUDU-3524
 URL: https://issues.apache.org/jira/browse/KUDU-3524
 Project: Kudu
  Issue Type: Bug
Reporter: Alexey Serbin


Running the newly added tests scenario 
{{TestScannerKeepAlivePeriodicallyCrossServers}} fails with SIGABRT when run as 
the following on macOS (but I guess it's not macOS-specific) in DEBUG build:
{noformat}
./bin/client-test --stress_cpu_threads=32 
--gtest_filter='*TestScannerKeepAlivePeriodicallyCrossServers*'
{noformat}

The error message and the stacktrace is below:
{noformat}
F20231113 12:21:13.431455 41195482 thread_restrictions.cc:79] Check failed: 
LoadTLS()->wait_allowed Waiting is not allowed to be used on this thread to 
prevent server-wide latency aberrations and deadlocks. Thread 41195482 (name: 
"rpc reactor", category: "reactor")
*** Check failure stack trace: ***
Process 77090 stopped
* thread #335, name = 'rpc reactor-41195482', stop reason = signal SIGABRT
frame #0: 0x7fff205b890e libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
->  0x7fff205b890e <+10>: jae0x7fff205b8918; <+20>
0x7fff205b8910 <+12>: movq   %rax, %rdi
0x7fff205b8913 <+15>: jmp0x7fff205b2ab9; cerror_nocancel
0x7fff205b8918 <+20>: retq   
Target 0: (client-test) stopped.
(lldb) bt
* thread #335, name = 'rpc reactor-41195482', stop reason = signal SIGABRT
  * frame #0: 0x7fff205b890e libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x7fff205e75bd libsystem_pthread.dylib`pthread_kill + 263
frame #2: 0x7fff2053c406 libsystem_c.dylib`abort + 125
frame #3: 0x00010f64ebd8 
libglog.1.dylib`google::LogMessage::SendToLog() [inlined] 
google::LogMessage::Fail() at logging.cc:1946:3 [opt]
frame #4: 0x00010f64ebd2 
libglog.1.dylib`google::LogMessage::SendToLog(this=0x70001a95e108) at 
logging.cc:1920:5 [opt]
frame #5: 0x00010f64f47a 
libglog.1.dylib`google::LogMessage::Flush(this=0x70001a95e108) at 
logging.cc:1777:5 [opt]
frame #6: 0x00010f65428f 
libglog.1.dylib`google::LogMessageFatal::~LogMessageFatal(this=0x70001a95e108)
 at logging.cc:2557:5 [opt]
frame #7: 0x00010f650349 
libglog.1.dylib`google::LogMessageFatal::~LogMessageFatal(this=) 
at logging.cc:2556:37 [opt]
frame #8: 0x00010e545473 
libkudu_util.dylib`kudu::ThreadRestrictions::AssertWaitAllowed() at 
thread_restrictions.cc:79:3
frame #9: 0x00010013ebb9 
client-test`kudu::CountDownLatch::Wait(this=0x70001a95e2a0) const at 
countdown_latch.h:74:5
frame #10: 0x00010a1749f5 
libkrpc.dylib`kudu::Notification::WaitForNotification(this=0x70001a95e2a0) 
const at notification.h:127:12
frame #11: 0x00010a1748e9 
libkrpc.dylib`kudu::rpc::Proxy::SyncRequest(this=0x00011317e9b8, 
method="ScannerKeepAlive", req=0x70001a95e428, resp=0x70001a95e408, 
controller=0x70001a95e458) at proxy.cc:259:8
frame #12: 0x00010697220f 
libtserver_service_proto.dylib`kudu::tserver::TabletServerServiceProxy::ScannerKeepAlive(this=0x00011317e9b8,
 req=0x70001a95e428, resp=0x70001a95e408, 
controller=0x70001a95e458) at tserver_service.proxy.cc:98:10
frame #13: 0x00010525c5b6 
libkudu_client.dylib`kudu::client::KuduScanner::Data::KeepAlive(this=0x00011290c700)
 at scanner-internal.cc:664:3
frame #14: 0x000105269e76 
libkudu_client.dylib`kudu::client::KuduScanner::Data::StartKeepAlivePeriodically(this=0x000112899858)::$_0::operator()()
 const at scanner-internal.cc:112:16
frame #15: 0x000105269e30 
libkudu_client.dylib`decltype(__f=0x000112899858)::$_0&>(fp)()) 
std::__1::__invoke)::$_0&>(kudu::client::KuduScanner::Data::StartKeepAlivePeriodically(unsigned
 long long, std::__1::shared_ptr)::$_0&) at 
type_traits:3694:1
frame #16: 0x000105269dd1 libkudu_client.dylib`void 
std::__1::__invoke_void_return_wrapper::__call(kudu::client::KuduScanner::Data::StartKeepAlivePeriodically(unsigned
 long long, std::__1::shared_ptr)::$_0&) at 
__functional_base:348:9
frame #17: 0x000105269d9d 
libkudu_client.dylib`std::__1::__function::__alloc_func)::$_0, 
std::__1::allocator)::$_0>, void 
()>::operator(this=0x000112899858)() at functional:1558:16
frame #18: 0x000105268ac9 
libkudu_client.dylib`std::__1::__function::__func)::$_0, 
std::__1::allocator)::$_0>, void 
()>::operator(this=0x000112899850)() at functional:1732:12
frame #19: 0x0001013ae082 
libtserver_test_util.dylib`std::__1::__function::__value_func::operator(this=0x000112899850)() const at functional:1885:16
frame #20: 0x0001013adee5 
libtserver_test_util.dylib`std::__1::function::operator(this=0x000112899850)() const at functional:2560:12
frame #21: 0x00010a16cd62 

[jira] [Comment Edited] (KUDU-3523) st_blksize is not alway equal to the filesystem block size

2023-11-07 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783824#comment-17783824
 ] 

Alexey Serbin edited comment on KUDU-3523 at 11/8/23 1:11 AM:
--

[~wangxixu],

Thank you for reporting the issue!  Indeed, as per 
[1|https://www.man7.org/linux/man-pages/man2/stat.2.html], the output of the 
{{stat}} utility reports on different 'block sizes' when run on the file and on 
the filesystem level (that's why there is {{\-f}} option).

That's right: st_blksize isn't not supposed to be always equal to the 
filesystem block size, and that's so "by design", AFAIK.  The 'st_blksize' 
stands for the IO block size, or "preferred" block size for efficient file 
system IO (a.k.a. optimal IO transfer size hint), see 
[2|https://www.man7.org/linux/man-pages/man2/statx.2.html].

As per filesystem operations and {{LogBlockManager}} in Kudu, one crucial 
invariant is that 'st_blksize' is a multiple of the filesystem's block size.  
At least, such invariant is important in the scope of addressing 
[KUDU-620|https://issues.apache.org/jira/browse/KUDU-620] (that's where the 
{{PathInstanceMetadataPB::filesystem_block_size_bytes}} field has originated 
from).

In the scope of this JIRA, please feel free to add references to particular 
places where the difference between the IO block and the filesystem block sizes 
might lead to inconsistencies.  I guess one of those is the misleading name of 
the {{PathInstanceMetadataPB::filesystem_block_size_bytes}} field.  Probably, 
there are more places where the difference is important, and that could lead to 
issues in the actual functionality.

# 
[https://www.man7.org/linux/man-pages/man2/stat.2.html|https://www.man7.org/linux/man-pages/man2/stat.2.html]
# 
[https://www.man7.org/linux/man-pages/man2/statx.2.html|https://www.man7.org/linux/man-pages/man2/statx.2.html]


was (Author: aserbin):
[~wangxixu],

Thank you for reporint the issue!  Indeed, as per 
[1|https://www.man7.org/linux/man-pages/man2/stat.2.html], the output of the 
{{stat}} utility reports on different 'block sizes' when run on the file and on 
the filesystem level (that's why there is {{\-f}} option).

That's right: st_blksize isn't not supposed to be always equal to the 
filesystem block size, and that's so "by design", AFAIK.  The 'st_blksize' 
stands for the IO block size, or "preferred" block size for efficient file 
system IO (a.k.a. optimal IO transfer size hint), see 
[2|https://www.man7.org/linux/man-pages/man2/statx.2.html].

As per filesystem operations and {{LogBlockManager}} in Kudu, one crucial 
invariant is that 'st_blksize' is a multiple of the filesystem's block size.  
At least, such invariant is important in the scope of addressing 
[KUDU-620|https://issues.apache.org/jira/browse/KUDU-620] (that's where the 
{{PathInstanceMetadataPB::filesystem_block_size_bytes}} field has originated 
from).

In the scope of this JIRA, please feel free to add references to particular 
places where the difference between the IO block and the filesystem block sizes 
might lead to inconsistencies.  I guess one of those is the misleading name of 
the {{PathInstanceMetadataPB::filesystem_block_size_bytes}} field.  Probably, 
there are more places where the difference is important, and that could lead to 
issues in the actual functionality.

# 
[https://www.man7.org/linux/man-pages/man2/stat.2.html|https://www.man7.org/linux/man-pages/man2/stat.2.html]
# 
[https://www.man7.org/linux/man-pages/man2/statx.2.html|https://www.man7.org/linux/man-pages/man2/statx.2.html]

> st_blksize is not alway equal to the filesystem block size
> --
>
> Key: KUDU-3523
> URL: https://issues.apache.org/jira/browse/KUDU-3523
> Project: Kudu
>  Issue Type: Bug
>Reporter: Xixu Wang
>Priority: Major
> Attachments: image-2023-11-06-15-42-46-082.png, 
> image-2023-11-06-15-45-11-819.png, image-2023-11-06-15-45-39-233.png, 
> image-2023-11-06-15-52-41-834.png
>
>
> In my ** aarch64 architecture system, the st_blksize is not equal to the real 
> filesystem block size. The st_blksize in my system is 65536 bytes, but the 
> block size of the filesystem is 4096 bytes. When writing some data which size 
> is less than 4096 bytes, the file on disk size is 4096 bytes not 65536 bytes. 
> But in kudu, it use st_blksize to decide the filesystem block size, which is 
> not always right.
>  
> *1. The test environment*
> Linux hybrid01 4.19.90-23.30.v2101.ky10.aarch64 #1 SMP Thu Dec 15 09:57:55 
> CST 2022 aarch64 aarch64 aarch64 GNU/Linux. And a docker container runs on it.
> *2.Create a file with encryption header*
>  
> {code:java}
> const string kFile = JoinPathSegments(test_dir_, "encrypted_file");  
> unique_ptr rw;  
> RWFileOptions opts;  
> opts.is_sensitive = true;  
> 

[jira] [Commented] (KUDU-3523) st_blksize is not alway equal to the filesystem block size

2023-11-07 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783824#comment-17783824
 ] 

Alexey Serbin commented on KUDU-3523:
-

[~wangxixu],

Thank you for reporint the issue!  Indeed, as per 
[1|https://www.man7.org/linux/man-pages/man2/stat.2.html], the output of the 
{{stat}} utility reports on different 'block sizes' when run on the file and on 
the filesystem level (that's why there is {{\-f}} option).

That's right: st_blksize isn't not supposed to be always equal to the 
filesystem block size, and that's so "by design", AFAIK.  The 'st_blksize' 
stands for the IO block size, or "preferred" block size for efficient file 
system IO (a.k.a. optimal IO transfer size hint), see 
[2|https://www.man7.org/linux/man-pages/man2/statx.2.html].

As per filesystem operations and {{LogBlockManager}} in Kudu, one crucial 
invariant is that 'st_blksize' is a multiple of the filesystem's block size.  
At least, such invariant is important in the scope of addressing 
[KUDU-620|https://issues.apache.org/jira/browse/KUDU-620] (that's where the 
{{PathInstanceMetadataPB::filesystem_block_size_bytes}} field has originated 
from).

In the scope of this JIRA, please feel free to add references to particular 
places where the difference between the IO block and the filesystem block sizes 
might lead to inconsistencies.  I guess one of those is the misleading name of 
the {{PathInstanceMetadataPB::filesystem_block_size_bytes}} field.  Probably, 
there are more places where the difference is important, and that could lead to 
issues in the actual functionality.

# 
[https://www.man7.org/linux/man-pages/man2/stat.2.html|https://www.man7.org/linux/man-pages/man2/stat.2.html]
# 
[https://www.man7.org/linux/man-pages/man2/statx.2.html|https://www.man7.org/linux/man-pages/man2/statx.2.html]

> st_blksize is not alway equal to the filesystem block size
> --
>
> Key: KUDU-3523
> URL: https://issues.apache.org/jira/browse/KUDU-3523
> Project: Kudu
>  Issue Type: Bug
>Reporter: Xixu Wang
>Priority: Major
> Attachments: image-2023-11-06-15-42-46-082.png, 
> image-2023-11-06-15-45-11-819.png, image-2023-11-06-15-45-39-233.png, 
> image-2023-11-06-15-52-41-834.png
>
>
> In my ** aarch64 architecture system, the st_blksize is not equal to the real 
> filesystem block size. The st_blksize in my system is 65536 bytes, but the 
> block size of the filesystem is 4096 bytes. When writing some data which size 
> is less than 4096 bytes, the file on disk size is 4096 bytes not 65536 bytes. 
> But in kudu, it use st_blksize to decide the filesystem block size, which is 
> not always right.
>  
> *1. The test environment*
> Linux hybrid01 4.19.90-23.30.v2101.ky10.aarch64 #1 SMP Thu Dec 15 09:57:55 
> CST 2022 aarch64 aarch64 aarch64 GNU/Linux. And a docker container runs on it.
> *2.Create a file with encryption header*
>  
> {code:java}
> const string kFile = JoinPathSegments(test_dir_, "encrypted_file");  
> unique_ptr rw;  
> RWFileOptions opts;  
> opts.is_sensitive = true;  
> ASSERT_OK(env_->NewRWFile(opts, kFile, ));  
> uint64_t file_size = 0;  
> env_->GetFileSizeOnDisk(kFile, _size); {code}
> *3.stat the file*
>  
> The IO Block size is 65536, which means st_blsize is 65536, the file logic 
> size is 64 bytes.
> !image-2023-11-06-15-42-46-082.png!
> *4. filesystem block size is 4096 bytes*
> !image-2023-11-06-15-45-39-233.png!
> *5.The file on disk size is 4096 bytes*
> !image-2023-11-06-15-52-41-834.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3521) Kudu servers sometimes crash when host clock is synchronized by PTPd

2023-11-04 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3521:

Fix Version/s: 1.16.1

> Kudu servers sometimes crash when host clock is synchronized by PTPd
> 
>
> Key: KUDU-3521
> URL: https://issues.apache.org/jira/browse/KUDU-3521
> Project: Kudu
>  Issue Type: Bug
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
> Fix For: 1.16.1, 1.18.0, 1.17.1
>
>
> This issue has been reported on the [\#kudu-general Slack 
> channel|https://getkudu.slack.com/archives/C0CPXJ3CH/p1698246065354269].  A 
> Kudu server of 1.16.0 version (not sure whether it was {{kudu-master}} or 
> {{kudu-tserver}}, but it doesn't matter) crashed with the following error:
> {noformat}
> F1024 22:32:06.866636 3323203 hybrid_clock.cc:452] Check failed: _s.ok() 
> unable to get current timestamp with error bound: Service unavailable: clock 
> error estimate (18446744073709551615us) too high (clock considered 
> synchronized by the kernel)
> {noformat}
> From the analysis of the [code in 
> hybrid_clock.cc|https://github.com/apache/kudu/blob/04fdbd0974f4418295d57c0daa4b67de3e777a43/src/kudu/clock/hybrid_clock.cc#L627-L705],
>  the only case it could happen is when {{t.maxerror}} turned to be a negative 
> number (e.g., -1) in [this 
> code|https://github.com/apache/kudu/blob/aeaec84df536cbd9a55e5e09998d64a961f5d706/src/kudu/clock/system_ntp.cc#L176].
> Negative values of the {{timex::maxerror}} field have never been seen when 
> using ntpd or chronyd for clock synchronization, but it's necessary to update 
> the code to adapt for such situations: apparently, PTP might set the 
> {{maxerror}} field of the {{timex}} structure to a negative value and then 
> call {{adjtimex()}}.  That's obvious from [the PTPd's 
> code|https://github.com/ptpd/ptpd/blob/1ec9e650b03e6bd75dd3179fb5f09862ebdc54bf/src/dep/sys.c#L1969-L1984].
>   The essence of the issue is using unsigned integers for clock error in the 
> Kudu code, but {{timex.maxerror}} is a signed number, and at least PTPd sets 
> it to a negative number when calling {{adjtimex()}}.  Also, nowhere in [the 
> documentation for 
> adjtimex()|https://www.man7.org/linux/man-pages/man2/adjtimex.2.html] it's 
> stated that the {{maxerror}} field's value should be a non-negative number.
> As a side note, there was [a prior attempt to address this 
> issue|https://gerrit.cloudera.org/#/c/12149/], but not enough evidence was 
> presented for the RCA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KUDU-3522) A tablet server starts in non-functional state when enabling data-at-rest encryption

2023-11-01 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3522:
---

 Summary: A tablet server starts in non-functional state when 
enabling data-at-rest encryption
 Key: KUDU-3522
 URL: https://issues.apache.org/jira/browse/KUDU-3522
 Project: Kudu
  Issue Type: Bug
  Components: security, tserver
Affects Versions: 1.17.0, 1.16.0
Reporter: Alexey Serbin


It's possible to configure a Kudu tablet server by enabling the data-at-rest 
encryption feature in such a way that the server runs in a non-functional 
state: {{kudu-tserver}} process starts and runs with no visible issues, but 
it's not able to host any tablet replicas.

It's easy to fix/address the issue by adding an extra sanity check: when 
opening an already existing FS data directory structure, make sure the server 
encryption key isn't empty if Kudu server is run with the 
{{\-\-encrypt_data_at_rest}} flag.  There might be more alternatives around.

The reproduction scenario for the issue is below.

# Start a tablet server without encryption-at-rest, making sure the tablet 
server starts and creates the directory structure on the file system.
 # Don't create any tables/ranges yet. Essentially, it's necessary to make sure 
not a single tablet replica is placed at the server yet.
 # Shut down the tablet server.
 # Update the configuration for the tablet server, enabling encryption-at-rest 
and specifying the key provider. For test purposes, it's enough to use the 
"default" key provider:
 {noformat}
--encrypt_data_at_rest=true
--encryption_key_provider=default
{noformat}
 # Start the tablet server.
 # Try to create a new tablet replica that would be placed at the tablet 
server.  That could be creation of a new table, or try to move a tablet replica 
from some other tablet server by using the {{kudu tablet change_config 
move_replica}} CLI tool.
 # Check logs of Kudu master or the {{kudu}} CLI tool: there should be error 
messages like {{Failed to initialize encryption: error:0607B083:digital 
envelope routines:EVP_CipherInit_ex:no cipher set}}
# No tablet replica can now be placed at the tablet server, while nothing 
suspicious can be found in the tablet server's log.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KUDU-3521) Kudu servers sometimes crash when host clock is synchronized by PTPd

2023-11-01 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3521.
-
Fix Version/s: 1.18.0
   1.17.1
   Resolution: Fixed

> Kudu servers sometimes crash when host clock is synchronized by PTPd
> 
>
> Key: KUDU-3521
> URL: https://issues.apache.org/jira/browse/KUDU-3521
> Project: Kudu
>  Issue Type: Bug
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
> Fix For: 1.18.0, 1.17.1
>
>
> This issue has been reported on the [\#kudu-general Slack 
> channel|https://getkudu.slack.com/archives/C0CPXJ3CH/p1698246065354269].  A 
> Kudu server of 1.16.0 version (not sure whether it was {{kudu-master}} or 
> {{kudu-tserver}}, but it doesn't matter) crashed with the following error:
> {noformat}
> F1024 22:32:06.866636 3323203 hybrid_clock.cc:452] Check failed: _s.ok() 
> unable to get current timestamp with error bound: Service unavailable: clock 
> error estimate (18446744073709551615us) too high (clock considered 
> synchronized by the kernel)
> {noformat}
> From the analysis of the [code in 
> hybrid_clock.cc|https://github.com/apache/kudu/blob/04fdbd0974f4418295d57c0daa4b67de3e777a43/src/kudu/clock/hybrid_clock.cc#L627-L705],
>  the only case it could happen is when {{t.maxerror}} turned to be a negative 
> number (e.g., -1) in [this 
> code|https://github.com/apache/kudu/blob/aeaec84df536cbd9a55e5e09998d64a961f5d706/src/kudu/clock/system_ntp.cc#L176].
> Negative values of the {{timex::maxerror}} field have never been seen when 
> using ntpd or chronyd for clock synchronization, but it's necessary to update 
> the code to adapt for such situations: apparently, PTP might set the 
> {{maxerror}} field of the {{timex}} structure to a negative value and then 
> call {{adjtimex()}}.  That's obvious from [the PTPd's 
> code|https://github.com/ptpd/ptpd/blob/1ec9e650b03e6bd75dd3179fb5f09862ebdc54bf/src/dep/sys.c#L1969-L1984].
>   The essence of the issue is using unsigned integers for clock error in the 
> Kudu code, but {{timex.maxerror}} is a signed number, and at least PTPd sets 
> it to a negative number when calling {{adjtimex()}}.  Also, nowhere in [the 
> documentation for 
> adjtimex()|https://www.man7.org/linux/man-pages/man2/adjtimex.2.html] it's 
> stated that the {{maxerror}} field's value should be a non-negative number.
> As a side note, there was [a prior attempt to address this 
> issue|https://gerrit.cloudera.org/#/c/12149/], but not enough evidence was 
> presented for the RCA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KUDU-3521) Kudu servers sometimes crash when host clock is synchronized by PTPd

2023-10-31 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-3521:
---

Assignee: Alexey Serbin

> Kudu servers sometimes crash when host clock is synchronized by PTPd
> 
>
> Key: KUDU-3521
> URL: https://issues.apache.org/jira/browse/KUDU-3521
> Project: Kudu
>  Issue Type: Bug
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>
> This issue has been reported on the [\#kudu-general Slack 
> channel|https://getkudu.slack.com/archives/C0CPXJ3CH/p1698246065354269].  A 
> Kudu server of 1.16.0 version (not sure whether it was {{kudu-master}} or 
> {{kudu-tserver}}, but it doesn't matter) crashed with the following error:
> {noformat}
> F1024 22:32:06.866636 3323203 hybrid_clock.cc:452] Check failed: _s.ok() 
> unable to get current timestamp with error bound: Service unavailable: clock 
> error estimate (18446744073709551615us) too high (clock considered 
> synchronized by the kernel)
> {noformat}
> From the analysis of the [code in 
> hybrid_clock.cc|https://github.com/apache/kudu/blob/04fdbd0974f4418295d57c0daa4b67de3e777a43/src/kudu/clock/hybrid_clock.cc#L627-L705],
>  the only case it could happen is when {{t.maxerror}} turned to be a negative 
> number (e.g., -1) in [this 
> code|https://github.com/apache/kudu/blob/aeaec84df536cbd9a55e5e09998d64a961f5d706/src/kudu/clock/system_ntp.cc#L176].
> Negative values of the {{timex::maxerror}} field have never been seen when 
> using ntpd or chronyd for clock synchronization, but it's necessary to update 
> the code to adapt for such situations: apparently, PTP might set the 
> {{maxerror}} field of the {{timex}} structure to a negative value and then 
> call {{adjtimex()}}.  That's obvious from [the PTPd's 
> code|https://github.com/ptpd/ptpd/blob/1ec9e650b03e6bd75dd3179fb5f09862ebdc54bf/src/dep/sys.c#L1969-L1984].
>   The essence of the issue is using unsigned integers for clock error in the 
> Kudu code, but {{timex.maxerror}} is a signed number, and at least PTPd sets 
> it to a negative number when calling {{adjtimex()}}.  Also, nowhere in [the 
> documentation for 
> adjtimex()|https://www.man7.org/linux/man-pages/man2/adjtimex.2.html] it's 
> stated that the {{maxerror}} field's value should be a non-negative number.
> As a side note, there was [a prior attempt to address this 
> issue|https://gerrit.cloudera.org/#/c/12149/], but not enough evidence was 
> presented for the RCA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3521) Kudu servers sometimes crash when host clock is synchronized by PTPd

2023-10-31 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3521:

Summary: Kudu servers sometimes crash when host clock is synchronized by 
PTPd  (was: Kudu servers sometimes crash when host clock is synchronized with 
PTP)

> Kudu servers sometimes crash when host clock is synchronized by PTPd
> 
>
> Key: KUDU-3521
> URL: https://issues.apache.org/jira/browse/KUDU-3521
> Project: Kudu
>  Issue Type: Bug
>Reporter: Alexey Serbin
>Priority: Major
>
> This issue has been reported on the [\#kudu-general Slack 
> channel|https://getkudu.slack.com/archives/C0CPXJ3CH/p1698246065354269].  A 
> Kudu server of 1.16.0 version (not sure whether it was {{kudu-master}} or 
> {{kudu-tserver}}, but it doesn't matter) crashed with the following error:
> {noformat}
> F1024 22:32:06.866636 3323203 hybrid_clock.cc:452] Check failed: _s.ok() 
> unable to get current timestamp with error bound: Service unavailable: clock 
> error estimate (18446744073709551615us) too high (clock considered 
> synchronized by the kernel)
> {noformat}
> From the analysis of the [code in 
> hybrid_clock.cc|https://github.com/apache/kudu/blob/04fdbd0974f4418295d57c0daa4b67de3e777a43/src/kudu/clock/hybrid_clock.cc#L627-L705],
>  the only case it could happen is when {{t.maxerror}} turned to be a negative 
> number (e.g., -1) in [this 
> code|https://github.com/apache/kudu/blob/aeaec84df536cbd9a55e5e09998d64a961f5d706/src/kudu/clock/system_ntp.cc#L176].
> Negative values of the {{timex::maxerror}} field have never been seen when 
> using ntpd or chronyd for clock synchronization, but it's necessary to update 
> the code to adapt for such situations: apparently, PTP might set the 
> {{maxerror}} field of the {{timex}} structure to a negative value and then 
> call {{adjtimex()}}.  That's obvious from [the PTPd's 
> code|https://github.com/ptpd/ptpd/blob/1ec9e650b03e6bd75dd3179fb5f09862ebdc54bf/src/dep/sys.c#L1969-L1984].
>   The essence of the issue is using unsigned integers for clock error in the 
> Kudu code, but {{timex.maxerror}} is a signed number, and at least PTPd sets 
> it to a negative number when calling {{adjtimex()}}.  Also, nowhere in [the 
> documentation for 
> adjtimex()|https://www.man7.org/linux/man-pages/man2/adjtimex.2.html] it's 
> stated that the {{maxerror}} field's value should be a non-negative number.
> As a side note, there was [a prior attempt to address this 
> issue|https://gerrit.cloudera.org/#/c/12149/], but not enough evidence was 
> presented for the RCA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KUDU-3521) Kudu servers sometimes crash when host clock is synchronized with PTP

2023-10-31 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3521:
---

 Summary: Kudu servers sometimes crash when host clock is 
synchronized with PTP
 Key: KUDU-3521
 URL: https://issues.apache.org/jira/browse/KUDU-3521
 Project: Kudu
  Issue Type: Bug
Reporter: Alexey Serbin


This issue has been reported on the [\#kudu-general Slack 
channel|https://getkudu.slack.com/archives/C0CPXJ3CH/p1698246065354269].  A 
Kudu server of 1.16.0 version (not sure whether it was {{kudu-master}} or 
{{kudu-tserver}}, but it doesn't matter) crashed with the following error:

{noformat}
F1024 22:32:06.866636 3323203 hybrid_clock.cc:452] Check failed: _s.ok() unable 
to get current timestamp with error bound: Service unavailable: clock error 
estimate (18446744073709551615us) too high (clock considered synchronized by 
the kernel)
{noformat}

>From the analysis of the [code in 
>hybrid_clock.cc|https://github.com/apache/kudu/blob/04fdbd0974f4418295d57c0daa4b67de3e777a43/src/kudu/clock/hybrid_clock.cc#L627-L705],
> the only case it could happen is when {{t.maxerror}} turned to be a negative 
>number (e.g., -1) in [this 
>code|https://github.com/apache/kudu/blob/aeaec84df536cbd9a55e5e09998d64a961f5d706/src/kudu/clock/system_ntp.cc#L176].

Negative values of the {{timex::maxerror}} field have never been seen when 
using ntpd or chronyd for clock synchronization, but it's necessary to update 
the code to adapt for such situations: apparently, PTP might set the 
{{maxerror}} field of the {{timex}} structure to a negative value and then call 
{{adjtimex()}}.  That's obvious from [the PTPd's 
code|https://github.com/ptpd/ptpd/blob/1ec9e650b03e6bd75dd3179fb5f09862ebdc54bf/src/dep/sys.c#L1969-L1984].
  The essence of the issue is using unsigned integers for clock error in the 
Kudu code, but {{timex.maxerror}} is a signed number, and at least PTPd sets it 
to a negative number when calling {{adjtimex()}}.  Also, nowhere in [the 
documentation for 
adjtimex()|https://www.man7.org/linux/man-pages/man2/adjtimex.2.html] it's 
stated that the {{maxerror}} field's value should be a non-negative number.

As a side note, there was [a prior attempt to address this 
issue|https://gerrit.cloudera.org/#/c/12149/], but not enough evidence was 
presented for the RCA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KUDU-3520) File descriptor leak in Env::NewRWFile() when ecryption-at-rest is enabled

2023-10-27 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3520:
---

 Summary: File descriptor leak in Env::NewRWFile() when 
ecryption-at-rest is enabled
 Key: KUDU-3520
 URL: https://issues.apache.org/jira/browse/KUDU-3520
 Project: Kudu
  Issue Type: Bug
  Components: fs, security, tserver
Affects Versions: 1.17.0, 1.16.0
Reporter: Alexey Serbin


There is a file descriptor leak in {{Env::NewRWFile()}} on an error path when 
encryption-at-rest is enabled.

In the code below, if {{ReadEncryptionHeader()}} or {{WriteEncryptionHeader()}} 
failed, the descriptor of the file opened by {{DoOpen()}} would be leaked.

{noformat}
RETURN_NOT_OK(DoOpen(fname, opts.mode, ));
EncryptionHeader eh;
if (encrypt) {
  DCHECK(encryption_key_);
  if (size >= kEncryptionHeaderSize) {
RETURN_NOT_OK(ReadEncryptionHeader(fd, fname, *encryption_key_, ));
  } else {
RETURN_NOT_OK(GenerateHeader());
RETURN_NOT_OK(WriteEncryptionHeader(fd, fname, *encryption_key_, eh));
  }
}
result->reset(new PosixRWFile(fname, fd, opts.sync_on_close, encrypt, eh));
{noformat}

It's been evidenced in the wild when creating the metadata file for a tablet 
during tablet copying failed with the error like below:

{noformat}
Runtime error: Couldn't create tablet metadata: Failed to write tablet metadata 
d199a872b03848d695f067ed5c694835: Failed to initialize encryption: 
error:0607B083:digital envelope routines:EVP_CipherInit_ex:no cipher 
set:crypto/evp/evp_enc.c:170
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KUDU-3363) impala get wrong timestamp when scan kudu timestamp with timezone

2023-10-25 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779605#comment-17779605
 ] 

Alexey Serbin commented on KUDU-3363:
-

Thank you for reporting the issue.

The root cause of the issue is explained in 
[IMPALA-12370|https://issues.apache.org/jira/browse/IMPALA-12370].  Also, a 
work-around recipe is described there as well.

> impala get wrong timestamp when scan kudu timestamp with timezone
> -
>
> Key: KUDU-3363
> URL: https://issues.apache.org/jira/browse/KUDU-3363
> Project: Kudu
>  Issue Type: Bug
>  Components: impala
>Reporter: daicheng
>Priority: Major
> Attachments: image-2022-04-24-00-01-05-746.png, 
> image-2022-04-24-00-01-37-520.png, image-2022-04-24-00-03-14-467.png, 
> image-2022-04-24-00-04-16-240.png, image-2022-04-24-00-04-52-860.png, 
> image-2022-04-24-00-05-52-086.png, image-2022-04-24-00-07-09-776.png
>
>
> impala version is 3.1.0-cdh6.1
> !image-2022-04-24-00-01-37-520.png|width=504,height=37!
> i have set system timezone=Asia/Shanghai:
> !image-2022-04-24-00-01-05-746.png|width=566,height=91!
> here is the bug:
> *step 1*
> i have parquet file with two columns like below,and read it with impala-shell 
> and spark (timezone=shanghai)
> !image-2022-04-24-00-03-14-467.png|width=666,height=101!
> !image-2022-04-24-00-04-16-240.png|width=551,height=214!
> the result both exactly right。
> *step two*
> create kudu table  with impala-shell:
> CREATE TABLE default.test_{_}test{_}_test_time2 (id BIGINT,t 
> TIMESTAMP,PRIMARY KEY (id) ) STORED AS KUDU;
> note: kudu version:1.8
> and  insert 2 row into the table with spark :
> !image-2022-04-24-00-04-52-860.png|width=577,height=176!
> *stop 3*
> read it with spark (timezone=shanghai),spark read kudu table with kudu-client 
> api,here is the result:
> !image-2022-04-24-00-05-52-086.png|width=747,height=246!
> the result is still exactly right。
> but read it with impala-shell: 
> !image-2022-04-24-00-07-09-776.png|width=701,height=118!
> the result show late 8hour
> *conclusion*
>    it seems like impala timezone didn't work when kudu column type is 
> timestamp, but it work fine in parquet file,I don't know why?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KUDU-3483) Flushing data in AUTO_FLUSH_BACKGROUND mode fails when the table's schema is changing

2023-10-25 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-3483.
-
Fix Version/s: 1.18.0
   Resolution: Fixed

Thank you for reporting the fixing the issue, [~wangxixu]!

> Flushing data in AUTO_FLUSH_BACKGROUND mode fails when the table's schema is 
> changing
> -
>
> Key: KUDU-3483
> URL: https://issues.apache.org/jira/browse/KUDU-3483
> Project: Kudu
>  Issue Type: Bug
>Reporter: Xixu Wang
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: image-2023-05-30-16-12-20-361.png
>
>
>  
> *1.The problem*
> Flush multiple data in auto_flush_background mode maybe fail when the table 
> schema has changed. The following is the error message:
> !image-2023-05-30-16-12-20-361.png!
>  
> *2.How to repeat the case*
> 1.create a table with 2 columns.
> 2.insert a data into this table in auto_flush_background mode.
> 3.Add 3 new columns for this table.
> 4.reopen this table
> 5.insert a data into this table in auto_flush_background mode.
> 6.flush the buffer
> {code:java}
> KuduTable table = createTable(ImmutableList.of());
> // Add a row with addNullableDef=null
> final KuduSession session = client.newSession();
> session.setFlushMode(SessionConfiguration.FlushMode.AUTO_FLUSH_BACKGROUND);   
>  
> Insert insert = table.newInsert();
> PartialRow row = insert.getRow();
> row.addInt("c0", 101);
> row.addInt("c1", 101);
> session.apply(insert);
> // Add some new columns.
> client.alterTable(tableName, new AlterTableOptions()
>   .addColumn("addNonNull", Type.INT32, 100)
>   .addNullableColumn("addNullable", Type.INT32)
>   .addNullableColumn("addNullableDef", Type.INT32, 200));
> 
> // Reopen table for the new schema.
> table = client.openTable(tableName);
> assertEquals(5, table.getSchema().getColumnCount());
> Insert newinsert = table.newInsert();
> PartialRow newrow = newinsert.getRow();
> newrow.addInt("c0", 101);
> newrow.addInt("c1", 101);
> newrow.addInt("addNonNull", 101);
> newrow.addInt("addNullable", 101);
> newrow.setNull("addNullableDef");
> session.apply(newinsert);
> session.flush(); {code}
>  
> *3.Why this problem happened*
> In auto_flush_background mode, applying an operation will firstly be inserted 
> into the buffer. When the buffer is full or function flush() is called, it 
> will try to flush multiple data into Kudu server. First, it will group these 
> data according to the tablet id as a batch. A batch may contains multiple 
> rows which belong to the same tablet. Then a batch will encode into bytes. At 
> this time, it will read the table schema of the first row and decide the 
> format of the data. If two rows has different schema but belongs to the same 
> table, which because of altering the table between inserting two rows, it 
> will cause array index outbound exception.
>  
> By the way, it hard to trace the whole process, especially in kudu tablet 
> server, it is better to log downstream IP and client id.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >