[jira] [Commented] (KUDU-3402) Make tracing.html work on newer browsers

2023-06-23 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736544#comment-17736544
 ] 

Todd Lipcon commented on KUDU-3402:
---

I think it's been deprecated and replaced by Perfetto (standalone trace 
viewer): 
https://chromium.googlesource.com/catapult/+/refs/heads/main/tracing/docs/perfetto.md

> Make tracing.html work on newer browsers
> 
>
> Key: KUDU-3402
> URL: https://issues.apache.org/jira/browse/KUDU-3402
> Project: Kudu
>  Issue Type: Improvement
>Reporter: Marton Greber
>Priority: Minor
>
> After starting up Kudu, navigate to one of the Master or Tablet Server 
> /tracing.html endpoint to see the tracing web UI. By using a recent version 
> of Chrome, one gets a blank screen. Pulling up the logs, one can see the root 
> cause: document.registerElement:
> {code:java}
> Uncaught TypeError: document.registerElement is not a function
>     at tracing.js:31:88
>     at tracing.js:31:448 {code}
> By checking the [MDN Web 
> Docs|https://developer.mozilla.org/en-US/docs/Web/API/Document/registerElement],
>  it can be seen, that this feature has been deprecated. In Chrome it was 
> removed in version 80.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KUDU-3383) About strong consistency read from leader

2022-07-27 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572022#comment-17572022
 ] 

Todd Lipcon commented on KUDU-3383:
---

You might want to look at Yugabyte's implementation of leader leases (it's on 
top of a Kudu-derived source, so similar approach ought to work for Kudu):

https://github.com/yugabyte/yugabyte-db/blob/master/src/yb/consensus/raft_consensus.cc#L180

> About strong consistency read from leader
> -
>
> Key: KUDU-3383
> URL: https://issues.apache.org/jira/browse/KUDU-3383
> Project: Kudu
>  Issue Type: Improvement
>Reporter: shenxingwuying
>Assignee: shenxingwuying
>Priority: Major
> Attachments: image-2022-07-20-23-14-34-519.png, 
> image-2022-07-20-23-17-40-718.png
>
>
> As describe as https://issues.apache.org/jira/browse/KUDU-3382.
> I am talking about linearizability read.
> h1. Background && Motivation
> Linearizability read is a very friendly feature for developers, kudu can 
> support it. Now I find kudu may be implements yet.
> h1. Issue of linearizability read from leader
> We need talk about the issue.
> The feture is especially important for kv system, and kudu is mainly olap 
> oriented. But in some scenaios, the feature also privide advantages.
> Kudu's read implements by Scan, event though read one row. It send a 
> ScanRequest with NewScanRequest and then send ContinueScanRequest. The 
> feature will be aimed at NewScanRequest.
> Kudu's raft implements is a strong leader, leader's state machine is not 
> older than followers, and followers heartbeat timeout or receives leader 
> election request(leader transfer) can elect leader and switch leader.
> If kudu need linearizability read, read leader is not enough, because double 
> leader may be exist at a very small period time.
> I provide two scenarios. The first one:
>  
> !image-2022-07-20-23-17-40-718.png!
>  
>  # A raft group has 3 replicas, L1, F2, F3. Their states is steady during 
> term 1.
>  # If network parition, F2 and F3 loss leader's heartbeat, F3 start election, 
> F2 vote it.
>  # F3 become Leader, we can call it L3. At this moment, there are 2 leaders: 
> L1(1) and L3(2).
>  # The state will be continued until the network partition recover. The time 
> may be short or long.
> During double leader, it's not liearizability read. So kudu should avoid 
> double leader at any time, pay the corresponding cost is no leader at a small 
> period time. Kudu should make a choice. For user usally need linearizability, 
> so I think kudu should support it. During a very small time no leader's 
> unavailability can avoid by client's fault tolerance.
> Whether read leader is linearizability read, someone can make sure it or I 
> can do a experiment.
> kudu should avoid double leaders at a very small period time and network 
> fault happens . I review the codes, and think now the problem is exist.
> h1. Solution
> To avoid the double leader's trouble,leader should be keep alive. If a leader 
> receives no enough heartbeats in a period of time, it shoud be leader down 
> and and then start another election just like follower does. Leader's timeout 
> should be less than follower's election.
> Another scheme: Read should send heartbeat to two follow to make sure it is 
> valid leader.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KUDU-3371) Use RocksDB to store LBM metadata

2022-05-25 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542192#comment-17542192
 ] 

Todd Lipcon commented on KUDU-3371:
---

Oh actually looks like I did put it on public github. Some info on my work here 
(including the link): https://issues.apache.org/jira/browse/KUDU-2204

> Use RocksDB to store LBM metadata
> -
>
> Key: KUDU-3371
> URL: https://issues.apache.org/jira/browse/KUDU-3371
> Project: Kudu
>  Issue Type: Improvement
>  Components: fs
>Reporter: Yingchun Lai
>Priority: Major
>
> h1. Motivation
> The current LBM container use separate .data and .metadata files. The .data 
> file store the real user data, we can use hole punching to reduce disk space. 
> While the metadata use write protobuf serialized string to a file, in append 
> only mode. Each protobuf object is a struct of BlockRecordPB:
>  
> {code:java}
> message BlockRecordPB {
>   required BlockIdPB block_id = 1;  // int64
>   required BlockRecordType op_type = 2;  // CREATE or DELETE
>   required uint64 timestamp_us = 3;
>   optional int64 offset = 4; // Required for CREATE.
>   optional int64 length = 5; // Required for CREATE.
> } {code}
> That means each object is either type of CREATE or DELETE. To mark a 'block' 
> as deleted, there will be 2 objects in the metadata, one is CREATE type and 
> the other is DELETE type.
> There are some weak points of current LBM metadata storage mechanism:
> h2. 1. Disk space amplification
> The metadata live blocks rate may be very low, the worst case is there is 
> only 1 alive block (suppose it hasn't reach the runtime compact threshold), 
> all the other thousands of blocks are dead (i.e. in pair of CREATE-DELETE).
> So the disk space amplification is very serious.
> h2. 2. Long time bootstrap
> In Kudu server bootstrap stage, it have to replay all the metadata files, to 
> find out the alive blocks. In the worst case, we may replayed thousands of 
> blocks in metadata, but find only a very few blocks are alive.
> It may waste much time in almost all cases, since the Kudu cluster in 
> production environment always run without bootstrap with several months, the 
> LBM may be very loose.
> h2. 3. Metadada compaction
> To resolve the issues above, there is a metadata compaction mechanism in LBM, 
> both at runtime and bootstrap stage.
> The one at runtime will lock the container, and it's synchronous.
> The one in bootstrap stage is synchronous too, and may make the bootstrap 
> time longer.
> h1. Optimization by using RocksDB
> h2. Storage design
>  * RocksDB instance: one RocksDB instance per data directory.
>  * Key: .
>  * Value: the same as before, i.e. the serialized protobuf string, and only 
> store for CREATE entries.
>  * Put/Delete: put value to rocksdb when create block, delete it from rocksdb 
> when delete block
>  * Scan: happened only in bootstrap stage to retrieve all blocks
>  * DeleteRange: happened only when invalidate a container
> h2. Advantages
>  # Disk space amplification: There is still disk space amplification problem. 
> But we can tune RocksDB to reach a balanced point, I trust in most cases, 
> RocksDB is better than append only file.
>  # Bootstrap time: since there are only valid blocks left in rocksdb, so it 
> maybe much faster than before.
>  # metadata compaction: we can leave it to rocksdb to do this work, of course 
> tuning needed.
> h2. test & benchmark
> I'm trying to use RocksDB to store LBM container metadata recently, finished 
> most of work now, and did some benchmark. It show that the fs module block 
> read/write/delete performance is similar to or little worse than the old 
> implemention, the bootstrap time may reduce several times.
> I not sure if it is worth to continue the work, or anybody know if there is 
> any discussion on this topic ever.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (KUDU-3371) Use RocksDB to store LBM metadata

2022-05-25 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542191#comment-17542191
 ] 

Todd Lipcon commented on KUDU-3371:
---

I think this is a reasonable idea. A long time ago I also tried using rocksdb 
to store consensus metadata and tablet metadata. I think part of that patch 
series ended up uploaded by [~anjuwong] at 
https://gerrit.cloudera.org/#/c/10649/ though the rest is probably lost to my 
old work laptop.

> Use RocksDB to store LBM metadata
> -
>
> Key: KUDU-3371
> URL: https://issues.apache.org/jira/browse/KUDU-3371
> Project: Kudu
>  Issue Type: Improvement
>  Components: fs
>Reporter: Yingchun Lai
>Priority: Major
>
> h1. Motivation
> The current LBM container use separate .data and .metadata files. The .data 
> file store the real user data, we can use hole punching to reduce disk space. 
> While the metadata use write protobuf serialized string to a file, in append 
> only mode. Each protobuf object is a struct of BlockRecordPB:
>  
> {code:java}
> message BlockRecordPB {
>   required BlockIdPB block_id = 1;  // int64
>   required BlockRecordType op_type = 2;  // CREATE or DELETE
>   required uint64 timestamp_us = 3;
>   optional int64 offset = 4; // Required for CREATE.
>   optional int64 length = 5; // Required for CREATE.
> } {code}
> That means each object is either type of CREATE or DELETE. To mark a 'block' 
> as deleted, there will be 2 objects in the metadata, one is CREATE type and 
> the other is DELETE type.
> There are some weak points of current LBM metadata storage mechanism:
> h2. 1. Disk space amplification
> The metadata live blocks rate may be very low, the worst case is there is 
> only 1 alive block (suppose it hasn't reach the runtime compact threshold), 
> all the other thousands of blocks are dead (i.e. in pair of CREATE-DELETE).
> So the disk space amplification is very serious.
> h2. 2. Long time bootstrap
> In Kudu server bootstrap stage, it have to replay all the metadata files, to 
> find out the alive blocks. In the worst case, we may replayed thousands of 
> blocks in metadata, but find only a very few blocks are alive.
> It may waste much time in almost all cases, since the Kudu cluster in 
> production environment always run without bootstrap with several months, the 
> LBM may be very loose.
> h2. 3. Metadada compaction
> To resolve the issues above, there is a metadata compaction mechanism in LBM, 
> both at runtime and bootstrap stage.
> The one at runtime will lock the container, and it's synchronous.
> The one in bootstrap stage is synchronous too, and may make the bootstrap 
> time longer.
> h1. Optimization by using RocksDB
> h2. Storage design
>  * RocksDB instance: one RocksDB instance per data directory.
>  * Key: .
>  * Value: the same as before, i.e. the serialized protobuf string, and only 
> store for CREATE entries.
>  * Put/Delete: put value to rocksdb when create block, delete it from rocksdb 
> when delete block
>  * Scan: happened only in bootstrap stage to retrieve all blocks
>  * DeleteRange: happened only when invalidate a container
> h2. Advantages
>  # Disk space amplification: There is still disk space amplification problem. 
> But we can tune RocksDB to reach a balanced point, I trust in most cases, 
> RocksDB is better than append only file.
>  # Bootstrap time: since there are only valid blocks left in rocksdb, so it 
> maybe much faster than before.
>  # metadata compaction: we can leave it to rocksdb to do this work, of course 
> tuning needed.
> h2. test & benchmark
> I'm trying to use RocksDB to store LBM container metadata recently, finished 
> most of work now, and did some benchmark. It show that the fs module block 
> read/write/delete performance is similar to or little worse than the old 
> implemention, the bootstrap time may reduce several times.
> I not sure if it is worth to continue the work, or anybody know if there is 
> any discussion on this topic ever.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (KUDU-2844) Avoid copying strings from dictionary or plain-encoded blocks

2020-08-26 Thread Todd Lipcon (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2844.
---
Fix Version/s: 1.13.0
   Resolution: Fixed

> Avoid copying strings from dictionary or plain-encoded blocks
> -
>
> Key: KUDU-2844
> URL: https://issues.apache.org/jira/browse/KUDU-2844
> Project: Kudu
>  Issue Type: Improvement
>  Components: cfile, perf
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: fg.svg
>
>
> When scanning a plain or dictionary-encoded binary column, we currently loop 
> over each entry and copy the string into the destination RowBlock's arena. In 
> TPCH Q1, the scanner threads use a significant percentage of CPU doing this 
> copying, and it also increases CPU cache footprint which likely decreases 
> performance in downstream operations like predicate evaluation, merging, 
> result serialization, etc.
> Instead of doing this, we could "attach" the dictionary block (with 
> ref-counting) to the RowBlock and refer directly to the dictionary entry from 
> the RowBlock. When the RowBlock eventually is reset, we can drop the 
> reference. This should be safe because we never mutate indirect data in-place.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3178) An option to terminate connections which have been open for very long time

2020-07-30 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168086#comment-17168086
 ] 

Todd Lipcon commented on KUDU-3178:
---

Could we instead just keep track of the expiration time of the authn token when 
the connection is established, and before processing any RPC call, verify that 
the time has not passed? If it has passed we can send back an authentication 
error that looks just the same as if authentication failed during the initial 
handshake, which should trigger the client's existing "fetch new token" paths, 
right?

> An option to terminate connections which have been open for very long time
> --
>
> Key: KUDU-3178
> URL: https://issues.apache.org/jira/browse/KUDU-3178
> Project: Kudu
>  Issue Type: Improvement
>  Components: master, security, tserver
>Reporter: Alexey Serbin
>Priority: Major
>
> A Kudu client can open a connection to {{kudu-master}} or {{kudu-tserver}} 
> and keep that connection open indefinitely by issuing some method at least 
> once every {{\-\-rpc_default_keepalive_time_ms}} interval (e.g., call 
> {{Ping()}} method).  This means there isn't a limit on how long an client can 
> have access to cluster once it's authenticated, unless {{kudu-master}} and 
> {{kudu-tserver}} processes are restarted.  When fine-grained authorization if 
> enforced, this issue is really benign because such lingering clients are 
> unable to call any methods that require authz token to be provided.
> It would be nice to address this by providing an option to terminate 
> connections which were established long time ago.  Both the interval of the 
> maximum connection lifetime and whether to terminate over-the-TTL connections 
> should be controlled by flags.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3162) Failed to compile kudu using openssl 1.1.1

2020-07-20 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161572#comment-17161572
 ] 

Todd Lipcon commented on KUDU-3162:
---

I think you want to set the OPENSSL_FOO_LIBRARY variables to point to 
libcrypto.so and libssl.so, not to the lib directory.

> Failed to compile kudu using openssl 1.1.1
> --
>
> Key: KUDU-3162
> URL: https://issues.apache.org/jira/browse/KUDU-3162
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.9.0
> Environment: CentOS 7.7.1908
>Reporter: Donghui Xu
>Priority: Minor
>
> When I downloaded toolchain and the corresponding kudu code(hash code is 
> 84086fe), the kudu compilation was successful. However, if the OpenSSL 
> version that kudu depends on is changed to 1.1.1 in CMakeLists.txt as follows:
> set(OPENSSL_ROOT_DIR 
> "/home/xudonghui/kudu/native-toolchain/build/openssl-1.1.1")
> set(OPENSSL_INCLUDE_DIR 
> "/home/xudonghui/kudu/native-toolchain/build/openssl-1.1.1/include")
> set(OPENSSL_SSL_LIBRARY 
> "/home/xudonghui/kudu/native-toolchain/build/openssl-1.1.1/lib")
> set(OPENSSL_CRYPTO_LIBRARY 
> "/home/xudonghui/kudu/native-toolchain/build/openssl-1.1.1/lib")
> find_package(OpenSSL 1.1.1 REQUIRED)
> The compilation error is as follows:
> /home/xudonghui/kudu/native-toolchain/build/openssl-1.1.1/lib: file not 
> recognized: Is a directory
> collect2: error: ld returned 1 exit status
> make[2]: *** [lib/exported/libkudu_client.so.0.1.0] Error 1
> make[1]: *** [src/kudu/client/CMakeFiles/kudu_client_exported.dir/all] Error 2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KUDU-3162) Failed to compile kudu using openssl 1.1.1

2020-07-20 Thread Todd Lipcon (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-3162.
---
Fix Version/s: n/a
   Resolution: Not A Problem

> Failed to compile kudu using openssl 1.1.1
> --
>
> Key: KUDU-3162
> URL: https://issues.apache.org/jira/browse/KUDU-3162
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.9.0
> Environment: CentOS 7.7.1908
>Reporter: Donghui Xu
>Priority: Minor
> Fix For: n/a
>
>
> When I downloaded toolchain and the corresponding kudu code(hash code is 
> 84086fe), the kudu compilation was successful. However, if the OpenSSL 
> version that kudu depends on is changed to 1.1.1 in CMakeLists.txt as follows:
> set(OPENSSL_ROOT_DIR 
> "/home/xudonghui/kudu/native-toolchain/build/openssl-1.1.1")
> set(OPENSSL_INCLUDE_DIR 
> "/home/xudonghui/kudu/native-toolchain/build/openssl-1.1.1/include")
> set(OPENSSL_SSL_LIBRARY 
> "/home/xudonghui/kudu/native-toolchain/build/openssl-1.1.1/lib")
> set(OPENSSL_CRYPTO_LIBRARY 
> "/home/xudonghui/kudu/native-toolchain/build/openssl-1.1.1/lib")
> find_package(OpenSSL 1.1.1 REQUIRED)
> The compilation error is as follows:
> /home/xudonghui/kudu/native-toolchain/build/openssl-1.1.1/lib: file not 
> recognized: Is a directory
> collect2: error: ld returned 1 exit status
> make[2]: *** [lib/exported/libkudu_client.so.0.1.0] Error 1
> make[1]: *** [src/kudu/client/CMakeFiles/kudu_client_exported.dir/all] Error 2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3158) Document recommendation to use a dedicated SSD for the WAL

2020-06-25 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17145902#comment-17145902
 ] 

Todd Lipcon commented on KUDU-3158:
---

Do we have any substantial data showing that this is really a strong 
recommendation? I've always been hesitant to recommend it, lest people think 
it's a "requirement" and then avoid using Kudu because SSDs aren't available on 
their nodes. I think the vast majority of production clusters do _not_ use SSDs 
and are still successful, so we should make sure to outline what specific 
scenarios really need/benefit from SSDs.

> Document recommendation to use a dedicated SSD for the WAL
> --
>
> Key: KUDU-3158
> URL: https://issues.apache.org/jira/browse/KUDU-3158
> Project: Kudu
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Grant Henke
>Priority: Major
>
> It is a common deployment best practice to place the WAL on its own SSD to 
> maximize ingest throughput in a Kudu cluster. However, we don't clearly 
> document that recommendation in the places users would commonly look for it. 
> It is mentioned somewhat in the FAQ:
> https://kudu.apache.org/faq.html#how-does-kudu-store-its-data-is-the-underlying-data-storage-readable-without-going-through-kudu
> But it should probably be mentioned in these places:
> https://kudu.apache.org/docs/installation.html#prerequisites_and_requirements
> https://kudu.apache.org/docs/configuration.html#directory_configuration
> https://kudu.apache.org/docs/scaling_guide.html
> Alternatively a hardware/deployment guide might be useful, but that is a 
> larger undertaking.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3149) Lock contention between registering ops and computing maintenance op stats

2020-06-18 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139506#comment-17139506
 ] 

Todd Lipcon commented on KUDU-3149:
---

In your thread dump it looks like Thread 4 is waiting on another lock. Who's 
holding that lock?

> Lock contention between registering ops and computing maintenance op stats
> --
>
> Key: KUDU-3149
> URL: https://issues.apache.org/jira/browse/KUDU-3149
> Project: Kudu
>  Issue Type: Bug
>  Components: perf, tserver
>Reporter: Andrew Wong
>Priority: Major
>
> We saw a bunch of tablets bootstrapping extremely slowly, and many stuck 
> supposedly bootstrapping, but not showing up in the {{/tablets}} page, i.e. 
> we could only see INITIALIZED and RUNNING tablets, no BOOTSTRAPPING.
> Upon digging into the stacks, we saw a bunch waiting in:
> {code}
> TID 46583(tablet-open [wo):
> @ 0x7f1dd57147e0  (unknown)
> @ 0x7f1dd5713332  (unknown)
> @ 0x7f1dd570e5d8  (unknown)
> @ 0x7f1dd570e4a7  (unknown)
> @  0x23b4058  kudu::Mutex::Acquire()
> @  0x23980ff  kudu::MaintenanceManager::RegisterOp()
> @   0xb85374  
> kudu::tablet::TabletReplica::RegisterMaintenanceOps()
> @   0xa0055b  kudu::tserver::TSTabletManager::OpenTablet()
> @  0x23f994c  kudu::ThreadPool::DispatchThread()
> @  0x23f3f8b  kudu::Thread::SuperviseThread()
> @ 0x7f1dd570caa1  (unknown)
> @ 0x7f1dd3b18bcd  (unknown)
> {code}
> and upon further inspection, the lock being held is taken by the MM scheduler 
> thread here:
> {code}
> Thread 4 (Thread 0x7f1d7d358700 (LWP 46999)):
> #0  0x7f1dd5713334 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x7f1dd570e5d8 in _L_lock_854 () from /lib64/libpthread.so.0
> #2  0x7f1dd570e4a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x00b51f29 in 
> kudu::tablet::Tablet::UpdateCompactionStats(kudu::MaintenanceOpStats*) ()
> #4  0x00b7f435 in 
> kudu::tablet::CompactRowSetsOp::UpdateStats(kudu::MaintenanceOpStats*) ()
> #5  0x023956e4 in kudu::MaintenanceManager::FindBestOp() ()
> #6  0x02396af9 in 
> kudu::MaintenanceManager::FindAndLaunchOp(std::unique_lock*) ()
> #7  0x02397858 in kudu::MaintenanceManager::RunSchedulerThread() ()
> {code}
> It seems like we're holding the maintenance manager's {{lock_}} member, for 
> the duration of us computing stats, which is contending with the registration 
> of maintenance manager ops. The scheduler thread is thus effectively blocking 
> the registration of many tablet replicas' ops, and blocking further 
> bootstrapping.
> A couple things come to mind:
> - We could probably take a snapshot of the ops under lock and release the 
> lock_ when finding the best op to run.
> - Additionally, we may want to consider disabling compactions entirely until 
> the initial set of tablets finishes bootstrapping.
> It's worth noting that it isn't the act of compacting that is contending 
> here, but rather the computation of the stats.
> As a workaround, we used the {{set_flag}} tool to disable compactions on the 
> node and noted significantly faster bootstrapping.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3131) test rw_mutex-test hangs sometimes if build_type is release

2020-06-03 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125186#comment-17125186
 ] 

Todd Lipcon commented on KUDU-3131:
---

I converted this to a Bug instead of subtask since it doesn't seem 
aarch64-related

> test rw_mutex-test hangs sometimes if build_type is release
> ---
>
> Key: KUDU-3131
> URL: https://issues.apache.org/jira/browse/KUDU-3131
> Project: Kudu
>  Issue Type: Bug
>Reporter: huangtianhua
>Priority: Major
>
> Built and test kudu on aarch64, in release mode there is a test hangs 
> sometimes(maybe a deadlock?) the console out as following:
> [==] Running 2 tests from 1 test case.
> [--] Global test environment set-up.
> [--] 2 tests from Priorities/RWMutexTest
> [ RUN  ] Priorities/RWMutexTest.TestDeadlocks/0
> And seems it's ok in debug mode.
> Now only this one test failed sometimes on aarch64, [~aserbin] [~adar] would 
> you please have a look for this? Or give some suggestion to us, thanks very 
> much.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3131) test rw_mutex-test hangs sometimes if build_type is release

2020-06-03 Thread Todd Lipcon (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-3131:
--
Parent: (was: KUDU-3007)
Issue Type: Bug  (was: Sub-task)

> test rw_mutex-test hangs sometimes if build_type is release
> ---
>
> Key: KUDU-3131
> URL: https://issues.apache.org/jira/browse/KUDU-3131
> Project: Kudu
>  Issue Type: Bug
>Reporter: huangtianhua
>Priority: Major
>
> Built and test kudu on aarch64, in release mode there is a test hangs 
> sometimes(maybe a deadlock?) the console out as following:
> [==] Running 2 tests from 1 test case.
> [--] Global test environment set-up.
> [--] 2 tests from Priorities/RWMutexTest
> [ RUN  ] Priorities/RWMutexTest.TestDeadlocks/0
> And seems it's ok in debug mode.
> Now only this one test failed sometimes on aarch64, [~aserbin] [~adar] would 
> you please have a look for this? Or give some suggestion to us, thanks very 
> much.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3131) test rw_mutex-test hangs sometimes if build_type is release

2020-06-03 Thread Todd Lipcon (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-3131:
--
Component/s: (was: test)

> test rw_mutex-test hangs sometimes if build_type is release
> ---
>
> Key: KUDU-3131
> URL: https://issues.apache.org/jira/browse/KUDU-3131
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: huangtianhua
>Priority: Major
>
> Built and test kudu on aarch64, in release mode there is a test hangs 
> sometimes(maybe a deadlock?) the console out as following:
> [==] Running 2 tests from 1 test case.
> [--] Global test environment set-up.
> [--] 2 tests from Priorities/RWMutexTest
> [ RUN  ] Priorities/RWMutexTest.TestDeadlocks/0
> And seems it's ok in debug mode.
> Now only this one test failed sometimes on aarch64, [~aserbin] [~adar] would 
> you please have a look for this? Or give some suggestion to us, thanks very 
> much.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3131) test rw_mutex-test hangs sometimes if build_type is release

2020-06-03 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125185#comment-17125185
 ] 

Todd Lipcon commented on KUDU-3131:
---

I don't think this should be classified as a test issue -- the test is finding 
a real bug that could cause deadlocks on a cluster

> test rw_mutex-test hangs sometimes if build_type is release
> ---
>
> Key: KUDU-3131
> URL: https://issues.apache.org/jira/browse/KUDU-3131
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: huangtianhua
>Priority: Major
>
> Built and test kudu on aarch64, in release mode there is a test hangs 
> sometimes(maybe a deadlock?) the console out as following:
> [==] Running 2 tests from 1 test case.
> [--] Global test environment set-up.
> [--] 2 tests from Priorities/RWMutexTest
> [ RUN  ] Priorities/RWMutexTest.TestDeadlocks/0
> And seems it's ok in debug mode.
> Now only this one test failed sometimes on aarch64, [~aserbin] [~adar] would 
> you please have a look for this? Or give some suggestion to us, thanks very 
> much.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KUDU-2077) Return data in Apache Arrow format

2020-06-02 Thread Todd Lipcon (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned KUDU-2077:
-

Assignee: (was: Todd Lipcon)

> Return data in Apache Arrow format
> --
>
> Key: KUDU-2077
> URL: https://issues.apache.org/jira/browse/KUDU-2077
> Project: Kudu
>  Issue Type: New Feature
>  Components: client, server
>Reporter: Andrew Wong
>Priority: Major
>  Labels: roadmap-candidate
> Fix For: 1.12.0
>
>
> Dan and I spent the hackathon tinkering with the Apache Arrow format. Arrow 
> is an in-memory columnar format designed to be the common data format for a 
> large number of projects, see [here|https://arrow.apache.org]. One place we 
> thought adding this would be particularly fitting is when sending results 
> back to the client, since this currently returns row-wise data. By returning 
> Arrow, this could open the door to simpler and faster integration with other 
> projects.
> The server-side changes can be localized to the tablet service and wire 
> protocol. We considered using Arrow more exhaustively throughout the server 
> codebase, but found that because Arrow and Kudu's own in-memory format (i.e. 
> that in kudu::ColumnBlock) are so similar, a simpler approach is to copy the 
> buffers from ColumnBlock to the scan response and build arrow::Arrays 
> client-side. A POC of the server-side changes can be found here: 
> https://github.com/danburkert/kudu/tree/arrow
> At the time of writing this, the arrow::Array type has a varying number of 
> arrow::Buffers, depending on the data type (e.g. one for null bitmaps, one 
> for data, etc). The ColumnBlock "buffers" (i.e. data, null_bitmap) should be 
> compatible with these Buffers with a couple of modifications:
> * The null-bitmaps in arrow are the complement of those used by Kudu
> * The RowBlock that owns the ColumnBlocks has a selection vector needs to be 
> accounted for
> If the buffers are transferred over the wire (via sidecars or protobuf), they 
> should be able to be converted to Arrays via arrow::ArrayData or directly via 
> the Array constructors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (KUDU-2077) Return data in Apache Arrow format

2020-06-02 Thread Todd Lipcon (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reopened KUDU-2077:
---

I wouldn't classify this as fixed quite yet. The patches referenced allow a 
caller to get columnar data format, but it doesn't correspond to the Arrow spec 
until the code in https://gerrit.cloudera.org/#/c/15661/ is added

> Return data in Apache Arrow format
> --
>
> Key: KUDU-2077
> URL: https://issues.apache.org/jira/browse/KUDU-2077
> Project: Kudu
>  Issue Type: New Feature
>  Components: client, server
>Reporter: Andrew Wong
>Assignee: Todd Lipcon
>Priority: Major
>  Labels: roadmap-candidate
> Fix For: 1.12.0
>
>
> Dan and I spent the hackathon tinkering with the Apache Arrow format. Arrow 
> is an in-memory columnar format designed to be the common data format for a 
> large number of projects, see [here|https://arrow.apache.org]. One place we 
> thought adding this would be particularly fitting is when sending results 
> back to the client, since this currently returns row-wise data. By returning 
> Arrow, this could open the door to simpler and faster integration with other 
> projects.
> The server-side changes can be localized to the tablet service and wire 
> protocol. We considered using Arrow more exhaustively throughout the server 
> codebase, but found that because Arrow and Kudu's own in-memory format (i.e. 
> that in kudu::ColumnBlock) are so similar, a simpler approach is to copy the 
> buffers from ColumnBlock to the scan response and build arrow::Arrays 
> client-side. A POC of the server-side changes can be found here: 
> https://github.com/danburkert/kudu/tree/arrow
> At the time of writing this, the arrow::Array type has a varying number of 
> arrow::Buffers, depending on the data type (e.g. one for null bitmaps, one 
> for data, etc). The ColumnBlock "buffers" (i.e. data, null_bitmap) should be 
> compatible with these Buffers with a couple of modifications:
> * The null-bitmaps in arrow are the complement of those used by Kudu
> * The RowBlock that owns the ColumnBlocks has a selection vector needs to be 
> accounted for
> If the buffers are transferred over the wire (via sidecars or protobuf), they 
> should be able to be converted to Arrays via arrow::ArrayData or directly via 
> the Array constructors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-1457) IPv6 support in kudu

2020-06-01 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121155#comment-17121155
 ] 

Todd Lipcon commented on KUDU-1457:
---

It looks like [~helifu] is working on this at 
https://gerrit.cloudera.org/#/c/15996/ 

> IPv6 support in kudu
> 
>
> Key: KUDU-1457
> URL: https://issues.apache.org/jira/browse/KUDU-1457
> Project: Kudu
>  Issue Type: New Feature
>  Components: rpc, util
>Reporter: Manukranth Kolloju
>Assignee: Ritwik Yadav
>Priority: Major
>  Labels: roadmap-candidate
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Worked on a patch to make kudu work in an IPv6 setup. It needed some breaking 
> changes. Before spending more time making it backwards compatible, wanted to 
> get some feedback about whether something like this is interesting to people 
> here: https://github.com/cloudera/kudu/pull/6. 
> The tests etc. all pass, except for client_sample_test which relies on the 
> 127.x.x.x space as opposed to ::1 in ipv6. 
> If we agree upon a uniform way to configure IPv6, I can get a patch up for 
> review. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3131) test rw_mutex-test hangs sometimes if build_type is release

2020-06-01 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121125#comment-17121125
 ] 

Todd Lipcon commented on KUDU-3131:
---

I can confirm I'm able to reproduce this on Ubuntu 18.04 x86 as well 
(deadlocked after 17 iterations of rw_mutex-test).

Perhaps we should consider our own rwlock implementation, or at least file on 
upstream Ubuntu bug tracker to request a backport (seems there is discussion in 
the above linked ticket)

> test rw_mutex-test hangs sometimes if build_type is release
> ---
>
> Key: KUDU-3131
> URL: https://issues.apache.org/jira/browse/KUDU-3131
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: huangtianhua
>Priority: Major
>
> Built and test kudu on aarch64, in release mode there is a test hangs 
> sometimes(maybe a deadlock?) the console out as following:
> [==] Running 2 tests from 1 test case.
> [--] Global test environment set-up.
> [--] 2 tests from Priorities/RWMutexTest
> [ RUN  ] Priorities/RWMutexTest.TestDeadlocks/0
> And seems it's ok in debug mode.
> Now only this one test failed sometimes on aarch64, [~aserbin] [~adar] would 
> you please have a look for this? Or give some suggestion to us, thanks very 
> much.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3136) Add a script to check for system dependencies

2020-06-01 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121121#comment-17121121
 ] 

Todd Lipcon commented on KUDU-3136:
---

There is some kind of preflight.py I did for build-thirdparty a long time ago. 
Maybe needs updating

> Add a script to check for system dependencies
> -
>
> Key: KUDU-3136
> URL: https://issues.apache.org/jira/browse/KUDU-3136
> Project: Kudu
>  Issue Type: Improvement
>Reporter: Grant Henke
>Priority: Major
>
> As a way to simplify the setup of development environments we should add a 
> script to validate the base environment expectations (packages, configs, etc) 
> and give clear error messages. This should help identify and prevent unusual 
> development troubleshooting. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-1386) NaN float and double values are not handled correctly

2020-06-01 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121120#comment-17121120
 ] 

Todd Lipcon commented on KUDU-1386:
---

looks like it was abandoned, not clear if it ever got done

> NaN float and double values are not handled correctly
> -
>
> Key: KUDU-1386
> URL: https://issues.apache.org/jira/browse/KUDU-1386
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tserver
>Reporter: Dan Burkert
>Assignee: William Berkeley
>Priority: Minor
>  Labels: roadmap-candidate
>
> {{TypeInfo::Compare}} and {{TypeInfo::Compare}} always return 
> 0 when one of the arguments is a {{NaN}} value.  This results in equality 
> predicates never filtering {{NaN}} values.  {{TypeInfo::Compare}} should be 
> changed so that it doesn't assume that the data type is totally ordered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-1954) Improve maintenance manager behavior in heavy write workload

2020-05-21 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113352#comment-17113352
 ] 

Todd Lipcon commented on KUDU-1954:
---

The incremental compaction design in Kudu ensures that any given compaction 
only reads <128MB of data (given the default budget configuration, which I 
wouldn't recommend changing). Do you have the logs showing a compaction that 
takes significantly longer than 10-20 seconds? Maybe we need to optimize some 
code paths there if you're seeing any lengthy compactions.

> Improve maintenance manager behavior in heavy write workload
> 
>
> Key: KUDU-1954
> URL: https://issues.apache.org/jira/browse/KUDU-1954
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf, tserver
>Affects Versions: 1.3.0
>Reporter: Todd Lipcon
>Priority: Major
> Attachments: mm-trace.png
>
>
> During the investigation in [this 
> doc|https://docs.google.com/document/d/1U1IXS1XD2erZyq8_qG81A1gZaCeHcq2i0unea_eEf5c/edit]
>  I found a few maintenance-manager-related issues during heavy writes:
> - we don't schedule flushes until we are already in "backpressure" realm, so 
> we spent most of our time doing backpressure
> - even if we configure N maintenance threads, we typically are only using 
> ~50% of those threads due to the scheduling granularity
> - when we do hit the "memory-pressure flush" threshold, all threads quickly 
> switch to flushing, which then brings us far beneath the threshold
> - long running compactions can temporarily starve flushes
> - high volume of writes can starve compactions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KUDU-2432) isolate race creating directory via dist_test.py

2020-05-14 Thread Todd Lipcon (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2432.
---
Fix Version/s: n/a
 Assignee: Todd Lipcon
   Resolution: Fixed

I believe so.

> isolate race creating directory via dist_test.py
> 
>
> Key: KUDU-2432
> URL: https://issues.apache.org/jira/browse/KUDU-2432
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Reporter: Mike Percy
>Assignee: Todd Lipcon
>Priority: Major
> Fix For: n/a
>
> Attachments: logs.txt
>
>
> When running dist_test.py I have been getting a 1% failure rate due to the 
> following errors.
> I am not sure if this is new or related to a single bad machine.
> {code:java}
> failed to download task files: WARNING 123 isolateserver(1484): Adding 
> unknown file 7cf0792d18a9dbef867c9bce0c681b3def0510b6 to cache
> WARNING 126 isolateserver(1490): Added back 1 unknown files
> INFO 135 tools(106): Profiling: Section Setup took 0.045 seconds
> INFO 164 tools(106): Profiling: Section GetIsolateds took 0.029 seconds
> INFO 167 tools(106): Profiling: Section GetRest took 0.003 seconds
> INFO 175 isolateserver(1365): 1 ( 227022kb) added
> INFO 176 isolateserver(1369): 1642 ( 3864634kb) current
> INFO 176 isolateserver(1372): 0 ( 0kb) removed
> INFO 176 isolateserver(1375): 45627408kb free
> INFO 176 tools(106): Profiling: Section CleanupTrimming took 0.009 seconds
> INFO 177 isolateserver(1365): 1 ( 227022kb) added
> INFO 177 isolateserver(1369): 1642 ( 3864634kb) current
> INFO 177 isolateserver(1372): 0 ( 0kb) removed
> INFO 177 isolateserver(1375): 45627408kb free
> INFO 178 tools(106): Profiling: Section CleanupTrimming took 0.001 seconds
> INFO 178 isolateserver(381): Waiting for all threads to die...
> INFO 178 isolateserver(390): Done.
> Traceback (most recent call last):
> File "/swarming.client/isolateserver.py", line 2211, in 
> sys.exit(main(sys.argv[1:]))
> File "/swarming.client/isolateserver.py", line 2204, in main
> return dispatcher.execute(OptionParserIsolateServer(), args)
> File "/swarming.client/third_party/depot_tools/subcommand.py", line 242, in 
> execute
> return command(parser, args[1:])
> File "/swarming.client/isolateserver.py", line 2064, in CMDdownload
> require_command=False)
> File "/swarming.client/isolateserver.py", line 1827, in fetch_isolated
> create_directories(outdir, bundle.files)
> File "/swarming.client/isolateserver.py", line 212, in create_directories
> os.mkdir(os.path.join(base_directory, d))
> OSError: [Errno 17] File exists: '/tmp/dist-test-task_gm4pM/build'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KUDU-2844) Avoid copying strings from dictionary or plain-encoded blocks

2020-04-23 Thread Todd Lipcon (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned KUDU-2844:
-

Assignee: Todd Lipcon

> Avoid copying strings from dictionary or plain-encoded blocks
> -
>
> Key: KUDU-2844
> URL: https://issues.apache.org/jira/browse/KUDU-2844
> Project: Kudu
>  Issue Type: Improvement
>  Components: cfile, perf
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Major
> Attachments: fg.svg
>
>
> When scanning a plain or dictionary-encoded binary column, we currently loop 
> over each entry and copy the string into the destination RowBlock's arena. In 
> TPCH Q1, the scanner threads use a significant percentage of CPU doing this 
> copying, and it also increases CPU cache footprint which likely decreases 
> performance in downstream operations like predicate evaluation, merging, 
> result serialization, etc.
> Instead of doing this, we could "attach" the dictionary block (with 
> ref-counting) to the RowBlock and refer directly to the dictionary entry from 
> the RowBlock. When the RowBlock eventually is reset, we can drop the 
> reference. This should be safe because we never mutate indirect data in-place.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3105) kudu_client based application reports 'Locking callback not initialized' error

2020-04-03 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074719#comment-17074719
 ] 

Todd Lipcon commented on KUDU-3105:
---

I ran into this last night while using conda on el7 to get a rather new version 
of python. The conda environment has openssl 1.1 in it, but my client was built 
outside of conda and has openssl 1.0.x from el7.

My initial attempt to fix this was to change Kudu to use dlsym to look for the 
OPENSSL_version_number() functoin and use that to determine runtime behavior. 
However, I hit a different related issue:
- the python client code uses 'import _ssl' to force Python to do its own 
OpenSSL initialization. In the conda environment, this linked against libssl.so 
from conda's lib dircetory (openssl 1.1). So, the python side inits openssl 1.1.
- the python C++ so file has a link to libssl.so.10, with the explicit versoin 
suffix in the file name, rather than just to 'libssl.so'. So, it still links to 
libssl _outside_ the environment.  So, it gets a not-initialized SSL and can't 
make an SSL context.

Not sure the right fix here.. .seems like we could either get the kuduclient.so 
to link against libssl.so instead of libssl.so.10, or we could be a little more 
"fast and loose" about trying to auto-detect whether SSL is initialized.

> kudu_client based application reports 'Locking callback not initialized' error
> --
>
> Key: KUDU-3105
> URL: https://issues.apache.org/jira/browse/KUDU-3105
> Project: Kudu
>  Issue Type: Bug
>  Components: client, python, security
>Affects Versions: 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1
>Reporter: Alexey Serbin
>Priority: Major
>
> When using kudu_client library compiled against OpenSSL 1.0.x with OpenSSL 
> 1.1.x run-time, Kudu client applications might report 'Runtime error: Locking 
> callback not initialized' error.
> For example, {{kudu-python}} based applications on RHEL/CentOS 7.7, if using 
> {{kudu-client}} of versions 1.9, 1.10, 1.11 in Python environment with 
> OpenSSL 1.1.1d might report an error like below:
> {noformat}
> Traceback (most recent call last):
>   File "kudu-python-app.py", line 22, in 
> client = kudu.connect(host=args.masters, port=args.ports)
>   File "/opt/lib/python3.6/site-packages/kudu/__init__.py", line 96, in 
> connect
> rpc_timeout_ms=rpc_timeout_ms)
>   File "kudu/client.pyx", line 297, in kudu.client.Client.__cinit__
>   File "kudu/errors.pyx", line 62, in kudu.errors.check_status
> kudu.errors.KuduBadStatus: b'Runtime error: Locking callback not initialized'
> {noformat}
> The issue is that the code {{libkudu_client}} compiled against OpenSSL 1.0.x 
> uses initialization code path specific for OpenSSL 1.0.x version, and the 
> post-condition requires presence of thread-safe callbacks installed after the 
> initialization is done.  However, those functions do not install the expected 
> locking callbacks in OpenSSL 1.1.x since OpenSSL uses different approach 
> w.r.t. locking callbacks since 1.1.0 version: the callbacks are not required 
> since the multi-threading model was revamped in the newer versions of the 
> library.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3102) tabletserver coredump in jsonwriter

2020-04-01 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073133#comment-17073133
 ] 

Todd Lipcon commented on KUDU-3102:
---

Or likely that an earlier bit of code had an out of bounds write which 
corrupted tcmalloc state. Does this repro frequently?

> tabletserver coredump in jsonwriter
> ---
>
> Key: KUDU-3102
> URL: https://issues.apache.org/jira/browse/KUDU-3102
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.10.1
>Reporter: Yingchun Lai
>Priority: Major
>
> A tserver coredump happened, backtrace like fowllowing:
> {code:java}
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> Missing separate debuginfo for 
> /home/work/app/kudu/c3tst-dev/master/package/libcrypto.so.10
> Try: yum --enablerepo='*debug*' install 
> /usr/lib/debug/.build-id/35/93fa778645a59ea272dbbb59d318c60940e792.debug
> Core was generated by 
> `/home/work/app/kudu/c3tst-dev/master/package/kudu_master 
> -default_num_replicas='.
> Program terminated with signal 11, Segmentation fault.
> #0  GetStackTrace_x86 (result=0x7fbf7232fa00, max_depth=31, skip_count=0) at 
> /home/laiyingchun/kudu_xm/thirdparty/src/gperftools-2.6.90/src/stacktrace_x86-inl.h:328
> 328
> /home/laiyingchun/kudu_xm/thirdparty/src/gperftools-2.6.90/src/stacktrace_x86-inl.h:
>  No such file or directory.
> Missing separate debuginfos, use: debuginfo-install 
> cyrus-sasl-gssapi-2.1.26-20.el7_2.x86_64 
> cyrus-sasl-lib-2.1.26-20.el7_2.x86_64 cyrus-sasl-md5-2.1.26-20.el7_2.x86_64 
> cyrus-sasl-plain-2.1.26-20.el7_2.x86_64 glibc-2.17-157.el7_3.1.x86_64 
> keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 
> libcom_err-1.42.9-9.el7.x86_64 libdb-5.3.21-19.el7.x86_64 
> libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-6.el7.x86_64 
> ncurses-libs-5.9-13.20130511.el7.x86_64 
> nss-softokn-freebl-3.16.2.3-14.4.el7.x86_64 
> openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 
> zlib-1.2.7-17.el7.x86_64
> (gdb) bt
> #0  GetStackTrace_x86 (result=0x7fbf7232fa00, max_depth=31, skip_count=0) at 
> /home/laiyingchun/kudu_xm/thirdparty/src/gperftools-2.6.90/src/stacktrace_x86-inl.h:328
> #1  0x00b9992b in GetStackTrace (result=result@entry=0x7fbf7232fa00, 
> max_depth=max_depth@entry=31, skip_count=skip_count@entry=1) at 
> /home/laiyingchun/kudu_xm/thirdparty/src/gperftools-2.6.90/src/stacktrace.cc:295
> #2  0x00b8c14d in DoSampledAllocation (size=size@entry=16385) at 
> /home/laiyingchun/kudu_xm/thirdparty/src/gperftools-2.6.90/src/tcmalloc.cc:1169
> #3  0x0289f151 in do_malloc (size=16385) at 
> /home/laiyingchun/kudu_xm/thirdparty/src/gperftools-2.6.90/src/tcmalloc.cc:1361
> #4  do_allocate_full (size=16385) at 
> /home/laiyingchun/kudu_xm/thirdparty/src/gperftools-2.6.90/src/tcmalloc.cc:1751
> #5  tcmalloc::allocate_full_cpp_throw_oom (size=16385) at 
> /home/laiyingchun/kudu_xm/thirdparty/src/gperftools-2.6.90/src/tcmalloc.cc:1765
> #6  0x0289f2a7 in dispatch_allocate_full 
> (size=) at 
> /home/laiyingchun/kudu_xm/thirdparty/src/gperftools-2.6.90/src/tcmalloc.cc:1774
> #7  malloc_fast_path (size=) at 
> /home/laiyingchun/kudu_xm/thirdparty/src/gperftools-2.6.90/src/tcmalloc.cc:1845
> #8  tc_new (size=) at 
> /home/laiyingchun/kudu_xm/thirdparty/src/gperftools-2.6.90/src/tcmalloc.cc:1969
> #9  0x7fbf79c785cd in std::__cxx11::basic_string std::char_traits, std::allocator >::reserve 
> (this=this@entry=0x7fbf7232fbb0, __res=)
> at 
> /home/laiyingchun/gcc-7.4.0-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:293
> #10 0x7fbf79c6be0b in std::__cxx11::basic_stringbuf std::char_traits, std::allocator >::overflow 
> (this=0x7fbf72330668, __c=83) at 
> /home/laiyingchun/gcc-7.4.0-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/sstream.tcc:133
> #11 0x7fbf79c76b89 in std::basic_streambuf 
> >::xsputn (this=0x7fbf72330668,
> __s=0x6929232 
> "Service_RequestConsensusVote\",\"total_count\":1,\"min\":104,\"mean\":104.0,\"percentile_75\":104,\"percentile_95\":104,\"percentile_99\":104,\"percentile_99_9\":104,\"percentile_99_99\":104,\"max\":104,\"total_sum\":104}"...,
>  __n=250)
> at 
> /home/laiyingchun/gcc-7.4.0-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/streambuf.tcc:98
> #12 0x7fbf79c66b62 in sputn (__n=250, __s=, 
> this=) at 
> /home/laiyingchun/gcc-7.4.0-build/x86_64-pc-linux-gnu/libstdc++-v3/include/streambuf:451
> #13 _M_write (__n=250, __s=, this=0x7fbf72330660) at 
> /home/laiyingchun/gcc-7.4.0-build/x86_64-pc-linux-gnu/libstdc++-v3/include/ostream:313
> #14 std::ostream::write (this=0x7fbf72330660,
> __s=0x6929200 
> 

[jira] [Commented] (KUDU-2432) isolate race creating directory via dist_test.py

2020-03-19 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063083#comment-17063083
 ] 

Todd Lipcon commented on KUDU-2432:
---

I pushed a fix for this: 
https://github.com/cloudera/dist_test/pull/new/kudu-2432

Testing in prod :)

> isolate race creating directory via dist_test.py
> 
>
> Key: KUDU-2432
> URL: https://issues.apache.org/jira/browse/KUDU-2432
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Reporter: Mike Percy
>Priority: Major
> Attachments: logs.txt
>
>
> When running dist_test.py I have been getting a 1% failure rate due to the 
> following errors.
> I am not sure if this is new or related to a single bad machine.
> {code:java}
> failed to download task files: WARNING 123 isolateserver(1484): Adding 
> unknown file 7cf0792d18a9dbef867c9bce0c681b3def0510b6 to cache
> WARNING 126 isolateserver(1490): Added back 1 unknown files
> INFO 135 tools(106): Profiling: Section Setup took 0.045 seconds
> INFO 164 tools(106): Profiling: Section GetIsolateds took 0.029 seconds
> INFO 167 tools(106): Profiling: Section GetRest took 0.003 seconds
> INFO 175 isolateserver(1365): 1 ( 227022kb) added
> INFO 176 isolateserver(1369): 1642 ( 3864634kb) current
> INFO 176 isolateserver(1372): 0 ( 0kb) removed
> INFO 176 isolateserver(1375): 45627408kb free
> INFO 176 tools(106): Profiling: Section CleanupTrimming took 0.009 seconds
> INFO 177 isolateserver(1365): 1 ( 227022kb) added
> INFO 177 isolateserver(1369): 1642 ( 3864634kb) current
> INFO 177 isolateserver(1372): 0 ( 0kb) removed
> INFO 177 isolateserver(1375): 45627408kb free
> INFO 178 tools(106): Profiling: Section CleanupTrimming took 0.001 seconds
> INFO 178 isolateserver(381): Waiting for all threads to die...
> INFO 178 isolateserver(390): Done.
> Traceback (most recent call last):
> File "/swarming.client/isolateserver.py", line 2211, in 
> sys.exit(main(sys.argv[1:]))
> File "/swarming.client/isolateserver.py", line 2204, in main
> return dispatcher.execute(OptionParserIsolateServer(), args)
> File "/swarming.client/third_party/depot_tools/subcommand.py", line 242, in 
> execute
> return command(parser, args[1:])
> File "/swarming.client/isolateserver.py", line 2064, in CMDdownload
> require_command=False)
> File "/swarming.client/isolateserver.py", line 1827, in fetch_isolated
> create_directories(outdir, bundle.files)
> File "/swarming.client/isolateserver.py", line 212, in create_directories
> os.mkdir(os.path.join(base_directory, d))
> OSError: [Errno 17] File exists: '/tmp/dist-test-task_gm4pM/build'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2432) isolate race creating directory via dist_test.py

2020-03-19 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063080#comment-17063080
 ] 

Todd Lipcon commented on KUDU-2432:
---

I looked into this a bit tonight since it's happening a lot lately. I sshed 
into one of the slaves that had had a failure and ran 'docker logs' on the 
dist-test slave container to get the full logs, and then grabbed the portion 
corresponding to a failed job. It looks like the issue is that a first attempt 
to download the files for the task failed with a "connection reset by peer" 
error. The retries seem to fail because the directory already exists from the 
first attempt. In other words, it's not a race, just broken retry logic. Will 
look at the code next.

> isolate race creating directory via dist_test.py
> 
>
> Key: KUDU-2432
> URL: https://issues.apache.org/jira/browse/KUDU-2432
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Reporter: Mike Percy
>Priority: Major
> Attachments: logs.txt
>
>
> When running dist_test.py I have been getting a 1% failure rate due to the 
> following errors.
> I am not sure if this is new or related to a single bad machine.
> {code:java}
> failed to download task files: WARNING 123 isolateserver(1484): Adding 
> unknown file 7cf0792d18a9dbef867c9bce0c681b3def0510b6 to cache
> WARNING 126 isolateserver(1490): Added back 1 unknown files
> INFO 135 tools(106): Profiling: Section Setup took 0.045 seconds
> INFO 164 tools(106): Profiling: Section GetIsolateds took 0.029 seconds
> INFO 167 tools(106): Profiling: Section GetRest took 0.003 seconds
> INFO 175 isolateserver(1365): 1 ( 227022kb) added
> INFO 176 isolateserver(1369): 1642 ( 3864634kb) current
> INFO 176 isolateserver(1372): 0 ( 0kb) removed
> INFO 176 isolateserver(1375): 45627408kb free
> INFO 176 tools(106): Profiling: Section CleanupTrimming took 0.009 seconds
> INFO 177 isolateserver(1365): 1 ( 227022kb) added
> INFO 177 isolateserver(1369): 1642 ( 3864634kb) current
> INFO 177 isolateserver(1372): 0 ( 0kb) removed
> INFO 177 isolateserver(1375): 45627408kb free
> INFO 178 tools(106): Profiling: Section CleanupTrimming took 0.001 seconds
> INFO 178 isolateserver(381): Waiting for all threads to die...
> INFO 178 isolateserver(390): Done.
> Traceback (most recent call last):
> File "/swarming.client/isolateserver.py", line 2211, in 
> sys.exit(main(sys.argv[1:]))
> File "/swarming.client/isolateserver.py", line 2204, in main
> return dispatcher.execute(OptionParserIsolateServer(), args)
> File "/swarming.client/third_party/depot_tools/subcommand.py", line 242, in 
> execute
> return command(parser, args[1:])
> File "/swarming.client/isolateserver.py", line 2064, in CMDdownload
> require_command=False)
> File "/swarming.client/isolateserver.py", line 1827, in fetch_isolated
> create_directories(outdir, bundle.files)
> File "/swarming.client/isolateserver.py", line 212, in create_directories
> os.mkdir(os.path.join(base_directory, d))
> OSError: [Errno 17] File exists: '/tmp/dist-test-task_gm4pM/build'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2432) isolate race creating directory via dist_test.py

2020-03-19 Thread Todd Lipcon (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-2432:
--
Attachment: logs.txt

> isolate race creating directory via dist_test.py
> 
>
> Key: KUDU-2432
> URL: https://issues.apache.org/jira/browse/KUDU-2432
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Reporter: Mike Percy
>Priority: Major
> Attachments: logs.txt
>
>
> When running dist_test.py I have been getting a 1% failure rate due to the 
> following errors.
> I am not sure if this is new or related to a single bad machine.
> {code:java}
> failed to download task files: WARNING 123 isolateserver(1484): Adding 
> unknown file 7cf0792d18a9dbef867c9bce0c681b3def0510b6 to cache
> WARNING 126 isolateserver(1490): Added back 1 unknown files
> INFO 135 tools(106): Profiling: Section Setup took 0.045 seconds
> INFO 164 tools(106): Profiling: Section GetIsolateds took 0.029 seconds
> INFO 167 tools(106): Profiling: Section GetRest took 0.003 seconds
> INFO 175 isolateserver(1365): 1 ( 227022kb) added
> INFO 176 isolateserver(1369): 1642 ( 3864634kb) current
> INFO 176 isolateserver(1372): 0 ( 0kb) removed
> INFO 176 isolateserver(1375): 45627408kb free
> INFO 176 tools(106): Profiling: Section CleanupTrimming took 0.009 seconds
> INFO 177 isolateserver(1365): 1 ( 227022kb) added
> INFO 177 isolateserver(1369): 1642 ( 3864634kb) current
> INFO 177 isolateserver(1372): 0 ( 0kb) removed
> INFO 177 isolateserver(1375): 45627408kb free
> INFO 178 tools(106): Profiling: Section CleanupTrimming took 0.001 seconds
> INFO 178 isolateserver(381): Waiting for all threads to die...
> INFO 178 isolateserver(390): Done.
> Traceback (most recent call last):
> File "/swarming.client/isolateserver.py", line 2211, in 
> sys.exit(main(sys.argv[1:]))
> File "/swarming.client/isolateserver.py", line 2204, in main
> return dispatcher.execute(OptionParserIsolateServer(), args)
> File "/swarming.client/third_party/depot_tools/subcommand.py", line 242, in 
> execute
> return command(parser, args[1:])
> File "/swarming.client/isolateserver.py", line 2064, in CMDdownload
> require_command=False)
> File "/swarming.client/isolateserver.py", line 1827, in fetch_isolated
> create_directories(outdir, bundle.files)
> File "/swarming.client/isolateserver.py", line 212, in create_directories
> os.mkdir(os.path.join(base_directory, d))
> OSError: [Errno 17] File exists: '/tmp/dist-test-task_gm4pM/build'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3036) RPC size multiplication for DDL operations might hit maximum RPC size limit

2020-01-08 Thread Todd Lipcon (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-3036:
--
Issue Type: Bug  (was: Improvement)

> RPC size multiplication for DDL operations might hit maximum RPC size limit
> ---
>
> Key: KUDU-3036
> URL: https://issues.apache.org/jira/browse/KUDU-3036
> Project: Kudu
>  Issue Type: Bug
>  Components: master, rpc
>Reporter: Alexey Serbin
>Priority: Major
>  Labels: operability, scalability
>
> When a table uses multi-tier partitioning scheme, with large number of 
> partitions created, an {{AlterTable}} request that affects many 
> partitions/tablets turns into a much larger {{UpdateConsensus}} RPC when 
> leader master pushes the corresponding update on the system tablet to 
> follower masters.
> I did some testing for this use case.  With {{AlterTable}} RPC adding new 
> range partitions, I observed the following:
> * With range x 2 hash partitions, with the incoming {{AlterTable}} RPC 
> request size is 37070 bytes, the size for the corresponding 
> {{UpdateConsensus}}  is 274278 bytes (~ 7x multiplication factor).
> * With range x 10 hash partitions, with the incoming {{AlterTable}} RPC 
> request size is 37070 bytes, the size for the corresponding 
> {{UpdateConsensus}} when leader master pushes the updates on the system 
> tablet to followers is 1365438 bytes (~ 36x multiplication factor).
> With that, it's easy to hit the limit on the maximum PRC size (controlled via 
> the {{\-\-rpc_max_message_size}} flag) in case of larger Kudu clusters.  If 
> that happens, Kudu masters start continuous leader re-election cycle since 
> follower masters don't receive any Raft heartbeats from their leader: the 
> heartbeats are rejected at the lower RPC layer due to the maximum RPC size 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3032) key columns unnecessarily selected when predicate is converted to range

2019-12-20 Thread Todd Lipcon (Jira)
Todd Lipcon created KUDU-3032:
-

 Summary: key columns unnecessarily selected when predicate is 
converted to range
 Key: KUDU-3032
 URL: https://issues.apache.org/jira/browse/KUDU-3032
 Project: Kudu
  Issue Type: Bug
  Components: perf
Affects Versions: 1.11.1
Reporter: Todd Lipcon


When a predicate applies to leading primary key columns, the tablet service 
optimizes it into a range scan and removes those predicates from the ScanSpec. 
However, the current behavior (seemingly going back to when this was 
implemented) does not actually prevent those key columns from being read from 
disk. This has a negative performance impact, particularly when the keys are 
large or inefficient to decompress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3032) key columns unnecessarily selected when predicate is converted to range

2019-12-20 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001233#comment-17001233
 ] 

Todd Lipcon commented on KUDU-3032:
---

The issue is that we seem to determine the "missing columns" for a scan prior 
to running the "OptimizeScanSpec()" function.

> key columns unnecessarily selected when predicate is converted to range
> ---
>
> Key: KUDU-3032
> URL: https://issues.apache.org/jira/browse/KUDU-3032
> Project: Kudu
>  Issue Type: Bug
>  Components: perf
>Affects Versions: 1.11.1
>Reporter: Todd Lipcon
>Priority: Major
>
> When a predicate applies to leading primary key columns, the tablet service 
> optimizes it into a range scan and removes those predicates from the 
> ScanSpec. However, the current behavior (seemingly going back to when this 
> was implemented) does not actually prevent those key columns from being read 
> from disk. This has a negative performance impact, particularly when the keys 
> are large or inefficient to decompress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3030) Crash in tcmalloc stack unwinder

2019-12-19 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000454#comment-17000454
 ] 

Todd Lipcon commented on KUDU-3030:
---

I investigated this a bit. We do already install libunwind, and use that for 
things like '/stacks' and glog. However, it seems like tcmalloc has the 
following behavior:

- the autoconf script tries to check whether frame pointers are omitted by 
default (as they usually are on x86)
- it also checks if libunwind is present, and if so, compiles a libunwind-based 
stack walker
- at runtime, if both are present, it will prefer libunwind on systems where it 
has detected that frame pointers are omitted.

So, in theory, since we do install libunwind before building tcmalloc in 
thirdparty, we should be selecting libunwind by default. However, we set 
CXXFLAGS to '-fno-omit-frame-pointer' in our thirdparty build, which actually 
affects the {{configure}} script as well. So, when it tried to check whether 
frame pointers were omitted by default, it decided that they were _not_ 
omitted, and thus configured itself to prefer the fp-based unwinder.

A couple ways we can fix this:
(1) stop compiling tcmalloc with -fno-omit-frame-pointer. This should get it to 
prefer libunwind.
(2) add some capability in tcmalloc's configuration to force it to use 
libunwind even when built with no frame pointers of its own.
(3) at runtime, it seems like we could set TCMALLOC_STACKTRACE_METHOD=libunwind 
early at startup, and it would prefer libunwind.

If we find that the libunwind-based unwinder is too slow for use in heap 
sampling use case, we could also try to patch tcmalloc's FP unwinder to be more 
safe. One approach is to call write() on each address before reading it, since 
write() will return -EFAULT instead of crashing if the address is bad. Another 
approach would be to set a threadlocal while we're in the stack trace routine, 
and if we catch a SEGV with this threadlocal set, we could ignore it and abort 
the stack tracing.


> Crash in tcmalloc stack unwinder
> 
>
> Key: KUDU-3030
> URL: https://issues.apache.org/jira/browse/KUDU-3030
> Project: Kudu
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.11.0
>Reporter: Todd Lipcon
>Priority: Critical
>
> We recently saw a crash where the tcmalloc heap profiler was trying to unwind 
> the stack, and ended up accessing invalid memory. The issue here is that 
> tcmalloc is relying on frame pointers for heap unwinding, but this particular 
> stack trace was going through libstdc++, which was installed on the system 
> and doesn't have frame pointers. "usually" this works OK, but when we get 
> unlucky, we can crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3030) Crash in tcmalloc stack unwinder

2019-12-19 Thread Todd Lipcon (Jira)
Todd Lipcon created KUDU-3030:
-

 Summary: Crash in tcmalloc stack unwinder
 Key: KUDU-3030
 URL: https://issues.apache.org/jira/browse/KUDU-3030
 Project: Kudu
  Issue Type: Bug
  Components: build
Affects Versions: 1.11.0
Reporter: Todd Lipcon


We recently saw a crash where the tcmalloc heap profiler was trying to unwind 
the stack, and ended up accessing invalid memory. The issue here is that 
tcmalloc is relying on frame pointers for heap unwinding, but this particular 
stack trace was going through libstdc++, which was installed on the system and 
doesn't have frame pointers. "usually" this works OK, but when we get unlucky, 
we can crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3022) C++ Client does not retry CreateTable on "new table name is already reserved" error

2019-12-16 Thread Todd Lipcon (Jira)
Todd Lipcon created KUDU-3022:
-

 Summary: C++ Client does not retry CreateTable on "new table name 
is already reserved" error
 Key: KUDU-3022
 URL: https://issues.apache.org/jira/browse/KUDU-3022
 Project: Kudu
  Issue Type: Bug
  Components: client, master
Reporter: Todd Lipcon


If two callers try to create a table with the same name at the same time, one 
of them can fail with the error: "new table name foo is already reserved". The 
comments in catalog_manager.cc seem to indicate that this should trigger the 
client to retry, but it seems the retry is not taking place.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3019) Kudu client hangs in Deffered.join()

2019-12-12 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995364#comment-16995364
 ] 

Todd Lipcon commented on KUDU-3019:
---

What version are you seeing this on and what did you set the timeouts to, if 
not default? Timeouts are handled by the async code so no timeout should be 
needed on the join that converts the async call to sync.

Are you able to repro?

> Kudu client hangs in Deffered.join()
> 
>
> Key: KUDU-3019
> URL: https://issues.apache.org/jira/browse/KUDU-3019
> Project: Kudu
>  Issue Type: Bug
>Reporter: Zoltán Borók-Nagy
>Priority: Major
>
> In Impala we've seen the Kudu client hanging with the following stack trace:
> {noformat}
> Thread 53015: (state = BLOCKED)
>  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be 
> imprecise)
>  - java.lang.Object.wait() @bci=2, line=502 (Compiled frame)
>  - com.stumbleupon.async.Deferred.doJoin(boolean, long) @bci=77, line=1122 
> (Compiled frame)
>  - com.stumbleupon.async.Deferred.join() @bci=3, line=1006 (Compiled frame)
>  - 
> org.apache.kudu.client.KuduClient.joinAndHandleException(com.stumbleupon.async.Deferred)
>  @bci=1, line=340 (Compiled frame)
>  - org.apache.kudu.client.KuduClient.openTable(java.lang.String) @bci=10, 
> line=212 (Compiled frame)
>  - 
> org.apache.impala.planner.KuduScanNode.init(org.apache.impala.analysis.Analyzer)
>  @bci=32, line=115 (Compiled frame)
>  - 
> org.apache.impala.planner.SingleNodePlanner.createScanNode(org.apache.impala.analysis.TableRef,
>  org.apache.impala.analysis.AggregateInfo, 
> org.apache.impala.analysis.Analyzer) @bci=252, line=1312 (Compiled frame)
> ...{noformat}
> The client hangs in Deferred.join():
> [https://github.com/apache/kudu/blob/a8c6ea258c06407c1a4fef260da3a1cb70529bd9/java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java#L423]
> To at least mitigate the problem, maybe Deferred.join(long timeout) could be 
> used instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-1644) Simplify IN-list predicate values based on tablet partition key or rowset PK bounds

2019-12-03 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987613#comment-16987613
 ] 

Todd Lipcon commented on KUDU-1644:
---

Looks like those images are in a private jira. Can you upload them here?

> Simplify IN-list predicate values based on tablet partition key or rowset PK 
> bounds
> ---
>
> Key: KUDU-1644
> URL: https://issues.apache.org/jira/browse/KUDU-1644
> Project: Kudu
>  Issue Type: Sub-task
>  Components: perf, tablet
>Reporter: Dan Burkert
>Priority: Major
>
> When new scans are optimized by the tablet, the tablet's partition key bounds 
> aren't taken into account in order to remove predicates from the scan.  One 
> of the most important such optimizations is that IN-list predicates could 
> remove values based on the tablet's constraints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-38) bootstrap should not replay logs that are known to be fully flushed

2019-11-19 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-38?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1695#comment-1695
 ] 

Todd Lipcon commented on KUDU-38:
-

bq. Guaranteeing that every complete segment has a fully sync'ed index file 
makes for a nice invariant, but isn't it overkill for the task at hand? 
Couldn't we get away with sync'ing whichever index file contains the earliest 
anchored index at TabletMetadata flush time? I'm particularly concerned about 
the backwards compatibility implications: how do we establish this invariant 
after upgrading to a release including this fix? Or, how do we detect that it's 
not present in existing log index files?

I think we need to make sure that all prior indexes are also synced, because 
it's possible that there is a lagging peer that will still need to catch up 
from a very old record. The index file is what allows a leader to go find those 
old log entries and send them along. Without it, the old log segments aren't 
useful.

bq. I'm particularly concerned about the backwards compatibility implications: 
how do we establish this invariant after upgrading to a release including this 
fix? Or, how do we detect that it's not present in existing log index files?

Yep, we'd need to take that into account, eg by adding some new flag to the 
tablet metadata indicating that the indexes are durable or somesuch.

bq. Alternatively, what about forgoing the log index file and rather than 
storing the earliest anchored index in the TabletMetadata, storing the 
"physical index" (i.e. the LogIndexEntry corresponding to the anchor)?

Again per above, we can't be alive with an invalid index file, or else 
consensus won't be happy.


Ignoring the rest of your questions for a minute, let me throw out an 
alternative idea or two:

*Option 1:*

We could add a new separate piece of metadata next to the logs called a "sync 
point" or somesuch. (this could even be at a well-known offset in the existing 
log file or something). We can periodically wake up a background process in a 
log (eg when we see that the sync point is too far back) and then:
(1) look up the earliest durability-anchored offset
(2) msync the log indexes up to that point
(3) write that point to the special "sync point" metadata file. This is just an 
offset, so it can be written atomically and lazily flushed (it only moves 
forward)

At startup, if we see a sync point metadata file, we know we can start 
replaying (and reconstructing index) from that point, without having to 
reconstruct any earlier index entries. If we do this lazily (eg once every few 
seconds only on actively-written tablets) the performance overhead should be 
negligible.

We also need to think about how this interacts with tablet copy -- right now, a 
newly copied tablet relies on replaying the WALs from the beginning because it 
doesn't copy log indexes. We may need to change that.

*Option 2*: get rid of "log index"
This is the "nuke everything from orbit" option: the whole log index thing was 
convenient but it's somewhat annoying for a number of reasons: (1) the issues 
described here, (2) we are using mmapped IO which is dangerous since IO errors 
crash the process, (3) just another bit of code to worry about and transfer 
around in tablet copy, etc.

The alternative is to embed the index in the WAL itself. One sketch of an 
implementation would be something like:
- divide the WAL into fixed-size pages, each with a header.  The header would 
have term/index info and some kind of "continuation" flag for when entries span 
multiple pages. This is more or less the postgres WAL design
- this allows us to binary-search the WAL instead of having a separate index.
- we have to consider how truncations work -- I guess we would move to physical 
truncation.

Another possible idea would be to not use fixed-size pages, but instead embed a 
tree structure into the WAL itself. For example, it wouldn't be too tough to 
add a back-pointer from each entry to the previous entry to enable backward 
scanning. If we then take a skip-list like approach (n/2 nodes have a skip-1 
pointer, n/4 nodes have a skip-4 pointer, n/8 nodes have a skip-8 pointer, etc) 
then we can get logarithmic access time to past log entries. Again, need to 
consider truncation.

Either of these options have the advantage that we no longer need to worry 
about indexes, but we still do need to worry about figuring out where to start 
replaying from, and we could take the same strategy as the first suggestion for 
that.

> bootstrap should not replay logs that are known to be fully flushed
> ---
>
> Key: KUDU-38
> URL: https://issues.apache.org/jira/browse/KUDU-38
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tablet
>Affects Versions: M3
>Reporter: Todd Lipcon
>  

[jira] [Created] (KUDU-2989) SASL server fails when FQDN is greater than 63 characters long

2019-10-31 Thread Todd Lipcon (Jira)
Todd Lipcon created KUDU-2989:
-

 Summary: SASL server fails when FQDN is greater than 63 characters 
long
 Key: KUDU-2989
 URL: https://issues.apache.org/jira/browse/KUDU-2989
 Project: Kudu
  Issue Type: Bug
  Components: rpc, security
Affects Versions: 1.10.0
Reporter: Todd Lipcon


Currently, on the server side, Kudu doesn't explicitly pass the host's FQDN 
into the SASL library. Due to an upstream SASL bug 
(https://github.com/cyrusimap/cyrus-sasl/issues/583) the FQDN gets truncated 
when trying to determine the server's principal, in the case that the server's 
fQDN is longer than 64 characters.

This results in startup failures where the preflight checks fail due to not 
finding the appropriate keytab entry (after searching for a truncated host name)

To work around this, we should use our own code to compute the FQDN.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2452) Prevent follower from causing pre-elections when UpdateConsensus is slow

2019-10-22 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957339#comment-16957339
 ] 

Todd Lipcon commented on KUDU-2452:
---

Another idea here that would be more complicated but have a much bigger 
positive impact would be to exploit the fact that most heartbeats are simple 
"lease renewals" with no new tablet-specific information. In other words, the 
tablet has no operations to replicate, and the safetime advancement is only due 
to the server-wide clock advancing. In this case, it is somewhat wasteful that 
we are actually sending such heartbeats once per tablet instead of once per 
server

> Prevent follower from causing pre-elections when UpdateConsensus is slow
> 
>
> Key: KUDU-2452
> URL: https://issues.apache.org/jira/browse/KUDU-2452
> Project: Kudu
>  Issue Type: Improvement
>Affects Versions: 1.7.0
>Reporter: William Berkeley
>Priority: Major
>  Labels: stability
>
> Thanks to pre-elections (KUDU-1365), slow UpdateConsensus calls on a single 
> follower don't disturb the whole tablet by calling elections. However, 
> sometimes I see situations where one or more followers are constantly calling 
> pre-elections, and only rarely, if ever, overflowing their service queues. 
> Occasionally, in 3x replicated tablets, the followers will get "lucky" and 
> detect a leader failure at around the same time, and an election will happen.
> This background instability has caused bugs like KUDU-2343 that should be 
> rare to occur pretty frequently, plus the extra RequestConsensusVote RPCs add 
> a little more stress on the consensus service and on replicas' consensus 
> locks. It also spams the logs, since there's no generally no exponential 
> backoff for these pre-elections because there's a successful heartbeat in 
> between them.
> It seems like we can get into the situation where the average number of 
> in-flight consensus requests is constant over time, so on average we are 
> processing each heartbeat in less than the heartbeat interval, however some 
> heartbeats take longer. Since UpdateConsensus calls to a replica are 
> serialized, a few of these in a row trigger the failure detector, despite the 
> follower receiving every heartbeat in a timely manner and responding 
> successfully eventually (and on average in a timely manner).
> It'd be nice to prevent these worthless pre-elections. A couple of ideas:
> 1. Separately calculate a backoff for failed pre-elections, and reset it when 
> a pre-election succeeds or more generally when there's an election.
> 2. Don't count the time the follower is executing UpdateConsensus against the 
> failure detector. [~mpercy] suggested stopping the failure detector during 
> UpdateReplica() and resuming it when the function returns.
> 3. Move leader failure detection out-of-band of UpdateConsensus entirely.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2975) Spread WAL across multiple data directories

2019-10-17 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953946#comment-16953946
 ] 

Todd Lipcon commented on KUDU-2975:
---

agreed this would be good to do. I think [~awong] had done some 
investigation/write-up of what would be required to do this at one point, right?

> Spread WAL across multiple data directories
> ---
>
> Key: KUDU-2975
> URL: https://issues.apache.org/jira/browse/KUDU-2975
> Project: Kudu
>  Issue Type: New Feature
>  Components: fs, tablet, tserver
>Reporter: LiFu He
>Priority: Major
> Attachments: network.png, tserver-WARNING.png, util.png
>
>
> Recently, we deployed a new kudu cluster and every node has 12 SSD. Then, we 
> created a big table and loaded data to it through flink.  We noticed that the 
> util of one SSD which is used to store WAL is 100% but others are free. So, 
> we suggest to spread WAL across multiple data directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2932) Unix domain socket could speed up data transmission

2019-08-28 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917942#comment-16917942
 ] 

Todd Lipcon commented on KUDU-2932:
---

bq. kudu doesn't support data locality

What do you mean by that? I think it _does_ support data locality (Impala/Spark 
tasks are put on the same machine as the kudu tserver, usually).

bq. I think it's useful to support unix domain socket for computing 
engine(impala/spark) and storage engine(kudu) mixed deployment scenarios.

We actually did some work on this many years ago, and also tried using shared 
memory. The domain socket seemed to have a small improvement, but shared memory 
didn't seem to be worth the benefit. You can find some of the really old 
patches on gerrit: https://gerrit.cloudera.org/#/c/957/

> Unix domain socket could speed up data transmission
> ---
>
> Key: KUDU-2932
> URL: https://issues.apache.org/jira/browse/KUDU-2932
> Project: Kudu
>  Issue Type: New Feature
>Reporter: HeLifu
>Priority: Major
>
> Right now, kudu doesn't support data locality. So, I think it's useful to 
> support unix domain socket for computing engine(impala/spark) and storage 
> engine(kudu) mixed deployment scenarios.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (KUDU-2847) Optimize iteration over selection vector in SerializeRowBlock

2019-08-27 Thread Todd Lipcon (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2847.
---
Fix Version/s: 1.11.0
   Resolution: Fixed

> Optimize iteration over selection vector in SerializeRowBlock
> -
>
> Key: KUDU-2847
> URL: https://issues.apache.org/jira/browse/KUDU-2847
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf, tserver
>Reporter: Todd Lipcon
>Assignee: ZhangYao
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently, SerializeRowBlock operates column-by-column, which means we have 
> to iterate over the selection bitmap once for each column. This code isn't 
> particularly well optimized -- in TPCH Q6, about 10% of CPU is spent in 
> BitmapFindFirst. We should look at alternate implementations here that better 
> amortize the bitmap iteration cost across all of the columns and generally 
> micro-optimize it.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (KUDU-2906) Don't allow elections when server clocks are too out of sync

2019-07-26 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894190#comment-16894190
 ] 

Todd Lipcon commented on KUDU-2906:
---

So in this case, the serve'rs NTP reported that it was in TIME_OK status, but 
gave a really incorrect error, with an incorrect maxerror?

Perhaps we could also do something to detect this kind of skew by making sure 
the monotonic clock and the wall clock advance at roughly the same rate (eg if 
the monotonic clock changed by 1 second, and the NTP clock changed by 3, it's a 
problem!)

> Don't allow elections when server clocks are too out of sync
> 
>
> Key: KUDU-2906
> URL: https://issues.apache.org/jira/browse/KUDU-2906
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.10.0
>Reporter: Andrew Wong
>Priority: Major
>
> In cases where machine clocks are not properly synchronized, if a tablet 
> replica is elected leader whose clock happens to be very far in the future 
> (greater than --max_clock_sync_error_usec=10 sec), it's possible that any 
> writes that goes to that tablet will be rejected by the followers, but 
> persisted to the leader's WAL.
> Then, upon fixing the clock on that machine, the replica may try to replay 
> the future op, but fail to replay it because the op timestamp is too far in 
> the future, with errors like:
> {code:java}
> F0715 12:03:09.369819  3500 tablet_bootstrap.cc:904] Check failed: _s.ok() 
> Bad status: Invalid argument: Tried to update clock beyond the max. 
> error.{code}
> Dumping a recovery WAL, I could see:
> {code:java}
> 130.138@6400743143334211584 REPLICATE NO_OP
> id { term: 130 index: 138 } timestamp: 6400743143334211584 op_type: NO_OP 
> noop_request { }
> COMMIT 130.138
> op_type: NO_OP commited_op_id { term: 130 index: 138 }
> 131.139@6400743925559676928 REPLICATE NO_OP
> id { term: 131 index: 139 } timestamp: 6400743925559676928 op_type: NO_OP 
> noop_request { }
> COMMIT 131.139
> op_type: NO_OP commited_op_id { term: 131 index: 139 }
> 132.140@11589864471731939930 REPLICATE NO_OP
> id { term: 132 index: 140 } timestamp: 11589864471731939930 op_type: NO_OP 
> noop_request { }{code}
> Note the drastic jump in timestamp.
> In this specific case, we verified that the replayed WAL wasn't that far 
> behind the recovery WAL, which had the future timestamps, so we could just 
> delete the recovery WAL and bootstrap from the replayed WAL.
> It would have been nice had those bad ops not been written at all, maybe by 
> preventing an election between such mismatched servers in the first place.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (KUDU-2673) Event timestamp support with kudu.

2019-07-17 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887317#comment-16887317
 ] 

Todd Lipcon commented on KUDU-2673:
---

I think adding some scanner pushdown capability for handling common bitemporal 
modeling patterns is reasonable (eg particular predicates or other 
constant-memory processing). However, I'm pretty strongly against adding 
HBase-like functionality to have user-specified timestamps, since we make a lot 
of physical layer optimizations around assumptions around time only moving 
forward, etc.

> Event timestamp support with kudu.
> --
>
> Key: KUDU-2673
> URL: https://issues.apache.org/jira/browse/KUDU-2673
> Project: Kudu
>  Issue Type: New Feature
>  Components: java, spark, tserver
>Reporter: yangz
>Priority: Major
>  Labels: features, roadmap-candidate
>
> Kudu has the ability to read historical data. But it is based by the 
> timestamp produced by kudu transaction and mvcc system. The timestamp kudu 
> used greatly weakened the usability.
> For our use case. we write data to kudu from data stream. We use range 
> partition by day.
> We want to get the hour version from kudu. So we need read history data from 
> kudu.
> It produced by undo file. But when user give a timestamp, it means timestamp 
> the event happen, associated with the data. Not the timestamp kudu produced. 
> So we need a way to set event timestamp to the kudu system.
> Finally, we got a way to solve this problem.
> But our solution has two limit.
>  # We only update the table by a row, and for one row we have a timestamp 
> with it.
>  # For getting the right history version of data, we need the data stream 
> send data by event time order.
> Despite these problems, it has satisfied our current business.
>  
> And our implement also solve part problem for the wrong order problem of 
> event time if you only need the newest data, which will not read undo file.
> for the data send into kudu,       t1 < t2
> t1 upsert -> t2 upsert      ->    newest will be t2 value
> t2 upsert -> t1 upsret      ->    (current kudu implement) t1,  our implement 
> will be t2.
>  
> Maybe our solution is not the best for the problem. But I think kudu snapshot 
> read should support event time.
> Our solution is not so complete for all user cases. But I hope it will be 
> useful for some cases with the community.   
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (KUDU-2894) How to modify the time zone of web ui

2019-07-15 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885839#comment-16885839
 ] 

Todd Lipcon commented on KUDU-2894:
---

I think fixing this would be quite a bit of work -- you'd need to plumb down 
the environmental context (browser client timezone) into the "tostring" 
methods, etc. Not against someone doing this, so long as it's easy to switch 
back and forth between TZ-aware stringification since it's useful to be able to 
see the "raw" value here rather than some adjusted value

> How to modify the time zone of web ui
> -
>
> Key: KUDU-2894
> URL: https://issues.apache.org/jira/browse/KUDU-2894
> Project: Kudu
>  Issue Type: Bug
>  Components: ui
>Affects Versions: 1.8.0
>Reporter: wxmimperio
>Priority: Major
> Attachments: image-2019-07-15-19-32-42-041.png, 
> image-2019-07-16-10-28-33-293.png
>
>
> !image-2019-07-15-19-32-42-041.png!
> I create partition with 2019-05-31 00:00:00——2019-06-01 00:00:00, it show 
> 2019-05-30 16:00:00——2019-05-31 16:00:00.
> 8 hours difference.
> However, the data time I inserted is correct, not the partition display.
> How to modify the time zone on the web interface?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (KUDU-2894) How to modify the time zone of web ui

2019-07-15 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885501#comment-16885501
 ] 

Todd Lipcon commented on KUDU-2894:
---

Note that the 'Z' suffix on the timestamps indicates that these are UTC times. 
Kudu's timestamp type is not timezone-aware, so we won't remember whatever time 
zone you were in when you created the partition. However, we could certainly 
_format_ timestamps in a timezone-aware manner. That gets a little tricky, 
though -- should we format it based on the timezone of the client browser? The 
server? What if different masters are in different time zones? How about when 
formatting time zones in logs? What if the timezone changes (eg as in the 
daylight savings time in the USA)? Would the formatting of a partition change 
in the logs when the time zone changes?

It seems like UTC formatting is the least error-prone way to stringify 
timestamps on the server side.

> How to modify the time zone of web ui
> -
>
> Key: KUDU-2894
> URL: https://issues.apache.org/jira/browse/KUDU-2894
> Project: Kudu
>  Issue Type: Bug
>  Components: ui
>Affects Versions: 1.8.0
>Reporter: wxmimperio
>Priority: Major
> Attachments: image-2019-07-15-19-32-42-041.png
>
>
> !image-2019-07-15-19-32-42-041.png!
> I create partition with 2019-05-31 00:00:00——2019-06-01 00:00:00, it show 
> 2019-05-30 16:00:00——2019-05-31 16:00:00.
> 8 hours difference.
> However, the data time I inserted is correct, not the partition display.
> How to modify the time zone on the web interface?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (KUDU-2864) Support NOT predicates

2019-07-11 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883247#comment-16883247
 ] 

Todd Lipcon commented on KUDU-2864:
---

I think at some point as we add more complex predicates (like NOT or OR), we 
need to actually change KuduPredicate over to more of a traditional "expression 
tree" structure. eg we would have a NotPredicate class which wraps another 
Predicate.

> Support NOT predicates
> --
>
> Key: KUDU-2864
> URL: https://issues.apache.org/jira/browse/KUDU-2864
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: Grant Henke
>Priority: Major
>
> Support a NOT predicate that can be combined with other predicates.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (KUDU-2855) Lazy-create DeltaMemStores on first update

2019-07-11 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2855.
---
   Resolution: Fixed
Fix Version/s: 1.11.0

> Lazy-create DeltaMemStores on first update
> --
>
> Key: KUDU-2855
> URL: https://issues.apache.org/jira/browse/KUDU-2855
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf, tserver
>Reporter: Todd Lipcon
>Assignee: HeLifu
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently DeltaTracker::DoOpen creates a DeltaMemStore for each DRS. If we 
> assume that most DRS don't have any deltas, this ends up wasting quite a bit 
> of memory. Looking at one TS in a production cluster, about 1GB of the ~14G 
> heap is being used by DMS. Of that, 464MB is data and the remainder is 
> overhead.
> This would likely improve other code paths too to fast-path out any 
> DMS-related code.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (KUDU-2888) Better encoding for dictionary code-words

2019-07-09 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881716#comment-16881716
 ] 

Todd Lipcon commented on KUDU-2888:
---

attached a little test file I wrote. NOTE: it has some weirdness where it gets 
the compression ratio of bitshuffle off by four, and maybe there are some perf 
problems too (didn't spend a lot of time on it).

> Better encoding for dictionary code-words
> -
>
> Key: KUDU-2888
> URL: https://issues.apache.org/jira/browse/KUDU-2888
> Project: Kudu
>  Issue Type: Bug
>  Components: cfile, perf
>Reporter: Todd Lipcon
>Priority: Major
> Attachments: codec-test.py
>
>
> Currently we use bitshuffle for all ints, including dictionary codewords. For 
> dictionary codewords, we know the maximum possible value up-front, and we 
> also know that the ints will be non-negative and small. This set of 
> constraints makes it much better to use a specialized bitpacking algorithm 
> rather than a more generic compression like bitshuffle+lz4. Based on some 
> quick experiments I ran, we can probably get a several-fold decoding speedup 
> with no loss of compression by switching to a codec like simdbitpacking for 
> these codewords.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2888) Better encoding for dictionary code-words

2019-07-09 Thread Todd Lipcon (JIRA)
Todd Lipcon created KUDU-2888:
-

 Summary: Better encoding for dictionary code-words
 Key: KUDU-2888
 URL: https://issues.apache.org/jira/browse/KUDU-2888
 Project: Kudu
  Issue Type: Bug
  Components: cfile, perf
Reporter: Todd Lipcon
 Attachments: codec-test.py

Currently we use bitshuffle for all ints, including dictionary codewords. For 
dictionary codewords, we know the maximum possible value up-front, and we also 
know that the ints will be non-negative and small. This set of constraints 
makes it much better to use a specialized bitpacking algorithm rather than a 
more generic compression like bitshuffle+lz4. Based on some quick experiments I 
ran, we can probably get a several-fold decoding speedup with no loss of 
compression by switching to a codec like simdbitpacking for these codewords.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2888) Better encoding for dictionary code-words

2019-07-09 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-2888:
--
Attachment: codec-test.py

> Better encoding for dictionary code-words
> -
>
> Key: KUDU-2888
> URL: https://issues.apache.org/jira/browse/KUDU-2888
> Project: Kudu
>  Issue Type: Bug
>  Components: cfile, perf
>Reporter: Todd Lipcon
>Priority: Major
> Attachments: codec-test.py
>
>
> Currently we use bitshuffle for all ints, including dictionary codewords. For 
> dictionary codewords, we know the maximum possible value up-front, and we 
> also know that the ints will be non-negative and small. This set of 
> constraints makes it much better to use a specialized bitpacking algorithm 
> rather than a more generic compression like bitshuffle+lz4. Based on some 
> quick experiments I ran, we can probably get a several-fold decoding speedup 
> with no loss of compression by switching to a codec like simdbitpacking for 
> these codewords.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-428) fine-grained authorization through Sentry integration

2019-07-08 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880772#comment-16880772
 ] 

Todd Lipcon commented on KUDU-428:
--

[~hahao] should we mark this as resolved?

> fine-grained authorization through Sentry integration
> -
>
> Key: KUDU-428
> URL: https://issues.apache.org/jira/browse/KUDU-428
> Project: Kudu
>  Issue Type: New Feature
>  Components: master, security, tserver
>Affects Versions: 1.2.0
>Reporter: Todd Lipcon
>Assignee: Hao Hao
>Priority: Critical
>  Labels: roadmap-candidate
>
> We need to support basic SQL-like access control:
> - grant/revoke on tables, columns
> - service-level grant/revoke
> - probably need some group/role mapping infrastructure as well



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-1644) Simplify IN-list predicate values based on tablet partition key or rowset PK bounds

2019-07-08 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-1644:
--
Component/s: perf

> Simplify IN-list predicate values based on tablet partition key or rowset PK 
> bounds
> ---
>
> Key: KUDU-1644
> URL: https://issues.apache.org/jira/browse/KUDU-1644
> Project: Kudu
>  Issue Type: Sub-task
>  Components: perf, tablet
>Reporter: Dan Burkert
>Priority: Major
>
> When new scans are optimized by the tablet, the tablet's partition key bounds 
> aren't taken into account in order to remove predicates from the scan.  One 
> of the most important such optimizations is that IN-list predicates could 
> remove values based on the tablet's constraints.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2867) Optimize Timestamp varint decoding

2019-06-27 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2867.
---
   Resolution: Fixed
Fix Version/s: 1.11.0

> Optimize Timestamp varint decoding
> --
>
> Key: KUDU-2867
> URL: https://issues.apache.org/jira/browse/KUDU-2867
> Project: Kudu
>  Issue Type: Improvement
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Minor
> Fix For: 1.11.0
>
>
> In looking at a profile with some delta reads, I noticed ~11% of CPU going to 
> GetMemcmpableVarint, much of which is from Timestamp decoding. In practice, 
> any timestamp from HybridClock will need an 8-byte varint, so we can 
> fast-path that case. (unfortunately it's too much effort to change the data 
> format to not use a varint here)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-1644) Simplify IN-list predicate values based on tablet partition key or rowset PK bounds

2019-06-26 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873493#comment-16873493
 ] 

Todd Lipcon commented on KUDU-1644:
---

Worth noting there's also the case where the merging of the DRS bounds with the 
IN-list can fully eliminate the DRS (ie the predicate converts to 'None') in 
which case the rowset can be skipped. I don't know if we implement that 
optimization today but seems again like an easy win.

> Simplify IN-list predicate values based on tablet partition key or rowset PK 
> bounds
> ---
>
> Key: KUDU-1644
> URL: https://issues.apache.org/jira/browse/KUDU-1644
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tablet
>Reporter: Dan Burkert
>Priority: Major
>
> When new scans are optimized by the tablet, the tablet's partition key bounds 
> aren't taken into account in order to remove predicates from the scan.  One 
> of the most important such optimizations is that IN-list predicates could 
> remove values based on the tablet's constraints.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-1644) Simplify IN-list predicate values based on tablet partition key bounds

2019-06-26 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873487#comment-16873487
 ] 

Todd Lipcon commented on KUDU-1644:
---

This optimization can even happen at the DRS level, which is useful for IN-list 
on PK prefixes. For example, we may have {{PK IN ('foo', 'bar')}} but within a 
given DRS we know from the DRS row bounds that we have an implied predicate 
{{PK BETWEEN 'apple' and 'cat'}}. If we merge that implied predicate with the 
IN list, it simplifies to {{PK = 'bar'}} which can be evaluated as a range scan.

This isn't as general as the proposal in KUDU-2875, but in the common case of a 
well-compacted table, it could possibly have almost the same level benefit vs 
the naive full scan of the PK column.

> Simplify IN-list predicate values based on tablet partition key bounds
> --
>
> Key: KUDU-1644
> URL: https://issues.apache.org/jira/browse/KUDU-1644
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tablet
>Reporter: Dan Burkert
>Priority: Major
>
> When new scans are optimized by the tablet, the tablet's partition key bounds 
> aren't taken into account in order to remove predicates from the scan.  One 
> of the most important such optimizations is that IN-list predicates could 
> remove values based on the tablet's constraints.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-1644) Simplify IN-list predicate values based on tablet partition key or rowset PK bounds

2019-06-26 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-1644:
--
Summary: Simplify IN-list predicate values based on tablet partition key or 
rowset PK bounds  (was: Simplify IN-list predicate values based on tablet 
partition key bounds)

> Simplify IN-list predicate values based on tablet partition key or rowset PK 
> bounds
> ---
>
> Key: KUDU-1644
> URL: https://issues.apache.org/jira/browse/KUDU-1644
> Project: Kudu
>  Issue Type: Sub-task
>  Components: tablet
>Reporter: Dan Burkert
>Priority: Major
>
> When new scans are optimized by the tablet, the tablet's partition key bounds 
> aren't taken into account in order to remove predicates from the scan.  One 
> of the most important such optimizations is that IN-list predicates could 
> remove values based on the tablet's constraints.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2875) Convert scans with IN-list predicates on primary key prefix to multiple scans with equality predicates

2019-06-26 Thread Todd Lipcon (JIRA)
Todd Lipcon created KUDU-2875:
-

 Summary: Convert scans with IN-list predicates on primary key 
prefix to multiple scans with equality predicates
 Key: KUDU-2875
 URL: https://issues.apache.org/jira/browse/KUDU-2875
 Project: Kudu
  Issue Type: Sub-task
  Components: perf, tserver
Reporter: Todd Lipcon


In the case that an IN predicate is applied to a prefix of the primary key (or 
to column N of a composite key with equality predicates present on all columns 
 10'. In this case, we can internally convert to two scans 
'entity = 1 and timestamp > 10' and 'entity = 3 and timestamp > 10'. These can 
then be evaluated efficiently using the index.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2854) Short circuit predicates on dictionary-coded columns

2019-06-21 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869715#comment-16869715
 ] 

Todd Lipcon commented on KUDU-2854:
---

bq. But currently we don't have a quick way to judge if there is any delta for 
the whole column(cfile) or the whole data block(part of cfile). 

I think we could make some changes in DeltaTracker::WrapIterator and 
DeltaTracker::NewDeltaIterator so that, if there are no relevant DeltaFiles, 
and the only relevant DMS is empty, we could avoid wrapping the base iterator. 
This is the special case of "no deltas at all" which is a little different than 
"no deltas for a specific column". Still, that's a useful optimization (and 
common that we have no deltas). KUDU-2855 would also make this easier to 
implement.

bq.  Is there any way we can easily judge if a column contain deltas or if a 
data block contain deltas?

After DeltaIterator::PrepareBatch is called, we can use MayHaveDeltas() to see 
on a per-block basis whether there were any deltas. We can extend this method 
to be MayHaveDeltas(col_idx). Note that we already use this to determine 
whether we can push down predicates into the block decoder here in 
DeltaApplier::MaterializeColumn:

{code}
  // Data with updates cannot be evaluated at the decoder-level.
  if (delta_iter_->MayHaveDeltas()) {
ctx->SetDecoderEvalNotSupported();
RETURN_NOT_OK(base_iter_->MaterializeColumn(ctx));
RETURN_NOT_OK(delta_iter_->ApplyUpdates(ctx->col_idx(), ctx->block(), 
*ctx->sel()));
  } else {
RETURN_NOT_OK(base_iter_->MaterializeColumn(ctx));
  }
{code}

> Short circuit predicates on dictionary-coded columns
> 
>
> Key: KUDU-2854
> URL: https://issues.apache.org/jira/browse/KUDU-2854
> Project: Kudu
>  Issue Type: Improvement
>  Components: cfile, perf, tserver
>Reporter: Todd Lipcon
>Priority: Major
>
> In the common case that a column has no updates in a given DRS, if we see 
> that no entries in the dictionary match the predicate, we can short circuit 
> at a few layers:
> - we can store a flag in the cfile footer that indicates that all blocks are 
> dict-coded (ie there are no fallbacks). In that case, we can skip the whole 
> rowset
> - if a cfile is partially dict-encoded, we can skip any dict-coded blocks 
> without decoding the dictionary words



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2872) client_symbol-test fails with devtoolset-7 (gcc 7)

2019-06-20 Thread Todd Lipcon (JIRA)
Todd Lipcon created KUDU-2872:
-

 Summary: client_symbol-test fails with devtoolset-7 (gcc 7)
 Key: KUDU-2872
 URL: https://issues.apache.org/jira/browse/KUDU-2872
 Project: Kudu
  Issue Type: Bug
  Components: build
Affects Versions: 1.10.0
Reporter: Todd Lipcon


Compiling with devtoolset-7 (gcc 7.3.1) produces a client library with some bad 
symbols:

Found bad symbol 'operator delete(void*, unsigned long)'
Found bad symbol 'transaction clone for std::logic_error::what() const'
... 65 other symbols like the above

It seems these are something to do with transactional memory support in the 
libstdcxx used in devtoolset-7? Seems likely we need to mark these symbols 
hidden in the linker script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2871) TLS 1.3 not supported by krpc

2019-06-20 Thread Todd Lipcon (JIRA)
Todd Lipcon created KUDU-2871:
-

 Summary: TLS 1.3 not supported by krpc
 Key: KUDU-2871
 URL: https://issues.apache.org/jira/browse/KUDU-2871
 Project: Kudu
  Issue Type: Bug
  Components: rpc, security
Reporter: Todd Lipcon


The TLS negotiation in our RPC protocol assumes a whole number of round trips 
between client and server. For TLS 1.3, the exchange has 1.5 round trips (the 
client is the last sender rather than the server) which breaks negotiation. 
Most tests thus fail with OpenSSL 1.1.1.

We should temporarily disable TLS 1.3 and then fix RPC to support this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2869) Internal compiler error compiling with devtoolset-7 on el7

2019-06-19 Thread Todd Lipcon (JIRA)
Todd Lipcon created KUDU-2869:
-

 Summary: Internal compiler error compiling with devtoolset-7 on el7
 Key: KUDU-2869
 URL: https://issues.apache.org/jira/browse/KUDU-2869
 Project: Kudu
  Issue Type: Bug
Affects Versions: 1.10.0
Reporter: Todd Lipcon


Trying to compile 1.10.0-rc0 with devtoolset-7 I hit the following error:
{code}
In file included from 
/data/1/todd/apache-kudu-1.10.0/src/kudu/gutil/macros.h:14:0,
 from 
/data/1/todd/apache-kudu-1.10.0/src/kudu/common/types.h:34,
 from 
/data/1/todd/apache-kudu-1.10.0/src/kudu/common/columnblock.h:28,
 from 
/data/1/todd/apache-kudu-1.10.0/src/kudu/common/columnblock.cc:18:
/data/1/todd/apache-kudu-1.10.0/src/kudu/gutil/port.h: In substitution of 
‘template::type* 
, bool USE_REINTERPRET> T UnalignedLoad(const void*) [with T = 
__int128; typename port_internal::enable_if_numeric::type*  = 
; bool USE_REINTERPRET = ]’:
/data/1/todd/apache-kudu-1.10.0/src/kudu/common/types.h:361:55:   required from 
here
/data/1/todd/apache-kudu-1.10.0/src/kudu/gutil/port.h:1196:73: internal 
compiler error: unexpected expression ‘LoadByReinterpretCast<__int128>’ of kind 
template_id_expr
  bool USE_REINTERPRET = port_internal::LoadByReinterpretCast()>
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2092) Pull in krb5_is_config_principal() when running against older kerberos versions

2019-06-18 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2092.
---
   Resolution: Won't Fix
Fix Version/s: n/a

It looks like Impala's copy of krpc added this in IMPALA-4669 
(f51c4435c9b9ce04e5476d7b54f442e8b72db878) but then in IMPALA-7006 updated to a 
new version of krpc which didn't include the workaround. I guess Impala no 
longer needs to support such old versions krb5 such as in EL5.

> Pull in krb5_is_config_principal() when running against older kerberos 
> versions
> ---
>
> Key: KUDU-2092
> URL: https://issues.apache.org/jira/browse/KUDU-2092
> Project: Kudu
>  Issue Type: Improvement
>  Components: security
>Affects Versions: 1.4.0
>Reporter: Sailesh Mukil
>Priority: Major
> Fix For: n/a
>
>
> On kerberos versions < krb5-1.8, the function krb5_is_config_principal() does 
> not exist. Since our code uses that function, and we dynamically link against 
> kerberos, we would be unable to build on systems that have these old kerberos 
> versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-1028) Be more graceful about clock unsynch errors

2019-06-18 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-1028.
---
   Resolution: Duplicate
Fix Version/s: n/a

> Be more graceful about clock unsynch errors
> ---
>
> Key: KUDU-1028
> URL: https://issues.apache.org/jira/browse/KUDU-1028
> Project: Kudu
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: Feature Complete
>Reporter: David Alves
>Priority: Major
> Fix For: n/a
>
>
> We should likely refuse to execute the txns but not crash outright.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-1028) Be more graceful about clock unsynch errors

2019-06-18 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866864#comment-16866864
 ] 

Todd Lipcon commented on KUDU-1028:
---

I think we can probably close this since KUDU-1578 allows us to ride over 
interruptions for a while and KUDU-2242 added a wait at startup.

> Be more graceful about clock unsynch errors
> ---
>
> Key: KUDU-1028
> URL: https://issues.apache.org/jira/browse/KUDU-1028
> Project: Kudu
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: Feature Complete
>Reporter: David Alves
>Priority: Major
>
> We should likely refuse to execute the txns but not crash outright.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KUDU-1067) Add metrics for tablet row count, size on disk

2019-06-18 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned KUDU-1067:
-

Assignee: (was: Todd Lipcon)

> Add metrics for tablet row count, size on disk
> --
>
> Key: KUDU-1067
> URL: https://issues.apache.org/jira/browse/KUDU-1067
> Project: Kudu
>  Issue Type: Improvement
>  Components: impala, metrics, ops-tooling
>Affects Versions: Private Beta
>Reporter: Todd Lipcon
>Priority: Major
>
> Would be nice to expose these metrics up to CM. If we expose tablet-level 
> metrics, CM will aggregate by table for us.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2381) Optimize DeltaMemStore for case of no matching deltas

2019-06-18 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2381.
---
   Resolution: Fixed
Fix Version/s: 1.11.0

> Optimize DeltaMemStore for case of no matching deltas
> -
>
> Key: KUDU-2381
> URL: https://issues.apache.org/jira/browse/KUDU-2381
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf, tablet
>Reporter: Todd Lipcon
>Assignee: ZhangYao
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently in a scan workload which scans 280 columns I see DeltaMemStore 
> iteration taking up a significant amount of CPU in the scan, despite the fact 
> that the dataset has no updates. Of 1.6sec in 
> MaterializingIterator::NextBlock, we spent 0.61s in DMSIterator::PrepareBatch 
> and 0.14s in DMSIterator::MayHaveDeltas. So, about 46% of our time here is on 
> wasted work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2867) Optimize Timestamp varint decoding

2019-06-18 Thread Todd Lipcon (JIRA)
Todd Lipcon created KUDU-2867:
-

 Summary: Optimize Timestamp varint decoding
 Key: KUDU-2867
 URL: https://issues.apache.org/jira/browse/KUDU-2867
 Project: Kudu
  Issue Type: Improvement
Reporter: Todd Lipcon
Assignee: Todd Lipcon


In looking at a profile with some delta reads, I noticed ~11% of CPU going to 
GetMemcmpableVarint, much of which is from Timestamp decoding. In practice, any 
timestamp from HybridClock will need an 8-byte varint, so we can fast-path that 
case. (unfortunately it's too much effort to change the data format to not use 
a varint here)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2836) Maybe wrong memory size used to detect pressure

2019-06-18 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2836.
---
   Resolution: Fixed
Fix Version/s: 1.11.0

> Maybe wrong memory size used to detect pressure
> ---
>
> Key: KUDU-2836
> URL: https://issues.apache.org/jira/browse/KUDU-2836
> Project: Kudu
>  Issue Type: Improvement
>  Components: tserver
>Reporter: Yingchun Lai
>Assignee: Yingchun Lai
>Priority: Critical
> Fix For: 1.11.0
>
> Attachments: 选区_313.jpg
>
>
> One of my tserver, totally 128G memory, gflags: 
> {code:java}
> -memory_limit_hard_bytes=107374182475 (100G)  
> -memory_limit_soft_percentage=85 -memory_pressure_percentage=80{code}
> Memory used about 95%, "top" result like:
> {code:java}
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 8359 work 20 0 0.326t 0.116t 81780 S 727.9 94.6 230228:10 kudu_tablet_ser
> {code}
> That is kudu_tablet_server process used about 116G memory.
> On mem-trackers page, I find the "Total consumption" value is about 65G, much 
> lower than 116G.
> Then, I login to the server and read code to check any free memory MM 
> operations are work correctly. Unfortunatly, the memory pressure detect 
> function(process_memory::UnderMemoryPressure) doesn't report it's under 
> pressure, because the tcmalloc function GetNumericProperty(const char* 
> property, size_t* value) with parameter "generic.current_allocated_bytes" 
> doesn't return the memory as the memory use reported by the OS.
> [https://gperftools.github.io/gperftools/tcmalloc.html]
> {quote}
> |{{generic.current_allocated_bytes}}|Number of bytes used by the application. 
> This will not typically match the memory use reported by the OS, because it 
> does not include TCMalloc overhead or memory fragmentation.|
> {quote}
> This situation may lead to OPs prefer to free memory could not be scheduled 
> promptly, and the OS memory may consumed empty, and then kill tserver because 
> of OOM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2866) CFileSet::Iterator::FinishBatch takes a lot of CPU for selective wide table scans

2019-06-18 Thread Todd Lipcon (JIRA)
Todd Lipcon created KUDU-2866:
-

 Summary: CFileSet::Iterator::FinishBatch takes a lot of CPU for 
selective wide table scans
 Key: KUDU-2866
 URL: https://issues.apache.org/jira/browse/KUDU-2866
 Project: Kudu
  Issue Type: Improvement
  Components: perf, tablet
Reporter: Todd Lipcon
Assignee: Todd Lipcon


Scanning a wide table with a predicate that doesn't ever match showed ~10% CPU 
usage in CFileSet::Iterator::FinishBatch. Looking at the assembly it seems that 
the cost was in iterating over the vector indicating which columns had 
been prepared. In the case of a selective predicate, only one of the 200+ 
columns was prepared, and the iteration was quite slow. Instead of using the 
bitmap, we can instead just keep a list of the prepared column iterators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2847) Optimize iteration over selection vector in SerializeRowBlock

2019-06-17 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866241#comment-16866241
 ] 

Todd Lipcon commented on KUDU-2847:
---

After optimizing predicate evaluation (KUDU-2846) this is now about 20% of the 
scanner thread CPU for TPCH Q6

> Optimize iteration over selection vector in SerializeRowBlock
> -
>
> Key: KUDU-2847
> URL: https://issues.apache.org/jira/browse/KUDU-2847
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf, tserver
>Reporter: Todd Lipcon
>Priority: Major
>
> Currently, SerializeRowBlock operates column-by-column, which means we have 
> to iterate over the selection bitmap once for each column. This code isn't 
> particularly well optimized -- in TPCH Q6, about 10% of CPU is spent in 
> BitmapFindFirst. We should look at alternate implementations here that better 
> amortize the bitmap iteration cost across all of the columns and generally 
> micro-optimize it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2846) Special case predicate evaluation for SIMD support

2019-06-17 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866238#comment-16866238
 ] 

Todd Lipcon commented on KUDU-2846:
---

After the optimization was committed, TPCH Q6 is down to about 8% of CPU spent 
in predicate evaluation. Looking at perf output, it seems like the compiler we 
usually use for release builds isn't actually using SIMD instructions -- it's 
still way faster than it used to be since it avoids branches, but we could 
probably get another 2x by hand-coding this stuff.

> Special case predicate evaluation for SIMD support
> --
>
> Key: KUDU-2846
> URL: https://issues.apache.org/jira/browse/KUDU-2846
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf, tserver
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Major
> Fix For: 1.10.0
>
>
> In the common case of predicate evaluation on primitive types, we can likely 
> improve performance as follows:
> - doing the comparisons while ignoring nullability and selectedness (any null 
> or unselected cells may have junk data, which causes a junk comparison result)
> - take the resulting bitmask of comparison results and use bitwise ops to 
> account for null/unselected cells to ensure that those result in a 'false' 
> comparison
> For some types of comparisons this can result in SIMD operations. For others, 
> at least, this will remove most branches from the path. This should speed up 
> queries like TPCH Q6 which spends 25% of its time in predicate evaluation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-1235) Add Get API

2019-06-17 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-1235:
--
Component/s: perf

> Add Get API
> ---
>
> Key: KUDU-1235
> URL: https://issues.apache.org/jira/browse/KUDU-1235
> Project: Kudu
>  Issue Type: New Feature
>  Components: client, perf, tablet, tserver
>Reporter: Binglin Chang
>Assignee: Binglin Chang
>Priority: Major
>  Labels: kudu-roadmap
> Attachments: perf-get.svg, perf-scan-opt.svg, perf-scan.svg
>
>
> Get API is more user friendly and efficient if use just want primary key 
> lookup.
> I setup a cluster and test get/scan single row using ycsb, initial test shows 
> better performance for get.
> {noformat}
> kudu_workload:
> recordcount=100
> operationcount=100
> workload=com.yahoo.ycsb.workloads.CoreWorkload
> readallfields=false
> readproportion=1
> updateproportion=0
> scanproportion=0
> insertproportion=0
> requestdistribution=uniform
> use_get_api=false
> load:
> ./bin/ycsb load kudu -P workloads/kudu_workload -p sync_ops=false -p 
> pre_split_num_tablets=1 -p table_name=ycsb_wiki_example -p 
> masterQuorum='c3-kudu-tst-st01.bj:32600' -threads 100
> read test:
> ./bin/ycsb run kudu -P workloads/kudu_workload -p 
> masterQuorum='c3-kudu-tst-st01.bj:32600' -threads 100
> {noformat}
> Get API:
> [OVERALL], RunTime(ms), 21304.0
> [OVERALL], Throughput(ops/sec), 46939.54187007135
> [CLEANUP], Operations, 100.0
> [CLEANUP], AverageLatency(us), 423.57
> [CLEANUP], MinLatency(us), 24.0
> [CLEANUP], MaxLatency(us), 19327.0
> [CLEANUP], 95thPercentileLatency(us), 52.0
> [CLEANUP], 99thPercentileLatency(us), 18815.0
> [READ], Operations, 100.0
> [READ], AverageLatency(us), 2065.185152
> [READ], MinLatency(us), 134.0
> [READ], MaxLatency(us), 92159.0
> [READ], 95thPercentileLatency(us), 2391.0
> [READ], 99thPercentileLatency(us), 6359.0
> [READ], Return=0, 100
> Scan API:
> [OVERALL], RunTime(ms), 38259.0
> [OVERALL], Throughput(ops/sec), 26137.6408165399
> [CLEANUP], Operations, 100.0
> [CLEANUP], AverageLatency(us), 47.32
> [CLEANUP], MinLatency(us), 16.0
> [CLEANUP], MaxLatency(us), 1837.0
> [CLEANUP], 95thPercentileLatency(us), 41.0
> [CLEANUP], 99thPercentileLatency(us), 158.0
> [READ], Operations, 100.0
> [READ], AverageLatency(us), 3595.825249
> [READ], MinLatency(us), 139.0
> [READ], MaxLatency(us), 3139583.0
> [READ], 95thPercentileLatency(us), 3775.0
> [READ], 99thPercentileLatency(us), 7659.0
> [READ], Return=0, 100



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-1438) [java client] Upgrade to Netty 4

2019-06-17 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865880#comment-16865880
 ] 

Todd Lipcon commented on KUDU-1438:
---

Are we at risk of CVEs getting filed against Netty 3 and no one supporting it? 
That might increase the priority of this a bit

> [java client] Upgrade to Netty 4
> 
>
> Key: KUDU-1438
> URL: https://issues.apache.org/jira/browse/KUDU-1438
> Project: Kudu
>  Issue Type: Improvement
>  Components: client
>Reporter: Jean-Daniel Cryans
>Priority: Major
>  Labels: kudu-roadmap
>
> Netty 4 promises better performance for certain workloads, it was an effort 
> mostly led by Twitter. See their blog post about it: 
> https://blog.twitter.com/2013/netty-4-at-twitter-reduced-gc-overhead
> asynchbase has a pull request for this, so our work should be similar: 
> https://github.com/OpenTSDB/asynchbase/pull/116/commits



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-81) rpc-test TestConnectionKeepalive failure

2019-06-17 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865864#comment-16865864
 ] 

Todd Lipcon commented on KUDU-81:
-

I seem to recall actually seeing this in the last year or two. Don't think it's 
fixed

> rpc-test TestConnectionKeepalive failure
> 
>
> Key: KUDU-81
> URL: https://issues.apache.org/jira/browse/KUDU-81
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc, test
>Affects Versions: M4
>Reporter: Todd Lipcon
>Priority: Trivial
>  Labels: flaky
>
> Saw this fail once:
> {code}
> /var/lib/jenkins/workspace/kudu-test/BUILD_TYPE/LEAKCHECK/label/centos6-kudu/src/rpc/rpc-test.cc:155:
>  Failure
> Value of: metrics.num_server_connections_
>   Actual: 1
> Expected: 0
> Server should have 0 server connections
> {code}
> Probably just a timing issue in the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-364) UBSAN error in stopwatch

2019-06-17 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865861#comment-16865861
 ] 

Todd Lipcon commented on KUDU-364:
--

I haven't seen it in a while -- guessing it was a Linux kernel bug and we've 
mostly been running on newer kernels in the last few years? Let's close as 
Cannot Reproduce and someone can reopen it if we see it on the flaky test 
dashboard.

> UBSAN error in stopwatch
> 
>
> Key: KUDU-364
> URL: https://issues.apache.org/jira/browse/KUDU-364
> Project: Kudu
>  Issue Type: Bug
>  Components: util
>Affects Versions: M4.5
>Reporter: Todd Lipcon
>Priority: Minor
>
> Got the following error in a run of 
> TabletServerTest.TestCreateTablet_TabletExists:
> /var/lib/jenkins/workspace/kudu-test/BUILD_TYPE/ASAN/label/centos6-kudu/src/util/stopwatch.h:192:43:
>  runtime error: signed integer overflow: 18446744073 * 10 cannot be 
> represented in type 'long'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-364) UBSAN error in stopwatch

2019-06-17 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-364.
--
   Resolution: Cannot Reproduce
Fix Version/s: n/a

> UBSAN error in stopwatch
> 
>
> Key: KUDU-364
> URL: https://issues.apache.org/jira/browse/KUDU-364
> Project: Kudu
>  Issue Type: Bug
>  Components: util
>Affects Versions: M4.5
>Reporter: Todd Lipcon
>Priority: Minor
> Fix For: n/a
>
>
> Got the following error in a run of 
> TabletServerTest.TestCreateTablet_TabletExists:
> /var/lib/jenkins/workspace/kudu-test/BUILD_TYPE/ASAN/label/centos6-kudu/src/util/stopwatch.h:192:43:
>  runtime error: signed integer overflow: 18446744073 * 10 cannot be 
> represented in type 'long'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-705) YCSB OOMEs when running with lots of threads

2019-06-17 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-705.
--
   Resolution: Cannot Reproduce
Fix Version/s: n/a

Let's close this since it's been 4 years and who knows how much has changed 
since then.

> YCSB OOMEs when running with lots of threads
> 
>
> Key: KUDU-705
> URL: https://issues.apache.org/jira/browse/KUDU-705
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: Private Beta
>Reporter: Todd Lipcon
>Priority: Major
> Fix For: n/a
>
>
> I tried running a YCSB workload against a 100-node cluster with 64 threads. 
> It OOMEd pretty quickly with "GC overhead limit exceeded". Haven't 
> investigated where the memory usage went, yet, but we should probably have 
> some docs on memory requirements and buffering, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KUDU-887) Test and stabilize DeleteTable

2019-06-17 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned KUDU-887:


Assignee: (was: Todd Lipcon)

I think some of the stress tests suggested above would still be a good idea. 
I'd consider it relatively low priority but I do guess that we have a bug or 
two here if we were to stress it.

> Test and stabilize DeleteTable
> --
>
> Key: KUDU-887
> URL: https://issues.apache.org/jira/browse/KUDU-887
> Project: Kudu
>  Issue Type: Task
>  Components: master, test
>Affects Versions: Private Beta
>Reporter: Todd Lipcon
>Priority: Major
>
> DeleteTable isn't very well tested right now, especially in the replicated 
> case. We should add some stress tests in this area.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2846) Special case predicate evaluation for SIMD support

2019-06-17 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2846.
---
   Resolution: Fixed
 Assignee: Todd Lipcon
Fix Version/s: 1.10.0

Calling this complete for now. A few ideas for future work while this is fresh 
in my mind:
- for string equality, we could likely vectorize the portion where we check 
length for equality
- for dictionary-coded columns, we can probably vectorize the predicate 
evaluation against the dictionary code words (KUDU-2854)
- for small IN lists, we can probably implement them more efficiently 
(KUDU-2853 has some ideas)


> Special case predicate evaluation for SIMD support
> --
>
> Key: KUDU-2846
> URL: https://issues.apache.org/jira/browse/KUDU-2846
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf, tserver
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Major
> Fix For: 1.10.0
>
>
> In the common case of predicate evaluation on primitive types, we can likely 
> improve performance as follows:
> - doing the comparisons while ignoring nullability and selectedness (any null 
> or unselected cells may have junk data, which causes a junk comparison result)
> - take the resulting bitmask of comparison results and use bitwise ops to 
> account for null/unselected cells to ensure that those result in a 'false' 
> comparison
> For some types of comparisons this can result in SIMD operations. For others, 
> at least, this will remove most branches from the path. This should speed up 
> queries like TPCH Q6 which spends 25% of its time in predicate evaluation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2856) Track memory and potentially limit count of open CFileReaders

2019-06-14 Thread Todd Lipcon (JIRA)
Todd Lipcon created KUDU-2856:
-

 Summary: Track memory and potentially limit count of open 
CFileReaders
 Key: KUDU-2856
 URL: https://issues.apache.org/jira/browse/KUDU-2856
 Project: Kudu
  Issue Type: Improvement
  Components: tserver
Reporter: Todd Lipcon


Currently, CFileReaders are lazy-initted on first access. But, even a 
non-initted CFileReader takes some memory (104 bytes currently), and initted 
ones never get de-initted. So, for long-running tservers with lots of data, in 
the limit, all cfiles will eventually be initted.

In one production tserver I'm looking at, there are 24 million CFileReaders 
(2385MB of heap) of which about 3.9 million are initted (about 1.1GB of heap). 
If we project this out and assume that eventually all CFileReaders get initted, 
we'd be up to about 7GB of heap for initted CFileReaders.

We should consider whether we can make the allocation of CFileReaders more 
lazy, and make it possible to deinit and deallocate readers that have not been 
recently/frequently used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2855) Lazy-create DeltaMemStores on first update

2019-06-14 Thread Todd Lipcon (JIRA)
Todd Lipcon created KUDU-2855:
-

 Summary: Lazy-create DeltaMemStores on first update
 Key: KUDU-2855
 URL: https://issues.apache.org/jira/browse/KUDU-2855
 Project: Kudu
  Issue Type: Improvement
  Components: perf, tserver
Reporter: Todd Lipcon


Currently DeltaTracker::DoOpen creates a DeltaMemStore for each DRS. If we 
assume that most DRS don't have any deltas, this ends up wasting quite a bit of 
memory. Looking at one TS in a production cluster, about 1GB of the ~14G heap 
is being used by DMS. Of that, 464MB is data and the remainder is overhead.

This would likely improve other code paths too to fast-path out any DMS-related 
code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2854) Short circuit predicates on dictionary-coded columns

2019-06-14 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16864309#comment-16864309
 ] 

Todd Lipcon commented on KUDU-2854:
---

I think we should also consider fast-pathing equality checks for dictionary 
predicates. Currently, we evaluate the predicate against the dictionary and 
come up with a bitmap of matching values. Then, for each codeword, we test the 
corresponding bit in the bitmap. That bitmap testing likely requires a few 
cycles and a branch, and can't readily be done with SIMD outside of AVX512 
gather instructions.

In the case that we see that exactly one dictionary value matches the 
predicate, we can transform it into an equality predicate on the codewords, and 
then use the SIMD-optimized equality code path.

I don't have perf numbers on hand but I know I often am querying datasets using 
equality predicates on dictionary-coded columns.

> Short circuit predicates on dictionary-coded columns
> 
>
> Key: KUDU-2854
> URL: https://issues.apache.org/jira/browse/KUDU-2854
> Project: Kudu
>  Issue Type: Improvement
>  Components: cfile, perf, tserver
>Reporter: Todd Lipcon
>Priority: Major
>
> In the common case that a column has no updates in a given DRS, if we see 
> that no entries in the dictionary match the predicate, we can short circuit 
> at a few layers:
> - we can store a flag in the cfile footer that indicates that all blocks are 
> dict-coded (ie there are no fallbacks). In that case, we can skip the whole 
> rowset
> - if a cfile is partially dict-encoded, we can skip any dict-coded blocks 
> without decoding the dictionary words



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2854) Short circuit predicates on dictionary-coded columns

2019-06-14 Thread Todd Lipcon (JIRA)
Todd Lipcon created KUDU-2854:
-

 Summary: Short circuit predicates on dictionary-coded columns
 Key: KUDU-2854
 URL: https://issues.apache.org/jira/browse/KUDU-2854
 Project: Kudu
  Issue Type: Improvement
  Components: cfile, perf, tserver
Reporter: Todd Lipcon


In the common case that a column has no updates in a given DRS, if we see that 
no entries in the dictionary match the predicate, we can short circuit at a few 
layers:
- we can store a flag in the cfile footer that indicates that all blocks are 
dict-coded (ie there are no fallbacks). In that case, we can skip the whole 
rowset
- if a cfile is partially dict-encoded, we can skip any dict-coded blocks 
without decoding the dictionary words



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2853) Optimize IN predicates with small lists of primitives

2019-06-13 Thread Todd Lipcon (JIRA)
Todd Lipcon created KUDU-2853:
-

 Summary: Optimize IN predicates with small lists of primitives
 Key: KUDU-2853
 URL: https://issues.apache.org/jira/browse/KUDU-2853
 Project: Kudu
  Issue Type: Improvement
  Components: perf, tserver
Reporter: Todd Lipcon


We currently use std::binary_search for evaluating predicates. For IN list over 
primitives, it's more efficient to avoid branches and instead do linear search. 
Linear search can also be unrolled and implemented with SIMD comparisons. I 
would guess we can get a 4x or more perf improvement for relatively common 
cases like IN lists containing <10 elements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2381) Optimize DeltaMemStore for case of no matching deltas

2019-06-12 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-2381:
--
Target Version/s:   (was: 1.8.0)

> Optimize DeltaMemStore for case of no matching deltas
> -
>
> Key: KUDU-2381
> URL: https://issues.apache.org/jira/browse/KUDU-2381
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf, tablet
>Reporter: Todd Lipcon
>Priority: Major
>
> Currently in a scan workload which scans 280 columns I see DeltaMemStore 
> iteration taking up a significant amount of CPU in the scan, despite the fact 
> that the dataset has no updates. Of 1.6sec in 
> MaterializingIterator::NextBlock, we spent 0.61s in DMSIterator::PrepareBatch 
> and 0.14s in DMSIterator::MayHaveDeltas. So, about 46% of our time here is on 
> wasted work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KUDU-2381) Optimize DeltaMemStore for case of no matching deltas

2019-06-12 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned KUDU-2381:
-

Assignee: (was: Todd Lipcon)

> Optimize DeltaMemStore for case of no matching deltas
> -
>
> Key: KUDU-2381
> URL: https://issues.apache.org/jira/browse/KUDU-2381
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf, tablet
>Reporter: Todd Lipcon
>Priority: Major
>
> Currently in a scan workload which scans 280 columns I see DeltaMemStore 
> iteration taking up a significant amount of CPU in the scan, despite the fact 
> that the dataset has no updates. Of 1.6sec in 
> MaterializingIterator::NextBlock, we spent 0.61s in DMSIterator::PrepareBatch 
> and 0.14s in DMSIterator::MayHaveDeltas. So, about 46% of our time here is on 
> wasted work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-483) Non-ordered MRS scanning for better performance

2019-06-11 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-483.
--
   Resolution: Won't Fix
Fix Version/s: n/a

Doesn't seem like this would really be worth it

> Non-ordered MRS scanning for better performance
> ---
>
> Key: KUDU-483
> URL: https://issues.apache.org/jira/browse/KUDU-483
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf, tablet
>Affects Versions: Backlog
>Reporter: Todd Lipcon
>Priority: Minor
> Fix For: n/a
>
>
> Here's a performance improvement I was thinking about recently:
> Currently, scanning the MRS is fairly expensive due to cache misses and the 
> per-leaf iteration code. Each leaf node has a pointer to its adjacent leaf, 
> but that may be somewhere fairly random in memory. Despite our best effort to 
> prefetch, we see a lot of cache misses in this code.
> For short (non fault-tolerant) scans, we don't need to yield rows in key 
> order. So, we could reorganize the memory layout of the MRS such that all 
> leaf nodes were allocated from a single arena-like structure. Then, iterating 
> could proceed in memory order rather than key order, and likely be a lot 
> faster. For fault tolerant (ordered) scans, we'd still have to use the btree 
> traversal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2846) Special case predicate evaluation for SIMD support

2019-06-11 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-2846:
--
Component/s: tserver
 perf

> Special case predicate evaluation for SIMD support
> --
>
> Key: KUDU-2846
> URL: https://issues.apache.org/jira/browse/KUDU-2846
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf, tserver
>Reporter: Todd Lipcon
>Priority: Major
>
> In the common case of predicate evaluation on primitive types, we can likely 
> improve performance as follows:
> - doing the comparisons while ignoring nullability and selectedness (any null 
> or unselected cells may have junk data, which causes a junk comparison result)
> - take the resulting bitmask of comparison results and use bitwise ops to 
> account for null/unselected cells to ensure that those result in a 'false' 
> comparison
> For some types of comparisons this can result in SIMD operations. For others, 
> at least, this will remove most branches from the path. This should speed up 
> queries like TPCH Q6 which spends 25% of its time in predicate evaluation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2847) Optimize iteration over selection vector in SerializeRowBlock

2019-06-11 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861508#comment-16861508
 ] 

Todd Lipcon commented on KUDU-2847:
---

Couple ideas:
- convert the bitmap to a list of set indexes once per block, and then iterate 
over the list of indexes once per column
- implement the above with some smarter code vs iterating over and testing each 
bit (eg 
https://lemire.me/blog/2018/03/08/iterating-over-set-bits-quickly-simd-edition/)
-- maybe worth special casing some code for runs of 'all-set' and 'all-unset' 
since a lot of SQL predicates tend to be either very dense or very sparse
- codegen for SerializeRowBlock to operate by row instead of by column would 
probably yield large benefits as well

> Optimize iteration over selection vector in SerializeRowBlock
> -
>
> Key: KUDU-2847
> URL: https://issues.apache.org/jira/browse/KUDU-2847
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf, tserver
>Reporter: Todd Lipcon
>Priority: Major
>
> Currently, SerializeRowBlock operates column-by-column, which means we have 
> to iterate over the selection bitmap once for each column. This code isn't 
> particularly well optimized -- in TPCH Q6, about 10% of CPU is spent in 
> BitmapFindFirst. We should look at alternate implementations here that better 
> amortize the bitmap iteration cost across all of the columns and generally 
> micro-optimize it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2847) Optimize iteration over selection vector in SerializeRowBlock

2019-06-11 Thread Todd Lipcon (JIRA)
Todd Lipcon created KUDU-2847:
-

 Summary: Optimize iteration over selection vector in 
SerializeRowBlock
 Key: KUDU-2847
 URL: https://issues.apache.org/jira/browse/KUDU-2847
 Project: Kudu
  Issue Type: Improvement
  Components: perf, tserver
Reporter: Todd Lipcon


Currently, SerializeRowBlock operates column-by-column, which means we have to 
iterate over the selection bitmap once for each column. This code isn't 
particularly well optimized -- in TPCH Q6, about 10% of CPU is spent in 
BitmapFindFirst. We should look at alternate implementations here that better 
amortize the bitmap iteration cost across all of the columns and generally 
micro-optimize it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2836) Maybe wrong memory size used to detect pressure

2019-06-11 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861493#comment-16861493
 ] 

Todd Lipcon commented on KUDU-2836:
---

bq. I found many mem_tracker_->Release in the code, but is there a situation 
that it will never be called with a positive parameter?

It should be called whenever any tracked memory is released. In particular, 
this should happen whenever any memory store (DMS/MRS) flushes. There are some 
other sources of memory like scanners, compactions, WAL that don't currently 
use memtrackers, but surprising that you can go tens of minutes without calling 
release.

That said, I think read-only workloads will exhibit the issue you're seeing, 
especially since the block cache churning may cause a lot of memory alloc/free 
pairs without actually ever calling Release. I think adding a background thread 
to periodically call GC is an easy fix that will serve as an extra defense.

> Maybe wrong memory size used to detect pressure
> ---
>
> Key: KUDU-2836
> URL: https://issues.apache.org/jira/browse/KUDU-2836
> Project: Kudu
>  Issue Type: Improvement
>  Components: tserver
>Reporter: Yingchun Lai
>Assignee: Yingchun Lai
>Priority: Critical
> Attachments: 选区_313.jpg
>
>
> One of my tserver, totally 128G memory, gflags: 
> {code:java}
> -memory_limit_hard_bytes=107374182475 (100G)  
> -memory_limit_soft_percentage=85 -memory_pressure_percentage=80{code}
> Memory used about 95%, "top" result like:
> {code:java}
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 8359 work 20 0 0.326t 0.116t 81780 S 727.9 94.6 230228:10 kudu_tablet_ser
> {code}
> That is kudu_tablet_server process used about 116G memory.
> On mem-trackers page, I find the "Total consumption" value is about 65G, much 
> lower than 116G.
> Then, I login to the server and read code to check any free memory MM 
> operations are work correctly. Unfortunatly, the memory pressure detect 
> function(process_memory::UnderMemoryPressure) doesn't report it's under 
> pressure, because the tcmalloc function GetNumericProperty(const char* 
> property, size_t* value) with parameter "generic.current_allocated_bytes" 
> doesn't return the memory as the memory use reported by the OS.
> [https://gperftools.github.io/gperftools/tcmalloc.html]
> {quote}
> |{{generic.current_allocated_bytes}}|Number of bytes used by the application. 
> This will not typically match the memory use reported by the OS, because it 
> does not include TCMalloc overhead or memory fragmentation.|
> {quote}
> This situation may lead to OPs prefer to free memory could not be scheduled 
> promptly, and the OS memory may consumed empty, and then kill tserver because 
> of OOM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KUDU-2846) Special case predicate evaluation for SIMD support

2019-06-11 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861471#comment-16861471
 ] 

Todd Lipcon edited comment on KUDU-2846 at 6/11/19 8:41 PM:


example code that does SIMD comparisons for int equality 8 at a time:
{code}
void TestFastCode(const ColumnBlock* cb, uint8_t* __restrict__ selvec, int32_t 
ref) {
__m256i ref_vec = _mm256_set1_epi32(ref);
const __m256i* data = (const __m256i*)cb->data_;
for (int i = 0; i < cb->nrows_; i += 8) {
__m256i m = _mm256_loadu_si256(data++);
__m256i c = _mm256_cmpeq_epi32(m, ref_vec);
uint8_t mask = _mm256_movemask_ps((__m256)c);
*selvec++ &= mask; // TODO: do we need to reverse the bits? also need 
to & with nulls
}
  // TODO: handle case of rowblock length not a multiple of 8
}{code}

I couldn't convince the auto-vectorizer to generate the same assembly as doing 
it by hand, but it may be worth implementing these for the most common 
predicates. Likely something like 10x improvement possible here vs our current 
branchy mess.


was (Author: tlipcon):
example code that does SIMD comparisons for int equality 8 at a time:
{code}
void TestFastCode(const ColumnBlock* cb, uint8_t* selvec, int32_t ref) {
__m256i ref_vec = _mm256_set1_epi32(ref);
for (int i = 0; i < cb->nrows_; i += 8) {
__m256i m = _mm256_loadu_si256((const __m256i*)>data_[i * 
sizeof(int32_t)]);
__m256i c = _mm256_cmpeq_epi32(m, ref_vec);
int mask = _mm256_movemask_ps((__m256)c);
selvec[i/8] &= mask; // TODO: do we need to reverse the bits? not sure.
}
  // TODO: handle case of rowblock length not a multiple of 8, or can we 
enforce that?
}
{code}

I couldn't convince the auto-vectorizer to generate the same assembly as doing 
it by hand, but it may be worth implementing these for the most common 
predicates. Likely something like 10x improvement possible here vs our current 
branchy mess.

> Special case predicate evaluation for SIMD support
> --
>
> Key: KUDU-2846
> URL: https://issues.apache.org/jira/browse/KUDU-2846
> Project: Kudu
>  Issue Type: Improvement
>Reporter: Todd Lipcon
>Priority: Major
>
> In the common case of predicate evaluation on primitive types, we can likely 
> improve performance as follows:
> - doing the comparisons while ignoring nullability and selectedness (any null 
> or unselected cells may have junk data, which causes a junk comparison result)
> - take the resulting bitmask of comparison results and use bitwise ops to 
> account for null/unselected cells to ensure that those result in a 'false' 
> comparison
> For some types of comparisons this can result in SIMD operations. For others, 
> at least, this will remove most branches from the path. This should speed up 
> queries like TPCH Q6 which spends 25% of its time in predicate evaluation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2846) Special case predicate evaluation for SIMD support

2019-06-11 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861471#comment-16861471
 ] 

Todd Lipcon commented on KUDU-2846:
---

example code that does SIMD comparisons for int equality 8 at a time:
{code}
void TestFastCode(const ColumnBlock* cb, uint8_t* selvec, int32_t ref) {
__m256i ref_vec = _mm256_set1_epi32(ref);
for (int i = 0; i < cb->nrows_; i += 8) {
__m256i m = _mm256_loadu_si256((const __m256i*)>data_[i * 
sizeof(int32_t)]);
__m256i c = _mm256_cmpeq_epi32(m, ref_vec);
int mask = _mm256_movemask_ps((__m256)c);
selvec[i/8] &= mask; // TODO: do we need to reverse the bits? not sure.
}
  // TODO: handle case of rowblock length not a multiple of 8, or can we 
enforce that?
}
{code}

I couldn't convince the auto-vectorizer to generate the same assembly as doing 
it by hand, but it may be worth implementing these for the most common 
predicates. Likely something like 10x improvement possible here vs our current 
branchy mess.

> Special case predicate evaluation for SIMD support
> --
>
> Key: KUDU-2846
> URL: https://issues.apache.org/jira/browse/KUDU-2846
> Project: Kudu
>  Issue Type: Improvement
>Reporter: Todd Lipcon
>Priority: Major
>
> In the common case of predicate evaluation on primitive types, we can likely 
> improve performance as follows:
> - doing the comparisons while ignoring nullability and selectedness (any null 
> or unselected cells may have junk data, which causes a junk comparison result)
> - take the resulting bitmask of comparison results and use bitwise ops to 
> account for null/unselected cells to ensure that those result in a 'false' 
> comparison
> For some types of comparisons this can result in SIMD operations. For others, 
> at least, this will remove most branches from the path. This should speed up 
> queries like TPCH Q6 which spends 25% of its time in predicate evaluation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2846) Special case predicate evaluation for SIMD support

2019-06-11 Thread Todd Lipcon (JIRA)
Todd Lipcon created KUDU-2846:
-

 Summary: Special case predicate evaluation for SIMD support
 Key: KUDU-2846
 URL: https://issues.apache.org/jira/browse/KUDU-2846
 Project: Kudu
  Issue Type: Improvement
Reporter: Todd Lipcon


In the common case of predicate evaluation on primitive types, we can likely 
improve performance as follows:
- doing the comparisons while ignoring nullability and selectedness (any null 
or unselected cells may have junk data, which causes a junk comparison result)
- take the resulting bitmask of comparison results and use bitwise ops to 
account for null/unselected cells to ensure that those result in a 'false' 
comparison

For some types of comparisons this can result in SIMD operations. For others, 
at least, this will remove most branches from the path. This should speed up 
queries like TPCH Q6 which spends 25% of its time in predicate evaluation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2380) Selective predicates when selecting high number of columns burns CPU in SerializeRowBlock

2019-06-11 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2380.
---
   Resolution: Fixed
Fix Version/s: 1.8.0

> Selective predicates when selecting high number of columns burns CPU in 
> SerializeRowBlock
> -
>
> Key: KUDU-2380
> URL: https://issues.apache.org/jira/browse/KUDU-2380
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf, tserver
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Major
> Fix For: 1.8.0
>
> Attachments: pprof.kudu-tserver.samples.cpu.001.pb.gz
>
>
> Testing a table with 280 columns, I found the following performance 
> characteristic:
> - scanning all 280 columns with a selective non-key predicate which matches 0 
> rows took 8.28s
> - scanning no columns (count query) with the same predicate which matches 0 
> rows took 314ms.
> This suggests that we are burning 96% of our CPU doing useless work for this 
> query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-1907) Lock contention in SASL Kerberos negotiation

2019-06-11 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-1907:
--
Priority: Minor  (was: Major)

> Lock contention in SASL Kerberos negotiation
> 
>
> Key: KUDU-1907
> URL: https://issues.apache.org/jira/browse/KUDU-1907
> Project: Kudu
>  Issue Type: Bug
>  Components: perf, rpc
>Affects Versions: 1.3.0
>Reporter: Todd Lipcon
>Priority: Minor
>
> Dan wrote a negotiation benchmark and found that we can only do ~600 kerberos 
> negotiations/second regardless of the number of concurrent clients. Looking 
> at stack traces reveals that the SASL GSSAPI plugin adds locks around all 
> GSSAPI calls, though it seems like the underlying GSSAPI library is actually 
> thread-safe. (the locks are a relic from bygone days).
> Given that we are only using a small sliver of Cyrus-SASL functionality, we 
> should consider using libgssapi directly instead of via SASL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2381) Optimize DeltaMemStore for case of no matching deltas

2019-06-11 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861340#comment-16861340
 ] 

Todd Lipcon commented on KUDU-2381:
---

Looking again at a similar workload and still see behavior like this. An easy 
fix is likely to have a new bool like 'DeltaPreparer::_may_have_deltas' which 
can be set to true in all of the spots where deleted_, reinserted_, and 
updates_by_col_ are modified. We can use that to implement MayHaveDeltas easily 
and to short-circuit the clearing of updates_by_col_ for the common case of no 
deltas.

> Optimize DeltaMemStore for case of no matching deltas
> -
>
> Key: KUDU-2381
> URL: https://issues.apache.org/jira/browse/KUDU-2381
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf, tablet
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Major
>
> Currently in a scan workload which scans 280 columns I see DeltaMemStore 
> iteration taking up a significant amount of CPU in the scan, despite the fact 
> that the dataset has no updates. Of 1.6sec in 
> MaterializingIterator::NextBlock, we spent 0.61s in DMSIterator::PrepareBatch 
> and 0.14s in DMSIterator::MayHaveDeltas. So, about 46% of our time here is on 
> wasted work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2844) Avoid copying strings from dictionary or plain-encoded blocks

2019-06-11 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861328#comment-16861328
 ] 

Todd Lipcon commented on KUDU-2844:
---

Attached a flame graph of a RPC worker thread during TPCH Q1. We spend 24% of 
our time in BinaryDictBlock::CopyNextValues and 27% in SerializeRowBlock.

> Avoid copying strings from dictionary or plain-encoded blocks
> -
>
> Key: KUDU-2844
> URL: https://issues.apache.org/jira/browse/KUDU-2844
> Project: Kudu
>  Issue Type: Improvement
>  Components: cfile, perf
>Reporter: Todd Lipcon
>Priority: Major
> Attachments: fg.svg
>
>
> When scanning a plain or dictionary-encoded binary column, we currently loop 
> over each entry and copy the string into the destination RowBlock's arena. In 
> TPCH Q1, the scanner threads use a significant percentage of CPU doing this 
> copying, and it also increases CPU cache footprint which likely decreases 
> performance in downstream operations like predicate evaluation, merging, 
> result serialization, etc.
> Instead of doing this, we could "attach" the dictionary block (with 
> ref-counting) to the RowBlock and refer directly to the dictionary entry from 
> the RowBlock. When the RowBlock eventually is reset, we can drop the 
> reference. This should be safe because we never mutate indirect data in-place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2844) Avoid copying strings from dictionary or plain-encoded blocks

2019-06-11 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-2844:
--
Attachment: fg.svg

> Avoid copying strings from dictionary or plain-encoded blocks
> -
>
> Key: KUDU-2844
> URL: https://issues.apache.org/jira/browse/KUDU-2844
> Project: Kudu
>  Issue Type: Improvement
>  Components: cfile, perf
>Reporter: Todd Lipcon
>Priority: Major
> Attachments: fg.svg
>
>
> When scanning a plain or dictionary-encoded binary column, we currently loop 
> over each entry and copy the string into the destination RowBlock's arena. In 
> TPCH Q1, the scanner threads use a significant percentage of CPU doing this 
> copying, and it also increases CPU cache footprint which likely decreases 
> performance in downstream operations like predicate evaluation, merging, 
> result serialization, etc.
> Instead of doing this, we could "attach" the dictionary block (with 
> ref-counting) to the RowBlock and refer directly to the dictionary entry from 
> the RowBlock. When the RowBlock eventually is reset, we can drop the 
> reference. This should be safe because we never mutate indirect data in-place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   6   7   8   9   10   >