[jira] [Commented] (KUDU-721) Support for DECIMAL type

2017-08-25 Thread Boris Tyukin (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142561#comment-16142561
 ] 

Boris Tyukin commented on KUDU-721:
---

any workarounds currently to store precise numbers? not sure strings will work 
as Impala does not take strings for aggregation functions

> Support for DECIMAL type
> 
>
> Key: KUDU-721
> URL: https://issues.apache.org/jira/browse/KUDU-721
> Project: Kudu
>  Issue Type: New Feature
>  Components: client, tablet
>Affects Versions: Private Beta
>Reporter: Todd Lipcon
>Assignee: Will Berkeley
>Priority: Critical
>  Labels: kudu-roadmap
>
> [~mgrund] identified that without DECIMAL type, we're going to have issues 
> with a lot of the tests tables that Impala uses. Also, since we're targeting 
> some financial applications, it seems pretty crucial. This JIRA is to track 
> the work necessary to support it. I'll write up some short design doc and 
> post here.
> *Design doc*: 
> https://docs.google.com/a/cloudera.com/document/d/1wIebU0HxviRLHFF1fSS-E30KPWOrTI0xVYAMPE6kliU/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1976) MiniKdc races between add_principal and kinit

2017-08-25 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-1976:
--
Target Version/s: 1.5.0
 Code Review: https://gerrit.cloudera.org/#/c/7850/

> MiniKdc races between add_principal and kinit
> -
>
> Key: KUDU-1976
> URL: https://issues.apache.org/jira/browse/KUDU-1976
> Project: Kudu
>  Issue Type: Bug
>  Components: java, test
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Assignee: Todd Lipcon
>
> Noticed this in a recent precommit run, in ASAN mode, but I recall seeing it 
> before too. I think it happens from time to time.
> {noformat}
> 02:36:16.240 [DEBUG - main] (MiniKdc.java:339) executing 
> '/usr/sbin/kadmin.local -q add_principal -pw testuser testuser', env: '...'
> 02:36:16.268 [DEBUG - main] (MiniKdc.java:363) WARNING: no policy specified 
> for testu...@krbtest.com; defaulting to no policy
> 02:36:16.269 [DEBUG - main] (MiniKdc.java:363) Authenticating as principal 
> jenkins-slave/ad...@krbtest.com with password.
> 02:36:16.269 [DEBUG - main] (MiniKdc.java:363) Principal 
> "testu...@krbtest.com" created.
> 02:36:16.274 [DEBUG - main] (MiniKdc.java:339) executing '/usr/bin/kinit 
> testuser', env: '...'
> 02:36:16.277 [DEBUG - main] (MiniKdc.java:363) kinit: Client 
> 'testu...@krbtest.com' not found in Kerberos database while getting initial 
> credentials
> {noformat}
> I wonder why the kinit doesn't take immediately. I went looking for a "sync" 
> option for the KDC but couldn't find one; perhaps it's a bug in the version 
> of the KDC used in the test environment? (Ubuntu 14.04 IIRC).
> If there's no such thing, maybe we should retry kinit with some backoff until 
> it works (or fails for good).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1976) MiniKdc races between add_principal and kinit

2017-08-25 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-1976:
--
Status: In Review  (was: Open)

> MiniKdc races between add_principal and kinit
> -
>
> Key: KUDU-1976
> URL: https://issues.apache.org/jira/browse/KUDU-1976
> Project: Kudu
>  Issue Type: Bug
>  Components: java, test
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Assignee: Todd Lipcon
>
> Noticed this in a recent precommit run, in ASAN mode, but I recall seeing it 
> before too. I think it happens from time to time.
> {noformat}
> 02:36:16.240 [DEBUG - main] (MiniKdc.java:339) executing 
> '/usr/sbin/kadmin.local -q add_principal -pw testuser testuser', env: '...'
> 02:36:16.268 [DEBUG - main] (MiniKdc.java:363) WARNING: no policy specified 
> for testu...@krbtest.com; defaulting to no policy
> 02:36:16.269 [DEBUG - main] (MiniKdc.java:363) Authenticating as principal 
> jenkins-slave/ad...@krbtest.com with password.
> 02:36:16.269 [DEBUG - main] (MiniKdc.java:363) Principal 
> "testu...@krbtest.com" created.
> 02:36:16.274 [DEBUG - main] (MiniKdc.java:339) executing '/usr/bin/kinit 
> testuser', env: '...'
> 02:36:16.277 [DEBUG - main] (MiniKdc.java:363) kinit: Client 
> 'testu...@krbtest.com' not found in Kerberos database while getting initial 
> credentials
> {noformat}
> I wonder why the kinit doesn't take immediately. I went looking for a "sync" 
> option for the KDC but couldn't find one; perhaps it's a bug in the version 
> of the KDC used in the test environment? (Ubuntu 14.04 IIRC).
> If there's no such thing, maybe we should retry kinit with some backoff until 
> it works (or fails for good).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (KUDU-1976) MiniKdc races between add_principal and kinit

2017-08-25 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned KUDU-1976:
-

Assignee: Todd Lipcon

> MiniKdc races between add_principal and kinit
> -
>
> Key: KUDU-1976
> URL: https://issues.apache.org/jira/browse/KUDU-1976
> Project: Kudu
>  Issue Type: Bug
>  Components: java, test
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Assignee: Todd Lipcon
>
> Noticed this in a recent precommit run, in ASAN mode, but I recall seeing it 
> before too. I think it happens from time to time.
> {noformat}
> 02:36:16.240 [DEBUG - main] (MiniKdc.java:339) executing 
> '/usr/sbin/kadmin.local -q add_principal -pw testuser testuser', env: '...'
> 02:36:16.268 [DEBUG - main] (MiniKdc.java:363) WARNING: no policy specified 
> for testu...@krbtest.com; defaulting to no policy
> 02:36:16.269 [DEBUG - main] (MiniKdc.java:363) Authenticating as principal 
> jenkins-slave/ad...@krbtest.com with password.
> 02:36:16.269 [DEBUG - main] (MiniKdc.java:363) Principal 
> "testu...@krbtest.com" created.
> 02:36:16.274 [DEBUG - main] (MiniKdc.java:339) executing '/usr/bin/kinit 
> testuser', env: '...'
> 02:36:16.277 [DEBUG - main] (MiniKdc.java:363) kinit: Client 
> 'testu...@krbtest.com' not found in Kerberos database while getting initial 
> credentials
> {noformat}
> I wonder why the kinit doesn't take immediately. I went looking for a "sync" 
> option for the KDC but couldn't find one; perhaps it's a bug in the version 
> of the KDC used in the test environment? (Ubuntu 14.04 IIRC).
> If there's no such thing, maybe we should retry kinit with some backoff until 
> it works (or fails for good).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-2114) Master asks tservers to re-delete tombstoned tablets when reported

2017-08-25 Thread Mike Percy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2114:
-
Code Review: http://gerrit.cloudera.org:8080/7842

> Master asks tservers to re-delete tombstoned tablets when reported
> --
>
> Key: KUDU-2114
> URL: https://issues.apache.org/jira/browse/KUDU-2114
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, master, tserver
>Affects Versions: 1.5.0
>Reporter: Adar Dembo
>Assignee: Mike Percy
>Priority: Blocker
>
> Commit 5bca7d8 changed the behavior of tombstoned replicas such that they now 
> retain RaftConsensus instances despite being in the TOMBSTONED state. This 
> means that some additional consensus-related state is included in their 
> tablet report entries when a full tablet report is sent to the master. The 
> master evaluates this consensus-related state when considering whether an 
> evicted replica should be deleted, but it does not consider the TOMBSTONED 
> state. As a result, the master notices that these tombstones replicas have 
> been evicted, and asks the hosting tserver to delete them again. This will 
> happen any time there is a tablet report.
> This needs to be fixed, whether by excluding tombstone consensus state from 
> tablet reports, or by changing the master to consider the tablet's overall 
> state when deciding whether to delete it.
> When observed on a live cluster, it was further observed that the tablet 
> deletion requests were rather expensive. It appears that a DeleteTablet RPC 
> on a tombstone is not a no-op; it always flushes the superblock twice, which 
> generates two fsyncs. This should also be addressed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-2102) PosixRWFile::Sync doesn't guarantee durability when used with multiple threads

2017-08-25 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo resolved KUDU-2102.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Fixed in commit 9a07b2fedcc0beacc76cb1c656be84c0c99c9da9.

> PosixRWFile::Sync doesn't guarantee durability when used with multiple threads
> --
>
> Key: KUDU-2102
> URL: https://issues.apache.org/jira/browse/KUDU-2102
> Project: Kudu
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.5.0
>Reporter: Adar Dembo
>Assignee: Hao Hao
> Fix For: 1.5.0
>
>
> PosixRWFile uses an AtomicBool to "optimize" calls to Sync(). It works as 
> follows:
> # The bool starts as false.
> # It is set to true in WriteV().
> # In Sync(), we CAS it from true to false. If the CAS succeeds, we actually 
> do the fsync().
> The idea is that if two threads call Sync() at the same time, only one will 
> actually do the fsync(). However, there's a problem here: the "losing" thread 
> returns from Sync() early and operates under the assumption that the file's 
> data has been made durable, even though it is still in the process of being 
> synchronized to disk.
> We have two options:
> # Preserve the optimization but fix it so that the losing thread(s) wait for 
> the "winning" thread to finish the fsync. This can be done with some more 
> synchronization primitives (off the top of my head: a lock, a condition 
> variable, and another boolean).
> # Remove the optimization and let the losing thread(s) perform additional 
> fsyncs.
> To measure the effect of the optimization, I wrote a test program that opens 
> a file and fsyncs it 1000 times. I ran it on an el6.6 box, on spinning disks 
> mounted as xfs and ext4, and on files that were empty and 10G in size 
> (dropping caches first). I measured the cost of an fsync to be around 200 
> microseconds, suggesting that no I/O is being performed and that the overhead 
> is purely syscall-related.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-2114) Master asks tservers to re-delete tombstoned tablets when reported

2017-08-25 Thread Mike Percy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2114:
-
Description: 
Commit 5bca7d8 changed the behavior of tombstoned replicas such that they now 
retain RaftConsensus instances despite being in the TOMBSTONED state. This 
means that some additional consensus-related state is included in their tablet 
report entries when a full tablet report is sent to the master. The master 
evaluates this consensus-related state when considering whether an evicted 
replica should be deleted, but it does not consider the TOMBSTONED state. As a 
result, the master notices that these tombstones replicas have been evicted, 
and asks the hosting tserver to delete them again. This will happen any time 
there is a tablet report.

This needs to be fixed, whether by excluding tombstone consensus state from 
tablet reports, or by changing the master to consider the tablet's overall 
state when deciding whether to delete it.

When observed on a live cluster, it was further observed that the tablet 
deletion requests were rather expensive. It appears that a DeleteTablet RPC on 
a tombstone is not a no-op; it always flushes the superblock twice, which 
generates two fsyncs. This should also be addressed.

  was:
Commit 5bca7d8 changed the behavior of tombstoned replicas such that they now 
retain RaftConsensus instances despite being in the TOMBSTONED state. This 
means that some additional consensus-related state is included in their tablet 
report entries when a full tablet report is sent to the master. The master 
evaluates this consensus-related state when considering whether an evicted 
replica should be deleted, but it does not consider the TOMBSTONED state. As a 
result, the master notices that these tombstones replicas have been evicted, 
and asks the hosting tserver to delete them. Over, and over, and over.

This needs to be fixed, whether by excluding tombstone consensus state from 
tablet reports, or by changing the master to consider the tablet's overall 
state when deciding whether to delete it.

When observed on a live cluster, it was further observed that the tablet 
deletion requests were rather expensive. It appears that a DeleteTablet RPC on 
a tombstone is not a no-op; it always flushes the superblock twice, which 
generates two fsyncs. This should also be addressed.

Summary: Master asks tservers to re-delete tombstoned tablets when 
reported  (was: Master asks tservers to delete tombstoned tablets forever)

> Master asks tservers to re-delete tombstoned tablets when reported
> --
>
> Key: KUDU-2114
> URL: https://issues.apache.org/jira/browse/KUDU-2114
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, master, tserver
>Affects Versions: 1.5.0
>Reporter: Adar Dembo
>Assignee: Mike Percy
>Priority: Blocker
>
> Commit 5bca7d8 changed the behavior of tombstoned replicas such that they now 
> retain RaftConsensus instances despite being in the TOMBSTONED state. This 
> means that some additional consensus-related state is included in their 
> tablet report entries when a full tablet report is sent to the master. The 
> master evaluates this consensus-related state when considering whether an 
> evicted replica should be deleted, but it does not consider the TOMBSTONED 
> state. As a result, the master notices that these tombstones replicas have 
> been evicted, and asks the hosting tserver to delete them again. This will 
> happen any time there is a tablet report.
> This needs to be fixed, whether by excluding tombstone consensus state from 
> tablet reports, or by changing the master to consider the tablet's overall 
> state when deciding whether to delete it.
> When observed on a live cluster, it was further observed that the tablet 
> deletion requests were rather expensive. It appears that a DeleteTablet RPC 
> on a tombstone is not a no-op; it always flushes the superblock twice, which 
> generates two fsyncs. This should also be addressed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-1521) Flakiness in TestAsyncKuduSession

2017-08-25 Thread Adar Dembo (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142382#comment-16142382
 ] 

Adar Dembo commented on KUDU-1521:
--

bq. Adar Dembo wanna try looping dist test and see if this is still an issue?

The issue I originally wrote about (in test()) still exists. I wasn't able to 
figure out the other issues when I first reported this bug.

> Flakiness in TestAsyncKuduSession
> -
>
> Key: KUDU-1521
> URL: https://issues.apache.org/jira/browse/KUDU-1521
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.9.1
>Reporter: Adar Dembo
>
> I've been trying to parse the various failures in 
> http://104.196.14.100/job/kudu-gerrit/2270/BUILD_TYPE=RELEASE. Here's what I 
> see in the test:
> The way test() tests AUTO_FLUSH_BACKGROUND is inherently flaky; a delay while 
> running test code will give the background flush task a chance to fire when 
> the test code doesn't expect it. I've seen this cause lead to no 
> PleaseThrottleException, but I suspect the first block of test code dealing 
> with background flushes is flaky too (since it's testing elapsed time).
> There's also some test failures that I can't figure out. I've pasted them 
> below for posterity:
> {noformat}
> 03:52:14 
> testGetTableLocationsErrorCauseSessionStuck(org.kududb.client.TestAsyncKuduSession)
>   Time elapsed: 100.009 sec  <<< ERROR!
> 03:52:14 java.lang.Exception: test timed out after 10 milliseconds
> 03:52:14  at java.lang.Object.wait(Native Method)
> 03:52:14  at java.lang.Object.wait(Object.java:503)
> 03:52:14  at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1136)
> 03:52:14  at com.stumbleupon.async.Deferred.join(Deferred.java:1019)
> 03:52:14  at 
> org.kududb.client.TestAsyncKuduSession.testGetTableLocationsErrorCauseSessionStuck(TestAsyncKuduSession.java:133)
> 03:52:14 
> 03:52:14 
> testBatchErrorCauseSessionStuck(org.kududb.client.TestAsyncKuduSession)  Time 
> elapsed: 0.199 sec  <<< ERROR!
> 03:52:14 org.kududb.client.MasterErrorException: Server[Kudu Master - 
> 127.13.215.1:64030] NOT_FOUND[code 1]: The table was deleted: Table deleted 
> at 2016-07-09 03:50:24 UTC
> 03:52:14  at 
> org.kududb.client.TabletClient.dispatchMasterErrorOrReturnException(TabletClient.java:533)
> 03:52:14  at org.kududb.client.TabletClient.decode(TabletClient.java:463)
> 03:52:14  at org.kududb.client.TabletClient.decode(TabletClient.java:83)
> 03:52:14  at 
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:500)
> 03:52:14  at 
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
> 03:52:14  at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
> 03:52:14  at 
> org.kududb.client.TabletClient.handleUpstream(TabletClient.java:638)
> 03:52:14  at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
> 03:52:14  at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
> 03:52:14  at 
> org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184)
> 03:52:14  at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
> 03:52:14  at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
> 03:52:14  at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
> 03:52:14  at 
> org.kududb.client.AsyncKuduClient$TabletClientPipeline.sendUpstream(AsyncKuduClient.java:1877)
> 03:52:14  at 
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
> 03:52:14  at 
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
> 03:52:14  at 
> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
> 03:52:14  at 
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
> 03:52:14  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 03:52:14  at 
> 

[jira] [Updated] (KUDU-2055) Coalesce hole punching when deleting groups of blocks

2017-08-25 Thread Hao Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Hao updated KUDU-2055:
--
Issue Type: Improvement  (was: Bug)

> Coalesce hole punching when deleting groups of blocks
> -
>
> Key: KUDU-2055
> URL: https://issues.apache.org/jira/browse/KUDU-2055
> Project: Kudu
>  Issue Type: Improvement
>  Components: fs
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Assignee: Hao Hao
>
> We should implement hole punch coalescing in the log block manager. I'm not 
> sure whether it'll be faster than per-block hole punching, but it should 
> definitely be no worse than what we do today (and it will definitely reduce 
> the number of syscalls).
> To do this we need awareness within the LBM of when we're deleting groups of 
> blocks, which should come through KUDU-1943.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-1520) Possible race between alter schema lock release and tablet shutdown

2017-08-25 Thread Adar Dembo (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142351#comment-16142351
 ] 

Adar Dembo commented on KUDU-1520:
--

Yes. I looked through the AlterSchema/AlterSchemaState/TransactionDriver code 
and I think it's still vulnerable to this race.

> Possible race between alter schema lock release and tablet shutdown
> ---
>
> Key: KUDU-1520
> URL: https://issues.apache.org/jira/browse/KUDU-1520
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 0.9.1
>Reporter: Adar Dembo
>
> I've been running a new stress that hammers a cluster with concurrent alter 
> and delete table requests, and one of my test runs failed with the following:
> {noformat}
> F0707 18:59:34.311122   373 rw_semaphore.h:145] Check failed: 
> base::subtle::NoBarrier_Load(_) == kWriteFlag (0 vs. 2147483648) 
> *** Check failure stack trace: ***
> @ 0x7f86cd37df5d  google::LogMessage::Fail() at ??:0
> @ 0x7f86cd37fe5d  google::LogMessage::SendToLog() at ??:0
> @ 0x7f86cd37da99  google::LogMessage::Flush() at ??:0
> @ 0x7f86cd3808ff  google::LogMessageFatal::~LogMessageFatal() at ??:0
> @ 0x7f86d4f77c78  kudu::rw_semaphore::unlock() at ??:0
> @ 0x7f86d3728de0  std::unique_lock<>::unlock() at ??:0
> @ 0x7f86d3727192  std::unique_lock<>::~unique_lock() at ??:0
> @ 0x7f86d3725582  
> kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at 
> ??:0
> @ 0x7f86d37255be  
> kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at 
> ??:0
> @ 0x7f86d4f68dce  std::default_delete<>::operator()() at ??:0
> @ 0x7f86d4f670b9  std::unique_ptr<>::~unique_ptr() at ??:0
> @ 0x7f86d374510e  
> kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
> @ 0x7f86d374514a  
> kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
> @ 0x7f86d373f532  kudu::DefaultDeleter<>::operator()() at ??:0
> @ 0x7f86d373df4a  
> kudu::internal::gscoped_ptr_impl<>::~gscoped_ptr_impl() at ??:0
> @ 0x7f86d373d552  gscoped_ptr<>::~gscoped_ptr() at ??:0
> @ 0x7f86d373d580  
> kudu::tablet::TransactionDriver::~TransactionDriver() at ??:0
> @ 0x7f86d3740ab4  kudu::RefCountedThreadSafe<>::DeleteInternal() at 
> ??:0
> @ 0x7f86d3740405  
> kudu::DefaultRefCountedThreadSafeTraits<>::Destruct() at ??:0
> @ 0x7f86d373f928  kudu::RefCountedThreadSafe<>::Release() at ??:0
> @ 0x7f86d373e769  scoped_refptr<>::~scoped_refptr() at ??:0
> @ 0x7f86d37397cd  kudu::tablet::TabletPeer::SubmitAlterSchema() at 
> ??:0
> @ 0x7f86d4f4e070  
> kudu::tserver::TabletServiceAdminImpl::AlterSchema() at ??:0
> @ 0x7f86d27a4e92  
> _ZZN4kudu7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS_12MetricEntityEEENKUlPKN6google8protobuf7MessageEPS9_PNS_3rpc10RpcContextEE1_clESB_SC_SF_
>  at ??:0
> @ 0x7f86d27a5d96  
> _ZNSt17_Function_handlerIFvPKN6google8protobuf7MessageEPS2_PN4kudu3rpc10RpcContextEEZNS6_7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS6_12MetricEntityEEEUlS4_S5_S9_E1_E9_M_invokeERKSt9_Any_dataS4_S5_S9_
>  at ??:0
> @ 0x7f86d22ce6e4  std::function<>::operator()() at ??:0
> @ 0x7f86d22ce19b  kudu::rpc::GeneratedServiceIf::Handle() at ??:0
> @ 0x7f86d22d0a97  kudu::rpc::ServicePool::RunThread() at ??:0
> @ 0x7f86d22d1d45  boost::_mfi::mf0<>::operator()() at ??:0
> @ 0x7f86d22d1b6c  boost::_bi::list1<>::operator()<>() at ??:0
> @ 0x7f86d22d1a61  boost::_bi::bind_t<>::operator()() at ??:0
> @ 0x7f86d22d1998  
> boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
> {noformat}
> After looking through the code a bit, I suspect this happened because, in the 
> event of failure, the AlterSchema transaction releases the tablet's schema 
> lock implicitly (i.e. when AlterSchemaTransactionState is destroyed) _after_ 
> the transaction itself is removed from the driver's TransactionTracker. Thus, 
> the WaitForAllToFinish() performed during the tablet shutdown process thinks 
> all the transactions are done and proceeds to free tablet state. Later, the 
> last reference to the transaction is released (in 
> TabletPeer::SubmitAlterSchema), the transaction is destroyed, and we try to 
> unlock a lock whose memory has already been freed.
> If this analysis is correct, the broken invariant is: once the transaction 
> has been released from the tracker, it may no longer access any tablet state.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2014) Explore additional approaches to improve LBM startup time

2017-08-25 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142344#comment-16142344
 ] 

Todd Lipcon commented on KUDU-2014:
---

Another thing I noticed on a cluster yesterday is that the tserver actually got 
CPU-bound in inserting block records into the block map. This is effectively 
single-threaded since we hold a lock around the block map. Sharding the data 
structure could probably be a nice boost in the case that it's not IO-bound.

> Explore additional approaches to improve LBM startup time
> -
>
> Key: KUDU-2014
> URL: https://issues.apache.org/jira/browse/KUDU-2014
> Project: Kudu
>  Issue Type: Improvement
>  Components: fs
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>  Labels: data-scalability
>
> The fix for KUDU-1549 added support for deleting full log block manager 
> containers with no live blocks, and for compacting container metadata to omit 
> CREATE/DELETE record pairs. Both of these will help reduce the amount of 
> metadata that must be read at startup. However, there's more we can do to 
> help; this JIRA captures some additional ideas worth exploring (if/when LBM 
> startup once again becomes intolerable):
> In [this 
> gerrit|https://gerrit.cloudera.org/#/c/6826/2/src/kudu/fs/log_block_manager.cc@90],
>  Todd made the case that container metadata processing is seek-dominant:
> {quote}
> looking at a data/ dir on a cluster that has been around for quite some time, 
> most of the metadata files seem to be around 400KB. Assuming 100MB/sec 
> sequential throughput and 10ms seek, it definitely seems like the startup 
> time would be seek-dominated (10 or 20ms seek depending whether various 
> internal metadata pages are hot in cache, plus only 4ms of sequential read 
> time). 
> {quote}
> We theorized several ways to reduce seeking, all focused on reducing the 
> number of discrete container metadata files read at startup:
> # Raise the container max data file size. This won't help on older versions 
> of el6 with ext4, but will help everywhere else. It makes sense for the max 
> data file size to be a function of the disk size anyway. And it's a pretty 
> cheap way to extract more scalability.
> # Reuse container data file holes, explicitly to avoid creating so many 
> containers. Perhaps with a round of "defragmentation" to simplify reuse, or 
> perhaps not. As a side effect, metadata file compaction now becomes more 
> important (and costly).
> # Eschew one metadata file per data file altogether and maintain just one 
> metadata file. Deleting "dead" containers would no longer be an improvement 
> for metadata startup cost. Metadata compaction would be a lot more expensive. 
> Block records themselves would be larger, because each record now needs to 
> point to a particular data file, though this can be mitigated in various 
> ways. A variant of this would be to do away with the 1-1 relationship between 
> metadata and data files and make it more like m-n.
> # Reduce the number of extents in container metadata files via judicious 
> preallocation.
> See the gerrit linked above for more details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-1318) Java integration test failures in kudu-client-tools

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142297#comment-16142297
 ] 

Jean-Daniel Cryans commented on KUDU-1318:
--

[~d...@danburkert.com] this still an issue after your recent changes?

> Java integration test failures in kudu-client-tools
> ---
>
> Key: KUDU-1318
> URL: https://issues.apache.org/jira/browse/KUDU-1318
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.7.0
>Reporter: Adar Dembo
>
> Filing this so we can collectively figure out what's going on.
> I see the following errors consistently when I run "mvn clean install".
> {noformat}
> ---
>  T E S T S
> ---
> Running org.kududb.mapreduce.tools.ITRowCounter
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.954 sec <<< 
> FAILURE! - in org.kududb.mapreduce.tools.ITRowCounter
> test(org.kududb.mapreduce.tools.ITRowCounter)  Time elapsed: 2.345 sec  <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.yarn.proto.YarnProtos$LocalResourceProto.hashLong(J)I
>   at 
> org.apache.hadoop.yarn.proto.YarnProtos$LocalResourceProto.hashCode(YarnProtos.java:11655)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.LocalResourcePBImpl.hashCode(LocalResourcePBImpl.java:62)
>   at java.util.HashMap.hash(HashMap.java:362)
>   at java.util.HashMap.put(HashMap.java:492)
>   at 
> org.apache.hadoop.mapred.LocalDistributedCacheManager.setup(LocalDistributedCacheManager.java:133)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.(LocalJobRunner.java:163)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:731)
>   at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:536)
>   at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1306)
>   at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1303)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>   at org.apache.hadoop.mapreduce.Job.submit(Job.java:1303)
>   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1324)
>   at org.kududb.mapreduce.tools.ITRowCounter.test(ITRowCounter.java:64)
> Running org.kududb.mapreduce.tools.ITImportCsv
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 3.696 sec <<< 
> FAILURE! - in org.kududb.mapreduce.tools.ITImportCsv
> test(org.kududb.mapreduce.tools.ITImportCsv)  Time elapsed: 1.053 sec  <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.yarn.proto.YarnProtos$LocalResourceProto.hashLong(J)I
>   at 
> org.apache.hadoop.yarn.proto.YarnProtos$LocalResourceProto.hashCode(YarnProtos.java:11655)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.LocalResourcePBImpl.hashCode(LocalResourcePBImpl.java:62)
>   at java.util.HashMap.hash(HashMap.java:362)
>   at java.util.HashMap.put(HashMap.java:492)
>   at 
> org.apache.hadoop.mapred.LocalDistributedCacheManager.setup(LocalDistributedCacheManager.java:133)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.(LocalJobRunner.java:163)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:731)
>   at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:536)
>   at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1306)
>   at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1303)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>   at org.apache.hadoop.mapreduce.Job.submit(Job.java:1303)
>   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1324)
>   at org.kududb.mapreduce.tools.ITImportCsv.test(ITImportCsv.java:103)
> Results :
> Tests in error: 
>   ITImportCsv.test:103 » NoSuchMethod 
> org.apache.hadoop.yarn.proto.YarnProtos$Lo...
>   ITRowCounter.test:64 » NoSuchMethod 
> org.apache.hadoop.yarn.proto.YarnProtos$Lo...
> Tests run: 2, Failures: 0, Errors: 2, Skipped: 0
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1528) heap-use-after-free in Peer::ProcessResponse while deleting a tablet

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1528:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> heap-use-after-free in Peer::ProcessResponse while deleting a tablet 
> -
>
> Key: KUDU-1528
> URL: https://issues.apache.org/jira/browse/KUDU-1528
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 0.9.1
>Reporter: Adar Dembo
>
> Observed this in a [master-stress-test 
> failure|http://104.196.14.100/job/kudu-gerrit/2314/BUILD_TYPE=ASAN]. It 
> appears that we deleted a tablet while we were processing the response to 
> ConsensusService::UpdateConsensus().
> {noformat}
> ==721==ERROR: AddressSanitizer: heap-use-after-free on address 0x6140001b8fb0 
> at pc 0x7f54968ac684 bp 0x7f5487e485b0 sp 0x7f5487e485a8
> READ of size 8 at 0x6140001b8fb0 thread T6 (rpc reactor-728)
> #0 0x7f54968ac683 in scoped_refptr::operator 
> kudu::Histogram* scoped_refptr::*() const 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/gutil/ref_counted.h:269:38
> #1 0x7f5491297e5b in 
> kudu::ThreadPool::Submit(std::shared_ptr const&) 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/util/threadpool.cc:241:7
> #2 0x7f54912978cd in kudu::ThreadPool::SubmitFunc(boost::function ()> const&) 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/util/threadpool.cc:182:10
> #3 0x7f54912976ef in kudu::ThreadPool::SubmitClosure(kudu::Callback ()> const&) 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/util/threadpool.cc:178:10
> #4 0x7f5497f3929c in kudu::consensus::Peer::ProcessResponse() 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/consensus/consensus_peers.cc:263:14
> #5 0x7f5497f440ad in boost::_bi::bind_t kudu::consensus::Peer>, 
> boost::_bi::list1 > >::operator()() 
> /home/jenkins-slave/workspace/kudu-3/thirdparty/installed/include/boost/bind/bind.hpp:1222:16
> #6 0x7f549688135e in boost::function0::operator()() const 
> /home/jenkins-slave/workspace/kudu-3/thirdparty/installed/include/boost/function/function_template.hpp:770:14
> #7 0x7f549687ca1e in kudu::rpc::OutboundCall::CallCallback() 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/rpc/outbound_call.cc:189:5
> #8 0x7f549687ce01 in 
> kudu::rpc::OutboundCall::SetResponse(gscoped_ptr kudu::DefaultDeleter >) 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/rpc/outbound_call.cc:221:5
> #9 0x7f549688ea1f in 
> kudu::rpc::Connection::HandleCallResponse(gscoped_ptr  kudu::DefaultDeleter >) 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/rpc/connection.cc:534:3
> #10 0x7f549688ded2 in kudu::rpc::Connection::ReadHandler(ev::io&, int) 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/rpc/connection.cc:470:7
> #11 0x7f5496184605 in ev_invoke_pending 
> /home/jenkins-slave/workspace/kudu-2/thirdparty/libev-4.20/ev.c:3155
> #12 0x7f54961854f7 in ev_run 
> /home/jenkins-slave/workspace/kudu-2/thirdparty/libev-4.20/ev.c:3555
> #13 0x7f54968c6cca in kudu::rpc::ReactorThread::RunThread() 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/rpc/reactor.cc:306:3
> #14 0x7f54968d69dd in boost::_bi::bind_t kudu::rpc::ReactorThread>, 
> boost::_bi::list1 > 
> >::operator()() 
> /home/jenkins-slave/workspace/kudu-3/thirdparty/installed/include/boost/bind/bind.hpp:1222:16
> #15 0x7f549688135e in boost::function0::operator()() const 
> /home/jenkins-slave/workspace/kudu-3/thirdparty/installed/include/boost/function/function_template.hpp:770:14
> #16 0x7f5491285dd6 in kudu::Thread::SuperviseThread(void*) 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/util/thread.cc:586:3
> #17 0x7f5493bcb181 in start_thread 
> /build/eglibc-3GlaMS/eglibc-2.19/nptl/pthread_create.c:312
> #18 0x7f548e6fe47c in clone 
> /build/eglibc-3GlaMS/eglibc-2.19/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> 0x6140001b8fb0 is located 368 bytes inside of 416-byte region 
> [0x6140001b8e40,0x6140001b8fe0)
> freed by thread T31 (rpc worker-753) here:
> #0 0x4f54d0 in operator delete(void*) 
> /home/jenkins-slave/workspace/kudu-3/thirdparty/llvm-3.8.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cc:94
> #1 0x7f5497f9e40d in kudu::consensus::RaftConsensus::~RaftConsensus() 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/consensus/raft_consensus.cc:244:1
> #2 0x7f5497f9e551 in kudu::consensus::RaftConsensus::~RaftConsensus() 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/consensus/raft_consensus.cc:242:33
> #3 0x7f5498adf6e8 in 
> scoped_refptr::operator=(kudu::consensus::Consensus*)
>  

[jira] [Updated] (KUDU-1876) Poor error messages and behavior when webserver TLS is misconfigured

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1876:
-
Target Version/s: 1.6.0

> Poor error messages and behavior when webserver TLS is misconfigured
> 
>
> Key: KUDU-1876
> URL: https://issues.apache.org/jira/browse/KUDU-1876
> Project: Kudu
>  Issue Type: Bug
>  Components: security, supportability
>Affects Versions: 1.3.0
>Reporter: Adar Dembo
>
> I was playing around with Cloudera Manager's upcoming webserver TLS support 
> and found a couple cases where misconfigurations led to confusing error 
> messages and other weird behavior. I focused on *webserver_private_key_file*, 
> *webserver_certificate_file*, and *webserver_private_key_password_cmd*.
> *webserver_private_key_file* is unset, but *webserver_certificate_file* and 
> *webserver_private_key_password_cmd* are set: the server crashes (good) but 
> with a fairly inscrutable error message:
> {noformat}
> I0213 18:49:50.606950  2265 webserver.cc:144] Webserver: Enabling HTTPS 
> support
> I0213 18:49:50.607322  2265 webserver.cc:293] Webserver: set_ssl_option: 
> cannot open /etc/adar_kudu_tls/cert.pem: error:0906D06C:PEM 
> routines:PEM_read_bio:no start line
> W0213 18:49:50.607375  2265 net_util.cc:293] Failed to bind to 0.0.0.0:8051. 
> Trying to use lsof to find any processes listening on the same port:
> I0213 18:49:50.607393  2265 net_util.cc:296] $ export PATH=$PATH:/usr/sbin ; 
> lsof -n -i 'TCP:8051' -sTCP:LISTEN ; for pid in $(lsof -F p -n -i 'TCP:8051' 
> -sTCP:LISTEN | grep p | cut -f 2 -dp) ; do  while [ $pid -gt 1 ] ; dops h 
> -fp $pid ;stat=($( W0213 18:49:50.632638  2265 net_util.cc:303] 
> F0213 18:49:50.632704  2265 master_main.cc:71] Check failed: _s.ok() Bad 
> status: Network error: Webserver: Could not start on address 0.0.0.0:8051
> {noformat}
> *webserver_private_key_file*, *webserver_certificate_file*, and 
> *webserver_private_key_password_cmd* are all set, but the password command 
> script yields the wrong password: the server crashes (good) but the error 
> message is inscrutable: 
> {noformat}
> I0213 18:35:34.581714 32633 webserver.cc:293] Webserver: set_ssl_option: 
> cannot open /etc/adar_kudu_tls/cert.pem: error:06065064:digital envelope 
> routines:EVP_DecryptFinal_ex:bad decrypt
> W0213 18:35:34.581794 32633 net_util.cc:293] Failed to bind to 0.0.0.0:8051. 
> Trying to use lsof to find any processes listening on the same port:
> I0213 18:35:34.581811 32633 net_util.cc:296] $ export PATH=$PATH:/usr/sbin ; 
> lsof -n -i 'TCP:8051' -sTCP:LISTEN ; for pid in $(lsof -F p -n -i 'TCP:8051' 
> -sTCP:LISTEN | grep p | cut -f 2 -dp) ; do  while [ $pid -gt 1 ] ; dops h 
> -fp $pid ;stat=($( W0213 18:35:34.605216 32633 net_util.cc:303] 
> F0213 18:35:34.605254 32633 master_main.cc:71] Check failed: _s.ok() Bad 
> status: Network error: Webserver: Could not start on address 0.0.0.0:8051
> {noformat}
> *webserver_private_key_file* and *webserver_private_key_password_cmd* are 
> set, but *webserver_certificate_file* is not: the server starts up (probably 
> bad?) and any attempt to access the webui on the https port yields a "This 
> site can’t provide a secure connection" message in the browser with 
> ERR_SSL_PROTOCOL_ERROR as the error code. I only tested with Chromium.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-2045) Data race on process_memory::g_hard_limit

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-2045.
--
   Resolution: Fixed
 Assignee: Todd Lipcon
Fix Version/s: 1.5.0

Todd fixed it in 230ed20d54a3bf2a9e01481e68d06da3d67e42d5

> Data race on process_memory::g_hard_limit
> -
>
> Key: KUDU-2045
> URL: https://issues.apache.org/jira/browse/KUDU-2045
> Project: Kudu
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Assignee: Todd Lipcon
>  Labels: newbie
> Fix For: 1.5.0
>
>
> Saw this in a linked_list-test TSAN run. I don't think it's related to the 
> changes I currently have in my tree:
> {noformat}
> ==
> WARNING: ThreadSanitizer: data race (pid=19052)
>   Write of size 8 at 0x7f04796dc478 by thread T123 (mutexes: write M1221):
> #0 kudu::process_memory::(anonymous namespace)::DoInitLimits() 
> /home/adar/Source/kudu/src/kudu/util/process_memory.cc:166:16 
> (libkudu_util.so+0x1a5719)
> #1 GoogleOnceInternalInit(int*, void (*)(), void (*)(void*), void*) 
> /home/adar/Source/kudu/src/kudu/gutil/once.cc:38:7 (libgutil.so+0x35a87)
> #2 GoogleOnceInit(GoogleOnceType*, void (*)()) 
> /home/adar/Source/kudu/src/kudu/gutil/once.h:55:5 (libtserver.so+0xc69c3)
> #3 kudu::process_memory::(anonymous namespace)::InitLimits() 
> /home/adar/Source/kudu/src/kudu/util/process_memory.cc:184:3 
> (libkudu_util.so+0x1a5511)
> #4 kudu::process_memory::UnderMemoryPressure(double*) 
> /home/adar/Source/kudu/src/kudu/util/process_memory.cc:221:3 
> (libkudu_util.so+0x1a544a)
> #5 
> _ZNSt3__18__invokeIRPFbPdEJS1_EEEDTclclsr3std3__1E7forwardIT_Efp_Espclsr3std3__1E7forwardIT0_Efp0_EEEOS5_DpOS6_
>  
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/c++/v1/type_traits:4301:1
>  (libkudu_util.so+0x16aa6d)
> #6 bool std::__1::__invoke_void_return_wrapper::__call (*&)(double*), double*>(bool (*&)(double*), double*&&) 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/c++/v1/__functional_base:328
>  (libkudu_util.so+0x16aa6d)
> #7 std::__1::__function::__func std::__1::allocator, bool 
> (double*)>::operator()(double*&&) 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/c++/v1/functional:1552:12
>  (libkudu_util.so+0x16a974)
> #8 std::__1::function::operator()(double*) const 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/c++/v1/functional:1914:12
>  (libkudu_util.so+0x168b0d)
> #9 kudu::MaintenanceManager::FindBestOp() 
> /home/adar/Source/kudu/src/kudu/util/maintenance_manager.cc:383:7 
> (libkudu_util.so+0x165e66)
> #10 kudu::MaintenanceManager::RunSchedulerThread() 
> /home/adar/Source/kudu/src/kudu/util/maintenance_manager.cc:245:25 
> (libkudu_util.so+0x164650)
> #11 boost::_mfi::mf0 kudu::MaintenanceManager>::operator()(kudu::MaintenanceManager*) const 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/boost/bind/mem_fn_template.hpp:49:29
>  (libkudu_util.so+0x16b876)
> #12 void boost::_bi::list1 
> >::operator(), 
> boost::_bi::list0>(boost::_bi::type, boost::_mfi::mf0 kudu::MaintenanceManager>&, boost::_bi::list0&, int) 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/boost/bind/bind.hpp:259:9
>  (libkudu_util.so+0x16b7ca)
> #13 boost::_bi::bind_t kudu::MaintenanceManager>, 
> boost::_bi::list1 > 
> >::operator()() 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/boost/bind/bind.hpp:1222:16
>  (libkudu_util.so+0x16b753)
> #14 
> boost::detail::function::void_function_obj_invoker0 boost::_mfi::mf0, 
> boost::_bi::list1 > >, 
> void>::invoke(boost::detail::function::function_buffer&) 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/boost/function/function_template.hpp:159:11
>  (libkudu_util.so+0x16b559)
> #15 boost::function0::operator()() const 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/boost/function/function_template.hpp:770:14
>  (libkrpc.so+0xb0c11)
> #16 kudu::Thread::SuperviseThread(void*) 
> /home/adar/Source/kudu/src/kudu/util/thread.cc:591:3 
> (libkudu_util.so+0x1bce7e)
>   Previous read of size 8 at 0x7f04796dc478 by thread T121:
> #0 kudu::process_memory::HardLimit() 
> /home/adar/Source/kudu/src/kudu/util/process_memory.cc:217:10 
> (libkudu_util.so+0x1a540a)
> #1 kudu::MemTrackersHandler(kudu::WebCallbackRegistry::WebRequest const&, 
> std::__1::basic_ostringstream std::__1::allocator >*) 
> 

[jira] [Commented] (KUDU-1489) Use WAL directory for tablet metadata files

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142283#comment-16142283
 ] 

Jean-Daniel Cryans commented on KUDU-1489:
--

[~anjuwong] does this jira still make sense given the things you've been 
working on recently? Should it fold into some other jira?

> Use WAL directory for tablet metadata files
> ---
>
> Key: KUDU-1489
> URL: https://issues.apache.org/jira/browse/KUDU-1489
> Project: Kudu
>  Issue Type: Improvement
>  Components: consensus, fs, tserver
>Affects Versions: 0.9.0
>Reporter: Adar Dembo
>
> Today a tserver will place tablet metadata files (i.e. superblock and cmeta 
> files) in the first configured data directory. I don't remember why we 
> decided to do this (commit 691f97d introduced it), but upon reconsideration 
> the WAL directory seems like a much better choice, because if the machine has 
> different kinds of I/O devices, the WAL directory's device is typically the 
> fastest.
> Mostafa has been testing Impala and Kudu on a cluster with many thousands of 
> tablets. His cluster contains storage-dense machines, each configured with 14 
> spinning disks and one flash device. Naturally, the WAL directory sits on 
> that flash device and the data directories are on the spinning disks. With 
> thousands of tablet metadata files on the first spinning disk, nearly every 
> tablet in the tserver is bottlenecked on that device due to the sheer amount 
> of I/O needed to maintain the running state of the tablet, specifically 
> rewriting cmeta files on various Raft events (votes, term advancement, etc.).
> Many thousands of tablets is not really a good scale for Kudu right now, but 
> moving the tablet metadata files to a faster device should at least help with 
> the above.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1397) Allow building safely with custom toolchains

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1397:
-
Target Version/s: Backlog  (was: 1.5.0)

> Allow building safely with custom toolchains
> 
>
> Key: KUDU-1397
> URL: https://issues.apache.org/jira/browse/KUDU-1397
> Project: Kudu
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.8.0
>Reporter: Adar Dembo
>
> Casey uncovered several issues when building Kudu with the Impala toolchain; 
> this report attempts to capture them.
> The first and most important issue was a random SIGSEGV during a flush:
> {noformat}
> (gdb) bt
> #0 0x00e82540 in kudu::CopyCellData kudu::ColumnBlockCell, kudu::Arena> (src=..., dst=0x7ff9c637d5e0, 
> dst_arena=0x0)
> at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/common/row.h:79
> #1 0x00e80e33 in kudu::CopyCell kudu::ColumnBlockCell, kudu::Arena> (src=..., dst=0x7ff9c637d5e0, 
> dst_arena=0x0)
> at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/common/row.h:103
> #2 0x00e7f647 in kudu::CopyRow kudu::Arena> (src_row=..., dst_row=0x7ff9c637d870, dst_arena=0x0)
> at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/common/row.h:119
> #3  0x00e76773 in kudu::tablet::FlushCompactionInput 
> (input=0x3894f00, snap=..., out=0x7ff9c637dbf0)
> at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/tablet/compaction.cc:768
> #4  0x00e23f5a in kudu::tablet::Tablet::DoCompactionOrFlush 
> (this=0x395a840, input=..., mrs_being_flushed=0)
> at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/tablet/tablet.cc:1221
> #5  0x00e202b2 in kudu::tablet::Tablet::FlushInternal 
> (this=0x395a840, input=..., old_ms=...) at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/tablet/tablet.cc:744
> #6  0x00e1f8f6 in kudu::tablet::Tablet::FlushUnlocked 
> (this=0x395a840) at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/tablet/tablet.cc:678
> #7  0x00f1b3a3 in kudu::tablet::FlushMRSOp::Perform (this=0x38b9340) 
> at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/tablet/tablet_peer_mm_ops.cc:127
> #8  0x00ea19d7 in kudu::MaintenanceManager::LaunchOp (this=0x3904360, 
> op=0x38b9340) at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/tablet/maintenance_manager.cc:360
> #9  0x00ea6502 in boost::_mfi::mf1 kudu::MaintenanceOp*>::operator() (this=0x3d492a0, p=0x3904360, a1=0x38b9340)
> at 
> /home/casey/Code/native-toolchain/build/boost-1.57.0/include/boost/bind/mem_fn_template.hpp:165
> #10 0x00ea6163 in 
> boost::_bi::list2, 
> boost::_bi::value >::operator() kudu::MaintenanceManager, kudu::MaintenanceOp*>, boost::_bi::list0> 
> (this=0x3d492b0, f=..., a=...) at 
> /home/casey/Code/native-toolchain/build/boost-1.57.0/include/boost/bind/bind.hpp:313
> #11 0x00ea5bed in boost::_bi::bind_t kudu::MaintenanceManager, kudu::MaintenanceOp*>, 
> boost::_bi::list2, 
> boost::_bi::value > >::operator() (this=0x3d492a0) at 
> /home/casey/Code/native-toolchain/build/boost-1.57.0/include/boost/bind/bind_template.hpp:20
> #12 0x00ea57ec in 
> boost::detail::function::void_function_obj_invoker0 boost::_mfi::mf1, 
> boost::_bi::list2, 
> boost::_bi::value > >, void>::invoke 
> (function_obj_ptr=...) at 
> /home/casey/Code/native-toolchain/build/boost-1.57.0/include/boost/function/function_template.hpp:153
> #13 0x01c4205e in boost::function0::operator() (this=0x3c01838) 
> at 
> /home/casey/Code/native-toolchain/build/boost-1.57.0/include/boost/function/function_template.hpp:767
> #14 0x01d73aa4 in kudu::FunctionRunnable::Run (this=0x3c01830) at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/util/threadpool.cc:47
> #15 0x01d73062 in kudu::ThreadPool::DispatchThread (this=0x38c8340, 
> permanent=true) at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/util/threadpool.cc:321
> #16 0x01d76740 in boost::_mfi::mf1 bool>::operator() 

[jira] [Updated] (KUDU-1489) Use WAL directory for tablet metadata files

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1489:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> Use WAL directory for tablet metadata files
> ---
>
> Key: KUDU-1489
> URL: https://issues.apache.org/jira/browse/KUDU-1489
> Project: Kudu
>  Issue Type: Improvement
>  Components: consensus, fs, tserver
>Affects Versions: 0.9.0
>Reporter: Adar Dembo
>
> Today a tserver will place tablet metadata files (i.e. superblock and cmeta 
> files) in the first configured data directory. I don't remember why we 
> decided to do this (commit 691f97d introduced it), but upon reconsideration 
> the WAL directory seems like a much better choice, because if the machine has 
> different kinds of I/O devices, the WAL directory's device is typically the 
> fastest.
> Mostafa has been testing Impala and Kudu on a cluster with many thousands of 
> tablets. His cluster contains storage-dense machines, each configured with 14 
> spinning disks and one flash device. Naturally, the WAL directory sits on 
> that flash device and the data directories are on the spinning disks. With 
> thousands of tablet metadata files on the first spinning disk, nearly every 
> tablet in the tserver is bottlenecked on that device due to the sheer amount 
> of I/O needed to maintain the running state of the tablet, specifically 
> rewriting cmeta files on various Raft events (votes, term advancement, etc.).
> Many thousands of tablets is not really a good scale for Kudu right now, but 
> moving the tablet metadata files to a faster device should at least help with 
> the above.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-997) Expose client-side metrics

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-997:

Target Version/s: 1.6.0  (was: 1.5.0)

> Expose client-side metrics
> --
>
> Key: KUDU-997
> URL: https://issues.apache.org/jira/browse/KUDU-997
> Project: Kudu
>  Issue Type: New Feature
>  Components: client
>Affects Versions: Feature Complete
>Reporter: Adar Dembo
> Attachments: patch
>
>
> I think client-side metrics have been a desirable feature for quite some 
> time, but I especially wanted them while debugging KUDU-993.
> There are some challenges in collecting metric data in a cohesive way across 
> the client (at least in C++, where there isn't a completely uniform way to 
> send/receive RPCs). But I think the main challenge is figuring out how to 
> expose it to users. I'm not sure we want to expose metrics.h directly, 
> because it's deeply intertwined with gutil and other Kudu util code.
> I'm attaching a patch I wrote yesterday to help with KUDU-993. It doesn't 
> tackle the API problem at all, but shows how to build a histogram tracking 
> all writes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-683) Clean up multi-master tech debt

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142276#comment-16142276
 ] 

Jean-Daniel Cryans commented on KUDU-683:
-

[~adar] do you think we still have a lot of tech debt here?

> Clean up multi-master tech debt
> ---
>
> Key: KUDU-683
> URL: https://issues.apache.org/jira/browse/KUDU-683
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: M5
>Reporter: Adar Dembo
>
> Multi-master support in the C++ client has introduced a fair amount of 
> RPC-related tech debt. There's a lot of duplication in the handling of 
> timeouts, retries, and error conditions. The various callbacks are also 
> tricky to follow and error prone. Now that the code has settled and we 
> understand what's painful about it, we're in a better position to fix it.
> Here's a high-level design idea: there should only be one RPC class that's 
> responsible for RPC delivery end-to-end, including retries, leader master 
> discovery, etc. Within that class there should be a single callback that's 
> reused for every asynchronous function, and there should be a separate state 
> machine that tracks the ongoing status of the RPC. Per-RPC specialization 
> should be as minimal as possible, via templates on the PBs, callbacks, or, 
> worst case, subclassing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-2014) Explore additional approaches to improve LBM startup time

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-2014:
-
Issue Type: Improvement  (was: Bug)

> Explore additional approaches to improve LBM startup time
> -
>
> Key: KUDU-2014
> URL: https://issues.apache.org/jira/browse/KUDU-2014
> Project: Kudu
>  Issue Type: Improvement
>  Components: fs
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>  Labels: data-scalability
>
> The fix for KUDU-1549 added support for deleting full log block manager 
> containers with no live blocks, and for compacting container metadata to omit 
> CREATE/DELETE record pairs. Both of these will help reduce the amount of 
> metadata that must be read at startup. However, there's more we can do to 
> help; this JIRA captures some additional ideas worth exploring (if/when LBM 
> startup once again becomes intolerable):
> In [this 
> gerrit|https://gerrit.cloudera.org/#/c/6826/2/src/kudu/fs/log_block_manager.cc@90],
>  Todd made the case that container metadata processing is seek-dominant:
> {quote}
> looking at a data/ dir on a cluster that has been around for quite some time, 
> most of the metadata files seem to be around 400KB. Assuming 100MB/sec 
> sequential throughput and 10ms seek, it definitely seems like the startup 
> time would be seek-dominated (10 or 20ms seek depending whether various 
> internal metadata pages are hot in cache, plus only 4ms of sequential read 
> time). 
> {quote}
> We theorized several ways to reduce seeking, all focused on reducing the 
> number of discrete container metadata files read at startup:
> # Raise the container max data file size. This won't help on older versions 
> of el6 with ext4, but will help everywhere else. It makes sense for the max 
> data file size to be a function of the disk size anyway. And it's a pretty 
> cheap way to extract more scalability.
> # Reuse container data file holes, explicitly to avoid creating so many 
> containers. Perhaps with a round of "defragmentation" to simplify reuse, or 
> perhaps not. As a side effect, metadata file compaction now becomes more 
> important (and costly).
> # Eschew one metadata file per data file altogether and maintain just one 
> metadata file. Deleting "dead" containers would no longer be an improvement 
> for metadata startup cost. Metadata compaction would be a lot more expensive. 
> Block records themselves would be larger, because each record now needs to 
> point to a particular data file, though this can be mitigated in various 
> ways. A variant of this would be to do away with the 1-1 relationship between 
> metadata and data files and make it more like m-n.
> # Reduce the number of extents in container metadata files via judicious 
> preallocation.
> See the gerrit linked above for more details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-477) Implement block storage microbenchmark

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-477:

Target Version/s: Backlog  (was: 1.5.0)

> Implement block storage microbenchmark
> --
>
> Key: KUDU-477
> URL: https://issues.apache.org/jira/browse/KUDU-477
> Project: Kudu
>  Issue Type: Sub-task
>  Components: fs
>Affects Versions: M4.5
>Reporter: Adar Dembo
>Assignee: Adar Dembo
>
> With two block storage allocation strategies implemented, we ought to develop 
> a synthetic microbenchmark to evaluate the two (and future allocation 
> strategies).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1358) Following a master leader election, create table may fail

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1358:
-
Target Version/s: Backlog  (was: 1.5.0)

> Following a master leader election, create table may fail
> -
>
> Key: KUDU-1358
> URL: https://issues.apache.org/jira/browse/KUDU-1358
> Project: Kudu
>  Issue Type: Sub-task
>  Components: master
>Affects Versions: 0.7.0
>Reporter: Adar Dembo
>Assignee: Adar Dembo
>
> In the current multi-master design and implementation, tservers only 
> heartbeat to the leader master. After a master leader election, there's a 
> short window of time in which the new leader master may not be aware of the 
> existence of some (or even all) of the tservers. Attempts to create a table 
> during this window may fail, as the tservers known to the new leader master 
> may be too few to satisfy the new table's replication factor. Whether the 
> window exists in the first place depends on whether the new leader master had 
> been leader before, and whether any of the tservers had sent heartbeats to it 
> during that time.
> Some possible solutions include:
> # Modifying the heartbeat protocol so that tservers heartbeat to _all_ 
> masters, leaders and followers alike. Doing this will ensure that the "soft 
> state" belonging to any master is always up-to-date at the cost of network 
> bandwidth lost to heartbeating. Additionally, changes may need to be made to 
> ensure that a follower master can't cause a tserver to take any actions.
> # Never actually failing a create table request due to too few tservers, 
> instead allowing it to linger until such a time when more tservers exist. For 
> this to actually be practical we'd need to allow clients to "cancel" a 
> previously issued create table request.
> Both approaches probably include additional ramifications; this problem needs 
> to be thought through carefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-1520) Possible race between alter schema lock release and tablet shutdown

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142272#comment-16142272
 ] 

Jean-Daniel Cryans commented on KUDU-1520:
--

[~adar] is this still an issue?

> Possible race between alter schema lock release and tablet shutdown
> ---
>
> Key: KUDU-1520
> URL: https://issues.apache.org/jira/browse/KUDU-1520
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 0.9.1
>Reporter: Adar Dembo
>
> I've been running a new stress that hammers a cluster with concurrent alter 
> and delete table requests, and one of my test runs failed with the following:
> {noformat}
> F0707 18:59:34.311122   373 rw_semaphore.h:145] Check failed: 
> base::subtle::NoBarrier_Load(_) == kWriteFlag (0 vs. 2147483648) 
> *** Check failure stack trace: ***
> @ 0x7f86cd37df5d  google::LogMessage::Fail() at ??:0
> @ 0x7f86cd37fe5d  google::LogMessage::SendToLog() at ??:0
> @ 0x7f86cd37da99  google::LogMessage::Flush() at ??:0
> @ 0x7f86cd3808ff  google::LogMessageFatal::~LogMessageFatal() at ??:0
> @ 0x7f86d4f77c78  kudu::rw_semaphore::unlock() at ??:0
> @ 0x7f86d3728de0  std::unique_lock<>::unlock() at ??:0
> @ 0x7f86d3727192  std::unique_lock<>::~unique_lock() at ??:0
> @ 0x7f86d3725582  
> kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at 
> ??:0
> @ 0x7f86d37255be  
> kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at 
> ??:0
> @ 0x7f86d4f68dce  std::default_delete<>::operator()() at ??:0
> @ 0x7f86d4f670b9  std::unique_ptr<>::~unique_ptr() at ??:0
> @ 0x7f86d374510e  
> kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
> @ 0x7f86d374514a  
> kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
> @ 0x7f86d373f532  kudu::DefaultDeleter<>::operator()() at ??:0
> @ 0x7f86d373df4a  
> kudu::internal::gscoped_ptr_impl<>::~gscoped_ptr_impl() at ??:0
> @ 0x7f86d373d552  gscoped_ptr<>::~gscoped_ptr() at ??:0
> @ 0x7f86d373d580  
> kudu::tablet::TransactionDriver::~TransactionDriver() at ??:0
> @ 0x7f86d3740ab4  kudu::RefCountedThreadSafe<>::DeleteInternal() at 
> ??:0
> @ 0x7f86d3740405  
> kudu::DefaultRefCountedThreadSafeTraits<>::Destruct() at ??:0
> @ 0x7f86d373f928  kudu::RefCountedThreadSafe<>::Release() at ??:0
> @ 0x7f86d373e769  scoped_refptr<>::~scoped_refptr() at ??:0
> @ 0x7f86d37397cd  kudu::tablet::TabletPeer::SubmitAlterSchema() at 
> ??:0
> @ 0x7f86d4f4e070  
> kudu::tserver::TabletServiceAdminImpl::AlterSchema() at ??:0
> @ 0x7f86d27a4e92  
> _ZZN4kudu7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS_12MetricEntityEEENKUlPKN6google8protobuf7MessageEPS9_PNS_3rpc10RpcContextEE1_clESB_SC_SF_
>  at ??:0
> @ 0x7f86d27a5d96  
> _ZNSt17_Function_handlerIFvPKN6google8protobuf7MessageEPS2_PN4kudu3rpc10RpcContextEEZNS6_7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS6_12MetricEntityEEEUlS4_S5_S9_E1_E9_M_invokeERKSt9_Any_dataS4_S5_S9_
>  at ??:0
> @ 0x7f86d22ce6e4  std::function<>::operator()() at ??:0
> @ 0x7f86d22ce19b  kudu::rpc::GeneratedServiceIf::Handle() at ??:0
> @ 0x7f86d22d0a97  kudu::rpc::ServicePool::RunThread() at ??:0
> @ 0x7f86d22d1d45  boost::_mfi::mf0<>::operator()() at ??:0
> @ 0x7f86d22d1b6c  boost::_bi::list1<>::operator()<>() at ??:0
> @ 0x7f86d22d1a61  boost::_bi::bind_t<>::operator()() at ??:0
> @ 0x7f86d22d1998  
> boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
> {noformat}
> After looking through the code a bit, I suspect this happened because, in the 
> event of failure, the AlterSchema transaction releases the tablet's schema 
> lock implicitly (i.e. when AlterSchemaTransactionState is destroyed) _after_ 
> the transaction itself is removed from the driver's TransactionTracker. Thus, 
> the WaitForAllToFinish() performed during the tablet shutdown process thinks 
> all the transactions are done and proceeds to free tablet state. Later, the 
> last reference to the transaction is released (in 
> TabletPeer::SubmitAlterSchema), the transaction is destroyed, and we try to 
> unlock a lock whose memory has already been freed.
> If this analysis is correct, the broken invariant is: once the transaction 
> has been released from the tracker, it may no longer access any tablet state.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-993) Investigate making transaction tracker rejections more fair

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-993:

Issue Type: Improvement  (was: Bug)

> Investigate making transaction tracker rejections more fair
> ---
>
> Key: KUDU-993
> URL: https://issues.apache.org/jira/browse/KUDU-993
> Project: Kudu
>  Issue Type: Improvement
>  Components: tablet
>Affects Versions: Public beta
>Reporter: Adar Dembo
>
> When the transaction tracker hits its memory limit, it'll reject new 
> transactions until pending transactions finish. Clients respond by retrying 
> the failed transactions until the transaction tracker accepts them, or until 
> they timeout.
> The rejection mechanism doesn't take into account how many times a 
> transaction has been retried, and as a result, it's possible for some 
> transactions to be rejected many times over even as other transactions are 
> allowed through. Here's a contrived example: two clients submitting 
> transactions simultaneously, with room for only one pending transaction. 
> Given a long retry backoff delay and a short delay between transactions, it's 
> possible for one client to "hog" the available space while the other 
> continuously retries (each time it retries, the first client has managed to 
> stuff another transaction in).
> We should investigate making this rejection system more fair so that no one 
> transaction is starved.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-790) Not all MM ops acquire locks in Prepare()

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-790:

Target Version/s: Backlog  (was: 1.5.0)

> Not all MM ops acquire locks in Prepare()
> -
>
> Key: KUDU-790
> URL: https://issues.apache.org/jira/browse/KUDU-790
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: M5
>Reporter: Adar Dembo
>
> The maintenance manager invokes Prepare() on its scheduler thread and thunks 
> Perform() to a separate thread pool. If an op doesn't lock the rowsets it's 
> going to modify in Prepare(), there's a chance that the MM scheduler thread 
> may run again before Perform() is invoked (i.e. before those locks are 
> acquired. If this happens, the scheduler thread may compute the same stats 
> and schedules the same op.
> All of the ops are safe to call concurrently (well, those that aren't use 
> external synchronization to ensure that this doesn't happen), but what 
> happens next depends on the specific op and the timing. The second op may 
> no-op, or it may perform less useful work and waste time.
> Ideally Prepare() should acquire any and all locks needed by Perform(), so 
> that if the scheduler thread runs again, it'll compute different stats (since 
> those usually depend on acquiring rowset locks) and either not schedule the 
> op, or schedule a different body of work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-761) TS seg fault after short log append

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-761:

Issue Type: Bug  (was: Task)

> TS seg fault after short log append
> ---
>
> Key: KUDU-761
> URL: https://issues.apache.org/jira/browse/KUDU-761
> Project: Kudu
>  Issue Type: Bug
>  Components: log
>Affects Versions: M5
>Reporter: Adar Dembo
>
> I was running tpch_real_world with SF=6000 on a2414. Towards the end of the 
> run, flush MM ops consistently won out over log gc MM due to memory pressure. 
> Eventually we ran out of disk space on the disk hosting the logs:
> {noformat}
> E0511 05:45:47.913038 8174 log.cc:130] Error appending to the log: IO error: 
> pwritev error: expected to write 254370 bytes, wrote 145525 bytes instead
> {noformat}
> What followed was a SIGSEGV:
> {noformat}
> PC: @ 0x0 
> *** SIGSEGV (@0x0) received by PID 8004 (TID 0x7fae308fd700) from PID 0; 
> stack trace: ***
> @ 0x3d4ae0f500 
> @ 0x0 
> @ 0x8a2791 kudu::log::LogEntryPB::~LogEntryPB()
> @ 0x8a2272 kudu::log::LogEntryBatchPB::~LogEntryBatchPB()
> @ 0x8a22e1 kudu::log::LogEntryBatchPB::~LogEntryBatchPB()
> @ 0x892de4 kudu::log::Log::AppendThread::RunThread()
> {noformat}
> There's not much we can do when we run out of disk space, but better to crash 
> with a CHECK or something than to SIGSEGV.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-1544) Race in Java client's AsyncKuduSession.apply()

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142264#comment-16142264
 ] 

Jean-Daniel Cryans commented on KUDU-1544:
--

[~danburkert] [~adar] is this still a thing?

> Race in Java client's AsyncKuduSession.apply()
> --
>
> Key: KUDU-1544
> URL: https://issues.apache.org/jira/browse/KUDU-1544
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.10.0
>Reporter: Adar Dembo
>
> The race is between calls to flushNotification.get() and 
> inactiveBufferAvailable(). Suppose T1 calls inactiveBufferAvailable(), gets 
> back false, but is descheduled before constructing a PleaseThrottleException. 
> Now T2 is scheduled, finishes an outstanding flush, calls queueBuffer(), and 
> resets flushNotification to an empty Deferred. When T1 is rescheduled, it 
> throws a PTE with that empty Deferred.
> What is the effect? If the user waits on the Deferred from the PTE, the user 
> is effectively waiting on "the next flush", which, depending on the stream of 
> operations, may take place soon, may not take place for some time, or may not 
> take place at all.
> To fix this, we should probably reorder the calls to flushNotification.get() 
> in apply() to happen before calls to inactiveBufferAvailable(). That way, a 
> race will yield a stale Deferred rather than an empty one, and waiting on the 
> stale Deferred should be a no-op.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1961) devtoolset-3 defeats ccache

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1961:
-
Issue Type: Improvement  (was: Bug)

> devtoolset-3 defeats ccache
> ---
>
> Key: KUDU-1961
> URL: https://issues.apache.org/jira/browse/KUDU-1961
> Project: Kudu
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>
> When devtoolset-3 is used (via enable_devtoolset.sh on el6), it's quite 
> likely that ccache will go unused for the build. Certainly for 
> build-thirdparty.sh, and likely for the main Kudu build too (unless you go 
> out of your way to set CC/CXX=ccache when invoking cmake).
> We should be able to fix this in enable_devtoolset.sh, at least in the common 
> case where symlinks to ccache named after the compiler are on the PATH. We 
> could ensure that, following the call to 'scl enable devtoolset-3 ', 
> ccache symlinks are placed at the head of the PATH, before 
> /opt/rh/devtoolset-3/, and only then is  actually invoked. This should 
> cause ccache to be used, and it'll chain to the devtoolset-3 compiler because 
> /opt/rh/devtoolset-3/ is ahead of /usr/bin on the PATH. We may need an 
> intermediate script to do this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-1521) Flakiness in TestAsyncKuduSession

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142261#comment-16142261
 ] 

Jean-Daniel Cryans commented on KUDU-1521:
--

[~adar] wanna try looping dist test and see if this is still an issue?

> Flakiness in TestAsyncKuduSession
> -
>
> Key: KUDU-1521
> URL: https://issues.apache.org/jira/browse/KUDU-1521
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.9.1
>Reporter: Adar Dembo
>
> I've been trying to parse the various failures in 
> http://104.196.14.100/job/kudu-gerrit/2270/BUILD_TYPE=RELEASE. Here's what I 
> see in the test:
> The way test() tests AUTO_FLUSH_BACKGROUND is inherently flaky; a delay while 
> running test code will give the background flush task a chance to fire when 
> the test code doesn't expect it. I've seen this cause lead to no 
> PleaseThrottleException, but I suspect the first block of test code dealing 
> with background flushes is flaky too (since it's testing elapsed time).
> There's also some test failures that I can't figure out. I've pasted them 
> below for posterity:
> {noformat}
> 03:52:14 
> testGetTableLocationsErrorCauseSessionStuck(org.kududb.client.TestAsyncKuduSession)
>   Time elapsed: 100.009 sec  <<< ERROR!
> 03:52:14 java.lang.Exception: test timed out after 10 milliseconds
> 03:52:14  at java.lang.Object.wait(Native Method)
> 03:52:14  at java.lang.Object.wait(Object.java:503)
> 03:52:14  at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1136)
> 03:52:14  at com.stumbleupon.async.Deferred.join(Deferred.java:1019)
> 03:52:14  at 
> org.kududb.client.TestAsyncKuduSession.testGetTableLocationsErrorCauseSessionStuck(TestAsyncKuduSession.java:133)
> 03:52:14 
> 03:52:14 
> testBatchErrorCauseSessionStuck(org.kududb.client.TestAsyncKuduSession)  Time 
> elapsed: 0.199 sec  <<< ERROR!
> 03:52:14 org.kududb.client.MasterErrorException: Server[Kudu Master - 
> 127.13.215.1:64030] NOT_FOUND[code 1]: The table was deleted: Table deleted 
> at 2016-07-09 03:50:24 UTC
> 03:52:14  at 
> org.kududb.client.TabletClient.dispatchMasterErrorOrReturnException(TabletClient.java:533)
> 03:52:14  at org.kududb.client.TabletClient.decode(TabletClient.java:463)
> 03:52:14  at org.kududb.client.TabletClient.decode(TabletClient.java:83)
> 03:52:14  at 
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:500)
> 03:52:14  at 
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
> 03:52:14  at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
> 03:52:14  at 
> org.kududb.client.TabletClient.handleUpstream(TabletClient.java:638)
> 03:52:14  at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
> 03:52:14  at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
> 03:52:14  at 
> org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184)
> 03:52:14  at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
> 03:52:14  at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
> 03:52:14  at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
> 03:52:14  at 
> org.kududb.client.AsyncKuduClient$TabletClientPipeline.sendUpstream(AsyncKuduClient.java:1877)
> 03:52:14  at 
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
> 03:52:14  at 
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
> 03:52:14  at 
> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
> 03:52:14  at 
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
> 03:52:14  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 03:52:14  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 03:52:14  at java.lang.Thread.run(Thread.java:745)
> 03:52:14 
> 03:52:14 

[jira] [Updated] (KUDU-1537) Exactly-once semantics for DDL operations

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1537:
-
Target Version/s: Backlog  (was: 1.5.0)

> Exactly-once semantics for DDL operations
> -
>
> Key: KUDU-1537
> URL: https://issues.apache.org/jira/browse/KUDU-1537
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 0.9.1
>Reporter: Adar Dembo
>
> Now that Kudu has a replay cache, we should use it for master DDL operations 
> like CreateTable(), AlterTable(), and DeleteTable(). To do this we'll need to 
> add some client-specific RPC state into the write that the leader master 
> replicates to its followers, and use that state when an RPC is retried.
> Some tests (e.g. master-stress-test) have workarounds that should be removed 
> when this bug is fixed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1537) Exactly-once semantics for DDL operations

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1537:
-
Issue Type: Improvement  (was: Bug)

> Exactly-once semantics for DDL operations
> -
>
> Key: KUDU-1537
> URL: https://issues.apache.org/jira/browse/KUDU-1537
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 0.9.1
>Reporter: Adar Dembo
>
> Now that Kudu has a replay cache, we should use it for master DDL operations 
> like CreateTable(), AlterTable(), and DeleteTable(). To do this we'll need to 
> add some client-specific RPC state into the write that the leader master 
> replicates to its followers, and use that state when an RPC is retried.
> Some tests (e.g. master-stress-test) have workarounds that should be removed 
> when this bug is fixed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-1449) tablet unavailable caused by follower can not upgrade to leader.

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-1449.
--
   Resolution: Duplicate
Fix Version/s: n/a

Likely fixed by KUDU-1097 and friends.

> tablet unavailable caused by  follower can not upgrade to leader.
> -
>
> Key: KUDU-1449
> URL: https://issues.apache.org/jira/browse/KUDU-1449
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 0.8.0
> Environment: jd.com production env
>Reporter: zhangsong
>Priority: Critical
> Fix For: n/a
>
>
> 1 background : there is 5 node crash due to sys oom today , according to raft 
> protocol, kudu should select follower and upgrade it to leader and provide 
> service again,while it did not.  
> Found such error when issuing query via impala: "Unable to open scanner: 
> Timed out: GetTableLocations(flow_first_buy_user_0504, bucket=453, string 
> memberid=, int32 cate3_id=-2147483648, int32 cate2_id=-2147483648, int32 
> cate1_id=-2147483648, int32 chan_type=-2147483648, int32 
> county_id=-2147483648, int32 city_id=-2147483648, int32 
> province_id=-2147483648, 1) failed: timed out after deadline expired: timed 
> out after deadline expired
> "  
> 2 analysis:
> According to the bucket# , found the target tablet only has two 
> replicas,which is odd. Meantime the tablet-server hosting the leader replica 
> has crashed. 
> The follower can not upgrade to leader in that situation: only one leader and 
> one follower ,leader dead, follower can not get majority of votes for its 
> upgrading to leader(as only itself votes for itself).
> Thus result in the unavailability of tablet while there is a follower left 
> hosting the replica.
> After restart kudu-server on the node which hosting the previous leader 
> replica,  Observed that the leader replica become follower and previous 
> follower replica become leader, another follower replica is created and there 
> is 3-replica raft-configuration again.
> 3 modifications:
> follower should notice the abnormal situation where there is only two replica 
> in raft-configuration: one leader and one follower, and contact master to 
> correct it.
> 4 to do:
> what cause the two-replica raft-configuration is still known.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1736) kudu crash in debug build: unordered undo delta

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1736:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> kudu crash in debug build: unordered undo delta
> ---
>
> Key: KUDU-1736
> URL: https://issues.apache.org/jira/browse/KUDU-1736
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Reporter: zhangsong
>Priority: Critical
> Attachments: mt-tablet-test.txt.gz
>
>
> in jd cluster we met kudu-tserver crash with fatal messages described as 
> follow:
> Check failed: last_key_.CompareTo(key) <= 0 must insert undo deltas in 
> sorted order (ascending key, then descending ts): got key (row 
> 1422@tx6052042821982183424) after (row 1422@tx6052042821953155072)
> This is a dcheck which should not failed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-2015) invalidate data format will causing kudu-tserver to crash. and kudu-table will be un available

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-2015.
--
   Resolution: Cannot Reproduce
Fix Version/s: n/a

Closing since not getting help reproducing.

> invalidate data format will causing kudu-tserver to crash. and kudu-table 
> will be un available
> --
>
> Key: KUDU-2015
> URL: https://issues.apache.org/jira/browse/KUDU-2015
> Project: Kudu
>  Issue Type: Bug
>  Components: client, impala, tserver
>Affects Versions: 1.1.0
>Reporter: zhangsong
>Priority: Critical
> Fix For: n/a
>
>
> when issuing insert into clause using impala , have issue wrong insert 
> clause, which in turn causing the kudu-table unreadable and kudu-tserver 
> crash.
> the test table 's schema: 
> CREATE EXTERNAL TABLE `cst` (
> `pin` STRING,
> `age` INT
> )
> TBLPROPERTIES(...)
> the insert into issue "insert into cst values 
> (("test1",2),("test2",3),("test3",3))" 
> after insertion , impala-shell prompt successfully.
> but then select on this table will failed. 
> also found kudu-tservers(one leader and two follower) hold same tablet of the 
> table  crashed.
> FATAL msg on them is : "F0516 20:03:18.752769 39540 
> tablet_peer_mm_ops.cc:128] Check failed: _s.ok() FlushMRS failed on 
> 8ea48349d89d405c94334f832b1bae18: Invalid argument: Failed to finish DRS 
> writer: index too large"
> fortunately , it is a test table which only causing 3 kudu-tserver die.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1921) Add ability for clients to require authentication/encryption

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1921:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> Add ability for clients to require authentication/encryption
> 
>
> Key: KUDU-1921
> URL: https://issues.apache.org/jira/browse/KUDU-1921
> Project: Kudu
>  Issue Type: Improvement
>  Components: client, security
>Affects Versions: 1.3.0
>Reporter: Todd Lipcon
>Priority: Critical
>
> Currently, the clients always operate in "optional" mode for authentication 
> and encryption. This means that they are vulnerable to downgrade attacks by a 
> MITM. We should provide APIs so that clients can be configured to prohibit 
> downgrade when connecting to clusters they know to be secure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1466) C++ client errors misreported as GetTableLocations timeouts

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1466:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> C++ client errors misreported as GetTableLocations timeouts
> ---
>
> Key: KUDU-1466
> URL: https://issues.apache.org/jira/browse/KUDU-1466
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.8.0
>Reporter: Todd Lipcon
>Assignee: Alexey Serbin
>Priority: Critical
>
> client-test is currently very flaky due to this issue:
> - we are injecting some kind of failure on the tablet server (eg DNS 
> resolution failure)
> - when we fail to connect to the TS, we correctly re-trigger a lookup against 
> the master
> - depending how the backoffs and retries line up, we sometimes end up 
> triggering the lookup retry when the remaining operation budget is very short 
> (eg <10ms)
> -- this GetTabletLocations RPC times out since the master is unable to 
> respond within the ridiculously short timeout
> During the course of retrying some operation, we should probably not replace 
> the 'last_error' with a master error, so long as we have had at least one 
> successful master lookup (thus indicating that the master is not the problem)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1813) Scan at a specific timestamp doesn't include that timestamp as committed

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1813:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> Scan at a specific timestamp doesn't include that timestamp as committed
> 
>
> Key: KUDU-1813
> URL: https://issues.apache.org/jira/browse/KUDU-1813
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.2.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
>
> Currently, if the user performs the following sequence:
> - Insert a row
> - ts = client_->GetLastObservedTimestamp()
> - create a new scanner with READ_AT_SNAPSHOT set to 'ts'
> they will not observe their own write. This seems to be due to incorrect 
> usage of MvccSnapshot(ts) constructor which says that it considers all writes 
> _before_ 'ts' to be committed, rather than _before or equal to_.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-801) Delta flush doesn't wait for transactions to commit

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-801:

Target Version/s: 1.6.0  (was: 1.5.0)

> Delta flush doesn't wait for transactions to commit
> ---
>
> Key: KUDU-801
> URL: https://issues.apache.org/jira/browse/KUDU-801
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: Private Beta
>Reporter: Todd Lipcon
>Priority: Critical
>
> I saw a case of mt-tablet-test failing with what I think is the following 
> scenario:
> - transaction applies an update to DMS
> - delta flush happens
> - major delta compaction runs (the update is now part of base data and we 
> have an UNDO)
> - the RS is selected for compaction
> - CHECK failure because the UNDO delta contains something that is not yet 
> committed.
> We probably need to ensure that we don't Flush data which isn't yet committed 
> from an MVCC standpoint.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-2044) Tombstoned tablets show up in /metrics

2017-08-25 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo updated KUDU-2044:
-
Code Review: https://gerrit.cloudera.org/7618

> Tombstoned tablets show up in /metrics
> --
>
> Key: KUDU-2044
> URL: https://issues.apache.org/jira/browse/KUDU-2044
> Project: Kudu
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Assignee: Adar Dembo
>Priority: Critical
>  Labels: newbie
>
> They probably shouldn't be there.
> Furthermore, tablets tombstoned by the current process (i.e. by the current 
> run of the tserver/master) are present, but tablets tombstoned by a past run 
> aren't.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (KUDU-1807) GetTableSchema() is O(n) in the number of tablets

2017-08-25 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo reassigned KUDU-1807:


Assignee: Adar Dembo

> GetTableSchema() is O(n) in the number of tablets
> -
>
> Key: KUDU-1807
> URL: https://issues.apache.org/jira/browse/KUDU-1807
> Project: Kudu
>  Issue Type: Sub-task
>  Components: master, perf
>Affects Versions: 1.2.0
>Reporter: Todd Lipcon
>Assignee: Adar Dembo
>Priority: Critical
>  Labels: data-scalability
>
> GetTableSchema calls TableInfo::IsCreateTableDone. This method checks each 
> tablet for whether it is in the correct state, which requires acquiring the 
> RWC lock for every tablet. This is somewhat slow for large tables with 
> thousands of tablets, and this is actually a relatively hot path because 
> every task in an Impala query ends up calling GetTableSchema() when it opens 
> its scanner.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (KUDU-2044) Tombstoned tablets show up in /metrics

2017-08-25 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo reassigned KUDU-2044:


Assignee: Adar Dembo  (was: Will Berkeley)

> Tombstoned tablets show up in /metrics
> --
>
> Key: KUDU-2044
> URL: https://issues.apache.org/jira/browse/KUDU-2044
> Project: Kudu
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Assignee: Adar Dembo
>Priority: Critical
>  Labels: newbie
>
> They probably shouldn't be there.
> Furthermore, tablets tombstoned by the current process (i.e. by the current 
> run of the tserver/master) are present, but tablets tombstoned by a past run 
> aren't.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-2044) Tombstoned tablets show up in /metrics

2017-08-25 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo updated KUDU-2044:
-
Status: In Review  (was: In Progress)

> Tombstoned tablets show up in /metrics
> --
>
> Key: KUDU-2044
> URL: https://issues.apache.org/jira/browse/KUDU-2044
> Project: Kudu
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Assignee: Adar Dembo
>Priority: Critical
>  Labels: newbie
>
> They probably shouldn't be there.
> Furthermore, tablets tombstoned by the current process (i.e. by the current 
> run of the tserver/master) are present, but tablets tombstoned by a past run 
> aren't.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-2037) ts_recovery-itest flaky since KUDU-1034 fixed

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-2037.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Was fixed in 681f05b431a6fe62370feb439dd0756d9eefe07d.

> ts_recovery-itest flaky since KUDU-1034 fixed
> -
>
> Key: KUDU-2037
> URL: https://issues.apache.org/jira/browse/KUDU-2037
> Project: Kudu
>  Issue Type: Bug
>  Components: client, test
>Affects Versions: 1.4.0
>Reporter: Todd Lipcon
>Assignee: Alexey Serbin
>Priority: Critical
> Fix For: 1.5.0
>
>
> ts_recovery-itest is quite flaky lately (~50% in TSAN builds). I was able to 
> reproduce the flakiness reliably doing:
> {code}
> taskset -c 0-1 ./build-support/run-test.sh 
> ./build/latest/bin/ts_recovery-itest --gtest_filter=\*Orphan\* 
> -stress-cpu-threads 4
> {code}
> I tracked the flakiness down to being introduced by KUDU-1034 
> (4263b037844fca595a35f99479fbb5765ba7a443). The issue seems to be that the 
> test sets a low timeout such that a large number of requests time out, and 
> with the new behavior introduced by that commit, we end up hammering the 
> master and unable to make progress.
> Unclear if this is a feature (and we need to update the test) or a bug



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1454) Spark and MR jobs running without scan locality

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1454:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> Spark and MR jobs running without scan locality
> ---
>
> Key: KUDU-1454
> URL: https://issues.apache.org/jira/browse/KUDU-1454
> Project: Kudu
>  Issue Type: Bug
>  Components: client, perf, spark
>Affects Versions: 0.8.0
>Reporter: Todd Lipcon
>Priority: Critical
>
> Spark (and according to [~danburkert] MR also now) add all of the locations 
> of a tablet as split locations. This makes sense except that the Java client 
> currently always scans the leader replica. So in many cases we schedule a 
> task which is "local" to a follower, and then it ends up having to do a 
> remote scan.
> This makes Spark queries take about twice as long on tables with replicas 
> compared to unreplicated tables, and I think is a regression on the MR side.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1587) Memory-based backpressure is insufficient on seek-bound workloads

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1587:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> Memory-based backpressure is insufficient on seek-bound workloads
> -
>
> Key: KUDU-1587
> URL: https://issues.apache.org/jira/browse/KUDU-1587
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 0.10.0
>Reporter: Todd Lipcon
>Priority: Critical
> Attachments: graph.png, queue-time.png
>
>
> I pushed a uniform random insert workload from a bunch of clients to the 
> point that the vast majority of bloom filters no longer fit in buffer cache, 
> and the compaction had fallen way behind. Thus, every inserted row turns into 
> 40+ seeks (due to non-compact data) and takes 400-500ms. In this kind of 
> workload, the current backpressure (based on memory usage) is insufficient to 
> prevent ridiculously long queues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-428) Support for service/table/column authorization

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-428:

Target Version/s: 1.6.0  (was: 1.5.0)

> Support for service/table/column authorization
> --
>
> Key: KUDU-428
> URL: https://issues.apache.org/jira/browse/KUDU-428
> Project: Kudu
>  Issue Type: New Feature
>  Components: master, security, tserver
>Affects Versions: 1.2.0
>Reporter: Todd Lipcon
>Priority: Critical
>  Labels: kudu-roadmap
>
> We need to support basic SQL-like access control:
> - grant/revoke on tables, columns
> - service-level grant/revoke
> - probably need some group/role mapping infrastructure as well



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-869) Support PRE_VOTER config membership type

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-869:

Target Version/s: 1.6.0  (was: 1.5.0)

> Support PRE_VOTER config membership type
> 
>
> Key: KUDU-869
> URL: https://issues.apache.org/jira/browse/KUDU-869
> Project: Kudu
>  Issue Type: Sub-task
>  Components: consensus
>Affects Versions: Feature Complete
>Reporter: Mike Percy
>Assignee: Mike Percy
>Priority: Critical
>
> A PRE_VOTER membership type will reduce unavailability when bootstrapping new 
> nodes. See the remote bootstrap spec @ 
> https://docs.google.com/document/d/1zSibYnwPv9cFRnWn0ORyu2uCGB9Neb-EsF0M6AiMSEE
>  for details



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1788) Raft UpdateConsensus retry behavior on timeout is counter-productive

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1788:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> Raft UpdateConsensus retry behavior on timeout is counter-productive
> 
>
> Key: KUDU-1788
> URL: https://issues.apache.org/jira/browse/KUDU-1788
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.1.0
>Reporter: Todd Lipcon
>Priority: Critical
>
> In a stress test, I've seen the following counter-productive behavior:
> - a leader is trying to send operations to a replica (eg a 10MB batch)
> - the network is constrained due to other activity, so sending 10MB may take 
> >1sec
> - the request times out on the client side, likely while it was still in the 
> process of sending the batch
> - when the server receives it, it is likely to have timed out while waiting 
> in the queue. Or ,it will receive it and upon processing will all be 
> duplicate ops from the previous attempt
> - the client has no idea whether the server received it or not, and thus 
> keeps retrying the same batch (triggering the same timeout)
> This tends to be a "sticky"/cascading sort of state: after one such timeout, 
> the follower will be lagging behind more, and the next batch will be larger 
> (up to the configured max batch size). The client neither backs off nor 
> increases its timeout, so it will basically just keep the network pipe full 
> of useless redundant updates



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-1869) Scans do not work with hybrid time disabled and snapshot reads enabled

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142226#comment-16142226
 ] 

Jean-Daniel Cryans commented on KUDU-1869:
--

[~dralves] Are you working on this?

> Scans do not work with hybrid time disabled and snapshot reads enabled
> --
>
> Key: KUDU-1869
> URL: https://issues.apache.org/jira/browse/KUDU-1869
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.2.0
> Environment: Centos 6.6
> kudu 1.2.0-cdh5.10.0
> revision 01748528baa06b78e04ce9a799cc60090a821162
> build type RELEASE
> 6 nodes, 5 tservers/impalads
>Reporter: Matthew Jacobs
>Assignee: David Alves
>Priority: Critical
>  Labels: impala
>
> With {{-use_hybrid_clock=false}} and scanning with SNAPSHOT_READ, all scans 
> appear to be timing out with the following error message:
> {code}
> [vc0736.halxg.cloudera.com:21000] > SELECT COUNT(*) FROM 
> functional_kudu.alltypes;
> Query: select COUNT(*) FROM functional_kudu.alltypes
> Query submitted at: 2017-02-10 09:50:02 (Coordinator: 
> http://vc0736.halxg.cloudera.com:25000)
> Query progress can be monitored at: 
> http://vc0736.halxg.cloudera.com:25000/query_plan?query_id=ff48eb0af82f057e:f2c1e2f8
> WARNINGS: 
> Unable to open scanner: Timed out: unable to retry before timeout: Remote 
> error: Service unavailable: Timed out: could not wait for desired snapshot 
> timestamp to be consistent: Timed out waiting for ts: L: 3632307 to be safe 
> (mode: NON-LEADER). Current safe time: L: 3598991 Physical time difference: 
> None (Logical clock): Remote error: Service unavailable: Timed out: could not 
> wait for desired snapshot timestamp to be consistent: Timed out waiting for 
> ts: L: 3625836 to be safe (mode: NON-LEADER). Current safe time: L: 3593993 
> Physical time difference: None (Logical clock)
> {code}
> This is a severe issue for Impala which aims to keep SNAPSHOT_READ enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1125) Reduce impact of enabling fsync on the master

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1125:
-
Priority: Major  (was: Critical)

> Reduce impact of enabling fsync on the master
> -
>
> Key: KUDU-1125
> URL: https://issues.apache.org/jira/browse/KUDU-1125
> Project: Kudu
>  Issue Type: Sub-task
>  Components: master
>Affects Versions: Feature Complete
>Reporter: Jean-Daniel Cryans
>  Labels: data-scalability
>
> First time running ITBLL since we enabled fsync in the master and I'm now 
> seeing RPCs timing out because the master is always ERROR_SERVER_TOO_BUSY. In 
> the log I can see a lot of elections going on and the queue is always full.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-577) Specification for expected semantics and client modes

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142225#comment-16142225
 ] 

Jean-Daniel Cryans commented on KUDU-577:
-

[~dr-alves] same question as Todd above.

> Specification for expected semantics and client modes
> -
>
> Key: KUDU-577
> URL: https://issues.apache.org/jira/browse/KUDU-577
> Project: Kudu
>  Issue Type: Sub-task
>  Components: api, client
>Affects Versions: M4.5
>Reporter: Jean-Daniel Cryans
>Assignee: David Alves
>Priority: Critical
>
> We need a detailed description of what the different client modes are and 
> what it means the clients should do. This is to ensure that both terminology 
> and behavior match between languages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1762) suspected tablet memory leak

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1762:
-
Priority: Major  (was: Critical)

> suspected tablet memory leak
> 
>
> Key: KUDU-1762
> URL: https://issues.apache.org/jira/browse/KUDU-1762
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.0.1
> Environment: CentOS 6.5
> Kudu 1.0.1 (rev e60b610253f4303b24d41575f7bafbc5d69edddb)
>Reporter: Fu Lili
> Attachments: 0B2CE7BB-EF26-4EA1-B824-3584D7D79256.png, 
> kudu_heap_prof_20161206.tar.gz, mem_rss_graph_2016_12_19.png, 
> server02_30day_rss_before_and_after_mrs_flag_2.png, 
> server02_30day_rss_before_and_after_mrs_flag.png, tserver_smaps1
>
>
> here is the memory total info:
> {quote}
> 
> MALLOC: 1691715680 ( 1613.3 MiB) Bytes in use by application
> MALLOC: +178733056 (  170.5 MiB) Bytes in page heap freelist
> MALLOC: + 37483104 (   35.7 MiB) Bytes in central cache freelist
> MALLOC: +  4071488 (3.9 MiB) Bytes in transfer cache freelist
> MALLOC: + 13739264 (   13.1 MiB) Bytes in thread cache freelists
> MALLOC: + 12202144 (   11.6 MiB) Bytes in malloc metadata
> MALLOC:   
> MALLOC: =   1937944736 ( 1848.2 MiB) Actual memory used (physical + swap)
> MALLOC: +   311296 (0.3 MiB) Bytes released to OS (aka unmapped)
> MALLOC:   
> MALLOC: =   1938256032 ( 1848.5 MiB) Virtual address space used
> MALLOC:
> MALLOC: 174694  Spans in use
> MALLOC:201  Thread heaps in use
> MALLOC:   8192  Tcmalloc page size
> 
> Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
> Bytes released to the OS take up virtual address space but no physical memory.
> {quote}
> but in memroy detail, sum of all the sub Current Consumption is far less than 
> the to the root Current Consumption。
> ||Id||Parent||Limit||Current Consumption||Peak consumption||
> |root|none|4.00G|1.58G|1.74G|
> |log_cache|root|1.00G|480.8K|5.32M|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:70c8d889b0314b04a240fcb02c24a012|log_cache|128.00M|160B|160B|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:16d3c8193579445f8f766da6c7abc237|log_cache|128.00M|160B|160B|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:2c69c5cb9eb04eb48323a9268afc36a7|log_cache|128.00M|160B|160B|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:2b11d9220dab4a5f952c5b1c10a68ccd|log_cache|128.00M|69.2K|139.5K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:cec045be60af4f759497234d8815238b|log_cache|128.00M|68.6K|138.7K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:cea7a54cebd242e4997da641f5b32e3a|log_cache|128.00M|68.5K|139.3K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:9625dfde17774690a888b55024ac797a|log_cache|128.00M|68.5K|140.0K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:6046b33901ca43d0975f59cf7e491186|log_cache|128.00M|0B|133.0K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:1a18ab0915f0407b922fa7ecbe7a2f46|log_cache|128.00M|0B|132.6K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:ac54d1c1813a4e39943971cb56f248ef|log_cache|128.00M|0B|130.5K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:4438580df6cc4d469393b9d6adee68d8|log_cache|128.00M|0B|131.2K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:2f1cef7d2a494575b941baa22b8a3dc9|log_cache|128.00M|0B|131.6K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:d2ad22d202c04b2d98f1c5800df1c3b5|log_cache|128.00M|0B|132.5K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:b19b21d6b4c84f9895aad9e81559d019|log_cache|128.00M|0B|131.0K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:27e9531cd5814b1c9637493f05860b19|log_cache|128.00M|0B|131.1K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:425a19940239447faa0eaab4e380d644|log_cache|128.00M|68.5K|146.9K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:178bd7bc39a941a887f393b0a7848066|log_cache|128.00M|68.5K|139.9K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:91524acd28a440318918f11292ac8fdc|log_cache|128.00M|0B|132.0K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:be6f093aabf9460b97fc35dd026820b6|log_cache|128.00M|0B|130.4K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:dd8dd794f0f44426a3c46ce8f4b54652|log_cache|128.00M|0B|131.2K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:ed128ca7b19c4e3eaa48e9e3eb341492|log_cache|128.00M|68.5K|141.5K|
> |block_cache-sharded_lru_cache|root|none|257.05M|257.05M|
> |code_cache-sharded_lru_cache|root|none|112B|113B|
> |server|root|none|2.06M|121.97M|
> |tablet-70c8d889b0314b04a240fcb02c24a012|server|none|265B|265B|
> |txn_tracker|tablet-70c8d889b0314b04a240fcb02c24a012|64.00M|0B|0B|
> |MemRowSet-0|tablet-70c8d889b0314b04a240fcb02c24a012|none|265B|265B|
> 

[jira] [Commented] (KUDU-582) Send TS specific errors back to the client when the client is supposed to take specific actions, such as trying another replica

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142224#comment-16142224
 ] 

Jean-Daniel Cryans commented on KUDU-582:
-

Hey [~dr-alves], what should we do with this jira? It feels ancient and not 
that well defined. Should we close?

> Send TS specific errors back to the client when the client is supposed to 
> take specific actions, such as trying another replica
> ---
>
> Key: KUDU-582
> URL: https://issues.apache.org/jira/browse/KUDU-582
> Project: Kudu
>  Issue Type: Bug
>  Components: client, consensus, tserver
>Affects Versions: M4.5
>Reporter: David Alves
>Priority: Critical
>
> Right now we're sending umbrella statuses that the client is supposed to 
> interpret as a command that it should failover to another replica. This is 
> misusing statuses but it's also a problem in that we're likely (or will 
> likely) sending the same statuses (illegal state and abort) in places where 
> we don't mean the client to failover.
> This should be treated holistically in both clients and in the server 
> components.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1839) DNS failure during tablet creation lead to undeletable tablet

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1839:
-
Priority: Major  (was: Critical)

> DNS failure during tablet creation lead to undeletable tablet
> -
>
> Key: KUDU-1839
> URL: https://issues.apache.org/jira/browse/KUDU-1839
> Project: Kudu
>  Issue Type: Bug
>  Components: master, tablet
>Affects Versions: 1.2.0
>Reporter: Adar Dembo
>
> During a YCSB workload, two tservers died due to DNS resolution timeouts. For 
> example: 
> {noformat}
> F0117 09:21:14.952937  8392 raft_consensus.cc:1985] Check failed: _s.ok() Bad 
> status: Network error: Could not obtain a remote proxy to the peer.: Unable 
> to resolve address 've0130.halxg.cloudera.com': Name or service not known
> {noformat}
> It's not clear why this happened; perhaps table creation places an inordinate 
> strain on DNS due to concurrent resolution load from all the bootstrapping 
> peers.
> In any case, when these tservers were restarted, two tablets failed to 
> bootstrap, both for the same reason. I'll focus on just one tablet from here 
> on out to simplify troubleshooting:
> {noformat}
> E0117 15:35:45.567312 85124 ts_tablet_manager.cc:749] T 
> 8c167c441a7d44b8add737d13797e694 P 7425c65d80f54f2da0a85494a5eb3e68: Tablet 
> failed to bootstrap: Not found: Unable to load Consensus metadata: 
> /data/2/kudu/consensus-meta/8c167c441a7d44b8add737d13797e694: No such file or 
> directory (error 2)
> {noformat}
> Eventually, the master decided to delete this tablet:
> {noformat}
> I0117 15:42:32.119601 85166 tablet_service.cc:672] Processing DeleteTablet 
> for tablet 8c167c441a7d44b8add737d13797e694 with delete_type 
> TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new 
> config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153
> I0117 15:42:32.139128 85166 tablet_service.cc:672] Processing DeleteTablet 
> for tablet 8c167c441a7d44b8add737d13797e694 with delete_type 
> TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new 
> config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153
> I0117 15:42:32.181843 85166 tablet_service.cc:672] Processing DeleteTablet 
> for tablet 8c167c441a7d44b8add737d13797e694 with delete_type 
> TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new 
> config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153
> I0117 15:42:32.276289 85166 tablet_service.cc:672] Processing DeleteTablet 
> for tablet 8c167c441a7d44b8add737d13797e694 with delete_type 
> TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new 
> config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153
> {noformat}
> As can be seen by the presence of multiple deletion requests, each one 
> failed. It's annoying that the tserver didn't log why. But the master did:
> {noformat}
> I0117 15:42:32.117022 33903 catalog_manager.cc:2758] Sending 
> DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 
> 8c167c441a7d44b8add737d13797e694 on 7425c65d80f54f2da0a85494a5eb3e68 
> (ve0122.halxg.cloudera.com:7050) (TS 7425c65d80f54f2da0a85494a5eb3e68 not 
> found in new config with opid_index 29)
> W0117 15:42:32.117463 33890 catalog_manager.cc:2725] TS 
> 7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.cloudera.com:7050): delete 
> failed for tablet 8c167c441a7d44b8add737d13797e694 with error code 
> TABLET_NOT_RUNNING: Illegal state: Consensus not available. Tablet shutting 
> down
> I0117 15:42:32.117491 33890 catalog_manager.cc:2522] Scheduling retry of 
> 8c167c441a7d44b8add737d13797e694 Delete Tablet RPC for 
> TS=7425c65d80f54f2da0a85494a5eb3e68 with a delay of 19ms (attempt = 1)...
> {noformat}
> This isn't a fatal error as far as the master is concerned, so it retries the 
> deletion forever.
> Meanwhile, the broken replica of this tablet still appears to be part of the 
> replication group. At least, that's true as far as both the master web UI and 
> the tserver web UI are concerned. The leader tserver is logging this error 
> repeatedly:
> {noformat}
> W0117 16:38:04.797828 81809 consensus_peers.cc:329] T 
> 8c167c441a7d44b8add737d13797e694 P 335d132897de4bdb9b87443f2c487a42 -> Peer 
> 7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.cloudera.com:7050): Couldn't 
> send request to peer 7425c65d80f54f2da0a85494a5eb3e68 for tablet 
> 8c167c441a7d44b8add737d13797e694. Error code: TABLET_NOT_RUNNING (12). 
> Status: Illegal state: Tablet not RUNNING: FAILED: Not found: Unable to load 
> Consensus metadata: 
> /data/2/kudu/consensus-meta/8c167c441a7d44b8add737d13797e694: No such file or 
> directory (error 2). Retrying in the next heartbeat period. Already tried 
>  times.
> {noformat}
> It's not clear to me exactly what state the 

[jira] [Updated] (KUDU-2044) Tombstoned tablets show up in /metrics

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-2044:
-
Target Version/s: 1.6.0

> Tombstoned tablets show up in /metrics
> --
>
> Key: KUDU-2044
> URL: https://issues.apache.org/jira/browse/KUDU-2044
> Project: Kudu
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Assignee: Will Berkeley
>Priority: Critical
>  Labels: newbie
>
> They probably shouldn't be there.
> Furthermore, tablets tombstoned by the current process (i.e. by the current 
> run of the tserver/master) are present, but tablets tombstoned by a past run 
> aren't.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-2050) Avoid peer eviction during block manager startup

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-2050:
-
Target Version/s: 1.6.0

> Avoid peer eviction during block manager startup
> 
>
> Key: KUDU-2050
> URL: https://issues.apache.org/jira/browse/KUDU-2050
> Project: Kudu
>  Issue Type: Bug
>  Components: fs, tserver
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Priority: Critical
>
> In larger deployments we've observed that opening the block manager can take 
> a really long time, like tens of minutes or sometimes even hours. This is 
> especially true as of 1.4 where the log block manager tries to optimize 
> on-disk data structures during startup.
> The default time to Raft peer eviction is 5 minutes. If one node is restarted 
> and LBM startup takes over 5 minutes, or if all nodes are restarted and 
> there's over 5 minutes of LBM startup time variance across them, the "slow" 
> node could have all of its replicas evicted. Besides generating a lot of 
> unnecessary work in rereplication, this effectively "defeats" the LBM 
> optimizations in that it would have been equally slow (but more efficient) to 
> reformat the node instead.
> So, let's reorder startup such that LBM startup counts towards replica 
> bootstrapping. One idea: adjust FsManager startup so that tablet-meta/cmeta 
> files can be accessed early to construct bootstrapping replicas, but to defer 
> opening of the block manager until after that time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-2110) RPC footer may be appended more than once

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-2110:
-
Target Version/s:   (was: 1.5.0)

> RPC footer may be appended more than once
> -
>
> Key: KUDU-2110
> URL: https://issues.apache.org/jira/browse/KUDU-2110
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 1.5.0
>Reporter: Michael Ho
>Assignee: Michael Ho
>Priority: Blocker
>  Labels: 1.6.0
>
> The fix for KUDU-2065 included a footer to RPC messages. The footer is 
> appended during the beginning of the transmission of the outbound transfer 
> for an outbound call. However, the check for the beginning of transmission 
> for an outbound call isn't quite correct as it's possible for an outbound 
> transfer to not send anything in Transfer::SendBuffer().
> {noformat}
> // Transfer for outbound call must call StartCallTransfer() before 
> transmission can
> // begin to append footer to the payload if the remote supports it.
> if (!transfer->TransferStarted() &&
> transfer->is_for_outbound_call() &&
> !StartCallTransfer(transfer)) {
>   OutboundQueuePopFront();
>   continue;
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1956) Crash with "rowset selected for compaction but not available anymore"

2017-08-25 Thread Will Berkeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Berkeley updated KUDU-1956:

Fix Version/s: (was: 1.5.0)
   1.4.0

> Crash with "rowset selected for compaction but not available anymore"
> -
>
> Key: KUDU-1956
> URL: https://issues.apache.org/jira/browse/KUDU-1956
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.3.0
>Reporter: Todd Lipcon
>Priority: Critical
> Fix For: 1.4.0
>
>
> I loaded 1T of data into a server with 8 MM threads configured, and a patch 
> to make the MM thread wake up and do scheduling as soon as any prior op 
> finished. After a day or two of runtime the TS crashed with:
> E0324 14:28:19.733708  5801 tablet.cc:1207] T 
> cf12905ff7d84fa0b4148aab292f0c40 P 40b48ecb131449c58df26f62ccc35538: Rowset 
> selected for compaction but not available anymore: RowSet(22755)
> E0324 14:28:19.733762  5801 tablet.cc:1207] T 
> cf12905ff7d84fa0b4148aab292f0c40 P 40b48ecb131449c58df26f62ccc35538: Rowset 
> selected for compaction but not available anymore: RowSet(24031)
> F0324 14:28:19.733777  5801 tablet.cc:1210] T 
> cf12905ff7d84fa0b4148aab292f0c40 P 40b48ecb131449c58df26f62ccc35538: Was 
> unable to find all rowsets selected for compaction



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (KUDU-1956) Crash with "rowset selected for compaction but not available anymore"

2017-08-25 Thread Will Berkeley (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142087#comment-16142087
 ] 

Will Berkeley edited comment on KUDU-1956 at 8/25/17 7:44 PM:
--

Since the server no longer crashes when the race occurs, we'll consider this 
resolved, and track fixing the underlying race in KUDU-2115.


was (Author: wdberkeley):
Since the server no longer crashes when the race occurs, we'll consider this 
resolved, and tracked fixing the underlying race in KUDU-2115.

> Crash with "rowset selected for compaction but not available anymore"
> -
>
> Key: KUDU-1956
> URL: https://issues.apache.org/jira/browse/KUDU-1956
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.3.0
>Reporter: Todd Lipcon
>Priority: Critical
> Fix For: 1.4.0
>
>
> I loaded 1T of data into a server with 8 MM threads configured, and a patch 
> to make the MM thread wake up and do scheduling as soon as any prior op 
> finished. After a day or two of runtime the TS crashed with:
> E0324 14:28:19.733708  5801 tablet.cc:1207] T 
> cf12905ff7d84fa0b4148aab292f0c40 P 40b48ecb131449c58df26f62ccc35538: Rowset 
> selected for compaction but not available anymore: RowSet(22755)
> E0324 14:28:19.733762  5801 tablet.cc:1207] T 
> cf12905ff7d84fa0b4148aab292f0c40 P 40b48ecb131449c58df26f62ccc35538: Rowset 
> selected for compaction but not available anymore: RowSet(24031)
> F0324 14:28:19.733777  5801 tablet.cc:1210] T 
> cf12905ff7d84fa0b4148aab292f0c40 P 40b48ecb131449c58df26f62ccc35538: Was 
> unable to find all rowsets selected for compaction



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-1956) Crash with "rowset selected for compaction but not available anymore"

2017-08-25 Thread Will Berkeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Berkeley resolved KUDU-1956.
-
  Resolution: Fixed
   Fix Version/s: 1.5.0
Target Version/s:   (was: 1.5.0)

Since the server no longer crashes when the race occurs, we'll consider this 
resolved, and tracked fixing the underlying race in KUDU-2115.

> Crash with "rowset selected for compaction but not available anymore"
> -
>
> Key: KUDU-1956
> URL: https://issues.apache.org/jira/browse/KUDU-1956
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.3.0
>Reporter: Todd Lipcon
>Priority: Critical
> Fix For: 1.5.0
>
>
> I loaded 1T of data into a server with 8 MM threads configured, and a patch 
> to make the MM thread wake up and do scheduling as soon as any prior op 
> finished. After a day or two of runtime the TS crashed with:
> E0324 14:28:19.733708  5801 tablet.cc:1207] T 
> cf12905ff7d84fa0b4148aab292f0c40 P 40b48ecb131449c58df26f62ccc35538: Rowset 
> selected for compaction but not available anymore: RowSet(22755)
> E0324 14:28:19.733762  5801 tablet.cc:1207] T 
> cf12905ff7d84fa0b4148aab292f0c40 P 40b48ecb131449c58df26f62ccc35538: Rowset 
> selected for compaction but not available anymore: RowSet(24031)
> F0324 14:28:19.733777  5801 tablet.cc:1210] T 
> cf12905ff7d84fa0b4148aab292f0c40 P 40b48ecb131449c58df26f62ccc35538: Was 
> unable to find all rowsets selected for compaction



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KUDU-2115) Fix rowset compaction selection race

2017-08-25 Thread Will Berkeley (JIRA)
Will Berkeley created KUDU-2115:
---

 Summary: Fix rowset compaction selection race
 Key: KUDU-2115
 URL: https://issues.apache.org/jira/browse/KUDU-2115
 Project: Kudu
  Issue Type: Bug
  Components: tablet
Affects Versions: 1.3.0
Reporter: Will Berkeley


KUDU-1956 identified a race in selecting rowsets for compaction. Todd applied a 
workaround that prevents the server from crashing when this happens, but the 
race itself remains.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2096) Document necessary configuration for Kerberos with master CNAMEs

2017-08-25 Thread Attila Bukor (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142055#comment-16142055
 ] 

Attila Bukor commented on KUDU-2096:


[~tlipcon] what's the plan with this JIRA now that KUDU-2103 is fixed and 
merged in?

> Document necessary configuration for Kerberos with master CNAMEs
> 
>
> Key: KUDU-2096
> URL: https://issues.apache.org/jira/browse/KUDU-2096
> Project: Kudu
>  Issue Type: Task
>  Components: documentation, security
>Reporter: Todd Lipcon
>
> Currently our docs recommend using CNAMEs for master addresses to simplify 
> moving them around. However, if clients connect to a master with its 
> non-canonical name, there are some complications with Kerberos principals, 
> etc. We should test and document the necessary steps for such a configuration.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-2103) Java client doesn't work on a Kerberized cluster with DNS aliases for masters

2017-08-25 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-2103:
--
Fix Version/s: 1.4.1
   1.3.2

> Java client doesn't work on a Kerberized cluster with DNS aliases for masters
> -
>
> Key: KUDU-2103
> URL: https://issues.apache.org/jira/browse/KUDU-2103
> Project: Kudu
>  Issue Type: Bug
>  Components: java, security
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Attila Bukor
>Assignee: Attila Bukor
> Fix For: 1.3.2, 1.5.0, 1.4.1
>
>
> The Java client doesn't canonicalize the master_addresses when requesting 
> service tickets. This means that when the Java client is using DNS aliases 
> for the master hosts in a Kerberized environment, it fails to connect to the 
> masters as the Kerberos AS can't find the service tickets.
> As discussed in KUDU-2096, it should return the lower-case version of the 
> canonical hostname for this to work properly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-2103) Java client doesn't work on a Kerberized cluster with DNS aliases for masters

2017-08-25 Thread Attila Bukor (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Bukor updated KUDU-2103:
---
Component/s: security
 java

> Java client doesn't work on a Kerberized cluster with DNS aliases for masters
> -
>
> Key: KUDU-2103
> URL: https://issues.apache.org/jira/browse/KUDU-2103
> Project: Kudu
>  Issue Type: Bug
>  Components: java, security
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Attila Bukor
>Assignee: Attila Bukor
> Fix For: 1.5.0
>
>
> The Java client doesn't canonicalize the master_addresses when requesting 
> service tickets. This means that when the Java client is using DNS aliases 
> for the master hosts in a Kerberized environment, it fails to connect to the 
> masters as the Kerberos AS can't find the service tickets.
> As discussed in KUDU-2096, it should return the lower-case version of the 
> canonical hostname for this to work properly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-2103) Java client doesn't work on a Kerberized cluster with DNS aliases for masters

2017-08-25 Thread Attila Bukor (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Bukor updated KUDU-2103:
---
Affects Version/s: 1.3.0
   1.4.0

> Java client doesn't work on a Kerberized cluster with DNS aliases for masters
> -
>
> Key: KUDU-2103
> URL: https://issues.apache.org/jira/browse/KUDU-2103
> Project: Kudu
>  Issue Type: Bug
>  Components: java, security
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Attila Bukor
>Assignee: Attila Bukor
> Fix For: 1.5.0
>
>
> The Java client doesn't canonicalize the master_addresses when requesting 
> service tickets. This means that when the Java client is using DNS aliases 
> for the master hosts in a Kerberized environment, it fails to connect to the 
> masters as the Kerberos AS can't find the service tickets.
> As discussed in KUDU-2096, it should return the lower-case version of the 
> canonical hostname for this to work properly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)