[jira] [Closed] (KUDU-2241) NoSuchMethodError happened when run flume agent using kudu flume sink

2018-01-18 Thread Changyao Ye (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Changyao Ye closed KUDU-2241.
-
Resolution: Fixed

> NoSuchMethodError happened when run flume agent using kudu flume sink
> -
>
> Key: KUDU-2241
> URL: https://issues.apache.org/jira/browse/KUDU-2241
> Project: Kudu
>  Issue Type: Bug
>  Components: flume-sink
>Affects Versions: 1.5.0
>Reporter: Changyao Ye
>Priority: Major
>
> I've installed kudu and flume components from cloudera manager, and when I 
> start flume agent the following error happened.
> {panel:title=Error}
> 17/12/11 10:46:29 ERROR node.PollingPropertiesFileConfigurationProvider: 
> Unhandled error
> java.lang.NoSuchMethodError: 
> org.apache.flume.Context.getSubProperties(Ljava/lang/String;)Lorg/apache/kudu/shaded/com/google/common/collect/ImmutableMap;
>   at org.apache.kudu.flume.sink.KuduSink.configure(KuduSink.java:206)
>   at org.apache.flume.conf.Configurables.configure(Configurables.java:41)
>   at 
> org.apache.flume.node.AbstractConfigurationProvider.loadSinks(AbstractConfigurationProvider.java:411)
>   at 
> org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:102)
>   at 
> org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:141)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {panel}
> Version infomation:
> - flume: 1.6.0-cdh5.13.0
> - kudu: 1.5.0-cdh5.13.0
> I checked the classpath of flume agent and found that maybe there is version 
> conflict of guava jar between flume and kudu flume 
> sink(kudu-flume-sink-1.5.0-cdh5.13.0.jar) which includes shaded guava jar.
> Maybe this problem related to the changes made 
> [here|https://github.com/apache/kudu/commit/5a258508f8d560f630512c237711a65cd137c6b3].
> To make flume agent run properly, I excluded related class(ImmutableMap etc) 
> from shade settings in pom.xml and found that worked.
> {panel:title= kudu 1.5.0 top-level pom.xml}
> 
> com.google.common
> org.apache.kudu.shaded.com.google.common
> 
> com.google.common.collect.ImmutableMap*
> com.google.common.collect.ImmutableEnumMap*
> 
> 
> {panel}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2241) NoSuchMethodError happened when run flume agent using kudu flume sink

2018-01-18 Thread Aki Ariga (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331768#comment-16331768
 ] 

Aki Ariga commented on KUDU-2241:
-

It seems [this 
commit|https://github.com/apache/kudu/commit/68fa8010dddad81dd702c6f05fda7d561d9beef9#diff-93f725a07423fe1c889f448b33d21f46]
 resolves the problem. Is there any reason to prevent closing?

> NoSuchMethodError happened when run flume agent using kudu flume sink
> -
>
> Key: KUDU-2241
> URL: https://issues.apache.org/jira/browse/KUDU-2241
> Project: Kudu
>  Issue Type: Bug
>  Components: flume-sink
>Affects Versions: 1.5.0
>Reporter: Changyao Ye
>Priority: Major
>
> I've installed kudu and flume components from cloudera manager, and when I 
> start flume agent the following error happened.
> {panel:title=Error}
> 17/12/11 10:46:29 ERROR node.PollingPropertiesFileConfigurationProvider: 
> Unhandled error
> java.lang.NoSuchMethodError: 
> org.apache.flume.Context.getSubProperties(Ljava/lang/String;)Lorg/apache/kudu/shaded/com/google/common/collect/ImmutableMap;
>   at org.apache.kudu.flume.sink.KuduSink.configure(KuduSink.java:206)
>   at org.apache.flume.conf.Configurables.configure(Configurables.java:41)
>   at 
> org.apache.flume.node.AbstractConfigurationProvider.loadSinks(AbstractConfigurationProvider.java:411)
>   at 
> org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:102)
>   at 
> org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:141)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {panel}
> Version infomation:
> - flume: 1.6.0-cdh5.13.0
> - kudu: 1.5.0-cdh5.13.0
> I checked the classpath of flume agent and found that maybe there is version 
> conflict of guava jar between flume and kudu flume 
> sink(kudu-flume-sink-1.5.0-cdh5.13.0.jar) which includes shaded guava jar.
> Maybe this problem related to the changes made 
> [here|https://github.com/apache/kudu/commit/5a258508f8d560f630512c237711a65cd137c6b3].
> To make flume agent run properly, I excluded related class(ImmutableMap etc) 
> from shade settings in pom.xml and found that worked.
> {panel:title= kudu 1.5.0 top-level pom.xml}
> 
> com.google.common
> org.apache.kudu.shaded.com.google.common
> 
> com.google.common.collect.ImmutableMap*
> com.google.common.collect.ImmutableEnumMap*
> 
> 
> {panel}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-760) Rare ThreadRestrictions shutdown CHECK failure

2018-01-18 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-760:
-
Priority: Trivial  (was: Major)

> Rare ThreadRestrictions shutdown CHECK failure
> --
>
> Key: KUDU-760
> URL: https://issues.apache.org/jira/browse/KUDU-760
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: M5
>Reporter: Mike Percy
>Priority: Trivial
> Attachments: client-test.txt
>
>
> Seems like a really bizarre interaction between reactor thread restrictions 
> and consensus at shutdown time. I almost don't believe it, but here it is so 
> noting it down to investigate later. The log is a little weird and garbled 
> but seems to have happened in ClientTest.TestScanFaultTolerance:
> {noformat}
> F0513 04:22:40.899718 12760 thread_restrictions.cc:57] Check failed: 
> LoadTLS()->wait_allowed Waiting is not allowed to be used on this thread to 
> prevent server-wide latency aberrations and deadlocks.
> *** Check failure stack trace: ***
> @ 0x7f7d6fe0b38d  google::LogMessage::Fail() at ??:0
> @ 0x7f7d6fe0d22d  google::LogMessage::SendToLog() at ??:0
> @ 0x7f7d6fe0af7c  google::LogMessage::Flush() at ??:0
> @ 0x7f7d6fe0db4e  google::LogMessageFatal::~LogMessageFatal() at ??:0
> @ 0x7f7d67d69d42  kudu::ThreadRestrictions::AssertWaitAllowed() at 
> ??:0
> @ 0x7f7d6c26f2f7  kudu::consensus::ReplicaState::LockForShutdown() at 
> ??:0
> @ 0x7f7d6c25f574  kudu::consensus::RaftConsensus::Shutdown() at ??:0
> @ 0x7f7d6c24f289  kudu::consensus::RaftConsensus::~RaftConsensus() at 
> ??:0
> @ 0x7f7d6c24f443  kudu::consensus::RaftConsensus::~RaftConsensus() at 
> ??:0
> @ 0x7f7d6c269225  kudu::internal::BindState<>::~BindState() at ??:0
> @ 0x7f7d6c2692d3  kudu::internal::BindState<>::~BindState() at ??:0
> @ 0x7f7d6c216f44  kudu::consensus::LeaderElection::~LeaderElection() 
> at ??:0
> @ 0x7f7d6c21f2e3  kudu::RefCountedThreadSafe<>::DeleteInternal() at 
> ??:0
> @ 0x7f7d6c21f115  kudu::internal::BindState<>::~BindState() at ??:0
> @ 0x7f7d6c21f1d3  kudu::internal::BindState<>::~BindState() at ??:0
> @ 0x7f7d6c21e3de  
> boost::detail::function::functor_manager<>::manager() at ??:0
> @ 0x7f7d6f924e8e  boost::function0<>::clear() at ??:0
> @ 0x7f7d68f6f092  boost::function<>::operator=() at ??:0
> @ 0x7f7d68f6ae8d  kudu::rpc::OutboundCall::CallCallback() at ??:0
> @ 0x7f7d68f6b88d  kudu::rpc::OutboundCall::SetFailed() at ??:0
> @ 0x7f7d68f77125  kudu::rpc::Connection::Shutdown() at ??:0
> @ 0x7f7d68fb3604  kudu::rpc::ReactorThread::DestroyConnection() at 
> ??:0
> @ 0x7f7d68fb6170  
> kudu::rpc::ReactorThread::CompleteConnectionNegotiation() at ??:0
> @ 0x7f7d68f8a913  kudu::rpc::NegotiationCompletedTask::Run() at ??:0
> @ 0x7f7d68fb2bf5  kudu::rpc::ReactorThread::AsyncHandler() at ??:0
> @ 0x7f7d68c56525  ev_invoke_pending at ??:0
> @ 0x7f7d68c59765  ev_run at ??:0
> @ 0x7f7d68fb0211  kudu::rpc::ReactorThread::RunThread() at ??:0
> @ 0x7f7d68fc8acc  boost::_bi::bind_t<>::operator()() at ??:0
> @ 0x7f7d68f6efbb  boost::function0<>::operator()() at ??:0
> @ 0x7f7d67d4e078  kudu::Thread::SuperviseThread() at ??:0
> @ 0x7f7d6a69e182  start_thread at ??:0
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2244) spinlock contention in raft_consensus

2018-01-18 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-2244:
--
Issue Type: Improvement  (was: Bug)

> spinlock contention in raft_consensus
> -
>
> Key: KUDU-2244
> URL: https://issues.apache.org/jira/browse/KUDU-2244
> Project: Kudu
>  Issue Type: Improvement
>  Components: consensus
>Reporter: Andrew Wong
>Priority: Major
>
> I was going through the logs of a cluster that was seeing a bunch of 
> kernel_stack_watchdog traces, and the slowness seemed to be caused by a lot 
> of activity in consensus requests. E.g.
> W1214 18:57:29.514219 36138 kernel_stack_watchdog.cc:145] Thread 36317 stuck 
> at 
> /data/jenkins/workspace/generic-package-centos64-7-0-impala/topdir/BUILD/kudu-1.3.0-cdh5.11.0/src/kudu/rpc/outbound_call.cc:192
>  for 123ms:
> Kernel stack:
> [] sys_sched_yield+0x65/0xd0
> [] system_call_fastpath+0x16/0x1b
> [] 0x
> User stack:
> @ 0x7f72fab92057  __GI___sched_yield
> @  0x19498bf  kudu::Thread::StartThread()
> @  0x1952e7d  kudu::ThreadPool::CreateThreadUnlocked()
> @  0x19534d3  kudu::ThreadPool::Submit()
> @  0x1953a27  kudu::ThreadPool::SubmitFunc()
> @  0x1953ecb  kudu::ThreadPool::SubmitClosure()
> @   0x9c94ec  kudu::consensus::RaftConsensus::ElectionCallback()
> @   0x9e6032  kudu::consensus::LeaderElection::CheckForDecision()
> @   0x9e78c3  
> kudu::consensus::LeaderElection::VoteResponseRpcCallback()
> @   0xa8b137  kudu::rpc::OutboundCall::CallCallback()
> @   0xa8c2bc  kudu::rpc::OutboundCall::SetResponse()
> @   0xa822c0  kudu::rpc::Connection::HandleCallResponse()
> @   0xa83ffc  ev::base<>::method_thunk<>()
> @  0x198a07f  ev_invoke_pending
> @  0x198af71  ev_run
> @   0xa5e049  kudu::rpc::ReactorThread::RunThread()
> So it seemed to be cause by some slowness in getting threads. Upon perusing 
> the logs a bit more, there were a sizable number of spinlock profiling traces:
> W1214 18:54:27.897955 36379 rpcz_store.cc:238] Trace:
> 1214 18:54:26.766922 (+ 0us) service_pool.cc:143] Inserting onto call 
> queue
> 1214 18:54:26.771135 (+  4213us) service_pool.cc:202] Handling call
> 1214 18:54:26.771138 (+ 3us) raft_consensus.cc:1126] Updating replica for 
> 0 ops
> 1214 18:54:27.897699 (+1126561us) raft_consensus.cc:1165] Early marking 
> committed up to index 0
> 1214 18:54:27.897700 (+ 1us) raft_consensus.cc:1170] Triggering prepare 
> for 0 ops
> 1214 18:54:27.897701 (+ 1us) raft_consensus.cc:1282] Marking committed up 
> to 1766
> 1214 18:54:27.897702 (+ 1us) raft_consensus.cc:1332] Filling consensus 
> response to leader.
> 1214 18:54:27.897736 (+34us) spinlock_profiling.cc:255] Waited 991 ms on 
> lock 0x120b3540. stack: 019406c5 009c60d7 009c75f7 
> 007dc628 00a7adfc 00a7b9cd 0194d059 
> 7f72fbcc2dc4 7f72fabad1cc 
> 1214 18:54:27.897737 (+ 1us) raft_consensus.cc:1327] UpdateReplicas() 
> finished
> 1214 18:54:27.897741 (+ 4us) inbound_call.cc:130] Queueing success 
> response
> Metrics: {"spinlock_wait_cycles":2478395136}
> Each of the traces noted on the order of 500-1000ms of waiting on spinlocks. 
> Upon looking at raft_consensus.cc, it seems we're holding a spinlock 
> (update_lock_) while we call RaftConsensus::UpdateReplica(), which according 
> to its header, "won't return until all operations have been stored in the log 
> and all Prepares() have been completed". While locking may be necessary, it's 
> worth considering using a different kind of lock here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2214) voting while tablet copying says "voting while tombstoned"

2018-01-18 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-2214:
--
Labels: newbie  (was: )

> voting while tablet copying says "voting while tombstoned"
> --
>
> Key: KUDU-2214
> URL: https://issues.apache.org/jira/browse/KUDU-2214
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.5.0
>Reporter: Mike Percy
>Assignee: Will Berkeley
>Priority: Minor
>  Labels: newbie
>
> Voting while tablet copying currently says "voting while tombstoned", which 
> is confusing and not really correct. While tombstoned voting and voting while 
> tablet copying use essentially the same code path, they should differentiate 
> from each other in the log messages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-441) Support tablet merges

2018-01-18 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-441:
-
Issue Type: New Feature  (was: Bug)

> Support tablet merges
> -
>
> Key: KUDU-441
> URL: https://issues.apache.org/jira/browse/KUDU-441
> Project: Kudu
>  Issue Type: New Feature
>  Components: consensus, tablet
>Affects Versions: Public beta
>Reporter: Todd Lipcon
>Priority: Minor
>  Labels: kudu-roadmap
>
> Merging is less important than splitting, but we should consider it for GA. 
> It's useful when large amounts of data are deleted from a table and you want 
> to shrink back down to a smaller number of tablets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-1720) Expose ability to appoint a replica as leader via RPC

2018-01-18 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331611#comment-16331611
 ] 

Todd Lipcon commented on KUDU-1720:
---

[~mpercy] [~wdberkeley] do you guys think this is still necessary?I seem to 
recall Will had written instructions for recovering from the kind of scenario 
mentioned here using the existing unsafe_change_config tool.

> Expose ability to appoint a replica as leader via RPC
> -
>
> Key: KUDU-1720
> URL: https://issues.apache.org/jira/browse/KUDU-1720
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Reporter: Mike Percy
>Priority: Major
>
> Today we have an internal API call in RaftConsensus called EmulateElection() 
> that causes a node to increment its term and begin acting as Raft leader 
> without initiating an election. Exposing this would allow us to use this in 
> emergency scenarios, where all replicas except one are offline and we need to 
> force a node to become leader in order to then force an unsafe configuration 
> change to be serialized to the WAL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-1391) 2 of 3 replica alive but failed to elect leader

2018-01-18 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331612#comment-16331612
 ] 

Todd Lipcon commented on KUDU-1391:
---

[~mpercy] I believe this should be considered fixed back when we did tombstoned 
voting, etc. Do you agree?

> 2 of 3 replica alive but failed to elect leader
> ---
>
> Key: KUDU-1391
> URL: https://issues.apache.org/jira/browse/KUDU-1391
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Reporter: Binglin Chang
>Priority: Major
> Attachments: 6a32cfa0353e4175809c2aa67e16ac9e.log.st172, 
> 6a32cfa0353e4175809c2aa67e16ac9e.log.st212, 
> 6a32cfa0353e4175809c2aa67e16ac9e.log.st212.before, 
> 6a32cfa0353e4175809c2aa67e16ac9e.log.st216, remote-bootstrap-tool.patch
>
>
> Last weekend many TS have a lot too many open files error(haven't upgrade to 
> , when using our internal deploy tool to restart cluster (stop all ts, then 
> start all ts), the control machine have some issue which seems to block or 
> write to ssh terminal(maybe usb driver issue, not related to this bug), so 
> only half (about 30) of the TS is shutdown, then after maybe 10 minutes, I 
> switch to another control host and perform the whole restart. 
> Then I see writes are blocked, because 1 tablet is in no leader state, from 
> web-ui, 2 of  3 replicas is in follower state, 1 TABLET_DATA_TOMBSTONED, but 
> all election failed, will attach the log of the 2 followers. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-1330) Add tool to unsafely recover from loss of majority (or all) tablet replicas

2018-01-18 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-1330.
---
  Resolution: Fixed
   Fix Version/s: 1.4.0
Target Version/s:   (was: 1.5.0)

> Add tool to unsafely recover from loss of majority (or all) tablet replicas
> ---
>
> Key: KUDU-1330
> URL: https://issues.apache.org/jira/browse/KUDU-1330
> Project: Kudu
>  Issue Type: New Feature
>  Components: ops-tooling
>Affects Versions: 0.7.0
>Reporter: Todd Lipcon
>Assignee: Dinesh Bhat
>Priority: Major
> Fix For: 1.4.0
>
>
> [~bruceSz] ran into this issue: if you have a table with replication set to 
> 1, and you permanently lose a node, the table is stuck in a bad state. It 
> would be nice to allow the operator to accept the data loss and replace the 
> lost tablet with a new (empty) one.
> Similarly, I've had a few people ask about the scenario where you lose 2 of 3 
> replicas and you are willing to accept data loss, and force recovery from the 
> remaining one replica.
> We should add a tool (or tools) for these recovery scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-1721) Expose ability to force an unsafe Raft configuration change

2018-01-18 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-1721.
---
   Resolution: Fixed
 Assignee: Dinesh Bhat  (was: zhangsong)
Fix Version/s: 1.4.0

> Expose ability to force an unsafe Raft configuration change
> ---
>
> Key: KUDU-1721
> URL: https://issues.apache.org/jira/browse/KUDU-1721
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Reporter: Mike Percy
>Assignee: Dinesh Bhat
>Priority: Major
> Fix For: 1.4.0
>
>
> Add the ability to force an “unsafe” configuration change via RPC to the 
> leader. Instead of specifying AddServer() or RemoveServer(), which enforce 
> one-by-one configuration changes due to their inherent contract, we could 
> specify an unsafe configuration change API for administrative / emergency 
> cases only that the leader would process without going through the normal 
> "one by one" safety checks required by the Raft dissertation config change 
> protocol.
> The leader should still reject configuration changes in which is it not a 
> member of the requested new configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-1397) Allow building safely with custom toolchains

2018-01-18 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-1397:
--
Priority: Minor  (was: Major)

> Allow building safely with custom toolchains
> 
>
> Key: KUDU-1397
> URL: https://issues.apache.org/jira/browse/KUDU-1397
> Project: Kudu
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.8.0
>Reporter: Adar Dembo
>Priority: Minor
>
> Casey uncovered several issues when building Kudu with the Impala toolchain; 
> this report attempts to capture them.
> The first and most important issue was a random SIGSEGV during a flush:
> {noformat}
> (gdb) bt
> #0 0x00e82540 in kudu::CopyCellData kudu::ColumnBlockCell, kudu::Arena> (src=..., dst=0x7ff9c637d5e0, 
> dst_arena=0x0)
> at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/common/row.h:79
> #1 0x00e80e33 in kudu::CopyCell kudu::ColumnBlockCell, kudu::Arena> (src=..., dst=0x7ff9c637d5e0, 
> dst_arena=0x0)
> at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/common/row.h:103
> #2 0x00e7f647 in kudu::CopyRow kudu::Arena> (src_row=..., dst_row=0x7ff9c637d870, dst_arena=0x0)
> at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/common/row.h:119
> #3  0x00e76773 in kudu::tablet::FlushCompactionInput 
> (input=0x3894f00, snap=..., out=0x7ff9c637dbf0)
> at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/tablet/compaction.cc:768
> #4  0x00e23f5a in kudu::tablet::Tablet::DoCompactionOrFlush 
> (this=0x395a840, input=..., mrs_being_flushed=0)
> at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/tablet/tablet.cc:1221
> #5  0x00e202b2 in kudu::tablet::Tablet::FlushInternal 
> (this=0x395a840, input=..., old_ms=...) at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/tablet/tablet.cc:744
> #6  0x00e1f8f6 in kudu::tablet::Tablet::FlushUnlocked 
> (this=0x395a840) at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/tablet/tablet.cc:678
> #7  0x00f1b3a3 in kudu::tablet::FlushMRSOp::Perform (this=0x38b9340) 
> at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/tablet/tablet_peer_mm_ops.cc:127
> #8  0x00ea19d7 in kudu::MaintenanceManager::LaunchOp (this=0x3904360, 
> op=0x38b9340) at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/tablet/maintenance_manager.cc:360
> #9  0x00ea6502 in boost::_mfi::mf1 kudu::MaintenanceOp*>::operator() (this=0x3d492a0, p=0x3904360, a1=0x38b9340)
> at 
> /home/casey/Code/native-toolchain/build/boost-1.57.0/include/boost/bind/mem_fn_template.hpp:165
> #10 0x00ea6163 in 
> boost::_bi::list2, 
> boost::_bi::value >::operator() kudu::MaintenanceManager, kudu::MaintenanceOp*>, boost::_bi::list0> 
> (this=0x3d492b0, f=..., a=...) at 
> /home/casey/Code/native-toolchain/build/boost-1.57.0/include/boost/bind/bind.hpp:313
> #11 0x00ea5bed in boost::_bi::bind_t kudu::MaintenanceManager, kudu::MaintenanceOp*>, 
> boost::_bi::list2, 
> boost::_bi::value > >::operator() (this=0x3d492a0) at 
> /home/casey/Code/native-toolchain/build/boost-1.57.0/include/boost/bind/bind_template.hpp:20
> #12 0x00ea57ec in 
> boost::detail::function::void_function_obj_invoker0 boost::_mfi::mf1, 
> boost::_bi::list2, 
> boost::_bi::value > >, void>::invoke 
> (function_obj_ptr=...) at 
> /home/casey/Code/native-toolchain/build/boost-1.57.0/include/boost/function/function_template.hpp:153
> #13 0x01c4205e in boost::function0::operator() (this=0x3c01838) 
> at 
> /home/casey/Code/native-toolchain/build/boost-1.57.0/include/boost/function/function_template.hpp:767
> #14 0x01d73aa4 in kudu::FunctionRunnable::Run (this=0x3c01830) at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/util/threadpool.cc:47
> #15 0x01d73062 in kudu::ThreadPool::DispatchThread (this=0x38c8340, 
> permanent=true) at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/util/threadpool.cc:321
> #16 0x01d76740 in boost::_mfi::mf1 bool>::operator() 

[jira] [Resolved] (KUDU-2187) Don't hold threadpool lock when creating additional workers

2018-01-18 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2187.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Fixed thisin f5e1203a344c031e357075c900740da33f4d736e

> Don't hold threadpool lock when creating additional workers
> ---
>
> Key: KUDU-2187
> URL: https://issues.apache.org/jira/browse/KUDU-2187
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf, util
>Affects Versions: 1.6.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Major
> Fix For: 1.6.0
>
>
> I've noticed in a lot of clusters that creating new threads often takes a 
> relatively long time, for whatever reason. Currently we create new threads 
> while holding the threadpool's lock, which means that, even if other existing 
> worker threads finish their current work items and are ready to handle newly 
> enqueued work, everything is frozen while the thread is being created. This 
> creates potentially long stalls resulting in increased queue time at the high 
> percentiles.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2048) consensus: Only evict a replica is a majority is up to date

2018-01-18 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-2048:
--
   Resolution: Fixed
Fix Version/s: 1.6.0
   Status: Resolved  (was: In Review)

Was addressed in 1e4db3148a1cb4e340aa96edaea85c733cfdbf5a

> consensus: Only evict a replica is a majority is up to date
> ---
>
> Key: KUDU-2048
> URL: https://issues.apache.org/jira/browse/KUDU-2048
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, recovery
>Affects Versions: 1.4.0
>Reporter: Mike Percy
>Assignee: Todd Lipcon
>Priority: Major
> Fix For: 1.6.0
>
>
> In the context of replica eviction and 3-2-3 recovery, we currently have a 
> "hacky" rule that states that evicting down to less than 2 replicas in a 
> config is prohibited. However we don't currently check to see, when evicting, 
> whether that would leave the config with less than a majority of caught-up 
> voters.
> That means, for example, that if we have a config of 3 replicas { A, B, C } 
> and B falls behind, so is currently undergoing a tablet copy, and C goes 
> offline then the algorithm will evict C. However, since A is the only 
> up-to-date replica, this leaves the config in a state where nothing can 
> commit until B is done copying. Even worse, if B is killed or has an error, 
> then we are left in a state that requires manual recovery.
> Consider adding an additional rule that states that to evict a node, we must 
> have a majority of up-to-date replicas that are recently active. This will 
> help prevent certain problem scenarios like the above from occurring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2011) Request-side sidecars cannot be safely destroyed on timeout

2018-01-18 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331520#comment-16331520
 ] 

Todd Lipcon commented on KUDU-2011:
---

This was reopened because of a test failure on a stress cluster. See 
55c40fc72241a0569d311bdb2f0268389512a807

> Request-side sidecars cannot be safely destroyed on timeout
> ---
>
> Key: KUDU-2011
> URL: https://issues.apache.org/jira/browse/KUDU-2011
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc
>Reporter: Henry Robinson
>Assignee: Michael Ho
>Priority: Major
>
> If a timeout occurs while sending a request-side sidecar (see KUDU-1866), the 
> RPC callback may be invoked before the outbound transfer has been completely 
> written. 
> This is the last notification from the RPC layer that the caller will get, so 
> you might expect them to delete the sidecar payload at that point, but it's 
> not safe to do so. In fact, with a slow sender there is no way for the caller 
> to know when it's safe to delete the payload. There's no problem for the 
> protobuf message data, as it's serialized during the blocking part of an 
> async call, and that memory is tied to the lifetime of the outbound call, 
> which is managed by the RPC layer.
> Ownership of the sidecar payloads should be shared between caller and the RPC 
> layer, so really it's the new {{RpcSidecar::FromSlice}} API that causes the 
> problems because ownership is not shared with the {{RpcSidecar}} which does 
> have the correct lifetime. I propose removing {{FromSlice}} and having a 
> {{FromFaststring(shared_ptr)}} variant.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-1762) suspected tablet memory leak

2018-01-18 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331517#comment-16331517
 ] 

Todd Lipcon commented on KUDU-1762:
---

[~cfreely][~dawn110110][~zhqu...@gmail.com] - anything to report here? If not 
I'll assume we probably fixed at some point and close as cannot repro?

> suspected tablet memory leak
> 
>
> Key: KUDU-1762
> URL: https://issues.apache.org/jira/browse/KUDU-1762
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.0.1
> Environment: CentOS 6.5
> Kudu 1.0.1 (rev e60b610253f4303b24d41575f7bafbc5d69edddb)
>Reporter: Fu Lili
>Priority: Major
> Attachments: 0B2CE7BB-EF26-4EA1-B824-3584D7D79256.png, 
> kudu_heap_prof_20161206.tar.gz, mem_rss_graph_2016_12_19.png, 
> server02_30day_rss_before_and_after_mrs_flag.png, 
> server02_30day_rss_before_and_after_mrs_flag_2.png, tserver_smaps1
>
>
> here is the memory total info:
> {quote}
> 
> MALLOC: 1691715680 ( 1613.3 MiB) Bytes in use by application
> MALLOC: +178733056 (  170.5 MiB) Bytes in page heap freelist
> MALLOC: + 37483104 (   35.7 MiB) Bytes in central cache freelist
> MALLOC: +  4071488 (3.9 MiB) Bytes in transfer cache freelist
> MALLOC: + 13739264 (   13.1 MiB) Bytes in thread cache freelists
> MALLOC: + 12202144 (   11.6 MiB) Bytes in malloc metadata
> MALLOC:   
> MALLOC: =   1937944736 ( 1848.2 MiB) Actual memory used (physical + swap)
> MALLOC: +   311296 (0.3 MiB) Bytes released to OS (aka unmapped)
> MALLOC:   
> MALLOC: =   1938256032 ( 1848.5 MiB) Virtual address space used
> MALLOC:
> MALLOC: 174694  Spans in use
> MALLOC:201  Thread heaps in use
> MALLOC:   8192  Tcmalloc page size
> 
> Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
> Bytes released to the OS take up virtual address space but no physical memory.
> {quote}
> but in memroy detail, sum of all the sub Current Consumption is far less than 
> the to the root Current Consumption。
> ||Id||Parent||Limit||Current Consumption||Peak consumption||
> |root|none|4.00G|1.58G|1.74G|
> |log_cache|root|1.00G|480.8K|5.32M|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:70c8d889b0314b04a240fcb02c24a012|log_cache|128.00M|160B|160B|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:16d3c8193579445f8f766da6c7abc237|log_cache|128.00M|160B|160B|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:2c69c5cb9eb04eb48323a9268afc36a7|log_cache|128.00M|160B|160B|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:2b11d9220dab4a5f952c5b1c10a68ccd|log_cache|128.00M|69.2K|139.5K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:cec045be60af4f759497234d8815238b|log_cache|128.00M|68.6K|138.7K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:cea7a54cebd242e4997da641f5b32e3a|log_cache|128.00M|68.5K|139.3K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:9625dfde17774690a888b55024ac797a|log_cache|128.00M|68.5K|140.0K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:6046b33901ca43d0975f59cf7e491186|log_cache|128.00M|0B|133.0K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:1a18ab0915f0407b922fa7ecbe7a2f46|log_cache|128.00M|0B|132.6K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:ac54d1c1813a4e39943971cb56f248ef|log_cache|128.00M|0B|130.5K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:4438580df6cc4d469393b9d6adee68d8|log_cache|128.00M|0B|131.2K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:2f1cef7d2a494575b941baa22b8a3dc9|log_cache|128.00M|0B|131.6K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:d2ad22d202c04b2d98f1c5800df1c3b5|log_cache|128.00M|0B|132.5K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:b19b21d6b4c84f9895aad9e81559d019|log_cache|128.00M|0B|131.0K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:27e9531cd5814b1c9637493f05860b19|log_cache|128.00M|0B|131.1K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:425a19940239447faa0eaab4e380d644|log_cache|128.00M|68.5K|146.9K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:178bd7bc39a941a887f393b0a7848066|log_cache|128.00M|68.5K|139.9K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:91524acd28a440318918f11292ac8fdc|log_cache|128.00M|0B|132.0K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:be6f093aabf9460b97fc35dd026820b6|log_cache|128.00M|0B|130.4K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:dd8dd794f0f44426a3c46ce8f4b54652|log_cache|128.00M|0B|131.2K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:ed128ca7b19c4e3eaa48e9e3eb341492|log_cache|128.00M|68.5K|141.5K|
> |block_cache-sharded_lru_cache|root|none|257.05M|257.05M|
> |code_cache-sharded_lru_cache|root|none|112B|113B|
> |server|root|none|2.06M|121.97M|
> 

[jira] [Updated] (KUDU-1578) kudu-tserver should refuse service or "freeze" instead of crash when NTP loses sync

2018-01-18 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-1578:
--
Priority: Minor  (was: Major)

> kudu-tserver should refuse service or "freeze" instead of crash when NTP 
> loses sync
> ---
>
> Key: KUDU-1578
> URL: https://issues.apache.org/jira/browse/KUDU-1578
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Reporter: zhangsong
>Assignee: Todd Lipcon
>Priority: Minor
>
> Currently, kudu-tserver will crash when ntp is unsynchronized.
> However this behavior maybe not the right in large cluster ,when crash can 
> lead to replicate which can be useless or harm to cluster availability.
> Instead, kudu-tserver should suspend it self like refusing to serve write , 
> let the administrator decide what to do.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2228) Make Messenger options configurable

2018-01-18 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2228.
---
   Resolution: Fixed
Fix Version/s: 1.7.0

Sailesh fixed in b357fa9b729ce52627a862e08ebac822ae470bc9

> Make Messenger options configurable
> ---
>
> Key: KUDU-2228
> URL: https://issues.apache.org/jira/browse/KUDU-2228
> Project: Kudu
>  Issue Type: Task
>  Components: rpc
>Reporter: Sailesh Mukil
>Assignee: Sailesh Mukil
>Priority: Major
>  Labels: refactor, rpc
> Fix For: 1.7.0
>
>
> Currently, the RPC layer accesses many gflags directly to take certain 
> decisions, eg. whether to turn on encryption, authentication, etc.
> Since the RPC layer is to be used more like a library, these should be 
> configurable options that are passed to the Messenger (which is the API 
> endpoint for the application using the RPC layer), instead of the RPC layer 
> itself directly accessing these flags.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2225) Cannot submit Spark 2.2 job on yarn client mode to a secure Kudu cluster

2018-01-18 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2225.
---
   Resolution: Not A Problem
Fix Version/s: n/a

> Cannot submit Spark 2.2 job on yarn client mode to a secure Kudu cluster
> 
>
> Key: KUDU-2225
> URL: https://issues.apache.org/jira/browse/KUDU-2225
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: Zakaria AFKIR
>Priority: Major
> Fix For: n/a
>
>
> Hello,
> I use Kudu 1.4 with CDH 5.12.1 kerberized.
> I configured Kudu with kerberos and I checked that the service is accessible 
> using the kudu client.
> When submitting spark 2.2 job on yarn client mode to the secure Kudu cluster, 
> I get Kerberos errors below:
> {code:java}
> [ERROR]: org.apache.kudu.client.TabletClient - [Peer 
> master-talend-cdh5121.weave.local:7051] Unexpected exception from downstream 
> on [id: 0xae2efc99, /192.168.99.1:64165 => 
> talend-cdh5121.weave.local/10.2.1.5:7051]
> java.lang.RuntimeException: java.security.PrivilegedActionException: 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> org.apache.kudu.client.shaded.com.google.common.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.kudu.client.Negotiator.evaluateChallenge(Negotiator.java:678)
>   at 
> org.apache.kudu.client.Negotiator.sendSaslInitiate(Negotiator.java:560)
>   at 
> org.apache.kudu.client.Negotiator.startAuthentication(Negotiator.java:524)
>   at 
> org.apache.kudu.client.Negotiator.handleTlsMessage(Negotiator.java:478)
>   at org.apache.kudu.client.Negotiator.handleResponse(Negotiator.java:250)
>   at 
> org.apache.kudu.client.Negotiator.messageReceived(Negotiator.java:229)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:70)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
>   at 
> 

[jira] [Commented] (KUDU-2226) Tablets with too many DRSs will cause a huge DMS memory overhead

2018-01-18 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331513#comment-16331513
 ] 

Todd Lipcon commented on KUDU-2226:
---

[~zhqu...@gmail.com] can you try upgrading to 1.4 or later? There were a bunch 
of patches in 1.4 which substantially reduce the memory consumption of DRS/DMS. 
eg 286de539213602f502f0a25fddde27f97ab9665b , 
9adbd90d170a0a525e8aef05939ba211ca6c3a41 , 
c3953ad01e91bfd5b10d2851c7db5cdf424eec8e, 
02cbf8835d72157c348d8b4c1e34611b0bc05f44 and some others.

> Tablets with too many DRSs will cause a huge DMS memory overhead
> 
>
> Key: KUDU-2226
> URL: https://issues.apache.org/jira/browse/KUDU-2226
> Project: Kudu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
> Environment: CentOS6.5 Linux 2.6.32-431
> Kudu1.3.0 
> GitCommit 00813f96b9cb
>Reporter: ZhangZhen
>Priority: Major
>
> I have a table with 10M rows in total and has been hash partitioned to 16 
> buckets. Each tablet has about 100MB on disk size according to the /tablets 
> Web UI. Everyday 50K new rows will be inserted into this table, and about 5M 
> rows of this table will be updated, that's about half of rows in total, each 
> row will be updated only once. 
> Then I found something strange, from the /mem-trackers UI of TS, I found 
> every tablet of this table occupied about 900MB memory, mainly occupied by 
> DeltaMemStore, the peak memory consumption is about 1.8G. 
> I don't understand why the DeltaMemStore will cost so much memory, 900MB DMS 
> vs 100MB on disk size, that seems strange to me. What's more, I found these 
> DMS will be flushed very slowly, so for a long time these memory are 
> occupied, which cause "Soft memory limit exceeded" in the TS, and in result 
> cause "Rejecting consensus request".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2234) kudu-tool-test fails when environment has GLOG_colorlogtostderr=1

2018-01-18 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-2234:
--
Labels: newbie  (was: )

> kudu-tool-test fails when environment has GLOG_colorlogtostderr=1
> -
>
> Key: KUDU-2234
> URL: https://issues.apache.org/jira/browse/KUDU-2234
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.6.0
>Reporter: Mike Percy
>Priority: Major
>  Labels: newbie
>
> kudu-tool-test is sensitive to the value of GLOG_colorlogtostderr in the 
> environment. We should make the test tolerate this setting since this was 
> confusing during 1.6.0 release testing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2238) Big DMS not flush under memory pressure

2018-01-18 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2238.
---
   Resolution: Fixed
 Assignee: ZhangZhen
Fix Version/s: 1.7.0

Zhen put in a fix for this in f8d18693546cd518c8873b7a5f4c08579f85199a

> Big DMS not flush under memory pressure
> ---
>
> Key: KUDU-2238
> URL: https://issues.apache.org/jira/browse/KUDU-2238
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.3.0
> Environment: CentOS6.5 Linux 2.6.32-431 
> Kudu1.3.0 
> GitCommit 00813f96b9cb
>Reporter: ZhangZhen
>Assignee: ZhangZhen
>Priority: Major
> Fix For: 1.7.0
>
> Attachments: memory_anchored.png, memory_consumed.png
>
>
> I have a table with many updates, its DMS consumes a lot memory and cause 
> “Soft Memory Limit Exceed”. I check the /mem-trackers on the tablet server, 
> one of its DMS consumes about 3G memory, but check the /maintenance-manager, 
> its FlushDeltaMemStoresOp can only free 763B anchored memory and 
> perf_improvement is 0. Is this normal? I know Kudu is not optimized for 
> updates, but still confused why the DMS won’t be flushed under memory 
> pressure.
> Infos from /mem-trackers:
> !memory_consumed.png!
> tablet-5941a8bb934e4686abd1bfff9e35c860servernone3.00G3.00G
> txn_trackertablet-5941a8bb934e4686abd1bfff9e35c86064.00M0B
> 1.67M
> MemRowSet-339tablet-5941a8bb934e4686abd1bfff9e35c860none265B
> 265B
> DeltaMemStorestablet-5941a8bb934e4686abd1bfff9e35c860none3.00G
> 3.00G
> Infos from /maintenance-manager
> !memory_anchored.png!
> FlushDeltaMemStoresOp(5941a8bb934e4686abd1bfff9e35c860)true763B
> 511.15M0
> The tablet 5941a8bb934e4686abd1bfff9e35c860 has 16 RowSets in total
> Some configs of MM:
> --enable_maintenance_manager=true
> --log_target_replay_size_mb=1024
> --maintenance_manager_history_size=8
> --maintenance_manager_num_threads=6
> --maintenance_manager_polling_interval_ms=50



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2249) fix client shut down prematurely in KuduTableInputFormat

2018-01-18 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2249.
---
   Resolution: Fixed
 Assignee: Clemens Valiente
Fix Version/s: 1.7.0

Thanks Clemens!

> fix client shut down prematurely in KuduTableInputFormat
> 
>
> Key: KUDU-2249
> URL: https://issues.apache.org/jira/browse/KUDU-2249
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Clemens Valiente
>Assignee: Clemens Valiente
>Priority: Major
> Fix For: 1.7.0
>
>
> In FetchInputFormatSplit, Hive uses the same InputFormat for fetching the 
> splits and getting the recordReader (in our case, it is the 
> KuduTableInputFormat.TableRecordReader).
> If Hive then tries to initialize that record reader, it runs into an error 
> here:
> https://github.com/cloudera/kudu/blob/master/java/kudu-mapreduce/src/main/java/org/apache/kudu/mapreduce/KuduTableInputFormat.java#L397
> since the TableRecordReader uses the same client of the KuduTableInputFormat 
> that was already shut down by getSplits()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2262) Java client does not retry if no master is a leader

2018-01-18 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331493#comment-16331493
 ] 

Todd Lipcon commented on KUDU-2262:
---

Similar issue to KUDU-2267 but this one happens even without authentication 
being required

> Java client does not retry if no master is a leader
> ---
>
> Key: KUDU-2262
> URL: https://issues.apache.org/jira/browse/KUDU-2262
> Project: Kudu
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.7.0
>Reporter: Todd Lipcon
>Priority: Major
>
> In a test case I tried to restart the masters and then start a new client to 
> connect to the cluster. This caused the client to fail because the masters 
> were in the process of a leader election.
> It probably would make more sense for the client to retry a certain number of 
> times.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput

2018-01-18 Thread Joe McDonnell (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell updated KUDU-2086:

Priority: Critical  (was: Blocker)

> Uneven assignment of connections to Reactor threads creates skew and limits 
> transfer throughput
> ---
>
> Key: KUDU-2086
> URL: https://issues.apache.org/jira/browse/KUDU-2086
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 1.4.0
>Reporter: Mostafa Mokhtar
>Assignee: Joe McDonnell
>Priority: Critical
> Attachments: krpc_hash_test.c
>
>
> Uneven assignment of connections to Reactor threads causes a couple of 
> reactor threads to run @100% which limits overall system throughput.
> Increasing the number of reactor threads alleviate the problem but some 
> threads are still running much hotter than others.
> Snapshot below is from a 20 node cluster
> {code}
> ps -T -p 69387 | grep rpc |  grep -v "00:00"  | awk '{print $4,$0}' | sort
> 00:03:17  69387  69596 ?00:03:17 rpc reactor-695
> 00:03:20  69387  69632 ?00:03:20 rpc reactor-696
> 00:03:21  69387  69607 ?00:03:21 rpc reactor-696
> 00:03:25  69387  69629 ?00:03:25 rpc reactor-696
> 00:03:26  69387  69594 ?00:03:26 rpc reactor-695
> 00:03:34  69387  69595 ?00:03:34 rpc reactor-695
> 00:03:35  69387  69625 ?00:03:35 rpc reactor-696
> 00:03:38  69387  69570 ?00:03:38 rpc reactor-695
> 00:03:38  69387  69620 ?00:03:38 rpc reactor-696
> 00:03:47  69387  69639 ?00:03:47 rpc reactor-696
> 00:03:48  69387  69593 ?00:03:48 rpc reactor-695
> 00:03:49  69387  69591 ?00:03:49 rpc reactor-695
> 00:04:04  69387  69600 ?00:04:04 rpc reactor-696
> 00:07:16  69387  69640 ?00:07:16 rpc reactor-696
> 00:07:39  69387  69616 ?00:07:39 rpc reactor-696
> 00:07:54  69387  69572 ?00:07:54 rpc reactor-695
> 00:09:10  69387  69613 ?00:09:10 rpc reactor-696
> 00:09:28  69387  69567 ?00:09:28 rpc reactor-695
> 00:09:39  69387  69603 ?00:09:39 rpc reactor-696
> 00:09:42  69387  69641 ?00:09:42 rpc reactor-696
> 00:09:59  69387  69604 ?00:09:59 rpc reactor-696
> 00:10:06  69387  69623 ?00:10:06 rpc reactor-696
> 00:10:43  69387  69636 ?00:10:43 rpc reactor-696
> 00:10:59  69387  69642 ?00:10:59 rpc reactor-696
> 00:11:28  69387  69585 ?00:11:28 rpc reactor-695
> 00:12:43  69387  69598 ?00:12:43 rpc reactor-695
> 00:15:42  69387  69578 ?00:15:42 rpc reactor-695
> 00:16:10  69387  69614 ?00:16:10 rpc reactor-696
> 00:17:43  69387  69575 ?00:17:43 rpc reactor-695
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput

2018-01-18 Thread Joe McDonnell (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331492#comment-16331492
 ] 

Joe McDonnell commented on KUDU-2086:
-

[~tlipcon] Good point, this is not a blocker. I will lower the priority.

> Uneven assignment of connections to Reactor threads creates skew and limits 
> transfer throughput
> ---
>
> Key: KUDU-2086
> URL: https://issues.apache.org/jira/browse/KUDU-2086
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 1.4.0
>Reporter: Mostafa Mokhtar
>Assignee: Joe McDonnell
>Priority: Blocker
> Attachments: krpc_hash_test.c
>
>
> Uneven assignment of connections to Reactor threads causes a couple of 
> reactor threads to run @100% which limits overall system throughput.
> Increasing the number of reactor threads alleviate the problem but some 
> threads are still running much hotter than others.
> Snapshot below is from a 20 node cluster
> {code}
> ps -T -p 69387 | grep rpc |  grep -v "00:00"  | awk '{print $4,$0}' | sort
> 00:03:17  69387  69596 ?00:03:17 rpc reactor-695
> 00:03:20  69387  69632 ?00:03:20 rpc reactor-696
> 00:03:21  69387  69607 ?00:03:21 rpc reactor-696
> 00:03:25  69387  69629 ?00:03:25 rpc reactor-696
> 00:03:26  69387  69594 ?00:03:26 rpc reactor-695
> 00:03:34  69387  69595 ?00:03:34 rpc reactor-695
> 00:03:35  69387  69625 ?00:03:35 rpc reactor-696
> 00:03:38  69387  69570 ?00:03:38 rpc reactor-695
> 00:03:38  69387  69620 ?00:03:38 rpc reactor-696
> 00:03:47  69387  69639 ?00:03:47 rpc reactor-696
> 00:03:48  69387  69593 ?00:03:48 rpc reactor-695
> 00:03:49  69387  69591 ?00:03:49 rpc reactor-695
> 00:04:04  69387  69600 ?00:04:04 rpc reactor-696
> 00:07:16  69387  69640 ?00:07:16 rpc reactor-696
> 00:07:39  69387  69616 ?00:07:39 rpc reactor-696
> 00:07:54  69387  69572 ?00:07:54 rpc reactor-695
> 00:09:10  69387  69613 ?00:09:10 rpc reactor-696
> 00:09:28  69387  69567 ?00:09:28 rpc reactor-695
> 00:09:39  69387  69603 ?00:09:39 rpc reactor-696
> 00:09:42  69387  69641 ?00:09:42 rpc reactor-696
> 00:09:59  69387  69604 ?00:09:59 rpc reactor-696
> 00:10:06  69387  69623 ?00:10:06 rpc reactor-696
> 00:10:43  69387  69636 ?00:10:43 rpc reactor-696
> 00:10:59  69387  69642 ?00:10:59 rpc reactor-696
> 00:11:28  69387  69585 ?00:11:28 rpc reactor-695
> 00:12:43  69387  69598 ?00:12:43 rpc reactor-695
> 00:15:42  69387  69578 ?00:15:42 rpc reactor-695
> 00:16:10  69387  69614 ?00:16:10 rpc reactor-696
> 00:17:43  69387  69575 ?00:17:43 rpc reactor-695
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2261) flush responses' order should match the order we call apply

2018-01-18 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2261.
---
   Resolution: Fixed
Fix Version/s: 1.7.0

> flush responses' order should match the order we call apply
> ---
>
> Key: KUDU-2261
> URL: https://issues.apache.org/jira/browse/KUDU-2261
> Project: Kudu
>  Issue Type: Improvement
>  Components: client, java
>Reporter: ZhangZhen
>Assignee: ZhangZhen
>Priority: Major
> Fix For: 1.7.0
>
>
> The response list of flush() should have the same order of we apply 
> operations, so it's easier to know which operation failed and which succeeded.
> For example, if we apply three operations in the following order:
> apply OpA
> apply OpB
> apply OpC
> flush
> The expected response list should be [ ResponseA, ResponseB, ResponseC ], but 
> now the list may be disordered, like [ ResponseC, ResponseA, ResponseB ]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput

2018-01-18 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331487#comment-16331487
 ] 

Todd Lipcon commented on KUDU-2086:
---

[~joemcdonnell] is this still a blocker? From talking with [~mmokhtar] offline 
recently it sounds like some changes went into Impala that drastically reduced 
the load on the reactor threads to the point that it isn't a big problem 
anymore.

Might still be worth doing this eventually but we try to reserve blocker 
priority for serious issues like data loss.

> Uneven assignment of connections to Reactor threads creates skew and limits 
> transfer throughput
> ---
>
> Key: KUDU-2086
> URL: https://issues.apache.org/jira/browse/KUDU-2086
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 1.4.0
>Reporter: Mostafa Mokhtar
>Assignee: Joe McDonnell
>Priority: Blocker
> Attachments: krpc_hash_test.c
>
>
> Uneven assignment of connections to Reactor threads causes a couple of 
> reactor threads to run @100% which limits overall system throughput.
> Increasing the number of reactor threads alleviate the problem but some 
> threads are still running much hotter than others.
> Snapshot below is from a 20 node cluster
> {code}
> ps -T -p 69387 | grep rpc |  grep -v "00:00"  | awk '{print $4,$0}' | sort
> 00:03:17  69387  69596 ?00:03:17 rpc reactor-695
> 00:03:20  69387  69632 ?00:03:20 rpc reactor-696
> 00:03:21  69387  69607 ?00:03:21 rpc reactor-696
> 00:03:25  69387  69629 ?00:03:25 rpc reactor-696
> 00:03:26  69387  69594 ?00:03:26 rpc reactor-695
> 00:03:34  69387  69595 ?00:03:34 rpc reactor-695
> 00:03:35  69387  69625 ?00:03:35 rpc reactor-696
> 00:03:38  69387  69570 ?00:03:38 rpc reactor-695
> 00:03:38  69387  69620 ?00:03:38 rpc reactor-696
> 00:03:47  69387  69639 ?00:03:47 rpc reactor-696
> 00:03:48  69387  69593 ?00:03:48 rpc reactor-695
> 00:03:49  69387  69591 ?00:03:49 rpc reactor-695
> 00:04:04  69387  69600 ?00:04:04 rpc reactor-696
> 00:07:16  69387  69640 ?00:07:16 rpc reactor-696
> 00:07:39  69387  69616 ?00:07:39 rpc reactor-696
> 00:07:54  69387  69572 ?00:07:54 rpc reactor-695
> 00:09:10  69387  69613 ?00:09:10 rpc reactor-696
> 00:09:28  69387  69567 ?00:09:28 rpc reactor-695
> 00:09:39  69387  69603 ?00:09:39 rpc reactor-696
> 00:09:42  69387  69641 ?00:09:42 rpc reactor-696
> 00:09:59  69387  69604 ?00:09:59 rpc reactor-696
> 00:10:06  69387  69623 ?00:10:06 rpc reactor-696
> 00:10:43  69387  69636 ?00:10:43 rpc reactor-696
> 00:10:59  69387  69642 ?00:10:59 rpc reactor-696
> 00:11:28  69387  69585 ?00:11:28 rpc reactor-695
> 00:12:43  69387  69598 ?00:12:43 rpc reactor-695
> 00:15:42  69387  69578 ?00:15:42 rpc reactor-695
> 00:16:10  69387  69614 ?00:16:10 rpc reactor-696
> 00:17:43  69387  69575 ?00:17:43 rpc reactor-695
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2268) Sync client exceptions don't show appropriate call stack

2018-01-18 Thread Todd Lipcon (JIRA)
Todd Lipcon created KUDU-2268:
-

 Summary: Sync client exceptions don't show appropriate call stack
 Key: KUDU-2268
 URL: https://issues.apache.org/jira/browse/KUDU-2268
 Project: Kudu
  Issue Type: Improvement
  Components: client, java
Reporter: Todd Lipcon


Currently if you use the sync client and try to connect to the cluster but get 
some error like an authentication failure, the call stack in the resulting 
exception is entirely within Kudu/Netty code. It doesn't show you where your 
actual error is. We should fix up the call stacks before throwing errors in the 
sync client.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KUDU-2265) A non-leader master uses self-signed server TLS cert if it hasn't ever run as a leader

2018-01-18 Thread Alexey Serbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-2265:
---

Assignee: Alexey Serbin

> A non-leader master uses self-signed server TLS cert if it hasn't ever run as 
> a leader
> --
>
> Key: KUDU-2265
> URL: https://issues.apache.org/jira/browse/KUDU-2265
> Project: Kudu
>  Issue Type: Improvement
>  Components: master, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.6.0, 1.50
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>
> As it's currently implemented, master process replaces its auto-generated 
> self-signed server TLS certificate with CA-signed one only when it becomes a 
> leader (see {{{color:#33}CatalogManager::InitCertAuthorityWith(){color}}} 
> method; it's invoked only from \{{CatalogManager::InitCertAuthority()}}, that 
> is invoked only from \{{CatalogManager::PrepareForLeadershipTask()}}).
>  
> In case of just one Raft election from the start (which is pretty common, 
> BTW), non-leader masters run without CA-signed server certificate for a long 
> time in case of a multi-master configuration.  That causes clients to not use 
> their authn tokens for authentication while connecting to those non-leader 
> masters.  In case of Spark applications where executors do not have Kerberos 
> credentials (the common case), application logs are polluted with messages 
> like below:
> {noformat}
> org.apache.kudu.client.NonRecoverableException: Server requires Kerberos, but 
> this client is not authenticated (kinit)
>   at org.apache.kudu.client.Negotiator.evaluateChallenge(Negotiator.java:705)
>   at org.apache.kudu.client.Negotiator.sendSaslInitiate(Negotiator.java:581)
> ...
> Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2267) Client may see negotiation failure when talks to master followers with only self signed cert

2018-01-18 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331458#comment-16331458
 ] 

Todd Lipcon commented on KUDU-2267:
---

I believe this problem is actualy worse than just some log spew, in two 
different cases:

(1) all masters have just restarted and have not yet elected a leader

In this case, if a client tries to connect by token, it will get an 
authentication failure from all masters. These are considered non-retriable, so 
the client will fail to connect to the cluster.

(2) the old leader has restarted and existing token-authenticated clients try 
to contact it

If a client is connected to the cluster using token auth, it will have cached 
its concept of which master is the leader master. The master then restarts and 
comes back up as a follower (quite likely). The next time the client tries to 
issue any master RPC, it will attempt to reconnect to its cached leader. In 
this case, that cached leader will not accept the token authentication method 
(because it has no CA cert) and thus the client will fail to authenticate. This 
failure is considered non-retryable, so it doesn't call ReconnectToCluster and 
attempt to find the new leader.

> Client may see negotiation failure when talks to master followers with only 
> self signed cert 
> -
>
> Key: KUDU-2267
> URL: https://issues.apache.org/jira/browse/KUDU-2267
> Project: Kudu
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 1.6.0
>Reporter: Hao Hao
>Priority: Major
>
> Currently, if a master has never been a leader from the very start of the 
> cluster, it has just self-signed cert. And if a client does not have valid 
> Kerberos credential but only authenticated token, then the client may see 
> {{org.apache.kudu.client.NonRecoverableException: Server requires Kerberos, 
> but this client is not authenticated}} error when trying to connect to master 
> followers. Since in that case SASL authentication type is chosen instead of 
> token for authentication.
> It is safe to ignore this error, as long as client is able to connect to 
> master leader. However, for a long term fix, masters should probably attempt 
> to get a signed cert from the leader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2126) DeleteTablet RPC on an already deleted tablet should be a no-op

2018-01-18 Thread Jeffrey F. Lukman (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey F. Lukman updated KUDU-2126:

Code Review: http://gerrit.cloudera.org:8080/9069

> DeleteTablet RPC on an already deleted tablet should be a no-op
> ---
>
> Key: KUDU-2126
> URL: https://issues.apache.org/jira/browse/KUDU-2126
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.5.0
>Reporter: Adar Dembo
>Assignee: Jeffrey F. Lukman
>Priority: Major
>  Labels: newbie
>
> This is remaining (but lower priority) work from KUDU-2114:
> bq. When observed on a live cluster, it was further observed that the tablet 
> deletion requests were rather expensive. It appears that a DeleteTablet RPC 
> on a tombstone is not a no-op; it always flushes the superblock twice, which 
> generates two fsyncs. This should also be addressed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-869) Support PRE_VOTER config membership type

2018-01-18 Thread Alexey Serbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-869.

   Resolution: Fixed
Fix Version/s: 1.7.0

This particular feature with corresponding operations is implemented as in 
79a255bbf4bf52f5d62feba213b21c9ad2fd9ffe.

> Support PRE_VOTER config membership type
> 
>
> Key: KUDU-869
> URL: https://issues.apache.org/jira/browse/KUDU-869
> Project: Kudu
>  Issue Type: Sub-task
>  Components: consensus
>Affects Versions: Feature Complete
>Reporter: Mike Percy
>Assignee: Mike Percy
>Priority: Critical
> Fix For: 1.7.0
>
>
> A PRE_VOTER membership type will reduce unavailability when bootstrapping new 
> nodes. See the remote bootstrap spec @ 
> https://docs.google.com/document/d/1zSibYnwPv9cFRnWn0ORyu2uCGB9Neb-EsF0M6AiMSEE
>  for details



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-869) Support PRE_VOTER config membership type

2018-01-18 Thread Alexey Serbin (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331428#comment-16331428
 ] 

Alexey Serbin commented on KUDU-869:


[~tlipcon], yes, I think we can mark this as resolved: the functionality for 
non-voters have been implemented long time ago.

> Support PRE_VOTER config membership type
> 
>
> Key: KUDU-869
> URL: https://issues.apache.org/jira/browse/KUDU-869
> Project: Kudu
>  Issue Type: Sub-task
>  Components: consensus
>Affects Versions: Feature Complete
>Reporter: Mike Percy
>Assignee: Mike Percy
>Priority: Critical
>
> A PRE_VOTER membership type will reduce unavailability when bootstrapping new 
> nodes. See the remote bootstrap spec @ 
> https://docs.google.com/document/d/1zSibYnwPv9cFRnWn0ORyu2uCGB9Neb-EsF0M6AiMSEE
>  for details



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2267) Client may see negotiation failure when talks to master followers with only self signed cert

2018-01-18 Thread Hao Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Hao updated KUDU-2267:
--
Description: 
Currently, if a master has never been a leader from the very start of the 
cluster, it has just self-signed cert. And if a client does not have valid 
Kerberos credential but only authenticated token, then the client may see 
{{org.apache.kudu.client.NonRecoverableException: Server requires Kerberos, but 
this client is not authenticated}} error when trying to connect to master 
followers. Since in that case SASL authentication type is chosen instead of 
token for authentication.

It is safe to ignore this error, as long as client is able to connect to master 
leader. However, for a long term fix, masters should probably attempt to get a 
signed cert from the leader.

  was:Currently, if a master has never been a leader from the very start of the 
cluster, it has just self-signed cert. And if a client does not have valid 
Kerberos credential but only authenticated token, then the client may see 
{{org.apache.kudu.client.NonRecoverableException: Server requires Kerberos, but 
this client is not authenticated}} error when trying to connect to master 
followers. Since in that case SASL authentication type is chosen instead of 
token for authentication.


> Client may see negotiation failure when talks to master followers with only 
> self signed cert 
> -
>
> Key: KUDU-2267
> URL: https://issues.apache.org/jira/browse/KUDU-2267
> Project: Kudu
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 1.6.0
>Reporter: Hao Hao
>Priority: Major
>
> Currently, if a master has never been a leader from the very start of the 
> cluster, it has just self-signed cert. And if a client does not have valid 
> Kerberos credential but only authenticated token, then the client may see 
> {{org.apache.kudu.client.NonRecoverableException: Server requires Kerberos, 
> but this client is not authenticated}} error when trying to connect to master 
> followers. Since in that case SASL authentication type is chosen instead of 
> token for authentication.
> It is safe to ignore this error, as long as client is able to connect to 
> master leader. However, for a long term fix, masters should probably attempt 
> to get a signed cert from the leader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-869) Support PRE_VOTER config membership type

2018-01-18 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331420#comment-16331420
 ] 

Todd Lipcon commented on KUDU-869:
--

[~mpercy] [~aserbin] this can be resolved, right?

> Support PRE_VOTER config membership type
> 
>
> Key: KUDU-869
> URL: https://issues.apache.org/jira/browse/KUDU-869
> Project: Kudu
>  Issue Type: Sub-task
>  Components: consensus
>Affects Versions: Feature Complete
>Reporter: Mike Percy
>Assignee: Mike Percy
>Priority: Critical
>
> A PRE_VOTER membership type will reduce unavailability when bootstrapping new 
> nodes. See the remote bootstrap spec @ 
> https://docs.google.com/document/d/1zSibYnwPv9cFRnWn0ORyu2uCGB9Neb-EsF0M6AiMSEE
>  for details



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2265) A non-leader master uses self-signed server TLS cert if it hasn't ever run as a leader

2018-01-18 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331421#comment-16331421
 ] 

Todd Lipcon commented on KUDU-2265:
---

[~aserbin] k I'm actually working the error message improvements into another 
patch, so if you haven't started on that part yet, just hang on :)

> A non-leader master uses self-signed server TLS cert if it hasn't ever run as 
> a leader
> --
>
> Key: KUDU-2265
> URL: https://issues.apache.org/jira/browse/KUDU-2265
> Project: Kudu
>  Issue Type: Improvement
>  Components: master, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.6.0, 1.50
>Reporter: Alexey Serbin
>Priority: Major
>
> As it's currently implemented, master process replaces its auto-generated 
> self-signed server TLS certificate with CA-signed one only when it becomes a 
> leader (see {{{color:#33}CatalogManager::InitCertAuthorityWith(){color}}} 
> method; it's invoked only from \{{CatalogManager::InitCertAuthority()}}, that 
> is invoked only from \{{CatalogManager::PrepareForLeadershipTask()}}).
>  
> In case of just one Raft election from the start (which is pretty common, 
> BTW), non-leader masters run without CA-signed server certificate for a long 
> time in case of a multi-master configuration.  That causes clients to not use 
> their authn tokens for authentication while connecting to those non-leader 
> masters.  In case of Spark applications where executors do not have Kerberos 
> credentials (the common case), application logs are polluted with messages 
> like below:
> {noformat}
> org.apache.kudu.client.NonRecoverableException: Server requires Kerberos, but 
> this client is not authenticated (kinit)
>   at org.apache.kudu.client.Negotiator.evaluateChallenge(Negotiator.java:705)
>   at org.apache.kudu.client.Negotiator.sendSaslInitiate(Negotiator.java:581)
> ...
> Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2193) Severe lock contention on TSTabletManager lock

2018-01-18 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2193.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

I fixed this in 1.6

> Severe lock contention on TSTabletManager lock
> --
>
> Key: KUDU-2193
> URL: https://issues.apache.org/jira/browse/KUDU-2193
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.6.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 1.6.0
>
>
> I'm doing some stress/failure testing on a cluster with lots of tablets and 
> ran into the following mess:
> - TSTabletManager::GenerateIncrementalTabletReport is holding the 
> TSTabletManager lock in 'read' mode
> -- it's calling CreateReportedTabletPB on a bunch of tablets which are in the 
> process of an election storm
> -- each such call blocks in RaftConsensus::ConsensusState since it's in the 
> process of fsyncing metadata to disk
> -- thus it's holding the read lock on TSTabletManager lock for a long time 
> (many seconds if not tens of seconds)
> - meanwhile, some other thread is trying to take TSTabletManager::lock for 
> write, and blocked due to the above reader
> - rw_spinlock is writer-starvation-free which means that no more readers can 
> acquire the lock
> What's worse is that rw_spinlock is a true spin lock, so now there are tens 
> of threads in a 'while (true) sched_yield()' loop, generating over 1.5M 
> context switches per second.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2267) Client may see negotiation failure when talks to master followers with only self signed cert

2018-01-18 Thread Hao Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Hao updated KUDU-2267:
--
Issue Type: Improvement  (was: Bug)

> Client may see negotiation failure when talks to master followers with only 
> self signed cert 
> -
>
> Key: KUDU-2267
> URL: https://issues.apache.org/jira/browse/KUDU-2267
> Project: Kudu
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 1.6.0
>Reporter: Hao Hao
>Priority: Major
>
> Currently, if a master has never been a leader from the very start of the 
> cluster, it has just self-signed cert. And if a client does not have valid 
> Kerberos credential but only authenticated token, then the client may see 
> {{org.apache.kudu.client.NonRecoverableException: Server requires Kerberos, 
> but this client is not authenticated}} error when trying to connect to master 
> followers. Since in that case SASL authentication type is chosen instead of 
> token for authentication.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2267) Client may see negotiation failure when talks to master followers with only self signed cert

2018-01-18 Thread Hao Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Hao updated KUDU-2267:
--
Summary: Client may see negotiation failure when talks to master followers 
with only self signed cert   (was: Client may see negotiation failure exception 
when talks to master followers with only self signed cert )

> Client may see negotiation failure when talks to master followers with only 
> self signed cert 
> -
>
> Key: KUDU-2267
> URL: https://issues.apache.org/jira/browse/KUDU-2267
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: 1.6.0
>Reporter: Hao Hao
>Priority: Major
>
> Currently, if a master has never been a leader from the very start of the 
> cluster, it has just self-signed cert. And if a client does not have valid 
> Kerberos credential but only authenticated token, then the client may see 
> {{org.apache.kudu.client.NonRecoverableException: Server requires Kerberos, 
> but this client is not authenticated}} error when trying to connect to master 
> followers. Since in that case SASL authentication type is chosen instead of 
> token for authentication.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2267) Client may see negotiation failure exception when talks to master followers with only self signed cert

2018-01-18 Thread Hao Hao (JIRA)
Hao Hao created KUDU-2267:
-

 Summary: Client may see negotiation failure exception when talks 
to master followers with only self signed cert 
 Key: KUDU-2267
 URL: https://issues.apache.org/jira/browse/KUDU-2267
 Project: Kudu
  Issue Type: Bug
  Components: client
Affects Versions: 1.6.0
Reporter: Hao Hao


Currently, if a master has never been a leader from the very start of the 
cluster, it has just self-signed cert. And if a client does not have valid 
Kerberos credential but only authenticated token, then the client may see 
{{org.apache.kudu.client.NonRecoverableException: Server requires Kerberos, but 
this client is not authenticated}} error when trying to connect to master 
followers. Since in that case SASL authentication type is chosen instead of 
token for authentication.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2266) Add OpId to pre-election / election log messages

2018-01-18 Thread Mike Percy (JIRA)
Mike Percy created KUDU-2266:


 Summary: Add OpId to pre-election / election log messages
 Key: KUDU-2266
 URL: https://issues.apache.org/jira/browse/KUDU-2266
 Project: Kudu
  Issue Type: Improvement
  Components: consensus
Reporter: Mike Percy
Assignee: Mike Percy


Having the last logged OpId would help with detecting progress when reviewing 
logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-1489) Use WAL directory for tablet metadata files

2018-01-18 Thread Andrew Wong (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong updated KUDU-1489:
--
   Resolution: Fixed
Fix Version/s: 1.7.0
   Status: Resolved  (was: In Review)

Resolved as 64926335fe263b43f1493c03a91ea999759dfc71

> Use WAL directory for tablet metadata files
> ---
>
> Key: KUDU-1489
> URL: https://issues.apache.org/jira/browse/KUDU-1489
> Project: Kudu
>  Issue Type: Improvement
>  Components: consensus, fs, tserver
>Affects Versions: 0.9.0
>Reporter: Adar Dembo
>Assignee: Andrew Wong
>Priority: Major
> Fix For: 1.7.0
>
>
> Today a tserver will place tablet metadata files (i.e. superblock and cmeta 
> files) in the first configured data directory. I don't remember why we 
> decided to do this (commit 691f97d introduced it), but upon reconsideration 
> the WAL directory seems like a much better choice, because if the machine has 
> different kinds of I/O devices, the WAL directory's device is typically the 
> fastest.
> Mostafa has been testing Impala and Kudu on a cluster with many thousands of 
> tablets. His cluster contains storage-dense machines, each configured with 14 
> spinning disks and one flash device. Naturally, the WAL directory sits on 
> that flash device and the data directories are on the spinning disks. With 
> thousands of tablet metadata files on the first spinning disk, nearly every 
> tablet in the tserver is bottlenecked on that device due to the sheer amount 
> of I/O needed to maintain the running state of the tablet, specifically 
> rewriting cmeta files on various Raft events (votes, term advancement, etc.).
> Many thousands of tablets is not really a good scale for Kudu right now, but 
> moving the tablet metadata files to a faster device should at least help with 
> the above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2265) A non-leader master uses self-signed server TLS cert if it hasn't ever run as a leader

2018-01-18 Thread Alexey Serbin (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331197#comment-16331197
 ] 

Alexey Serbin commented on KUDU-2265:
-

[~tlipcon], yes, I think that would be easier to troubleshoot.

Probably, we need to add more logging into both the C++ and the Java client 
code to describe what mechanisms are available and what mechanisms were chosen, 
and why.  And that should be printed out when negotiation fails (e.g., like 
it's done now as a trace in case of the C++ client).

> A non-leader master uses self-signed server TLS cert if it hasn't ever run as 
> a leader
> --
>
> Key: KUDU-2265
> URL: https://issues.apache.org/jira/browse/KUDU-2265
> Project: Kudu
>  Issue Type: Improvement
>  Components: master, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.6.0, 1.50
>Reporter: Alexey Serbin
>Priority: Major
>
> As it's currently implemented, master process replaces its auto-generated 
> self-signed server TLS certificate with CA-signed one only when it becomes a 
> leader (see {{{color:#33}CatalogManager::InitCertAuthorityWith(){color}}} 
> method; it's invoked only from \{{CatalogManager::InitCertAuthority()}}, that 
> is invoked only from \{{CatalogManager::PrepareForLeadershipTask()}}).
>  
> In case of just one Raft election from the start (which is pretty common, 
> BTW), non-leader masters run without CA-signed server certificate for a long 
> time in case of a multi-master configuration.  That causes clients to not use 
> their authn tokens for authentication while connecting to those non-leader 
> masters.  In case of Spark applications where executors do not have Kerberos 
> credentials (the common case), application logs are polluted with messages 
> like below:
> {noformat}
> org.apache.kudu.client.NonRecoverableException: Server requires Kerberos, but 
> this client is not authenticated (kinit)
>   at org.apache.kudu.client.Negotiator.evaluateChallenge(Negotiator.java:705)
>   at org.apache.kudu.client.Negotiator.sendSaslInitiate(Negotiator.java:581)
> ...
> Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2265) A non-leader master uses self-signed server TLS cert if it hasn't ever run as a leader

2018-01-18 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331127#comment-16331127
 ] 

Todd Lipcon commented on KUDU-2265:
---

Would be great if we could augment this message 
"org.apache.kudu.client.NonRecoverableException: Server requires Kerberos, but 
this client is not authenticated (kinit)" with something that said "Did not use 
available authentication token because server side does not have a trusted 
certificate." That would have made this easier to debug, right?

> A non-leader master uses self-signed server TLS cert if it hasn't ever run as 
> a leader
> --
>
> Key: KUDU-2265
> URL: https://issues.apache.org/jira/browse/KUDU-2265
> Project: Kudu
>  Issue Type: Improvement
>  Components: master, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.6.0, 1.50
>Reporter: Alexey Serbin
>Priority: Major
>
> As it's currently implemented, master process replaces its auto-generated 
> self-signed server TLS certificate with CA-signed one only when it becomes a 
> leader (see {{{color:#33}CatalogManager::InitCertAuthorityWith(){color}}} 
> method; it's invoked only from \{{CatalogManager::InitCertAuthority()}}, that 
> is invoked only from \{{CatalogManager::PrepareForLeadershipTask()}}).
>  
> In case of just one Raft election from the start (which is pretty common, 
> BTW), non-leader masters run without CA-signed server certificate for a long 
> time in case of a multi-master configuration.  That causes clients to not use 
> their authn tokens for authentication while connecting to those non-leader 
> masters.  In case of Spark applications where executors do not have Kerberos 
> credentials (the common case), application logs are polluted with messages 
> like below:
> {noformat}
> org.apache.kudu.client.NonRecoverableException: Server requires Kerberos, but 
> this client is not authenticated (kinit)
>   at org.apache.kudu.client.Negotiator.evaluateChallenge(Negotiator.java:705)
>   at org.apache.kudu.client.Negotiator.sendSaslInitiate(Negotiator.java:581)
> ...
> Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2265) A non-leader master uses self-signed server TLS cert if it hasn't ever run as a leader

2018-01-18 Thread Alexey Serbin (JIRA)
Alexey Serbin created KUDU-2265:
---

 Summary: A non-leader master uses self-signed server TLS cert if 
it hasn't ever run as a leader
 Key: KUDU-2265
 URL: https://issues.apache.org/jira/browse/KUDU-2265
 Project: Kudu
  Issue Type: Improvement
  Components: master, security
Affects Versions: 1.50, 1.6.0, 1.4.0, 1.3.1, 1.3.0
Reporter: Alexey Serbin


As it's currently implemented, master process replaces its auto-generated 
self-signed server TLS certificate with CA-signed one only when it becomes a 
leader (see {{{color:#33}CatalogManager::InitCertAuthorityWith(){color}}} 
method; it's invoked only from \{{CatalogManager::InitCertAuthority()}}, that 
is invoked only from \{{CatalogManager::PrepareForLeadershipTask()}}).

 

In case of just one Raft election from the start (which is pretty common, BTW), 
non-leader masters run without CA-signed server certificate for a long time in 
case of a multi-master configuration.  That causes clients to not use their 
authn tokens for authentication while connecting to those non-leader masters.  
In case of Spark applications where executors do not have Kerberos credentials 
(the common case), application logs are polluted with messages like below:
{noformat}
org.apache.kudu.client.NonRecoverableException: Server requires Kerberos, but 
this client is not authenticated (kinit)
  at org.apache.kudu.client.Negotiator.evaluateChallenge(Negotiator.java:705)
  at org.apache.kudu.client.Negotiator.sendSaslInitiate(Negotiator.java:581)

...

Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by 
GSSException: No valid credentials provided (Mechanism level: Failed to find 
any Kerberos tgt)]
  at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KUDU-2126) DeleteTablet RPC on an already deleted tablet should be a no-op

2018-01-18 Thread Jeffrey F. Lukman (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey F. Lukman reassigned KUDU-2126:
---

Assignee: Jeffrey F. Lukman

> DeleteTablet RPC on an already deleted tablet should be a no-op
> ---
>
> Key: KUDU-2126
> URL: https://issues.apache.org/jira/browse/KUDU-2126
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.5.0
>Reporter: Adar Dembo
>Assignee: Jeffrey F. Lukman
>Priority: Major
>  Labels: newbie
>
> This is remaining (but lower priority) work from KUDU-2114:
> bq. When observed on a live cluster, it was further observed that the tablet 
> deletion requests were rather expensive. It appears that a DeleteTablet RPC 
> on a tombstone is not a no-op; it always flushes the superblock twice, which 
> generates two fsyncs. This should also be addressed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-721) Support for DECIMAL type

2018-01-18 Thread Grant Henke (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-721:
-
Target Version/s: 1.7.0  (was: 1.6.0)

> Support for DECIMAL type
> 
>
> Key: KUDU-721
> URL: https://issues.apache.org/jira/browse/KUDU-721
> Project: Kudu
>  Issue Type: New Feature
>  Components: client, tablet
>Reporter: Todd Lipcon
>Assignee: Grant Henke
>Priority: Critical
>  Labels: kudu-roadmap
>
> [~mgrund] identified that without DECIMAL type, we're going to have issues 
> with a lot of the tests tables that Impala uses. Also, since we're targeting 
> some financial applications, it seems pretty crucial. This JIRA is to track 
> the work necessary to support it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2240) Expose partitioning information in a straightforward way in the Java API

2018-01-18 Thread Attila Bukor (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Bukor updated KUDU-2240:
---
Target Version/s: 1.7.0  (was: 1.6.0)

> Expose partitioning information in a straightforward way in the Java API
> 
>
> Key: KUDU-2240
> URL: https://issues.apache.org/jira/browse/KUDU-2240
> Project: Kudu
>  Issue Type: Improvement
>  Components: api, client, java
>Reporter: Attila Bukor
>Assignee: Attila Bukor
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)