date:20180222

[jira] [Updated] (KUDU-2319) Follower masters cannot accept authn tokens for verification

2018-02-22 Thread Alexey Serbin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2319:

Status: In Review  (was: In Progress)

> Follower masters cannot accept authn tokens for verification
> 
>
> Key: KUDU-2319
> URL: https://issues.apache.org/jira/browse/KUDU-2319
> Project: Kudu
>  Issue Type: Bug
>  Components: master, security
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.4.1, 1.6.0, 1.7.0
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>
> In case of multi-master setup, the follower masters which haven't been 
> leaders yet, cannot accept authn tokens for verification because they don't 
> have public parts of TSKs in their TokenVerifier.
> A small integration test posted as a WIP patch illustrates that:
>   http://gerrit.cloudera.org:8080/9373



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (KUDU-2293) tserver crashes with 'Found tablet in TABLET_DATA_COPYING state during StartTabletCopy()' message

2018-02-22 Thread Andrew Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong reassigned KUDU-2293:
-

Assignee: Andrew Wong

> tserver crashes with 'Found tablet in TABLET_DATA_COPYING state during 
> StartTabletCopy()' message
> -
>
> Key: KUDU-2293
> URL: https://issues.apache.org/jira/browse/KUDU-2293
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.5.0, 1.6.0, 1.7.0
>Reporter: Alexey Serbin
>Assignee: Andrew Wong
>Priority: Major
> Attachments: crash-at-tablet-copy-session-start.log
>
>
> When running out of disc space, tablet server can crash while trying to start 
> tablet copying over already tombstoned replica.
> In essence, if {{DataDirManager::CreateDataDirGroup()}} returns an error due 
> to the out-of-disc space condition while running 
> {{TabletCopyClient::Start()}}, tablet server crashes with an error message 
> like below.  The relevant part of the log is attached.
> {noformat}
> F0208 05:35:22.152496  2721 ts_tablet_manager.cc:563] T 
> 5384471d823e46929029f9ff6ce212a3 P c713ac498df040caa897d3229214baa3: Tablet 
> Copy: Found tablet in TABLET_DATA_COPYING state during 
> StartTabletCopy(){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2324) Add gflags to disable individual maintenance ops

2018-02-22 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-2324:
--
Component/s: supportability

> Add gflags to disable individual maintenance ops
> 
>
> Key: KUDU-2324
> URL: https://issues.apache.org/jira/browse/KUDU-2324
> Project: Kudu
>  Issue Type: Improvement
>  Components: supportability, tablet
>Reporter: Mike Percy
>Priority: Major
>
> It would be helpful in emergency situations to be able to disable individual 
> types of maintenance operations, such as major delta compaction or merging 
> compaction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-1266) Figure out per-version docs publishing

2018-02-22 Thread Andrew Wong (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373533#comment-16373533
 ] 

Andrew Wong commented on KUDU-1266:
---

It'd also be useful to have a /dev version in addition to the release versions, 
or at least publish to the SNAPSHOT site prior to releases with a warning 
indicating its status. Today, it's kind of painful writing docs and having to 
wait 2-3 months for them to see the light of day. Automating site publication 
would make this sort of thing more feasible.

> Figure out per-version docs publishing
> --
>
> Key: KUDU-1266
> URL: https://issues.apache.org/jira/browse/KUDU-1266
> Project: Kudu
>  Issue Type: Task
>  Components: documentation
>Reporter: Jean-Daniel Cryans
>Assignee: Mike Percy
>Priority: Minor
>
> Right now we just push the documentation in master to the website, but 
> ideally we'd want to have documentation available for each version. What's 
> the best way to do this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (KUDU-2275) SIGSEGV due to bug in libunwind

2018-02-22 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2275.
---
   Resolution: Fixed
Fix Version/s: 1.7.0

Upgraded to libunwind 1.3-rc1 to fix this

> SIGSEGV due to bug in libunwind
> ---
>
> Key: KUDU-2275
> URL: https://issues.apache.org/jira/browse/KUDU-2275
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Will Berkeley
>Assignee: Todd Lipcon
>Priority: Major
> Fix For: 1.7.0
>
>
> Rarely, the kernel stack watchdog can cause a segfault due to a bug in 
> libunwind.
> {noformat}
> *** Aborted at 1516180006 (unix time) try "date -d @1516180006" if you are 
> using GNU date ***
> PC: @ 0x8c94b4 (unknown)
> *** SIGSEGV (@0x7f27173e) received by PID 22279 (TID 0x7f270f87f700) from 
> PID 389939200; stack trace: ***{noformat}
> From a core file (produced from the minidump), the backtrace is
> {noformat}
> #0  access_mem (as=, addr=139805870391296, val=0x7f270f87bcc0, 
> write=, arg=)
>    at 
> /usr/src/debug/kudu-1.5.0-cdh5.13.1/thirdparty/src/libunwind-1.1a/src/x86_64/Ginit.c:173
> #1  0x008c8e02 in is_plt_entry (c=0x7f270f87c0e0) at 
> /usr/src/debug/kudu-1.5.0-cdh5.13.1/thirdparty/src/libunwind-1.1a/src/x86_64/Gstep.c:43
> #2  _ULx86_64_step (cursor=0x7f270f87c0e0) at 
> /usr/src/debug/kudu-1.5.0-cdh5.13.1/thirdparty/src/libunwind-1.1a/src/x86_64/Gstep.c:125
> #3  0x008c412d in google::GetStackTrace 
> (result=result@entry=0x292c0c8, max_depth=max_depth@entry=16, skip_count=0, 
> skip_count@entry=2)
>    at 
> /usr/src/debug/kudu-1.5.0-cdh5.13.1/thirdparty/src/glog-0.3.5/src/stacktrace_libunwind-inl.h:78
> #4  0x01a9be8c in Collect (skip_frames=2, this=0x292c0c0) at 
> /usr/src/debug/kudu-1.5.0-cdh5.13.1/src/kudu/util/debug-util.cc:350
> #5  kudu::(anonymous namespace)::HandleStackTraceSignal (signum= out>) at /usr/src/debug/kudu-1.5.0-cdh5.13.1/src/kudu/util/debug-util.cc:176
> #6  0x7f2716854670 in _quicksort () from ./lib64/libc.so.6
> #7  0x in ?? (){noformat}
> Note that addr = 139805870391296 = 0x7f27173e.
> The segfault happens because libunwind is accessing invalid memory it's 
> supposed to have validated:
> {code:java}
> /* validate address */
> const struct cursor *c = (const struct cursor *)arg;
> if (likely (c != NULL) && unlikely (c->validate)
> && unlikely (validate_mem (addr)))
> return -1;
> *val = *(unw_word_t *) addr;{code}
> [Others|https://lists.nongnu.org/archive/html/libunwind-devel/2016-09/msg1.html]
>  have seen this same problem before.
> There's also a fix for this issue in commit 
> 836c91c43d7a996028aa7e8d1f53630a6b8e7cbe. It's not in any release of 
> libunwind yet, so we could do one of the following
>  # upgrade libunwind to 1.2 (most recent release) and patch in the fix
>  # upgrade to a snapshot containing the fix
> To workaround, one can set --hung_task_check_interval_ms to a large value 
> like 2^30, so the stack watchdog runs very rarely (although the flag is a 
> 32-bit signed integer, so not too big). The tradeoff is the effective loss of 
> the stack watchdog, which can make debugging certain performance problems 
> more difficult.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (KUDU-2291) Implement a /stacks web page

2018-02-22 Thread Todd Lipcon (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2291.
---
   Resolution: Fixed
Fix Version/s: 1.7.0

> Implement a /stacks web page
> 
>
> Key: KUDU-2291
> URL: https://issues.apache.org/jira/browse/KUDU-2291
> Project: Kudu
>  Issue Type: Improvement
>  Components: supportability
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Major
> Fix For: 1.7.0
>
>
> Other hadoop ecosystem projects (mainly Java) offer a /stacks web page which 
> is equivalent to pstacking the process. We should offer the same in Kudu - it 
> can be useful for remotely understanding what's going on on a server which is 
> performing strangely, when root access to the machine may not be available 
> and tools like 'pstack' may not be installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (KUDU-2305) Local variables can overflow when serializing a 2GB message

2018-02-22 Thread Grant Henke (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke resolved KUDU-2305.
---
   Resolution: Fixed
Fix Version/s: 1.7.0

Resolved via commit 
[2b0c1c0|https://github.com/apache/kudu/commit/2b0c1c019921e485f06c4be280fedba3d5279672].

> Local variables can overflow when serializing a 2GB message
> ---
>
> Key: KUDU-2305
> URL: https://issues.apache.org/jira/browse/KUDU-2305
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 1.6.0
>Reporter: Joe McDonnell
>Assignee: Joe McDonnell
>Priority: Major
> Fix For: 1.7.0
>
>
> When rpc_max_message_size is set to its maximum of INT_MAX (2147483647), 
> certain local variables in SerializeMessage can overflow as messages approach 
> this size. Specifically, recorded_size, size_with_delim, and total_size are 4 
> byte signed integers and could overflow when additional_size becomes large.
> Since INT_MAX is the largest allowable value for rpc_max_message_size (a 4 
> byte signed integer), these variables will not overflow if changed to 4 byte 
> unsigned integers. This would eliminate the potential problem for 
> serialization.
> A similar problem exists in the InboundTransfer::ReceiveBuffer() and similar 
> codepaths. Changing those variables to unsigned integers should resolve the 
> issue.
> This does not impact existing systems, because the default value of 
> rpc_max_message_size is 50MB.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2236) org.apache.kudu.client.TestKuduClient flaky

2018-02-22 Thread Andrew Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong updated KUDU-2236:
--
Target Version/s: 1.8.0

> org.apache.kudu.client.TestKuduClient flaky
> ---
>
> Key: KUDU-2236
> URL: https://issues.apache.org/jira/browse/KUDU-2236
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.6.0
>Reporter: Edward Fancher
>Assignee: Andrew Wong
>Priority: Major
>
> Last seen in org.apache.kudu.client.TestKuduClient.testCloseShortlyAfterOpen
> DEBUG - Could not login via JAAS. Using no credentials: Unable to obtain 
> Principal Name for authentication 
> DEBUG - SASL mechanism PLAIN chosen for peer 127.63.177.1
> DEBUG - SASL mechanism PLAIN chosen for peer 127.63.177.1
> DEBUG - SASL mechanism PLAIN chosen for peer 127.63.177.1
> DEBUG - Learned about tablet Kudu Master for table 'Kudu Master' with 
> partition [, )
> DEBUG - Releasing all remaining resources
> DEBUG - [peer master-127.63.177.1:64030] cleaning up while in state READY due 
> to: connection disconnected
> INFO - W1206 07:14:39.727399 16334 connection.cc:511] server connection from 
> 127.63.177.1:43497 recv error: Network error: recv error: Connection reset by 
> peer (error 104)
> DEBUG - [peer master-127.63.177.1:64034] cleaning up while in state 
> NEGOTIATING due to: connection disconnected
> WARN - Error receiving response from 127.63.177.1:64030
> org.apache.kudu.client.RecoverableException: connection disconnected
>  at org.apache.kudu.client.Connection.channelDisconnected(Connection.java:244)
>  at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:102)
>  at org.apache.kudu.client.Connection.handleUpstream(Connection.java:236)
>  at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>  at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>  at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.channelDisconnected(SimpleChannelUpstreamHandler.java:208)
>  at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:102)
>  at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>  at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>  at 
> org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:60)
>  at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>  at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>  at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.cleanup(FrameDecoder.java:493)
>  at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.channelDisconnected(FrameDecoder.java:365)
>  at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:102)
>  at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>  at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
>  at 
> org.jboss.netty.channel.Channels.fireChannelDisconnected(Channels.java:396)
>  at org.jboss.netty.channel.Channels$4.run(Channels.java:386)
>  at 
> org.jboss.netty.channel.socket.ChannelRunnableWrapper.run(ChannelRunnableWrapper.java:40)
>  at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391)
>  at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315)
>  at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
>  at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>  at 
> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>  at 
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> INFO - W1206 07:14:39.741600 17527 negotiation.cc:311] Failed RPC 
> negotiation. Trace:
> INFO - 1206 07:14:39.695892 (+ 0us) reactor.cc:499] Submitting 
> negotiation task for server connection from 127.63.177.1:48551
> INFO - 1206 07:14:39.722215 (+ 26323us) server_negotiation.cc:173] Beginning 
> negotiation
> INFO - 1206 07:14:39.722236 (+21us) server_negotiation.cc:361] Waiting 
> for connection header
> DEBUG - [peer master-127.63.177.1:64032]

[jira] [Updated] (KUDU-2244) spinlock contention in raft_consensus

2018-02-22 Thread Andrew Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong updated KUDU-2244:
--
Target Version/s: 1.8.0

> spinlock contention in raft_consensus
> -
>
> Key: KUDU-2244
> URL: https://issues.apache.org/jira/browse/KUDU-2244
> Project: Kudu
>  Issue Type: Improvement
>  Components: consensus
>Reporter: Andrew Wong
>Priority: Major
>
> I was going through the logs of a cluster that was seeing a bunch of 
> kernel_stack_watchdog traces, and the slowness seemed to be caused by a lot 
> of activity in consensus requests. E.g.
> W1214 18:57:29.514219 36138 kernel_stack_watchdog.cc:145] Thread 36317 stuck 
> at 
> /data/jenkins/workspace/generic-package-centos64-7-0-impala/topdir/BUILD/kudu-1.3.0-cdh5.11.0/src/kudu/rpc/outbound_call.cc:192
>  for 123ms:
> Kernel stack:
> [] sys_sched_yield+0x65/0xd0
> [] system_call_fastpath+0x16/0x1b
> [] 0x
> User stack:
> @ 0x7f72fab92057  __GI___sched_yield
> @  0x19498bf  kudu::Thread::StartThread()
> @  0x1952e7d  kudu::ThreadPool::CreateThreadUnlocked()
> @  0x19534d3  kudu::ThreadPool::Submit()
> @  0x1953a27  kudu::ThreadPool::SubmitFunc()
> @  0x1953ecb  kudu::ThreadPool::SubmitClosure()
> @   0x9c94ec  kudu::consensus::RaftConsensus::ElectionCallback()
> @   0x9e6032  kudu::consensus::LeaderElection::CheckForDecision()
> @   0x9e78c3  
> kudu::consensus::LeaderElection::VoteResponseRpcCallback()
> @   0xa8b137  kudu::rpc::OutboundCall::CallCallback()
> @   0xa8c2bc  kudu::rpc::OutboundCall::SetResponse()
> @   0xa822c0  kudu::rpc::Connection::HandleCallResponse()
> @   0xa83ffc  ev::base<>::method_thunk<>()
> @  0x198a07f  ev_invoke_pending
> @  0x198af71  ev_run
> @   0xa5e049  kudu::rpc::ReactorThread::RunThread()
> So it seemed to be cause by some slowness in getting threads. Upon perusing 
> the logs a bit more, there were a sizable number of spinlock profiling traces:
> W1214 18:54:27.897955 36379 rpcz_store.cc:238] Trace:
> 1214 18:54:26.766922 (+ 0us) service_pool.cc:143] Inserting onto call 
> queue
> 1214 18:54:26.771135 (+  4213us) service_pool.cc:202] Handling call
> 1214 18:54:26.771138 (+ 3us) raft_consensus.cc:1126] Updating replica for 
> 0 ops
> 1214 18:54:27.897699 (+1126561us) raft_consensus.cc:1165] Early marking 
> committed up to index 0
> 1214 18:54:27.897700 (+ 1us) raft_consensus.cc:1170] Triggering prepare 
> for 0 ops
> 1214 18:54:27.897701 (+ 1us) raft_consensus.cc:1282] Marking committed up 
> to 1766
> 1214 18:54:27.897702 (+ 1us) raft_consensus.cc:1332] Filling consensus 
> response to leader.
> 1214 18:54:27.897736 (+34us) spinlock_profiling.cc:255] Waited 991 ms on 
> lock 0x120b3540. stack: 019406c5 009c60d7 009c75f7 
> 007dc628 00a7adfc 00a7b9cd 0194d059 
> 7f72fbcc2dc4 7f72fabad1cc 
> 1214 18:54:27.897737 (+ 1us) raft_consensus.cc:1327] UpdateReplicas() 
> finished
> 1214 18:54:27.897741 (+ 4us) inbound_call.cc:130] Queueing success 
> response
> Metrics: {"spinlock_wait_cycles":2478395136}
> Each of the traces noted on the order of 500-1000ms of waiting on spinlocks. 
> Upon looking at raft_consensus.cc, it seems we're holding a spinlock 
> (update_lock_) while we call RaftConsensus::UpdateReplica(), which according 
> to its header, "won't return until all operations have been stored in the log 
> and all Prepares() have been completed". While locking may be necessary, it's 
> worth considering using a different kind of lock here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2286) Redistribute tablet data when removing directories

2018-02-22 Thread Andrew Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong updated KUDU-2286:
--
Target Version/s: 1.8.0

> Redistribute tablet data when removing directories
> --
>
> Key: KUDU-2286
> URL: https://issues.apache.org/jira/browse/KUDU-2286
> Project: Kudu
>  Issue Type: Improvement
>  Components: fs, tablet
>Reporter: Andrew Wong
>Priority: Major
>
> Today, the `update_dirs` tool will allow the removal of a directory, even if 
> tablet data exists on it. For tablet servers, this means upon the next 
> startup, those tablets will be re-replicated to other servers. For masters, 
> this means the node will crash.
> While a well-provisioned Kudu cluster should be able to handle both of these 
> cases, it would be nice to redistribute the tablet's data locally to avoid 
> either outcome. This entails moving the data blocks from the removed 
> directory, and rewriting the tablet metadata (which keeps a record of the 
> data directories across which the tablet's data is stored) to exclude the 
> removed directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2146) Tool to determine the leader master

2018-02-22 Thread Andrew Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong updated KUDU-2146:
--
Target Version/s: 1.8.0

> Tool to determine the leader master
> ---
>
> Key: KUDU-2146
> URL: https://issues.apache.org/jira/browse/KUDU-2146
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Reporter: Andrew Wong
>Priority: Major
>
> Going through some docs regarding multi-master migration, it seems like some 
> processes are warned against in order to prevent data loss.
> As an example, adding masters to an existing multi-master deployment may mess 
> up the deployment and lose ops if new masters are added using a stale 
> follower as its "reference" master (i.e. the existing master from which data 
> is copied to the new master). As such, the docs warn against doing this 
> migration entirely, when it _should_ be safe to add the new masters using the 
> most up-to-date master as the reference master.
> It would thus be helpful to be able to determine which master is leader, or 
> at least has the highest op-id (finding the leader may be harder to figure 
> out if the masters are down).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2179) Have ksck not use a single snapshot for all tablets

2018-02-22 Thread Andrew Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong updated KUDU-2179:
--
Target Version/s: 1.8.0

> Have ksck not use a single snapshot for all tablets
> ---
>
> Key: KUDU-2179
> URL: https://issues.apache.org/jira/browse/KUDU-2179
> Project: Kudu
>  Issue Type: Improvement
>  Components: ksck
>Reporter: Andrew Wong
>Priority: Major
>
> When ksck runs, it selects a single timestamp and does a snapshot scan at 
> this time across all tablets. If the scans run for a long time (e.g. due to 
> heavy traffic to the tservers), some scans may be attempted on data that has 
> already been GC'ed, surfacing the errors:
> {{Error: Invalid argument: Snapshot timestamp is earlier than the ancient 
> history mark. Consider increasing the value of the configuration parameter 
> --tablet_history_max_age_sec. Snapshot timestamp: P: 1507232752670708 usec, 
> L: 0 Ancient History Mark: P: 1507232752970869 usec, L: 0 Physical time 
> difference: -0.300s}}
> This could be remediated by batching these scans and selecting a new 
> timestamp for each batch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2111) Add directory-level information to the filesystem report

2018-02-22 Thread Andrew Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong updated KUDU-2111:
--
Target Version/s: 1.8.0

> Add directory-level information to the filesystem report
> 
>
> Key: KUDU-2111
> URL: https://issues.apache.org/jira/browse/KUDU-2111
> Project: Kudu
>  Issue Type: Bug
>  Components: fs
>Reporter: Andrew Wong
>Assignee: Andrew Wong
>Priority: Major
>
> The FsReport is currently used to report fine-grained details about the 
> filesystem, e.g. the state of containers on-disk, etc, and is primarily used 
> by the log block manager.
> It would be nice to report on coarser-grained details of the entire 
> filesystem, like the current state of each directory. Once we begin striping 
> metadata and WALs, this can also be used to report things like the number of 
> metadata files or WALs in each directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-1521) Flakiness in TestAsyncKuduSession

2018-02-22 Thread Andrew Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong updated KUDU-1521:
--
Target Version/s: 1.8.0

> Flakiness in TestAsyncKuduSession
> -
>
> Key: KUDU-1521
> URL: https://issues.apache.org/jira/browse/KUDU-1521
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.9.1
>Reporter: Adar Dembo
>Assignee: Andrew Wong
>Priority: Major
> Attachments: 
> org.apache.kudu.client.TestAsyncKuduSession-TableIsDeleted-output.txt, 
> org.apache.kudu.client.TestAsyncKuduSession-output.txt
>
>
> I've been trying to parse the various failures in 
> http://104.196.14.100/job/kudu-gerrit/2270/BUILD_TYPE=RELEASE. Here's what I 
> see in the test:
> The way test() tests AUTO_FLUSH_BACKGROUND is inherently flaky; a delay while 
> running test code will give the background flush task a chance to fire when 
> the test code doesn't expect it. I've seen this cause lead to no 
> PleaseThrottleException, but I suspect the first block of test code dealing 
> with background flushes is flaky too (since it's testing elapsed time).
> There's also some test failures that I can't figure out. I've pasted them 
> below for posterity:
> {noformat}
> 03:52:14 
> testGetTableLocationsErrorCauseSessionStuck(org.kududb.client.TestAsyncKuduSession)
>   Time elapsed: 100.009 sec  <<< ERROR!
> 03:52:14 java.lang.Exception: test timed out after 10 milliseconds
> 03:52:14  at java.lang.Object.wait(Native Method)
> 03:52:14  at java.lang.Object.wait(Object.java:503)
> 03:52:14  at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1136)
> 03:52:14  at com.stumbleupon.async.Deferred.join(Deferred.java:1019)
> 03:52:14  at 
> org.kududb.client.TestAsyncKuduSession.testGetTableLocationsErrorCauseSessionStuck(TestAsyncKuduSession.java:133)
> 03:52:14 
> 03:52:14 
> testBatchErrorCauseSessionStuck(org.kududb.client.TestAsyncKuduSession)  Time 
> elapsed: 0.199 sec  <<< ERROR!
> 03:52:14 org.kududb.client.MasterErrorException: Server[Kudu Master - 
> 127.13.215.1:64030] NOT_FOUND[code 1]: The table was deleted: Table deleted 
> at 2016-07-09 03:50:24 UTC
> 03:52:14  at 
> org.kududb.client.TabletClient.dispatchMasterErrorOrReturnException(TabletClient.java:533)
> 03:52:14  at org.kududb.client.TabletClient.decode(TabletClient.java:463)
> 03:52:14  at org.kududb.client.TabletClient.decode(TabletClient.java:83)
> 03:52:14  at 
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:500)
> 03:52:14  at 
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
> 03:52:14  at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
> 03:52:14  at 
> org.kududb.client.TabletClient.handleUpstream(TabletClient.java:638)
> 03:52:14  at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
> 03:52:14  at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
> 03:52:14  at 
> org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184)
> 03:52:14  at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
> 03:52:14  at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
> 03:52:14  at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
> 03:52:14  at 
> org.kududb.client.AsyncKuduClient$TabletClientPipeline.sendUpstream(AsyncKuduClient.java:1877)
> 03:52:14  at 
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
> 03:52:14  at 
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
> 03:52:14  at 
> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
> 03:52:14  at 
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
> 03:52:14  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 03:52:14  at 
>

[jira] [Updated] (KUDU-1466) C++ client errors misreported as GetTableLocations timeouts

2018-02-22 Thread Alexey Serbin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-1466:

Target Version/s: 1.8.0  (was: 1.7.0)

> C++ client errors misreported as GetTableLocations timeouts
> ---
>
> Key: KUDU-1466
> URL: https://issues.apache.org/jira/browse/KUDU-1466
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.8.0
>Reporter: Todd Lipcon
>Assignee: Alexey Serbin
>Priority: Critical
>
> client-test is currently very flaky due to this issue:
> - we are injecting some kind of failure on the tablet server (eg DNS 
> resolution failure)
> - when we fail to connect to the TS, we correctly re-trigger a lookup against 
> the master
> - depending how the backoffs and retries line up, we sometimes end up 
> triggering the lookup retry when the remaining operation budget is very short 
> (eg <10ms)
> -- this GetTabletLocations RPC times out since the master is unable to 
> respond within the ridiculously short timeout
> During the course of retrying some operation, we should probably not replace 
> the 'last_error' with a master error, so long as we have had at least one 
> successful master lookup (thus indicating that the master is not the problem)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-801) Delta flush doesn't wait for transactions to commit

2018-02-22 Thread Hao Hao (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Hao updated KUDU-801:
-
Target Version/s: 1.8.0  (was: 1.7.0)

> Delta flush doesn't wait for transactions to commit
> ---
>
> Key: KUDU-801
> URL: https://issues.apache.org/jira/browse/KUDU-801
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: Private Beta
>Reporter: Todd Lipcon
>Assignee: Hao Hao
>Priority: Critical
>
> I saw a case of mt-tablet-test failing with what I think is the following 
> scenario:
> - transaction applies an update to DMS
> - delta flush happens
> - major delta compaction runs (the update is now part of base data and we 
> have an UNDO)
> - the RS is selected for compaction
> - CHECK failure because the UNDO delta contains something that is not yet 
> committed.
> We probably need to ensure that we don't Flush data which isn't yet committed 
> from an MVCC standpoint.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2117) Spread metadata across multiple data directories

2018-02-22 Thread Andrew Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong updated KUDU-2117:
--
Issue Type: Improvement  (was: Bug)

> Spread metadata across multiple data directories
> 
>
> Key: KUDU-2117
> URL: https://issues.apache.org/jira/browse/KUDU-2117
> Project: Kudu
>  Issue Type: Improvement
>  Components: fs, tablet, tserver
>Affects Versions: 1.6.0
>Reporter: Andrew Wong
>Assignee: Andrew Wong
>Priority: Major
>
> Tablet metadata and consensus metadata are placed in the first configured 
> data directory. This is an issue, as every write to these metadata incurs an 
> fsync and this stresses the first disk considerably more than the others
> One way around this is to spread metadata across multiple data directories. A 
> natural choice would be to place them in a directory within the tablet's disk 
> group. In this way, the data/metadata can be completely localized to the disk 
> group, which has the added benefit of making the risk of disk failures easier 
> to assess per tablet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2259) kudu-spark imports authentication token into client multiple times

2018-02-22 Thread Grant Henke (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2259:
--
Target Version/s: 1.7.0

> kudu-spark imports authentication token into client multiple times
> --
>
> Key: KUDU-2259
> URL: https://issues.apache.org/jira/browse/KUDU-2259
> Project: Kudu
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 1.6.0
>Reporter: Will Berkeley
>Priority: Major
>
> kudu-spark should have one KuduContext per task, which is sent serialized 
> from the driver with an authentication token. The KuduContext either 
> retrieves a Kudu client from a JVM-scoped cache, or creates one and puts it 
> in the cache, and finally imports its authentication token into the client.
> Under default configuration in an un-Kerberized cluster, the client uses the 
> authentication token to connect to the cluster. However, if 
> -rpc_encryption=disabled, then the client will not use the authentication 
> token. This causes the master to issue an authentication token to the client, 
> and the new token replaces the old token in the client.
> While there's one KuduContext per task, multiple tasks may run on the same 
> executor. If this occurs, each KuduContext tries to import its authentication 
> token into the client. If the client has already received a token from the 
> master because encryption is disabled, then it's possible that the 
> KuduContext's token and the master-issued token are for different users, 
> since the KuduContext's token was issued on the driver to the driver's Unix 
> user and the master-issued token is issued to the executor's user.
> An example of the exception that occurred when running spark2-shell as root:
> {noformat}
> 18/01/11 12:14:01 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 
> (TID 1, kudu-tserver-01, executor 1): java.lang.IllegalArgumentException: 
> cannot import authentication data from a different user: old='yarn', 
> new='root'
>   at 
> org.apache.kudu.client.SecurityContext.checkUserMatches(SecurityContext.java:128)
>   at 
> org.apache.kudu.client.SecurityContext.importAuthenticationCredentials(SecurityContext.java:138)
>   at 
> org.apache.kudu.client.AsyncKuduClient.importAuthenticationCredentials(AsyncKuduClient.java:677)
>   at 
> org.apache.kudu.spark.kudu.KuduContext.asyncClient$lzycompute(KuduContext.scala:103)
>   at 
> org.apache.kudu.spark.kudu.KuduContext.asyncClient(KuduContext.scala:100)
>   at 
> org.apache.kudu.spark.kudu.KuduContext.syncClient$lzycompute(KuduContext.scala:98)
>   at 
> org.apache.kudu.spark.kudu.KuduContext.syncClient(KuduContext.scala:98)
>   at org.apache.kudu.spark.kudu.KuduRDD.compute(KuduRDD.scala:71)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2117) Spread metadata across multiple data directories

2018-02-22 Thread Grant Henke (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2117:
--
Fix Version/s: (was: 1.7.0)

> Spread metadata across multiple data directories
> 
>
> Key: KUDU-2117
> URL: https://issues.apache.org/jira/browse/KUDU-2117
> Project: Kudu
>  Issue Type: Bug
>  Components: fs, tablet, tserver
>Affects Versions: 1.6.0
>Reporter: Andrew Wong
>Assignee: Andrew Wong
>Priority: Major
>
> Tablet metadata and consensus metadata are placed in the first configured 
> data directory. This is an issue, as every write to these metadata incurs an 
> fsync and this stresses the first disk considerably more than the others
> One way around this is to spread metadata across multiple data directories. A 
> natural choice would be to place them in a directory within the tablet's disk 
> group. In this way, the data/metadata can be completely localized to the disk 
> group, which has the added benefit of making the risk of disk failures easier 
> to assess per tablet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2259) kudu-spark imports authentication token into client multiple times

2018-02-22 Thread Grant Henke (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2259:
--
Fix Version/s: (was: 1.7.0)

> kudu-spark imports authentication token into client multiple times
> --
>
> Key: KUDU-2259
> URL: https://issues.apache.org/jira/browse/KUDU-2259
> Project: Kudu
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 1.6.0
>Reporter: Will Berkeley
>Priority: Major
>
> kudu-spark should have one KuduContext per task, which is sent serialized 
> from the driver with an authentication token. The KuduContext either 
> retrieves a Kudu client from a JVM-scoped cache, or creates one and puts it 
> in the cache, and finally imports its authentication token into the client.
> Under default configuration in an un-Kerberized cluster, the client uses the 
> authentication token to connect to the cluster. However, if 
> -rpc_encryption=disabled, then the client will not use the authentication 
> token. This causes the master to issue an authentication token to the client, 
> and the new token replaces the old token in the client.
> While there's one KuduContext per task, multiple tasks may run on the same 
> executor. If this occurs, each KuduContext tries to import its authentication 
> token into the client. If the client has already received a token from the 
> master because encryption is disabled, then it's possible that the 
> KuduContext's token and the master-issued token are for different users, 
> since the KuduContext's token was issued on the driver to the driver's Unix 
> user and the master-issued token is issued to the executor's user.
> An example of the exception that occurred when running spark2-shell as root:
> {noformat}
> 18/01/11 12:14:01 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 
> (TID 1, kudu-tserver-01, executor 1): java.lang.IllegalArgumentException: 
> cannot import authentication data from a different user: old='yarn', 
> new='root'
>   at 
> org.apache.kudu.client.SecurityContext.checkUserMatches(SecurityContext.java:128)
>   at 
> org.apache.kudu.client.SecurityContext.importAuthenticationCredentials(SecurityContext.java:138)
>   at 
> org.apache.kudu.client.AsyncKuduClient.importAuthenticationCredentials(AsyncKuduClient.java:677)
>   at 
> org.apache.kudu.spark.kudu.KuduContext.asyncClient$lzycompute(KuduContext.scala:103)
>   at 
> org.apache.kudu.spark.kudu.KuduContext.asyncClient(KuduContext.scala:100)
>   at 
> org.apache.kudu.spark.kudu.KuduContext.syncClient$lzycompute(KuduContext.scala:98)
>   at 
> org.apache.kudu.spark.kudu.KuduContext.syncClient(KuduContext.scala:98)
>   at org.apache.kudu.spark.kudu.KuduRDD.compute(KuduRDD.scala:71)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2240) Expose partitioning information in a straightforward way in the Java API

2018-02-22 Thread Grant Henke (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2240:
--
Target Version/s: 1.8.0  (was: 1.7.0)

> Expose partitioning information in a straightforward way in the Java API
> 
>
> Key: KUDU-2240
> URL: https://issues.apache.org/jira/browse/KUDU-2240
> Project: Kudu
>  Issue Type: Improvement
>  Components: api, client, java
>Reporter: Attila Bukor
>Assignee: Attila Bukor
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2075) Crash when using tracing in SetupThreadLocalBuffer

2018-02-22 Thread Grant Henke (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2075:
--
Target Version/s: 1.8.0  (was: 1.7.0)

> Crash when using tracing in SetupThreadLocalBuffer
> --
>
> Key: KUDU-2075
> URL: https://issues.apache.org/jira/browse/KUDU-2075
> Project: Kudu
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.4.0
>Reporter: Jean-Daniel Cryans
>Assignee: Todd Lipcon
>Priority: Critical
>
> Got this crash while tracing:
> {noformat}
> F0721 13:14:12.038748  2708 map-util.h:414] Check failed: 
> InsertIfNotPresent(collection, key, data) duplicate key: 139914842822400
> {noformat}
> Backtrace:
> {noformat}
> #0  0x00348aa32625 in raise () from /lib64/libc.so.6
> #1  0x00348aa33e05 in abort () from /lib64/libc.so.6
> #2  0x01b53d29 in kudu::AbortFailureFunction () at 
> /usr/src/debug/kudu-1.4.0-cdh5.12.0/src/kudu/util/minidump.cc:186
> #3  0x008b9e1d in google::LogMessage::Fail () at 
> /usr/src/debug/kudu-1.4.0-cdh5.12.0/thirdparty/src/glog-0.3.5/src/logging.cc:1488
> #4  0x008bbcdd in google::LogMessage::SendToLog (this=Unhandled dwarf 
> expression opcode 0xf3
> ) at 
> /usr/src/debug/kudu-1.4.0-cdh5.12.0/thirdparty/src/glog-0.3.5/src/logging.cc:1442
> #5  0x008b9959 in google::LogMessage::Flush (this=0x7f40768144f0) at 
> /usr/src/debug/kudu-1.4.0-cdh5.12.0/thirdparty/src/glog-0.3.5/src/logging.cc:1311
> #6  0x008bc77f in google::LogMessageFatal::~LogMessageFatal 
> (this=0x7f40768144f0, __in_chrg=)
> at 
> /usr/src/debug/kudu-1.4.0-cdh5.12.0/thirdparty/src/glog-0.3.5/src/logging.cc:2023
> #7  0x01b0915f in InsertOrDie kudu::debug::TraceLog::PerThreadInfo*> > (collection=0x36265a8, key=Unhandled 
> dwarf expression opcode 0xf3
> )
> at /usr/src/debug/kudu-1.4.0-cdh5.12.0/src/kudu/gutil/map-util.h:414
> #8  0x01b00c18 in kudu::debug::TraceLog::SetupThreadLocalBuffer 
> (this=0x3626300) at 
> /usr/src/debug/kudu-1.4.0-cdh5.12.0/src/kudu/util/debug/trace_event_impl.cc:1715
> #9  0x01b052d8 in 
> kudu::debug::TraceLog::AddTraceEventWithThreadIdAndTimestamp (this=0x3626300, 
> phase=Unhandled dwarf expression opcode 0xf3
> )
> at 
> /usr/src/debug/kudu-1.4.0-cdh5.12.0/src/kudu/util/debug/trace_event_impl.cc:1773
> #10 0x00ab616d in AddTraceEventWithThreadIdAndTimestamp long> (this=0x59dd5e40, entry_batches=std::vector of length 1, capacity 1 = 
> {...})
> at 
> /usr/src/debug/kudu-1.4.0-cdh5.12.0/src/kudu/util/debug/trace_event.h:1315
> #11 AddTraceEvent (this=0x59dd5e40, entry_batches=std::vector 
> of length 1, capacity 1 = {...})
> at 
> /usr/src/debug/kudu-1.4.0-cdh5.12.0/src/kudu/util/debug/trace_event.h:1331
> #12 kudu::log::Log::AppendThread::HandleGroup (this=0x59dd5e40, 
> entry_batches=std::vector of length 1, capacity 1 = {...})
> at /usr/src/debug/kudu-1.4.0-cdh5.12.0/src/kudu/consensus/log.cc:335
> #13 0x00ab6707 in kudu::log::Log::AppendThread::DoWork 
> (this=0x59dd5e40) at 
> /usr/src/debug/kudu-1.4.0-cdh5.12.0/src/kudu/consensus/log.cc:326
> #14 0x01b8d7d6 in operator() (this=0x21c2a180, permanent=false)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2050) Avoid peer eviction during block manager startup

2018-02-22 Thread Grant Henke (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2050:
--
Target Version/s: 1.8.0

> Avoid peer eviction during block manager startup
> 
>
> Key: KUDU-2050
> URL: https://issues.apache.org/jira/browse/KUDU-2050
> Project: Kudu
>  Issue Type: Bug
>  Components: fs, tserver
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Priority: Critical
>
> In larger deployments we've observed that opening the block manager can take 
> a really long time, like tens of minutes or sometimes even hours. This is 
> especially true as of 1.4 where the log block manager tries to optimize 
> on-disk data structures during startup.
> The default time to Raft peer eviction is 5 minutes. If one node is restarted 
> and LBM startup takes over 5 minutes, or if all nodes are restarted and 
> there's over 5 minutes of LBM startup time variance across them, the "slow" 
> node could have all of its replicas evicted. Besides generating a lot of 
> unnecessary work in rereplication, this effectively "defeats" the LBM 
> optimizations in that it would have been equally slow (but more efficient) to 
> reformat the node instead.
> So, let's reorder startup such that LBM startup counts towards replica 
> bootstrapping. One idea: adjust FsManager startup so that tablet-meta/cmeta 
> files can be accessed early to construct bootstrapping replicas, but to defer 
> opening of the block manager until after that time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-1921) Add ability for clients to require authentication/encryption

2018-02-22 Thread Grant Henke (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-1921:
--
Target Version/s: 1.8.0  (was: 1.7.0)

> Add ability for clients to require authentication/encryption
> 
>
> Key: KUDU-1921
> URL: https://issues.apache.org/jira/browse/KUDU-1921
> Project: Kudu
>  Issue Type: Improvement
>  Components: client, security
>Affects Versions: 1.3.0
>Reporter: Todd Lipcon
>Priority: Critical
>
> Currently, the clients always operate in "optional" mode for authentication 
> and encryption. This means that they are vulnerable to downgrade attacks by a 
> MITM. We should provide APIs so that clients can be configured to prohibit 
> downgrade when connecting to clusters they know to be secure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-1843) Client UUIDs should be cryptographically random

2018-02-22 Thread Grant Henke (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-1843:
--
Target Version/s: 1.8.0  (was: 1.7.0)

> Client UUIDs should be cryptographically random
> ---
>
> Key: KUDU-1843
> URL: https://issues.apache.org/jira/browse/KUDU-1843
> Project: Kudu
>  Issue Type: Improvement
>  Components: security
>Affects Versions: 1.3.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
>
> Currently we use boost::uuid's default random generator, which is not 
> cryptographically random. This may increase the ease with which an attacker 
> could guess another client's client ID, which would potentially allow them to 
> perform DoS or try to steal the results of RPCs from the result cache.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-1736) kudu crash in debug build: unordered undo delta

2018-02-22 Thread Grant Henke (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-1736:
--
Target Version/s: 1.8.0  (was: 1.7.0)

> kudu crash in debug build: unordered undo delta
> ---
>
> Key: KUDU-1736
> URL: https://issues.apache.org/jira/browse/KUDU-1736
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Reporter: zhangsong
>Priority: Critical
> Attachments: mt-tablet-test-20171123.txt.xz, mt-tablet-test.txt, 
> mt-tablet-test.txt.gz
>
>
> in jd cluster we met kudu-tserver crash with fatal messages described as 
> follow:
> Check failed: last_key_.CompareTo(key) <= 0 must insert undo deltas in 
> sorted order (ascending key, then descending ts): got key (row 
> 1422@tx6052042821982183424) after (row 1422@tx6052042821953155072)
> This is a dcheck which should not failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-1592) Documentation that mentions file block manager should sound more ominous

2018-02-22 Thread Grant Henke (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-1592:
--
Target Version/s: 1.8.0  (was: 1.7.0)

> Documentation that mentions file block manager should sound more ominous
> 
>
> Key: KUDU-1592
> URL: https://issues.apache.org/jira/browse/KUDU-1592
> Project: Kudu
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Todd Lipcon
>Priority: Major
>
> In troubleshooting.adoc, as well as in the error message when we fail to hole 
> punch, we suggest using the file block manager as a workaround. It says 
> something vague about "at the cost of some scalability and efficiency" but 
> should be something a lot more ominous -- users are quickly running out of 
> file descriptors if they try the FBM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-582) Send TS specific errors back to the client when the client is supposed to take specific actions, such as trying another replica

2018-02-22 Thread Grant Henke (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-582:
-
Target Version/s: 1.8.0  (was: 1.7.0)

> Send TS specific errors back to the client when the client is supposed to 
> take specific actions, such as trying another replica
> ---
>
> Key: KUDU-582
> URL: https://issues.apache.org/jira/browse/KUDU-582
> Project: Kudu
>  Issue Type: Bug
>  Components: client, consensus, tserver
>Affects Versions: M4.5
>Reporter: David Alves
>Priority: Critical
>
> Right now we're sending umbrella statuses that the client is supposed to 
> interpret as a command that it should failover to another replica. This is 
> misusing statuses but it's also a problem in that we're likely (or will 
> likely) sending the same statuses (illegal state and abort) in places where 
> we don't mean the client to failover.
> This should be treated holistically in both clients and in the server 
> components.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2263) Consider removing PB descriptors from PBC header

2018-02-22 Thread Grant Henke (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2263:
--
Target Version/s: 1.8.0  (was: 1.7.0)

> Consider removing PB descriptors from PBC header
> 
>
> Key: KUDU-2263
> URL: https://issues.apache.org/jira/browse/KUDU-2263
> Project: Kudu
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.7.0
>Reporter: Todd Lipcon
>Priority: Major
>
> Looking at a cmeta file on disk, it seems the vast majority of the bytes are 
> in the supplemental header. We currently serialize the entire descriptor set 
> of the referenced file and its dependencies. This means that in each cmeta 
> file, we end up serializing even things like the definition of SchemaPB – 
> unnecessary to serialize the type at hand and quite large.
>  
> At a minimum we can prune the descriptors serialized to only include those 
> that are transitively referenced by the PB type in the file. I think we 
> should also consider doing away with this information entirely and instead 
> allow 'kudu pbc dump' to take a descriptor set as external input – it's easy 
> enough to generate a descriptor set from any kudu version source tree using 
> the protoc command line.
> One potential major improvement if we can get these files down to <4kb is 
> that we could atomically rewrite them in a single disk IO using O_DIRECT 
> rather than doing a rewrite-rename-fsync dance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (KUDU-2291) Implement a /stacks web page

2018-02-22 Thread Grant Henke (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke reassigned KUDU-2291:
-

Assignee: Todd Lipcon

> Implement a /stacks web page
> 
>
> Key: KUDU-2291
> URL: https://issues.apache.org/jira/browse/KUDU-2291
> Project: Kudu
>  Issue Type: Improvement
>  Components: supportability
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Major
>
> Other hadoop ecosystem projects (mainly Java) offer a /stacks web page which 
> is equivalent to pstacking the process. We should offer the same in Kudu - it 
> can be useful for remotely understanding what's going on on a server which is 
> performing strangely, when root access to the machine may not be available 
> and tools like 'pstack' may not be installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2282) Support coercion of Decimal values

2018-02-22 Thread Grant Henke (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2282:
--
Target Version/s: 1.8.0  (was: 1.7.0)

> Support coercion of Decimal values 
> ---
>
> Key: KUDU-2282
> URL: https://issues.apache.org/jira/browse/KUDU-2282
> Project: Kudu
>  Issue Type: Improvement
>Affects Versions: 1.7.0
>Reporter: Grant Henke
>Assignee: Grant Henke
>Priority: Major
>
> Currently when decimal values are used in KuduValue.cc or PartialRow.cc we 
> enforce that the scale matches the expected scale. Instead we should support 
> basic coercion where no value rounding or truncating is required.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2321) Allow Kudu to start up with different data directories

2018-02-22 Thread Grant Henke (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2321:
--
Target Version/s: 1.8.0  (was: 1.7.0)

> Allow Kudu to start up with different data directories
> --
>
> Key: KUDU-2321
> URL: https://issues.apache.org/jira/browse/KUDU-2321
> Project: Kudu
>  Issue Type: Improvement
>  Components: fs, supportability
>Reporter: Andrew Wong
>Priority: Major
>
> Today, Kudu will refuse to start up when its FS layout isn't as expected. 
> Before 1.6.0, users could not add data directories; before 1.7.0, users could 
> not remove data directories; today, to do either of the above, users must use 
> the `update_dirs` tool before starting up.
> Prior to this, the preferred way to start up a Kudu node with a different FS 
> layout was to rmrf the entirety of the node's FS layout and start a new node 
> with a new UUID at the same location. While Kudu is designed to be resilient 
> to such removal of data (automatically replicating the removed tablets to 
> other nodes), this has lead to problems, as, e.g., wiping multiple nodes 
> without waiting ample time in between wipes could lead to data loss.
> While the `update_dirs` tool removes the need for rmrf altogether, that may 
> not stop unknowing users from wiping their nodes clean upon failure to 
> startup (e.g. if a user adds a data dir through some cluster management 
> software like Cloudera Manager (which doesn't run `update_dirs`), and 
> receives an error upon starting up). Now that we have a solution to safe 
> removal and addition of data directories with known constraints to its usage, 
> it's not unthinkable that we extend Kudu itself to start up with a different 
> FS layout subject to the same constraints.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2233) Check failure during compactions: pv_delete_redo != nullptr

2018-02-22 Thread Will Berkeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Berkeley updated KUDU-2233:

Description: 
There have been a couple of reports of a check failure during compactions at 
least from 1.4, pasted below:

{noformat}
F1201 14:55:37.052140 10508 compaction.cc:756] Check failed: pv_delete_redo != 
nullptr
 * 
 ** 
 *** Check failure stack trace: ***
 Wrote minidump to 
/var/log/kudu/minidumps/kudu-tserver/215cde39-7795-0885-0b51038d-771d875e.dmp
 *** Aborted at 1512161737 (unix time) try "date -d @1512161737" if you are 
using GNU date ***
 PC: @ 0x3ec3632625 (unknown)
 *** SIGABRT (@0x3b98eec028e3) received by PID 10467 (TID 0x7f8b02c58700) 
from PID 10467; stack trace: ***
 @ 0x3ec3a0f7e0 (unknown)
 @ 0x3ec3632625 (unknown)
 @ 0x3ec3633e05 (unknown)
 @ 0x1b53f59 (unknown)
 @ 0x8b9f6d google::LogMessage::Fail()
 @ 0x8bbe2d google::LogMessage::SendToLog()
 @ 0x8b9aa9 google::LogMessage::Flush()
 @ 0x8bc8cf google::LogMessageFatal::~LogMessageFatal()
 @ 0x9db0fe kudu::tablet::FlushCompactionInput()
 @ 0x9a056a kudu::tablet::Tablet::DoMergeCompactionOrFlush()
 @ 0x9a372d kudu::tablet::Tablet::Compact()
 @ 0x9bd8d1 kudu::tablet::CompactRowSetsOp::Perform()
 @ 0x1b4145f kudu::MaintenanceManager::LaunchOp()
 @ 0x1b8da06 kudu::ThreadPool::DispatchThread()
 @ 0x1b888ea kudu::Thread::SuperviseThread()
 @ 0x3ec3a07aa1 (unknown)
 @ 0x3ec36e893d (unknown)
 @ 0x0 (unknown)}}
{noformat}

  was:
There have been a couple of reports of a check failure during compactions at 
least from 1.4, pasted below:

{{F1201 14:55:37.052140 10508 compaction.cc:756] Check failed: pv_delete_redo 
!= nullptr 
*** Check failure stack trace: ***
Wrote minidump to 
/var/log/kudu/minidumps/kudu-tserver/215cde39-7795-0885-0b51038d-771d875e.dmp
*** Aborted at 1512161737 (unix time) try "date -d @1512161737" if you are 
using GNU date ***
PC: @ 0x3ec3632625 (unknown)
*** SIGABRT (@0x3b98eec028e3) received by PID 10467 (TID 0x7f8b02c58700) 
from PID 10467; stack trace: ***
@ 0x3ec3a0f7e0 (unknown)
@ 0x3ec3632625 (unknown)
@ 0x3ec3633e05 (unknown)
@ 0x1b53f59 (unknown)
@ 0x8b9f6d google::LogMessage::Fail()
@ 0x8bbe2d google::LogMessage::SendToLog()
@ 0x8b9aa9 google::LogMessage::Flush()
@ 0x8bc8cf google::LogMessageFatal::~LogMessageFatal()
@ 0x9db0fe kudu::tablet::FlushCompactionInput()
@ 0x9a056a kudu::tablet::Tablet::DoMergeCompactionOrFlush()
@ 0x9a372d kudu::tablet::Tablet::Compact()
@ 0x9bd8d1 kudu::tablet::CompactRowSetsOp::Perform()
@ 0x1b4145f kudu::MaintenanceManager::LaunchOp()
@ 0x1b8da06 kudu::ThreadPool::DispatchThread()
@ 0x1b888ea kudu::Thread::SuperviseThread()
@ 0x3ec3a07aa1 (unknown)
@ 0x3ec36e893d (unknown)
@ 0x0 (unknown)}}


> Check failure during compactions: pv_delete_redo != nullptr
> ---
>
> Key: KUDU-2233
> URL: https://issues.apache.org/jira/browse/KUDU-2233
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet, tserver
>Affects Versions: 1.4.0
>Reporter: Andrew Wong
>Assignee: David Alves
>Priority: Major
>
> There have been a couple of reports of a check failure during compactions at 
> least from 1.4, pasted below:
> {noformat}
> F1201 14:55:37.052140 10508 compaction.cc:756] Check failed: pv_delete_redo 
> != nullptr
>  * 
>  ** 
>  *** Check failure stack trace: ***
>  Wrote minidump to 
> /var/log/kudu/minidumps/kudu-tserver/215cde39-7795-0885-0b51038d-771d875e.dmp
>  *** Aborted at 1512161737 (unix time) try "date -d @1512161737" if you are 
> using GNU date ***
>  PC: @ 0x3ec3632625 (unknown)
>  *** SIGABRT (@0x3b98eec028e3) received by PID 10467 (TID 0x7f8b02c58700) 
> from PID 10467; stack trace: ***
>  @ 0x3ec3a0f7e0 (unknown)
>  @ 0x3ec3632625 (unknown)
>  @ 0x3ec3633e05 (unknown)
>  @ 0x1b53f59 (unknown)
>  @ 0x8b9f6d google::LogMessage::Fail()
>  @ 0x8bbe2d google::LogMessage::SendToLog()
>  @ 0x8b9aa9 google::LogMessage::Flush()
>  @ 0x8bc8cf google::LogMessageFatal::~LogMessageFatal()
>  @ 0x9db0fe kudu::tablet::FlushCompactionInput()
>  @ 0x9a056a kudu::tablet::Tablet::DoMergeCompactionOrFlush()
>  @ 0x9a372d kudu::tablet::Tablet::Compact()
>  @ 0x9bd8d1 kudu::tablet::CompactRowSetsOp::Perform()
>  @ 0x1b4145f kudu::MaintenanceManager::LaunchOp()
>  @ 0x1b8da06 kudu::ThreadPool::DispatchThread()
>  @ 0x1b888ea kudu::Thread::SuperviseThread()
>  @ 0x3ec3a07aa1 (unknown)
>  @ 0x3ec36e893d (unknown)
>  @ 0x0 (unknown)}}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2322) Leader spews logs when follower falls behind log GC

2018-02-22 Thread Jean-Daniel Cryans (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373031#comment-16373031
 ] 

Jean-Daniel Cryans commented on KUDU-2322:
--

Or [~aserbin].

> Leader spews logs when follower falls behind log GC
> ---
>
> Key: KUDU-2322
> URL: https://issues.apache.org/jira/browse/KUDU-2322
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.7.0
>Reporter: Todd Lipcon
>Priority: Critical
>
> I'm running a YCSB-based write stress test and found that one of the 
> followers fell behind enough that its logs got GCed by the leader. At this 
> point, the leader started logging about 100 messages per second indicating 
> that it could not obtain a request for this peer.
> I believe this is a regression since 1.6, since before 3-4-3 replication we 
> would have evicted the replica as soon as it fell behind GC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2323) NON_VOTER replica flapping (repeatedly added and evicted)

2018-02-22 Thread Jean-Daniel Cryans (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373030#comment-16373030
 ] 

Jean-Daniel Cryans commented on KUDU-2323:
--

Or [~aserbin].

> NON_VOTER replica flapping (repeatedly added and evicted)
> -
>
> Key: KUDU-2323
> URL: https://issues.apache.org/jira/browse/KUDU-2323
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.7.0
>Reporter: Todd Lipcon
>Priority: Major
>
> In running a YCSB stress workload I see a tablet got into some state where 
> the master flapped back and forth adding and then removing a replica as a 
> NON_VOTER:
> {code}
> I0221 21:54:35.341892 28047 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:35.360297 28045 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:35.612417 28048 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:35.713057 28045 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:35.725723 28045 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:35.752959 28052 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:35.767974 28047 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:35.772202 28045 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.291569 28046 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.296468 28046 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.328945 28045 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.339675 28045 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.387465 28045 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.394716 28047 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.398644 28047 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.405082 28047 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.409888 28048 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.414216 28046 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.417915 28048 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.423548 28048 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.453407 28045 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.552772 28048 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:58:01.300199 28053 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:58:01.426921 28046 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 22:01:37.779790 28051 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2280) Altering the column default isn't "type safe"

2018-02-22 Thread Grant Henke (JIRA)


 [ 
https://issues.apache.org/jira/browse/KUDU-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2280:
--
Target Version/s: 1.8.0

> Altering the column default isn't "type safe"
> -
>
> Key: KUDU-2280
> URL: https://issues.apache.org/jira/browse/KUDU-2280
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: 1.4.0
>Reporter: Grant Henke
>Priority: Critical
>  Labels: usability
>
> When creating a table the schema data is used to check that a default value 
> is of the right size and type. This is possible because the column schema is 
> available and checked via {{KuduValue::Data->CheckTypeAndGetPointer}}  in 
> {{KuduColumnSpec::ToColumnSchema}}.
> When altering a table {{KuduValue::Data->GetSlice()}} is used instead because 
> we don't have the column schema information available. The Slice is then 
> added to the alter table request. 
> When this request is received server side, we can only check the size (if we 
> know the expected size) to validate the correct information was sent and cast 
> it to the correct value via {{ColumnSchema::ApplyDelta}}.
> For some examples I can set a DOUBLE type default on an INT64 column, or a 
> FLOAT type default on an INT32 column. With the current size check logic ( 
> {{col_delta.default_value->size() < type_info()->size()}} ) you can 
> technically set any type >= your target type. 
> An additional issue is that {{KuduValue::FromInt}} treats all integers as 
> int64_t values when calling {{KuduValue::Data->GetSlice()}}. This means if we 
> made the size check more strict we wouldn't be able to alter the default 
> value of any integer columns smaller than INT64 because the data size is too 
> large once received by the server. This "size" problem affects the ability to 
> support decimal defaults too. 
> example (where column "default" is an INT32 column): 
> {noformat}
> table_alterer->AlterColumn("default")->Default(KuduValue::FromInt(12345));{noformat}
> To solve this we could require the expected column DataType to be passed 
> along with the request so that the server can validate the expected column, 
> size, and coerce the values smaller if needed/possible. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KUDU-2324) Add gflags to disable individual maintenance ops

2018-02-22 Thread Mike Percy (JIRA)

Mike Percy created KUDU-2324:


 Summary: Add gflags to disable individual maintenance ops
 Key: KUDU-2324
 URL: https://issues.apache.org/jira/browse/KUDU-2324
 Project: Kudu
  Issue Type: Improvement
  Components: tablet
Reporter: Mike Percy


It would be helpful in emergency situations to be able to disable individual 
types of maintenance operations, such as major delta compaction or merging 
compaction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

37 matches

Mail list logo