[jira] [Commented] (KUDU-1535) Add rack awareness

2018-08-08 Thread Jean-Daniel Cryans (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16573600#comment-16573600
 ] 

Jean-Daniel Cryans commented on KUDU-1535:
--

[~wdberkeley] wrote this design doc for location awareness 
https://docs.google.com/document/d/1w3RbHsBZM74uVI28FK1-s2g_BZCEQsbbHGLDI55KPSg/edit#

> Add rack awareness
> --
>
> Key: KUDU-1535
> URL: https://issues.apache.org/jira/browse/KUDU-1535
> Project: Kudu
>  Issue Type: New Feature
>  Components: master
>Reporter: Jean-Daniel Cryans
>Assignee: Will Berkeley
>Priority: Major
>
> Kudu currently doesn't have the concept of rack awareness, so any kind of 
> rack failure can result in data loss (assuming that Kudu has tablet servers 
> in multiple racks).
> This changes how the master picks hosts to send new tablets to, and we also 
> need to implement a way for the master to map hostname names to racks (could 
> be similar to Hadoop).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2477) Master web UI table detail page should clearly indicate primary key

2018-06-14 Thread Jean-Daniel Cryans (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513042#comment-16513042
 ] 

Jean-Daniel Cryans commented on KUDU-2477:
--

Yeah the current underlining isn't very obvious.

> Master web UI table detail page should clearly indicate primary key
> ---
>
> Key: KUDU-2477
> URL: https://issues.apache.org/jira/browse/KUDU-2477
> Project: Kudu
>  Issue Type: Improvement
>  Components: ui
>Affects Versions: 1.7.0
>Reporter: Mike Percy
>Priority: Major
>  Labels: newbie
>
> The master web UI table detail page currently does not show the primary key 
> for the table, at least not in an obvious way. This should be improved.
> i.e. http://tserver.example.com:8051/table?id=9af2ac1a89e94f5daf0afb2f9819674b



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2459) Impala 'Create Table' Statement under Web UI doesn't account for non-Impala conforming tablenames

2018-05-30 Thread Jean-Daniel Cryans (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495715#comment-16495715
 ] 

Jean-Daniel Cryans commented on KUDU-2459:
--

[~twmarshall], any thoughts on Shriya's solution above?

> Impala 'Create Table' Statement under Web UI doesn't account for non-Impala 
> conforming tablenames
> -
>
> Key: KUDU-2459
> URL: https://issues.apache.org/jira/browse/KUDU-2459
> Project: Kudu
>  Issue Type: Bug
>  Components: ui
>Affects Versions: 1.7.0
>Reporter: Shriya Gupta
>Assignee: Shriya Gupta
>Priority: Major
> Attachments: Screen Shot 2018-05-30 at 4.45.12 PM.png
>
>
> Under the Tables section of Kudu Web UI, for a selected table, the table 
> metrics display a CREATE TABLE statement that can be run to make Impala 
> cognizant of that table. However, in generation of this statement, the table 
> name tries to match the original Kudu tablename which may not always be 
> acceptable as a tablename for Impala. For example, Kudu accepts dot in 
> tablename, Impala does not. 
> A statement like the following throws an invalid tablename error in Impala – 
> !Screen Shot 2018-05-30 at 4.45.12 PM.png|width=606,height=321!  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-1867) Improve the "Could not lock .../block_manager_instance" error message

2018-05-21 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483424#comment-16483424
 ] 

Jean-Daniel Cryans commented on KUDU-1867:
--

[~fwang29] did you mean to resolve this as Fix Version 1.8.0?

> Improve the "Could not lock .../block_manager_instance" error message
> -
>
> Key: KUDU-1867
> URL: https://issues.apache.org/jira/browse/KUDU-1867
> Project: Kudu
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 1.2.0
>Reporter: Jean-Daniel Cryans
>Assignee: Fengling Wang
>Priority: Major
>  Labels: newbie
> Fix For: 1.2.0
>
>
> It's possible for users to encounter a rather cryptic error when trying to 
> run Kudu while it's already running or with a different user than what was 
> previously used:
> {code}
> Check failed: _s.ok() Bad status: IO error: Failed to load FS layout: Could 
> not lock /path/to/data/block_manager_instance: Could not lock 
> /path/to/data/block_manager_instance: lock 
> /path/to/data/block_manager_instance: Resource temporarily unavailable (error 
> 11)
> {code}
> This is the log line that we FATAL with, so unless you already know what it 
> means you're left to your own guessing and log digging. Instead, the error 
> message could be more prescriptive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2424) Track cluster startup with ksck

2018-04-30 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458955#comment-16458955
 ] 

Jean-Daniel Cryans commented on KUDU-2424:
--

Same as KUDU-1959?

> Track cluster startup with ksck
> ---
>
> Key: KUDU-2424
> URL: https://issues.apache.org/jira/browse/KUDU-2424
> Project: Kudu
>  Issue Type: Improvement
>  Components: ksck, ops-tooling
>Affects Versions: 1.7.0
>Reporter: Will Berkeley
>Assignee: Will Berkeley
>Priority: Major
>
> Right now, ksck is probably the best way to track a cluster's progress 
> towards being fully started and ready to go. However, it doesn't track 
> bootstrap explicitly. It might be nice if it could. This would need 
> server-side changes to expos information about bootstrapping, I think.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2342) Non-voter replicas can be promoted and get stuck

2018-03-13 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-2342:
-
Summary: Non-voter replicas can be promoted and get stuck  (was: Insert 
into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed 
to write batch ")

> Non-voter replicas can be promoted and get stuck
> 
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Blocker
>  Labels: scalability
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2264) Java client should re-login from ticket cache when ticket is expiring

2018-03-07 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-2264:
-
Code Review: https://gerrit.cloudera.org/#/c/9050/

> Java client should re-login from ticket cache when ticket is expiring
> -
>
> Key: KUDU-2264
> URL: https://issues.apache.org/jira/browse/KUDU-2264
> Project: Kudu
>  Issue Type: Improvement
>  Components: client, java, security
>Affects Versions: 1.3.1, 1.4.0, 1.5.0, 1.6.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
>
> Currently, if the Kudu client is used from a thread that has no JAAS Subject 
> with Kerberos credentials, it will log in from the user's ticket cache (in a 
> configurable location).
> However, if that original ticket expires, then the client will never re-read 
> the ticket cache. Instead, it will start to get authentication failures, even 
> if the underlying ticket cache on disk has been updated with new credentials.
> This causes big issues in Impala -- Impala starts a thread which reacquires 
> tickets from its keytab and writes them into its ticket cache, but with 
> existing versions of Kudu, the client won't pick up these new tickets. Impala 
> also currently caches Kudu clients "forever". So, after 30 days (or whatever 
> the ticket lifetime is), Impala will become unable to query Kudu.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2322) Leader spews logs when follower falls behind log GC

2018-02-22 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373031#comment-16373031
 ] 

Jean-Daniel Cryans commented on KUDU-2322:
--

Or [~aserbin].

> Leader spews logs when follower falls behind log GC
> ---
>
> Key: KUDU-2322
> URL: https://issues.apache.org/jira/browse/KUDU-2322
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.7.0
>Reporter: Todd Lipcon
>Priority: Critical
>
> I'm running a YCSB-based write stress test and found that one of the 
> followers fell behind enough that its logs got GCed by the leader. At this 
> point, the leader started logging about 100 messages per second indicating 
> that it could not obtain a request for this peer.
> I believe this is a regression since 1.6, since before 3-4-3 replication we 
> would have evicted the replica as soon as it fell behind GC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2323) NON_VOTER replica flapping (repeatedly added and evicted)

2018-02-22 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373030#comment-16373030
 ] 

Jean-Daniel Cryans commented on KUDU-2323:
--

Or [~aserbin].

> NON_VOTER replica flapping (repeatedly added and evicted)
> -
>
> Key: KUDU-2323
> URL: https://issues.apache.org/jira/browse/KUDU-2323
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.7.0
>Reporter: Todd Lipcon
>Priority: Major
>
> In running a YCSB stress workload I see a tablet got into some state where 
> the master flapped back and forth adding and then removing a replica as a 
> NON_VOTER:
> {code}
> I0221 21:54:35.341892 28047 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:35.360297 28045 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:35.612417 28048 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:35.713057 28045 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:35.725723 28045 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:35.752959 28052 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:35.767974 28047 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:35.772202 28045 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.291569 28046 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.296468 28046 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.328945 28045 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.339675 28045 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.387465 28045 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.394716 28047 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.398644 28047 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.405082 28047 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.409888 28048 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.414216 28046 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.417915 28048 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.423548 28048 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.453407 28045 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.552772 28048 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:58:01.300199 28053 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:58:01.426921 28046 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 22:01:37.779790 28051 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-428) Support for service/table/column authorization

2018-01-26 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-428:

Target Version/s: Backlog  (was: 1.6.0)

> Support for service/table/column authorization
> --
>
> Key: KUDU-428
> URL: https://issues.apache.org/jira/browse/KUDU-428
> Project: Kudu
>  Issue Type: New Feature
>  Components: master, security, tserver
>Affects Versions: 1.2.0
>Reporter: Todd Lipcon
>Priority: Critical
>  Labels: kudu-roadmap
>
> We need to support basic SQL-like access control:
> - grant/revoke on tables, columns
> - service-level grant/revoke
> - probably need some group/role mapping infrastructure as well



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-1867) Improve the "Could not lock .../block_manager_instance" error message

2018-01-10 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320704#comment-16320704
 ] 

Jean-Daniel Cryans commented on KUDU-1867:
--

[~evinhas] this is not a bug, this jira is about improving an error message. 
You're still going to get an error due to the underlying issue which might be 
that the file has the wrong permissions or something like that.

> Improve the "Could not lock .../block_manager_instance" error message
> -
>
> Key: KUDU-1867
> URL: https://issues.apache.org/jira/browse/KUDU-1867
> Project: Kudu
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 1.2.0
>Reporter: Jean-Daniel Cryans
>  Labels: newbie
>
> It's possible for users to encounter a rather cryptic error when trying to 
> run Kudu while it's already running or with a different user than what was 
> previously used:
> {code}
> Check failed: _s.ok() Bad status: IO error: Failed to load FS layout: Could 
> not lock /path/to/data/block_manager_instance: Could not lock 
> /path/to/data/block_manager_instance: lock 
> /path/to/data/block_manager_instance: Resource temporarily unavailable (error 
> 11)
> {code}
> This is the log line that we FATAL with, so unless you already know what it 
> means you're left to your own guessing and log digging. Instead, the error 
> message could be more prescriptive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1867) Improve the "Could not lock .../block_manager_instance" error message

2018-01-10 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1867:
-
Target Version/s: Backlog  (was: 1.5.0)

> Improve the "Could not lock .../block_manager_instance" error message
> -
>
> Key: KUDU-1867
> URL: https://issues.apache.org/jira/browse/KUDU-1867
> Project: Kudu
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 1.2.0
>Reporter: Jean-Daniel Cryans
>  Labels: newbie
>
> It's possible for users to encounter a rather cryptic error when trying to 
> run Kudu while it's already running or with a different user than what was 
> previously used:
> {code}
> Check failed: _s.ok() Bad status: IO error: Failed to load FS layout: Could 
> not lock /path/to/data/block_manager_instance: Could not lock 
> /path/to/data/block_manager_instance: lock 
> /path/to/data/block_manager_instance: Resource temporarily unavailable (error 
> 11)
> {code}
> This is the log line that we FATAL with, so unless you already know what it 
> means you're left to your own guessing and log digging. Instead, the error 
> message could be more prescriptive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KUDU-2250) Document odd interaction between upserts and Spark Datasets

2017-12-28 Thread Jean-Daniel Cryans (JIRA)
Jean-Daniel Cryans created KUDU-2250:


 Summary: Document odd interaction between upserts and Spark 
Datasets
 Key: KUDU-2250
 URL: https://issues.apache.org/jira/browse/KUDU-2250
 Project: Kudu
  Issue Type: Task
  Components: spark
Affects Versions: 1.6.0
Reporter: Jean-Daniel Cryans


We need to document a specific behavior of Spark Datasets that runs contrary to 
how Kudu works.

Say you have 3 columns "k, x, y" where k is the primary key.

You run a first insert on a row "k=1, x=2, y=3".

Now you upsert "k=1, y=4".

Using any Kudu API, the full row would now be "k=1, x=2, y=4" but with Datasets 
you have "k=1, x=*NULL*, y=4". This means that Datasets put a null value when 
some columns aren't specified.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2241) NoSuchMethodError happened when run flume agent using kudu flume sink

2017-12-18 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295179#comment-16295179
 ] 

Jean-Daniel Cryans commented on KUDU-2241:
--

[~danburkert] I think you already fixed this?

> NoSuchMethodError happened when run flume agent using kudu flume sink
> -
>
> Key: KUDU-2241
> URL: https://issues.apache.org/jira/browse/KUDU-2241
> Project: Kudu
>  Issue Type: Bug
>  Components: flume-sink
>Affects Versions: 1.5.0
>Reporter: Changyao Ye
>
> I've installed kudu and flume components from cloudera manager, and when I 
> start flume agent the following error happened.
> {panel:title=Error}
> 17/12/11 10:46:29 ERROR node.PollingPropertiesFileConfigurationProvider: 
> Unhandled error
> java.lang.NoSuchMethodError: 
> org.apache.flume.Context.getSubProperties(Ljava/lang/String;)Lorg/apache/kudu/shaded/com/google/common/collect/ImmutableMap;
>   at org.apache.kudu.flume.sink.KuduSink.configure(KuduSink.java:206)
>   at org.apache.flume.conf.Configurables.configure(Configurables.java:41)
>   at 
> org.apache.flume.node.AbstractConfigurationProvider.loadSinks(AbstractConfigurationProvider.java:411)
>   at 
> org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:102)
>   at 
> org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:141)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {panel}
> Version infomation:
> - flume: 1.6.0-cdh5.13.0
> - kudu: 1.5.0-cdh5.13.0
> I checked the classpath of flume agent and found that maybe there is version 
> conflict of guava jar between flume and kudu flume 
> sink(kudu-flume-sink-1.5.0-cdh5.13.0.jar) which includes shaded guava jar.
> Maybe this problem related to the changes made 
> [here|https://github.com/apache/kudu/commit/5a258508f8d560f630512c237711a65cd137c6b3].
> To make flume agent run properly, I excluded related class(ImmutableMap etc) 
> from shade settings in pom.xml and found that worked.
> {panel:title= kudu 1.5.0 top-level pom.xml}
> 
> com.google.common
> org.apache.kudu.shaded.com.google.common
> 
> com.google.common.collect.ImmutableMap*
> com.google.common.collect.ImmutableEnumMap*
> 
> 
> {panel}
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2242) Wait for NTP synchronization on startup

2017-12-15 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292944#comment-16292944
 ] 

Jean-Daniel Cryans commented on KUDU-2242:
--

[~dr-alves] / [~tlipcon] thoughts on this?

> Wait for NTP synchronization on startup
> ---
>
> Key: KUDU-2242
> URL: https://issues.apache.org/jira/browse/KUDU-2242
> Project: Kudu
>  Issue Type: Improvement
>  Components: server
>Reporter: Jean-Daniel Cryans
>Priority: Critical
>
> A common issue when restarting servers is that the clock won't be 
> synchronized right away and the Kudu services will refuse to start. We should 
> be more tolerant to this, maybe a few simple retries?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2238) Big DMS not flush under memory pressure

2017-12-14 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291579#comment-16291579
 ] 

Jean-Daniel Cryans commented on KUDU-2238:
--

In 1.3.0 we start force flushing at the same time we start rejecting writes. In 
1.4.0 we start force flushing at 60% usage and rejections are sent at 80%. I 
really recommend upgrading.

> Big DMS not flush under memory pressure
> ---
>
> Key: KUDU-2238
> URL: https://issues.apache.org/jira/browse/KUDU-2238
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.3.0
> Environment: CentOS6.5 Linux 2.6.32-431 
> Kudu1.3.0 
> GitCommit 00813f96b9cb
>Reporter: ZhangZhen
> Attachments: memory_anchored.png, memory_consumed.png
>
>
> I have a table with many updates, its DMS consumes a lot memory and cause 
> “Soft Memory Limit Exceed”. I check the /mem-trackers on the tablet server, 
> one of its DMS consumes about 3G memory, but check the /maintenance-manager, 
> its FlushDeltaMemStoresOp can only free 763B anchored memory and 
> perf_improvement is 0. Is this normal? I know Kudu is not optimized for 
> updates, but still confused why the DMS won’t be flushed under memory 
> pressure.
> Infos from /mem-trackers:
> !memory_consumed.png!
> tablet-5941a8bb934e4686abd1bfff9e35c860servernone3.00G3.00G
> txn_trackertablet-5941a8bb934e4686abd1bfff9e35c86064.00M0B
> 1.67M
> MemRowSet-339tablet-5941a8bb934e4686abd1bfff9e35c860none265B
> 265B
> DeltaMemStorestablet-5941a8bb934e4686abd1bfff9e35c860none3.00G
> 3.00G
> Infos from /maintenance-manager
> !memory_anchored.png!
> FlushDeltaMemStoresOp(5941a8bb934e4686abd1bfff9e35c860)true763B
> 511.15M0
> The tablet 5941a8bb934e4686abd1bfff9e35c860 has 16 RowSets in total
> Some configs of MM:
> --enable_maintenance_manager=true
> --log_target_replay_size_mb=1024
> --maintenance_manager_history_size=8
> --maintenance_manager_num_threads=6
> --maintenance_manager_polling_interval_ms=50



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-2209) HybridClock doesn't handle changes STA_NANO status flag

2017-12-04 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-2209:
-
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0
   1.4.1
   1.2.1
   1.3.2
   Status: Resolved  (was: In Review)

Thanks Attila, I pushed your cherry-pick and it's now everywhere.

> HybridClock doesn't handle changes STA_NANO status flag
> ---
>
> Key: KUDU-2209
> URL: https://issues.apache.org/jira/browse/KUDU-2209
> Project: Kudu
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 1.3.2, 1.2.1, 1.4.1, 1.6.0, 1.5.1
>
>
> Users have occasionally reported spurious crashes due to Kudu thinking that 
> another node has a time stamp from the future. After some debugging I 
> realized that the issue is that we currently capture the flag 'STA_NANO' from 
> the kernel only at startup. This flag indicates whether the kernel's 
> sub-second timestamp is in nanoseconds or microseconds. We initially assumed 
> this was a static property of the kernel. However it turns out that this flag 
> can get toggled at runtime by ntp in certain circumstances. Given this, it 
> was possible for us to interpret a number of nanoseconds as if it were 
> microseconds, resulting in a timestamp up to 1000 seconds in the future.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2226) Frequently updated table does not flush DeltaMemStore in time and will occupy a lot of memory

2017-11-30 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273716#comment-16273716
 ] 

Jean-Daniel Cryans commented on KUDU-2226:
--

There's probably a big backlog, will be interesting to see if it gets to DMS 
flushes or compactions after some time.

> Frequently updated table does not flush DeltaMemStore in time and will occupy 
> a lot of memory
> -
>
> Key: KUDU-2226
> URL: https://issues.apache.org/jira/browse/KUDU-2226
> Project: Kudu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
> Environment: CentOS6.5 Linux 2.6.32-431
> Kudu1.3.0 
> GitCommit 00813f96b9cb
>Reporter: ZhangZhen
>
> I have a table with 10M rows in total and has been hash partitioned to 16 
> buckets. Each tablet has about 100MB on disk size according to the /tablets 
> Web UI. Everyday 50K new rows will be inserted into this table, and about 5M 
> rows of this table will be updated, that's about half of rows in total, each 
> row will be updated only once. 
> Then I found something strange, from the /mem-trackers UI of TS, I found 
> every tablet of this table occupied about 900MB memory, mainly occupied by 
> DeltaMemStore, the peak memory consumption is about 1.8G. 
> I don't understand why the DeltaMemStore will cost so much memory, 900MB DMS 
> vs 100MB on disk size, that seems strange to me. What's more, I found these 
> DMS will be flushed very slowly, so for a long time these memory are 
> occupied, which cause "Soft memory limit exceeded" in the TS, and in result 
> cause "Rejecting consensus request".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-2221) Improve server startup error message when glog files have the wrong ACLs

2017-11-20 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-2221.
--
   Resolution: Duplicate
Fix Version/s: n/a

Oh man, you are right :)

> Improve server startup error message when glog files have the wrong ACLs
> 
>
> Key: KUDU-2221
> URL: https://issues.apache.org/jira/browse/KUDU-2221
> Project: Kudu
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.4.0
>Reporter: Jean-Daniel Cryans
> Fix For: n/a
>
>
> On a server where the glog files were played with as "root", the master 
> refused to start due to:
> {noformat}
> master_main.cc:68] Check failed: _s.ok() Bad status: IO error: Unable to 
> delete excess log files: glob failure: 2
> {noformat}
> The existing log files belonged to root instead of kudu, chown'ing fixed the 
> issue, but this message could be made easier to parse.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KUDU-2221) Improve server startup error message when glog files have the wrong ACLs

2017-11-20 Thread Jean-Daniel Cryans (JIRA)
Jean-Daniel Cryans created KUDU-2221:


 Summary: Improve server startup error message when glog files have 
the wrong ACLs
 Key: KUDU-2221
 URL: https://issues.apache.org/jira/browse/KUDU-2221
 Project: Kudu
  Issue Type: Improvement
  Components: server
Affects Versions: 1.4.0
Reporter: Jean-Daniel Cryans


On a server where the glog files were played with as "root", the master refused 
to start due to:

{noformat}
master_main.cc:68] Check failed: _s.ok() Bad status: IO error: Unable to delete 
excess log files: glob failure: 2
{noformat}

The existing log files belonged to root instead of kudu, chown'ing fixed the 
issue, but this message could be made easier to parse.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-1762) suspected tablet memory leak

2017-10-30 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225100#comment-16225100
 ] 

Jean-Daniel Cryans commented on KUDU-1762:
--

[~rains-tung] Do you have any evidence of what was described above? Else this 
message can just mean that you don't have enough memory allocated to Kudu, or 
you don't have enough maintenance manager threads, or you're simply trying to 
push too much data at the same time for the hardware you have so Kudu has to 
pushback.

> suspected tablet memory leak
> 
>
> Key: KUDU-1762
> URL: https://issues.apache.org/jira/browse/KUDU-1762
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.0.1
> Environment: CentOS 6.5
> Kudu 1.0.1 (rev e60b610253f4303b24d41575f7bafbc5d69edddb)
>Reporter: Fu Lili
> Attachments: 0B2CE7BB-EF26-4EA1-B824-3584D7D79256.png, 
> kudu_heap_prof_20161206.tar.gz, mem_rss_graph_2016_12_19.png, 
> server02_30day_rss_before_and_after_mrs_flag.png, 
> server02_30day_rss_before_and_after_mrs_flag_2.png, tserver_smaps1
>
>
> here is the memory total info:
> {quote}
> 
> MALLOC: 1691715680 ( 1613.3 MiB) Bytes in use by application
> MALLOC: +178733056 (  170.5 MiB) Bytes in page heap freelist
> MALLOC: + 37483104 (   35.7 MiB) Bytes in central cache freelist
> MALLOC: +  4071488 (3.9 MiB) Bytes in transfer cache freelist
> MALLOC: + 13739264 (   13.1 MiB) Bytes in thread cache freelists
> MALLOC: + 12202144 (   11.6 MiB) Bytes in malloc metadata
> MALLOC:   
> MALLOC: =   1937944736 ( 1848.2 MiB) Actual memory used (physical + swap)
> MALLOC: +   311296 (0.3 MiB) Bytes released to OS (aka unmapped)
> MALLOC:   
> MALLOC: =   1938256032 ( 1848.5 MiB) Virtual address space used
> MALLOC:
> MALLOC: 174694  Spans in use
> MALLOC:201  Thread heaps in use
> MALLOC:   8192  Tcmalloc page size
> 
> Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
> Bytes released to the OS take up virtual address space but no physical memory.
> {quote}
> but in memroy detail, sum of all the sub Current Consumption is far less than 
> the to the root Current Consumption。
> ||Id||Parent||Limit||Current Consumption||Peak consumption||
> |root|none|4.00G|1.58G|1.74G|
> |log_cache|root|1.00G|480.8K|5.32M|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:70c8d889b0314b04a240fcb02c24a012|log_cache|128.00M|160B|160B|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:16d3c8193579445f8f766da6c7abc237|log_cache|128.00M|160B|160B|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:2c69c5cb9eb04eb48323a9268afc36a7|log_cache|128.00M|160B|160B|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:2b11d9220dab4a5f952c5b1c10a68ccd|log_cache|128.00M|69.2K|139.5K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:cec045be60af4f759497234d8815238b|log_cache|128.00M|68.6K|138.7K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:cea7a54cebd242e4997da641f5b32e3a|log_cache|128.00M|68.5K|139.3K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:9625dfde17774690a888b55024ac797a|log_cache|128.00M|68.5K|140.0K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:6046b33901ca43d0975f59cf7e491186|log_cache|128.00M|0B|133.0K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:1a18ab0915f0407b922fa7ecbe7a2f46|log_cache|128.00M|0B|132.6K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:ac54d1c1813a4e39943971cb56f248ef|log_cache|128.00M|0B|130.5K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:4438580df6cc4d469393b9d6adee68d8|log_cache|128.00M|0B|131.2K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:2f1cef7d2a494575b941baa22b8a3dc9|log_cache|128.00M|0B|131.6K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:d2ad22d202c04b2d98f1c5800df1c3b5|log_cache|128.00M|0B|132.5K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:b19b21d6b4c84f9895aad9e81559d019|log_cache|128.00M|0B|131.0K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:27e9531cd5814b1c9637493f05860b19|log_cache|128.00M|0B|131.1K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:425a19940239447faa0eaab4e380d644|log_cache|128.00M|68.5K|146.9K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:178bd7bc39a941a887f393b0a7848066|log_cache|128.00M|68.5K|139.9K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:91524acd28a440318918f11292ac8fdc|log_cache|128.00M|0B|132.0K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:be6f093aabf9460b97fc35dd026820b6|log_cache|128.00M|0B|130.4K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:dd8dd794f0f44426a3c46ce8f4b54652|log_cache|128.00M|0B|131.2K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:ed128ca7b19c4e3eaa48e9e3eb341492|log_cache|128.00M|68.5K|141.5K|
> 

[jira] [Commented] (KUDU-2203) java.io.FileNotFoundException when trying to initialize MiniKuduCluster

2017-10-27 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16222449#comment-16222449
 ] 

Jean-Daniel Cryans commented on KUDU-2203:
--

Is the following javadoc helpful? 
https://github.com/apache/kudu/blob/branch-1.5.x/java/kudu-client/src/test/java/org/apache/kudu/client/MiniKuduCluster.java#L46

>  java.io.FileNotFoundException when trying to initialize MiniKuduCluster
> 
>
> Key: KUDU-2203
> URL: https://issues.apache.org/jira/browse/KUDU-2203
> Project: Kudu
>  Issue Type: Bug
>  Components: java, test
>Affects Versions: 1.5.0
>Reporter: Nacho García Fernández
>
> I'm getting the following error when I try to create a new instance of 
> MiniKuduCluster:
> {code:java}
> java.io.FileNotFoundException: Cannot find binary kudu-master in binary 
> directory null
>   at org.apache.kudu.client.TestUtils.findBinary(TestUtils.java:159)
>   at 
> org.apache.kudu.client.MiniKuduCluster.startMasters(MiniKuduCluster.java:210)
>   at 
> org.apache.kudu.client.MiniKuduCluster.startCluster(MiniKuduCluster.java:153)
>   at 
> org.apache.kudu.client.MiniKuduCluster.start(MiniKuduCluster.java:117)
>   at 
> org.apache.kudu.client.MiniKuduCluster.access$300(MiniKuduCluster.java:50)
>   at 
> org.apache.kudu.client.MiniKuduCluster$MiniKuduClusterBuilder.build(MiniKuduCluster.java:661)
>   at org.apache.kudu.client.BaseKuduTest.doSetup(BaseKuduTest.java:113)
>   at 
> org.apache.kudu.client.BaseKuduTest.setUpBeforeClass(BaseKuduTest.java:76)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
>   at 
> com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
> {code}
> This is happening when a base class try to extend from BaseKuduTest (which 
> internally instantiates a MiniKuduCluster).
> My simple test class:
> {code:java}
> import org.apache.kudu.client.BaseKuduTest;
> import org.junit.Before;
> import org.junit.Test;
> public class KuduInputFormatTest extends BaseKuduTest {
> @Before
> public void initialize() throws Exception {
> }
> @Test
> public void test() throws Exception {
> System.out.println("The error occurs before this message is printed");
> }
> {code}
> Current POM dependencies:
> {noformat}
> 
> org.apache.kudu
> kudu-client
> ${kudu.version}
> 
> 
> org.apache.kudu
> kudu-client
> ${kudu.version}
> test-jar
> test
> 
> {noformat}
> where kudu.version is 1.5.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-2173) Partitions are pruned incorrectly when range-partitioned on a PK prefix

2017-10-06 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-2173:
-
Code Review: https://gerrit.cloudera.org/#/c/8222/

> Partitions are pruned incorrectly when range-partitioned on a PK prefix
> ---
>
> Key: KUDU-2173
> URL: https://issues.apache.org/jira/browse/KUDU-2173
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: 1.2.0, 1.3.1, 1.4.0, 1.5.0
>Reporter: Todd Lipcon
>Assignee: Dan Burkert
>Priority: Blocker
>
> Given a schema:
> {code}
> Schema [
>   0:a[int8 NOT NULL],
>   1:b[int8 NOT NULL]
> ]
> PRIMARY KEY (a,b)
> {code}
> and a partition:
> {code}
> RANGE (a) PARTITION VALUES >= 10
> {code}
> ... the partition pruner incorrectly handles the following scan spec:
> {code}
>  `a` < 11 AND `b` < 11
> {code}
> ... and prunes the partition despite the possibility of it having a row like 
> (10, 1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (KUDU-2130) ITClientStress.testManyShortClientsGeneratingScanTokens is flaky

2017-09-06 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans reassigned KUDU-2130:


Assignee: Dan Burkert

> ITClientStress.testManyShortClientsGeneratingScanTokens is flaky
> 
>
> Key: KUDU-2130
> URL: https://issues.apache.org/jira/browse/KUDU-2130
> Project: Kudu
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.5.0
>Reporter: Adar Dembo
>Assignee: Dan Burkert
> Fix For: 1.6.0
>
> Attachments: org.apache.kudu.client.ITClientStress.html
>
>
> This test appears to fail every time I run "./gradlew build" on my laptop.
> {noformat}
> org.apache.kudu.client.ITClientStress > 
> testManyShortClientsGeneratingScanTokens FAILED
> java.lang.AssertionError: log contained NPE
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertFalse(Assert.java:64)
> at 
> org.apache.kudu.client.ITClientStress.runTasks(ITClientStress.java:79)
> at 
> org.apache.kudu.client.ITClientStress.testManyShortClientsGeneratingScanTokens(ITClientStress.java:97)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (KUDU-2083) MaintenanceManager running_op_ count not decremented if MaintenanceOp::Prepare() fails

2017-08-30 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans reassigned KUDU-2083:


   Assignee: David Alves
Code Review: https://gerrit.cloudera.org/#/c/7610/

David pushed a fix to master but we ought to backport it so leaving it open 
until that's done.

Also, for future google searches, here's the smoking gun:

{noformat}
I0825 00:57:57.157274 26537 maintenance_manager.cc:265] P 
2ff1960fe2ef43a09f7458a1883bdca1: Prepare failed for 
FlushMRSOp(5ed8492e345f45cbbf4a7c2f706605a5).  Re-running scheduler.
{noformat}

When you hit this as many times as you have maintenance manager threads 
configured (default: 1), you stop flushing.

> MaintenanceManager running_op_ count not decremented if 
> MaintenanceOp::Prepare() fails
> --
>
> Key: KUDU-2083
> URL: https://issues.apache.org/jira/browse/KUDU-2083
> Project: Kudu
>  Issue Type: Bug
>Reporter: Samuel Okrent
>Assignee: David Alves
>Priority: Critical
>
> In MaintenanceManager::RunSchedulerThread(), an op gets selected, 
> running_ops_ is incremented, and Prepare() is called on the op. If Prepare() 
> returns false, the op isn't run, so running_ops_ never gets decremented. If 
> Prepare() ever fails, then this could be a problem, as the maintenance 
> manager compares running_ops_ to the number of operation threads to determine 
> whether or not it can run another operation. Prepare generally doesn't fail, 
> but if Tablet::AlterSchema() is called in between FlushMRSOp::UpdateStats() 
> and  FlushMRSOp::Prepare(), that is one instance where Prepare() could 
> potentially fail.
> To fix, decrement running_ops_ in the codepath that follows from Prepare() 
> failing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2119) Failure in TestBinaryPlainBlockBuilderRoundTrip

2017-08-29 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146564#comment-16146564
 ] 

Jean-Daniel Cryans commented on KUDU-2119:
--

[~wdberkeley] that seems bad, would it be a regression in 1.5.0? I'm guessing 
it's not.

> Failure in TestBinaryPlainBlockBuilderRoundTrip
> ---
>
> Key: KUDU-2119
> URL: https://issues.apache.org/jira/browse/KUDU-2119
> Project: Kudu
>  Issue Type: Bug
>  Components: cfile
>Affects Versions: 1.5.0
>Reporter: Will Berkeley
>
> Reproducible with the seed shown:
> $ build/latest/bin/encoding-test -test_random_seed=-129518567 
> --gtest_filter=TestEncoding.TestBinaryPlainBlockBuilderRoundTrip
> Note: Google Test filter = TestEncoding.TestBinaryPlainBlockBuilderRoundTrip
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from TestEncoding
> [ RUN  ] TestEncoding.TestBinaryPlainBlockBuilderRoundTrip
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0829 09:16:22.198427 2011275264 test_util.cc:195] Using random seed: 
> -129518567
> I0829 09:16:22.199496 2011275264 encoding-test.cc:288] Block: 00: 3930 
>  8802  0c00  000c 0c0c 90..
> 10: 0c00 0c0c 0c0c 000c 0c0c 0c00 0c0c 0c0c 
> 20: 000c 0c0c 0c00 0c0c 0c0c 000c 0c0c 0c00 
> 30: 0c0c 0c0c 000c 0c0c 0c00 0c0c 0c0c 000c 
> 40: 0c0c 0c00 0c0c 0c0c 000c 0c0c 0c00 0c0c 
> 50: 0c0c 000c 0c0c 0c00 0c0c 0c0c 000c 0c0c 
> 60: 0c00 0c0c 0c0c 000c 0c0c 0c00 0c0c 0c0c 
> 70: 000c 0c0c 0c00 0c0c 0c0c 000c 0c0c 0c00 
> 80: 0c0c 0c0c 000c 0c0c 0c00 0c0c 0c0c 000c 
> 90: 0c0c 0c00 0c0c 0c0c 000c 0c0c 0c00 0c0c 
> a0: 0c0c 000c 0c0c 0c00 0c0c 0c0c 000c 0c0c 
> b0: 0c00 0c0c 0c0c 000c 0c0c 0c00 0c0c 0c0c 
> c0: 000c 0c0c 0c00 0c0c 0c0c 000c 0c0c 0c00 
> d0: 0c0c 0c0c 000c 0c0c 0c00 0c0c 0c0c 000c 
> e0: 0c0c 0c00 0c0c 0c0c 000c 0c0c 0c00 0c0c 
> f0: 0c0c 000c 0c0c 0c00 0c0c 0c0c 000c 0c0c 
> 000100: 0c00 0c0c 0c0c 000c 0c0c 0c00 0c0c 0c0c 
> 000110: 000c 0c0c 0c00 0c0c 0c0c 000c 0c0c 0c00 
> 000120: 0c0c 0c0c 000c 0c0c 0c00 0c0c 0c0c 000c 
> 000130: 0c0c 0c00 0c0c 0c0c 000c 0c0c 0c00 0c0c 
> 000140: 0c0c 000c 0c0c 0c00 0c0c 0c0c 000c 0c0c 
> 000150: 0c00 0c0c 0c0c 000c 0c0c 0c00 0c0c 0c0c 
> 000160: 000c 0c0c 0c00 0c0c 0c0c 000c 0c0c 0c00 
> 000170: 0c0c 0c0c 000c 0c0c 0c00 0c0c 0c0c 000c 
> 000180: 0c0c 0c00 0c0c 0c0c 000c 0c0c 0c00 0c0c 
> 000190: 0c0c 000c 0c0c 0c00 0c0c 0c0c 000c 0c0c 
> 0001a0: 0c00 0c0c 0c0c 000c 0c0c 0c00 0c0c 0c0c 
> 0001b0: 000c 0c0c 0c00 0c0c 0c0c 000c 0c0c 0c00 
> 0001c0: 0c0c 0c0c 000c 0c0c 0c00 0c0c 0c0c 000c 
> 0001d0: 0c0c 0c00 0c0c 0c0c 000c 0c0c 0c00 0c0c 
> 0001e0: 0c0c 000c 0c0c 0c00 0c0c 0c0c 000c 0c0c 
> 0001f0: 0c00 0c0c 0c0c 000c 0c0c 0c00 0c0c 0c0c 
> 000200: 000c 0c0c 0c00 0c0c 0c0c 000c 0c0c 0c00 
> 000210: 0c0c 0c0c 000c 0c0c 0c00 0c0c 0c0c 000c 
> 000220: 0c0c 0c00 0c0c 0c0c 000c 0c0c 0c00 0c0c 
> 000230: 0c0c 000c 0c0c 0c00 0c0c 0c0c 000c 0c0c 
> 000240: 0c00 0c0c 0c0c 000c 0c0c 0c00 0c0c 0c0c 
> 000250: 000c 0c0c 0c00 0c0c 0c0c 000c 0c0c 0c00 
> 000260: 0c0c 0c0c 000c 0c0c 0c00 0c0c 0c0c 000c 
> 000270: 0c0c 0c00 0c0c 0c0c 000c 0c0c 0c00 0c0c 
> 000280: 0c0c 000c 0c0c 0c00 0c0c 0c0c 000c 0c0c 
> 000290: 0c00 0c0c 0c0c 000c 0c0c 0c00 0c0c 0c0c 
> 0002a0: 000c 0c0c 0c00 0c0c 0c0c 000c 0c0c 0c00 
> 0002b0: 0c0c 0c0c 000c 0c0c 0c00 0c0c 0c0c 000c 
> 0002c0: 0c0c 0c00 0c0c 0c0c 000c 0c0c 0c00 0c0c 
> 0002d0: 0c0c 000c 0c0c 0c00 0c0c 0c0c 000c 0c0c 
> 0002e0: 0c00 0c0c 0c0c 000c 0c0c 0c00 0c0c 0c0c 
> 0002f0: 000c 0c0c 0c00 0c0c 0c0c 000c 0c0c 0c00 
> 000300: 0c0c 0c0c 000c 0c0c 0c00 0c0c 0c0c 000c 
> 000310: 0c0c 0c00 0c0c 0c0c 000c 0c0c 0c00 0c0c 
> 000320: 0c0c 000c 0c0c 0c00 0c0c 0c0c 000c 0c0c 
> 000330: 0c00 0c0c 0c0c  ..
> /Users/wdberkeley/kudu/src/kudu/cfile/encoding-test.cc:291: Failure
> Expected: (s.size()) > (kCount * 2u), actual: 822 vs 1296
> I0829 09:16:22.200042 2011275264 

[jira] [Updated] (KUDU-1728) Parallelize tablet copy operations

2017-08-28 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1728:
-
Target Version/s: Backlog

>From recent experience, I think this would be great. If we could define a 
>tserver-wide IO budget and on-the-fly decide if a tablet copy can do more (or 
>less) in parallel then this could resolve some re-replication situations 
>faster.

Another similar idea, maybe instead of doing many tablets in parallel we should 
instead focus on re-replicating single tablets as fast as we can? It might 
become more difficult to get good parallelism over a cluster though since 
there's fewer copy operations going on.

> Parallelize tablet copy operations
> --
>
> Key: KUDU-1728
> URL: https://issues.apache.org/jira/browse/KUDU-1728
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, tablet
>Reporter: Mike Percy
>
> Parallelize tablet copy operations. Right now all data is copied serially. We 
> may want to consider throttling on either side if we want to budget IO.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-1318) Java integration test failures in kudu-client-tools

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142297#comment-16142297
 ] 

Jean-Daniel Cryans commented on KUDU-1318:
--

[~d...@danburkert.com] this still an issue after your recent changes?

> Java integration test failures in kudu-client-tools
> ---
>
> Key: KUDU-1318
> URL: https://issues.apache.org/jira/browse/KUDU-1318
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.7.0
>Reporter: Adar Dembo
>
> Filing this so we can collectively figure out what's going on.
> I see the following errors consistently when I run "mvn clean install".
> {noformat}
> ---
>  T E S T S
> ---
> Running org.kududb.mapreduce.tools.ITRowCounter
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 4.954 sec <<< 
> FAILURE! - in org.kududb.mapreduce.tools.ITRowCounter
> test(org.kududb.mapreduce.tools.ITRowCounter)  Time elapsed: 2.345 sec  <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.yarn.proto.YarnProtos$LocalResourceProto.hashLong(J)I
>   at 
> org.apache.hadoop.yarn.proto.YarnProtos$LocalResourceProto.hashCode(YarnProtos.java:11655)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.LocalResourcePBImpl.hashCode(LocalResourcePBImpl.java:62)
>   at java.util.HashMap.hash(HashMap.java:362)
>   at java.util.HashMap.put(HashMap.java:492)
>   at 
> org.apache.hadoop.mapred.LocalDistributedCacheManager.setup(LocalDistributedCacheManager.java:133)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.(LocalJobRunner.java:163)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:731)
>   at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:536)
>   at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1306)
>   at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1303)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>   at org.apache.hadoop.mapreduce.Job.submit(Job.java:1303)
>   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1324)
>   at org.kududb.mapreduce.tools.ITRowCounter.test(ITRowCounter.java:64)
> Running org.kududb.mapreduce.tools.ITImportCsv
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 3.696 sec <<< 
> FAILURE! - in org.kududb.mapreduce.tools.ITImportCsv
> test(org.kududb.mapreduce.tools.ITImportCsv)  Time elapsed: 1.053 sec  <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.yarn.proto.YarnProtos$LocalResourceProto.hashLong(J)I
>   at 
> org.apache.hadoop.yarn.proto.YarnProtos$LocalResourceProto.hashCode(YarnProtos.java:11655)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.LocalResourcePBImpl.hashCode(LocalResourcePBImpl.java:62)
>   at java.util.HashMap.hash(HashMap.java:362)
>   at java.util.HashMap.put(HashMap.java:492)
>   at 
> org.apache.hadoop.mapred.LocalDistributedCacheManager.setup(LocalDistributedCacheManager.java:133)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.(LocalJobRunner.java:163)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:731)
>   at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:536)
>   at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1306)
>   at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1303)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>   at org.apache.hadoop.mapreduce.Job.submit(Job.java:1303)
>   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1324)
>   at org.kududb.mapreduce.tools.ITImportCsv.test(ITImportCsv.java:103)
> Results :
> Tests in error: 
>   ITImportCsv.test:103 » NoSuchMethod 
> org.apache.hadoop.yarn.proto.YarnProtos$Lo...
>   ITRowCounter.test:64 » NoSuchMethod 
> org.apache.hadoop.yarn.proto.YarnProtos$Lo...
> Tests run: 2, Failures: 0, Errors: 2, Skipped: 0
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1528) heap-use-after-free in Peer::ProcessResponse while deleting a tablet

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1528:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> heap-use-after-free in Peer::ProcessResponse while deleting a tablet 
> -
>
> Key: KUDU-1528
> URL: https://issues.apache.org/jira/browse/KUDU-1528
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 0.9.1
>Reporter: Adar Dembo
>
> Observed this in a [master-stress-test 
> failure|http://104.196.14.100/job/kudu-gerrit/2314/BUILD_TYPE=ASAN]. It 
> appears that we deleted a tablet while we were processing the response to 
> ConsensusService::UpdateConsensus().
> {noformat}
> ==721==ERROR: AddressSanitizer: heap-use-after-free on address 0x6140001b8fb0 
> at pc 0x7f54968ac684 bp 0x7f5487e485b0 sp 0x7f5487e485a8
> READ of size 8 at 0x6140001b8fb0 thread T6 (rpc reactor-728)
> #0 0x7f54968ac683 in scoped_refptr::operator 
> kudu::Histogram* scoped_refptr::*() const 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/gutil/ref_counted.h:269:38
> #1 0x7f5491297e5b in 
> kudu::ThreadPool::Submit(std::shared_ptr const&) 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/util/threadpool.cc:241:7
> #2 0x7f54912978cd in kudu::ThreadPool::SubmitFunc(boost::function ()> const&) 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/util/threadpool.cc:182:10
> #3 0x7f54912976ef in kudu::ThreadPool::SubmitClosure(kudu::Callback ()> const&) 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/util/threadpool.cc:178:10
> #4 0x7f5497f3929c in kudu::consensus::Peer::ProcessResponse() 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/consensus/consensus_peers.cc:263:14
> #5 0x7f5497f440ad in boost::_bi::bind_t kudu::consensus::Peer>, 
> boost::_bi::list1 > >::operator()() 
> /home/jenkins-slave/workspace/kudu-3/thirdparty/installed/include/boost/bind/bind.hpp:1222:16
> #6 0x7f549688135e in boost::function0::operator()() const 
> /home/jenkins-slave/workspace/kudu-3/thirdparty/installed/include/boost/function/function_template.hpp:770:14
> #7 0x7f549687ca1e in kudu::rpc::OutboundCall::CallCallback() 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/rpc/outbound_call.cc:189:5
> #8 0x7f549687ce01 in 
> kudu::rpc::OutboundCall::SetResponse(gscoped_ptr kudu::DefaultDeleter >) 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/rpc/outbound_call.cc:221:5
> #9 0x7f549688ea1f in 
> kudu::rpc::Connection::HandleCallResponse(gscoped_ptr  kudu::DefaultDeleter >) 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/rpc/connection.cc:534:3
> #10 0x7f549688ded2 in kudu::rpc::Connection::ReadHandler(ev::io&, int) 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/rpc/connection.cc:470:7
> #11 0x7f5496184605 in ev_invoke_pending 
> /home/jenkins-slave/workspace/kudu-2/thirdparty/libev-4.20/ev.c:3155
> #12 0x7f54961854f7 in ev_run 
> /home/jenkins-slave/workspace/kudu-2/thirdparty/libev-4.20/ev.c:3555
> #13 0x7f54968c6cca in kudu::rpc::ReactorThread::RunThread() 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/rpc/reactor.cc:306:3
> #14 0x7f54968d69dd in boost::_bi::bind_t kudu::rpc::ReactorThread>, 
> boost::_bi::list1 > 
> >::operator()() 
> /home/jenkins-slave/workspace/kudu-3/thirdparty/installed/include/boost/bind/bind.hpp:1222:16
> #15 0x7f549688135e in boost::function0::operator()() const 
> /home/jenkins-slave/workspace/kudu-3/thirdparty/installed/include/boost/function/function_template.hpp:770:14
> #16 0x7f5491285dd6 in kudu::Thread::SuperviseThread(void*) 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/util/thread.cc:586:3
> #17 0x7f5493bcb181 in start_thread 
> /build/eglibc-3GlaMS/eglibc-2.19/nptl/pthread_create.c:312
> #18 0x7f548e6fe47c in clone 
> /build/eglibc-3GlaMS/eglibc-2.19/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> 0x6140001b8fb0 is located 368 bytes inside of 416-byte region 
> [0x6140001b8e40,0x6140001b8fe0)
> freed by thread T31 (rpc worker-753) here:
> #0 0x4f54d0 in operator delete(void*) 
> /home/jenkins-slave/workspace/kudu-3/thirdparty/llvm-3.8.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cc:94
> #1 0x7f5497f9e40d in kudu::consensus::RaftConsensus::~RaftConsensus() 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/consensus/raft_consensus.cc:244:1
> #2 0x7f5497f9e551 in kudu::consensus::RaftConsensus::~RaftConsensus() 
> /home/jenkins-slave/workspace/kudu-3/src/kudu/consensus/raft_consensus.cc:242:33
> #3 0x7f5498adf6e8 in 
> scoped_refptr::operator=(kudu::consensus::Consensus*)
>  

[jira] [Updated] (KUDU-1876) Poor error messages and behavior when webserver TLS is misconfigured

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1876:
-
Target Version/s: 1.6.0

> Poor error messages and behavior when webserver TLS is misconfigured
> 
>
> Key: KUDU-1876
> URL: https://issues.apache.org/jira/browse/KUDU-1876
> Project: Kudu
>  Issue Type: Bug
>  Components: security, supportability
>Affects Versions: 1.3.0
>Reporter: Adar Dembo
>
> I was playing around with Cloudera Manager's upcoming webserver TLS support 
> and found a couple cases where misconfigurations led to confusing error 
> messages and other weird behavior. I focused on *webserver_private_key_file*, 
> *webserver_certificate_file*, and *webserver_private_key_password_cmd*.
> *webserver_private_key_file* is unset, but *webserver_certificate_file* and 
> *webserver_private_key_password_cmd* are set: the server crashes (good) but 
> with a fairly inscrutable error message:
> {noformat}
> I0213 18:49:50.606950  2265 webserver.cc:144] Webserver: Enabling HTTPS 
> support
> I0213 18:49:50.607322  2265 webserver.cc:293] Webserver: set_ssl_option: 
> cannot open /etc/adar_kudu_tls/cert.pem: error:0906D06C:PEM 
> routines:PEM_read_bio:no start line
> W0213 18:49:50.607375  2265 net_util.cc:293] Failed to bind to 0.0.0.0:8051. 
> Trying to use lsof to find any processes listening on the same port:
> I0213 18:49:50.607393  2265 net_util.cc:296] $ export PATH=$PATH:/usr/sbin ; 
> lsof -n -i 'TCP:8051' -sTCP:LISTEN ; for pid in $(lsof -F p -n -i 'TCP:8051' 
> -sTCP:LISTEN | grep p | cut -f 2 -dp) ; do  while [ $pid -gt 1 ] ; dops h 
> -fp $pid ;stat=($( W0213 18:49:50.632638  2265 net_util.cc:303] 
> F0213 18:49:50.632704  2265 master_main.cc:71] Check failed: _s.ok() Bad 
> status: Network error: Webserver: Could not start on address 0.0.0.0:8051
> {noformat}
> *webserver_private_key_file*, *webserver_certificate_file*, and 
> *webserver_private_key_password_cmd* are all set, but the password command 
> script yields the wrong password: the server crashes (good) but the error 
> message is inscrutable: 
> {noformat}
> I0213 18:35:34.581714 32633 webserver.cc:293] Webserver: set_ssl_option: 
> cannot open /etc/adar_kudu_tls/cert.pem: error:06065064:digital envelope 
> routines:EVP_DecryptFinal_ex:bad decrypt
> W0213 18:35:34.581794 32633 net_util.cc:293] Failed to bind to 0.0.0.0:8051. 
> Trying to use lsof to find any processes listening on the same port:
> I0213 18:35:34.581811 32633 net_util.cc:296] $ export PATH=$PATH:/usr/sbin ; 
> lsof -n -i 'TCP:8051' -sTCP:LISTEN ; for pid in $(lsof -F p -n -i 'TCP:8051' 
> -sTCP:LISTEN | grep p | cut -f 2 -dp) ; do  while [ $pid -gt 1 ] ; dops h 
> -fp $pid ;stat=($( W0213 18:35:34.605216 32633 net_util.cc:303] 
> F0213 18:35:34.605254 32633 master_main.cc:71] Check failed: _s.ok() Bad 
> status: Network error: Webserver: Could not start on address 0.0.0.0:8051
> {noformat}
> *webserver_private_key_file* and *webserver_private_key_password_cmd* are 
> set, but *webserver_certificate_file* is not: the server starts up (probably 
> bad?) and any attempt to access the webui on the https port yields a "This 
> site can’t provide a secure connection" message in the browser with 
> ERR_SSL_PROTOCOL_ERROR as the error code. I only tested with Chromium.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-2045) Data race on process_memory::g_hard_limit

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-2045.
--
   Resolution: Fixed
 Assignee: Todd Lipcon
Fix Version/s: 1.5.0

Todd fixed it in 230ed20d54a3bf2a9e01481e68d06da3d67e42d5

> Data race on process_memory::g_hard_limit
> -
>
> Key: KUDU-2045
> URL: https://issues.apache.org/jira/browse/KUDU-2045
> Project: Kudu
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Assignee: Todd Lipcon
>  Labels: newbie
> Fix For: 1.5.0
>
>
> Saw this in a linked_list-test TSAN run. I don't think it's related to the 
> changes I currently have in my tree:
> {noformat}
> ==
> WARNING: ThreadSanitizer: data race (pid=19052)
>   Write of size 8 at 0x7f04796dc478 by thread T123 (mutexes: write M1221):
> #0 kudu::process_memory::(anonymous namespace)::DoInitLimits() 
> /home/adar/Source/kudu/src/kudu/util/process_memory.cc:166:16 
> (libkudu_util.so+0x1a5719)
> #1 GoogleOnceInternalInit(int*, void (*)(), void (*)(void*), void*) 
> /home/adar/Source/kudu/src/kudu/gutil/once.cc:38:7 (libgutil.so+0x35a87)
> #2 GoogleOnceInit(GoogleOnceType*, void (*)()) 
> /home/adar/Source/kudu/src/kudu/gutil/once.h:55:5 (libtserver.so+0xc69c3)
> #3 kudu::process_memory::(anonymous namespace)::InitLimits() 
> /home/adar/Source/kudu/src/kudu/util/process_memory.cc:184:3 
> (libkudu_util.so+0x1a5511)
> #4 kudu::process_memory::UnderMemoryPressure(double*) 
> /home/adar/Source/kudu/src/kudu/util/process_memory.cc:221:3 
> (libkudu_util.so+0x1a544a)
> #5 
> _ZNSt3__18__invokeIRPFbPdEJS1_EEEDTclclsr3std3__1E7forwardIT_Efp_Espclsr3std3__1E7forwardIT0_Efp0_EEEOS5_DpOS6_
>  
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/c++/v1/type_traits:4301:1
>  (libkudu_util.so+0x16aa6d)
> #6 bool std::__1::__invoke_void_return_wrapper::__call (*&)(double*), double*>(bool (*&)(double*), double*&&) 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/c++/v1/__functional_base:328
>  (libkudu_util.so+0x16aa6d)
> #7 std::__1::__function::__func std::__1::allocator, bool 
> (double*)>::operator()(double*&&) 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/c++/v1/functional:1552:12
>  (libkudu_util.so+0x16a974)
> #8 std::__1::function::operator()(double*) const 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/c++/v1/functional:1914:12
>  (libkudu_util.so+0x168b0d)
> #9 kudu::MaintenanceManager::FindBestOp() 
> /home/adar/Source/kudu/src/kudu/util/maintenance_manager.cc:383:7 
> (libkudu_util.so+0x165e66)
> #10 kudu::MaintenanceManager::RunSchedulerThread() 
> /home/adar/Source/kudu/src/kudu/util/maintenance_manager.cc:245:25 
> (libkudu_util.so+0x164650)
> #11 boost::_mfi::mf0 kudu::MaintenanceManager>::operator()(kudu::MaintenanceManager*) const 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/boost/bind/mem_fn_template.hpp:49:29
>  (libkudu_util.so+0x16b876)
> #12 void boost::_bi::list1 
> >::operator(), 
> boost::_bi::list0>(boost::_bi::type, boost::_mfi::mf0 kudu::MaintenanceManager>&, boost::_bi::list0&, int) 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/boost/bind/bind.hpp:259:9
>  (libkudu_util.so+0x16b7ca)
> #13 boost::_bi::bind_t kudu::MaintenanceManager>, 
> boost::_bi::list1 > 
> >::operator()() 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/boost/bind/bind.hpp:1222:16
>  (libkudu_util.so+0x16b753)
> #14 
> boost::detail::function::void_function_obj_invoker0 boost::_mfi::mf0, 
> boost::_bi::list1 > >, 
> void>::invoke(boost::detail::function::function_buffer&) 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/boost/function/function_template.hpp:159:11
>  (libkudu_util.so+0x16b559)
> #15 boost::function0::operator()() const 
> /home/adar/Source/kudu/thirdparty/installed/tsan/include/boost/function/function_template.hpp:770:14
>  (libkrpc.so+0xb0c11)
> #16 kudu::Thread::SuperviseThread(void*) 
> /home/adar/Source/kudu/src/kudu/util/thread.cc:591:3 
> (libkudu_util.so+0x1bce7e)
>   Previous read of size 8 at 0x7f04796dc478 by thread T121:
> #0 kudu::process_memory::HardLimit() 
> /home/adar/Source/kudu/src/kudu/util/process_memory.cc:217:10 
> (libkudu_util.so+0x1a540a)
> #1 kudu::MemTrackersHandler(kudu::WebCallbackRegistry::WebRequest const&, 
> std::__1::basic_ostringstream std::__1::allocator >*) 
> 

[jira] [Commented] (KUDU-1489) Use WAL directory for tablet metadata files

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142283#comment-16142283
 ] 

Jean-Daniel Cryans commented on KUDU-1489:
--

[~anjuwong] does this jira still make sense given the things you've been 
working on recently? Should it fold into some other jira?

> Use WAL directory for tablet metadata files
> ---
>
> Key: KUDU-1489
> URL: https://issues.apache.org/jira/browse/KUDU-1489
> Project: Kudu
>  Issue Type: Improvement
>  Components: consensus, fs, tserver
>Affects Versions: 0.9.0
>Reporter: Adar Dembo
>
> Today a tserver will place tablet metadata files (i.e. superblock and cmeta 
> files) in the first configured data directory. I don't remember why we 
> decided to do this (commit 691f97d introduced it), but upon reconsideration 
> the WAL directory seems like a much better choice, because if the machine has 
> different kinds of I/O devices, the WAL directory's device is typically the 
> fastest.
> Mostafa has been testing Impala and Kudu on a cluster with many thousands of 
> tablets. His cluster contains storage-dense machines, each configured with 14 
> spinning disks and one flash device. Naturally, the WAL directory sits on 
> that flash device and the data directories are on the spinning disks. With 
> thousands of tablet metadata files on the first spinning disk, nearly every 
> tablet in the tserver is bottlenecked on that device due to the sheer amount 
> of I/O needed to maintain the running state of the tablet, specifically 
> rewriting cmeta files on various Raft events (votes, term advancement, etc.).
> Many thousands of tablets is not really a good scale for Kudu right now, but 
> moving the tablet metadata files to a faster device should at least help with 
> the above.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1397) Allow building safely with custom toolchains

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1397:
-
Target Version/s: Backlog  (was: 1.5.0)

> Allow building safely with custom toolchains
> 
>
> Key: KUDU-1397
> URL: https://issues.apache.org/jira/browse/KUDU-1397
> Project: Kudu
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.8.0
>Reporter: Adar Dembo
>
> Casey uncovered several issues when building Kudu with the Impala toolchain; 
> this report attempts to capture them.
> The first and most important issue was a random SIGSEGV during a flush:
> {noformat}
> (gdb) bt
> #0 0x00e82540 in kudu::CopyCellData kudu::ColumnBlockCell, kudu::Arena> (src=..., dst=0x7ff9c637d5e0, 
> dst_arena=0x0)
> at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/common/row.h:79
> #1 0x00e80e33 in kudu::CopyCell kudu::ColumnBlockCell, kudu::Arena> (src=..., dst=0x7ff9c637d5e0, 
> dst_arena=0x0)
> at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/common/row.h:103
> #2 0x00e7f647 in kudu::CopyRow kudu::Arena> (src_row=..., dst_row=0x7ff9c637d870, dst_arena=0x0)
> at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/common/row.h:119
> #3  0x00e76773 in kudu::tablet::FlushCompactionInput 
> (input=0x3894f00, snap=..., out=0x7ff9c637dbf0)
> at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/tablet/compaction.cc:768
> #4  0x00e23f5a in kudu::tablet::Tablet::DoCompactionOrFlush 
> (this=0x395a840, input=..., mrs_being_flushed=0)
> at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/tablet/tablet.cc:1221
> #5  0x00e202b2 in kudu::tablet::Tablet::FlushInternal 
> (this=0x395a840, input=..., old_ms=...) at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/tablet/tablet.cc:744
> #6  0x00e1f8f6 in kudu::tablet::Tablet::FlushUnlocked 
> (this=0x395a840) at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/tablet/tablet.cc:678
> #7  0x00f1b3a3 in kudu::tablet::FlushMRSOp::Perform (this=0x38b9340) 
> at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/tablet/tablet_peer_mm_ops.cc:127
> #8  0x00ea19d7 in kudu::MaintenanceManager::LaunchOp (this=0x3904360, 
> op=0x38b9340) at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/tablet/maintenance_manager.cc:360
> #9  0x00ea6502 in boost::_mfi::mf1 kudu::MaintenanceOp*>::operator() (this=0x3d492a0, p=0x3904360, a1=0x38b9340)
> at 
> /home/casey/Code/native-toolchain/build/boost-1.57.0/include/boost/bind/mem_fn_template.hpp:165
> #10 0x00ea6163 in 
> boost::_bi::list2, 
> boost::_bi::value >::operator() kudu::MaintenanceManager, kudu::MaintenanceOp*>, boost::_bi::list0> 
> (this=0x3d492b0, f=..., a=...) at 
> /home/casey/Code/native-toolchain/build/boost-1.57.0/include/boost/bind/bind.hpp:313
> #11 0x00ea5bed in boost::_bi::bind_t kudu::MaintenanceManager, kudu::MaintenanceOp*>, 
> boost::_bi::list2, 
> boost::_bi::value > >::operator() (this=0x3d492a0) at 
> /home/casey/Code/native-toolchain/build/boost-1.57.0/include/boost/bind/bind_template.hpp:20
> #12 0x00ea57ec in 
> boost::detail::function::void_function_obj_invoker0 boost::_mfi::mf1, 
> boost::_bi::list2, 
> boost::_bi::value > >, void>::invoke 
> (function_obj_ptr=...) at 
> /home/casey/Code/native-toolchain/build/boost-1.57.0/include/boost/function/function_template.hpp:153
> #13 0x01c4205e in boost::function0::operator() (this=0x3c01838) 
> at 
> /home/casey/Code/native-toolchain/build/boost-1.57.0/include/boost/function/function_template.hpp:767
> #14 0x01d73aa4 in kudu::FunctionRunnable::Run (this=0x3c01830) at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/util/threadpool.cc:47
> #15 0x01d73062 in kudu::ThreadPool::DispatchThread (this=0x38c8340, 
> permanent=true) at 
> /home/casey/Code/native-toolchain/source/kudu/incubator-kudu-0.8.0-RC1/src/kudu/util/threadpool.cc:321
> #16 0x01d76740 in boost::_mfi::mf1 bool>::operator() 

[jira] [Updated] (KUDU-1489) Use WAL directory for tablet metadata files

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1489:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> Use WAL directory for tablet metadata files
> ---
>
> Key: KUDU-1489
> URL: https://issues.apache.org/jira/browse/KUDU-1489
> Project: Kudu
>  Issue Type: Improvement
>  Components: consensus, fs, tserver
>Affects Versions: 0.9.0
>Reporter: Adar Dembo
>
> Today a tserver will place tablet metadata files (i.e. superblock and cmeta 
> files) in the first configured data directory. I don't remember why we 
> decided to do this (commit 691f97d introduced it), but upon reconsideration 
> the WAL directory seems like a much better choice, because if the machine has 
> different kinds of I/O devices, the WAL directory's device is typically the 
> fastest.
> Mostafa has been testing Impala and Kudu on a cluster with many thousands of 
> tablets. His cluster contains storage-dense machines, each configured with 14 
> spinning disks and one flash device. Naturally, the WAL directory sits on 
> that flash device and the data directories are on the spinning disks. With 
> thousands of tablet metadata files on the first spinning disk, nearly every 
> tablet in the tserver is bottlenecked on that device due to the sheer amount 
> of I/O needed to maintain the running state of the tablet, specifically 
> rewriting cmeta files on various Raft events (votes, term advancement, etc.).
> Many thousands of tablets is not really a good scale for Kudu right now, but 
> moving the tablet metadata files to a faster device should at least help with 
> the above.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-997) Expose client-side metrics

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-997:

Target Version/s: 1.6.0  (was: 1.5.0)

> Expose client-side metrics
> --
>
> Key: KUDU-997
> URL: https://issues.apache.org/jira/browse/KUDU-997
> Project: Kudu
>  Issue Type: New Feature
>  Components: client
>Affects Versions: Feature Complete
>Reporter: Adar Dembo
> Attachments: patch
>
>
> I think client-side metrics have been a desirable feature for quite some 
> time, but I especially wanted them while debugging KUDU-993.
> There are some challenges in collecting metric data in a cohesive way across 
> the client (at least in C++, where there isn't a completely uniform way to 
> send/receive RPCs). But I think the main challenge is figuring out how to 
> expose it to users. I'm not sure we want to expose metrics.h directly, 
> because it's deeply intertwined with gutil and other Kudu util code.
> I'm attaching a patch I wrote yesterday to help with KUDU-993. It doesn't 
> tackle the API problem at all, but shows how to build a histogram tracking 
> all writes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-683) Clean up multi-master tech debt

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142276#comment-16142276
 ] 

Jean-Daniel Cryans commented on KUDU-683:
-

[~adar] do you think we still have a lot of tech debt here?

> Clean up multi-master tech debt
> ---
>
> Key: KUDU-683
> URL: https://issues.apache.org/jira/browse/KUDU-683
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: M5
>Reporter: Adar Dembo
>
> Multi-master support in the C++ client has introduced a fair amount of 
> RPC-related tech debt. There's a lot of duplication in the handling of 
> timeouts, retries, and error conditions. The various callbacks are also 
> tricky to follow and error prone. Now that the code has settled and we 
> understand what's painful about it, we're in a better position to fix it.
> Here's a high-level design idea: there should only be one RPC class that's 
> responsible for RPC delivery end-to-end, including retries, leader master 
> discovery, etc. Within that class there should be a single callback that's 
> reused for every asynchronous function, and there should be a separate state 
> machine that tracks the ongoing status of the RPC. Per-RPC specialization 
> should be as minimal as possible, via templates on the PBs, callbacks, or, 
> worst case, subclassing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-2014) Explore additional approaches to improve LBM startup time

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-2014:
-
Issue Type: Improvement  (was: Bug)

> Explore additional approaches to improve LBM startup time
> -
>
> Key: KUDU-2014
> URL: https://issues.apache.org/jira/browse/KUDU-2014
> Project: Kudu
>  Issue Type: Improvement
>  Components: fs
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>  Labels: data-scalability
>
> The fix for KUDU-1549 added support for deleting full log block manager 
> containers with no live blocks, and for compacting container metadata to omit 
> CREATE/DELETE record pairs. Both of these will help reduce the amount of 
> metadata that must be read at startup. However, there's more we can do to 
> help; this JIRA captures some additional ideas worth exploring (if/when LBM 
> startup once again becomes intolerable):
> In [this 
> gerrit|https://gerrit.cloudera.org/#/c/6826/2/src/kudu/fs/log_block_manager.cc@90],
>  Todd made the case that container metadata processing is seek-dominant:
> {quote}
> looking at a data/ dir on a cluster that has been around for quite some time, 
> most of the metadata files seem to be around 400KB. Assuming 100MB/sec 
> sequential throughput and 10ms seek, it definitely seems like the startup 
> time would be seek-dominated (10 or 20ms seek depending whether various 
> internal metadata pages are hot in cache, plus only 4ms of sequential read 
> time). 
> {quote}
> We theorized several ways to reduce seeking, all focused on reducing the 
> number of discrete container metadata files read at startup:
> # Raise the container max data file size. This won't help on older versions 
> of el6 with ext4, but will help everywhere else. It makes sense for the max 
> data file size to be a function of the disk size anyway. And it's a pretty 
> cheap way to extract more scalability.
> # Reuse container data file holes, explicitly to avoid creating so many 
> containers. Perhaps with a round of "defragmentation" to simplify reuse, or 
> perhaps not. As a side effect, metadata file compaction now becomes more 
> important (and costly).
> # Eschew one metadata file per data file altogether and maintain just one 
> metadata file. Deleting "dead" containers would no longer be an improvement 
> for metadata startup cost. Metadata compaction would be a lot more expensive. 
> Block records themselves would be larger, because each record now needs to 
> point to a particular data file, though this can be mitigated in various 
> ways. A variant of this would be to do away with the 1-1 relationship between 
> metadata and data files and make it more like m-n.
> # Reduce the number of extents in container metadata files via judicious 
> preallocation.
> See the gerrit linked above for more details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-477) Implement block storage microbenchmark

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-477:

Target Version/s: Backlog  (was: 1.5.0)

> Implement block storage microbenchmark
> --
>
> Key: KUDU-477
> URL: https://issues.apache.org/jira/browse/KUDU-477
> Project: Kudu
>  Issue Type: Sub-task
>  Components: fs
>Affects Versions: M4.5
>Reporter: Adar Dembo
>Assignee: Adar Dembo
>
> With two block storage allocation strategies implemented, we ought to develop 
> a synthetic microbenchmark to evaluate the two (and future allocation 
> strategies).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1358) Following a master leader election, create table may fail

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1358:
-
Target Version/s: Backlog  (was: 1.5.0)

> Following a master leader election, create table may fail
> -
>
> Key: KUDU-1358
> URL: https://issues.apache.org/jira/browse/KUDU-1358
> Project: Kudu
>  Issue Type: Sub-task
>  Components: master
>Affects Versions: 0.7.0
>Reporter: Adar Dembo
>Assignee: Adar Dembo
>
> In the current multi-master design and implementation, tservers only 
> heartbeat to the leader master. After a master leader election, there's a 
> short window of time in which the new leader master may not be aware of the 
> existence of some (or even all) of the tservers. Attempts to create a table 
> during this window may fail, as the tservers known to the new leader master 
> may be too few to satisfy the new table's replication factor. Whether the 
> window exists in the first place depends on whether the new leader master had 
> been leader before, and whether any of the tservers had sent heartbeats to it 
> during that time.
> Some possible solutions include:
> # Modifying the heartbeat protocol so that tservers heartbeat to _all_ 
> masters, leaders and followers alike. Doing this will ensure that the "soft 
> state" belonging to any master is always up-to-date at the cost of network 
> bandwidth lost to heartbeating. Additionally, changes may need to be made to 
> ensure that a follower master can't cause a tserver to take any actions.
> # Never actually failing a create table request due to too few tservers, 
> instead allowing it to linger until such a time when more tservers exist. For 
> this to actually be practical we'd need to allow clients to "cancel" a 
> previously issued create table request.
> Both approaches probably include additional ramifications; this problem needs 
> to be thought through carefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-1520) Possible race between alter schema lock release and tablet shutdown

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142272#comment-16142272
 ] 

Jean-Daniel Cryans commented on KUDU-1520:
--

[~adar] is this still an issue?

> Possible race between alter schema lock release and tablet shutdown
> ---
>
> Key: KUDU-1520
> URL: https://issues.apache.org/jira/browse/KUDU-1520
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 0.9.1
>Reporter: Adar Dembo
>
> I've been running a new stress that hammers a cluster with concurrent alter 
> and delete table requests, and one of my test runs failed with the following:
> {noformat}
> F0707 18:59:34.311122   373 rw_semaphore.h:145] Check failed: 
> base::subtle::NoBarrier_Load(_) == kWriteFlag (0 vs. 2147483648) 
> *** Check failure stack trace: ***
> @ 0x7f86cd37df5d  google::LogMessage::Fail() at ??:0
> @ 0x7f86cd37fe5d  google::LogMessage::SendToLog() at ??:0
> @ 0x7f86cd37da99  google::LogMessage::Flush() at ??:0
> @ 0x7f86cd3808ff  google::LogMessageFatal::~LogMessageFatal() at ??:0
> @ 0x7f86d4f77c78  kudu::rw_semaphore::unlock() at ??:0
> @ 0x7f86d3728de0  std::unique_lock<>::unlock() at ??:0
> @ 0x7f86d3727192  std::unique_lock<>::~unique_lock() at ??:0
> @ 0x7f86d3725582  
> kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at 
> ??:0
> @ 0x7f86d37255be  
> kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at 
> ??:0
> @ 0x7f86d4f68dce  std::default_delete<>::operator()() at ??:0
> @ 0x7f86d4f670b9  std::unique_ptr<>::~unique_ptr() at ??:0
> @ 0x7f86d374510e  
> kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
> @ 0x7f86d374514a  
> kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
> @ 0x7f86d373f532  kudu::DefaultDeleter<>::operator()() at ??:0
> @ 0x7f86d373df4a  
> kudu::internal::gscoped_ptr_impl<>::~gscoped_ptr_impl() at ??:0
> @ 0x7f86d373d552  gscoped_ptr<>::~gscoped_ptr() at ??:0
> @ 0x7f86d373d580  
> kudu::tablet::TransactionDriver::~TransactionDriver() at ??:0
> @ 0x7f86d3740ab4  kudu::RefCountedThreadSafe<>::DeleteInternal() at 
> ??:0
> @ 0x7f86d3740405  
> kudu::DefaultRefCountedThreadSafeTraits<>::Destruct() at ??:0
> @ 0x7f86d373f928  kudu::RefCountedThreadSafe<>::Release() at ??:0
> @ 0x7f86d373e769  scoped_refptr<>::~scoped_refptr() at ??:0
> @ 0x7f86d37397cd  kudu::tablet::TabletPeer::SubmitAlterSchema() at 
> ??:0
> @ 0x7f86d4f4e070  
> kudu::tserver::TabletServiceAdminImpl::AlterSchema() at ??:0
> @ 0x7f86d27a4e92  
> _ZZN4kudu7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS_12MetricEntityEEENKUlPKN6google8protobuf7MessageEPS9_PNS_3rpc10RpcContextEE1_clESB_SC_SF_
>  at ??:0
> @ 0x7f86d27a5d96  
> _ZNSt17_Function_handlerIFvPKN6google8protobuf7MessageEPS2_PN4kudu3rpc10RpcContextEEZNS6_7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS6_12MetricEntityEEEUlS4_S5_S9_E1_E9_M_invokeERKSt9_Any_dataS4_S5_S9_
>  at ??:0
> @ 0x7f86d22ce6e4  std::function<>::operator()() at ??:0
> @ 0x7f86d22ce19b  kudu::rpc::GeneratedServiceIf::Handle() at ??:0
> @ 0x7f86d22d0a97  kudu::rpc::ServicePool::RunThread() at ??:0
> @ 0x7f86d22d1d45  boost::_mfi::mf0<>::operator()() at ??:0
> @ 0x7f86d22d1b6c  boost::_bi::list1<>::operator()<>() at ??:0
> @ 0x7f86d22d1a61  boost::_bi::bind_t<>::operator()() at ??:0
> @ 0x7f86d22d1998  
> boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
> {noformat}
> After looking through the code a bit, I suspect this happened because, in the 
> event of failure, the AlterSchema transaction releases the tablet's schema 
> lock implicitly (i.e. when AlterSchemaTransactionState is destroyed) _after_ 
> the transaction itself is removed from the driver's TransactionTracker. Thus, 
> the WaitForAllToFinish() performed during the tablet shutdown process thinks 
> all the transactions are done and proceeds to free tablet state. Later, the 
> last reference to the transaction is released (in 
> TabletPeer::SubmitAlterSchema), the transaction is destroyed, and we try to 
> unlock a lock whose memory has already been freed.
> If this analysis is correct, the broken invariant is: once the transaction 
> has been released from the tracker, it may no longer access any tablet state.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-993) Investigate making transaction tracker rejections more fair

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-993:

Issue Type: Improvement  (was: Bug)

> Investigate making transaction tracker rejections more fair
> ---
>
> Key: KUDU-993
> URL: https://issues.apache.org/jira/browse/KUDU-993
> Project: Kudu
>  Issue Type: Improvement
>  Components: tablet
>Affects Versions: Public beta
>Reporter: Adar Dembo
>
> When the transaction tracker hits its memory limit, it'll reject new 
> transactions until pending transactions finish. Clients respond by retrying 
> the failed transactions until the transaction tracker accepts them, or until 
> they timeout.
> The rejection mechanism doesn't take into account how many times a 
> transaction has been retried, and as a result, it's possible for some 
> transactions to be rejected many times over even as other transactions are 
> allowed through. Here's a contrived example: two clients submitting 
> transactions simultaneously, with room for only one pending transaction. 
> Given a long retry backoff delay and a short delay between transactions, it's 
> possible for one client to "hog" the available space while the other 
> continuously retries (each time it retries, the first client has managed to 
> stuff another transaction in).
> We should investigate making this rejection system more fair so that no one 
> transaction is starved.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-790) Not all MM ops acquire locks in Prepare()

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-790:

Target Version/s: Backlog  (was: 1.5.0)

> Not all MM ops acquire locks in Prepare()
> -
>
> Key: KUDU-790
> URL: https://issues.apache.org/jira/browse/KUDU-790
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: M5
>Reporter: Adar Dembo
>
> The maintenance manager invokes Prepare() on its scheduler thread and thunks 
> Perform() to a separate thread pool. If an op doesn't lock the rowsets it's 
> going to modify in Prepare(), there's a chance that the MM scheduler thread 
> may run again before Perform() is invoked (i.e. before those locks are 
> acquired. If this happens, the scheduler thread may compute the same stats 
> and schedules the same op.
> All of the ops are safe to call concurrently (well, those that aren't use 
> external synchronization to ensure that this doesn't happen), but what 
> happens next depends on the specific op and the timing. The second op may 
> no-op, or it may perform less useful work and waste time.
> Ideally Prepare() should acquire any and all locks needed by Perform(), so 
> that if the scheduler thread runs again, it'll compute different stats (since 
> those usually depend on acquiring rowset locks) and either not schedule the 
> op, or schedule a different body of work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-761) TS seg fault after short log append

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-761:

Issue Type: Bug  (was: Task)

> TS seg fault after short log append
> ---
>
> Key: KUDU-761
> URL: https://issues.apache.org/jira/browse/KUDU-761
> Project: Kudu
>  Issue Type: Bug
>  Components: log
>Affects Versions: M5
>Reporter: Adar Dembo
>
> I was running tpch_real_world with SF=6000 on a2414. Towards the end of the 
> run, flush MM ops consistently won out over log gc MM due to memory pressure. 
> Eventually we ran out of disk space on the disk hosting the logs:
> {noformat}
> E0511 05:45:47.913038 8174 log.cc:130] Error appending to the log: IO error: 
> pwritev error: expected to write 254370 bytes, wrote 145525 bytes instead
> {noformat}
> What followed was a SIGSEGV:
> {noformat}
> PC: @ 0x0 
> *** SIGSEGV (@0x0) received by PID 8004 (TID 0x7fae308fd700) from PID 0; 
> stack trace: ***
> @ 0x3d4ae0f500 
> @ 0x0 
> @ 0x8a2791 kudu::log::LogEntryPB::~LogEntryPB()
> @ 0x8a2272 kudu::log::LogEntryBatchPB::~LogEntryBatchPB()
> @ 0x8a22e1 kudu::log::LogEntryBatchPB::~LogEntryBatchPB()
> @ 0x892de4 kudu::log::Log::AppendThread::RunThread()
> {noformat}
> There's not much we can do when we run out of disk space, but better to crash 
> with a CHECK or something than to SIGSEGV.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-1544) Race in Java client's AsyncKuduSession.apply()

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142264#comment-16142264
 ] 

Jean-Daniel Cryans commented on KUDU-1544:
--

[~danburkert] [~adar] is this still a thing?

> Race in Java client's AsyncKuduSession.apply()
> --
>
> Key: KUDU-1544
> URL: https://issues.apache.org/jira/browse/KUDU-1544
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.10.0
>Reporter: Adar Dembo
>
> The race is between calls to flushNotification.get() and 
> inactiveBufferAvailable(). Suppose T1 calls inactiveBufferAvailable(), gets 
> back false, but is descheduled before constructing a PleaseThrottleException. 
> Now T2 is scheduled, finishes an outstanding flush, calls queueBuffer(), and 
> resets flushNotification to an empty Deferred. When T1 is rescheduled, it 
> throws a PTE with that empty Deferred.
> What is the effect? If the user waits on the Deferred from the PTE, the user 
> is effectively waiting on "the next flush", which, depending on the stream of 
> operations, may take place soon, may not take place for some time, or may not 
> take place at all.
> To fix this, we should probably reorder the calls to flushNotification.get() 
> in apply() to happen before calls to inactiveBufferAvailable(). That way, a 
> race will yield a stale Deferred rather than an empty one, and waiting on the 
> stale Deferred should be a no-op.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1961) devtoolset-3 defeats ccache

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1961:
-
Issue Type: Improvement  (was: Bug)

> devtoolset-3 defeats ccache
> ---
>
> Key: KUDU-1961
> URL: https://issues.apache.org/jira/browse/KUDU-1961
> Project: Kudu
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>
> When devtoolset-3 is used (via enable_devtoolset.sh on el6), it's quite 
> likely that ccache will go unused for the build. Certainly for 
> build-thirdparty.sh, and likely for the main Kudu build too (unless you go 
> out of your way to set CC/CXX=ccache when invoking cmake).
> We should be able to fix this in enable_devtoolset.sh, at least in the common 
> case where symlinks to ccache named after the compiler are on the PATH. We 
> could ensure that, following the call to 'scl enable devtoolset-3 ', 
> ccache symlinks are placed at the head of the PATH, before 
> /opt/rh/devtoolset-3/, and only then is  actually invoked. This should 
> cause ccache to be used, and it'll chain to the devtoolset-3 compiler because 
> /opt/rh/devtoolset-3/ is ahead of /usr/bin on the PATH. We may need an 
> intermediate script to do this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-1521) Flakiness in TestAsyncKuduSession

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142261#comment-16142261
 ] 

Jean-Daniel Cryans commented on KUDU-1521:
--

[~adar] wanna try looping dist test and see if this is still an issue?

> Flakiness in TestAsyncKuduSession
> -
>
> Key: KUDU-1521
> URL: https://issues.apache.org/jira/browse/KUDU-1521
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.9.1
>Reporter: Adar Dembo
>
> I've been trying to parse the various failures in 
> http://104.196.14.100/job/kudu-gerrit/2270/BUILD_TYPE=RELEASE. Here's what I 
> see in the test:
> The way test() tests AUTO_FLUSH_BACKGROUND is inherently flaky; a delay while 
> running test code will give the background flush task a chance to fire when 
> the test code doesn't expect it. I've seen this cause lead to no 
> PleaseThrottleException, but I suspect the first block of test code dealing 
> with background flushes is flaky too (since it's testing elapsed time).
> There's also some test failures that I can't figure out. I've pasted them 
> below for posterity:
> {noformat}
> 03:52:14 
> testGetTableLocationsErrorCauseSessionStuck(org.kududb.client.TestAsyncKuduSession)
>   Time elapsed: 100.009 sec  <<< ERROR!
> 03:52:14 java.lang.Exception: test timed out after 10 milliseconds
> 03:52:14  at java.lang.Object.wait(Native Method)
> 03:52:14  at java.lang.Object.wait(Object.java:503)
> 03:52:14  at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1136)
> 03:52:14  at com.stumbleupon.async.Deferred.join(Deferred.java:1019)
> 03:52:14  at 
> org.kududb.client.TestAsyncKuduSession.testGetTableLocationsErrorCauseSessionStuck(TestAsyncKuduSession.java:133)
> 03:52:14 
> 03:52:14 
> testBatchErrorCauseSessionStuck(org.kududb.client.TestAsyncKuduSession)  Time 
> elapsed: 0.199 sec  <<< ERROR!
> 03:52:14 org.kududb.client.MasterErrorException: Server[Kudu Master - 
> 127.13.215.1:64030] NOT_FOUND[code 1]: The table was deleted: Table deleted 
> at 2016-07-09 03:50:24 UTC
> 03:52:14  at 
> org.kududb.client.TabletClient.dispatchMasterErrorOrReturnException(TabletClient.java:533)
> 03:52:14  at org.kududb.client.TabletClient.decode(TabletClient.java:463)
> 03:52:14  at org.kududb.client.TabletClient.decode(TabletClient.java:83)
> 03:52:14  at 
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:500)
> 03:52:14  at 
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
> 03:52:14  at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
> 03:52:14  at 
> org.kududb.client.TabletClient.handleUpstream(TabletClient.java:638)
> 03:52:14  at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
> 03:52:14  at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
> 03:52:14  at 
> org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184)
> 03:52:14  at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
> 03:52:14  at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
> 03:52:14  at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
> 03:52:14  at 
> org.kududb.client.AsyncKuduClient$TabletClientPipeline.sendUpstream(AsyncKuduClient.java:1877)
> 03:52:14  at 
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
> 03:52:14  at 
> org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
> 03:52:14  at 
> org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
> 03:52:14  at 
> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
> 03:52:14  at 
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
> 03:52:14  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 03:52:14  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 03:52:14  at java.lang.Thread.run(Thread.java:745)
> 03:52:14 
> 03:52:14 

[jira] [Updated] (KUDU-1537) Exactly-once semantics for DDL operations

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1537:
-
Target Version/s: Backlog  (was: 1.5.0)

> Exactly-once semantics for DDL operations
> -
>
> Key: KUDU-1537
> URL: https://issues.apache.org/jira/browse/KUDU-1537
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 0.9.1
>Reporter: Adar Dembo
>
> Now that Kudu has a replay cache, we should use it for master DDL operations 
> like CreateTable(), AlterTable(), and DeleteTable(). To do this we'll need to 
> add some client-specific RPC state into the write that the leader master 
> replicates to its followers, and use that state when an RPC is retried.
> Some tests (e.g. master-stress-test) have workarounds that should be removed 
> when this bug is fixed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1537) Exactly-once semantics for DDL operations

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1537:
-
Issue Type: Improvement  (was: Bug)

> Exactly-once semantics for DDL operations
> -
>
> Key: KUDU-1537
> URL: https://issues.apache.org/jira/browse/KUDU-1537
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 0.9.1
>Reporter: Adar Dembo
>
> Now that Kudu has a replay cache, we should use it for master DDL operations 
> like CreateTable(), AlterTable(), and DeleteTable(). To do this we'll need to 
> add some client-specific RPC state into the write that the leader master 
> replicates to its followers, and use that state when an RPC is retried.
> Some tests (e.g. master-stress-test) have workarounds that should be removed 
> when this bug is fixed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-1449) tablet unavailable caused by follower can not upgrade to leader.

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-1449.
--
   Resolution: Duplicate
Fix Version/s: n/a

Likely fixed by KUDU-1097 and friends.

> tablet unavailable caused by  follower can not upgrade to leader.
> -
>
> Key: KUDU-1449
> URL: https://issues.apache.org/jira/browse/KUDU-1449
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 0.8.0
> Environment: jd.com production env
>Reporter: zhangsong
>Priority: Critical
> Fix For: n/a
>
>
> 1 background : there is 5 node crash due to sys oom today , according to raft 
> protocol, kudu should select follower and upgrade it to leader and provide 
> service again,while it did not.  
> Found such error when issuing query via impala: "Unable to open scanner: 
> Timed out: GetTableLocations(flow_first_buy_user_0504, bucket=453, string 
> memberid=, int32 cate3_id=-2147483648, int32 cate2_id=-2147483648, int32 
> cate1_id=-2147483648, int32 chan_type=-2147483648, int32 
> county_id=-2147483648, int32 city_id=-2147483648, int32 
> province_id=-2147483648, 1) failed: timed out after deadline expired: timed 
> out after deadline expired
> "  
> 2 analysis:
> According to the bucket# , found the target tablet only has two 
> replicas,which is odd. Meantime the tablet-server hosting the leader replica 
> has crashed. 
> The follower can not upgrade to leader in that situation: only one leader and 
> one follower ,leader dead, follower can not get majority of votes for its 
> upgrading to leader(as only itself votes for itself).
> Thus result in the unavailability of tablet while there is a follower left 
> hosting the replica.
> After restart kudu-server on the node which hosting the previous leader 
> replica,  Observed that the leader replica become follower and previous 
> follower replica become leader, another follower replica is created and there 
> is 3-replica raft-configuration again.
> 3 modifications:
> follower should notice the abnormal situation where there is only two replica 
> in raft-configuration: one leader and one follower, and contact master to 
> correct it.
> 4 to do:
> what cause the two-replica raft-configuration is still known.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1736) kudu crash in debug build: unordered undo delta

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1736:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> kudu crash in debug build: unordered undo delta
> ---
>
> Key: KUDU-1736
> URL: https://issues.apache.org/jira/browse/KUDU-1736
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Reporter: zhangsong
>Priority: Critical
> Attachments: mt-tablet-test.txt.gz
>
>
> in jd cluster we met kudu-tserver crash with fatal messages described as 
> follow:
> Check failed: last_key_.CompareTo(key) <= 0 must insert undo deltas in 
> sorted order (ascending key, then descending ts): got key (row 
> 1422@tx6052042821982183424) after (row 1422@tx6052042821953155072)
> This is a dcheck which should not failed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-2015) invalidate data format will causing kudu-tserver to crash. and kudu-table will be un available

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-2015.
--
   Resolution: Cannot Reproduce
Fix Version/s: n/a

Closing since not getting help reproducing.

> invalidate data format will causing kudu-tserver to crash. and kudu-table 
> will be un available
> --
>
> Key: KUDU-2015
> URL: https://issues.apache.org/jira/browse/KUDU-2015
> Project: Kudu
>  Issue Type: Bug
>  Components: client, impala, tserver
>Affects Versions: 1.1.0
>Reporter: zhangsong
>Priority: Critical
> Fix For: n/a
>
>
> when issuing insert into clause using impala , have issue wrong insert 
> clause, which in turn causing the kudu-table unreadable and kudu-tserver 
> crash.
> the test table 's schema: 
> CREATE EXTERNAL TABLE `cst` (
> `pin` STRING,
> `age` INT
> )
> TBLPROPERTIES(...)
> the insert into issue "insert into cst values 
> (("test1",2),("test2",3),("test3",3))" 
> after insertion , impala-shell prompt successfully.
> but then select on this table will failed. 
> also found kudu-tservers(one leader and two follower) hold same tablet of the 
> table  crashed.
> FATAL msg on them is : "F0516 20:03:18.752769 39540 
> tablet_peer_mm_ops.cc:128] Check failed: _s.ok() FlushMRS failed on 
> 8ea48349d89d405c94334f832b1bae18: Invalid argument: Failed to finish DRS 
> writer: index too large"
> fortunately , it is a test table which only causing 3 kudu-tserver die.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1921) Add ability for clients to require authentication/encryption

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1921:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> Add ability for clients to require authentication/encryption
> 
>
> Key: KUDU-1921
> URL: https://issues.apache.org/jira/browse/KUDU-1921
> Project: Kudu
>  Issue Type: Improvement
>  Components: client, security
>Affects Versions: 1.3.0
>Reporter: Todd Lipcon
>Priority: Critical
>
> Currently, the clients always operate in "optional" mode for authentication 
> and encryption. This means that they are vulnerable to downgrade attacks by a 
> MITM. We should provide APIs so that clients can be configured to prohibit 
> downgrade when connecting to clusters they know to be secure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1466) C++ client errors misreported as GetTableLocations timeouts

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1466:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> C++ client errors misreported as GetTableLocations timeouts
> ---
>
> Key: KUDU-1466
> URL: https://issues.apache.org/jira/browse/KUDU-1466
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: 0.8.0
>Reporter: Todd Lipcon
>Assignee: Alexey Serbin
>Priority: Critical
>
> client-test is currently very flaky due to this issue:
> - we are injecting some kind of failure on the tablet server (eg DNS 
> resolution failure)
> - when we fail to connect to the TS, we correctly re-trigger a lookup against 
> the master
> - depending how the backoffs and retries line up, we sometimes end up 
> triggering the lookup retry when the remaining operation budget is very short 
> (eg <10ms)
> -- this GetTabletLocations RPC times out since the master is unable to 
> respond within the ridiculously short timeout
> During the course of retrying some operation, we should probably not replace 
> the 'last_error' with a master error, so long as we have had at least one 
> successful master lookup (thus indicating that the master is not the problem)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1813) Scan at a specific timestamp doesn't include that timestamp as committed

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1813:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> Scan at a specific timestamp doesn't include that timestamp as committed
> 
>
> Key: KUDU-1813
> URL: https://issues.apache.org/jira/browse/KUDU-1813
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.2.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
>
> Currently, if the user performs the following sequence:
> - Insert a row
> - ts = client_->GetLastObservedTimestamp()
> - create a new scanner with READ_AT_SNAPSHOT set to 'ts'
> they will not observe their own write. This seems to be due to incorrect 
> usage of MvccSnapshot(ts) constructor which says that it considers all writes 
> _before_ 'ts' to be committed, rather than _before or equal to_.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-801) Delta flush doesn't wait for transactions to commit

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-801:

Target Version/s: 1.6.0  (was: 1.5.0)

> Delta flush doesn't wait for transactions to commit
> ---
>
> Key: KUDU-801
> URL: https://issues.apache.org/jira/browse/KUDU-801
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: Private Beta
>Reporter: Todd Lipcon
>Priority: Critical
>
> I saw a case of mt-tablet-test failing with what I think is the following 
> scenario:
> - transaction applies an update to DMS
> - delta flush happens
> - major delta compaction runs (the update is now part of base data and we 
> have an UNDO)
> - the RS is selected for compaction
> - CHECK failure because the UNDO delta contains something that is not yet 
> committed.
> We probably need to ensure that we don't Flush data which isn't yet committed 
> from an MVCC standpoint.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-2037) ts_recovery-itest flaky since KUDU-1034 fixed

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-2037.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Was fixed in 681f05b431a6fe62370feb439dd0756d9eefe07d.

> ts_recovery-itest flaky since KUDU-1034 fixed
> -
>
> Key: KUDU-2037
> URL: https://issues.apache.org/jira/browse/KUDU-2037
> Project: Kudu
>  Issue Type: Bug
>  Components: client, test
>Affects Versions: 1.4.0
>Reporter: Todd Lipcon
>Assignee: Alexey Serbin
>Priority: Critical
> Fix For: 1.5.0
>
>
> ts_recovery-itest is quite flaky lately (~50% in TSAN builds). I was able to 
> reproduce the flakiness reliably doing:
> {code}
> taskset -c 0-1 ./build-support/run-test.sh 
> ./build/latest/bin/ts_recovery-itest --gtest_filter=\*Orphan\* 
> -stress-cpu-threads 4
> {code}
> I tracked the flakiness down to being introduced by KUDU-1034 
> (4263b037844fca595a35f99479fbb5765ba7a443). The issue seems to be that the 
> test sets a low timeout such that a large number of requests time out, and 
> with the new behavior introduced by that commit, we end up hammering the 
> master and unable to make progress.
> Unclear if this is a feature (and we need to update the test) or a bug



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1454) Spark and MR jobs running without scan locality

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1454:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> Spark and MR jobs running without scan locality
> ---
>
> Key: KUDU-1454
> URL: https://issues.apache.org/jira/browse/KUDU-1454
> Project: Kudu
>  Issue Type: Bug
>  Components: client, perf, spark
>Affects Versions: 0.8.0
>Reporter: Todd Lipcon
>Priority: Critical
>
> Spark (and according to [~danburkert] MR also now) add all of the locations 
> of a tablet as split locations. This makes sense except that the Java client 
> currently always scans the leader replica. So in many cases we schedule a 
> task which is "local" to a follower, and then it ends up having to do a 
> remote scan.
> This makes Spark queries take about twice as long on tables with replicas 
> compared to unreplicated tables, and I think is a regression on the MR side.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1587) Memory-based backpressure is insufficient on seek-bound workloads

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1587:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> Memory-based backpressure is insufficient on seek-bound workloads
> -
>
> Key: KUDU-1587
> URL: https://issues.apache.org/jira/browse/KUDU-1587
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 0.10.0
>Reporter: Todd Lipcon
>Priority: Critical
> Attachments: graph.png, queue-time.png
>
>
> I pushed a uniform random insert workload from a bunch of clients to the 
> point that the vast majority of bloom filters no longer fit in buffer cache, 
> and the compaction had fallen way behind. Thus, every inserted row turns into 
> 40+ seeks (due to non-compact data) and takes 400-500ms. In this kind of 
> workload, the current backpressure (based on memory usage) is insufficient to 
> prevent ridiculously long queues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-428) Support for service/table/column authorization

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-428:

Target Version/s: 1.6.0  (was: 1.5.0)

> Support for service/table/column authorization
> --
>
> Key: KUDU-428
> URL: https://issues.apache.org/jira/browse/KUDU-428
> Project: Kudu
>  Issue Type: New Feature
>  Components: master, security, tserver
>Affects Versions: 1.2.0
>Reporter: Todd Lipcon
>Priority: Critical
>  Labels: kudu-roadmap
>
> We need to support basic SQL-like access control:
> - grant/revoke on tables, columns
> - service-level grant/revoke
> - probably need some group/role mapping infrastructure as well



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-869) Support PRE_VOTER config membership type

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-869:

Target Version/s: 1.6.0  (was: 1.5.0)

> Support PRE_VOTER config membership type
> 
>
> Key: KUDU-869
> URL: https://issues.apache.org/jira/browse/KUDU-869
> Project: Kudu
>  Issue Type: Sub-task
>  Components: consensus
>Affects Versions: Feature Complete
>Reporter: Mike Percy
>Assignee: Mike Percy
>Priority: Critical
>
> A PRE_VOTER membership type will reduce unavailability when bootstrapping new 
> nodes. See the remote bootstrap spec @ 
> https://docs.google.com/document/d/1zSibYnwPv9cFRnWn0ORyu2uCGB9Neb-EsF0M6AiMSEE
>  for details



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1788) Raft UpdateConsensus retry behavior on timeout is counter-productive

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1788:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> Raft UpdateConsensus retry behavior on timeout is counter-productive
> 
>
> Key: KUDU-1788
> URL: https://issues.apache.org/jira/browse/KUDU-1788
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.1.0
>Reporter: Todd Lipcon
>Priority: Critical
>
> In a stress test, I've seen the following counter-productive behavior:
> - a leader is trying to send operations to a replica (eg a 10MB batch)
> - the network is constrained due to other activity, so sending 10MB may take 
> >1sec
> - the request times out on the client side, likely while it was still in the 
> process of sending the batch
> - when the server receives it, it is likely to have timed out while waiting 
> in the queue. Or ,it will receive it and upon processing will all be 
> duplicate ops from the previous attempt
> - the client has no idea whether the server received it or not, and thus 
> keeps retrying the same batch (triggering the same timeout)
> This tends to be a "sticky"/cascading sort of state: after one such timeout, 
> the follower will be lagging behind more, and the next batch will be larger 
> (up to the configured max batch size). The client neither backs off nor 
> increases its timeout, so it will basically just keep the network pipe full 
> of useless redundant updates



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-1869) Scans do not work with hybrid time disabled and snapshot reads enabled

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142226#comment-16142226
 ] 

Jean-Daniel Cryans commented on KUDU-1869:
--

[~dralves] Are you working on this?

> Scans do not work with hybrid time disabled and snapshot reads enabled
> --
>
> Key: KUDU-1869
> URL: https://issues.apache.org/jira/browse/KUDU-1869
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.2.0
> Environment: Centos 6.6
> kudu 1.2.0-cdh5.10.0
> revision 01748528baa06b78e04ce9a799cc60090a821162
> build type RELEASE
> 6 nodes, 5 tservers/impalads
>Reporter: Matthew Jacobs
>Assignee: David Alves
>Priority: Critical
>  Labels: impala
>
> With {{-use_hybrid_clock=false}} and scanning with SNAPSHOT_READ, all scans 
> appear to be timing out with the following error message:
> {code}
> [vc0736.halxg.cloudera.com:21000] > SELECT COUNT(*) FROM 
> functional_kudu.alltypes;
> Query: select COUNT(*) FROM functional_kudu.alltypes
> Query submitted at: 2017-02-10 09:50:02 (Coordinator: 
> http://vc0736.halxg.cloudera.com:25000)
> Query progress can be monitored at: 
> http://vc0736.halxg.cloudera.com:25000/query_plan?query_id=ff48eb0af82f057e:f2c1e2f8
> WARNINGS: 
> Unable to open scanner: Timed out: unable to retry before timeout: Remote 
> error: Service unavailable: Timed out: could not wait for desired snapshot 
> timestamp to be consistent: Timed out waiting for ts: L: 3632307 to be safe 
> (mode: NON-LEADER). Current safe time: L: 3598991 Physical time difference: 
> None (Logical clock): Remote error: Service unavailable: Timed out: could not 
> wait for desired snapshot timestamp to be consistent: Timed out waiting for 
> ts: L: 3625836 to be safe (mode: NON-LEADER). Current safe time: L: 3593993 
> Physical time difference: None (Logical clock)
> {code}
> This is a severe issue for Impala which aims to keep SNAPSHOT_READ enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1125) Reduce impact of enabling fsync on the master

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1125:
-
Priority: Major  (was: Critical)

> Reduce impact of enabling fsync on the master
> -
>
> Key: KUDU-1125
> URL: https://issues.apache.org/jira/browse/KUDU-1125
> Project: Kudu
>  Issue Type: Sub-task
>  Components: master
>Affects Versions: Feature Complete
>Reporter: Jean-Daniel Cryans
>  Labels: data-scalability
>
> First time running ITBLL since we enabled fsync in the master and I'm now 
> seeing RPCs timing out because the master is always ERROR_SERVER_TOO_BUSY. In 
> the log I can see a lot of elections going on and the queue is always full.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-577) Specification for expected semantics and client modes

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142225#comment-16142225
 ] 

Jean-Daniel Cryans commented on KUDU-577:
-

[~dr-alves] same question as Todd above.

> Specification for expected semantics and client modes
> -
>
> Key: KUDU-577
> URL: https://issues.apache.org/jira/browse/KUDU-577
> Project: Kudu
>  Issue Type: Sub-task
>  Components: api, client
>Affects Versions: M4.5
>Reporter: Jean-Daniel Cryans
>Assignee: David Alves
>Priority: Critical
>
> We need a detailed description of what the different client modes are and 
> what it means the clients should do. This is to ensure that both terminology 
> and behavior match between languages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1762) suspected tablet memory leak

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1762:
-
Priority: Major  (was: Critical)

> suspected tablet memory leak
> 
>
> Key: KUDU-1762
> URL: https://issues.apache.org/jira/browse/KUDU-1762
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.0.1
> Environment: CentOS 6.5
> Kudu 1.0.1 (rev e60b610253f4303b24d41575f7bafbc5d69edddb)
>Reporter: Fu Lili
> Attachments: 0B2CE7BB-EF26-4EA1-B824-3584D7D79256.png, 
> kudu_heap_prof_20161206.tar.gz, mem_rss_graph_2016_12_19.png, 
> server02_30day_rss_before_and_after_mrs_flag_2.png, 
> server02_30day_rss_before_and_after_mrs_flag.png, tserver_smaps1
>
>
> here is the memory total info:
> {quote}
> 
> MALLOC: 1691715680 ( 1613.3 MiB) Bytes in use by application
> MALLOC: +178733056 (  170.5 MiB) Bytes in page heap freelist
> MALLOC: + 37483104 (   35.7 MiB) Bytes in central cache freelist
> MALLOC: +  4071488 (3.9 MiB) Bytes in transfer cache freelist
> MALLOC: + 13739264 (   13.1 MiB) Bytes in thread cache freelists
> MALLOC: + 12202144 (   11.6 MiB) Bytes in malloc metadata
> MALLOC:   
> MALLOC: =   1937944736 ( 1848.2 MiB) Actual memory used (physical + swap)
> MALLOC: +   311296 (0.3 MiB) Bytes released to OS (aka unmapped)
> MALLOC:   
> MALLOC: =   1938256032 ( 1848.5 MiB) Virtual address space used
> MALLOC:
> MALLOC: 174694  Spans in use
> MALLOC:201  Thread heaps in use
> MALLOC:   8192  Tcmalloc page size
> 
> Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
> Bytes released to the OS take up virtual address space but no physical memory.
> {quote}
> but in memroy detail, sum of all the sub Current Consumption is far less than 
> the to the root Current Consumption。
> ||Id||Parent||Limit||Current Consumption||Peak consumption||
> |root|none|4.00G|1.58G|1.74G|
> |log_cache|root|1.00G|480.8K|5.32M|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:70c8d889b0314b04a240fcb02c24a012|log_cache|128.00M|160B|160B|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:16d3c8193579445f8f766da6c7abc237|log_cache|128.00M|160B|160B|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:2c69c5cb9eb04eb48323a9268afc36a7|log_cache|128.00M|160B|160B|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:2b11d9220dab4a5f952c5b1c10a68ccd|log_cache|128.00M|69.2K|139.5K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:cec045be60af4f759497234d8815238b|log_cache|128.00M|68.6K|138.7K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:cea7a54cebd242e4997da641f5b32e3a|log_cache|128.00M|68.5K|139.3K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:9625dfde17774690a888b55024ac797a|log_cache|128.00M|68.5K|140.0K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:6046b33901ca43d0975f59cf7e491186|log_cache|128.00M|0B|133.0K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:1a18ab0915f0407b922fa7ecbe7a2f46|log_cache|128.00M|0B|132.6K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:ac54d1c1813a4e39943971cb56f248ef|log_cache|128.00M|0B|130.5K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:4438580df6cc4d469393b9d6adee68d8|log_cache|128.00M|0B|131.2K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:2f1cef7d2a494575b941baa22b8a3dc9|log_cache|128.00M|0B|131.6K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:d2ad22d202c04b2d98f1c5800df1c3b5|log_cache|128.00M|0B|132.5K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:b19b21d6b4c84f9895aad9e81559d019|log_cache|128.00M|0B|131.0K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:27e9531cd5814b1c9637493f05860b19|log_cache|128.00M|0B|131.1K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:425a19940239447faa0eaab4e380d644|log_cache|128.00M|68.5K|146.9K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:178bd7bc39a941a887f393b0a7848066|log_cache|128.00M|68.5K|139.9K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:91524acd28a440318918f11292ac8fdc|log_cache|128.00M|0B|132.0K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:be6f093aabf9460b97fc35dd026820b6|log_cache|128.00M|0B|130.4K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:dd8dd794f0f44426a3c46ce8f4b54652|log_cache|128.00M|0B|131.2K|
> |log_cache:0c79993cd5504785a68f07c52463a4dc:ed128ca7b19c4e3eaa48e9e3eb341492|log_cache|128.00M|68.5K|141.5K|
> |block_cache-sharded_lru_cache|root|none|257.05M|257.05M|
> |code_cache-sharded_lru_cache|root|none|112B|113B|
> |server|root|none|2.06M|121.97M|
> |tablet-70c8d889b0314b04a240fcb02c24a012|server|none|265B|265B|
> |txn_tracker|tablet-70c8d889b0314b04a240fcb02c24a012|64.00M|0B|0B|
> |MemRowSet-0|tablet-70c8d889b0314b04a240fcb02c24a012|none|265B|265B|
> 

[jira] [Commented] (KUDU-582) Send TS specific errors back to the client when the client is supposed to take specific actions, such as trying another replica

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142224#comment-16142224
 ] 

Jean-Daniel Cryans commented on KUDU-582:
-

Hey [~dr-alves], what should we do with this jira? It feels ancient and not 
that well defined. Should we close?

> Send TS specific errors back to the client when the client is supposed to 
> take specific actions, such as trying another replica
> ---
>
> Key: KUDU-582
> URL: https://issues.apache.org/jira/browse/KUDU-582
> Project: Kudu
>  Issue Type: Bug
>  Components: client, consensus, tserver
>Affects Versions: M4.5
>Reporter: David Alves
>Priority: Critical
>
> Right now we're sending umbrella statuses that the client is supposed to 
> interpret as a command that it should failover to another replica. This is 
> misusing statuses but it's also a problem in that we're likely (or will 
> likely) sending the same statuses (illegal state and abort) in places where 
> we don't mean the client to failover.
> This should be treated holistically in both clients and in the server 
> components.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1839) DNS failure during tablet creation lead to undeletable tablet

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1839:
-
Priority: Major  (was: Critical)

> DNS failure during tablet creation lead to undeletable tablet
> -
>
> Key: KUDU-1839
> URL: https://issues.apache.org/jira/browse/KUDU-1839
> Project: Kudu
>  Issue Type: Bug
>  Components: master, tablet
>Affects Versions: 1.2.0
>Reporter: Adar Dembo
>
> During a YCSB workload, two tservers died due to DNS resolution timeouts. For 
> example: 
> {noformat}
> F0117 09:21:14.952937  8392 raft_consensus.cc:1985] Check failed: _s.ok() Bad 
> status: Network error: Could not obtain a remote proxy to the peer.: Unable 
> to resolve address 've0130.halxg.cloudera.com': Name or service not known
> {noformat}
> It's not clear why this happened; perhaps table creation places an inordinate 
> strain on DNS due to concurrent resolution load from all the bootstrapping 
> peers.
> In any case, when these tservers were restarted, two tablets failed to 
> bootstrap, both for the same reason. I'll focus on just one tablet from here 
> on out to simplify troubleshooting:
> {noformat}
> E0117 15:35:45.567312 85124 ts_tablet_manager.cc:749] T 
> 8c167c441a7d44b8add737d13797e694 P 7425c65d80f54f2da0a85494a5eb3e68: Tablet 
> failed to bootstrap: Not found: Unable to load Consensus metadata: 
> /data/2/kudu/consensus-meta/8c167c441a7d44b8add737d13797e694: No such file or 
> directory (error 2)
> {noformat}
> Eventually, the master decided to delete this tablet:
> {noformat}
> I0117 15:42:32.119601 85166 tablet_service.cc:672] Processing DeleteTablet 
> for tablet 8c167c441a7d44b8add737d13797e694 with delete_type 
> TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new 
> config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153
> I0117 15:42:32.139128 85166 tablet_service.cc:672] Processing DeleteTablet 
> for tablet 8c167c441a7d44b8add737d13797e694 with delete_type 
> TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new 
> config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153
> I0117 15:42:32.181843 85166 tablet_service.cc:672] Processing DeleteTablet 
> for tablet 8c167c441a7d44b8add737d13797e694 with delete_type 
> TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new 
> config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153
> I0117 15:42:32.276289 85166 tablet_service.cc:672] Processing DeleteTablet 
> for tablet 8c167c441a7d44b8add737d13797e694 with delete_type 
> TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new 
> config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153
> {noformat}
> As can be seen by the presence of multiple deletion requests, each one 
> failed. It's annoying that the tserver didn't log why. But the master did:
> {noformat}
> I0117 15:42:32.117022 33903 catalog_manager.cc:2758] Sending 
> DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 
> 8c167c441a7d44b8add737d13797e694 on 7425c65d80f54f2da0a85494a5eb3e68 
> (ve0122.halxg.cloudera.com:7050) (TS 7425c65d80f54f2da0a85494a5eb3e68 not 
> found in new config with opid_index 29)
> W0117 15:42:32.117463 33890 catalog_manager.cc:2725] TS 
> 7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.cloudera.com:7050): delete 
> failed for tablet 8c167c441a7d44b8add737d13797e694 with error code 
> TABLET_NOT_RUNNING: Illegal state: Consensus not available. Tablet shutting 
> down
> I0117 15:42:32.117491 33890 catalog_manager.cc:2522] Scheduling retry of 
> 8c167c441a7d44b8add737d13797e694 Delete Tablet RPC for 
> TS=7425c65d80f54f2da0a85494a5eb3e68 with a delay of 19ms (attempt = 1)...
> {noformat}
> This isn't a fatal error as far as the master is concerned, so it retries the 
> deletion forever.
> Meanwhile, the broken replica of this tablet still appears to be part of the 
> replication group. At least, that's true as far as both the master web UI and 
> the tserver web UI are concerned. The leader tserver is logging this error 
> repeatedly:
> {noformat}
> W0117 16:38:04.797828 81809 consensus_peers.cc:329] T 
> 8c167c441a7d44b8add737d13797e694 P 335d132897de4bdb9b87443f2c487a42 -> Peer 
> 7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.cloudera.com:7050): Couldn't 
> send request to peer 7425c65d80f54f2da0a85494a5eb3e68 for tablet 
> 8c167c441a7d44b8add737d13797e694. Error code: TABLET_NOT_RUNNING (12). 
> Status: Illegal state: Tablet not RUNNING: FAILED: Not found: Unable to load 
> Consensus metadata: 
> /data/2/kudu/consensus-meta/8c167c441a7d44b8add737d13797e694: No such file or 
> directory (error 2). Retrying in the next heartbeat period. Already tried 
>  times.
> {noformat}
> It's not clear to me exactly what state the 

[jira] [Updated] (KUDU-2044) Tombstoned tablets show up in /metrics

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-2044:
-
Target Version/s: 1.6.0

> Tombstoned tablets show up in /metrics
> --
>
> Key: KUDU-2044
> URL: https://issues.apache.org/jira/browse/KUDU-2044
> Project: Kudu
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Assignee: Will Berkeley
>Priority: Critical
>  Labels: newbie
>
> They probably shouldn't be there.
> Furthermore, tablets tombstoned by the current process (i.e. by the current 
> run of the tserver/master) are present, but tablets tombstoned by a past run 
> aren't.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-2050) Avoid peer eviction during block manager startup

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-2050:
-
Target Version/s: 1.6.0

> Avoid peer eviction during block manager startup
> 
>
> Key: KUDU-2050
> URL: https://issues.apache.org/jira/browse/KUDU-2050
> Project: Kudu
>  Issue Type: Bug
>  Components: fs, tserver
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Priority: Critical
>
> In larger deployments we've observed that opening the block manager can take 
> a really long time, like tens of minutes or sometimes even hours. This is 
> especially true as of 1.4 where the log block manager tries to optimize 
> on-disk data structures during startup.
> The default time to Raft peer eviction is 5 minutes. If one node is restarted 
> and LBM startup takes over 5 minutes, or if all nodes are restarted and 
> there's over 5 minutes of LBM startup time variance across them, the "slow" 
> node could have all of its replicas evicted. Besides generating a lot of 
> unnecessary work in rereplication, this effectively "defeats" the LBM 
> optimizations in that it would have been equally slow (but more efficient) to 
> reformat the node instead.
> So, let's reorder startup such that LBM startup counts towards replica 
> bootstrapping. One idea: adjust FsManager startup so that tablet-meta/cmeta 
> files can be accessed early to construct bootstrapping replicas, but to defer 
> opening of the block manager until after that time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-2110) RPC footer may be appended more than once

2017-08-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-2110:
-
Target Version/s:   (was: 1.5.0)

> RPC footer may be appended more than once
> -
>
> Key: KUDU-2110
> URL: https://issues.apache.org/jira/browse/KUDU-2110
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 1.5.0
>Reporter: Michael Ho
>Assignee: Michael Ho
>Priority: Blocker
>  Labels: 1.6.0
>
> The fix for KUDU-2065 included a footer to RPC messages. The footer is 
> appended during the beginning of the transmission of the outbound transfer 
> for an outbound call. However, the check for the beginning of transmission 
> for an outbound call isn't quite correct as it's possible for an outbound 
> transfer to not send anything in Transfer::SendBuffer().
> {noformat}
> // Transfer for outbound call must call StartCallTransfer() before 
> transmission can
> // begin to append footer to the payload if the remote supports it.
> if (!transfer->TransferStarted() &&
> transfer->is_for_outbound_call() &&
> !StartCallTransfer(transfer)) {
>   OutboundQueuePopFront();
>   continue;
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2095) scanner keepAlive method is necessary in java client

2017-08-22 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16136970#comment-16136970
 ] 

Jean-Daniel Cryans commented on KUDU-2095:
--

Hi [~Zhang Guangqiang], you're right that this is not implemented in the Java 
client.

It appears this is your lucky day because someone wrote a patch: 
https://gerrit.cloudera.org/#/c/7749/

> scanner keepAlive method is necessary in java client
> 
>
> Key: KUDU-2095
> URL: https://issues.apache.org/jira/browse/KUDU-2095
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.3.1
>Reporter: zhangguangqiang
>
> when I use kudu java client,I need to make scanner to keepAlive in my usage 
> case。But I can not find this method in java client; On the other hand,I found 
> kudu::client::KuduScanner::KeepAlive in c++ client。This is very necessary in 
> my usage,will you implement it in java client? thank you !



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-1942) Kerberos fails to log in on hostnames with capital characters

2017-08-22 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-1942.
--
Resolution: Fixed

> Kerberos fails to log in on hostnames with capital characters
> -
>
> Key: KUDU-1942
> URL: https://issues.apache.org/jira/browse/KUDU-1942
> Project: Kudu
>  Issue Type: Bug
>  Components: security
>Affects Versions: 1.3.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 1.3.2, 1.5.0, 1.4.1
>
>
> I tried setting up Kerberos on my laptop which has the hostname 
> 'todd-Thinkpad-T540p' (including capital letters). It seems like we are 
> inconsistent about our treatment of caps -- if I make the keytab with the 
> principal kudu/todd-thinkpad-t540p it fails complaining it can't find 
> kudu/todd-Thinkpad-T540p, and if I make the keytab as the latter, it fails 
> complaining it can't find the former.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KUDU-2101) [ksck] Include a summary at the bottom of the report

2017-08-17 Thread Jean-Daniel Cryans (JIRA)
Jean-Daniel Cryans created KUDU-2101:


 Summary: [ksck] Include a summary at the bottom of the report
 Key: KUDU-2101
 URL: https://issues.apache.org/jira/browse/KUDU-2101
 Project: Kudu
  Issue Type: Improvement
  Components: ops-tooling
Affects Versions: 1.4.0
Reporter: Jean-Daniel Cryans


Right now the end of ksck's report looks like this:

{noformat}
Table some_table_name has 1 under-replicated tablet(s)

WARNING: 6 out of 15 table(s) are not in a healthy state
{noformat}

So we have one summary line for each table as they are printed out and then at 
the end we say how many tables are bad. We could rework the last line like a 
table with information like this:

||Table name||Status||
|some_other_table_name|HEALTHY|
|some_table_name|1 under-replicated tablet|

Maybe we could also give the detail of how many tablets are good, how many are 
missing some replicas, and how many lost a majority of their tablets.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-2090) Insert operation request timed out, UpdateConsensus RPC timed out.

2017-08-07 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-2090.
--
   Resolution: Invalid
Fix Version/s: n/a

> Insert operation request timed out, UpdateConsensus RPC timed out.
> --
>
> Key: KUDU-2090
> URL: https://issues.apache.org/jira/browse/KUDU-2090
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.3.0
> Environment: Kudu 1.3.0-1.cdh5.11.0.p0.12, CentOS Linux release 
> 7.3.1611 (Core)
>Reporter: LUOYAJUN
> Fix For: n/a
>
> Attachments: kudu-tserver.WARNING
>
>
> Insert operation occurs timeout, with the logs appear 'UpdateConsensus RPC'. 
> The kudu cluster consists 3 masters and 8 tabletServers, with 143 tables and 
> 1115 tablets.
> Some message of this issue in the TabletServer Log:
> W0807 03:19:45.116417 20083 consensus_peers.cc:357] T 
> 5c0a1dbeeef04cc796d65746b5cda4dc P a8a23a2a3bb0446db77dcc85fc85530a -> Peer 
> 622a4488ce774290b2dcd3104a06ae3c (hadoop-04:7050): Couldn't send request to 
> peer 622a4488ce774290b2dcd3104a06ae3c for tablet 
> 5c0a1dbeeef04cc796d65746b5cda4dc. Status: Timed out: UpdateConsensus RPC to 
> 10.20.110.4:7050 timed out after 1.000s (SENT). Retrying in the next 
> heartbeat period. Already tried 6 times.
> W0807 03:19:45.163341 20085 consensus_peers.cc:357] T 
> bd89a18ccc0142d784942ecc130ff3b6 P a8a23a2a3bb0446db77dcc85fc85530a -> Peer 
> cf66bf6093764bffa6387c241f9994c6 (hadoop-02:7050): Couldn't send request to 
> peer cf66bf6093764bffa6387c241f9994c6 for tablet 
> bd89a18ccc0142d784942ecc130ff3b6. Error code: TABLET_NOT_FOUND (6). Status: 
> Timed out: UpdateConsensus RPC to 10.20.110.2:7050 timed out after 1.000s 
> (SENT). Retrying in the next heartbeat period. Already tried 1 times.
> W0807 03:19:45.320494 20083 consensus_peers.cc:357] T 
> 0b821119e2b849c38f981269da488fdc P a8a23a2a3bb0446db77dcc85fc85530a -> Peer 
> 622a4488ce774290b2dcd3104a06ae3c (hadoop-04:7050): Couldn't send request to 
> peer 622a4488ce774290b2dcd3104a06ae3c for tablet 
> 0b821119e2b849c38f981269da488fdc. Error code: TABLET_NOT_FOUND (6). Status: 
> Timed out: UpdateConsensus RPC to 10.20.110.4:7050 timed out after 1.000s 
> (SENT). Retrying in the next heartbeat period. Already tried 7 times.
> W0807 03:19:45.320538 20083 consensus_peers.cc:357] T 
> 8471841aa0114924868cfdf596e9bf95 P a8a23a2a3bb0446db77dcc85fc85530a -> Peer 
> c0556f4e50a34b04b9f4b1ffc63f3ffb (hadoop-03:7050): Couldn't send request to 
> peer c0556f4e50a34b04b9f4b1ffc63f3ffb for tablet 
> 8471841aa0114924868cfdf596e9bf95. Status: Timed out: UpdateConsensus RPC to 
> 10.20.110.3:7050 timed out after 1.000s (SENT). Retrying in the next 
> heartbeat period. Already tried 7 times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2090) Insert operation request timed out, UpdateConsensus RPC timed out.

2017-08-07 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16116740#comment-16116740
 ] 

Jean-Daniel Cryans commented on KUDU-2090:
--

Hi [~LUOYAJUN], we use jira to track bugs, improvements, new features, not to 
handle problems encountered when using Kudu. Please write to the user@ mailing 
list or see if someone can help you on the Slack channel: 
http://kudu.apache.org/community.html

>From a quick look at the log you attached, I wouldn't be able to pinpoint a 
>problem. It definitely takes a long time to replicate the writes, but no 
>evidence as to why since we only have logs from one machine. Plus there's a 
>bunch of "Tablet not found" which needs to be looked into. 

bq. But we see that Kudu recommends to limit the number of tablets per server 
to 100 or fewer

What it says is "Recommended maximum number of tablet servers is 100", meaning 
the number of servers not tablets.

> Insert operation request timed out, UpdateConsensus RPC timed out.
> --
>
> Key: KUDU-2090
> URL: https://issues.apache.org/jira/browse/KUDU-2090
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.3.0
> Environment: Kudu 1.3.0-1.cdh5.11.0.p0.12, CentOS Linux release 
> 7.3.1611 (Core)
>Reporter: LUOYAJUN
> Attachments: kudu-tserver.WARNING
>
>
> Insert operation occurs timeout, with the logs appear 'UpdateConsensus RPC'. 
> The kudu cluster consists 3 masters and 8 tabletServers, with 143 tables and 
> 1115 tablets.
> Some message of this issue in the TabletServer Log:
> W0807 03:19:45.116417 20083 consensus_peers.cc:357] T 
> 5c0a1dbeeef04cc796d65746b5cda4dc P a8a23a2a3bb0446db77dcc85fc85530a -> Peer 
> 622a4488ce774290b2dcd3104a06ae3c (hadoop-04:7050): Couldn't send request to 
> peer 622a4488ce774290b2dcd3104a06ae3c for tablet 
> 5c0a1dbeeef04cc796d65746b5cda4dc. Status: Timed out: UpdateConsensus RPC to 
> 10.20.110.4:7050 timed out after 1.000s (SENT). Retrying in the next 
> heartbeat period. Already tried 6 times.
> W0807 03:19:45.163341 20085 consensus_peers.cc:357] T 
> bd89a18ccc0142d784942ecc130ff3b6 P a8a23a2a3bb0446db77dcc85fc85530a -> Peer 
> cf66bf6093764bffa6387c241f9994c6 (hadoop-02:7050): Couldn't send request to 
> peer cf66bf6093764bffa6387c241f9994c6 for tablet 
> bd89a18ccc0142d784942ecc130ff3b6. Error code: TABLET_NOT_FOUND (6). Status: 
> Timed out: UpdateConsensus RPC to 10.20.110.2:7050 timed out after 1.000s 
> (SENT). Retrying in the next heartbeat period. Already tried 1 times.
> W0807 03:19:45.320494 20083 consensus_peers.cc:357] T 
> 0b821119e2b849c38f981269da488fdc P a8a23a2a3bb0446db77dcc85fc85530a -> Peer 
> 622a4488ce774290b2dcd3104a06ae3c (hadoop-04:7050): Couldn't send request to 
> peer 622a4488ce774290b2dcd3104a06ae3c for tablet 
> 0b821119e2b849c38f981269da488fdc. Error code: TABLET_NOT_FOUND (6). Status: 
> Timed out: UpdateConsensus RPC to 10.20.110.4:7050 timed out after 1.000s 
> (SENT). Retrying in the next heartbeat period. Already tried 7 times.
> W0807 03:19:45.320538 20083 consensus_peers.cc:357] T 
> 8471841aa0114924868cfdf596e9bf95 P a8a23a2a3bb0446db77dcc85fc85530a -> Peer 
> c0556f4e50a34b04b9f4b1ffc63f3ffb (hadoop-03:7050): Couldn't send request to 
> peer c0556f4e50a34b04b9f4b1ffc63f3ffb for tablet 
> 8471841aa0114924868cfdf596e9bf95. Status: Timed out: UpdateConsensus RPC to 
> 10.20.110.3:7050 timed out after 1.000s (SENT). Retrying in the next 
> heartbeat period. Already tried 7 times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2084) Improve to increase/decrease the number of maps for MR/YARN job on Kudu

2017-07-31 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107513#comment-16107513
 ] 

Jean-Daniel Cryans commented on KUDU-2084:
--

bq. Should we make improvement to let customer at lease able to decrease the 
mapper?

Not just a single's vendors customers but also every Kudu users :)

I think this could tie into our idea of having Scan Tokens that can be split 
into more chunks, something other frameworks can leverage too. [~danburkert], 
was there a jira for this?

> Improve to increase/decrease the number of maps for MR/YARN job on Kudu
> ---
>
> Key: KUDU-2084
> URL: https://issues.apache.org/jira/browse/KUDU-2084
> Project: Kudu
>  Issue Type: Improvement
>Reporter: David Wang
>
> Right now, we can't increase/decrease the number of maps of MR/YARN job by 
> below logical.
> https://github.com/cloudera/kudu/blob/branch-1.4.x/java/kudu-mapreduce/src/main/java/org/apache/kudu/mapreduce/KuduTableInputFormat.java#L137-L166
> Should we make improvement to let customer at lease able to decrease the 
> mapper?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-2053) Request reported as stale by the server in the spark client

2017-07-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-2053:
-
Resolution: Fixed
Status: Resolved  (was: In Review)

Fixed in be8e3c22b9a3a71b2c365e2b9ed306ea23d60058.

> Request reported as stale by the server in the spark client
> ---
>
> Key: KUDU-2053
> URL: https://issues.apache.org/jira/browse/KUDU-2053
> Project: Kudu
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.3.0
>Reporter: David Alves
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 1.5.0
>
> Attachments: 
> org.apache.kudu.spark.tools.ITBigLinkedListTest-output.txt
>
>
> We had multiple reports of requests being reported as stale by the server 
> when using the spark client. Writes fail with an error like:
> [Peer xxx] Tablet server sent error Incomplete: Request with id { client_id: 
> "11ef08bf559a49568e607fe3a616cc51" seq_no: 1 first_incomplete_seq_no: 1 
> attempt_no: 1 } is stale.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-2058) Java LocatedTablet implementation has sketchy string comparison

2017-07-25 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-2058.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Fixed in 0fe3a288d63a1b49cc615ec1346fcaef7a17b7c7, thanks Sri Sai Kumar!

> Java LocatedTablet implementation has sketchy string comparison
> ---
>
> Key: KUDU-2058
> URL: https://issues.apache.org/jira/browse/KUDU-2058
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java
>Affects Versions: 1.4.0
>Reporter: Todd Lipcon
>Assignee: Sri Sai Kumar Ravipati
>  Labels: newbie
> Fix For: 1.5.0
>
>
> Findbugs spotted this issue where two strings are compared by identity 
> instead of content:
> {code}
>   /**
>* Return the first occurrence for the given role, or null if there is none.
>*/
>   private Replica getOneOfRoleOrNull(Role role) {
> for (Replica r : replicas) {
>   if (r.getRole() == role.toString()) {
> return r;
>   }
> }
> return null;
>   }
> {code}
> it's not clear why strings are being used for comparison at all rather than 
> checking enum value equality, which would be both faster and more likely to 
> be correct. It may be that this code works fine despite the sketchiness, 
> though, because the two string objects are likely to be derived from the same 
> enumValue.toString() which returns a constant string object.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2076) Deleting/updating is slow on single numeric row key tables

2017-07-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100263#comment-16100263
 ] 

Jean-Daniel Cryans commented on KUDU-2076:
--

The former.

> Deleting/updating is slow on single numeric row key tables
> --
>
> Key: KUDU-2076
> URL: https://issues.apache.org/jira/browse/KUDU-2076
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.4.0
>Reporter: Jean-Daniel Cryans
>
> A user reported that deleting 50M rows on the simplest of tables, (id INT, 
> text STRING, PRIMARY KEY (id)), doesn't complete.
> It reproduces locally and I was able to see that we're deleting 1.4 rows / ms 
> which is awful considering that everything is cached.
> Todd found that we're spending most of our time decoding big blocks of 
> bit-shuffled keys. Intuitively he though that having a composite row key 
> would perform better and indeed adding a column set to 0 in front makes it 
> 10x faster.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2076) Deleting/updating is slow on single numeric row key tables

2017-07-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100210#comment-16100210
 ] 

Jean-Daniel Cryans commented on KUDU-2076:
--

[~AcharkiMed] is that table schema really the one you want to use with Kudu? Or 
will you have more row keys? Because if you do use composite row keys then this 
shouldn't be as bad as a problem.

> Deleting/updating is slow on single numeric row key tables
> --
>
> Key: KUDU-2076
> URL: https://issues.apache.org/jira/browse/KUDU-2076
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.4.0
>Reporter: Jean-Daniel Cryans
>
> A user reported that deleting 50M rows on the simplest of tables, (id INT, 
> text STRING, PRIMARY KEY (id)), doesn't complete.
> It reproduces locally and I was able to see that we're deleting 1.4 rows / ms 
> which is awful considering that everything is cached.
> Todd found that we're spending most of our time decoding big blocks of 
> bit-shuffled keys. Intuitively he though that having a composite row key 
> would perform better and indeed adding a column set to 0 in front makes it 
> 10x faster.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KUDU-2076) Deleting/updating is slow on single numeric row key tables

2017-07-21 Thread Jean-Daniel Cryans (JIRA)
Jean-Daniel Cryans created KUDU-2076:


 Summary: Deleting/updating is slow on single numeric row key tables
 Key: KUDU-2076
 URL: https://issues.apache.org/jira/browse/KUDU-2076
 Project: Kudu
  Issue Type: Bug
  Components: tablet
Affects Versions: 1.4.0
Reporter: Jean-Daniel Cryans


A user reported that deleting 50M rows on the simplest of tables, (id INT, text 
STRING, PRIMARY KEY (id)), doesn't complete.

It reproduces locally and I was able to see that we're deleting 1.4 rows / ms 
which is awful considering that everything is cached.

Todd found that we're spending most of our time decoding big blocks of 
bit-shuffled keys. Intuitively he though that having a composite row key would 
perform better and indeed adding a column set to 0 in front makes it 10x faster.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-2074) how can I get data by Primary Key faster with c++ api of kudu

2017-07-21 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-2074.
--
   Resolution: Invalid
Fix Version/s: n/a

Hi, for user questions please use the user mailing list: 
http://kudu.apache.org/community.html

We use Jira for reporting bugs, tracking new improvements and features.

> how can I get data by Primary Key faster with c++ api of  kudu
> --
>
> Key: KUDU-2074
> URL: https://issues.apache.org/jira/browse/KUDU-2074
> Project: Kudu
>  Issue Type: Wish
>  Components: api
>Affects Versions: 1.4.0
>Reporter: wei.zeng
> Fix For: n/a
>
>
> my code as below
> p  = KuduTable->NewComparisonPredicate("c1", kudu.EQUAL, i)
> scanner.AddConjunctPredicate(p)
> how can I get data by Primary Key faster with c++ api of  kudu 
> Is there a better way?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-2071) disk size is much large than actually data size

2017-07-17 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-2071:
-
   Resolution: Duplicate
Fix Version/s: n/a
   Status: Resolved  (was: In Review)

I moved this jira from being "In Review" which is a status that we only use 
when patches are actually up for review, to "Duplicate" since this is really 
KUDU-1943.

> disk size is much large than actually data size
> ---
>
> Key: KUDU-2071
> URL: https://issues.apache.org/jira/browse/KUDU-2071
> Project: Kudu
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.3.0
> Environment: system version
> 4.9.20-11.31.amzn1.x86_64 #1 SMP Thu Apr 13 01:53:57 UTC 2017 x86_64 x86_64 
> x86_64 GNU/Linux
> software version:
> kudu 1.3.0-cdh5.11.0
> revision 4dcf4a9d516865d249f4cb9b07f93c67e84614ae
> build type RELEASE
> built by jenkins at 12 Apr 2017 14:02:51 PST on 
> kudu-centos66-046c.vpc.cloudera.com
> build id 2017-04-12_13-25-42
>Reporter: KingLee
>  Labels: patch
> Fix For: n/a
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I ran m -rf on all the data dirs before reinstalling the cluster, and insert 
> 100 records to the cluster using yscb, data's size is about 5GB,but it 
> cost disk size 260GB, one of node 's disk as follows:
> before write data:
> [root@ip-10-1-42-124 ~]# du -sh /data1/server/kudu/tserver_wal/wals/ 
> /data2/server/kudu/tserver_data/ /data3/server/kudu/tserver_data/data/ 
> /data4/server/kudu/tserver_data/data/
> 4.0K/data1/server/kudu/tserver_wal/wals/
> 24K /data2/server/kudu/tserver_data/
> 8.0K/data3/server/kudu/tserver_data/data/
> 8.0K/data4/server/kudu/tserver_data/data/
> after write data:
> [root@ip-10-1-42-124 ~]# du -sh /data1/server/kudu/tserver_wal/wals/ 
> /data2/server/kudu/tserver_data/ /data3/server/kudu/tserver_data/data/ 
> /data4/server/kudu/tserver_data/data/
> 2.7G/data1/server/kudu/tserver_wal/wals/
> 29G /data2/server/kudu/tserver_data/
> 29G /data3/server/kudu/tserver_data/data/
> 27G /data4/server/kudu/tserver_data/data/
> actually data size :
> 9b137115cfaa427a9106c87086f41957 5041*3 MBytes
> kudu tserver configure:
> --fs_wal_dir=/var/lib/kudu/tserver
> --fs_data_dirs=/var/lib/kudu/tserver
> --default_num_replicas=3
> --tserver_master_addrs=192.168.1.22:7051,1192.168.1.23:7051,192.168.1.24:7051,192.168.1.25:7051,192.168.1.26:7051
> --maintenance_manager_num_threads=4
> --block_cache_capacity_mb=10240
> --memory_limit_hard_bytes=600
> --fs_wal_dir=/data1/server/kudu/tserver_wal
> --fs_data_dirs=/data2/server/kudu/tserver_data,/data3/server/kudu/tserver_data,/data4/server/kudu/tserver_data
> --fs_data_dirs_reserved_bytes=100
> --log_segment_size_mb=8
> and our production environment 's data is 25TB, but cost 45TB, where do these 
> disks go?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2071) disk size is much large than actually data size

2017-07-14 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16087949#comment-16087949
 ] 

Jean-Daniel Cryans commented on KUDU-2071:
--

Hey [~King Lee], this looks like a classical case of KUDU-1943.

> disk size is much large than actually data size
> ---
>
> Key: KUDU-2071
> URL: https://issues.apache.org/jira/browse/KUDU-2071
> Project: Kudu
>  Issue Type: Improvement
>  Components: tserver
>Affects Versions: 1.3.0
> Environment: system version
> 4.9.20-11.31.amzn1.x86_64 #1 SMP Thu Apr 13 01:53:57 UTC 2017 x86_64 x86_64 
> x86_64 GNU/Linux
> software version:
> kudu 1.3.0-cdh5.11.0
> revision 4dcf4a9d516865d249f4cb9b07f93c67e84614ae
> build type RELEASE
> built by jenkins at 12 Apr 2017 14:02:51 PST on 
> kudu-centos66-046c.vpc.cloudera.com
> build id 2017-04-12_13-25-42
>Reporter: KingLee
>  Labels: patch
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I ran m -rf on all the data dirs before reinstalling the cluster, and insert 
> 100 records to the cluster using yscb, data's size is about 5GB,but it 
> cost disk size 260GB, one of node 's disk as follows:
> before write data:
> [root@ip-10-1-42-124 ~]# du -sh /data1/server/kudu/tserver_wal/wals/ 
> /data2/server/kudu/tserver_data/ /data3/server/kudu/tserver_data/data/ 
> /data4/server/kudu/tserver_data/data/
> 4.0K/data1/server/kudu/tserver_wal/wals/
> 24K /data2/server/kudu/tserver_data/
> 8.0K/data3/server/kudu/tserver_data/data/
> 8.0K/data4/server/kudu/tserver_data/data/
> after write data:
> [root@ip-10-1-42-124 ~]# du -sh /data1/server/kudu/tserver_wal/wals/ 
> /data2/server/kudu/tserver_data/ /data3/server/kudu/tserver_data/data/ 
> /data4/server/kudu/tserver_data/data/
> 2.7G/data1/server/kudu/tserver_wal/wals/
> 29G /data2/server/kudu/tserver_data/
> 29G /data3/server/kudu/tserver_data/data/
> 27G /data4/server/kudu/tserver_data/data/
> actually data size :
> 9b137115cfaa427a9106c87086f41957 5041MBytes
> kudu tserver configure:
> --fs_wal_dir=/var/lib/kudu/tserver
> --fs_data_dirs=/var/lib/kudu/tserver
> --default_num_replicas=3
> --tserver_master_addrs=192.168.1.22:7051,1192.168.1.23:7051,192.168.1.24:7051,192.168.1.25:7051,192.168.1.26:7051
> --maintenance_manager_num_threads=4
> --block_cache_capacity_mb=10240
> --memory_limit_hard_bytes=600
> --fs_wal_dir=/data1/server/kudu/tserver_wal
> --fs_data_dirs=/data2/server/kudu/tserver_data,/data3/server/kudu/tserver_data,/data4/server/kudu/tserver_data
> --fs_data_dirs_reserved_bytes=100
> --log_segment_size_mb=8
> and our production environment 's data is 25TB, but cost 45TB, where do these 
> disks go?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KUDU-2069) Add a maintenance mode

2017-07-13 Thread Jean-Daniel Cryans (JIRA)
Jean-Daniel Cryans created KUDU-2069:


 Summary: Add a maintenance mode
 Key: KUDU-2069
 URL: https://issues.apache.org/jira/browse/KUDU-2069
 Project: Kudu
  Issue Type: Bug
Reporter: Jean-Daniel Cryans


Adding a maintenance mode in Kudu would mainly prevent replicas from getting 
evicted while the cluster remains in that mode. We might also consider 
preventing table create/alter/delete.

Maintenance mode is a good thing for two use cases:
- When starting a cluster, some nodes might take longer than others to come 
back. We shouldn't kick their replicas out.
- When nodes need to be taken out for a known short period of time, say to 
change a disk or perform an OS upgrade, we shouldn't necessarily evict all its 
replicas. Right now, taking a node out for more than 5 minutes is equivalent to 
wiping it out since it will delete all its replicas when coming back. In fact, 
wiping it out might even be preferable since it cleans up the data dirs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KUDU-2064) Overall log cache usage doesn't respect the limit

2017-07-10 Thread Jean-Daniel Cryans (JIRA)
Jean-Daniel Cryans created KUDU-2064:


 Summary: Overall log cache usage doesn't respect the limit
 Key: KUDU-2064
 URL: https://issues.apache.org/jira/browse/KUDU-2064
 Project: Kudu
  Issue Type: Bug
  Components: log
Affects Versions: 1.4.0
Reporter: Jean-Daniel Cryans


Looking at a fairly loaded machine (10TB of data in LBM, close to 10k tablets), 
I can see in the mem-trackers page that the log cache is using 1.83GB, that it 
peaked at 2.82GB, with a 1GB limit. It's consistent on other similarly loaded 
tservers. It's unexpected.

Looking at the per-tablet breakdown, they all have between 0 and a handful of 
MBs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2061) Java Client Not Honoring setIgnoreAllDuplicateRows When Inserting Duplicate Values

2017-07-08 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079197#comment-16079197
 ] 

Jean-Daniel Cryans commented on KUDU-2061:
--

Hi [~sblack522], sorry for the tardy insert.

One thing that that sample code won't check is if the returned 
OperationResponse has an error. Duplicate rows won't throw an exception, 
instead the RowError's Status#isAlreadyPresent() will return true. Maybe the 
API could be clearer/better documented.

> Java Client Not Honoring setIgnoreAllDuplicateRows When Inserting Duplicate 
> Values
> --
>
> Key: KUDU-2061
> URL: https://issues.apache.org/jira/browse/KUDU-2061
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: 1.3.1, 1.5.0
>Reporter: Scott Black
>
> Duplicate values on insert are not causing warning/error to be returned when 
> setIgnoreAllDuplicateRows is set to false. This is silently causing data loss.
> Test case. Use the example code from 
> [https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java].
>  Change line 43 to insert an constant. 3 inserts will execute but only a 
> single row with no error is thrown. See KUDU-1563 as it seems all inserts are 
> now ignore inserts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-406) [java client] TestKuduSession is flaky when testing for PleaseThrottleException

2017-07-06 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-406.
-
   Resolution: Duplicate
Fix Version/s: n/a

Resolving as a duplicate of the newer KUDU-1521.

> [java client] TestKuduSession is flaky when testing for 
> PleaseThrottleException
> ---
>
> Key: KUDU-406
> URL: https://issues.apache.org/jira/browse/KUDU-406
> Project: Kudu
>  Issue Type: Bug
>  Components: client
>Affects Versions: M4
>Reporter: Jean-Daniel Cryans
> Fix For: n/a
>
>
> From Adar:
> {noformat}
> This code:
> // Test sending edits too fast
> session.setFlushMode(KuduSession.FlushMode.AUTO_FLUSH_BACKGROUND);
> session.setMutationBufferSpace(10);
> // The buffer has a capacity of 10, we insert 21 rows, meaning we fill 
> the first one,
> // force flush, fill a second one before the first one could come back,
> // and the 21st row will be sent back.
> boolean gotException = false;
> for (int i = 50; i < 71; i++) {
>   try {
> session.apply(createInsert(i));
>   } catch (PleaseThrottleException ex) {
> gotException = true;
> assertEquals(70, i);
> // Wait for the buffer to clear
> ex.getDeferred().join(DEFAULT_SLEEP);
> session.apply(ex.getFailedRpc());
> session.flush().join(DEFAULT_SLEEP);
>   }
> }
> assertTrue(gotException); <--
> assertEquals(21, countInRange(50, 71));
> If the client is running particularly slowly, isn't it possible for the 
> background flush to finish before the 11th (or possibly 21st) row is 
> inserted? Then we won't throw a PleaseThrottleException.
> {noformat}
> It's indeed possible and it happened once. We need a way to block the 
> background flushing from happening, maybe through mocking.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-105) Support for timeouts on client admin ops

2017-07-06 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-105.
-
   Resolution: Fixed
Fix Version/s: n/a

This was done over time in both clients.

> Support for timeouts on client admin ops
> 
>
> Key: KUDU-105
> URL: https://issues.apache.org/jira/browse/KUDU-105
> Project: Kudu
>  Issue Type: Improvement
>  Components: api, client
>Affects Versions: M4
>Reporter: Todd Lipcon
>Priority: Minor
> Fix For: n/a
>
>
> Would be nice to be able to specify timeouts and other options on admin ops. 
> Probably in general all such functions should take a struct so it's easy to 
> add new options in ABI-compatible ways.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-1086) compaction/flush logs should include tablet ID

2017-07-06 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-1086.
--
   Resolution: Invalid
Fix Version/s: n/a

Looking at recent logs, it seems we always include a tablet id.

> compaction/flush logs should include tablet ID
> --
>
> Key: KUDU-1086
> URL: https://issues.apache.org/jira/browse/KUDU-1086
> Project: Kudu
>  Issue Type: Improvement
>  Components: ops-tooling
>Affects Versions: Private Beta
>Reporter: Todd Lipcon
> Fix For: n/a
>
>
> Things like delta flushes and delta compactions don't currently list the 
> tablet ID. That makes them a little less than useful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-435) Java client cleanup for release

2017-07-06 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-435.
-
   Resolution: Fixed
Fix Version/s: n/a

Close to 3 years _after_ Kudu was unveiled, I'm calling this done :)

> Java client cleanup for release
> ---
>
> Key: KUDU-435
> URL: https://issues.apache.org/jira/browse/KUDU-435
> Project: Kudu
>  Issue Type: New Feature
>  Components: client
>Affects Versions: Public beta
>Reporter: Todd Lipcon
>Assignee: Jean-Daniel Cryans
>  Labels: kudu-roadmap
> Fix For: n/a
>
>
> Various items are missing in the Java client:
> - finish synchronized Scanner API
> - double check that APIs are nice and exportable (eg using builders rather 
> than lots of arguments, so we can add new ones compatibly)
> - figure out our shading or classloaders so we don't get bit by dependency 
> disasters



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-731) [java client] Refactor RPC to avoid the "master hack"

2017-07-06 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-731.
-
   Resolution: Fixed
 Assignee: Alexey Serbin
Fix Version/s: 1.5.0

Alexey fixed most of the "hack", we still have a handle on a master table but 
it's mostly cleaned up. See 58248841f213a64683ee217f025f0a38a8450f74.

> [java client] Refactor RPC to avoid the "master hack"
> -
>
> Key: KUDU-731
> URL: https://issues.apache.org/jira/browse/KUDU-731
> Project: Kudu
>  Issue Type: Sub-task
>  Components: client
>Affects Versions: M4.5
>Reporter: Andrew Wang
>Assignee: Alexey Serbin
> Fix For: 1.5.0
>
>
> Currently we hack master RPCs by looking for a special tablet named 
> "~~master~~". Let's fix this hack.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KUDU-724) [java client] Separate interfaces from implementations

2017-07-06 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved KUDU-724.
-
  Resolution: Won't Fix
   Fix Version/s: n/a
Target Version/s:   (was: 1.4.0)

Resolving as won't fix like Todd suggested almost a year ago.

> [java client] Separate interfaces from implementations
> --
>
> Key: KUDU-724
> URL: https://issues.apache.org/jira/browse/KUDU-724
> Project: Kudu
>  Issue Type: Sub-task
>  Components: client
>Affects Versions: M5
>Reporter: Jean-Daniel Cryans
> Fix For: n/a
>
>
> From [~adar]:
> {quote}
> As you clean up the Java client, I'd be a fan of strictly separating 
> interfaces from implementations, and moving the former into a different Java 
> package. In general this is painful and annoying when there's only one 
> implementation, but I think it's valuable when you're trying to showcase a 
> public API because users won't be distracted by any implementation details at 
> all.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KUDU-2060) Show primary keys in the master's table web UI page

2017-06-30 Thread Jean-Daniel Cryans (JIRA)
Jean-Daniel Cryans created KUDU-2060:


 Summary: Show primary keys in the master's table web UI page
 Key: KUDU-2060
 URL: https://issues.apache.org/jira/browse/KUDU-2060
 Project: Kudu
  Issue Type: Improvement
  Components: master
Reporter: Jean-Daniel Cryans


We used to be able to know what the primary key for a table was by just going 
to the master's web UI and looking at the table's page, because the whole 
CREATE TABLE statement was there. Now we have an abridged version, so we don't 
even show what the primary key is anymore.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KUDU-1905) Allow reinserts on pk-only tables

2017-06-30 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-1905:
-
Summary: Allow reinserts on pk-only tables  (was: TS can not start)

> Allow reinserts on pk-only tables
> -
>
> Key: KUDU-1905
> URL: https://issues.apache.org/jira/browse/KUDU-1905
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.2.0
> Environment: kudu1.2、CDH5.10
>Reporter: YulongZ
>Assignee: David Alves
>Priority: Blocker
> Fix For: 1.3.0
>
>
> TabletServer can not start, here is the log:
> {code}
> Mar 1, 8:40:03.872 PM ERROR   compaction.cc:966   
> Status: Corruption: empty changelist - expected column updates Unable to 
> decode changelist.
> Source Row: RowIdxInBlock: 0; Base: (string party_id=, string 
> cust_id=, string source_system=); Undo Mutations: 
> [@6096334378197721088(DELETE)]; Redo Mutations: 
> [@6096334378560618496(DELETE), @6096334430649839616([invalid: Corruption: 
> empty changelist - expected column updates])];
> Dest Row: RowIdxInBlock: 0; Base: (string party_id=, string 
> cust_id=, string source_system=); Undo Mutations: 
> [@6096334378560618496(DELETE)]; Redo Mutations: 
> [@6096334378197721088(DELETE)];
> Mar 1, 8:40:03.872 PM FATAL   tablet_peer_mm_ops.cc:128   
> Check failed: _s.ok() FlushMRS failed on 325bb987ab604d8d9629f8ba4153f7d6: 
> Corruption: Flush to disk failed: empty changelist - expected column updates
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KUDU-2057) Dynamic budget for DRS compactions

2017-06-29 Thread Jean-Daniel Cryans (JIRA)
Jean-Daniel Cryans created KUDU-2057:


 Summary: Dynamic budget for DRS compactions
 Key: KUDU-2057
 URL: https://issues.apache.org/jira/browse/KUDU-2057
 Project: Kudu
  Issue Type: Improvement
  Components: tablet
Reporter: Jean-Daniel Cryans


Clusters have busier and quieter periods, so by default Kudu leverages the 
latter to schedule compactions because during the former it's mostly flushing.

A further improvement would be to somehow recognize that a tserver is mostly 
scheduling DRS compactions and to start giving them bigger and bigger budgets. 
Compacting more DRSes at a time lowers the overall write amplification, by 
running the risk of compacting for too long and not be able to schedule 
important flushes. We could lower the risk by re-adding an emergency flush 
thread, and/or making it possible to cancel tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2052) Use XFS_IOC_UNRESVSP64 ioctl to punch holes on xfs filesystems

2017-06-22 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060134#comment-16060134
 ] 

Jean-Daniel Cryans commented on KUDU-2052:
--

What should we recommend to folks who are upgrading to 1.4 who are on xfs and 
el6?

> Use XFS_IOC_UNRESVSP64 ioctl to punch holes on xfs filesystems
> --
>
> Key: KUDU-2052
> URL: https://issues.apache.org/jira/browse/KUDU-2052
> Project: Kudu
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.4.0
>Reporter: Adar Dembo
>Assignee: Adar Dembo
>Priority: Critical
>
> One of the changes in Kudu 1.4 is a more comprehensive repair functionality 
> in log block manager startup. Amongst other things this includes a heuristic 
> to detect whether an LBM container consumes more disk space than it should, 
> based on the live blocks in the container. If the heuristic fires, the LBM 
> reclaims the extra disk space by truncating the end of the container and 
> repunching out all of the dead blocks in the container.
> We brought up Kudu 1.4 on a large production cluster running xfs and observed 
> pathologically slow startup times. On one node, there was a three hour gap 
> between the last bit of data directory processing and the end of LBM startup 
> in general. This time can only be attributed to hole repunching, which is 
> executed by the same set of thread pools that open the data directories.
> Further research revealed that on xfs in el6, a hole punch via fallocate() 
> _always_ includes an fsync() (in the kernel), even if the underlying data was 
> already punched out. This isn't the case with ext4, nor does it appear to be 
> the case with xfs in more modern kernels (though this hasn't been confirmed).
> xfs provides the [XFS_IOC_UNRESVSP64 
> ioctl|https://linux.die.net/man/3/xfsctl], which can be used to deallocate 
> space from a file. That sounds an awful lot like hole punching, and some 
> quick performance tests show that it doesn't incur the cost of an fsync(). We 
> should switch over to it when punching holes on xfs. Certainly on older (i.e. 
> el6) kernels, and potentially everywhere for simplicity's sake.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-1956) Crash with "rowset selected for compaction but not available anymore"

2017-06-12 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046792#comment-16046792
 ] 

Jean-Daniel Cryans commented on KUDU-1956:
--

Todd applied a temporary workaround in https://gerrit.cloudera.org/#/c/7120/2

> Crash with "rowset selected for compaction but not available anymore"
> -
>
> Key: KUDU-1956
> URL: https://issues.apache.org/jira/browse/KUDU-1956
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.3.0
>Reporter: Todd Lipcon
>Priority: Critical
>
> I loaded 1T of data into a server with 8 MM threads configured, and a patch 
> to make the MM thread wake up and do scheduling as soon as any prior op 
> finished. After a day or two of runtime the TS crashed with:
> E0324 14:28:19.733708  5801 tablet.cc:1207] T 
> cf12905ff7d84fa0b4148aab292f0c40 P 40b48ecb131449c58df26f62ccc35538: Rowset 
> selected for compaction but not available anymore: RowSet(22755)
> E0324 14:28:19.733762  5801 tablet.cc:1207] T 
> cf12905ff7d84fa0b4148aab292f0c40 P 40b48ecb131449c58df26f62ccc35538: Rowset 
> selected for compaction but not available anymore: RowSet(24031)
> F0324 14:28:19.733777  5801 tablet.cc:1210] T 
> cf12905ff7d84fa0b4148aab292f0c40 P 40b48ecb131449c58df26f62ccc35538: Was 
> unable to find all rowsets selected for compaction



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2034) For kudu-jepsen scenario, add parameters to induce more frequent fail-over events on the server side

2017-06-08 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043212#comment-16043212
 ] 

Jean-Daniel Cryans commented on KUDU-2034:
--

bq. The results of those runs are not published elsewhere. Thank you for 
pointing at that – I'll remove the first statement from the description.

No need to edit, I was merely expressing a desire to see those results 
somewhere public

bq. If there is a need to start running the kudu-jepsen on ASF infra, I think 
it's worth opening a separate task for that.

I think we should have a jira about setting this up, yeah. Having this in the 
open would inspire more confidence in Kudu's testing.

> For kudu-jepsen scenario, add parameters to induce more frequent fail-over 
> events on the server side
> 
>
> Key: KUDU-2034
> URL: https://issues.apache.org/jira/browse/KUDU-2034
> Project: Kudu
>  Issue Type: Task
>Affects Versions: 1.4.0
>Reporter: Alexey Serbin
>  Labels: consistency, kudu-jepsen, newbie
>
> The kudu-jepsen setup has been running with no errors for a long time 
> already.  Currently, the set of parameters for the back-end components is 
> standard -- everything is set to defaults, but the experimental flags are 
> enabled.  Yes, it made sense to run with the default parameters: originally 
> we wanted to make sure the test did not fail with production-alike settings.  
> Now we can start exercising 'artificially congested' scenarios.
> Let's set parameters to induce frequent re-election/fail-over events at the 
> server side.  The idea is to bring in set of parameters used by certain tests 
> in {{src/kudu/integration-tests}}.  E.g., {{catalog_manager_tsk-itest.cc}} 
> contains an example of parameters to enable frequent re-elections among 
> masters; the raft-related part can be used as a starting point to update 
> tserver's parameters for kudu-jepsen.
> An additional step might be injecting latency into tservers' operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KUDU-2034) For kudu-jepsen scenario, add parameters to induce more frequent fail-over events on the server side

2017-06-07 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16041172#comment-16041172
 ] 

Jean-Daniel Cryans commented on KUDU-2034:
--

bq. The kudu-jepsen setup has been running with no errors for a long time 
already.

This isn't something that's running on ASF infra and the results aren't 
published either, right? It'd be nice if we could make the above claim and also 
back it.

> For kudu-jepsen scenario, add parameters to induce more frequent fail-over 
> events on the server side
> 
>
> Key: KUDU-2034
> URL: https://issues.apache.org/jira/browse/KUDU-2034
> Project: Kudu
>  Issue Type: Task
>Affects Versions: 1.4.0
>Reporter: Alexey Serbin
>  Labels: consistency, kudu-jepsen, newbie
>
> The kudu-jepsen setup has been running with no errors for a long time 
> already.  Currently, the set of parameters for the back-end components is 
> standard -- everything is set to defaults, but the experimental flags are 
> enabled.  Yes, it made sense to run with the default parameters: originally 
> we wanted to make sure the test did not fail with production-alike settings.  
> Now we can start exercising 'artificially congested' scenarios.
> Let's set parameters to induce frequent re-election/fail-over events at the 
> server side.  The idea is to bring in set of parameters used by certain tests 
> in {{src/kudu/integration-tests}}.  E.g., {{catalog_manager_tsk-itest.cc}} 
> contains an example of parameters to enable frequent re-elections among 
> masters; the raft-related part can be used as a starting point to update 
> tserver's parameters for kudu-jepsen.
> An additional step might be injecting latency into tservers' operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


  1   2   >