[jira] [Updated] (KUDU-2343) Java client doesn't properly reconnect to leader master when old leader is online

2018-03-13 Thread Grant Henke (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2343:
--
Fix Version/s: 1.7.0

> Java client doesn't properly reconnect to leader master when old leader is 
> online
> -
>
> Key: KUDU-2343
> URL: https://issues.apache.org/jira/browse/KUDU-2343
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 1.7.0
>
>
> In the following sequence of events, the Java client doesn't properly fail 
> over to locate a new master, and in fact gets "stuck" until the client is 
> restarted:
> - client connects to the cluster and caches the master locations
> - client opens a table and caches tablet locations
> - the master fails over to a new leader
> - the tablet either goes down or fails over, causing the client to need to 
> update its tablet locations
> In this case, it gets stuck in a retry loop where it will never be able to 
> connect to the new leader master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2343) Java client doesn't properly reconnect to leader master when old leader is online

2018-03-13 Thread Grant Henke (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2343:
--
Fix Version/s: 1.6.1

> Java client doesn't properly reconnect to leader master when old leader is 
> online
> -
>
> Key: KUDU-2343
> URL: https://issues.apache.org/jira/browse/KUDU-2343
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 1.7.0, 1.6.1, 1.8.0
>
>
> In the following sequence of events, the Java client doesn't properly fail 
> over to locate a new master, and in fact gets "stuck" until the client is 
> restarted:
> - client connects to the cluster and caches the master locations
> - client opens a table and caches tablet locations
> - the master fails over to a new leader
> - the tablet either goes down or fails over, causing the client to need to 
> update its tablet locations
> In this case, it gets stuck in a retry loop where it will never be able to 
> connect to the new leader master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2153) Servers delete tmp files before obtaining directory lock

2018-03-13 Thread Grant Henke (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2153:
--
Fix Version/s: 1.7.x

> Servers delete tmp files before obtaining directory lock
> 
>
> Key: KUDU-2153
> URL: https://issues.apache.org/jira/browse/KUDU-2153
> Project: Kudu
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 1.2.0, 1.3.1, 1.4.0, 1.5.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 1.8.0, 1.7.x
>
>
> In FsManager::Open() we currently call DeleteTmpFiles very early, before 
> starting the block manager. This means that, if you accidentally start a 
> tserver while another is running, it's possible for it to delete temporary 
> files that are in-use by the running tserver, causing it to exhibit strange 
> behavior, crash, etc (as in KUDU-2152).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2343) Java client doesn't properly reconnect to leader master when old leader is online

2018-03-13 Thread Alexey Serbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2343:

Affects Version/s: 1.3.0

> Java client doesn't properly reconnect to leader master when old leader is 
> online
> -
>
> Key: KUDU-2343
> URL: https://issues.apache.org/jira/browse/KUDU-2343
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
>
> In the following sequence of events, the Java client doesn't properly fail 
> over to locate a new master, and in fact gets "stuck" until the client is 
> restarted:
> - client connects to the cluster and caches the master locations
> - client opens a table and caches tablet locations
> - the master fails over to a new leader
> - the tablet either goes down or fails over, causing the client to need to 
> update its tablet locations
> In this case, it gets stuck in a retry loop where it will never be able to 
> connect to the new leader master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2343) Java client doesn't properly reconnect to leader master when old leader is online

2018-03-13 Thread Alexey Serbin (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397861#comment-16397861
 ] 

Alexey Serbin commented on KUDU-2343:
-

I think 1.3.0 is also affected, right?

> Java client doesn't properly reconnect to leader master when old leader is 
> online
> -
>
> Key: KUDU-2343
> URL: https://issues.apache.org/jira/browse/KUDU-2343
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java
>Affects Versions: 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
>
> In the following sequence of events, the Java client doesn't properly fail 
> over to locate a new master, and in fact gets "stuck" until the client is 
> restarted:
> - client connects to the cluster and caches the master locations
> - client opens a table and caches tablet locations
> - the master fails over to a new leader
> - the tablet either goes down or fails over, causing the client to need to 
> update its tablet locations
> In this case, it gets stuck in a retry loop where it will never be able to 
> connect to the new leader master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2153) Servers delete tmp files before obtaining directory lock

2018-03-13 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-2153:
--
   Resolution: Fixed
Fix Version/s: 1.8.0
   Status: Resolved  (was: In Review)

[~granthenke] think we should cherry-pick this?

> Servers delete tmp files before obtaining directory lock
> 
>
> Key: KUDU-2153
> URL: https://issues.apache.org/jira/browse/KUDU-2153
> Project: Kudu
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 1.2.0, 1.3.1, 1.4.0, 1.5.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 1.8.0
>
>
> In FsManager::Open() we currently call DeleteTmpFiles very early, before 
> starting the block manager. This means that, if you accidentally start a 
> tserver while another is running, it's possible for it to delete temporary 
> files that are in-use by the running tserver, causing it to exhibit strange 
> behavior, crash, etc (as in KUDU-2152).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KUDU-2303) Add KuduSchema::ToString implementation

2018-03-13 Thread Fengling Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengling Wang reassigned KUDU-2303:
---

Assignee: Fengling Wang

> Add KuduSchema::ToString implementation
> ---
>
> Key: KUDU-2303
> URL: https://issues.apache.org/jira/browse/KUDU-2303
> Project: Kudu
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 1.6.0
>Reporter: Grant Henke
>Assignee: Fengling Wang
>Priority: Minor
>  Labels: beginner, newbie, starter
>
> Adding a ToString method to KuduSchema and likely KuduColumnSchema would be 
> useful for users to print schema information while debugging or logging. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2346) Document or fix mac pkg-config/PKG_CONFIG_PATH issue

2018-03-13 Thread Grant Henke (JIRA)
Grant Henke created KUDU-2346:
-

 Summary: Document or fix mac pkg-config/PKG_CONFIG_PATH issue
 Key: KUDU-2346
 URL: https://issues.apache.org/jira/browse/KUDU-2346
 Project: Kudu
  Issue Type: Improvement
  Components: build
Reporter: Grant Henke


When running a fresh build on MacOs, thirdparty builds can fail with:
{code:java}
++ pkg-config --cflags openssl
Package openssl was not found in the pkg-config search path.
Perhaps you should add the directory containing `openssl.pc'
to the PKG_CONFIG_PATH environment variable
No package 'openssl' found
+ OPENSSL_CFLAGS={code}
A know workaround is to set the following:

 
{code:java}
export PKG_CONFIG_PATH=/usr/local/opt/openssl/lib/pkgconfig
{code}
We should document or automate this workaround in our builds. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2345) Add developer docs for the python client

2018-03-13 Thread Grant Henke (JIRA)
Grant Henke created KUDU-2345:
-

 Summary: Add developer docs for the python client
 Key: KUDU-2345
 URL: https://issues.apache.org/jira/browse/KUDU-2345
 Project: Kudu
  Issue Type: Improvement
  Components: python
Reporter: Grant Henke


I am far from a Python expert. Especially with Cython in the mix, so it took me 
a bit just to get started working on the kudu python client. 

We should document basic steps for how to develop and test the kudu python 
client. Including environment setup, building, and testing (running a single 
test too).

For now I essentially boiled my work down to this:

 
{code:java}
cd /path/to/kudu
cd build/debug 
make -j4
make install
cd /path/to/kudu/python
git clean -fdx
export KUDU_HOME=/path/to/kudu
pip install -r requirements.txt
python setup.py build_ext
python setup.py test
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2342) Non-voter replicas can be promoted and get stuck

2018-03-13 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated KUDU-2342:
-
Summary: Non-voter replicas can be promoted and get stuck  (was: Insert 
into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed 
to write batch ")

> Non-voter replicas can be promoted and get stuck
> 
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Blocker
>  Labels: scalability
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2330) Exceptions thrown by Java client have inappropriate stack traces

2018-03-13 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-2330:
--
Fix Version/s: (was: 1.8.0)

> Exceptions thrown by Java client have inappropriate stack traces
> 
>
> Key: KUDU-2330
> URL: https://issues.apache.org/jira/browse/KUDU-2330
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java
>Affects Versions: 1.7.0
>Reporter: Todd Lipcon
>Priority: Major
> Fix For: 1.7.0
>
>
> Currently, the exceptions thrown by the Java client tend to have stack traces 
> showing the point at which some error callback is called. The stack usually 
> leads back to Netty reading a response from the wire, and not from the actual 
> user code which invoked the call.
> For the async client this is somewhat unavoidable, and I think people have 
> gotten used to stack traces in async clients being rather useless. But, in 
> the synchronous wrapper, we should rewrite the stack traces so that the 
> user's actual call stack is preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397618#comment-16397618
 ] 

Todd Lipcon commented on KUDU-2342:
---

{code}
if (s.ok() &&
peer_pb &&
peer_pb->member_type() == RaftPeerPB::NON_VOTER &&
peer_pb->attrs().promote()) {
  // This peer is ready to promote.
  //
  // TODO(mpercy): Should we introduce a function SafeToPromote() that
  // does the same calculation as SafeToEvict() but for adding a VOTER?
  NotifyObserversOfPeerToPromote(peer->uuid());
{code}

I think Mike's TODO here is relevant. Basically we ended up proposing an 
uncommittable config change here.

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Blocker
>  Labels: scalability
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397617#comment-16397617
 ] 

Todd Lipcon commented on KUDU-2342:
---

I think being more conservative might be good in general -- eg after any tablet 
copy completes, include the newly-copied node for some number of 
seconds/minutes.

More directly, though, I think it's bad to promote a node that did not have a 
successful last communication.

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Blocker
>  Labels: scalability
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread David Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397597#comment-16397597
 ] 

David Alves commented on KUDU-2342:
---

Seems like we should be more conservative with the first rule (for voters only) 
and also add the non-voter which we intend to promote.

thoughts?

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Blocker
>  Labels: scalability
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread David Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397594#comment-16397594
 ] 

David Alves commented on KUDU-2342:
---

>From what I read of the code, there are two main gc mechanisms:
 * one only for voters, that makes sure never to gc more than the committed 
index
 * one for all peers that is more conservative as it only gcs after everyone 
has an index, but has an upper bound of 80

 

In this case we gc'd logs after the tablet copy as if the peer as a non-voter 
(second rule), meaning the non-voter can't catch up, but then still promoted 
him to voter, pushing a change config that can never be committed.

 

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Blocker
>  Labels: scalability
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397537#comment-16397537
 ] 

Todd Lipcon commented on KUDU-2342:
---

For reference, here's the ksck report on this tablet:
{code}
Tablet b8431200388d486995a4426c88bc06a2 of table 
'impala::tpch_3_kudu.lineitem' is under-replicated: 1 replica(s) not RUNNING
  14b2404c50b540ae8957adff9a6c7548 (vd1336.halxg.cloudera.com:7050): RUNNING
  a260dca5a9c846e99cb621881a7b86b8 (vc1515.halxg.cloudera.com:7050): RUNNING 
[LEADER]
  e3fdd8da21a643aba21b7acdd6b17499 (va1038.halxg.cloudera.com:7050): TS 
unavailable
  f7376c96c6b64e7fa6a7bfc84fd0cd64 (vc1534.halxg.cloudera.com:7050): RUNNING 
[NONVOTER]

2 replicas' active configs differ from the master's.
  All the peers reported by the master and tablet servers are:
  A = 14b2404c50b540ae8957adff9a6c7548
  B = a260dca5a9c846e99cb621881a7b86b8
  C = e3fdd8da21a643aba21b7acdd6b17499
  D = f7376c96c6b64e7fa6a7bfc84fd0cd64

The consensus matrix is:
 Config source |Replicas| Current term | Config index | 
Committed?
---++--+--+
 master| A   B*  C   D~ |  |  | Yes
 A | A   B*  C   D  | 1| 1233 | No
 B | A   B*  C   D  | 1| 1233 | No
 C | [config not available] |  |  | 
 D | A   B*  C   D~ | 1| 1141 | Yes
Table impala::tpch_3_kudu.lineitem has 1 under-replicated tablet(s)
{code}

It would be nice if ksck could report some info on opid indexes too, but that's 
a separate improvement.

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Blocker
>  Labels: scalability
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread Grant Henke (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2342:
--
Priority: Blocker  (was: Critical)

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Blocker
>  Labels: scalability
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397522#comment-16397522
 ] 

Todd Lipcon commented on KUDU-2342:
---

Reconstructing the timeline a bit:

- 07:20:54.751998: peer e3fdd8 fell behind the retention and "can never be 
caught up"
- 07:20:54.766460: peer f7376c added as a NON_VOTER
- 07:20:55.268965: tablet copy starts to f7376c
- 07:21:34.559736: tablet copy ends
- 07:21:34.779841: logs held by the tablet copy session are GCed
- 07:21:34.790443: the new NON_VOTER peer is already unable to be caught up 
because the logs just got GCed (*hmm, interesting*)
- 07:21:34.790797: nevertheless, the leader issues a config change to promote 
f7376c to VOTER

Now we have 2/4 VOTER replicas which can never be caught up -- the original bad 
one, and the one we just promoted. Hence we can't make progress.

It seems there are two serious issues at play here:
- why did we not retain the logs between the tablet copy session finishing and 
catching up the peer? perhaps because the non-voter isn't included in the log 
retention calculations and was more than 80 segments behind?
- why did we promote a non-voter that wasn't relatively up to date or in a 
"good" state?





> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Critical
>  Labels: scalability
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397509#comment-16397509
 ] 

Todd Lipcon commented on KUDU-2342:
---

The change config which is pending is:
{code}
1.1233@6229814865004195840  REPLICATE CHANGE_CONFIG_OP
id { term: 1 index: 1233 } timestamp: 6229814865004195840 op_type: 
CHANGE_CONFIG_OP change_config_record { tablet_id: 
"b8431200388d486995a4426c88bc06a2" old_config { opid_index: 1141 
OBSOLETE_local: false peers { permanent_uuid: 
"a260dca5a9c846e99cb621881a7b86b8" member_type: VOTER last_known_addr { host: 
"vc1515.halxg.cloudera.com" port: 7050 } } peers { permanent_uuid: 
"e3fdd8da21a643aba21b7acdd6b17499" member_type: VOTER last_known_addr { host: 
"va1038.halxg.cloudera.com" port: 7050 } } peers { permanent_uuid: 
"14b2404c50b540ae8957adff9a6c7548" member_type: VOTER last_known_addr { host: 
"vd1336.halxg.cloudera.com" port: 7050 } } peers { permanent_uuid: 
"f7376c96c6b64e7fa6a7bfc84fd0cd64" member_type: NON_VOTER last_known_addr { 
host: "vc1534.halxg.cloudera.com" port: 7050 } attrs { promote: true } } } 
new_config { opid_index: 1233 OBSOLETE_local: false peers { permanent_uuid: 
"a260dca5a9c846e99cb621881a7b86b8" member_type: VOTER last_known_addr { host: 
"vc1515.halxg.cloudera.com" port: 7050 } } peers { permanent_uuid: 
"e3fdd8da21a643aba21b7acdd6b17499" member_type: VOTER last_known_addr { host: 
"va1038.halxg.cloudera.com" port: 7050 } } peers { permanent_uuid: 
"14b2404c50b540ae8957adff9a6c7548" member_type: VOTER last_known_addr { host: 
"vd1336.halxg.cloudera.com" port: 7050 } } peers { permanent_uuid: 
"f7376c96c6b64e7fa6a7bfc84fd0cd64" member_type: VOTER last_known_addr { host: 
"vc1534.halxg.cloudera.com" port: 7050 } attrs { promote: false } } } }
{code}

That is to say, it has a pending promotion of peer 
f7376c96c6b64e7fa6a7bfc84fd0cd64 (vc1534) from NON_VOTER to VOTER.

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Critical
>  Labels: scalability
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397505#comment-16397505
 ] 

Todd Lipcon commented on KUDU-2342:
---

It appears what happened is that the leader actaully got 80 segments ahead of 
the two followers, and since our default log_max_segments_to_retain=80, it GCed 
the logs anyway. Then it couldn't replicate to either follower and the tablet 
got stuck. I checked the earliest WAL on that server (wal-01141) and its 
earliest op is 1.1154.

What's a bit odd here is that the leader watermark thinks that 1232 is the 
committed index and the majority-replicated, but it wants to send ops 1143 and 
1055 to the two peers. Also interesting is that it appears this tablet is 
currently in a configuration with four VOTER replicas.

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Critical
>  Labels: scalability
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-2342:
--
Attachment: tablet-info.html

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Critical
>  Labels: scalability
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397479#comment-16397479
 ] 

Todd Lipcon commented on KUDU-2342:
---

The server vc1515 has the following spewing in its logs:

{code}
I0313 11:56:27.615651 43703 consensus_peers.cc:230] T 
b8431200388d486995a4426c88bc06a2 P a260dca5a9c846e99cb621881a7b86b8 -> Peer 
f7376c96c6b64e7fa6a7bfc84fd0cd64 (vc1534.halxg.cloudera.com:7050): Could not 
obtain request from queue for peer: f7376c96c6b64e7fa6a7bfc84fd0cd64. Status: 
Not found: Failed to read ops 1143..1221: Segment 1130 which contained index 
1143 has been GCed
I0313 11:56:27.973654 43703 consensus_peers.cc:230] T 
b8431200388d486995a4426c88bc06a2 P a260dca5a9c846e99cb621881a7b86b8 -> Peer 
e3fdd8da21a643aba21b7acdd6b17499 (va1038.halxg.cloudera.com:7050): Could not 
obtain request from queue for peer: e3fdd8da21a643aba21b7acdd6b17499. Status: 
Not found: Failed to read ops 1055..1221: Segment 1043 which contained index 
1055 has been GCed
{code}

in other words, it appears to have evicted the log segments necessary to catch 
up both of its followers. Thus it's unable to replicate and commit any writes, 
so the write here timed out. Instead of letting it time out we should of course 
respond more rapidly saying that the tablet is unavailable, but that's a 
separate issue.

I guess in this case we can't recover because it wont evict a follower either 
because it knows that it wouldn't be able to commit the config change. So, how 
did it get into the state where it had GCed logs behind the majority_replicated 
watermark? [~aserbin] said he can take a look

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Priority: Major
>  Labels: scalability
> Attachments: Impala query profile.txt
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reassigned KUDU-2342:
-

Assignee: Alexey Serbin
Target Version/s: 1.7.0
Priority: Critical  (was: Major)

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Critical
>  Labels: scalability
> Attachments: Impala query profile.txt
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2330) Exceptions thrown by Java client have inappropriate stack traces

2018-03-13 Thread Grant Henke (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2330:
--
Fix Version/s: 1.7.0

> Exceptions thrown by Java client have inappropriate stack traces
> 
>
> Key: KUDU-2330
> URL: https://issues.apache.org/jira/browse/KUDU-2330
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java
>Affects Versions: 1.7.0
>Reporter: Todd Lipcon
>Priority: Major
> Fix For: 1.7.0, 1.8.0
>
>
> Currently, the exceptions thrown by the Java client tend to have stack traces 
> showing the point at which some error callback is called. The stack usually 
> leads back to Netty reading a response from the wire, and not from the actual 
> user code which invoked the call.
> For the async client this is somewhat unavoidable, and I think people have 
> gotten used to stack traces in async clients being rather useless. But, in 
> the synchronous wrapper, we should rewrite the stack traces so that the 
> user's actual call stack is preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2338) Java decimal predicates are not coerced to match the column scale

2018-03-13 Thread Grant Henke (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke resolved KUDU-2338.
---
   Resolution: Fixed
Fix Version/s: 1.7.0

Resolved via 
[9749d4c|https://github.com/apache/kudu/commit/9749d4cde4c475783b0e936d641749b3394c5aad].

> Java decimal predicates are not coerced to match the column scale
> -
>
> Key: KUDU-2338
> URL: https://issues.apache.org/jira/browse/KUDU-2338
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.7.0
>Reporter: Grant Henke
>Assignee: Grant Henke
>Priority: Critical
> Fix For: 1.7.0
>
>
> In the Java client we need to coerce the BigDecimal values to the expected 
> scale to ensure they can be correctly decoded server side. Though this was 
> being done in the deprecated ColumnRangePredicate implementation it was not 
> in the KuduPredicate implementation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2344) RPCs should fail with 'not authorized' when plain user is not in ACL

2018-03-13 Thread Dan Burkert (JIRA)
Dan Burkert created KUDU-2344:
-

 Summary: RPCs should fail with 'not authorized' when plain user is 
not in ACL
 Key: KUDU-2344
 URL: https://issues.apache.org/jira/browse/KUDU-2344
 Project: Kudu
  Issue Type: Bug
  Components: client
Affects Versions: 1.7.0
Reporter: Dan Burkert


See the TODOs in TestSecurityContextRealUser.java and client-test.cc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2322) Leader spews logs when follower falls behind log GC

2018-03-13 Thread Grant Henke (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2322:
--
Fix Version/s: 1.7.0

> Leader spews logs when follower falls behind log GC
> ---
>
> Key: KUDU-2322
> URL: https://issues.apache.org/jira/browse/KUDU-2322
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.7.0
>Reporter: Todd Lipcon
>Assignee: Alexey Serbin
>Priority: Critical
> Fix For: 1.7.0, 1.8.0
>
>
> I'm running a YCSB-based write stress test and found that one of the 
> followers fell behind enough that its logs got GCed by the leader. At this 
> point, the leader started logging about 100 messages per second indicating 
> that it could not obtain a request for this peer.
> I believe this is a regression since 1.6, since before 3-4-3 replication we 
> would have evicted the replica as soon as it fell behind GC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2331) Use tablet_id filter for 'move_replica' while running ksck-based pre-flight consistency check

2018-03-13 Thread Grant Henke (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke resolved KUDU-2331.
---
   Resolution: Fixed
Fix Version/s: 1.8.0
   1.7.0

> Use tablet_id filter for 'move_replica' while running ksck-based pre-flight 
> consistency check
> -
>
> Key: KUDU-2331
> URL: https://issues.apache.org/jira/browse/KUDU-2331
> Project: Kudu
>  Issue Type: Improvement
>  Components: ksck, supportability
>Reporter: Alexey Serbin
>Assignee: Will Berkeley
>Priority: Major
> Fix For: 1.7.0, 1.8.0
>
>
> {\{kudu tablet change_config move_replica}} could use {{tablet_id}} filter to 
> perform pre-flight consistency check.  Right now, it always fails with 
> {{Network error: ksck pre-move health check failed: Not all Tablet Servers 
> are reachable}} error when whatever a tserver in the cluster is dead.  No 
> matter if the dead server is not involved in serving a replica of the tablet, 
> the command doesn't allow users to run it.
> The tool should allow moving a tablet replica when the destination server and 
> all tablet replicas are alive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2343) Java client doesn't properly reconnect to leader master when old leader is online

2018-03-13 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397450#comment-16397450
 ] 

Todd Lipcon commented on KUDU-2343:
---

The issue here appears to be that the ConnectionCache is using the server UUID 
as the key. In the case of the masters, the client does not actually use a UUID 
to identify a server, so even though it learns that the leader master has 
changed, when it attempts to send an RPC to it, it mistakenly pulls a the 
_wrong connection_ out of the cache. Thus it thinks it's sending an RPC to the 
new leader but still sends it to the old one, which faithfully responds that it 
is not the leader. This goes on until the client is restarted (or the old 
leader master happens regains leadership)

I checked this back a bunch of versions and it appears it was introduced 
between 1.2 and 1.3 when we did some pretty serious refactoring on the Java 
client.

> Java client doesn't properly reconnect to leader master when old leader is 
> online
> -
>
> Key: KUDU-2343
> URL: https://issues.apache.org/jira/browse/KUDU-2343
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java
>Affects Versions: 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
>
> In the following sequence of events, the Java client doesn't properly fail 
> over to locate a new master, and in fact gets "stuck" until the client is 
> restarted:
> - client connects to the cluster and caches the master locations
> - client opens a table and caches tablet locations
> - the master fails over to a new leader
> - the tablet either goes down or fails over, causing the client to need to 
> update its tablet locations
> In this case, it gets stuck in a retry loop where it will never be able to 
> connect to the new leader master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2343) Java client doesn't properly reconnect to leader master when old leader is online

2018-03-13 Thread Todd Lipcon (JIRA)
Todd Lipcon created KUDU-2343:
-

 Summary: Java client doesn't properly reconnect to leader master 
when old leader is online
 Key: KUDU-2343
 URL: https://issues.apache.org/jira/browse/KUDU-2343
 Project: Kudu
  Issue Type: Bug
  Components: client, java
Affects Versions: 1.6.0, 1.5.0, 1.4.0, 1.3.1, 1.7.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon


In the following sequence of events, the Java client doesn't properly fail over 
to locate a new master, and in fact gets "stuck" until the client is restarted:
- client connects to the cluster and caches the master locations
- client opens a table and caches tablet locations
- the master fails over to a new leader
- the tablet either goes down or fails over, causing the client to need to 
update its tablet locations

In this case, it gets stuck in a retry loop where it will never be able to 
connect to the new leader master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2322) Leader spews logs when follower falls behind log GC

2018-03-13 Thread Alexey Serbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2322:

   Resolution: Fixed
Fix Version/s: 1.8.0
   Status: Resolved  (was: In Review)

Fixed with d856107e0b58067f0bbebbbda52f4f67c674897a

> Leader spews logs when follower falls behind log GC
> ---
>
> Key: KUDU-2322
> URL: https://issues.apache.org/jira/browse/KUDU-2322
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.7.0
>Reporter: Todd Lipcon
>Assignee: Alexey Serbin
>Priority: Critical
> Fix For: 1.8.0
>
>
> I'm running a YCSB-based write stress test and found that one of the 
> followers fell behind enough that its logs got GCed by the leader. At this 
> point, the leader started logging about 100 messages per second indicating 
> that it could not obtain a request for this peer.
> I believe this is a regression since 1.6, since before 3-4-3 replication we 
> would have evicted the replica as soon as it fell behind GC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2330) Exceptions thrown by Java client have inappropriate stack traces

2018-03-13 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-2330:
--
   Resolution: Fixed
Fix Version/s: 1.8.0
   Status: Resolved  (was: In Review)

> Exceptions thrown by Java client have inappropriate stack traces
> 
>
> Key: KUDU-2330
> URL: https://issues.apache.org/jira/browse/KUDU-2330
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java
>Affects Versions: 1.7.0
>Reporter: Todd Lipcon
>Priority: Major
> Fix For: 1.8.0
>
>
> Currently, the exceptions thrown by the Java client tend to have stack traces 
> showing the point at which some error callback is called. The stack usually 
> leads back to Netty reading a response from the wire, and not from the actual 
> user code which invoked the call.
> For the async client this is somewhat unavoidable, and I think people have 
> gotten used to stack traces in async clients being rather useless. But, in 
> the synchronous wrapper, we should rewrite the stack traces so that the 
> user's actual call stack is preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2330) Exceptions thrown by Java client have inappropriate stack traces

2018-03-13 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397438#comment-16397438
 ] 

Todd Lipcon commented on KUDU-2330:
---

Oops, just as I hit "commit" I realized I accidentally listed the wrong JIRA in 
the commit message of ce0db915787b58a79109e6faecc6f1daef9f2850

> Exceptions thrown by Java client have inappropriate stack traces
> 
>
> Key: KUDU-2330
> URL: https://issues.apache.org/jira/browse/KUDU-2330
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java
>Affects Versions: 1.7.0
>Reporter: Todd Lipcon
>Priority: Major
>
> Currently, the exceptions thrown by the Java client tend to have stack traces 
> showing the point at which some error callback is called. The stack usually 
> leads back to Netty reading a response from the wire, and not from the actual 
> user code which invoked the call.
> For the async client this is somewhat unavoidable, and I think people have 
> gotten used to stack traces in async clients being rather useless. But, in 
> the synchronous wrapper, we should rewrite the stack traces so that the 
> user's actual call stack is preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread Mostafa Mokhtar (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mostafa Mokhtar updated KUDU-2342:
--
Attachment: Impala query profile.txt

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Priority: Major
>  Labels: scalability
> Attachments: Impala query profile.txt
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread Mostafa Mokhtar (JIRA)
Mostafa Mokhtar created KUDU-2342:
-

 Summary: Insert into Lineitem table with 1340 tablets on 129 node 
cluster failed with "Failed to write batch "
 Key: KUDU-2342
 URL: https://issues.apache.org/jira/browse/KUDU-2342
 Project: Kudu
  Issue Type: Bug
  Components: tablet
Affects Versions: 1.7.0
Reporter: Mostafa Mokhtar


While loading TPCH 30TB on 129 node cluster via Impala, write operation failed 
with :
Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
(vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
180.000s (SENT)





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2262) Java client does not retry if no master is a leader

2018-03-13 Thread Hao Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397428#comment-16397428
 ] 

Hao Hao commented on KUDU-2262:
---

Oops, pasted the wrong one. You are right. It is commit f62e4cd0e.

> Java client does not retry if no master is a leader
> ---
>
> Key: KUDU-2262
> URL: https://issues.apache.org/jira/browse/KUDU-2262
> Project: Kudu
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.3.0, 1.4.0, 1.5.0, 1.6.0
>Reporter: Todd Lipcon
>Assignee: Hao Hao
>Priority: Major
> Fix For: 1.7.0
>
>
> In a test case I tried to restart the masters and then start a new client to 
> connect to the cluster. This caused the client to fail because the masters 
> were in the process of a leader election.
> It probably would make more sense for the client to retry a certain number of 
> times.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2341) [DOCS] Kudu/Sentry docs are out of date

2018-03-13 Thread Thomas Tauber-Marshall (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Tauber-Marshall resolved KUDU-2341.
--
   Resolution: Invalid
Fix Version/s: n/a

Meant to file this as an Impala issue, sorry.

> [DOCS] Kudu/Sentry docs are out of date
> ---
>
> Key: KUDU-2341
> URL: https://issues.apache.org/jira/browse/KUDU-2341
> Project: Kudu
>  Issue Type: Bug
>  Components: documentation
>Reporter: Thomas Tauber-Marshall
>Priority: Critical
>  Labels: docs
> Fix For: n/a
>
>
> The documentation of Impala's support for Sentry authorization on Kudu 
> tables, available here:
> http://impala.apache.org/docs/build/html/topics/impala_kudu.html
> is out of date. It should be updated to include the changes made in 
> IMPALA-5489. In particular:
> - Access is no longer "all or nothing" - we support column-level permissions
> - Permissions do not apply "to all  SQL operations" - we support SELECT- and 
> INSERT-specific permissions. DELETE/UPDATE/UPSERT still require ALL
> We should also document that "all on server" is required to specify 
> "kudu.master_addresses" in a CREATE, even for managed tables, in addition to 
> be required to CREATE any external table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2340) Unicode support for Kudu table names

2018-03-13 Thread Jim Halfpenny (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397159#comment-16397159
 ] 

Jim Halfpenny commented on KUDU-2340:
-

The Kudu table is created successfully, so I do suspect the is an Impala issue. 
Here's the list I get with 2 unicode table names:

{{[jhalfpenny@jh-kafka-1 ~]$ kudu table list `hostname`}}
{{impala::kudutest.}}
{{∂}}
{{impala::kudutest.test}}

> Unicode support for Kudu table names
> 
>
> Key: KUDU-2340
> URL: https://issues.apache.org/jira/browse/KUDU-2340
> Project: Kudu
>  Issue Type: Bug
>Reporter: Jim Halfpenny
>Priority: Major
>
> It is possible to create a Kudu table containing unicode characters in its in 
> Impala by specifying the kudu.table_name attribute. When trying to select 
> from this table you receive an error that the underlying table does not exist.
> The example below shows a table being created successfully, but failing on a 
> select * statement.
> {{[jh-kafka-2:21000] > create table test2( a int primary key) stored as kudu 
> TBLPROPERTIES('kudu.table_name' = 'impala::kudutest.');}}
> {{Query: create table test2( a int primary key) stored as kudu 
> TBLPROPERTIES('kudu.table_name' = 'impala::kudutest.')}}
> {{WARNINGS: Unpartitioned Kudu tables are inefficient for large data 
> sizes.}}{{Fetched 0 row(s) in 0.64s}}
> {{[jh-kafka-2:21000] > select * from test2;}}
> {{Query: select * from test2}}
> {{Query submitted at: 2018-03-13 08:23:29 (Coordinator: 
> https://jh-kafka-2:25000)}}
> {{ERROR: AnalysisException: Failed to load metadata for table: 'test2'}}
> {{CAUSED BY: TableLoadingException: Error loading metadata for Kudu table 
> impala::kudutest.}}
> {{CAUSED BY: ImpalaRuntimeException: Error opening Kudu table 
> 'impala::kudutest.', Kudu error: The table does not exist: table_name: 
> "impala::kudutest."}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2340) Unicode support for Kudu table names

2018-03-13 Thread Jim Halfpenny (JIRA)
Jim Halfpenny created KUDU-2340:
---

 Summary: Unicode support for Kudu table names
 Key: KUDU-2340
 URL: https://issues.apache.org/jira/browse/KUDU-2340
 Project: Kudu
  Issue Type: Bug
Reporter: Jim Halfpenny


It is possible to create a Kudu table containing unicode characters in its in 
Impala by specifying the kudu.table_name attribute. When trying to select from 
this table you receive an error that the underlying table does not exist.

The example below shows a table being created successfully, but failing on a 
select * statement.

{{[jh-kafka-2:21000] > create table test2( a int primary key) stored as kudu 
TBLPROPERTIES('kudu.table_name' = 'impala::kudutest.');}}
{{Query: create table test2( a int primary key) stored as kudu 
TBLPROPERTIES('kudu.table_name' = 'impala::kudutest.')}}
{{WARNINGS: Unpartitioned Kudu tables are inefficient for large data 
sizes.}}{{Fetched 0 row(s) in 0.64s}}
{{[jh-kafka-2:21000] > select * from test2;}}
{{Query: select * from test2}}
{{Query submitted at: 2018-03-13 08:23:29 (Coordinator: 
https://jh-kafka-2:25000)}}
{{ERROR: AnalysisException: Failed to load metadata for table: 'test2'}}
{{CAUSED BY: TableLoadingException: Error loading metadata for Kudu table 
impala::kudutest.}}
{{CAUSED BY: ImpalaRuntimeException: Error opening Kudu table 
'impala::kudutest.', Kudu error: The table does not exist: table_name: 
"impala::kudutest."}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-1693) Flush write operations on per-TS basis and add corresponding limit on the buffer space

2018-03-13 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-1693:
--
Component/s: perf

> Flush write operations on per-TS basis and add corresponding limit on the 
> buffer space
> --
>
> Key: KUDU-1693
> URL: https://issues.apache.org/jira/browse/KUDU-1693
> Project: Kudu
>  Issue Type: Improvement
>  Components: client, perf
>Affects Versions: 1.0.0
>Reporter: Alexey Serbin
>Priority: Major
>
> Currently, the Kudu C++ client buffers incoming operations regardless of 
> their destination tablet server.  Accordingly, it's possible to set limit on 
> the _total_ buffer space, not per tablet server.  This approach works but 
> there is room for improvement: there are real-world scenarios where per-TS 
> buffering would be more robust.  Besides, tablet servers impose limit on the 
> RPC operations size.
> Grouping write operations on per-tablet-server basis would be beneficial for 
> 'one-out-of-many lagging tablet server' scenario.  There, all tablet servers 
> for a table perform well except for one which runs slow due to excessive IO, 
> network issues, failing disk, etc.  The problem is that the lagging server 
> hinders the overall performance.  This is due to the current approach to the 
> buffer turnaround: a buffer is considered 'flushed' and its space is 
> reclaimed at once when _all_ operations in the buffer are completed.  So, if 
> 1000 operations have already been sent but there is 1 operation still in 
> progress, the whole buffer space is 'locked' and cannot be used.
> Accordingly, introducing per-tablet-server buffer limit would help to address 
> scenarios with concurrent writes into tables with extremely diverse partition 
> factors (like 2 and 100).   E.g., consider a case when incoming write 
> operations for tables with diverse partition factors are intermixed in the 
> context of one session.  The problem is that setting the total buffer space 
> limit high is beneficial for the writes into the table with many partitions 
> (assuming those writes are evenly distributed across participating tablets), 
> but it may be over the server-side's limit for max transaction size if those 
> writes are targeted for the table with a few partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2295) nullptr dereference while scanning on already shutdown tablet replica

2018-03-13 Thread Alexey Serbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2295:

   Resolution: Fixed
Fix Version/s: 1.7.0
   Status: Resolved  (was: In Review)

Fixed with fa2b495487e520f911a0ddbd8b2bf52bfe01e28e

> nullptr dereference while scanning on already shutdown tablet replica
> -
>
> Key: KUDU-2295
> URL: https://issues.apache.org/jira/browse/KUDU-2295
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.7.0
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
> Fix For: 1.7.0
>
>
> While running the \{{raft_consensus_stress-itest}}, I find one of tablet 
> servers crashed with the following stack trace:
> {noformat}
>      
> *** Aborted at 1518480865 (unix time) try "date -d @1518480865" if you are 
> using GNU date ***
> PC: @ 0x7f1e02025790 scoped_refptr<>::operator->()
>   
> *** SIGSEGV (@0x160) received by PID 8782 (TID 0x7f1de3c7e700) from PID 352; 
> stack trace: ***
>     @ 0x7f1dfdcfc330 (unknown) at ??:0
>   
>     @ 0x7f1e02025790 scoped_refptr<>::operator->() at ??:0
>   
>     @ 0x7f1e00ae62e7 kudu::tablet::Tablet::GetTabletAncientHistoryMark() 
> at ??:0
>     @ 0x7f1e00ae627d kudu::tablet::Tablet::GetHistoryGcOpts() at ??:0 
>   
>     @ 0x7f1e02012c53 kudu::tserver::(anonymous 
> namespace)::VerifyNotAncientHistory() at ??:0
>     @ 0x7f1e0201223b 
> kudu::tserver::TabletServiceImpl::HandleScanAtSnapshot() at ??:0
>     @ 0x7f1e0200c6dd 
> kudu::tserver::TabletServiceImpl::HandleNewScanRequest() at ??:0
>     @ 0x7f1e02009d33 kudu::tserver::TabletServiceImpl::Scan() at ??:0 
>   
>     @ 0x7f1dfc90de4d 
> kudu::tserver::TabletServerServiceIf::TabletServerServiceIf()::$_5::operator()()
>  at ??:0
>     @ 0x7f1dfc90dc92 std::_Function_handler<>::_M_invoke() at ??:0
>   
>     @ 0x7f1dfba728ab std::function<>::operator()() at ??:0
>   
>     @ 0x7f1dfba7216d kudu::rpc::GeneratedServiceIf::Handle() at ??:0  
>   
>     @ 0x7f1dfba74526 kudu::rpc::ServicePool::RunThread() at ??:0  
>   
>     @ 0x7f1dfba76ad9 boost::_mfi::mf0<>::operator()() at ??:0 
>   
>     @ 0x7f1dfba76a40 boost::_bi::list1<>::operator()<>() at ??:0  
>   
>     @ 0x7f1dfba769ea boost::_bi::bind_t<>::operator()() at ??:0   
>   
>     @ 0x7f1dfba767cd 
> boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
>     @ 0x7f1dfba190f8 boost::function0<>::operator()() at ??:0 
>   
>     @ 0x7f1df9d1788d kudu::Thread::SuperviseThread() at ??:0  
>   
>     @ 0x7f1dfdcf4184 start_thread at ??:0 
>   
>     @ 0x7f1df6023ffd clone at ??:0
>   
>     @    0x0 (unknown){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KUDU-1119) Consider supporting Impala tablespaces for Kudu tables

2018-03-13 Thread Alexey Serbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin reassigned KUDU-1119:
---

Assignee: (was: Alexey Serbin)

> Consider supporting Impala tablespaces for Kudu tables
> --
>
> Key: KUDU-1119
> URL: https://issues.apache.org/jira/browse/KUDU-1119
> Project: Kudu
>  Issue Type: Improvement
>  Components: impala
>Affects Versions: Public beta
>Reporter: Misty Stanley-Jones
>Priority: Major
>
> Right now if you create a table using Impala, in a given Impala database 
> (my_database:my_table), Kudu strips out the database part and just calls the 
> table my_table. This creates a requirement for Kudu table names to be unique 
> across all Impala databases, and may be surprising behavior to seasoned 
> Impala users. I'm filing this ticket at [~tlipcon]'s request and may be 
> getting some of the details / limitations wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2322) Leader spews logs when follower falls behind log GC

2018-03-13 Thread Alexey Serbin (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-2322:

Target Version/s: 1.7.0  (was: 1.8.0)

> Leader spews logs when follower falls behind log GC
> ---
>
> Key: KUDU-2322
> URL: https://issues.apache.org/jira/browse/KUDU-2322
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.7.0
>Reporter: Todd Lipcon
>Assignee: Alexey Serbin
>Priority: Critical
>
> I'm running a YCSB-based write stress test and found that one of the 
> followers fell behind enough that its logs got GCed by the leader. At this 
> point, the leader started logging about 100 messages per second indicating 
> that it could not obtain a request for this peer.
> I believe this is a regression since 1.6, since before 3-4-3 replication we 
> would have evicted the replica as soon as it fell behind GC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KUDU-2322) Leader spews logs when follower falls behind log GC

2018-03-13 Thread Alexey Serbin (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394025#comment-16394025
 ] 

Alexey Serbin edited comment on KUDU-2322 at 3/13/18 3:29 PM:
--

I posted a patch to throttle log messages.  I think it will take longer to 
adjust the 3-4-3 scheme to evict failed-behind-log-GC followers, and I'm not 
quite sure yet whether such a shortcut would be advantageous.


was (Author: aserbin):
I posted a patch to throttle log messages.  I think it will take longer to 
adjust the 3-4-3 scheme to evict failed-behind-log-GC followers, and I'm not 
quite sure yet whether such a shortcut is advantageous.

> Leader spews logs when follower falls behind log GC
> ---
>
> Key: KUDU-2322
> URL: https://issues.apache.org/jira/browse/KUDU-2322
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.7.0
>Reporter: Todd Lipcon
>Assignee: Alexey Serbin
>Priority: Critical
>
> I'm running a YCSB-based write stress test and found that one of the 
> followers fell behind enough that its logs got GCed by the leader. At this 
> point, the leader started logging about 100 messages per second indicating 
> that it could not obtain a request for this peer.
> I believe this is a regression since 1.6, since before 3-4-3 replication we 
> would have evicted the replica as soon as it fell behind GC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2334) OutboundTransfer::TransferStarted() isn't stateful enough with TLS socket

2018-03-13 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2334.
---
   Resolution: Fixed
Fix Version/s: 1.7.0

> OutboundTransfer::TransferStarted() isn't stateful enough with TLS socket
> -
>
> Key: KUDU-2334
> URL: https://issues.apache.org/jira/browse/KUDU-2334
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc
>Reporter: Michael Ho
>Assignee: Michael Ho
>Priority: Major
> Fix For: 1.7.0
>
>
> Currently, {{OutboundTransfer::TransferStarted()}} returns true if we made 
> non-zero progress in {{OutboundTransfer::SendBuffer()}}. However, this may 
> not work well when using TLS socket as {{SSL_Write()}} is stateful. So, if we 
> called {{SSL_Write()}} with the buffer of the first slice in an 
> {{OutboundTransfer}} and failed with 0 bytes written and errno {{EAGAIN}} due 
> to send buffer being full, we need to try again with the exact buffer in the 
> first slice. However, it's possible that particular {{OutboundTransfer}} may 
> be cancelled or timed-out before the next call to 
> {{OutboundTransfer::SendBuffer()}} in {{Connection::WriteHandler()}}. In 
> which case, we will skip to the next {{OutboundTransfer}} in the queue and 
> call {{SSL_Write()}} with the buffers in the next {{OutboundTransfer}}, 
> leading to the error message:
> {noformat}
> failed to write to TLS socket: error:1409F07F:SSL 
> routines:SSL3_WRITE_PENDING:bad write retry:s3_pkt.c
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2339) TSAN detected race in DnsResolver

2018-03-13 Thread Will Berkeley (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396616#comment-16396616
 ] 

Will Berkeley commented on KUDU-2339:
-

This might be KUDU-2059, or related to it.

> TSAN detected race in DnsResolver
> -
>
> Key: KUDU-2339
> URL: https://issues.apache.org/jira/browse/KUDU-2339
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.7.0
>Reporter: Will Berkeley
>Priority: Major
> Attachments: raft_consensus_election-itest.0.txt
>
>
> Seen in a precommit http://jenkins.kudu.apache.org/job/kudu-gerrit/12388/ 
> during raft_consensus-itest. Log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2339) TSAN detected race in DnsResolver

2018-03-13 Thread Will Berkeley (JIRA)
Will Berkeley created KUDU-2339:
---

 Summary: TSAN detected race in DnsResolver
 Key: KUDU-2339
 URL: https://issues.apache.org/jira/browse/KUDU-2339
 Project: Kudu
  Issue Type: Bug
Affects Versions: 1.7.0
Reporter: Will Berkeley
 Attachments: raft_consensus_election-itest.0.txt

Seen in a precommit http://jenkins.kudu.apache.org/job/kudu-gerrit/12388/ 
during raft_consensus-itest. Log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)