[jira] [Comment Edited] (KUDU-3099) KuduBackup/KuduRestore System.exit(0) results in Spark on YARN failure with exitCode: 16

2020-04-03 Thread Waleed Fateem (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074122#comment-17074122
 ] 

Waleed Fateem edited comment on KUDU-3099 at 4/4/20, 4:41 AM:
--

Patch submitted:

[https://gerrit.cloudera.org/#/c/15638/]


was (Author: waleedfateem):
New patch submitted:

[https://gerrit.cloudera.org/#/c/15638/]

> KuduBackup/KuduRestore System.exit(0) results in Spark on YARN failure with 
> exitCode: 16
> 
>
> Key: KUDU-3099
> URL: https://issues.apache.org/jira/browse/KUDU-3099
> Project: Kudu
>  Issue Type: Bug
>  Components: backup, spark
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Waleed Fateem
>Assignee: Waleed Fateem
>Priority: Major
>
> When running KuduBackup/KuduRestore the underlying Spark application can fail 
> when running on YARN even when the backup/restore tasks complete 
> successfully. The following was from the Spark driver log:
> {code:java}
> INFO spark.SparkContext: Submitted application: Kudu Table Backup
> ..
> INFO spark.SparkContext: Starting job: save at KuduBackup.scala:90
> INFO scheduler.DAGScheduler: Got job 0 (save at KuduBackup.scala:90) with 200 
> output partitions
> scheduler.DAGScheduler: Final stage: ResultStage 0 (save at 
> KuduBackup.scala:90)
> ..
> INFO scheduler.DAGScheduler: Submitting 200 missing tasks from ResultStage 0 
> (MapPartitionsRDD[2] at save at KuduBackup.scala:90) (first 15 tasks are for 
> partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
> INFO cluster.YarnClusterScheduler: Adding task set 0.0 with 200 tasks
> ..
> INFO cluster.YarnClusterScheduler: Removed TaskSet 0.0, whose tasks have all 
> completed, from pool 
> INFO scheduler.DAGScheduler: Job 0 finished: save at KuduBackup.scala:90, 
> took 20.007488 s
> ..
> INFO spark.SparkContext: Invoking stop() from shutdown hook
> ..
> INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors
> ..
> INFO spark.SparkContext: Successfully stopped SparkContext
> INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 16, (reason: 
> Shutdown hook called before final status was reported.)
> INFO util.ShutdownHookManager: Shutdown hook called{code}
> Spark explicitly added this shutdown hook to catch System.exit() calls and in 
> case this occurs before the SparkContext stops then the application status is 
> considered a failure:
> [https://github.com/apache/spark/blob/branch-2.3/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L299]
> The System.exit() call added as part of KUDU-2787 can cause this race 
> condition and that was merged in the 1.10.x and 1.11.x branches. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3105) kudu_client based application reports 'Locking callback not initialized' error

2020-04-03 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074719#comment-17074719
 ] 

Todd Lipcon commented on KUDU-3105:
---

I ran into this last night while using conda on el7 to get a rather new version 
of python. The conda environment has openssl 1.1 in it, but my client was built 
outside of conda and has openssl 1.0.x from el7.

My initial attempt to fix this was to change Kudu to use dlsym to look for the 
OPENSSL_version_number() functoin and use that to determine runtime behavior. 
However, I hit a different related issue:
- the python client code uses 'import _ssl' to force Python to do its own 
OpenSSL initialization. In the conda environment, this linked against libssl.so 
from conda's lib dircetory (openssl 1.1). So, the python side inits openssl 1.1.
- the python C++ so file has a link to libssl.so.10, with the explicit versoin 
suffix in the file name, rather than just to 'libssl.so'. So, it still links to 
libssl _outside_ the environment.  So, it gets a not-initialized SSL and can't 
make an SSL context.

Not sure the right fix here.. .seems like we could either get the kuduclient.so 
to link against libssl.so instead of libssl.so.10, or we could be a little more 
"fast and loose" about trying to auto-detect whether SSL is initialized.

> kudu_client based application reports 'Locking callback not initialized' error
> --
>
> Key: KUDU-3105
> URL: https://issues.apache.org/jira/browse/KUDU-3105
> Project: Kudu
>  Issue Type: Bug
>  Components: client, python, security
>Affects Versions: 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1
>Reporter: Alexey Serbin
>Priority: Major
>
> When using kudu_client library compiled against OpenSSL 1.0.x with OpenSSL 
> 1.1.x run-time, Kudu client applications might report 'Runtime error: Locking 
> callback not initialized' error.
> For example, {{kudu-python}} based applications on RHEL/CentOS 7.7, if using 
> {{kudu-client}} of versions 1.9, 1.10, 1.11 in Python environment with 
> OpenSSL 1.1.1d might report an error like below:
> {noformat}
> Traceback (most recent call last):
>   File "kudu-python-app.py", line 22, in 
> client = kudu.connect(host=args.masters, port=args.ports)
>   File "/opt/lib/python3.6/site-packages/kudu/__init__.py", line 96, in 
> connect
> rpc_timeout_ms=rpc_timeout_ms)
>   File "kudu/client.pyx", line 297, in kudu.client.Client.__cinit__
>   File "kudu/errors.pyx", line 62, in kudu.errors.check_status
> kudu.errors.KuduBadStatus: b'Runtime error: Locking callback not initialized'
> {noformat}
> The issue is that the code {{libkudu_client}} compiled against OpenSSL 1.0.x 
> uses initialization code path specific for OpenSSL 1.0.x version, and the 
> post-condition requires presence of thread-safe callbacks installed after the 
> initialization is done.  However, those functions do not install the expected 
> locking callbacks in OpenSSL 1.1.x since OpenSSL uses different approach 
> w.r.t. locking callbacks since 1.1.0 version: the callbacks are not required 
> since the multi-threading model was revamped in the newer versions of the 
> library.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3101) ASAN and dynamic linking are incompatible on aarch64

2020-04-03 Thread RuiChen (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074318#comment-17074318
 ] 

RuiChen commented on KUDU-3101:
---

No problem, I enable TSAN for Kudu on ARM64 server, face the same issue like 
ASAN in above comment, looks like ASAN and TSAN with dynamic linking on ARM64 
will casue the following error, but with static linking is OK, and have 
reported to google sanitizers team too.

> ASAN and dynamic linking are incompatible on aarch64
> 
>
> Key: KUDU-3101
> URL: https://issues.apache.org/jira/browse/KUDU-3101
> Project: Kudu
>  Issue Type: Sub-task
>Affects Versions: 1.12.0
>Reporter: RuiChen
>Assignee: RuiChen
>Priority: Minor
>
> Kudu Readme mention "NOTE: Dynamic linking is incompatible with ASAN and 
> static linking is incompatible with TSAN."[1], but no checking for it in 
> CMakeLists.txt, so if developer use following cmake command, cmake and make 
> will be successful, but maybe face issue when run test cases.
> {code:java}
>     CC=../../thirdparty/clang-toolchain/bin/clang \
>     CXX=../../thirdparty/clang-toolchain/bin/clang++ \
>     cmake -DCMAKE_BUILD_TYPE=debug -DKUDU_USE_ASAN=1 ../..{code}
> I build Kudu in ARM64 server, issue like this, if I use KUDU_LINK=static, 
> testing pass,
> but weird, in x86 ASAN and dynamic linking worked in Jenkins CI, seems this 
> issue only impact ARM64 :
> {code:java}
> ubuntu@ubuntu:~/workspace/github.com/apache/kudu/build/asan$./bin/example-test
>  
> AddressSanitizer:DEADLYSIGNAL 
> = 
> ==20451==ERROR: AddressSanitizer: SEGV on unknown address 0x (pc 
> 0x bp 0xd0853a20 sp 0xd0853a20 T0) 
> ==20451==Hint:pcpointstothezeropage. 
> ==20451==The signal is caused by a READ memoryaccess. 
> ==20451==Hint: address points to the zero page.
> AddressSanitizercannotprovideadditionalinfo. 
> SUMMARY: AddressSanitizer: SEGV () 
> ==20451==ABORTING
> {code}
>  
> [[1]: 
> https://github.com/apache/kudu#building-kudu-with-dynamic-linking|https://github.com/apache/kudu#building-kudu-with-dynamic-linking]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)