Currently we try to execute pyspark from user CLI, but in context of project user, but get this error : (the cluster is kerberized)
[<user>@edgenode1 ~]$ pyspark --master yarn --num-executors 5 --proxy-user <project-user> Python 2.7.5 (default, Jun 24 2015, 00:41:19) [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2 Type "help", "copyright", "credits" or "license" for more information. 15/10/06 09:44:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/10/06 09:44:25 INFO SparkContext: Running Spark version 1.3.1 15/10/06 09:44:25 INFO SecurityManager: Changing view acls to: <user>,<project-user> 15/10/06 09:44:25 INFO SecurityManager: Changing modify acls to: <user>,<project-user> 15/10/06 09:44:25 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(<user>, <project-user>); users with modify permissions: Set(<user>, <project-user>) 15/10/06 09:44:25 INFO Slf4jLogger: Slf4jLogger started 15/10/06 09:44:25 INFO Remoting: Starting remoting 15/10/06 09:44:26 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@<server>:40607] 15/10/06 09:44:26 INFO Utils: Successfully started service 'sparkDriver' on port 40607. 15/10/06 09:44:26 INFO SparkEnv: Registering MapOutputTracker 15/10/06 09:44:26 INFO SparkEnv: Registering BlockManagerMaster 15/10/06 09:44:26 INFO DiskBlockManager: Created local directory at /tmp/spark-10b70025-ca98-4940-91b8-6dbd0b7148aa/blockmgr-33e9fb6d-d5b2-4fa5-876f-0b91501be632 15/10/06 09:44:26 INFO MemoryStore: MemoryStore started with capacity 265.4 MB 15/10/06 09:44:26 INFO HttpFileServer: HTTP File server directory is /tmp/spark-1a4b86f0-3e57-4f44-bded-6157f4f1933f/httpd-2cafcce9-71ec-44fb-8500-2c70756ea3b9 15/10/06 09:44:26 INFO HttpServer: Starting HTTP Server 15/10/06 09:44:26 INFO Server: jetty-8.y.z-SNAPSHOT 15/10/06 09:44:26 INFO AbstractConnector: Started SocketConnector@0.0.0.0:34903 15/10/06 09:44:26 INFO Utils: Successfully started service 'HTTP file server' on port 34903. 15/10/06 09:44:26 INFO SparkEnv: Registering OutputCommitCoordinator 15/10/06 09:44:26 INFO Server: jetty-8.y.z-SNAPSHOT 15/10/06 09:44:26 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 15/10/06 09:44:26 INFO Utils: Successfully started service 'SparkUI' on port 4040. 15/10/06 09:44:26 INFO SparkUI: Started SparkUI at http://<server>:4040 spark.yarn.driver.memoryOverhead is set but does not apply in client mode. 15/10/06 09:44:27 INFO TimelineClientImpl: Timeline service address: http://<master-node>:8188/ws/v1/timeline/ 15/10/06 09:44:27 INFO RMProxy: Connecting to ResourceManager at <master-node>/10.49.20.5:8050 Traceback (most recent call last): File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/shell.py", line 50, in <module> sc = SparkContext(appName="PySparkShell", pyFiles=add_files) File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/context.py", line 110, in __init__ conf, jsc, profiler_cls) File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/context.py", line 158, in _do_init self._jsc = jsc or self._initialize_context(self._conf._jconf) File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/context.py", line 211, in _initialize_context return self._jvm.JavaSparkContext(jconf) File "/home/<user>/.local/lib/python2.7/site-packages/py4j-0.9-py2.7.egg/py4j/java_gateway.py", line 1064, in __call__ answer, self._gateway_client, None, self._fqn) File "/home/<user>/.local/lib/python2.7/site-packages/py4j-0.9-py2.7.egg/py4j/protocol.py", line 308, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. @guha, yes you can separate workloads via yarn capacity scheduler. Von: ayan guha [mailto:guha.a...@gmail.com] Gesendet: Mittwoch, 7. Oktober 2015 12:06 An: Steve Loughran <ste...@hortonworks.com> Cc: user <user@spark.apache.org>; Dominik Fries <dominik.fr...@woodmark.de> Betreff: Re: spark multi tenancy Can queues also be used to separate workloads? On 7 Oct 2015 20:34, "Steve Loughran" <ste...@hortonworks.com> wrote: > On 7 Oct 2015, at 09:26, Dominik Fries <dominik.fr...@woodmark.de> wrote: > > Hello Folks, > > We want to deploy several spark projects and want to use a unique project > user for each of them. Only the project user should start the spark > application and have the corresponding packages installed. > > Furthermore a personal user, which belongs to a specific project, should > start a spark application via the corresponding spark project user as proxy. > (Development) > > The Application is currently running with ipython / pyspark. (HDP 2.3 - > Spark 1.3.1) > > Is this possible or what is the best practice for a spark multi tenancy > environment ? > > Deploy on a kerberized YARN cluster and each application instance will be running as a different unix user in the cluster, with the appropriate access to HDFS —isolated. The issue then becomes "do workloads clash with each other?". If you want to isolate dev & production, using node labels to keep dev work off the production nodes is the standard technique.
smime.p7s
Description: S/MIME cryptographic signature