On 13 Oct 2016, at 14:40, Mendelson, Assaf 
<assaf.mendel...@rsa.com<mailto:assaf.mendel...@rsa.com>> wrote:

Hi,
We have a spark cluster and we wanted to add some security for it. I was 
looking at the documentation (in  
http://spark.apache.org/docs/latest/security.html) and had some questions.
1.       Do all executors listen by the same blockManager port? For example, in 
yarn there are multiple executors per node, do they all listen to the same port?

On YARN the executors will come up on their own ports.

2.       Are ports defined in earlier version (e.g. 
http://spark.apache.org/docs/1.6.1/security.html) and removed in the latest 
(such as spark.executor.port and spark.fileserver.port) gone and can be blocked?
3.       If I define multiple workers per node in spark standalone mode, how do 
I set the different ports for each worker (there is only one 
spark.worker.ui.port / SPARK_WORKER_WEBUI_PORT definition. Do I have to start 
each worker separately to configure a port?) The same is true for the worker 
port (SPARK_WORKER_PORT)
4.       Is it possible to encrypt the logs instead of just limiting with 
permissions the log directory?

if writing to HDFS on a Hadoop 2.7+ cluster you can use HDFS Encryption At Rest 
to encrypt the data on the disks. If you are talking to S3 with the Hadoop 2.8+ 
libraries (not officially shipping), you can use S3 server side encryption with 
AWS managed keys too.

5.       Is the communication between the servers encrypted (e.g. using ssh?)

you can enable this;

https://spark.apache.org/docs/latest/security.html
https://spark.apache.org/docs/latest/configuration.html#security

spark.network.sasl.serverAlwaysEncrypt true
spark.authenticate.enableSaslEncryption true

I *believe* that encrypted shuffle comes with 2.1   
https://issues.apache.org/jira/browse/SPARK-5682

as usual, look in the source to really understand

there's various ways to interact with spark and within; you need to make sure 
they are all secured against malicious users

-web UI. on YARN, you can use SPNEGO to kerberos-auth the yarn RM proxy; the 
Spark UI will 302 all direct requests to its web UI back to that proxy. 
Communications behind the scnese between the RM and the Spark  UI will not, 
AFAIK, be encrypted/authed.

-spark-driver executor comms
-bulk data exchange between drivers
-shuffle service in executor, or hosted inside YARN node managers.
-spark-filesystem communications
-spark to other data source communications (Kafka, etc)

You're going to have go through them all and do the checklist.

As is usual in an open source project, documentation improvements are always 
welcome. There is a good security doc in the spark source —but I'm sure extra 
contributions will be welcome




6.       Are there any additional best practices beyond what is written in the 
documentation?
Thanks,

In a YARN cluster, Kerberos is mandatory if you want any form of security. 
Sorry.

Reply via email to