This is a duplicate of my stack overflow question here:

https://stackoverflow.com/questions/57881044/verifying-in-transit-encryption-for-spark-shuffle

I'm running Spark over YARN on AWS EMR 5.20.

I've followed the following guide for running in-transit encryption for
spark shuffle:

https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/configuring-spark/content/configuring_spark_for_wire_encryption.html

First off, this doc *only* refers to self-signed certs, and we're using a
CA-signed cert. No big deal, I put the CA cert in the truststore.

Unfortunately, I'm not in a position to use the built-in Amazon In Transit
encryption, nor can I use Spark Defaults as we send along our spark
assembly with our jobs to allow multiple versions to be used.

The piece I'm tacking on to our spark jobs looks like this:

spark.shuffle.encryption.enabled=true
spark.ssl.enabled=true
spark.ssl.keyPassword=*****
spark.ssl.keyStore="/opt/my-cluster/keystore.jks"
spark.ssl.keyStorePassword=*****
spark.ssl.protocol=TLS
spark.ssl.trustStore="/opt/my-cluster/truststore.jks”
spark.ssl.trustStorePassword=*****
spark.authenticate=true
spark.network.crypto.enabled=true
spark.enableSaslEncryption=true
spark.ui.https.enabled=true
spark.io.encryption.enabled=true
spark.network.sasl.serverAlwaysEncrypt=true

Jobs are running fine. I'm running a simple job I'm assuming will force a
shuffle. Here's the code:

import org.apache.spark.sql.SparkSession

import scala.util.Random

object SparkShuffleTest {

  def main(args: Array[String]) {
    val randomText = for (i <- Range(0,100000)) yield Random.nextPrintableChar()
    val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
    val logData = spark.sparkContext.parallelize(randomText)
    val pairs = logData.map(c => (c, 1))
    pairs.foreach(println(_))
    val outputs = pairs.reduceByKey(_ + _).collect()
    outputs.foreach({case (a, b) => println(s"$a:$b")})
    println("Outputs collected...")
    println(outputs)
    spark.stop()
  }
}

*So, here's the tough part:*

If I screw around with the location of the keystore and change it to a
bogus name, my jobs fail, as they should, because they can't find a valid
keystore. However, if I do this to the *truststore*, there's no failure.
It's like it's not even reading the truststore. How can I actually get this
to encrypt, or what am I configuring wrong? Obviously, if I'm giving it a
bogus truststore, it ought to fail at encrypting shuffle. Does that just
not throw an error at all?

Thanks!

Reply via email to