I think I'm getting close to find the reason: When I initialize the SparkContext, the following code is executed: def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, jsc, profiler_cls): self.environment = environment or {} # java gateway must have been launched at this point. if conf is not None and conf._jconf is not None: # conf has been initialized in JVM properly, so use conf directly. This represent the # scenario that JVM has been launched before SparkConf is created (e.g. SparkContext is # created and then stopped, and we create a new SparkConf and new SparkContext again) self._conf = conf else: self._conf = SparkConf(_jvm=SparkContext._jvm)
So I can see that the only way that my SparkConf will be used is if it also has a _jvm object. I've used spark-submit to submit my job and printed the _jvm object but it is null, which explains why my SparkConf object is ignored. I've tried running exactly the same on Spark 2.0.1 and it worked! My SparkConf object had a valid _jvm object. Anybody knows what changed? Or if I got something wrong? Thanks :) Sidney Feiner / SW Developer M: +972.528197720 / Skype: sidney.feiner.startapp [StartApp]<http://www.startapp.com/> From: Sidney Feiner Sent: Thursday, January 26, 2017 9:26 AM To: user@spark.apache.org Subject: [PySpark 2.1.0] - SparkContext not properly initialized by SparkConf Hey, I'm pasting a question I asked on Stack Overflow without getting any answers(:() I hope somebody here knows the answer, thanks in advance! Link to post<https://stackoverflow.com/questions/41847113/pyspark-2-1-0-sparkcontext-not-properly-initialized-by-sparkconf> I'm migrating from Spark 1.6 to 2.1.0 and I've run into a problem migrating my PySpark application. I'm dynamically setting up my SparkConf object based on configurations in a file and when I was on Spark 1.6, the app would run with the correct configs. But now, when I open the Spark UI, I can see that NONE of those configs are loaded into the SparkContext. Here's my code: spark_conf = SparkConf().setAll( filter(lambda x: x[0].startswith('spark.'), conf_dict.items()) ) sc = SparkContext(conf=spark_conf) I've also added a print before initializing the SparkContext to make sure the SparkConf has all the relevant configs: [print("{0}: {1}".format(key, value)) for (key, value) in spark_conf.getAll()] And this outputs all the configs I need: * spark.app.name: MyApp * spark.akka.threads: 4 * spark.driver.memory: 2G * spark.streaming.receiver.maxRate: 25 * spark.streaming.backpressure.enabled: true * spark.executor.logs.rolling.maxRetainedFiles: 7 * spark.executor.memory: 3G * spark.cores.max: 24 * spark.executor.cores: 4 * spark.streaming.blockInterval: 350ms * spark.memory.storageFraction: 0.2 * spark.memory.useLegacyMode: false * spark.memory.fraction: 0.8 * spark.executor.logs.rolling.time.interval: daily I submit my job with the following: /usr/local/spark/bin/spark-submit --conf spark.driver.host=i-${HOSTNAME} --master spark://i-${HOSTNAME}:7077 /path/to/main/file.py /path/to/config/file Does anybody know why my SparkContext doesn't get initialized with my SparkConf? Thanks :) Sidney Feiner / SW Developer M: +972.528197720 / Skype: sidney.feiner.startapp [StartApp]<http://www.startapp.com/>