I have been interested in finding out why I am getting strange behavior when running a certain spark job. The job will error out if I place an action (A .show(1) method) either right after caching the DataFrame or right before writing the dataframe back to hdfs. There is a very similar post to Stackoverflow post here... Spark SQL SaveMode.Overwrite, getting java.io.FileNotFoundException and requiring 'REFRESH TABLE tableName' <https://stackoverflow.com/questions/42920748/spark-sql-savemode-overwrite-getting-java-io-filenotfoundexception-and-requirin> .
Basically the other post explains, that when you read from the same hdfs directory that you are writing to, and your SaveMode is "overwrite", then you will get a java.io.FileNotFoundException. But here I am finding that just moving where in the program the action is can give very different results - either completing the program or giving this exception. I was wondering if anyone can explain why Spark is not being consistent here? val myDF = spark.read.format("csv") .option("header", "false") .option("delimiter", "\t") .schema(schema) .load(myPath) // If I cache it here or persist it then do an action after the cache, it will occasionally // not throw the error. This is when completely restarting the SparkSession so there is no // risk of another user interfering on the same JVM. myDF.cache() myDF.show(1) // Below is just meant to be showing that we're are doing other "spark dataframe transformations", // but different transformations have both led to the weird behavior so, I'm not being specific about // what exactly the dataframe transformations are val secondDF = mergeOtherDFsWithmyDF(myDF, otherDF, thirdDF) val fourthDF = mergeTwoDFs(thirdDF, StringToCheck, fifthDF) // Below is the same .show(1) action call as was previously done, only this below // action ALWAYS results in a successful completion and the above .show(1) sometimes results // in FileNotFoundException and sometimes results in successful completion. The only // thing that changes among test runs is only one is executed. Either // **fourthDF.show(1) or myDF.show(1) is left commented out** fourthDF.show(1) fourthDF.write .mode(writeMode) .option("header", "false") .option("delimiter", "\t") .csv(myPath)