Caching dataframes and overwrite

Michael Artz Tue, 21 Nov 2017 17:40:19 -0800

I have been interested in finding out why I am getting strange behavior
when running a certain spark job. The job will error out if I place an
action (A .show(1) method) either right after caching the DataFrame or
right before writing the dataframe back to hdfs. There is a very similar
post to Stackoverflow post here... Spark SQL SaveMode.Overwrite, getting
java.io.FileNotFoundException and requiring 'REFRESH TABLE tableName'
<https://stackoverflow.com/questions/42920748/spark-sql-savemode-overwrite-getting-java-io-filenotfoundexception-and-requirin>
.


Basically the other post explains, that when you read from the same hdfs
directory that you are writing to, and your SaveMode is "overwrite", then
you will get a java.io.FileNotFoundException. But here I am finding that
just moving where in the program the action is can give very different
results - either completing the program or giving this exception. I was
wondering if anyone can explain why Spark is not being consistent here?

 val myDF = spark.read.format("csv")
    .option("header", "false")
    .option("delimiter", "\t")
    .schema(schema)
    .load(myPath)

// If I cache it here or persist it then do an action after the cache,
it will occasionally
// not throw the error. This is when completely restarting the
SparkSession so there is no
// risk of another user interfering on the same JVM.
      myDF.cache()
      myDF.show(1)
// Below is just meant to be showing that we're are doing other "spark
dataframe transformations",
// but different transformations have both led to the weird behavior
so, I'm not being specific about
// what exactly the dataframe transformations are

val secondDF = mergeOtherDFsWithmyDF(myDF, otherDF, thirdDF)

val fourthDF = mergeTwoDFs(thirdDF, StringToCheck, fifthDF)

// Below is the same .show(1) action call as was previously done, only
this below
// action ALWAYS results in a successful completion and the above
.show(1) sometimes results
// in FileNotFoundException and sometimes results in successful
completion. The only
// thing that changes among test runs is only one is executed. Either
// **fourthDF.show(1) or myDF.show(1) is left commented out**
fourthDF.show(1)
fourthDF.write
    .mode(writeMode)
    .option("header", "false")
    .option("delimiter", "\t")
    .csv(myPath)

Caching dataframes and overwrite

Reply via email to