Imagine simple Spark job, that will store each line of the RDD to a
separate file


val lines = sc.parallelize(1 to 100).map(n => s"this is line $n")
lines.foreach(line => writeToFile(line))

def writeToFile(line: String) = {
    def filePath = "file://..."
    val file = new File(new URI(path).getPath)
    // using function simply closes the output stream
    using(new FileOutputStream(file)) { output =>
      output.write(value)
    }
}


Now, example above works 99,9% of a time. Files are generated for each
line, each file contains that particular line.

However, when dealing with large number of data, we encounter situations
where some of the files are empty! Files are generated, but there is no
content inside of them (0 bytes).

Now the question is: can Spark job have side effects. Is it even legal to
write such code?
If no, then what other choice do we have when we want to save data from our
RDD?
If yes, then do you guys see what could be the reason of this job acting in
this strange manner 0.1% of the time?


disclaimer: we are fully aware of .saveAsTextFile method in the API,
however the example above is a simplification of our code - normally we
produce PDF files.


Best regards,
Paweł Szulc

Reply via email to