Yes, this is perfectly "legal". This is what RDD.foreach() is for! You may
be encountering an IO exception while writing, and maybe using() suppresses
it. (?) I'd try writing the files with java.nio.file.Files.write() -- I'd
expect there is less that can go wrong with that simple call.

On Thu, Dec 11, 2014 at 12:50 PM, Paweł Szulc <paul.sz...@gmail.com> wrote:

> Imagine simple Spark job, that will store each line of the RDD to a
> separate file
>
>
> val lines = sc.parallelize(1 to 100).map(n => s"this is line $n")
> lines.foreach(line => writeToFile(line))
>
> def writeToFile(line: String) = {
>     def filePath = "file://..."
>     val file = new File(new URI(path).getPath)
>     // using function simply closes the output stream
>     using(new FileOutputStream(file)) { output =>
>       output.write(value)
>     }
> }
>
>
> Now, example above works 99,9% of a time. Files are generated for each
> line, each file contains that particular line.
>
> However, when dealing with large number of data, we encounter situations
> where some of the files are empty! Files are generated, but there is no
> content inside of them (0 bytes).
>
> Now the question is: can Spark job have side effects. Is it even legal to
> write such code?
> If no, then what other choice do we have when we want to save data from
> our RDD?
> If yes, then do you guys see what could be the reason of this job acting
> in this strange manner 0.1% of the time?
>
>
> disclaimer: we are fully aware of .saveAsTextFile method in the API,
> however the example above is a simplification of our code - normally we
> produce PDF files.
>
>
> Best regards,
> Paweł Szulc
>
>
>
>
>
>
>

Reply via email to