I would not recommend using the direct output committer with HDFS. Its intended only as an optimization for S3.
On Fri, Mar 25, 2016 at 4:03 AM, Vinoth Chandar <vin...@uber.com> wrote: > Hi, > > We are doing the following to save a dataframe in parquet (using > DirectParquetOutputCommitter) as follows. > > dfWriter.format("parquet") > .mode(SaveMode.Overwrite) > .save(outputPath) > > The problem is even if an executor fails once while writing file (say some > transient HDFS issue), when its re-spawn, it fails again because the file > exists already, eventually failing the entire job. > > Is this a known issue? Any workarounds? > > Thanks > Vinoth >