Re: Spark 1.4 DataFrame Parquet file writing - missing random rows/partitions

Cheng Lian Wed, 17 Jun 2015 01:25:58 -0700

Hi Nathan,

Thanks a lot for the detailed report, especially the information aboutnonconsecutive part numbers. It's confirmed to be a race condition bugand just filed https://issues.apache.org/jira/browse/SPARK-8406 to trackthis. Will deliver a fix ASAP and this will be included in 1.4.1.


Best,
Cheng

On 6/16/15 12:30 AM, Nathan McCarthy wrote:

Hi all,
Looks like data frame parquet writing is very broken in Spark 1.4.0.We had no problems with Spark 1.3.
When trying to save a data frame with *569610608* rows.

dfc.write.format("parquet").save(“/data/map_parquet_file")
We get random results between runs. Caching the data frame in memorymakes no difference. It looks like the write out misses some of theRDD partitions. We have an RDD with *6750* partitions. When we writeout we get less files out than the number of partitions. When readingthe data back in and running a count, we get smaller number of rows.
I’ve tried counting the rows in all different ways. All return thesame result, *560214031* rows, missing about 9.4 million rows (0.15%).
qc.read.parquet("/data/map_parquet_file").count
qc.read.parquet("/data/map_parquet_file").rdd.count
qc.read.parquet("/data/map_parquet_file").mapPartitions{itr => var c =0; itr.foreach(_ => c = c + 1); Seq(c).toIterator }.reduce(_ + _)
Looking on HDFS the files, there are /6643/ .parquet files. 107missing partitions (about 0.15%).
Then writing out the same cached DF again to a new file gives *6717*files on hdfs (about 33 files missing or 0.5%);
dfc.write.parquet(“/data/map_parquet_file_2")

And we get *566670107* rows back (about 3million missing ~0.5%);

qc.read.parquet("/data/map_parquet_file_2").count
Writing the same df out to json writes the expected number (*6750*) ofparquet files and returns the right number of rows /569610608/.
dfc.write.format("json").save("/data/map_parquet_file_3")
qc.read.format("json").load("/data/map_parquet_file_3").count
One thing to note is that the parquet part files on HDFS are not thenormal sequential part numbers like for the json output and parquetoutput in Spark 1.3.
part-r-06151.gz.parquet part-r-118401.gz.parquetpart-r-146249.gz.parquet part-r-196755.gz.parquetpart-r-35811.gz.parquet part-r-55628.gz.parquetpart-r-73497.gz.parquet part-r-97237.gz.parquetpart-r-06161.gz.parquet part-r-118406.gz.parquetpart-r-146254.gz.parquet part-r-196763.gz.parquetpart-r-35826.gz.parquet part-r-55647.gz.parquetpart-r-73500.gz.parquet _SUCCESS
We are using MapR 4.0.2 for hdfs.

Any ideas?

Cheers,
Nathan

Re: Spark 1.4 DataFrame Parquet file writing - missing random rows/partitions

Reply via email to