Corrupt parquet file

2018-02-05 Thread Dong Jiang
Hi, We are running on Spark 2.2.1, generating parquet files, like the following pseudo code df.write.parquet(...) We have recently noticed parquet file corruptions, when reading the parquet in Spark or Presto, as the following: Caused by: org.apache.parquet.io.ParquetDecodingException: Can not

Re: Corrupt parquet file

2018-02-05 Thread Dong Jiang
before, what do you do to prevent a recurrence? Thanks, Dong From: Ryan Blue <rb...@netflix.com> Reply-To: "rb...@netflix.com" <rb...@netflix.com> Date: Monday, February 5, 2018 at 12:46 PM To: Dong Jiang <dji...@dataxu.com> Cc: Spark Dev List <dev@spark.apache.or

Re: Corrupt parquet file

2018-02-12 Thread Dong Jiang
back the entire data set, and then copy from HDFS to S3. Any other thoughts? From: Steve Loughran <ste...@hortonworks.com> Date: Monday, February 12, 2018 at 2:27 PM To: "rb...@netflix.com" <rb...@netflix.com> Cc: Dong Jiang <dji...@dataxu.com>, Apache Spark Dev <de

Re: Corrupt parquet file

2018-02-05 Thread Dong Jiang
o: "rb...@netflix.com" <rb...@netflix.com> Date: Monday, February 5, 2018 at 1:34 PM To: Dong Jiang <dji...@dataxu.com> Cc: Spark Dev List <dev@spark.apache.org> Subject: Re: Corrupt parquet file We ensure the bad node is removed from our cluster and reprocess to replac

Re: Corrupt parquet file

2018-02-05 Thread Dong Jiang
a recurrence? Can you share your experience? Thanks, Dong From: Ryan Blue <rb...@netflix.com> Reply-To: "rb...@netflix.com" <rb...@netflix.com> Date: Monday, February 5, 2018 at 12:38 PM To: Dong Jiang <dji...@dataxu.com> Cc: Spark Dev List <dev@spark.apache.or

Spark SQL unexpected behavior when comparing timestamp to date

2018-03-02 Thread Dong Jiang
Hi, I opened a JIRA ticket https://issues.apache.org/jira/browse/SPARK-23549, I don't know if anyone can take a look? Spark SQL unexpected behavior when comparing timestamp to date scala> spark.version res1: String = 2.2.1 scala> spark.sql("select cast('2017-03-01 00:00:00' as timestamp)