[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-03-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20579


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-03-21 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r176232893
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -719,4 +720,27 @@ object DataSource extends Logging {
 }
 globPath
   }
+
+  /**
+   * Called before writing into a FileFormat based data source to make 
sure the
+   * supplied schema is not empty.
+   * @param schema
+   */
+  private def hasEmptySchema(schema: StructType): Unit = {
+def hasEmptySchemaInternal(schema: StructType): Boolean = {
--- End diff --

@cloud-fan I have gone ahead and changed the top level function name to 
validateSchema. I have kept the internal function name to be hasEmptySchema. 
Hopefully it makes sense now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-03-21 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r176187422
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -719,4 +720,27 @@ object DataSource extends Logging {
 }
 globPath
   }
+
+  /**
+   * Called before writing into a FileFormat based data source to make 
sure the
+   * supplied schema is not empty.
+   * @param schema
+   */
+  private def hasEmptySchema(schema: StructType): Unit = {
+def hasEmptySchemaInternal(schema: StructType): Boolean = {
--- End diff --

You are right @cloud-fan. Given we are raising the error from the function 
itself, should i rename it to validateSchema ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-03-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r176163585
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -719,4 +720,27 @@ object DataSource extends Logging {
 }
 globPath
   }
+
+  /**
+   * Called before writing into a FileFormat based data source to make 
sure the
+   * supplied schema is not empty.
+   * @param schema
+   */
+  private def hasEmptySchema(schema: StructType): Unit = {
+def hasEmptySchemaInternal(schema: StructType): Boolean = {
--- End diff --

they should be `verifySchema` and `hasEmptySchema`, depending on their 
return type.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-03-20 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r175952404
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -77,7 +77,6 @@ class ParquetFileFormat
   job: Job,
   options: Map[String, String],
   dataSchema: StructType): OutputWriterFactory = {
-
--- End diff --

@cloud-fan will remove.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-03-20 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r175952408
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -719,4 +720,27 @@ object DataSource extends Logging {
 }
 globPath
   }
+
+  /**
+   * Called before writing into a FileFormat based data source to make 
sure the
+   * supplied schema is not empty.
+   * @param schema
+   */
+  private def verifySchema(schema: StructType): Unit = {
+def verifyInternal(schema: StructType): Boolean = {
--- End diff --

@cloud-fan will make the change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-03-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r175950101
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -77,7 +77,6 @@ class ParquetFileFormat
   job: Job,
   options: Map[String, String],
   dataSchema: StructType): OutputWriterFactory = {
-
--- End diff --

unnecessary change


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-03-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r175950061
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -719,4 +720,27 @@ object DataSource extends Logging {
 }
 globPath
   }
+
+  /**
+   * Called before writing into a FileFormat based data source to make 
sure the
+   * supplied schema is not empty.
+   * @param schema
+   */
+  private def verifySchema(schema: StructType): Unit = {
+def verifyInternal(schema: StructType): Boolean = {
--- End diff --

better to call it `hasEmptySchema`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-03-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r175858198
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -546,6 +546,10 @@ case class DataSource(
   case dataSource: CreatableRelationProvider =>
 SaveIntoDataSourceCommand(data, dataSource, 
caseInsensitiveOptions, mode)
   case format: FileFormat =>
+if (DataSource.isBuiltInFileBasedDataSource(format) && 
data.schema.size == 0) {
--- End diff --

actually we just need [this 
check](https://github.com/apache/spark/pull/20579/files#diff-ee26d4c4be21e92e92a02e9f16dbc285R71)
 here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-03-20 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r175854733
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -546,6 +546,10 @@ case class DataSource(
   case dataSource: CreatableRelationProvider =>
 SaveIntoDataSourceCommand(data, dataSource, 
caseInsensitiveOptions, mode)
   case format: FileFormat =>
+if (DataSource.isBuiltInFileBasedDataSource(format) && 
data.schema.size == 0) {
--- End diff --

We don't need this check. `FileFormat` is internal, we don't need to 
distinguish between built-in file format or external ones.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-03-16 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r175154988
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -542,6 +542,11 @@ case class DataSource(
   throw new AnalysisException("Cannot save interval data type into 
external storage.")
 }
 
+if (data.schema.size == 0) {
--- End diff --

@gatorsmile @cloud-fan OK.. sounds reasonable to me. I will rollback the 
latest change in this PR and we can discuss if we want to introduce the 
behaviour change in a future jira/pr. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-03-16 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r175118143
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -542,6 +542,11 @@ case class DataSource(
   throw new AnalysisException("Cannot save interval data type into 
external storage.")
 }
 
+if (data.schema.size == 0) {
--- End diff --

Currently, we are not blocking this. I do not think we should introduce 
this behavior change. This is risky to block all the cases. 

Previously, I tried to block CREATE TABLE with an empty schema. Later, I 
hit a regression because some data sources are using options/table properties 
to specify the schema...

A general guide here is to avoid behavior changes if possible. When we have 
to introduce a behavior change, we should make it configurable. At least, users 
can convert it back by using a flag.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-03-16 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r175019613
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -542,6 +542,11 @@ case class DataSource(
   throw new AnalysisException("Cannot save interval data type into 
external storage.")
 }
 
+if (data.schema.size == 0) {
--- End diff --

@gatorsmile May i request you to please quickly go through Wenchen's and 
Ryan's comments  above ? My understanding is that , we want to consistently 
rejecting writing empty schema for all the data sources ? Please let me know.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-03-16 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r175013773
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -542,6 +542,11 @@ case class DataSource(
   throw new AnalysisException("Cannot save interval data type into 
external storage.")
 }
 
+if (data.schema.size == 0) {
--- End diff --

Is it required? This is a behavior change. Can we exclude it from this PR? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-03-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r173625828
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -72,6 +72,29 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext {
 }
   }
 
+  // Text and Parquet format does not allow wrting data frame with empty 
schema.
+  Seq("parquet", "text").foreach { format =>
+test(s"SPARK-23372 writing empty dataframe should produce 
AnalysisException - $format") {
+  withTempPath { outputPath =>
+intercept[AnalysisException] {
+  
spark.emptyDataFrame.write.format(format).save(outputPath.toString)
+}
+  }
+}
+  }
+
+  // Formats excluding text and parquet allow writing empty data frames to 
files.
+  allFileBasedDataSources.filterNot(p => p == "text" || p == 
"parquet").foreach { format =>
+test(s"SPARK-23372 writing empty dataframe and reading from it - 
$format") {
+  withTempPath { outputPath =>
+  
spark.emptyDataFrame.write.format(format).save(outputPath.toString)
+  intercept[AnalysisException] {
+val df = spark.read.format(format).load(outputPath.toString)
--- End diff --

Sorry if I misunderstood. The link is 
https://github.com/apache/spark/pull/20579#issuecomment-364994881. Is that the 
right link?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-02-12 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r167664375
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -72,6 +72,29 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext {
 }
   }
 
+  // Text and Parquet format does not allow wrting data frame with empty 
schema.
+  Seq("parquet", "text").foreach { format =>
+test(s"SPARK-23372 writing empty dataframe should produce 
AnalysisException - $format") {
+  withTempPath { outputPath =>
+intercept[AnalysisException] {
+  
spark.emptyDataFrame.write.format(format).save(outputPath.toString)
+}
--- End diff --

Sure @dongjoon-hyun 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-02-12 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r167660095
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -68,6 +68,16 @@ class ParquetFileFormat
 
   override def toString: String = "Parquet"
 
+  private def verifySchema(schema: StructType): Unit = {
+if (schema.size == 0) {
+  throw new AnalysisException(
+s"""
+   |Parquet data source does not support writing empty groups.
--- End diff --

@hvanhovell Thank you. I will change it to use "schema". I will check 
nested schema as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-02-12 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r167661193
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -72,6 +72,29 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext {
 }
   }
 
+  // Text and Parquet format does not allow wrting data frame with empty 
schema.
+  Seq("parquet", "text").foreach { format =>
+test(s"SPARK-23372 writing empty dataframe should produce 
AnalysisException - $format") {
+  withTempPath { outputPath =>
+intercept[AnalysisException] {
+  
spark.emptyDataFrame.write.format(format).save(outputPath.toString)
+}
--- End diff --

Can we check the error message to make it sure?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-02-12 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r167659522
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -72,6 +72,29 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext {
 }
   }
 
+  // Text and Parquet format does not allow wrting data frame with empty 
schema.
+  Seq("parquet", "text").foreach { format =>
+test(s"SPARK-23372 writing empty dataframe should produce 
AnalysisException - $format") {
+  withTempPath { outputPath =>
+intercept[AnalysisException] {
+  
spark.emptyDataFrame.write.format(format).save(outputPath.toString)
+}
+  }
+}
+  }
+
+  // Formats excluding text and parquet allow writing empty data frames to 
files.
+  allFileBasedDataSources.filterNot(p => p == "text" || p == 
"parquet").foreach { format =>
+test(s"SPARK-23372 writing empty dataframe and reading from it - 
$format") {
+  withTempPath { outputPath =>
+  
spark.emptyDataFrame.write.format(format).save(outputPath.toString)
+  intercept[AnalysisException] {
+val df = spark.read.format(format).load(outputPath.toString)
--- End diff --

@hvanhovell Actually thats my question as well. Please see my comment 
[comment/question](https://github.com/apache/spark/pull/20579#issuecomment-364994881)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-02-12 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r167625448
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -72,6 +72,29 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext {
 }
   }
 
+  // Text and Parquet format does not allow wrting data frame with empty 
schema.
+  Seq("parquet", "text").foreach { format =>
+test(s"SPARK-23372 writing empty dataframe should produce 
AnalysisException - $format") {
+  withTempPath { outputPath =>
+intercept[AnalysisException] {
+  
spark.emptyDataFrame.write.format(format).save(outputPath.toString)
+}
+  }
+}
+  }
+
+  // Formats excluding text and parquet allow writing empty data frames to 
files.
+  allFileBasedDataSources.filterNot(p => p == "text" || p == 
"parquet").foreach { format =>
+test(s"SPARK-23372 writing empty dataframe and reading from it - 
$format") {
+  withTempPath { outputPath =>
+  
spark.emptyDataFrame.write.format(format).save(outputPath.toString)
+  intercept[AnalysisException] {
+val df = spark.read.format(format).load(outputPath.toString)
--- End diff --

This should pass right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-02-12 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r167623896
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -68,6 +68,16 @@ class ParquetFileFormat
 
   override def toString: String = "Parquet"
 
+  private def verifySchema(schema: StructType): Unit = {
+if (schema.size == 0) {
+  throw new AnalysisException(
+s"""
+   |Parquet data source does not support writing empty groups.
--- End diff --

`group` is a parquet term. Let's use `schema` instead?

We should also check for nested empty schema's.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-02-11 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r167465371
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -72,6 +72,29 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext {
 }
   }
 
+  // Text and Parquet format does not allow wrting data frame with empty 
schema.
+  Seq("parquet", "text").foreach { format =>
+test(s"SPARK-23372 writing empty dataframe should produce 
AnalysisException - $format") {
+  withTempPath { outputPath =>
+intercept[AnalysisException] {
+  
spark.emptyDataFrame.write.format(format).save(outputPath.toString)
+}
+  }
+}
+  }
+
+  // Formats excluding text and parquet allow writing empty data frames to 
files.
--- End diff --

I have added this test to show the current behaviour of reading the empty 
data frame.  For formats like orc, json, csv , we succeed to write an empty 
data frame but get the error while reading while inferring the schema ? Is this 
the right behaviour ? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-02-11 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r167463979
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -68,6 +68,16 @@ class ParquetFileFormat
 
   override def toString: String = "Parquet"
 
+  private def verifySchema(schema: StructType): Unit = {
+if (schema.size < 1) {
--- End diff --

no particular reason :-). I can change that to "== 0".


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-02-11 Thread mannharleen
Github user mannharleen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20579#discussion_r167463659
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -68,6 +68,16 @@ class ParquetFileFormat
 
   override def toString: String = "Parquet"
 
+  private def verifySchema(schema: StructType): Unit = {
+if (schema.size < 1) {
--- End diff --

any particular reason for using "< 1" and not "== 0". just curious


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20579: [SPARK-23372][SQL] Writing empty struct in parque...

2018-02-11 Thread dilipbiswal
GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/20579

[SPARK-23372][SQL] Writing empty struct in parquet fails during execution. 
It should fail earlier in the processing.

## What changes were proposed in this pull request?
Running
spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path)
Results in
``` SQL
org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema 
with an empty group: message spark_schema {
 }

at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27)
 at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37)
 at org.apache.parquet.schema.MessageType.accept(MessageType.java:58)
 at 
org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23)
 at 
org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:225)
 at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342)
 at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276)
 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
 at org.apache.spark.scheduler.Task.run(Task.scala:109)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.
```

This PR addresses a couple of things.
1) The above case now fails earlier during processing during the prep write 
phase.
2) Writing an empty data frame in ORC succeeds but fails during read while 
inferring the schema.
This issue is also addressed in this PR.

## How was this patch tested?

Unit tests added in FileBasedDatasourceSuite.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark spark-23372

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20579.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20579


commit 9f7a1705960250cf6a828787f0f12a9f28b608c5
Author: Dilip Biswal 
Date:   2018-02-11T17:09:07Z

[SPARK-23372] Writing empty struct in parquet fails during execution. It 
should fail earlier in the processing




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org