[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...

2018-12-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22759


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...

2018-12-06 Thread dima-asana
Github user dima-asana commented on a diff in the pull request:

https://github.com/apache/spark/pull/22759#discussion_r239577117
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
 ---
@@ -542,6 +551,35 @@ class DataFrameReaderWriterSuite extends QueryTest 
with SharedSQLContext with Be
 }
   }
 
+  test("parquet - column nullability -- write only") {
+val schema = StructType(
+  StructField("cl1", IntegerType, nullable = false) ::
--- End diff --

fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...

2018-12-06 Thread dima-asana
Github user dima-asana commented on a diff in the pull request:

https://github.com/apache/spark/pull/22759#discussion_r239577230
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
 ---
@@ -542,6 +551,35 @@ class DataFrameReaderWriterSuite extends QueryTest 
with SharedSQLContext with Be
 }
   }
 
+  test("parquet - column nullability -- write only") {
+val schema = StructType(
+  StructField("cl1", IntegerType, nullable = false) ::
+StructField("cl2", IntegerType, nullable = true) :: Nil)
+val row = Row(3, 4)
+val df = spark.createDataFrame(sparkContext.parallelize(row :: Nil), 
schema)
+
+withTempPath { dir =>
+  val path = dir.getAbsolutePath
+  df.write.mode("overwrite").parquet(path)
+  val file = SpecificParquetRecordReaderBase.listDirectory(dir).get(0)
+
+  val hadoopInputFile = HadoopInputFile.fromPath(new Path(file), new 
Configuration())
+  val f = ParquetFileReader.open(hadoopInputFile)
+  val parquetSchema = f.getFileMetaData.getSchema.getColumns.asScala
+  .map(_.getPrimitiveType)
+  f.close
+
+  // the write keeps nullable info from the schema
+  val expectedParquetSchema: Seq[PrimitiveType] = Seq(
+new PrimitiveType(Repetition.REQUIRED, PrimitiveTypeName.INT32, 
"cl1"),
+new PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.INT32, 
"cl2")
+  )
+
+  assert (expectedParquetSchema == parquetSchema)
--- End diff --

fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...

2018-12-06 Thread dima-asana
Github user dima-asana commented on a diff in the pull request:

https://github.com/apache/spark/pull/22759#discussion_r239577173
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
 ---
@@ -542,6 +551,35 @@ class DataFrameReaderWriterSuite extends QueryTest 
with SharedSQLContext with Be
 }
   }
 
+  test("parquet - column nullability -- write only") {
+val schema = StructType(
+  StructField("cl1", IntegerType, nullable = false) ::
+StructField("cl2", IntegerType, nullable = true) :: Nil)
+val row = Row(3, 4)
+val df = spark.createDataFrame(sparkContext.parallelize(row :: Nil), 
schema)
+
+withTempPath { dir =>
+  val path = dir.getAbsolutePath
+  df.write.mode("overwrite").parquet(path)
+  val file = SpecificParquetRecordReaderBase.listDirectory(dir).get(0)
+
+  val hadoopInputFile = HadoopInputFile.fromPath(new Path(file), new 
Configuration())
+  val f = ParquetFileReader.open(hadoopInputFile)
+  val parquetSchema = f.getFileMetaData.getSchema.getColumns.asScala
+  .map(_.getPrimitiveType)
+  f.close
+
+  // the write keeps nullable info from the schema
+  val expectedParquetSchema: Seq[PrimitiveType] = Seq(
--- End diff --

fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...

2018-12-06 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22759#discussion_r239474962
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
 ---
@@ -542,6 +551,35 @@ class DataFrameReaderWriterSuite extends QueryTest 
with SharedSQLContext with Be
 }
   }
 
+  test("parquet - column nullability -- write only") {
+val schema = StructType(
+  StructField("cl1", IntegerType, nullable = false) ::
--- End diff --

Nit: could we indent these at the same level?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...

2018-12-06 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22759#discussion_r239475332
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
 ---
@@ -542,6 +551,35 @@ class DataFrameReaderWriterSuite extends QueryTest 
with SharedSQLContext with Be
 }
   }
 
+  test("parquet - column nullability -- write only") {
+val schema = StructType(
+  StructField("cl1", IntegerType, nullable = false) ::
+StructField("cl2", IntegerType, nullable = true) :: Nil)
+val row = Row(3, 4)
+val df = spark.createDataFrame(sparkContext.parallelize(row :: Nil), 
schema)
+
+withTempPath { dir =>
+  val path = dir.getAbsolutePath
+  df.write.mode("overwrite").parquet(path)
+  val file = SpecificParquetRecordReaderBase.listDirectory(dir).get(0)
+
+  val hadoopInputFile = HadoopInputFile.fromPath(new Path(file), new 
Configuration())
+  val f = ParquetFileReader.open(hadoopInputFile)
+  val parquetSchema = f.getFileMetaData.getSchema.getColumns.asScala
+  .map(_.getPrimitiveType)
+  f.close
+
+  // the write keeps nullable info from the schema
+  val expectedParquetSchema: Seq[PrimitiveType] = Seq(
--- End diff --

Also really doesn't matter, but you can simplify the code by omitting types 
like this, etc.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...

2018-12-06 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/22759#discussion_r239475203
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
 ---
@@ -542,6 +551,35 @@ class DataFrameReaderWriterSuite extends QueryTest 
with SharedSQLContext with Be
 }
   }
 
+  test("parquet - column nullability -- write only") {
+val schema = StructType(
+  StructField("cl1", IntegerType, nullable = false) ::
+StructField("cl2", IntegerType, nullable = true) :: Nil)
+val row = Row(3, 4)
+val df = spark.createDataFrame(sparkContext.parallelize(row :: Nil), 
schema)
+
+withTempPath { dir =>
+  val path = dir.getAbsolutePath
+  df.write.mode("overwrite").parquet(path)
+  val file = SpecificParquetRecordReaderBase.listDirectory(dir).get(0)
+
+  val hadoopInputFile = HadoopInputFile.fromPath(new Path(file), new 
Configuration())
+  val f = ParquetFileReader.open(hadoopInputFile)
+  val parquetSchema = f.getFileMetaData.getSchema.getColumns.asScala
+  .map(_.getPrimitiveType)
+  f.close
+
+  // the write keeps nullable info from the schema
+  val expectedParquetSchema: Seq[PrimitiveType] = Seq(
+new PrimitiveType(Repetition.REQUIRED, PrimitiveTypeName.INT32, 
"cl1"),
+new PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.INT32, 
"cl2")
+  )
+
+  assert (expectedParquetSchema == parquetSchema)
--- End diff --

Nit: I think ideally you use the `===` test operator, so that failures 
generated a better message


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...

2018-11-09 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22759#discussion_r232441937
  
--- Diff: docs/sql-programming-guide.md ---
@@ -706,7 +706,7 @@ data across a fixed number of buckets and can be used 
when a number of unique va
 
 [Parquet](http://parquet.io) is a columnar format that is supported by 
many other data processing systems.
 Spark SQL provides support for both reading and writing Parquet files that 
automatically preserves the schema
-of the original data. When writing Parquet files, all columns are 
automatically converted to be nullable for
+of the original data. When reading Parquet files, all columns are 
automatically converted to be nullable for
--- End diff --

This file has been re-org . Could you merge the latest master?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...

2018-10-17 Thread dima-asana
GitHub user dima-asana opened a pull request:

https://github.com/apache/spark/pull/22759

[MINOR][SQL][DOC] Correct parquet nullability documentation

## What changes were proposed in this pull request?

Parquet files appear to have nullability info when being written, not being 
read.

## How was this patch tested?

Some test code: (running spark 2.3, but the relevant code in DataSource 
looks identical on master)

case class NullTest(bo: Boolean, opbol: Option[Boolean])
val testDf = spark.createDataFrame(Seq(NullTest(true, Some(false

defined class NullTest
testDf: org.apache.spark.sql.DataFrame = [bo: boolean, opbol: boolean]

testDf.write.parquet("s3://asana-stats/tmp_dima/parquet_check_schema")


spark.read.parquet("s3://asana-stats/tmp_dima/parquet_check_schema/part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet4").printSchema()
root
 |-- bo: boolean (nullable = true)
 |-- opbol: boolean (nullable = true)


Meanwhile, the parquet file formed does have nullable info:

[]batch@prod-report000:/tmp/dimakamalov-batch$ aws s3 ls 
s3://asana-stats/tmp_dima/parquet_check_schema/
2018-10-17 21:03:52  0 _SUCCESS
2018-10-17 21:03:50504 
part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet
[]batch@prod-report000:/tmp/dimakamalov-batch$ aws s3 cp 
s3://asana-stats/tmp_dima/parquet_check_schema/part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet
 .
download: 
s3://asana-stats/tmp_dima/parquet_check_schema/part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet
 to ./part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet
[]batch@prod-report000:/tmp/dimakamalov-batch$ java -jar 
parquet-tools-1.8.2.jar schema 
part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet 
message spark_schema {
  required boolean bo;
  optional boolean opbol;
}


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dima-asana/spark 
dima-asana-nullable-parquet-doc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22759.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22759


commit 0c6ae5621e0fee76f836bdc6ec504685b7983a75
Author: dima-asana <42555784+dima-asana@...>
Date:   2018-10-17T21:08:50Z

Correct parquet nullability documentation

Parquet files appear to have nullability info when being written, not being 
read.  Some test code:

case class NullTest(bo: Boolean, opbol: Option[Boolean])
val testDf = spark.createDataFrame(Seq(NullTest(true, Some(false

defined class NullTest
testDf: org.apache.spark.sql.DataFrame = [bo: boolean, opbol: boolean]

testDf.write.parquet("s3://asana-stats/tmp_dima/parquet_check_schema")


spark.read.parquet("s3://asana-stats/tmp_dima/parquet_check_schema/part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet4").printSchema()
root
 |-- bo: boolean (nullable = true)
 |-- opbol: boolean (nullable = true)


Meanwhile, the parquet file formed does have nullable info:

[]batch@prod-report000:/tmp/dimakamalov-batch$ aws s3 ls 
s3://asana-stats/tmp_dima/parquet_check_schema/
2018-10-17 21:03:52  0 _SUCCESS
2018-10-17 21:03:50504 
part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet
[]batch@prod-report000:/tmp/dimakamalov-batch$ aws s3 cp 
s3://asana-stats/tmp_dima/parquet_check_schema/part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet
 .
download: 
s3://asana-stats/tmp_dima/parquet_check_schema/part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet
 to ./part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet
[]batch@prod-report000:/tmp/dimakamalov-batch$ java -jar 
parquet-tools-1.8.2.jar schema 
part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet 
message spark_schema {
  required boolean bo;
  optional boolean opbol;
}




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org