[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22759 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...
Github user dima-asana commented on a diff in the pull request: https://github.com/apache/spark/pull/22759#discussion_r239577117 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala --- @@ -542,6 +551,35 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSQLContext with Be } } + test("parquet - column nullability -- write only") { +val schema = StructType( + StructField("cl1", IntegerType, nullable = false) :: --- End diff -- fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...
Github user dima-asana commented on a diff in the pull request: https://github.com/apache/spark/pull/22759#discussion_r239577230 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala --- @@ -542,6 +551,35 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSQLContext with Be } } + test("parquet - column nullability -- write only") { +val schema = StructType( + StructField("cl1", IntegerType, nullable = false) :: +StructField("cl2", IntegerType, nullable = true) :: Nil) +val row = Row(3, 4) +val df = spark.createDataFrame(sparkContext.parallelize(row :: Nil), schema) + +withTempPath { dir => + val path = dir.getAbsolutePath + df.write.mode("overwrite").parquet(path) + val file = SpecificParquetRecordReaderBase.listDirectory(dir).get(0) + + val hadoopInputFile = HadoopInputFile.fromPath(new Path(file), new Configuration()) + val f = ParquetFileReader.open(hadoopInputFile) + val parquetSchema = f.getFileMetaData.getSchema.getColumns.asScala + .map(_.getPrimitiveType) + f.close + + // the write keeps nullable info from the schema + val expectedParquetSchema: Seq[PrimitiveType] = Seq( +new PrimitiveType(Repetition.REQUIRED, PrimitiveTypeName.INT32, "cl1"), +new PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.INT32, "cl2") + ) + + assert (expectedParquetSchema == parquetSchema) --- End diff -- fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...
Github user dima-asana commented on a diff in the pull request: https://github.com/apache/spark/pull/22759#discussion_r239577173 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala --- @@ -542,6 +551,35 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSQLContext with Be } } + test("parquet - column nullability -- write only") { +val schema = StructType( + StructField("cl1", IntegerType, nullable = false) :: +StructField("cl2", IntegerType, nullable = true) :: Nil) +val row = Row(3, 4) +val df = spark.createDataFrame(sparkContext.parallelize(row :: Nil), schema) + +withTempPath { dir => + val path = dir.getAbsolutePath + df.write.mode("overwrite").parquet(path) + val file = SpecificParquetRecordReaderBase.listDirectory(dir).get(0) + + val hadoopInputFile = HadoopInputFile.fromPath(new Path(file), new Configuration()) + val f = ParquetFileReader.open(hadoopInputFile) + val parquetSchema = f.getFileMetaData.getSchema.getColumns.asScala + .map(_.getPrimitiveType) + f.close + + // the write keeps nullable info from the schema + val expectedParquetSchema: Seq[PrimitiveType] = Seq( --- End diff -- fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22759#discussion_r239474962 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala --- @@ -542,6 +551,35 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSQLContext with Be } } + test("parquet - column nullability -- write only") { +val schema = StructType( + StructField("cl1", IntegerType, nullable = false) :: --- End diff -- Nit: could we indent these at the same level? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22759#discussion_r239475332 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala --- @@ -542,6 +551,35 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSQLContext with Be } } + test("parquet - column nullability -- write only") { +val schema = StructType( + StructField("cl1", IntegerType, nullable = false) :: +StructField("cl2", IntegerType, nullable = true) :: Nil) +val row = Row(3, 4) +val df = spark.createDataFrame(sparkContext.parallelize(row :: Nil), schema) + +withTempPath { dir => + val path = dir.getAbsolutePath + df.write.mode("overwrite").parquet(path) + val file = SpecificParquetRecordReaderBase.listDirectory(dir).get(0) + + val hadoopInputFile = HadoopInputFile.fromPath(new Path(file), new Configuration()) + val f = ParquetFileReader.open(hadoopInputFile) + val parquetSchema = f.getFileMetaData.getSchema.getColumns.asScala + .map(_.getPrimitiveType) + f.close + + // the write keeps nullable info from the schema + val expectedParquetSchema: Seq[PrimitiveType] = Seq( --- End diff -- Also really doesn't matter, but you can simplify the code by omitting types like this, etc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22759#discussion_r239475203 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala --- @@ -542,6 +551,35 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSQLContext with Be } } + test("parquet - column nullability -- write only") { +val schema = StructType( + StructField("cl1", IntegerType, nullable = false) :: +StructField("cl2", IntegerType, nullable = true) :: Nil) +val row = Row(3, 4) +val df = spark.createDataFrame(sparkContext.parallelize(row :: Nil), schema) + +withTempPath { dir => + val path = dir.getAbsolutePath + df.write.mode("overwrite").parquet(path) + val file = SpecificParquetRecordReaderBase.listDirectory(dir).get(0) + + val hadoopInputFile = HadoopInputFile.fromPath(new Path(file), new Configuration()) + val f = ParquetFileReader.open(hadoopInputFile) + val parquetSchema = f.getFileMetaData.getSchema.getColumns.asScala + .map(_.getPrimitiveType) + f.close + + // the write keeps nullable info from the schema + val expectedParquetSchema: Seq[PrimitiveType] = Seq( +new PrimitiveType(Repetition.REQUIRED, PrimitiveTypeName.INT32, "cl1"), +new PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.INT32, "cl2") + ) + + assert (expectedParquetSchema == parquetSchema) --- End diff -- Nit: I think ideally you use the `===` test operator, so that failures generated a better message --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22759#discussion_r232441937 --- Diff: docs/sql-programming-guide.md --- @@ -706,7 +706,7 @@ data across a fixed number of buckets and can be used when a number of unique va [Parquet](http://parquet.io) is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema -of the original data. When writing Parquet files, all columns are automatically converted to be nullable for +of the original data. When reading Parquet files, all columns are automatically converted to be nullable for --- End diff -- This file has been re-org . Could you merge the latest master? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22759: [MINOR][SQL][DOC] Correct parquet nullability doc...
GitHub user dima-asana opened a pull request: https://github.com/apache/spark/pull/22759 [MINOR][SQL][DOC] Correct parquet nullability documentation ## What changes were proposed in this pull request? Parquet files appear to have nullability info when being written, not being read. ## How was this patch tested? Some test code: (running spark 2.3, but the relevant code in DataSource looks identical on master) case class NullTest(bo: Boolean, opbol: Option[Boolean]) val testDf = spark.createDataFrame(Seq(NullTest(true, Some(false defined class NullTest testDf: org.apache.spark.sql.DataFrame = [bo: boolean, opbol: boolean] testDf.write.parquet("s3://asana-stats/tmp_dima/parquet_check_schema") spark.read.parquet("s3://asana-stats/tmp_dima/parquet_check_schema/part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet4").printSchema() root |-- bo: boolean (nullable = true) |-- opbol: boolean (nullable = true) Meanwhile, the parquet file formed does have nullable info: []batch@prod-report000:/tmp/dimakamalov-batch$ aws s3 ls s3://asana-stats/tmp_dima/parquet_check_schema/ 2018-10-17 21:03:52 0 _SUCCESS 2018-10-17 21:03:50504 part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet []batch@prod-report000:/tmp/dimakamalov-batch$ aws s3 cp s3://asana-stats/tmp_dima/parquet_check_schema/part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet . download: s3://asana-stats/tmp_dima/parquet_check_schema/part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet to ./part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet []batch@prod-report000:/tmp/dimakamalov-batch$ java -jar parquet-tools-1.8.2.jar schema part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet message spark_schema { required boolean bo; optional boolean opbol; } You can merge this pull request into a Git repository by running: $ git pull https://github.com/dima-asana/spark dima-asana-nullable-parquet-doc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22759.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22759 commit 0c6ae5621e0fee76f836bdc6ec504685b7983a75 Author: dima-asana <42555784+dima-asana@...> Date: 2018-10-17T21:08:50Z Correct parquet nullability documentation Parquet files appear to have nullability info when being written, not being read. Some test code: case class NullTest(bo: Boolean, opbol: Option[Boolean]) val testDf = spark.createDataFrame(Seq(NullTest(true, Some(false defined class NullTest testDf: org.apache.spark.sql.DataFrame = [bo: boolean, opbol: boolean] testDf.write.parquet("s3://asana-stats/tmp_dima/parquet_check_schema") spark.read.parquet("s3://asana-stats/tmp_dima/parquet_check_schema/part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet4").printSchema() root |-- bo: boolean (nullable = true) |-- opbol: boolean (nullable = true) Meanwhile, the parquet file formed does have nullable info: []batch@prod-report000:/tmp/dimakamalov-batch$ aws s3 ls s3://asana-stats/tmp_dima/parquet_check_schema/ 2018-10-17 21:03:52 0 _SUCCESS 2018-10-17 21:03:50504 part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet []batch@prod-report000:/tmp/dimakamalov-batch$ aws s3 cp s3://asana-stats/tmp_dima/parquet_check_schema/part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet . download: s3://asana-stats/tmp_dima/parquet_check_schema/part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet to ./part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet []batch@prod-report000:/tmp/dimakamalov-batch$ java -jar parquet-tools-1.8.2.jar schema part-0-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet message spark_schema { required boolean bo; optional boolean opbol; } --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org