[ https://issues.apache.org/jira/browse/SPARK-24264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gerard Maas updated SPARK-24264: -------------------------------- Issue Type: Bug (was: Improvement) > [Structured Streaming] Remove 'mergeSchema' option from Parquet source > configuration > ------------------------------------------------------------------------------------ > > Key: SPARK-24264 > URL: https://issues.apache.org/jira/browse/SPARK-24264 > Project: Spark > Issue Type: Bug > Components: Structured Streaming > Affects Versions: 2.3.0 > Reporter: Gerard Maas > Priority: Major > Labels: features, usability > > Looking into the Parquet format support for the File source in Structured > Streaming, the docs mention the use of the option 'mergeSchema' to merge the > schemas of the part files found.[1] > > There seems to be no practical use of that configuration in a streaming > context. > > In its batch counterpart, `mergeSchemas` would infer the schema superset of > the part-files found. > > When using the File source + parquet format in streaming mode, we must > provide a schema to the readStream.schema(...) builder and that schema is > fixed for the duration of the stream. > > My current understanding is that: > > - Files containing a subset of the fields declared in the schema will render > null values for the non-existing fields. > - For files containing a superset of the fields, the additional data fields > will be lost. > - Files not matching the schema set on the streaming source will render all > fields null for each record in the file. > > It looks like 'mergeSchema' has no practical effect, although enabling it > might lead to additional processing to actually merge the Parquet schema of > the input files. > > I inquired on the dev+user mailing lists about any other behavior but I got > no responses. > > From the user perspective, they may think that this option would help their > job cope with schema evolution at runtime, but that is also not the case. > > Looks like removing this option and leaving the value always set to false is > the reasonable thing to do. > > [1] > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala#L376] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org