[ 
https://issues.apache.org/jira/browse/SPARK-24264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerard Maas updated SPARK-24264:
--------------------------------
    Issue Type: Bug  (was: Improvement)

> [Structured Streaming] Remove 'mergeSchema' option from Parquet source 
> configuration
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-24264
>                 URL: https://issues.apache.org/jira/browse/SPARK-24264
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.3.0
>            Reporter: Gerard Maas
>            Priority: Major
>              Labels: features, usability
>
> Looking into the Parquet format support for the File source in Structured 
> Streaming, the docs mention the use of the option 'mergeSchema' to merge the 
> schemas of the part files found.[1]
>  
> There seems to be no practical use of that configuration in a streaming 
> context.
>  
> In its batch counterpart, `mergeSchemas` would infer the schema superset of 
> the part-files found. 
>  
>  When using the File source + parquet format in streaming mode, we must 
> provide a schema to the readStream.schema(...) builder and that schema is 
> fixed for the duration of the stream.
>  
> My current understanding is that:
>  
> - Files containing a subset of the fields declared in the schema will render 
> null values for the non-existing fields.
> - For files containing a superset of the fields, the additional data fields 
> will be lost. 
> - Files not matching the schema set on the streaming source will render all 
> fields null for each record in the file.
>  
> It looks like 'mergeSchema' has no practical effect, although enabling it 
> might lead to additional processing to actually merge the Parquet schema of 
> the input files. 
>  
> I inquired on the dev+user mailing lists about any other behavior but I got 
> no responses.
>  
> From the user perspective, they may think that this option would help their 
> job cope with schema evolution at runtime, but that is also not the case. 
>  
> Looks like removing this option and leaving the value always set to false is 
> the reasonable thing to do.  
>  
> [1] 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala#L376]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to