Hyukjin Kwon created SPARK-11500:
------------------------------------

             Summary: Not deterministic order of columns when using merging 
schemas.
                 Key: SPARK-11500
                 URL: https://issues.apache.org/jira/browse/SPARK-11500
             Project: Spark
          Issue Type: Bug
          Components: SQL
            Reporter: Hyukjin Kwon


When executing 

{{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, 
pathTwo).printSchema()}}

The order of columns is not deterministic, showing up in a different order 
sometimes.

This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which 
{{ParquetRelation}} extends as you know). When 
{{FileStatusCache.listLeafFiles()}} is called, this returns {{Set[FileStatus]}} 
which messes up the order of {{Array[FileStatus]}}.

So, after retrieving the list of leaf files including {{_metadata}} and 
{{_common_metadata}},  this starts to merge (separately and if necessary) the 
{{Set}}s of {{_metadata}}, {{_common_metadata}} and part-files in 
{{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different 
column order having the leading columns (of the first file) which the other 
files do not have.

I think this can be resolved by using {{LinkedHashSet}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to