Hyukjin Kwon created SPARK-11500: ------------------------------------ Summary: Not deterministic order of columns when using merging schemas. Key: SPARK-11500 URL: https://issues.apache.org/jira/browse/SPARK-11500 Project: Spark Issue Type: Bug Components: SQL Reporter: Hyukjin Kwon
When executing {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, pathTwo).printSchema()}} The order of columns is not deterministic, showing up in a different order sometimes. This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which {{ParquetRelation}} extends as you know). When {{FileStatusCache.listLeafFiles()}} is called, this returns {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}. So, after retrieving the list of leaf files including {{_metadata}} and {{_common_metadata}}, this starts to merge (separately and if necessary) the {{Set}}s of {{_metadata}}, {{_common_metadata}} and part-files in {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different column order having the leading columns (of the first file) which the other files do not have. I think this can be resolved by using {{LinkedHashSet}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org