[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-11 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-155707374
  
Also backported to branch-1.6.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-11 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/9517


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-11 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-155706699
  
LGTM, merged to master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-155309503
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/45489/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-155309501
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-09 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-155309294
  
**[Test build #45489 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45489/consoleFull)**
 for PR 9517 at commit 
[`32dfb87`](https://github.com/apache/spark/commit/32dfb87ce36a093c54d4a3dfd39ccbc00c417af9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-09 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-155281726
  
**[Test build #45489 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45489/consoleFull)**
 for PR 9517 at commit 
[`32dfb87`](https://github.com/apache/spark/commit/32dfb87ce36a093c54d4a3dfd39ccbc00c417af9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-155281161
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-155281155
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-155280757
  
I used `sortBy` instead of `sortWith`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-155053133
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-09 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-155053016
  
**[Test build #45359 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45359/consoleFull)**
 for PR 9517 at commit 
[`4f47063`](https://github.com/apache/spark/commit/4f4706352c84469503ae3c3388098458b570f62f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-09 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-155021593
  
**[Test build #45359 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45359/consoleFull)**
 for PR 9517 at commit 
[`4f47063`](https://github.com/apache/spark/commit/4f4706352c84469503ae3c3388098458b570f62f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-155020044
  
In this commit, I added partitioned tables for the test and sorted the 
`FileStatus`es.

There are several things to mention here.

Firstly, now we do not need to change `Set` to `LinkedHashSet` and `Map` to 
`LinkedHashMap` for this issue since it manually sorts the `FileStatus`es. 
However, I left them as I though anyway the order of files better be in the 
order as they are retrieved. If that looks weird, I would like to get it back.

Secondly, in any cases, the columns of the lexicographically first file 
shows first, which might be a matter for files starting/containing with numeric 
values. However, I left this as I though anyway it is deterministic.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-155019607
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-155019577
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/9517#discussion_r44247117
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
 ---
@@ -461,13 +461,29 @@ private[sql] class ParquetRelation(
   // You should enable this configuration only if you are very 
sure that for the parquet
   // part-files to read there are corresponding summary files 
containing correct schema.
 
+  // As filed in SPARK-11500, the order of files to touch is a 
matter, which might affect
+  // the ordering of the output columns. There are several things 
to mention here.
+  //
+  //  1. If mergeRespectSummaries config is false, then it merges 
schemas by reducing from
+  // the first part-file so that the columns of the first file 
show first.
+  //
+  //  2. If mergeRespectSummaries config is true, then there 
should be, at least,
+  // "_metadata"s for all given files. So, we can ensure the 
columns of the first file
+  // show first.
+  //
+  //  3. If shouldMergeSchemas is false, but when multiple files 
are given, there is
+  // no guarantee of the output order, since there might not 
be a summary file for the
+  // first file, which ends up putting ahead the columns of 
the other files. However,
+  // this should be okay since not enabling shouldMergeSchemas 
means (assumes) all the
+  // files have the same schemas.
+
   val needMerged: Seq[FileStatus] =
 if (mergeRespectSummaries) {
   Seq()
 } else {
   dataStatuses
 }
-  (metadataStatuses ++ commonMetadataStatuses ++ needMerged).toSeq
+  needMerged ++ metadataStatuses ++ commonMetadataStatuses
--- End diff --

Yes, I think I should sort them.
It looks it is not really recommended just to use it as it is, although 
they looks sorted, assuming from [this 
link](http://lucene.472066.n3.nabble.com/FileSystem-contract-of-listStatus-td3475540.html).
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/9517#discussion_r44247087
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/sources/ParquetHadoopFsRelationSuite.scala
 ---
@@ -155,4 +155,22 @@ class ParquetHadoopFsRelationSuite extends 
HadoopFsRelationTest {
   assert(physicalPlan.collect { case p: execution.Filter => p }.length 
=== 1)
 }
   }
+
+  test("SPARK-11500: Not deterministic order of columns when using merging 
schemas.") {
+import testImplicits._
+withSQLConf(SQLConf.PARQUET_SCHEMA_MERGING_ENABLED.key -> "true") {
+  withTempPath { dir =>
+val pathOne = s"${dir.getCanonicalPath}/table1"
+Seq(1, 1).zipWithIndex.toDF("a", "b").write.parquet(pathOne)
+val pathTwo = s"${dir.getCanonicalPath}/table2"
+Seq(1, 1).zipWithIndex.toDF("c", "b").write.parquet(pathTwo)
+val pathThree = s"${dir.getCanonicalPath}/table3"
+Seq(1, 1).zipWithIndex.toDF("d", "b").write.parquet(pathThree)
--- End diff --

Thanks for commands!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-08 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/9517#discussion_r44245271
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
 ---
@@ -461,13 +461,29 @@ private[sql] class ParquetRelation(
   // You should enable this configuration only if you are very 
sure that for the parquet
   // part-files to read there are corresponding summary files 
containing correct schema.
 
+  // As filed in SPARK-11500, the order of files to touch is a 
matter, which might affect
+  // the ordering of the output columns. There are several things 
to mention here.
+  //
+  //  1. If mergeRespectSummaries config is false, then it merges 
schemas by reducing from
+  // the first part-file so that the columns of the first file 
show first.
+  //
+  //  2. If mergeRespectSummaries config is true, then there 
should be, at least,
+  // "_metadata"s for all given files. So, we can ensure the 
columns of the first file
+  // show first.
+  //
+  //  3. If shouldMergeSchemas is false, but when multiple files 
are given, there is
+  // no guarantee of the output order, since there might not 
be a summary file for the
+  // first file, which ends up putting ahead the columns of 
the other files. However,
+  // this should be okay since not enabling shouldMergeSchemas 
means (assumes) all the
+  // files have the same schemas.
+
   val needMerged: Seq[FileStatus] =
 if (mergeRespectSummaries) {
   Seq()
 } else {
   dataStatuses
 }
-  (metadataStatuses ++ commonMetadataStatuses ++ needMerged).toSeq
+  needMerged ++ metadataStatuses ++ commonMetadataStatuses
--- End diff --

Does HDFS guarantee that the result of `listStatus()` is always sorted? If 
not, we probably need to sort these `FileStatus`es.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-08 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/9517#discussion_r44244350
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/sources/ParquetHadoopFsRelationSuite.scala
 ---
@@ -155,4 +155,22 @@ class ParquetHadoopFsRelationSuite extends 
HadoopFsRelationTest {
   assert(physicalPlan.collect { case p: execution.Filter => p }.length 
=== 1)
 }
   }
+
+  test("SPARK-11500: Not deterministic order of columns when using merging 
schemas.") {
+import testImplicits._
+withSQLConf(SQLConf.PARQUET_SCHEMA_MERGING_ENABLED.key -> "true") {
+  withTempPath { dir =>
+val pathOne = s"${dir.getCanonicalPath}/table1"
+Seq(1, 1).zipWithIndex.toDF("a", "b").write.parquet(pathOne)
+val pathTwo = s"${dir.getCanonicalPath}/table2"
+Seq(1, 1).zipWithIndex.toDF("c", "b").write.parquet(pathTwo)
+val pathThree = s"${dir.getCanonicalPath}/table3"
+Seq(1, 1).zipWithIndex.toDF("d", "b").write.parquet(pathThree)
--- End diff --

We should probably use a partitioned table here. Directories like 
`base/table1`, `base/table2`, and `base/table3` are not valid partition 
directory names, and loading `base` as a Parquet file should throw an 
exception. It's not expected that this test case can pass.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154906491
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154906427
  
**[Test build #45324 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45324/consoleFull)**
 for PR 9517 at commit 
[`bcf72d3`](https://github.com/apache/spark/commit/bcf72d3ca308f9a69993803d9c8939696c915b07).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154891629
  
**[Test build #45324 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45324/consoleFull)**
 for PR 9517 at commit 
[`bcf72d3`](https://github.com/apache/spark/commit/bcf72d3ca308f9a69993803d9c8939696c915b07).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154891106
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154891112
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-08 Thread HyukjinKwon
Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154890873
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154531328
  
Build finished. No test results found.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154531343
  

Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45235/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154503748
  
Build finished. No test results found.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154503753
  

Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45233/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154502425
  
**[Test build #45235 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45235/consoleFull)**
 for PR 9517 at commit 
[`bcf72d3`](https://github.com/apache/spark/commit/bcf72d3ca308f9a69993803d9c8939696c915b07).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154502199
  
Build started sha1 is merged.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154502163
  
Build triggered. sha1 is merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-06 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154501691
  
add to whitelist 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154499906
  
Build triggered. sha1 is merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154499940
  
Build started sha1 is merged.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-06 Thread cloud-fan
Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154498818
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-05 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/9517

[SPARK-11500][SQL] Not deterministic order of columns when using merging 
schemas.

https://issues.apache.org/jira/browse/SPARK-11500

As filed in SPARK-11500, if merging schemas is enabled, the order of files 
to touch is a matter which might affect the ordering of the output columns. 

This was mostly because of the use of `Set` and `Map` so I replaced them to 
`LinkedHashSet` and `LinkedHashMap` to keep the insertion order.

Also, reducing order is set left, and replaced the order of `filesToTouch` 
from `metadataStatuses ++ commonMetadataStatuses ++ needMerged` to  `needMerged 
++ metadataStatuses ++ commonMetadataStatuses` in order to touch the part-files 
first which always have the schema in footers whereas the others might not 
exist.

One nit is, If merging schemas is enabled, but when multiple files are 
given, there is no guarantee of the output order, since there might not be a 
summary file for the first file, which ends up putting ahead the columns of the 
other files. 

However, I thought this should be okay since disabling merging schemas 
means (assumes) all the files have the same schemas.

In addition, in the test code for this, I only checked the names of fields.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-11500

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9517.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9517


commit b0e6ce2729f584a9f95996707f60eb650c2a58b9
Author: hyukjinkwon 
Date:   2015-11-06T07:38:26Z

[SPARK-11500][SQL] Not deterministic order of columns when using merging 
schemas.

commit 08fc91ca8d21902677e78f0adb3b36769f2cba51
Author: hyukjinkwon 
Date:   2015-11-06T07:38:55Z

[SPARK-11500][SQL] Add a test to check the deterministic order.

commit bcf72d3ca308f9a69993803d9c8939696c915b07
Author: hyukjinkwon 
Date:   2015-11-06T07:40:17Z

[SPARK-11500][SQL] Remove trailing newline.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154338582
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

2015-11-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the pull request:

https://github.com/apache/spark/pull/9517#issuecomment-154338571
  
cc @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org