date:20171006

[GitHub] spark issue #19270: [SPARK-21809] : Change Stage Page to use datatables to s...

2017-10-06 Thread pgandhi999

Github user pgandhi999 commented on the issue:

https://github.com/apache/spark/pull/19270
  
Ok Yes, I just noticed that comment. Will get back to you on this one !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19429: [SPARK-20055] [Docs] Added documentation for load...

2017-10-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19429#discussion_r143287807
  
--- Diff: examples/src/main/python/sql/datasource.py ---
@@ -53,6 +53,11 @@ def basic_datasource_example(spark):
 df.select("name", "age").write.save("namesAndAges.parquet", 
format="parquet")
 # $example off:manual_load_options$
 
+# $example on:manual_load_options_csv$
+df = spark.read.load("examples/src/main/resources/people.csv",
+ format="csv", sep=":", inferSchema="true", 
header="true")
+# $example off:manual_load_options_csv
--- End diff --

This need to be corrected to 
> # $example off:manual_load_options_csv$


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19429: [SPARK-20055] [Docs] Added documentation for load...

2017-10-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19429#discussion_r143287505
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java
 ---
@@ -115,7 +115,20 @@ private static void 
runBasicDataSourceExample(SparkSession spark) {
 Dataset peopleDF =
   
spark.read().format("json").load("examples/src/main/resources/people.json");
 peopleDF.select("name", 
"age").write().format("parquet").save("namesAndAges.parquet");
-// $example off:manual_load_options$
+// $example on:manual_load_options_csv$
--- End diff --

You still need to keep 
> // $example off:manual_load_options$


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...

2017-10-06 Thread pgandhi999

Github user pgandhi999 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19270#discussion_r143287411
  
--- Diff: core/src/main/resources/org/apache/spark/ui/static/utils.js ---
@@ -46,3 +46,64 @@ function formatBytes(bytes, type) {
 var i = Math.floor(Math.log(bytes) / Math.log(k));
 return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + 
sizes[i];
 }
+
+function formatLogsCells(execLogs, type) {
+if (type !== 'display') return Object.keys(execLogs);
+if (!execLogs) return;
+var result = '';
+$.each(execLogs, function (logName, logUrl) {
+result += '' + logName + ''
+});
+return result;
+}
+
+function getStandAloneAppId(cb) {
--- End diff --

So, to summarize what you said, basically the function should simply return 
the appId which can be taken as a param in those individual functions. I think 
that makes sense.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19270: [SPARK-21809] : Change Stage Page to use datatables to s...

2017-10-06 Thread ajbozarth

Github user ajbozarth commented on the issue:

https://github.com/apache/spark/pull/19270
  
Thanks, I must've missed that in the description. You've take care of all 
but my last comment, I'm not seeing the accumulators table, have you checked it 
shows up when theres accumulators?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19394: [SPARK-22170][SQL] Reduce memory consumption in broadcas...

2017-10-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19394
  
**[Test build #82525 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82525/testReport)**
 for PR 19394 at commit 
[`5b92fe2`](https://github.com/apache/spark/commit/5b92fe252530f4c8dab4d41772011e86ac308f80).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...

2017-10-06 Thread ajbozarth

Github user ajbozarth commented on a diff in the pull request:

https://github.com/apache/spark/pull/19270#discussion_r143286538
  
--- Diff: core/src/main/resources/org/apache/spark/ui/static/utils.js ---
@@ -46,3 +46,64 @@ function formatBytes(bytes, type) {
 var i = Math.floor(Math.log(bytes) / Math.log(k));
 return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + 
sizes[i];
 }
+
+function formatLogsCells(execLogs, type) {
+if (type !== 'display') return Object.keys(execLogs);
+if (!execLogs) return;
+var result = '';
+$.each(execLogs, function (logName, logUrl) {
+result += '' + logName + ''
+});
+return result;
+}
+
+function getStandAloneAppId(cb) {
--- End diff --

I was thinking it should return the appId rather than take in a function as 
a param


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19394: [SPARK-22170][SQL] Reduce memory consumption in b...

2017-10-06 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/19394#discussion_r143286130
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala
 ---
@@ -73,25 +73,37 @@ case class BroadcastExchangeExec(
 try {
   val beforeCollect = System.nanoTime()
   // Note that we use .executeCollect() because we don't want to 
convert data to Scala types
-  val input: Array[InternalRow] = child.executeCollect()
-  if (input.length >= 51200) {
+  val (numRows, input) = child.executeCollectIterator()
+  if (numRows >= 51200) {
 throw new SparkException(
-  s"Cannot broadcast the table with more than 512 millions 
rows: ${input.length} rows")
+  s"Cannot broadcast the table with more than 512 millions 
rows: $numRows rows")
   }
+
   val beforeBuild = System.nanoTime()
   longMetric("collectTime") += (beforeBuild - beforeCollect) / 
100
-  val dataSize = 
input.map(_.asInstanceOf[UnsafeRow].getSizeInBytes.toLong).sum
+
+  // Construct the relation.
+  val relation = mode.transform(input, Some(numRows))
+
+  val dataSize = relation match {
+case map: HashedRelation =>
+  map.estimatedSize
+case arr: Array[InternalRow] =>
+  arr.map(_.asInstanceOf[UnsafeRow].getSizeInBytes.toLong).sum
+case _ =>
+  numRows * 512 // guess: each row is about 512 bytes
--- End diff --

Fixed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19394: [SPARK-22170][SQL] Reduce memory consumption in b...

2017-10-06 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/19394#discussion_r143286104
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala
 ---
@@ -73,25 +73,37 @@ case class BroadcastExchangeExec(
 try {
   val beforeCollect = System.nanoTime()
   // Note that we use .executeCollect() because we don't want to 
convert data to Scala types
--- End diff --

Updated.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19394: [SPARK-22170][SQL] Reduce memory consumption in b...

2017-10-06 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/19394#discussion_r143285737
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/ConfigBehaviorSuite.scala ---
@@ -58,7 +58,7 @@ class ConfigBehaviorSuite extends QueryTest with 
SharedSQLContext {
   withSQLConf(SQLConf.RANGE_EXCHANGE_SAMPLE_SIZE_PER_PARTITION.key -> 
"1") {
 // If we only sample one point, the range boundaries will be 
pretty bad and the
 // chi-sq value would be very high.
-assert(computeChiSquareTest() > 1000)
+assert(computeChiSquareTest() > 300)
--- End diff --

@rxin, the difference that is causing this to fail is that `rdd.id` is 15 
instead of 14 because of the change to `getByteArrayRdd`. That id is used to 
seed the random generator and the chi-sq value isn't high enough. The previous 
value must have been unusually high. With so few data points, this can vary 
quite a bit so I think changing the bound to 300 is a good fix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19394: [SPARK-22170][SQL] Reduce memory consumption in broadcas...

2017-10-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19394
  
**[Test build #82524 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82524/testReport)**
 for PR 19394 at commit 
[`f6613c2`](https://github.com/apache/spark/commit/f6613c2651691ca149390ea2407856ba390590cb).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18460: [SPARK-21247][SQL] Type comparison should respect...

2017-10-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18460#discussion_r143285500
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -100,6 +101,17 @@ object TypeCoercion {
 case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) 
=>
   Some(TimestampType)
 
+case (t1 @ StructType(fields1), t2 @ StructType(fields2)) if 
t1.sameType(t2) =>
+  Some(StructType(fields1.zip(fields2).map { case (f1, f2) =>
+// Since `t1.sameType(t2)` is true, two StructTypes have the same 
DataType
+// except `name` (in case of `spark.sql.caseSensitive=false`) and 
`nullable`.
+// - Different names: use a lower case name because 
findTightestCommonType is commutative.
+// - Different nullabilities: `nullable` is true iff one of them 
is nullable.
+val name = if (f1.name == f2.name) f1.name else 
f1.name.toLowerCase(Locale.ROOT)
--- End diff --

Do both `tab.Col1.a` and `tab.Col1.A`work well when case sensitivity is off?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19451#discussion_r143284824
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1242,6 +1243,53 @@ object ReplaceIntersectWithSemiJoin extends 
Rule[LogicalPlan] {
 }
 
 /**
+ * If one or both of the datasets in the logical [[Except]] operator are 
purely transformed using
+ * [[Filter]], this rule will replace logical [[Except]] operator with a 
[[Filter]] operator by
+ * flipping the filter condition of the right child.
+ * {{{
+ *   SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 
WHERE a1 = 5
+ *   ==>  SELECT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5
--- End diff --

This is wrong. See the correct description of `EXCEPT`

> EXCEPT returns distinct rows from the left input query that arenât 
output by the right input query.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19336: [SPARK-21947][SS] Check and report error when mon...

2017-10-06 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19336


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19336: [SPARK-21947][SS] Check and report error when monotonica...

2017-10-06 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/19336
  
LGTM. Merging to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18460: [SPARK-21247][SQL] Type comparison should respect case-s...

2017-10-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18460
  
**[Test build #82523 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82523/testReport)**
 for PR 18460 at commit 
[`c72aa18`](https://github.com/apache/spark/commit/c72aa18ccbf23053a06faa35deda77ff1135b818).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18460: [SPARK-21247][SQL] Type comparison should respect case-s...

2017-10-06 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/18460
  
The test cases are added. Thank you, @gatorsmile !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19444: [SPARK-22214][SQL] Refactor the list hive partiti...

2017-10-06 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19444


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19444: [SPARK-22214][SQL] Refactor the list hive partitions cod...

2017-10-06 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19444
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19270: [SPARK-21809] : Change Stage Page to use datatables to s...

2017-10-06 Thread pgandhi999

Github user pgandhi999 commented on the issue:

https://github.com/apache/spark/pull/19270
  
@ajbozarth Regarding your earlier comment on the missing functionality of 
show additional metrics, I have mentioned that in the description of the PR. I 
restate it below:

Because of the above change, certain functionalities in the page had to be 
modified to support the addition of datatables. For example, the toggle 
checkbox 'Select All' previously would add the checked fields as columns in the 
Task table and as rows in the Summary Metrics table, but after the change, only 
columns are added in the Task Table as it got tricky to add rows dynamically in 
the datatables.

I have fixed the second bit of missing functionality about creating link on 
completed tasks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...

2017-10-06 Thread pgandhi999

Github user pgandhi999 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19270#discussion_r143280105
  
--- Diff: core/src/main/resources/org/apache/spark/ui/static/utils.js ---
@@ -46,3 +46,64 @@ function formatBytes(bytes, type) {
 var i = Math.floor(Math.log(bytes) / Math.log(k));
 return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + 
sizes[i];
 }
+
+function formatLogsCells(execLogs, type) {
+if (type !== 'display') return Object.keys(execLogs);
+if (!execLogs) return;
+var result = '';
+$.each(execLogs, function (logName, logUrl) {
+result += '' + logName + ''
+});
+return result;
+}
+
+function getStandAloneAppId(cb) {
--- End diff --

I did not get what you mean? Do you want me to return appId, or do you want 
me to rename the function to a more suitable name? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...

2017-10-06 Thread pgandhi999

Github user pgandhi999 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19270#discussion_r143279779
  
--- Diff: core/src/main/resources/org/apache/spark/ui/static/taskspages.js 
---
@@ -0,0 +1,474 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+$(document).ajaxStop($.unblockUI);
+$(document).ajaxStart(function () {
+$.blockUI({message: 'Loading Tasks Page...'});
+});
+
+$.extend( $.fn.dataTable.ext.type.order, {
+"file-size-pre": ConvertDurationString,
+
+"file-size-asc": function ( a, b ) {
+a = ConvertDurationString( a );
+b = ConvertDurationString( b );
+return ((a < b) ? -1 : ((a > b) ? 1 : 0));
+},
+
+"file-size-desc": function ( a, b ) {
+a = ConvertDurationString( a );
+b = ConvertDurationString( b );
+return ((a < b) ? 1 : ((a > b) ? -1 : 0));
+}
+} );
+
+function createTemplateURI(appId) {
+var words = document.baseURI.split('/');
+var ind = words.indexOf("proxy");
+if (ind > 0) {
+var baseURI = words.slice(0, ind + 1).join('/') + '/' + appId + 
'/static/stagespage-template.html';
+return baseURI;
+}
+ind = words.indexOf("history");
+if(ind > 0) {
+var baseURI = words.slice(0, ind).join('/') + 
'/static/stagespage-template.html';
+return baseURI;
+}
+return location.origin + "/static/stagespage-template.html";
+}
+
+// This function will only parse the URL under certain formate
+// e.g. 
https://axonitered-jt1.red.ygrid.yahoo.com:50509/history/application_1502220952225_59143/stages/stage/?id=0=0
+function StageEndPoint(appId) {
+var words = document.baseURI.split('/');
+var words2 = document.baseURI.split('?');
+var ind = words.indexOf("proxy");
+if (ind > 0) {
+var appId = words[ind + 1];
+var stageIdLen = words2[1].indexOf('&');
+var stageId = words2[1].substr(3, stageIdLen - 3);
+var newBaseURI = words.slice(0, ind + 2).join('/');
+return newBaseURI + "/api/v1/applications/" + appId + "/stages/" + 
stageId;
+}
+ind = words.indexOf("history");
+if (ind > 0) {
+var appId = words[ind + 1];
+var attemptId = words[ind + 2];
+var stageIdLen = words2[1].indexOf('&');
+var stageId = words2[1].substr(3, stageIdLen - 3);
+var newBaseURI = words.slice(0, ind).join('/');
+if (isNaN(attemptId) || attemptId == "0") {
+return newBaseURI + "/api/v1/applications/" + appId + 
"/stages/" + stageId;
+} else {
+return newBaseURI + "/api/v1/applications/" + appId + "/" + 
attemptId + "/stages/" + stageId;
+}
+}
+var stageIdLen = words2[1].indexOf('&');
+var stageId = words2[1].substr(3, stageIdLen - 3);
+return location.origin + "/api/v1/applications/" + appId + "/stages/" 
+ stageId;
+}
+
+function sortNumber(a,b) {
+return a - b;
+}
+
+function quantile(array, percentile) {
+index = percentile/100. * (array.length-1);
+if (Math.floor(index) == index) {
+   result = array[index];
+} else {
+var i = Math.floor(index);
+fraction = index - i;
+result = array[i];
+}
+return result;
+}
+
+$(document).ready(function () {
+$.extend($.fn.dataTable.defaults, {
+stateSave: true,
+lengthMenu: [[20, 40, 60, 100, -1], [20, 40, 60, 100, "All"]],
+pageLength: 20
+});
+
+$("#showAdditionalMetrics").append(
--- End diff --

The reason I wrote that piece of code in javascript is because I thought if 
in future, the data-column indices needed to be rearranged dynamically, it 
would make more sense to do it in javascript. But, if you feel otherwise, I 
shall move it to html.


---

[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...

2017-10-06 Thread pgandhi999

Github user pgandhi999 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19270#discussion_r143278211
  
--- Diff: core/src/main/resources/org/apache/spark/ui/static/taskspages.js 
---
@@ -0,0 +1,474 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+$(document).ajaxStop($.unblockUI);
+$(document).ajaxStart(function () {
+$.blockUI({message: 'Loading Tasks Page...'});
+});
+
+$.extend( $.fn.dataTable.ext.type.order, {
+"file-size-pre": ConvertDurationString,
+
+"file-size-asc": function ( a, b ) {
+a = ConvertDurationString( a );
+b = ConvertDurationString( b );
+return ((a < b) ? -1 : ((a > b) ? 1 : 0));
+},
+
+"file-size-desc": function ( a, b ) {
+a = ConvertDurationString( a );
+b = ConvertDurationString( b );
+return ((a < b) ? 1 : ((a > b) ? -1 : 0));
+}
+} );
+
+function createTemplateURI(appId) {
+var words = document.baseURI.split('/');
+var ind = words.indexOf("proxy");
+if (ind > 0) {
+var baseURI = words.slice(0, ind + 1).join('/') + '/' + appId + 
'/static/stagespage-template.html';
+return baseURI;
+}
+ind = words.indexOf("history");
+if(ind > 0) {
+var baseURI = words.slice(0, ind).join('/') + 
'/static/stagespage-template.html';
+return baseURI;
+}
+return location.origin + "/static/stagespage-template.html";
+}
+
+// This function will only parse the URL under certain formate
+// e.g. 
https://axonitered-jt1.red.ygrid.yahoo.com:50509/history/application_1502220952225_59143/stages/stage/?id=0=0
+function StageEndPoint(appId) {
+var words = document.baseURI.split('/');
+var words2 = document.baseURI.split('?');
+var ind = words.indexOf("proxy");
+if (ind > 0) {
+var appId = words[ind + 1];
+var stageIdLen = words2[1].indexOf('&');
+var stageId = words2[1].substr(3, stageIdLen - 3);
+var newBaseURI = words.slice(0, ind + 2).join('/');
+return newBaseURI + "/api/v1/applications/" + appId + "/stages/" + 
stageId;
+}
+ind = words.indexOf("history");
+if (ind > 0) {
+var appId = words[ind + 1];
+var attemptId = words[ind + 2];
+var stageIdLen = words2[1].indexOf('&');
+var stageId = words2[1].substr(3, stageIdLen - 3);
+var newBaseURI = words.slice(0, ind).join('/');
+if (isNaN(attemptId) || attemptId == "0") {
+return newBaseURI + "/api/v1/applications/" + appId + 
"/stages/" + stageId;
+} else {
+return newBaseURI + "/api/v1/applications/" + appId + "/" + 
attemptId + "/stages/" + stageId;
+}
+}
+var stageIdLen = words2[1].indexOf('&');
+var stageId = words2[1].substr(3, stageIdLen - 3);
--- End diff --

So the javascript substr function takes the first param as the starting 
index and the second param as the length of characters. And the split happens 
way before so it is independent of any anchors. I have tested it for many cases 
and it seems to look well, but if you are still unsure, I can look into better 
ways of parsing the url.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...

2017-10-06 Thread pgandhi999

Github user pgandhi999 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19270#discussion_r143277618
  
--- Diff: core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala ---
@@ -346,7 +346,7 @@ class UISeleniumSuite extends SparkFunSuite with 
WebBrowser with Matchers with B
 
   for {
 stageId <- 0 to 1
-attemptId <- 0 to 1
+attemptId <- 1 to 0
--- End diff --

So this test was initially failing due to following reason. For stage id 1 
and attempt id 0, the stage is designed to fail. So ideally, for this case, 
when the test tries to connect to the backend to get the json file in line 352, 
it would exit. But as there is no code writen to handle the exception, the test 
would quit and fail as the last case never ran. So, by changing the order, I 
ensured that all the stage success cases would run and the last one would fail 
and test will pass. I went as far as I could to debug and this was what I could 
find. If you think there is something more to this, let me know and we can 
discuss further.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19444: [SPARK-22214][SQL] Refactor the list hive partitions cod...

2017-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19444
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19444: [SPARK-22214][SQL] Refactor the list hive partitions cod...

2017-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19444
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82519/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19444: [SPARK-22214][SQL] Refactor the list hive partitions cod...

2017-10-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19444
  
**[Test build #82519 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82519/testReport)**
 for PR 19444 at commit 
[`1e119ea`](https://github.com/apache/spark/commit/1e119ea7ae414e27ad0832cee347a20d24f1c0cb).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...

2017-10-06 Thread pgandhi999

Github user pgandhi999 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19270#discussion_r143274906
  
--- Diff: core/src/main/scala/org/apache/spark/ui/exec/ExecutorsTab.scala 
---
@@ -170,6 +170,17 @@ class ExecutorsListener(storageStatusListener: 
StorageStatusListener, conf: Spar
 execTaskSummary.isBlacklisted = isBlacklisted
   }
 
+  def getExecutorHost(eid: String): String = {
--- End diff --

As the list active StorageStatusList stores the info of all the active 
executors at any point, I thought it could be used in that manner. I tested it 
and it seems to work fine. If you think there could be a potential issue with 
this approach, do let me know and we can discuss further. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19451
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...

2017-10-06 Thread pgandhi999

Github user pgandhi999 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19270#discussion_r143274132
  
--- Diff: core/src/main/resources/org/apache/spark/ui/static/taskspages.js 
---
@@ -0,0 +1,474 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+$(document).ajaxStop($.unblockUI);
+$(document).ajaxStart(function () {
+$.blockUI({message: 'Loading Tasks Page...'});
+});
+
+$.extend( $.fn.dataTable.ext.type.order, {
+"file-size-pre": ConvertDurationString,
+
+"file-size-asc": function ( a, b ) {
+a = ConvertDurationString( a );
+b = ConvertDurationString( b );
+return ((a < b) ? -1 : ((a > b) ? 1 : 0));
+},
+
+"file-size-desc": function ( a, b ) {
+a = ConvertDurationString( a );
+b = ConvertDurationString( b );
+return ((a < b) ? 1 : ((a > b) ? -1 : 0));
+}
+} );
+
+function createTemplateURI(appId) {
+var words = document.baseURI.split('/');
+var ind = words.indexOf("proxy");
+if (ind > 0) {
+var baseURI = words.slice(0, ind + 1).join('/') + '/' + appId + 
'/static/stagespage-template.html';
+return baseURI;
+}
+ind = words.indexOf("history");
+if(ind > 0) {
+var baseURI = words.slice(0, ind).join('/') + 
'/static/stagespage-template.html';
+return baseURI;
+}
+return location.origin + "/static/stagespage-template.html";
+}
+
+// This function will only parse the URL under certain formate
+// e.g. 
https://axonitered-jt1.red.ygrid.yahoo.com:50509/history/application_1502220952225_59143/stages/stage/?id=0=0
+function StageEndPoint(appId) {
+var words = document.baseURI.split('/');
+var words2 = document.baseURI.split('?');
+var ind = words.indexOf("proxy");
+if (ind > 0) {
+var appId = words[ind + 1];
+var stageIdLen = words2[1].indexOf('&');
+var stageId = words2[1].substr(3, stageIdLen - 3);
+var newBaseURI = words.slice(0, ind + 2).join('/');
+return newBaseURI + "/api/v1/applications/" + appId + "/stages/" + 
stageId;
+}
+ind = words.indexOf("history");
+if (ind > 0) {
+var appId = words[ind + 1];
+var attemptId = words[ind + 2];
+var stageIdLen = words2[1].indexOf('&');
+var stageId = words2[1].substr(3, stageIdLen - 3);
+var newBaseURI = words.slice(0, ind).join('/');
+if (isNaN(attemptId) || attemptId == "0") {
+return newBaseURI + "/api/v1/applications/" + appId + 
"/stages/" + stageId;
+} else {
+return newBaseURI + "/api/v1/applications/" + appId + "/" + 
attemptId + "/stages/" + stageId;
+}
+}
+var stageIdLen = words2[1].indexOf('&');
+var stageId = words2[1].substr(3, stageIdLen - 3);
+return location.origin + "/api/v1/applications/" + appId + "/stages/" 
+ stageId;
+}
+
+function sortNumber(a,b) {
+return a - b;
+}
+
+function quantile(array, percentile) {
+index = percentile/100. * (array.length-1);
+if (Math.floor(index) == index) {
+   result = array[index];
+} else {
+var i = Math.floor(index);
+fraction = index - i;
+result = array[i];
+}
+return result;
+}
+
+$(document).ready(function () {
+$.extend($.fn.dataTable.defaults, {
+stateSave: true,
+lengthMenu: [[20, 40, 60, 100, -1], [20, 40, 60, 100, "All"]],
+pageLength: 20
+});
+
+$("#showAdditionalMetrics").append(
+"" +
+"" +
+" Show Additional Metrics" +
+"" +
+"" +
+" Select All" +
+" Scheduler Delay" +
+" Task Deserialization Time" +
+" Shuffle Read Blocked Time" +
+" Shuffle Remote Reads" +
+

[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...

2017-10-06 Thread pgandhi999

Github user pgandhi999 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19270#discussion_r143274000
  
--- Diff: core/src/main/resources/org/apache/spark/ui/static/taskspages.js 
---
@@ -0,0 +1,474 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+$(document).ajaxStop($.unblockUI);
+$(document).ajaxStart(function () {
+$.blockUI({message: 'Loading Tasks Page...'});
+});
+
+$.extend( $.fn.dataTable.ext.type.order, {
+"file-size-pre": ConvertDurationString,
+
+"file-size-asc": function ( a, b ) {
+a = ConvertDurationString( a );
+b = ConvertDurationString( b );
+return ((a < b) ? -1 : ((a > b) ? 1 : 0));
+},
+
+"file-size-desc": function ( a, b ) {
+a = ConvertDurationString( a );
+b = ConvertDurationString( b );
+return ((a < b) ? 1 : ((a > b) ? -1 : 0));
+}
+} );
+
+function createTemplateURI(appId) {
--- End diff --

Moved them all to utils.js


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...

2017-10-06 Thread pgandhi999

Github user pgandhi999 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19270#discussion_r143274034
  
--- Diff: core/src/main/resources/org/apache/spark/ui/static/taskspages.js 
---
@@ -0,0 +1,474 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+$(document).ajaxStop($.unblockUI);
+$(document).ajaxStart(function () {
+$.blockUI({message: 'Loading Tasks Page...'});
+});
+
+$.extend( $.fn.dataTable.ext.type.order, {
+"file-size-pre": ConvertDurationString,
+
+"file-size-asc": function ( a, b ) {
+a = ConvertDurationString( a );
+b = ConvertDurationString( b );
+return ((a < b) ? -1 : ((a > b) ? 1 : 0));
+},
+
+"file-size-desc": function ( a, b ) {
+a = ConvertDurationString( a );
+b = ConvertDurationString( b );
+return ((a < b) ? 1 : ((a > b) ? -1 : 0));
+}
+} );
+
+function createTemplateURI(appId) {
+var words = document.baseURI.split('/');
+var ind = words.indexOf("proxy");
+if (ind > 0) {
+var baseURI = words.slice(0, ind + 1).join('/') + '/' + appId + 
'/static/stagespage-template.html';
+return baseURI;
+}
+ind = words.indexOf("history");
+if(ind > 0) {
+var baseURI = words.slice(0, ind).join('/') + 
'/static/stagespage-template.html';
+return baseURI;
+}
+return location.origin + "/static/stagespage-template.html";
+}
+
+// This function will only parse the URL under certain formate
+// e.g. 
https://axonitered-jt1.red.ygrid.yahoo.com:50509/history/application_1502220952225_59143/stages/stage/?id=0=0
+function StageEndPoint(appId) {
--- End diff --

Fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...

2017-10-06 Thread pgandhi999

Github user pgandhi999 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19270#discussion_r143274083
  
--- Diff: core/src/main/resources/org/apache/spark/ui/static/taskspages.js 
---
@@ -0,0 +1,474 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+$(document).ajaxStop($.unblockUI);
+$(document).ajaxStart(function () {
+$.blockUI({message: 'Loading Tasks Page...'});
+});
+
+$.extend( $.fn.dataTable.ext.type.order, {
+"file-size-pre": ConvertDurationString,
+
+"file-size-asc": function ( a, b ) {
+a = ConvertDurationString( a );
+b = ConvertDurationString( b );
+return ((a < b) ? -1 : ((a > b) ? 1 : 0));
+},
+
+"file-size-desc": function ( a, b ) {
+a = ConvertDurationString( a );
+b = ConvertDurationString( b );
+return ((a < b) ? 1 : ((a > b) ? -1 : 0));
+}
+} );
+
+function createTemplateURI(appId) {
+var words = document.baseURI.split('/');
+var ind = words.indexOf("proxy");
+if (ind > 0) {
+var baseURI = words.slice(0, ind + 1).join('/') + '/' + appId + 
'/static/stagespage-template.html';
+return baseURI;
+}
+ind = words.indexOf("history");
+if(ind > 0) {
+var baseURI = words.slice(0, ind).join('/') + 
'/static/stagespage-template.html';
+return baseURI;
+}
+return location.origin + "/static/stagespage-template.html";
+}
+
+// This function will only parse the URL under certain formate
+// e.g. 
https://axonitered-jt1.red.ygrid.yahoo.com:50509/history/application_1502220952225_59143/stages/stage/?id=0=0
+function StageEndPoint(appId) {
+var words = document.baseURI.split('/');
+var words2 = document.baseURI.split('?');
+var ind = words.indexOf("proxy");
+if (ind > 0) {
+var appId = words[ind + 1];
+var stageIdLen = words2[1].indexOf('&');
+var stageId = words2[1].substr(3, stageIdLen - 3);
+var newBaseURI = words.slice(0, ind + 2).join('/');
+return newBaseURI + "/api/v1/applications/" + appId + "/stages/" + 
stageId;
+}
+ind = words.indexOf("history");
+if (ind > 0) {
+var appId = words[ind + 1];
+var attemptId = words[ind + 2];
+var stageIdLen = words2[1].indexOf('&');
+var stageId = words2[1].substr(3, stageIdLen - 3);
+var newBaseURI = words.slice(0, ind).join('/');
+if (isNaN(attemptId) || attemptId == "0") {
+return newBaseURI + "/api/v1/applications/" + appId + 
"/stages/" + stageId;
+} else {
+return newBaseURI + "/api/v1/applications/" + appId + "/" + 
attemptId + "/stages/" + stageId;
+}
+}
+var stageIdLen = words2[1].indexOf('&');
+var stageId = words2[1].substr(3, stageIdLen - 3);
+return location.origin + "/api/v1/applications/" + appId + "/stages/" 
+ stageId;
+}
+
+function sortNumber(a,b) {
+return a - b;
+}
+
+function quantile(array, percentile) {
+index = percentile/100. * (array.length-1);
+if (Math.floor(index) == index) {
+   result = array[index];
+} else {
+var i = Math.floor(index);
+fraction = index - i;
+result = array[i];
+}
+return result;
+}
+
+$(document).ready(function () {
+$.extend($.fn.dataTable.defaults, {
--- End diff --

Done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...

2017-10-06 Thread pgandhi999

Github user pgandhi999 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19270#discussion_r143273927
  
--- Diff: core/src/main/resources/org/apache/spark/ui/static/taskspages.js 
---
@@ -0,0 +1,474 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+$(document).ajaxStop($.unblockUI);
+$(document).ajaxStart(function () {
+$.blockUI({message: 'Loading Tasks Page...'});
--- End diff --

Fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...

2017-10-06 Thread pgandhi999

Github user pgandhi999 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19270#discussion_r143273867
  
--- Diff: 
core/src/main/scala/org/apache/spark/status/api/v1/AllStagesResource.scala ---
@@ -138,21 +155,61 @@ private[v1] object AllStagesResource {
 }
   }
 
-  def convertTaskData(uiData: TaskUIData, lastUpdateTime: Option[Long]): 
TaskData = {
+  private def getGettingResultTime(info: TaskInfo, currentTime: Long): 
Long = {
--- End diff --

Fixed it


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-06 Thread sathiyapk

GitHub user sathiyapk opened a pull request:

https://github.com/apache/spark/pull/19451

SPARK-22181 Adds ReplaceExceptWithNotFilter rule

## What changes were proposed in this pull request?

Adds a new optimisation rule 'ReplaceExceptWithNotFilter' that replaces 
Except logical with Filter operator and schedule it before applying 
'ReplaceExceptWithAntiJoin' rule. This way we can avoid expensive join 
operation if one or both of the datasets of the Except operation are fully 
derived out of Filters from a same parent.

## How was this patch tested?

The patch is tested locally using spark-shell + unit test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sathiyapk/spark 
SPARK-22181-optimize-exceptWithFilter

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19451.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19451


commit 1baecfdce9552b1e7853ace425970ea49c8f2304
Author: Sathiya 
Date:   2017-10-06T18:57:52Z

SPARK-22181 Adds ReplaceExceptWithNotFilter rule




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19270: [SPARK-21809] : Change Stage Page to use datatables to s...

2017-10-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19270
  
**[Test build #82522 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82522/testReport)**
 for PR 19270 at commit 
[`25b5215`](https://github.com/apache/spark/commit/25b5215cd6dbbc11b5e5d8d56bff6743578de2b5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18460: [SPARK-21247][SQL] Type comparison should respect...

2017-10-06 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/18460#discussion_r143273532
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -100,6 +101,17 @@ object TypeCoercion {
 case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) 
=>
   Some(TimestampType)
 
+case (t1 @ StructType(fields1), t2 @ StructType(fields2)) if 
t1.sameType(t2) =>
+  Some(StructType(fields1.zip(fields2).map { case (f1, f2) =>
+// Since `t1.sameType(t2)` is true, two StructTypes have the same 
DataType
+// except `name` (in case of `spark.sql.caseSensitive=false`) and 
`nullable`.
+// - Different names: use a lower case name because 
findTightestCommonType is commutative.
+// - Different nullabilities: `nullable` is true iff one of them 
is nullable.
+val name = if (f1.name == f2.name) f1.name else 
f1.name.toLowerCase(Locale.ROOT)
--- End diff --

For a test case, does `nested column in the references` mean `WHERE` clause?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18460: [SPARK-21247][SQL] Type comparison should respect...

2017-10-06 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/18460#discussion_r143273000
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -100,6 +101,17 @@ object TypeCoercion {
 case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) 
=>
   Some(TimestampType)
 
+case (t1 @ StructType(fields1), t2 @ StructType(fields2)) if 
t1.sameType(t2) =>
+  Some(StructType(fields1.zip(fields2).map { case (f1, f2) =>
+// Since `t1.sameType(t2)` is true, two StructTypes have the same 
DataType
+// except `name` (in case of `spark.sql.caseSensitive=false`) and 
`nullable`.
+// - Different names: use a lower case name because 
findTightestCommonType is commutative.
+// - Different nullabilities: `nullable` is true iff one of them 
is nullable.
+val name = if (f1.name == f2.name) f1.name else 
f1.name.toLowerCase(Locale.ROOT)
--- End diff --

Yes. Hive does like the following.
```sql
hive> CREATE TABLE S AS SELECT named_struct('A',1);
hive> DESCRIBE S;
OK
_c0 struct
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18966: [SPARK-21751][SQL] CodeGeneraor.splitExpressions counts ...

2017-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18966
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18966: [SPARK-21751][SQL] CodeGeneraor.splitExpressions counts ...

2017-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18966
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82518/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18966: [SPARK-21751][SQL] CodeGeneraor.splitExpressions counts ...

2017-10-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18966
  
**[Test build #82518 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82518/testReport)**
 for PR 18966 at commit 
[`b04c09c`](https://github.com/apache/spark/commit/b04c09c3683f104909713344c90e46b4129f5401).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143272791
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -1213,6 +1213,71 @@ case class ToUTCTimestamp(left: Expression, right: 
Expression)
 }
 
 /**
+ * This modifies a timestamp to show how the display time changes going 
from one timezone to
+ * another, for the same instant in time.
+ *
+ * We intentionally do not provide an ExpressionDescription as this is not 
meant to be exposed to
+ * users, its only used for internal conversions.
+ */
+private[spark] case class TimestampTimezoneCorrection(
--- End diff --

> Additionally, timezone rules have changed over time and keep changing. 

Ah, yes, that makes sense... it's also why my initial tests were failing at 
the DST boundaries. :-/


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19450: [SPARK-22218] spark shuffle services fails to update sec...

2017-10-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19450
  
**[Test build #82521 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82521/testReport)**
 for PR 19450 at commit 
[`5a9ef13`](https://github.com/apache/spark/commit/5a9ef1396a28b535042e336a2f43d80104ce95e5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19450: [SPARK-22218] spark shuffle services fails to update sec...

2017-10-06 Thread tgravescs

Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/19450
  
ping @vanzin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19450: [SPARK-22218] spark shuffle services fails to upd...

2017-10-06 Thread tgravescs

GitHub user tgravescs opened a pull request:

https://github.com/apache/spark/pull/19450

[SPARK-22218] spark shuffle services fails to update secret on app 
re-attempts

This patch fixes application re-attempts when running spark on yarn using 
the external shuffle service with security on.  Currently executors will fail 
to launch on any application re-attempt when launched on a nodemanager that had 
an executor from the first attempt.  The reason for this is because we aren't 
updating the secret key after the first application attempt.  The fix here is 
to just remove the containskey check to see if it already exists. In this way, 
we always add it and make sure its the most recent secret.  Similarly remove 
the check for containsKey on the remove since its just adding extra check that 
isn't really needed.

Note this worked before spark 2.2 because the check used to be contains 
(which was looking for the value) rather then containsKey, so that never 
matched and it was just always adding the new secret.

Patch was tested on a 10 node cluster as well as added the unit test.
The test ran was a wordcount where the output directory already existed.  
With the bug present the application attempt failed with max number of executor 
Failures which were all saslExceptions.  With the fix present the application 
re-attempts fail with directory already exists or when you remove the directory 
between attempts the re-attemps succeed.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tgravescs/spark SPARK-22218

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19450.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19450


commit aa9a2c2705cd59f9e1caa7e78ad7eafad3ff2789
Author: Thomas Graves 
Date:   2017-10-06T18:45:13Z

[SPARK-22218] spark shuffle services fails to update secret on application 
re-attempts

commit 5a9ef1396a28b535042e336a2f43d80104ce95e5
Author: Thomas Graves 
Date:   2017-10-06T18:55:43Z

minor updates




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...

2017-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19448
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82517/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...

2017-10-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19448
  
**[Test build #82517 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82517/testReport)**
 for PR 19448 at commit 
[`e6fdbdc`](https://github.com/apache/spark/commit/e6fdbdcf4118283abd22f7b14586ed742d225657).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...

2017-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19448
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

2017-10-06 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18664
  
I am okay with proceeding separately for dealing with timezone, and 
matching the behaviour with Arrow to the existing behaviour without Arrow here 
with respect to timezone.

Less sure yet if we need a design doc for dealing with timezone but will 
follow if other committers feel so.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-06 Thread icexelloss

Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/18732#discussion_r143263694
  
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,69 @@ def pivot(self, pivot_col, values=None):
 jgd = self._jgd.pivot(pivot_col)
 else:
 jgd = self._jgd.pivot(pivot_col, values)
-return GroupedData(jgd, self.sql_ctx)
+return GroupedData(jgd, self._df)
+
+@since(2.3)
+def apply(self, udf):
--- End diff --

@rxin just to recap our discussion regarding naming:

You asked:
> What's the difference between this one and the transform function you 
also proposed? I'm trying to see if all the naming
makes sense when considered together.

Answer is:
`transform` takes a function: pd.Series -> pd.Series and apply the function 
on each column (or subset of columns). The input and output Series are of the 
same length.

`apply` takes a function: pd.DataFrame -> pd.DataFrame and apply the 
function on the group. Similar to `flatMapGroups`

Does this make sense to you?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...

2017-10-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19372
  
**[Test build #3943 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3943/testReport)**
 for PR 19372 at commit 
[`2ea3f18`](https://github.com/apache/spark/commit/2ea3f18bb002c38f387e01c31c6fb3ce2cb37231).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

2017-10-06 Thread icexelloss

Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/18664
  
If we all agree on the necessity of a design doc first, I can create a Jira 
and we can make progress there.

What do you all think? @BryanCutler  @gatorsmile @HyukjinKwon 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18229: [SPARK-20691][CORE] Difference between Storage Memory as...

2017-10-06 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/18229
  
gentle ping @mkesselaers 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

2017-10-06 Thread icexelloss

Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/18664
  
I agree. I think some high level document describing these differences so 
we can discuss it. I think we should be more careful about Arrow-version 
behavior before releasing support for timestamp and nested types.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19077: [SPARK-21860][core]Improve memory reuse for heap memory ...

2017-10-06 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19077
  
gentle ping @jerryshao for review


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

2017-10-06 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18664
  
Yup, I admit there could be some exceptions (there have been actually) but 
that should still be the baseline we should basically pursue. Probably, we 
could treat this Arrow optimisation as an evolving configuration option for now 
due to those differences but the final goal should be the matched behaviour.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19449: [SPARK-22219][SQL] Refactor code to get a value for "spa...

2017-10-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19449
  
**[Test build #82520 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82520/testReport)**
 for PR 19449 at commit 
[`be4220d`](https://github.com/apache/spark/commit/be4220dcb218366f81f4015af56bee9fade90066).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19449: [SPARK-22219][SQL] Refactor code to get a value f...

2017-10-06 Thread kiszk

GitHub user kiszk opened a pull request:

https://github.com/apache/spark/pull/19449

[SPARK-22219][SQL] Refactor code to get a value for 
"spark.sql.codegen.comments"

## What changes were proposed in this pull request?

This PR refactors code to get a value for "spark.sql.codegen.comments" by 
avoiding `SparkEnv.get.conf`. This PR uses `SQLConf.get.codegenComments` since 
`SQLConf.get` always returns an instance of `SQLConf`.

## How was this patch tested?

Added test case to `DebuggingSuite`


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kiszk/spark SPARK-22219

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19449.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19449


commit be4220dcb218366f81f4015af56bee9fade90066
Author: Kazuaki Ishizaki 
Date:   2017-10-06T17:54:03Z

Initial commit




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread zivanfi

Github user zivanfi commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143257840
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -1213,6 +1213,71 @@ case class ToUTCTimestamp(left: Expression, right: 
Expression)
 }
 
 /**
+ * This modifies a timestamp to show how the display time changes going 
from one timezone to
+ * another, for the same instant in time.
+ *
+ * We intentionally do not provide an ExpressionDescription as this is not 
meant to be exposed to
+ * users, its only used for internal conversions.
+ */
+private[spark] case class TimestampTimezoneCorrection(
--- End diff --

Unfortunately the offset depends on the actual date, so a timezone 
conversion can not be simplified to a simple delta.

Daylight saving time starts and ends on different days in different 
timezones, while some timezones don't have DST changes at all. Additionally, 
timezone rules have changed over time and keep changing. Both the basic 
timezone offset and the DST rules themselves could change (and have changed) 
over time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

2017-10-06 Thread icexelloss

Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/18664
  
> The baseline should be (as said above): Internal optimisation should not 
introduce any behaviour change, and we are discouraged to change the previous 
behaviour unless it has bugs in general.

I am not sure if I totally agree with this. 

Take the struct  for instance, in the non-Arrow version, struct is turned 
to `pyspark.sql.Row` object in `toPandas()`. I wouldn't call this bug because 
it's design choice. However, this is maybe not the best design choice because 
if the user pickle the `pandas.DataFrame` to a file and send it to someone, the 
receiver won't be able to deserialize this `pandas.DataFrame` without having 
the pyspark library dependency.

 Now I am **not** trying to argue we should or should not turn struct 
column to `pyspark.sql.Row`, my point is that there might be some design 
choices in the non-Arrow versions that are not ideal and maybe that's not a 
hard requirement to make behavior 100% the same between non-Arrow and Arrow 
version.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18732
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82515/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18732
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18732
  
**[Test build #82515 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82515/testReport)**
 for PR 18732 at commit 
[`284ba00`](https://github.com/apache/spark/commit/284ba00be0cbc357ac42900f8c4af57901d147c5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...

2017-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19424
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82514/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...

2017-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19424
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...

2017-10-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19424
  
**[Test build #82514 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82514/testReport)**
 for PR 19424 at commit 
[`24f1a75`](https://github.com/apache/spark/commit/24f1a757e4df3b44e5242188e67bfffd6a3f0426).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait DataSourceReaderHolder `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19294: [SPARK-21549][CORE] Respect OutputFormats with no...

2017-10-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19294#discussion_r143252186
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
 ---
@@ -57,6 +60,15 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: 
String)
*/
   private def absPathStagingDir: Path = new Path(path, "_temporary-" + 
jobId)
 
+  /**
+   * Checks whether there are files to be committed to an absolute output 
location.
+   *
+   * As the committing and aborting the job occurs on driver where 
`addedAbsPathFiles` is always
+   * null, it is necessary to check whether the output path is specified, 
that may not be the case
+   * for committers not writing to distributed file systems.
--- End diff --

This is much better.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19294: [SPARK-21549][CORE] Respect OutputFormats with no...

2017-10-06 Thread mridulm

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/19294#discussion_r143250260
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
 ---
@@ -57,6 +60,15 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: 
String)
*/
   private def absPathStagingDir: Path = new Path(path, "_temporary-" + 
jobId)
 
+  /**
+   * Checks whether there are files to be committed to an absolute output 
location.
+   *
+   * As the committing and aborting the job occurs on driver where 
`addedAbsPathFiles` is always
+   * null, it is necessary to check whether the output path is specified, 
that may not be the case
+   * for committers not writing to distributed file systems.
--- End diff --

How about :
"As committing and aborting a job occurs on driver, where 
`addedAbsPathFiles` is always null, it is necessary to check whether the output 
path is specified. Output path need not be relevant for committers not writing 
to distributed file systems"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143246203
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/TimestampTableTimeZone.scala
 ---
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.{AnalysisException, SparkSession}
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.analysis.UnresolvedException
+import org.apache.spark.sql.catalyst.catalog.CatalogTable
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types.{StringType, TimestampType}
+
+/**
+ * Apply a correction to data loaded from, or saved to, Parquet, so that 
it timestamps can be read
+ * like TIMESTAMP WITHOUT TIMEZONE.  This gives correct behavior if you 
process data with
+ * machines in different timezones, or if you access the data from 
multiple SQL engines.
+ */
+private[sql] case class TimestampTableTimeZone(sparkSession: SparkSession)
--- End diff --

I'm also trying to see if this can be simplified; I guess the main thing is 
Imran's comment about not being able to use `transformUp`. I need to take a 
look at whether that's really the case.

This rule also doesn't seem to handle `InsertIntoHiveTable`, which is in 
the hive module so can't be handled here. Probably will need a new rule based 
on this one in the hive module.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2017-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18931
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143245229
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -230,6 +230,13 @@ case class AlterTableSetPropertiesCommand(
 isView: Boolean)
   extends RunnableCommand {
 
+  if (isView) {
+properties.get(TimestampTableTimeZone.TIMEZONE_PROPERTY).foreach { _ =>
--- End diff --


[HiveQL](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView)
 explicitly allows properties in view; I've never used them, though.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2017-10-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18931
  
**[Test build #82516 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82516/testReport)**
 for PR 18931 at commit 
[`edb73d6`](https://github.com/apache/spark/commit/edb73d6e9705e1ce5a836b9a394fec706d68aec4).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

2017-10-06 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19447
  
We still need to add a test case. We also should capture the exception and 
issue a better one; otherwise, users will not know what they should do when 
they hit such a confusing error.

cc @kiszk @rednaxelafx Could you take a look at this PR? Do we still hit 
this using the current hard-coded value? If so, what is the easy way to write a 
test case?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143243769
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -1213,6 +1213,71 @@ case class ToUTCTimestamp(left: Expression, right: 
Expression)
 }
 
 /**
+ * This modifies a timestamp to show how the display time changes going 
from one timezone to
+ * another, for the same instant in time.
+ *
+ * We intentionally do not provide an ExpressionDescription as this is not 
meant to be exposed to
+ * users, its only used for internal conversions.
+ */
+private[spark] case class TimestampTimezoneCorrection(
--- End diff --

Hmm, let me see if I can figure that out.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2017-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18931
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82516/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...

2017-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19424
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143244400
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -266,6 +267,10 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
* @since 1.4.0
*/
   def insertInto(tableName: String): Unit = {
+extraOptions.get(TimestampTableTimeZone.TIMEZONE_PROPERTY).foreach { 
tz =>
--- End diff --

The 
[spec](https://docs.google.com/document/d/1XmyVjr3eOJiNFjVeSnmjIU60Hq-XiZB03pgi3r1razM/edit#heading=h.jnyh4t7bhnlx)
 requests errors when using invalid time zones. I guess this would still fail 
with a different error in that case, so I can remove this if you're really 
against adding it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143243338
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -1213,6 +1213,71 @@ case class ToUTCTimestamp(left: Expression, right: 
Expression)
 }
 
 /**
+ * This modifies a timestamp to show how the display time changes going 
from one timezone to
+ * another, for the same instant in time.
+ *
+ * We intentionally do not provide an ExpressionDescription as this is not 
meant to be exposed to
+ * users, its only used for internal conversions.
+ */
+private[spark] case class TimestampTimezoneCorrection(
--- End diff --

What I'm saying is the analysis rule can just determine the delta, and then 
just do a simple add/delete.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143243228
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -1015,6 +1020,10 @@ object DateTimeUtils {
 guess
   }
 
+  def convertTz(ts: SQLTimestamp, fromZone: String, toZone: String): 
SQLTimestamp = {
+convertTz(ts, getTimeZone(fromZone), getTimeZone(toZone))
--- End diff --

I guess caching the value as done by the `FromUTCTimestamp` expression is 
the right way to go?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...

2017-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19424
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82513/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...

2017-10-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19424
  
**[Test build #82513 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82513/testReport)**
 for PR 19424 at commit 
[`e880328`](https://github.com/apache/spark/commit/e88032896512c2d6945b1cfd017a462961726907).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait DataSourceReaderHolder `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

2017-10-06 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18664
  
The baseline should be (as said above): Internal optimisation should not 
introduce any behaviour change, and we are discouraged to change the previous 
behaviour unless it has bugs in general.

So, the issue is the previous behaviour without-Arrow does not respect 
`SESSION_LOCAL_TIMEZONE` but we should respect; however, this is a behaviour 
change so we should discuss further and possibly introduce a configuration to 
control this to prevent the behaviour change. Also, to be clear, I assume we 
admit not respecting `SESSION_LOCAL_TIMEZONE` is not a bug but a behaviour 
change to be fixed as an improvement, right?

How about matching the behaviour with-Arrow to the previous behaviour 
without-Arrow (not respecting `SESSION_LOCAL_TIMEZONE`) first and then 
separately make a PR to introduce a configuration to respect 
`SESSION_LOCAL_TIMEZONE` for both?



I actually think I a little bit doubt about respecting 
`SESSION_LOCAL_TIMEZONE` when we `collect()` or `toPandas()`, and want to 
separate this discussion into another PR, JIRA or mailing list if possible.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143242256
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -1213,6 +1213,71 @@ case class ToUTCTimestamp(left: Expression, right: 
Expression)
 }
 
 /**
+ * This modifies a timestamp to show how the display time changes going 
from one timezone to
+ * another, for the same instant in time.
+ *
+ * We intentionally do not provide an ExpressionDescription as this is not 
meant to be exposed to
+ * users, its only used for internal conversions.
+ */
+private[spark] case class TimestampTimezoneCorrection(
--- End diff --

I guess you could use ToUTCTimestamp / FromUTCTimestamp for this, but that 
would be more expensive since you'd be doing the conversion twice for each 
value.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...

2017-10-06 Thread jsnowacki

Github user jsnowacki commented on the issue:

https://github.com/apache/spark/pull/19443
  
The only other reason for Python I can think of, if the above are not 
compelling enough, is that the issue with function not having 
call-by-column-name option is that we'll get the error only at the run-time, 
due to the process of how the names and types are being mapped onto JVM code. 
Which is partially related to the dynamic nature of typing in Python. So while 
this is spotted on compile time in Java/Scala, which may be annoying and 
considered inconsistent, in Python task may fail after hours due to this issue. 
Same is true for R (and SQL to some extent).  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19444: [SPARK-22214][SQL] Refactor the list hive partitions cod...

2017-10-06 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19444
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18460: [SPARK-21247][SQL] Type comparision should respec...

2017-10-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18460#discussion_r143238804
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -100,6 +101,17 @@ object TypeCoercion {
 case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) 
=>
   Some(TimestampType)
 
+case (t1 @ StructType(fields1), t2 @ StructType(fields2)) if 
t1.sameType(t2) =>
+  Some(StructType(fields1.zip(fields2).map { case (f1, f2) =>
+// Since `t1.sameType(t2)` is true, two StructTypes have the same 
DataType
+// except `name` (in case of `spark.sql.caseSensitive=false`) and 
`nullable`.
+// - Different names: use a lower case name because 
findTightestCommonType is commutative.
+// - Different nullabilities: `nullable` is true iff one of them 
is nullable.
+val name = if (f1.name == f2.name) f1.name else 
f1.name.toLowerCase(Locale.ROOT)
--- End diff --

Why the output nested column name is lower case? Is Hive behaving like this?

In addition, could you add one more test and check whether we also respect 
case sensitivity conf when we resolve the queries that contain the nested 
column in the references?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18460: [SPARK-21247][SQL] Type comparision should respect case-...

2017-10-06 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18460
  
I see. There is a typo in PR title: `comparision`.

Will review the fix soon.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19438: [SPARK-22208] [SQL] Improve percentile_approx by not rou...

2017-10-06 Thread jiangxb1987

Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/19438
  
Maybe we can run some of the major test suites locally and update all the 
results.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19444: [SPARK-22214][SQL] Refactor the list hive partitions cod...

2017-10-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19444
  
**[Test build #82519 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82519/testReport)**
 for PR 19444 at commit 
[`1e119ea`](https://github.com/apache/spark/commit/1e119ea7ae414e27ad0832cee347a20d24f1c0cb).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19438: [SPARK-22208] [SQL] Improve percentile_approx by not rou...

2017-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19438
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19438: [SPARK-22208] [SQL] Improve percentile_approx by not rou...

2017-10-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19438
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82512/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19438: [SPARK-22208] [SQL] Improve percentile_approx by not rou...

2017-10-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19438
  
**[Test build #82512 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82512/testReport)**
 for PR 19438 at commit 
[`aae2769`](https://github.com/apache/spark/commit/aae2769b753db483e0ec2652fa0c4f09fcd14c61).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19444: [SPARK-22214][SQL] Refactor the list hive partitions cod...

2017-10-06 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19444
  
LGTM except a minor comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19444: [SPARK-22214][SQL] Refactor the list hive partiti...

2017-10-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19444#discussion_r143235124
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -638,12 +638,14 @@ private[hive] class HiveClientImpl(
   table: CatalogTable,
   spec: Option[TablePartitionSpec]): Seq[CatalogTablePartition] = 
withHiveState {
 val hiveTable = toHiveTable(table, Some(userName))
-val parts = spec match {
-  case None => shim.getAllPartitions(client, 
hiveTable).map(fromHivePartition)
+val partialPartSpec = spec match {
--- End diff --

-> `partSpec`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18966: [SPARK-21751][SQL] CodeGeneraor.splitExpressions counts ...

2017-10-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18966
  
**[Test build #82518 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82518/testReport)**
 for PR 18966 at commit 
[`b04c09c`](https://github.com/apache/spark/commit/b04c09c3683f104909713344c90e46b4129f5401).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...

2017-10-06 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19447
  
Here the answers to your questions @gatorsmile , please tell me if I need 
to elaborate more deeply.
This conf controls how many inner classes are generated. A big value means 
that we will have few inner classes and that the outer class and the inner 
classes will have a lot of methods. A small one means a lot of classes with few 
values. So,
1 - I don't think there is a big perf impact. Maybe, if this value is very 
small, we might have a lot nested class instances and therefore a waste of 
memory. But since these classes don't have any variable, this is quite 
negligible to me.
 2 - If the user encounters an Exception like
```
Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
class 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass
 has grown past JVM limit of 0x
```
then he/she must reduce this value
 3 - I don't think so, because I have not been able to reproduce the 
problem other than in a very complex use case. So probably that use case is a 
corner case which can benefit by this change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19394: [SPARK-22170][SQL] Reduce memory consumption in b...

2017-10-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19394#discussion_r143232991
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala
 ---
@@ -73,25 +73,37 @@ case class BroadcastExchangeExec(
 try {
   val beforeCollect = System.nanoTime()
   // Note that we use .executeCollect() because we don't want to 
convert data to Scala types
-  val input: Array[InternalRow] = child.executeCollect()
-  if (input.length >= 51200) {
+  val (numRows, input) = child.executeCollectIterator()
+  if (numRows >= 51200) {
 throw new SparkException(
-  s"Cannot broadcast the table with more than 512 millions 
rows: ${input.length} rows")
+  s"Cannot broadcast the table with more than 512 millions 
rows: $numRows rows")
   }
+
   val beforeBuild = System.nanoTime()
   longMetric("collectTime") += (beforeBuild - beforeCollect) / 
100
-  val dataSize = 
input.map(_.asInstanceOf[UnsafeRow].getSizeInBytes.toLong).sum
+
+  // Construct the relation.
+  val relation = mode.transform(input, Some(numRows))
+
+  val dataSize = relation match {
+case map: HashedRelation =>
+  map.estimatedSize
+case arr: Array[InternalRow] =>
+  arr.map(_.asInstanceOf[UnsafeRow].getSizeInBytes.toLong).sum
+case _ =>
+  numRows * 512 // guess: each row is about 512 bytes
--- End diff --

Yeah, it makes more sense to issue an exception. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19294: [SPARK-21549][CORE] Respect OutputFormats with no...

2017-10-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19294#discussion_r143232506
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
 ---
@@ -57,6 +60,15 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: 
String)
*/
   private def absPathStagingDir: Path = new Path(path, "_temporary-" + 
jobId)
 
+  /**
+   * Checks whether there are files to be committed to an absolute output 
location.
+   *
+   * As the committing and aborting the job occurs on driver where 
`addedAbsPathFiles` is always
+   * null, it is necessary to check whether the output path is specified, 
that may not be the case
+   * for committers not writing to distributed file systems.
--- End diff --

This also has a grammar issue. It is not clear too


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...

2017-10-06 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19448
  
**[Test build #82517 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82517/testReport)**
 for PR 19448 at commit 
[`e6fdbdc`](https://github.com/apache/spark/commit/e6fdbdcf4118283abd22f7b14586ed742d225657).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 >

101 - 200 of 370 matches

Mail list logo