[GitHub] spark issue #19270: [SPARK-21809] : Change Stage Page to use datatables to s...
Github user pgandhi999 commented on the issue: https://github.com/apache/spark/pull/19270 Ok Yes, I just noticed that comment. Will get back to you on this one ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19429: [SPARK-20055] [Docs] Added documentation for load...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19429#discussion_r143287807 --- Diff: examples/src/main/python/sql/datasource.py --- @@ -53,6 +53,11 @@ def basic_datasource_example(spark): df.select("name", "age").write.save("namesAndAges.parquet", format="parquet") # $example off:manual_load_options$ +# $example on:manual_load_options_csv$ +df = spark.read.load("examples/src/main/resources/people.csv", + format="csv", sep=":", inferSchema="true", header="true") +# $example off:manual_load_options_csv --- End diff -- This need to be corrected to > # $example off:manual_load_options_csv$ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19429: [SPARK-20055] [Docs] Added documentation for load...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19429#discussion_r143287505 --- Diff: examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java --- @@ -115,7 +115,20 @@ private static void runBasicDataSourceExample(SparkSession spark) { Dataset peopleDF = spark.read().format("json").load("examples/src/main/resources/people.json"); peopleDF.select("name", "age").write().format("parquet").save("namesAndAges.parquet"); -// $example off:manual_load_options$ +// $example on:manual_load_options_csv$ --- End diff -- You still need to keep > // $example off:manual_load_options$ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...
Github user pgandhi999 commented on a diff in the pull request: https://github.com/apache/spark/pull/19270#discussion_r143287411 --- Diff: core/src/main/resources/org/apache/spark/ui/static/utils.js --- @@ -46,3 +46,64 @@ function formatBytes(bytes, type) { var i = Math.floor(Math.log(bytes) / Math.log(k)); return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + sizes[i]; } + +function formatLogsCells(execLogs, type) { +if (type !== 'display') return Object.keys(execLogs); +if (!execLogs) return; +var result = ''; +$.each(execLogs, function (logName, logUrl) { +result += '' + logName + '' +}); +return result; +} + +function getStandAloneAppId(cb) { --- End diff -- So, to summarize what you said, basically the function should simply return the appId which can be taken as a param in those individual functions. I think that makes sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19270: [SPARK-21809] : Change Stage Page to use datatables to s...
Github user ajbozarth commented on the issue: https://github.com/apache/spark/pull/19270 Thanks, I must've missed that in the description. You've take care of all but my last comment, I'm not seeing the accumulators table, have you checked it shows up when theres accumulators? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19394: [SPARK-22170][SQL] Reduce memory consumption in broadcas...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19394 **[Test build #82525 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82525/testReport)** for PR 19394 at commit [`5b92fe2`](https://github.com/apache/spark/commit/5b92fe252530f4c8dab4d41772011e86ac308f80). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...
Github user ajbozarth commented on a diff in the pull request: https://github.com/apache/spark/pull/19270#discussion_r143286538 --- Diff: core/src/main/resources/org/apache/spark/ui/static/utils.js --- @@ -46,3 +46,64 @@ function formatBytes(bytes, type) { var i = Math.floor(Math.log(bytes) / Math.log(k)); return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + sizes[i]; } + +function formatLogsCells(execLogs, type) { +if (type !== 'display') return Object.keys(execLogs); +if (!execLogs) return; +var result = ''; +$.each(execLogs, function (logName, logUrl) { +result += '' + logName + '' +}); +return result; +} + +function getStandAloneAppId(cb) { --- End diff -- I was thinking it should return the appId rather than take in a function as a param --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19394: [SPARK-22170][SQL] Reduce memory consumption in b...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/19394#discussion_r143286130 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala --- @@ -73,25 +73,37 @@ case class BroadcastExchangeExec( try { val beforeCollect = System.nanoTime() // Note that we use .executeCollect() because we don't want to convert data to Scala types - val input: Array[InternalRow] = child.executeCollect() - if (input.length >= 51200) { + val (numRows, input) = child.executeCollectIterator() + if (numRows >= 51200) { throw new SparkException( - s"Cannot broadcast the table with more than 512 millions rows: ${input.length} rows") + s"Cannot broadcast the table with more than 512 millions rows: $numRows rows") } + val beforeBuild = System.nanoTime() longMetric("collectTime") += (beforeBuild - beforeCollect) / 100 - val dataSize = input.map(_.asInstanceOf[UnsafeRow].getSizeInBytes.toLong).sum + + // Construct the relation. + val relation = mode.transform(input, Some(numRows)) + + val dataSize = relation match { +case map: HashedRelation => + map.estimatedSize +case arr: Array[InternalRow] => + arr.map(_.asInstanceOf[UnsafeRow].getSizeInBytes.toLong).sum +case _ => + numRows * 512 // guess: each row is about 512 bytes --- End diff -- Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19394: [SPARK-22170][SQL] Reduce memory consumption in b...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/19394#discussion_r143286104 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala --- @@ -73,25 +73,37 @@ case class BroadcastExchangeExec( try { val beforeCollect = System.nanoTime() // Note that we use .executeCollect() because we don't want to convert data to Scala types --- End diff -- Updated. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19394: [SPARK-22170][SQL] Reduce memory consumption in b...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/19394#discussion_r143285737 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/ConfigBehaviorSuite.scala --- @@ -58,7 +58,7 @@ class ConfigBehaviorSuite extends QueryTest with SharedSQLContext { withSQLConf(SQLConf.RANGE_EXCHANGE_SAMPLE_SIZE_PER_PARTITION.key -> "1") { // If we only sample one point, the range boundaries will be pretty bad and the // chi-sq value would be very high. -assert(computeChiSquareTest() > 1000) +assert(computeChiSquareTest() > 300) --- End diff -- @rxin, the difference that is causing this to fail is that `rdd.id` is 15 instead of 14 because of the change to `getByteArrayRdd`. That id is used to seed the random generator and the chi-sq value isn't high enough. The previous value must have been unusually high. With so few data points, this can vary quite a bit so I think changing the bound to 300 is a good fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19394: [SPARK-22170][SQL] Reduce memory consumption in broadcas...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19394 **[Test build #82524 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82524/testReport)** for PR 19394 at commit [`f6613c2`](https://github.com/apache/spark/commit/f6613c2651691ca149390ea2407856ba390590cb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18460: [SPARK-21247][SQL] Type comparison should respect...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18460#discussion_r143285500 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -100,6 +101,17 @@ object TypeCoercion { case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) => Some(TimestampType) +case (t1 @ StructType(fields1), t2 @ StructType(fields2)) if t1.sameType(t2) => + Some(StructType(fields1.zip(fields2).map { case (f1, f2) => +// Since `t1.sameType(t2)` is true, two StructTypes have the same DataType +// except `name` (in case of `spark.sql.caseSensitive=false`) and `nullable`. +// - Different names: use a lower case name because findTightestCommonType is commutative. +// - Different nullabilities: `nullable` is true iff one of them is nullable. +val name = if (f1.name == f2.name) f1.name else f1.name.toLowerCase(Locale.ROOT) --- End diff -- Do both `tab.Col1.a` and `tab.Col1.A`work well when case sensitivity is off? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19451#discussion_r143284824 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1242,6 +1243,53 @@ object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] { } /** + * If one or both of the datasets in the logical [[Except]] operator are purely transformed using + * [[Filter]], this rule will replace logical [[Except]] operator with a [[Filter]] operator by + * flipping the filter condition of the right child. + * {{{ + * SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 WHERE a1 = 5 + * ==> SELECT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5 --- End diff -- This is wrong. See the correct description of `EXCEPT` > EXCEPT returns distinct rows from the left input query that arenât output by the right input query. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19336: [SPARK-21947][SS] Check and report error when mon...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19336 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19336: [SPARK-21947][SS] Check and report error when monotonica...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/19336 LGTM. Merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18460: [SPARK-21247][SQL] Type comparison should respect case-s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18460 **[Test build #82523 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82523/testReport)** for PR 18460 at commit [`c72aa18`](https://github.com/apache/spark/commit/c72aa18ccbf23053a06faa35deda77ff1135b818). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18460: [SPARK-21247][SQL] Type comparison should respect case-s...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/18460 The test cases are added. Thank you, @gatorsmile ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19444: [SPARK-22214][SQL] Refactor the list hive partiti...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19444 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19444: [SPARK-22214][SQL] Refactor the list hive partitions cod...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19444 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19270: [SPARK-21809] : Change Stage Page to use datatables to s...
Github user pgandhi999 commented on the issue: https://github.com/apache/spark/pull/19270 @ajbozarth Regarding your earlier comment on the missing functionality of show additional metrics, I have mentioned that in the description of the PR. I restate it below: Because of the above change, certain functionalities in the page had to be modified to support the addition of datatables. For example, the toggle checkbox 'Select All' previously would add the checked fields as columns in the Task table and as rows in the Summary Metrics table, but after the change, only columns are added in the Task Table as it got tricky to add rows dynamically in the datatables. I have fixed the second bit of missing functionality about creating link on completed tasks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...
Github user pgandhi999 commented on a diff in the pull request: https://github.com/apache/spark/pull/19270#discussion_r143280105 --- Diff: core/src/main/resources/org/apache/spark/ui/static/utils.js --- @@ -46,3 +46,64 @@ function formatBytes(bytes, type) { var i = Math.floor(Math.log(bytes) / Math.log(k)); return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + sizes[i]; } + +function formatLogsCells(execLogs, type) { +if (type !== 'display') return Object.keys(execLogs); +if (!execLogs) return; +var result = ''; +$.each(execLogs, function (logName, logUrl) { +result += '' + logName + '' +}); +return result; +} + +function getStandAloneAppId(cb) { --- End diff -- I did not get what you mean? Do you want me to return appId, or do you want me to rename the function to a more suitable name? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...
Github user pgandhi999 commented on a diff in the pull request: https://github.com/apache/spark/pull/19270#discussion_r143279779 --- Diff: core/src/main/resources/org/apache/spark/ui/static/taskspages.js --- @@ -0,0 +1,474 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +$(document).ajaxStop($.unblockUI); +$(document).ajaxStart(function () { +$.blockUI({message: 'Loading Tasks Page...'}); +}); + +$.extend( $.fn.dataTable.ext.type.order, { +"file-size-pre": ConvertDurationString, + +"file-size-asc": function ( a, b ) { +a = ConvertDurationString( a ); +b = ConvertDurationString( b ); +return ((a < b) ? -1 : ((a > b) ? 1 : 0)); +}, + +"file-size-desc": function ( a, b ) { +a = ConvertDurationString( a ); +b = ConvertDurationString( b ); +return ((a < b) ? 1 : ((a > b) ? -1 : 0)); +} +} ); + +function createTemplateURI(appId) { +var words = document.baseURI.split('/'); +var ind = words.indexOf("proxy"); +if (ind > 0) { +var baseURI = words.slice(0, ind + 1).join('/') + '/' + appId + '/static/stagespage-template.html'; +return baseURI; +} +ind = words.indexOf("history"); +if(ind > 0) { +var baseURI = words.slice(0, ind).join('/') + '/static/stagespage-template.html'; +return baseURI; +} +return location.origin + "/static/stagespage-template.html"; +} + +// This function will only parse the URL under certain formate +// e.g. https://axonitered-jt1.red.ygrid.yahoo.com:50509/history/application_1502220952225_59143/stages/stage/?id=0=0 +function StageEndPoint(appId) { +var words = document.baseURI.split('/'); +var words2 = document.baseURI.split('?'); +var ind = words.indexOf("proxy"); +if (ind > 0) { +var appId = words[ind + 1]; +var stageIdLen = words2[1].indexOf('&'); +var stageId = words2[1].substr(3, stageIdLen - 3); +var newBaseURI = words.slice(0, ind + 2).join('/'); +return newBaseURI + "/api/v1/applications/" + appId + "/stages/" + stageId; +} +ind = words.indexOf("history"); +if (ind > 0) { +var appId = words[ind + 1]; +var attemptId = words[ind + 2]; +var stageIdLen = words2[1].indexOf('&'); +var stageId = words2[1].substr(3, stageIdLen - 3); +var newBaseURI = words.slice(0, ind).join('/'); +if (isNaN(attemptId) || attemptId == "0") { +return newBaseURI + "/api/v1/applications/" + appId + "/stages/" + stageId; +} else { +return newBaseURI + "/api/v1/applications/" + appId + "/" + attemptId + "/stages/" + stageId; +} +} +var stageIdLen = words2[1].indexOf('&'); +var stageId = words2[1].substr(3, stageIdLen - 3); +return location.origin + "/api/v1/applications/" + appId + "/stages/" + stageId; +} + +function sortNumber(a,b) { +return a - b; +} + +function quantile(array, percentile) { +index = percentile/100. * (array.length-1); +if (Math.floor(index) == index) { + result = array[index]; +} else { +var i = Math.floor(index); +fraction = index - i; +result = array[i]; +} +return result; +} + +$(document).ready(function () { +$.extend($.fn.dataTable.defaults, { +stateSave: true, +lengthMenu: [[20, 40, 60, 100, -1], [20, 40, 60, 100, "All"]], +pageLength: 20 +}); + +$("#showAdditionalMetrics").append( --- End diff -- The reason I wrote that piece of code in javascript is because I thought if in future, the data-column indices needed to be rearranged dynamically, it would make more sense to do it in javascript. But, if you feel otherwise, I shall move it to html. ---
[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...
Github user pgandhi999 commented on a diff in the pull request: https://github.com/apache/spark/pull/19270#discussion_r143278211 --- Diff: core/src/main/resources/org/apache/spark/ui/static/taskspages.js --- @@ -0,0 +1,474 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +$(document).ajaxStop($.unblockUI); +$(document).ajaxStart(function () { +$.blockUI({message: 'Loading Tasks Page...'}); +}); + +$.extend( $.fn.dataTable.ext.type.order, { +"file-size-pre": ConvertDurationString, + +"file-size-asc": function ( a, b ) { +a = ConvertDurationString( a ); +b = ConvertDurationString( b ); +return ((a < b) ? -1 : ((a > b) ? 1 : 0)); +}, + +"file-size-desc": function ( a, b ) { +a = ConvertDurationString( a ); +b = ConvertDurationString( b ); +return ((a < b) ? 1 : ((a > b) ? -1 : 0)); +} +} ); + +function createTemplateURI(appId) { +var words = document.baseURI.split('/'); +var ind = words.indexOf("proxy"); +if (ind > 0) { +var baseURI = words.slice(0, ind + 1).join('/') + '/' + appId + '/static/stagespage-template.html'; +return baseURI; +} +ind = words.indexOf("history"); +if(ind > 0) { +var baseURI = words.slice(0, ind).join('/') + '/static/stagespage-template.html'; +return baseURI; +} +return location.origin + "/static/stagespage-template.html"; +} + +// This function will only parse the URL under certain formate +// e.g. https://axonitered-jt1.red.ygrid.yahoo.com:50509/history/application_1502220952225_59143/stages/stage/?id=0=0 +function StageEndPoint(appId) { +var words = document.baseURI.split('/'); +var words2 = document.baseURI.split('?'); +var ind = words.indexOf("proxy"); +if (ind > 0) { +var appId = words[ind + 1]; +var stageIdLen = words2[1].indexOf('&'); +var stageId = words2[1].substr(3, stageIdLen - 3); +var newBaseURI = words.slice(0, ind + 2).join('/'); +return newBaseURI + "/api/v1/applications/" + appId + "/stages/" + stageId; +} +ind = words.indexOf("history"); +if (ind > 0) { +var appId = words[ind + 1]; +var attemptId = words[ind + 2]; +var stageIdLen = words2[1].indexOf('&'); +var stageId = words2[1].substr(3, stageIdLen - 3); +var newBaseURI = words.slice(0, ind).join('/'); +if (isNaN(attemptId) || attemptId == "0") { +return newBaseURI + "/api/v1/applications/" + appId + "/stages/" + stageId; +} else { +return newBaseURI + "/api/v1/applications/" + appId + "/" + attemptId + "/stages/" + stageId; +} +} +var stageIdLen = words2[1].indexOf('&'); +var stageId = words2[1].substr(3, stageIdLen - 3); --- End diff -- So the javascript substr function takes the first param as the starting index and the second param as the length of characters. And the split happens way before so it is independent of any anchors. I have tested it for many cases and it seems to look well, but if you are still unsure, I can look into better ways of parsing the url. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...
Github user pgandhi999 commented on a diff in the pull request: https://github.com/apache/spark/pull/19270#discussion_r143277618 --- Diff: core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala --- @@ -346,7 +346,7 @@ class UISeleniumSuite extends SparkFunSuite with WebBrowser with Matchers with B for { stageId <- 0 to 1 -attemptId <- 0 to 1 +attemptId <- 1 to 0 --- End diff -- So this test was initially failing due to following reason. For stage id 1 and attempt id 0, the stage is designed to fail. So ideally, for this case, when the test tries to connect to the backend to get the json file in line 352, it would exit. But as there is no code writen to handle the exception, the test would quit and fail as the last case never ran. So, by changing the order, I ensured that all the stage success cases would run and the last one would fail and test will pass. I went as far as I could to debug and this was what I could find. If you think there is something more to this, let me know and we can discuss further. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19444: [SPARK-22214][SQL] Refactor the list hive partitions cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19444 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19444: [SPARK-22214][SQL] Refactor the list hive partitions cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19444 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82519/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19444: [SPARK-22214][SQL] Refactor the list hive partitions cod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19444 **[Test build #82519 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82519/testReport)** for PR 19444 at commit [`1e119ea`](https://github.com/apache/spark/commit/1e119ea7ae414e27ad0832cee347a20d24f1c0cb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...
Github user pgandhi999 commented on a diff in the pull request: https://github.com/apache/spark/pull/19270#discussion_r143274906 --- Diff: core/src/main/scala/org/apache/spark/ui/exec/ExecutorsTab.scala --- @@ -170,6 +170,17 @@ class ExecutorsListener(storageStatusListener: StorageStatusListener, conf: Spar execTaskSummary.isBlacklisted = isBlacklisted } + def getExecutorHost(eid: String): String = { --- End diff -- As the list active StorageStatusList stores the info of all the active executors at any point, I thought it could be used in that manner. I tested it and it seems to work fine. If you think there could be a potential issue with this approach, do let me know and we can discuss further. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19451 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...
Github user pgandhi999 commented on a diff in the pull request: https://github.com/apache/spark/pull/19270#discussion_r143274132 --- Diff: core/src/main/resources/org/apache/spark/ui/static/taskspages.js --- @@ -0,0 +1,474 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +$(document).ajaxStop($.unblockUI); +$(document).ajaxStart(function () { +$.blockUI({message: 'Loading Tasks Page...'}); +}); + +$.extend( $.fn.dataTable.ext.type.order, { +"file-size-pre": ConvertDurationString, + +"file-size-asc": function ( a, b ) { +a = ConvertDurationString( a ); +b = ConvertDurationString( b ); +return ((a < b) ? -1 : ((a > b) ? 1 : 0)); +}, + +"file-size-desc": function ( a, b ) { +a = ConvertDurationString( a ); +b = ConvertDurationString( b ); +return ((a < b) ? 1 : ((a > b) ? -1 : 0)); +} +} ); + +function createTemplateURI(appId) { +var words = document.baseURI.split('/'); +var ind = words.indexOf("proxy"); +if (ind > 0) { +var baseURI = words.slice(0, ind + 1).join('/') + '/' + appId + '/static/stagespage-template.html'; +return baseURI; +} +ind = words.indexOf("history"); +if(ind > 0) { +var baseURI = words.slice(0, ind).join('/') + '/static/stagespage-template.html'; +return baseURI; +} +return location.origin + "/static/stagespage-template.html"; +} + +// This function will only parse the URL under certain formate +// e.g. https://axonitered-jt1.red.ygrid.yahoo.com:50509/history/application_1502220952225_59143/stages/stage/?id=0=0 +function StageEndPoint(appId) { +var words = document.baseURI.split('/'); +var words2 = document.baseURI.split('?'); +var ind = words.indexOf("proxy"); +if (ind > 0) { +var appId = words[ind + 1]; +var stageIdLen = words2[1].indexOf('&'); +var stageId = words2[1].substr(3, stageIdLen - 3); +var newBaseURI = words.slice(0, ind + 2).join('/'); +return newBaseURI + "/api/v1/applications/" + appId + "/stages/" + stageId; +} +ind = words.indexOf("history"); +if (ind > 0) { +var appId = words[ind + 1]; +var attemptId = words[ind + 2]; +var stageIdLen = words2[1].indexOf('&'); +var stageId = words2[1].substr(3, stageIdLen - 3); +var newBaseURI = words.slice(0, ind).join('/'); +if (isNaN(attemptId) || attemptId == "0") { +return newBaseURI + "/api/v1/applications/" + appId + "/stages/" + stageId; +} else { +return newBaseURI + "/api/v1/applications/" + appId + "/" + attemptId + "/stages/" + stageId; +} +} +var stageIdLen = words2[1].indexOf('&'); +var stageId = words2[1].substr(3, stageIdLen - 3); +return location.origin + "/api/v1/applications/" + appId + "/stages/" + stageId; +} + +function sortNumber(a,b) { +return a - b; +} + +function quantile(array, percentile) { +index = percentile/100. * (array.length-1); +if (Math.floor(index) == index) { + result = array[index]; +} else { +var i = Math.floor(index); +fraction = index - i; +result = array[i]; +} +return result; +} + +$(document).ready(function () { +$.extend($.fn.dataTable.defaults, { +stateSave: true, +lengthMenu: [[20, 40, 60, 100, -1], [20, 40, 60, 100, "All"]], +pageLength: 20 +}); + +$("#showAdditionalMetrics").append( +"" + +"" + +" Show Additional Metrics" + +"" + +"" + +" Select All" + +" Scheduler Delay" + +" Task Deserialization Time" + +" Shuffle Read Blocked Time" + +" Shuffle Remote Reads" + +
[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...
Github user pgandhi999 commented on a diff in the pull request: https://github.com/apache/spark/pull/19270#discussion_r143274000 --- Diff: core/src/main/resources/org/apache/spark/ui/static/taskspages.js --- @@ -0,0 +1,474 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +$(document).ajaxStop($.unblockUI); +$(document).ajaxStart(function () { +$.blockUI({message: 'Loading Tasks Page...'}); +}); + +$.extend( $.fn.dataTable.ext.type.order, { +"file-size-pre": ConvertDurationString, + +"file-size-asc": function ( a, b ) { +a = ConvertDurationString( a ); +b = ConvertDurationString( b ); +return ((a < b) ? -1 : ((a > b) ? 1 : 0)); +}, + +"file-size-desc": function ( a, b ) { +a = ConvertDurationString( a ); +b = ConvertDurationString( b ); +return ((a < b) ? 1 : ((a > b) ? -1 : 0)); +} +} ); + +function createTemplateURI(appId) { --- End diff -- Moved them all to utils.js --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...
Github user pgandhi999 commented on a diff in the pull request: https://github.com/apache/spark/pull/19270#discussion_r143274034 --- Diff: core/src/main/resources/org/apache/spark/ui/static/taskspages.js --- @@ -0,0 +1,474 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +$(document).ajaxStop($.unblockUI); +$(document).ajaxStart(function () { +$.blockUI({message: 'Loading Tasks Page...'}); +}); + +$.extend( $.fn.dataTable.ext.type.order, { +"file-size-pre": ConvertDurationString, + +"file-size-asc": function ( a, b ) { +a = ConvertDurationString( a ); +b = ConvertDurationString( b ); +return ((a < b) ? -1 : ((a > b) ? 1 : 0)); +}, + +"file-size-desc": function ( a, b ) { +a = ConvertDurationString( a ); +b = ConvertDurationString( b ); +return ((a < b) ? 1 : ((a > b) ? -1 : 0)); +} +} ); + +function createTemplateURI(appId) { +var words = document.baseURI.split('/'); +var ind = words.indexOf("proxy"); +if (ind > 0) { +var baseURI = words.slice(0, ind + 1).join('/') + '/' + appId + '/static/stagespage-template.html'; +return baseURI; +} +ind = words.indexOf("history"); +if(ind > 0) { +var baseURI = words.slice(0, ind).join('/') + '/static/stagespage-template.html'; +return baseURI; +} +return location.origin + "/static/stagespage-template.html"; +} + +// This function will only parse the URL under certain formate +// e.g. https://axonitered-jt1.red.ygrid.yahoo.com:50509/history/application_1502220952225_59143/stages/stage/?id=0=0 +function StageEndPoint(appId) { --- End diff -- Fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...
Github user pgandhi999 commented on a diff in the pull request: https://github.com/apache/spark/pull/19270#discussion_r143274083 --- Diff: core/src/main/resources/org/apache/spark/ui/static/taskspages.js --- @@ -0,0 +1,474 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +$(document).ajaxStop($.unblockUI); +$(document).ajaxStart(function () { +$.blockUI({message: 'Loading Tasks Page...'}); +}); + +$.extend( $.fn.dataTable.ext.type.order, { +"file-size-pre": ConvertDurationString, + +"file-size-asc": function ( a, b ) { +a = ConvertDurationString( a ); +b = ConvertDurationString( b ); +return ((a < b) ? -1 : ((a > b) ? 1 : 0)); +}, + +"file-size-desc": function ( a, b ) { +a = ConvertDurationString( a ); +b = ConvertDurationString( b ); +return ((a < b) ? 1 : ((a > b) ? -1 : 0)); +} +} ); + +function createTemplateURI(appId) { +var words = document.baseURI.split('/'); +var ind = words.indexOf("proxy"); +if (ind > 0) { +var baseURI = words.slice(0, ind + 1).join('/') + '/' + appId + '/static/stagespage-template.html'; +return baseURI; +} +ind = words.indexOf("history"); +if(ind > 0) { +var baseURI = words.slice(0, ind).join('/') + '/static/stagespage-template.html'; +return baseURI; +} +return location.origin + "/static/stagespage-template.html"; +} + +// This function will only parse the URL under certain formate +// e.g. https://axonitered-jt1.red.ygrid.yahoo.com:50509/history/application_1502220952225_59143/stages/stage/?id=0=0 +function StageEndPoint(appId) { +var words = document.baseURI.split('/'); +var words2 = document.baseURI.split('?'); +var ind = words.indexOf("proxy"); +if (ind > 0) { +var appId = words[ind + 1]; +var stageIdLen = words2[1].indexOf('&'); +var stageId = words2[1].substr(3, stageIdLen - 3); +var newBaseURI = words.slice(0, ind + 2).join('/'); +return newBaseURI + "/api/v1/applications/" + appId + "/stages/" + stageId; +} +ind = words.indexOf("history"); +if (ind > 0) { +var appId = words[ind + 1]; +var attemptId = words[ind + 2]; +var stageIdLen = words2[1].indexOf('&'); +var stageId = words2[1].substr(3, stageIdLen - 3); +var newBaseURI = words.slice(0, ind).join('/'); +if (isNaN(attemptId) || attemptId == "0") { +return newBaseURI + "/api/v1/applications/" + appId + "/stages/" + stageId; +} else { +return newBaseURI + "/api/v1/applications/" + appId + "/" + attemptId + "/stages/" + stageId; +} +} +var stageIdLen = words2[1].indexOf('&'); +var stageId = words2[1].substr(3, stageIdLen - 3); +return location.origin + "/api/v1/applications/" + appId + "/stages/" + stageId; +} + +function sortNumber(a,b) { +return a - b; +} + +function quantile(array, percentile) { +index = percentile/100. * (array.length-1); +if (Math.floor(index) == index) { + result = array[index]; +} else { +var i = Math.floor(index); +fraction = index - i; +result = array[i]; +} +return result; +} + +$(document).ready(function () { +$.extend($.fn.dataTable.defaults, { --- End diff -- Done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...
Github user pgandhi999 commented on a diff in the pull request: https://github.com/apache/spark/pull/19270#discussion_r143273927 --- Diff: core/src/main/resources/org/apache/spark/ui/static/taskspages.js --- @@ -0,0 +1,474 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +$(document).ajaxStop($.unblockUI); +$(document).ajaxStart(function () { +$.blockUI({message: 'Loading Tasks Page...'}); --- End diff -- Fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19270: [SPARK-21809] : Change Stage Page to use datatabl...
Github user pgandhi999 commented on a diff in the pull request: https://github.com/apache/spark/pull/19270#discussion_r143273867 --- Diff: core/src/main/scala/org/apache/spark/status/api/v1/AllStagesResource.scala --- @@ -138,21 +155,61 @@ private[v1] object AllStagesResource { } } - def convertTaskData(uiData: TaskUIData, lastUpdateTime: Option[Long]): TaskData = { + private def getGettingResultTime(info: TaskInfo, currentTime: Long): Long = { --- End diff -- Fixed it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule
GitHub user sathiyapk opened a pull request: https://github.com/apache/spark/pull/19451 SPARK-22181 Adds ReplaceExceptWithNotFilter rule ## What changes were proposed in this pull request? Adds a new optimisation rule 'ReplaceExceptWithNotFilter' that replaces Except logical with Filter operator and schedule it before applying 'ReplaceExceptWithAntiJoin' rule. This way we can avoid expensive join operation if one or both of the datasets of the Except operation are fully derived out of Filters from a same parent. ## How was this patch tested? The patch is tested locally using spark-shell + unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sathiyapk/spark SPARK-22181-optimize-exceptWithFilter Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19451.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19451 commit 1baecfdce9552b1e7853ace425970ea49c8f2304 Author: SathiyaDate: 2017-10-06T18:57:52Z SPARK-22181 Adds ReplaceExceptWithNotFilter rule --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19270: [SPARK-21809] : Change Stage Page to use datatables to s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19270 **[Test build #82522 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82522/testReport)** for PR 19270 at commit [`25b5215`](https://github.com/apache/spark/commit/25b5215cd6dbbc11b5e5d8d56bff6743578de2b5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18460: [SPARK-21247][SQL] Type comparison should respect...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/18460#discussion_r143273532 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -100,6 +101,17 @@ object TypeCoercion { case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) => Some(TimestampType) +case (t1 @ StructType(fields1), t2 @ StructType(fields2)) if t1.sameType(t2) => + Some(StructType(fields1.zip(fields2).map { case (f1, f2) => +// Since `t1.sameType(t2)` is true, two StructTypes have the same DataType +// except `name` (in case of `spark.sql.caseSensitive=false`) and `nullable`. +// - Different names: use a lower case name because findTightestCommonType is commutative. +// - Different nullabilities: `nullable` is true iff one of them is nullable. +val name = if (f1.name == f2.name) f1.name else f1.name.toLowerCase(Locale.ROOT) --- End diff -- For a test case, does `nested column in the references` mean `WHERE` clause? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18460: [SPARK-21247][SQL] Type comparison should respect...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/18460#discussion_r143273000 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -100,6 +101,17 @@ object TypeCoercion { case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) => Some(TimestampType) +case (t1 @ StructType(fields1), t2 @ StructType(fields2)) if t1.sameType(t2) => + Some(StructType(fields1.zip(fields2).map { case (f1, f2) => +// Since `t1.sameType(t2)` is true, two StructTypes have the same DataType +// except `name` (in case of `spark.sql.caseSensitive=false`) and `nullable`. +// - Different names: use a lower case name because findTightestCommonType is commutative. +// - Different nullabilities: `nullable` is true iff one of them is nullable. +val name = if (f1.name == f2.name) f1.name else f1.name.toLowerCase(Locale.ROOT) --- End diff -- Yes. Hive does like the following. ```sql hive> CREATE TABLE S AS SELECT named_struct('A',1); hive> DESCRIBE S; OK _c0 struct ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18966: [SPARK-21751][SQL] CodeGeneraor.splitExpressions counts ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18966 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18966: [SPARK-21751][SQL] CodeGeneraor.splitExpressions counts ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18966 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82518/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18966: [SPARK-21751][SQL] CodeGeneraor.splitExpressions counts ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18966 **[Test build #82518 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82518/testReport)** for PR 18966 at commit [`b04c09c`](https://github.com/apache/spark/commit/b04c09c3683f104909713344c90e46b4129f5401). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19250#discussion_r143272791 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -1213,6 +1213,71 @@ case class ToUTCTimestamp(left: Expression, right: Expression) } /** + * This modifies a timestamp to show how the display time changes going from one timezone to + * another, for the same instant in time. + * + * We intentionally do not provide an ExpressionDescription as this is not meant to be exposed to + * users, its only used for internal conversions. + */ +private[spark] case class TimestampTimezoneCorrection( --- End diff -- > Additionally, timezone rules have changed over time and keep changing. Ah, yes, that makes sense... it's also why my initial tests were failing at the DST boundaries. :-/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19450: [SPARK-22218] spark shuffle services fails to update sec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19450 **[Test build #82521 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82521/testReport)** for PR 19450 at commit [`5a9ef13`](https://github.com/apache/spark/commit/5a9ef1396a28b535042e336a2f43d80104ce95e5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19450: [SPARK-22218] spark shuffle services fails to update sec...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/19450 ping @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19450: [SPARK-22218] spark shuffle services fails to upd...
GitHub user tgravescs opened a pull request: https://github.com/apache/spark/pull/19450 [SPARK-22218] spark shuffle services fails to update secret on app re-attempts This patch fixes application re-attempts when running spark on yarn using the external shuffle service with security on. Currently executors will fail to launch on any application re-attempt when launched on a nodemanager that had an executor from the first attempt. The reason for this is because we aren't updating the secret key after the first application attempt. The fix here is to just remove the containskey check to see if it already exists. In this way, we always add it and make sure its the most recent secret. Similarly remove the check for containsKey on the remove since its just adding extra check that isn't really needed. Note this worked before spark 2.2 because the check used to be contains (which was looking for the value) rather then containsKey, so that never matched and it was just always adding the new secret. Patch was tested on a 10 node cluster as well as added the unit test. The test ran was a wordcount where the output directory already existed. With the bug present the application attempt failed with max number of executor Failures which were all saslExceptions. With the fix present the application re-attempts fail with directory already exists or when you remove the directory between attempts the re-attemps succeed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgravescs/spark SPARK-22218 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19450.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19450 commit aa9a2c2705cd59f9e1caa7e78ad7eafad3ff2789 Author: Thomas GravesDate: 2017-10-06T18:45:13Z [SPARK-22218] spark shuffle services fails to update secret on application re-attempts commit 5a9ef1396a28b535042e336a2f43d80104ce95e5 Author: Thomas Graves Date: 2017-10-06T18:55:43Z minor updates --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19448 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82517/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19448 **[Test build #82517 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82517/testReport)** for PR 19448 at commit [`e6fdbdc`](https://github.com/apache/spark/commit/e6fdbdcf4118283abd22f7b14586ed742d225657). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19448 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18664 I am okay with proceeding separately for dealing with timezone, and matching the behaviour with Arrow to the existing behaviour without Arrow here with respect to timezone. Less sure yet if we need a design doc for dealing with timezone but will follow if other committers feel so. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143263694 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,69 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col) else: jgd = self._jgd.pivot(pivot_col, values) -return GroupedData(jgd, self.sql_ctx) +return GroupedData(jgd, self._df) + +@since(2.3) +def apply(self, udf): --- End diff -- @rxin just to recap our discussion regarding naming: You asked: > What's the difference between this one and the transform function you also proposed? I'm trying to see if all the naming makes sense when considered together. Answer is: `transform` takes a function: pd.Series -> pd.Series and apply the function on each column (or subset of columns). The input and output Series are of the same length. `apply` takes a function: pd.DataFrame -> pd.DataFrame and apply the function on the group. Similar to `flatMapGroups` Does this make sense to you? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19372: [SPARK-22156][MLLIB] Fix update equation of learning rat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19372 **[Test build #3943 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3943/testReport)** for PR 19372 at commit [`2ea3f18`](https://github.com/apache/spark/commit/2ea3f18bb002c38f387e01c31c6fb3ce2cb37231). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18664 If we all agree on the necessity of a design doc first, I can create a Jira and we can make progress there. What do you all think? @BryanCutler @gatorsmile @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18229: [SPARK-20691][CORE] Difference between Storage Memory as...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/18229 gentle ping @mkesselaers --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18664 I agree. I think some high level document describing these differences so we can discuss it. I think we should be more careful about Arrow-version behavior before releasing support for timestamp and nested types. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19077: [SPARK-21860][core]Improve memory reuse for heap memory ...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19077 gentle ping @jerryshao for review --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18664 Yup, I admit there could be some exceptions (there have been actually) but that should still be the baseline we should basically pursue. Probably, we could treat this Arrow optimisation as an evolving configuration option for now due to those differences but the final goal should be the matched behaviour. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19449: [SPARK-22219][SQL] Refactor code to get a value for "spa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19449 **[Test build #82520 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82520/testReport)** for PR 19449 at commit [`be4220d`](https://github.com/apache/spark/commit/be4220dcb218366f81f4015af56bee9fade90066). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19449: [SPARK-22219][SQL] Refactor code to get a value f...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/19449 [SPARK-22219][SQL] Refactor code to get a value for "spark.sql.codegen.comments" ## What changes were proposed in this pull request? This PR refactors code to get a value for "spark.sql.codegen.comments" by avoiding `SparkEnv.get.conf`. This PR uses `SQLConf.get.codegenComments` since `SQLConf.get` always returns an instance of `SQLConf`. ## How was this patch tested? Added test case to `DebuggingSuite` You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark SPARK-22219 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19449.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19449 commit be4220dcb218366f81f4015af56bee9fade90066 Author: Kazuaki IshizakiDate: 2017-10-06T17:54:03Z Initial commit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...
Github user zivanfi commented on a diff in the pull request: https://github.com/apache/spark/pull/19250#discussion_r143257840 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -1213,6 +1213,71 @@ case class ToUTCTimestamp(left: Expression, right: Expression) } /** + * This modifies a timestamp to show how the display time changes going from one timezone to + * another, for the same instant in time. + * + * We intentionally do not provide an ExpressionDescription as this is not meant to be exposed to + * users, its only used for internal conversions. + */ +private[spark] case class TimestampTimezoneCorrection( --- End diff -- Unfortunately the offset depends on the actual date, so a timezone conversion can not be simplified to a simple delta. Daylight saving time starts and ends on different days in different timezones, while some timezones don't have DST changes at all. Additionally, timezone rules have changed over time and keep changing. Both the basic timezone offset and the DST rules themselves could change (and have changed) over time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18664 > The baseline should be (as said above): Internal optimisation should not introduce any behaviour change, and we are discouraged to change the previous behaviour unless it has bugs in general. I am not sure if I totally agree with this. Take the struct for instance, in the non-Arrow version, struct is turned to `pyspark.sql.Row` object in `toPandas()`. I wouldn't call this bug because it's design choice. However, this is maybe not the best design choice because if the user pickle the `pandas.DataFrame` to a file and send it to someone, the receiver won't be able to deserialize this `pandas.DataFrame` without having the pyspark library dependency. Now I am **not** trying to argue we should or should not turn struct column to `pyspark.sql.Row`, my point is that there might be some design choices in the non-Arrow versions that are not ideal and maybe that's not a hard requirement to make behavior 100% the same between non-Arrow and Arrow version. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18732 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82515/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18732 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18732 **[Test build #82515 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82515/testReport)** for PR 18732 at commit [`284ba00`](https://github.com/apache/spark/commit/284ba00be0cbc357ac42900f8c4af57901d147c5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19424 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82514/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19424 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19424 **[Test build #82514 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82514/testReport)** for PR 19424 at commit [`24f1a75`](https://github.com/apache/spark/commit/24f1a757e4df3b44e5242188e67bfffd6a3f0426). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait DataSourceReaderHolder ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19294: [SPARK-21549][CORE] Respect OutputFormats with no...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19294#discussion_r143252186 --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala --- @@ -57,6 +60,15 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: String) */ private def absPathStagingDir: Path = new Path(path, "_temporary-" + jobId) + /** + * Checks whether there are files to be committed to an absolute output location. + * + * As the committing and aborting the job occurs on driver where `addedAbsPathFiles` is always + * null, it is necessary to check whether the output path is specified, that may not be the case + * for committers not writing to distributed file systems. --- End diff -- This is much better. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19294: [SPARK-21549][CORE] Respect OutputFormats with no...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/19294#discussion_r143250260 --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala --- @@ -57,6 +60,15 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: String) */ private def absPathStagingDir: Path = new Path(path, "_temporary-" + jobId) + /** + * Checks whether there are files to be committed to an absolute output location. + * + * As the committing and aborting the job occurs on driver where `addedAbsPathFiles` is always + * null, it is necessary to check whether the output path is specified, that may not be the case + * for committers not writing to distributed file systems. --- End diff -- How about : "As committing and aborting a job occurs on driver, where `addedAbsPathFiles` is always null, it is necessary to check whether the output path is specified. Output path need not be relevant for committers not writing to distributed file systems" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19250#discussion_r143246203 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/TimestampTableTimeZone.scala --- @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.{AnalysisException, SparkSession} +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.analysis.UnresolvedException +import org.apache.spark.sql.catalyst.catalog.CatalogTable +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.types.{StringType, TimestampType} + +/** + * Apply a correction to data loaded from, or saved to, Parquet, so that it timestamps can be read + * like TIMESTAMP WITHOUT TIMEZONE. This gives correct behavior if you process data with + * machines in different timezones, or if you access the data from multiple SQL engines. + */ +private[sql] case class TimestampTableTimeZone(sparkSession: SparkSession) --- End diff -- I'm also trying to see if this can be simplified; I guess the main thing is Imran's comment about not being able to use `transformUp`. I need to take a look at whether that's really the case. This rule also doesn't seem to handle `InsertIntoHiveTable`, which is in the hive module so can't be handled here. Probably will need a new rule based on this one in the hive module. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18931 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19250#discussion_r143245229 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -230,6 +230,13 @@ case class AlterTableSetPropertiesCommand( isView: Boolean) extends RunnableCommand { + if (isView) { +properties.get(TimestampTableTimeZone.TIMEZONE_PROPERTY).foreach { _ => --- End diff -- [HiveQL](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView) explicitly allows properties in view; I've never used them, though. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18931 **[Test build #82516 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82516/testReport)** for PR 18931 at commit [`edb73d6`](https://github.com/apache/spark/commit/edb73d6e9705e1ce5a836b9a394fec706d68aec4). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19447 We still need to add a test case. We also should capture the exception and issue a better one; otherwise, users will not know what they should do when they hit such a confusing error. cc @kiszk @rednaxelafx Could you take a look at this PR? Do we still hit this using the current hard-coded value? If so, what is the easy way to write a test case? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19250#discussion_r143243769 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -1213,6 +1213,71 @@ case class ToUTCTimestamp(left: Expression, right: Expression) } /** + * This modifies a timestamp to show how the display time changes going from one timezone to + * another, for the same instant in time. + * + * We intentionally do not provide an ExpressionDescription as this is not meant to be exposed to + * users, its only used for internal conversions. + */ +private[spark] case class TimestampTimezoneCorrection( --- End diff -- Hmm, let me see if I can figure that out. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18931 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82516/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19424 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19250#discussion_r143244400 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -266,6 +267,10 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { * @since 1.4.0 */ def insertInto(tableName: String): Unit = { +extraOptions.get(TimestampTableTimeZone.TIMEZONE_PROPERTY).foreach { tz => --- End diff -- The [spec](https://docs.google.com/document/d/1XmyVjr3eOJiNFjVeSnmjIU60Hq-XiZB03pgi3r1razM/edit#heading=h.jnyh4t7bhnlx) requests errors when using invalid time zones. I guess this would still fail with a different error in that case, so I can remove this if you're really against adding it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/19250#discussion_r143243338 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -1213,6 +1213,71 @@ case class ToUTCTimestamp(left: Expression, right: Expression) } /** + * This modifies a timestamp to show how the display time changes going from one timezone to + * another, for the same instant in time. + * + * We intentionally do not provide an ExpressionDescription as this is not meant to be exposed to + * users, its only used for internal conversions. + */ +private[spark] case class TimestampTimezoneCorrection( --- End diff -- What I'm saying is the analysis rule can just determine the delta, and then just do a simple add/delete. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19250#discussion_r143243228 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -1015,6 +1020,10 @@ object DateTimeUtils { guess } + def convertTz(ts: SQLTimestamp, fromZone: String, toZone: String): SQLTimestamp = { +convertTz(ts, getTimeZone(fromZone), getTimeZone(toZone)) --- End diff -- I guess caching the value as done by the `FromUTCTimestamp` expression is the right way to go? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19424 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82513/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19424: [SPARK-22197][SQL] push down operators to data source be...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19424 **[Test build #82513 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82513/testReport)** for PR 19424 at commit [`e880328`](https://github.com/apache/spark/commit/e88032896512c2d6945b1cfd017a462961726907). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait DataSourceReaderHolder ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18664 The baseline should be (as said above): Internal optimisation should not introduce any behaviour change, and we are discouraged to change the previous behaviour unless it has bugs in general. So, the issue is the previous behaviour without-Arrow does not respect `SESSION_LOCAL_TIMEZONE` but we should respect; however, this is a behaviour change so we should discuss further and possibly introduce a configuration to control this to prevent the behaviour change. Also, to be clear, I assume we admit not respecting `SESSION_LOCAL_TIMEZONE` is not a bug but a behaviour change to be fixed as an improvement, right? How about matching the behaviour with-Arrow to the previous behaviour without-Arrow (not respecting `SESSION_LOCAL_TIMEZONE`) first and then separately make a PR to introduce a configuration to respect `SESSION_LOCAL_TIMEZONE` for both? I actually think I a little bit doubt about respecting `SESSION_LOCAL_TIMEZONE` when we `collect()` or `toPandas()`, and want to separate this discussion into another PR, JIRA or mailing list if possible. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19250#discussion_r143242256 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -1213,6 +1213,71 @@ case class ToUTCTimestamp(left: Expression, right: Expression) } /** + * This modifies a timestamp to show how the display time changes going from one timezone to + * another, for the same instant in time. + * + * We intentionally do not provide an ExpressionDescription as this is not meant to be exposed to + * users, its only used for internal conversions. + */ +private[spark] case class TimestampTimezoneCorrection( --- End diff -- I guess you could use ToUTCTimestamp / FromUTCTimestamp for this, but that would be more expensive since you'd be doing the conversion twice for each value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...
Github user jsnowacki commented on the issue: https://github.com/apache/spark/pull/19443 The only other reason for Python I can think of, if the above are not compelling enough, is that the issue with function not having call-by-column-name option is that we'll get the error only at the run-time, due to the process of how the names and types are being mapped onto JVM code. Which is partially related to the dynamic nature of typing in Python. So while this is spotted on compile time in Java/Scala, which may be annoying and considered inconsistent, in Python task may fail after hours due to this issue. Same is true for R (and SQL to some extent). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19444: [SPARK-22214][SQL] Refactor the list hive partitions cod...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19444 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18460: [SPARK-21247][SQL] Type comparision should respec...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18460#discussion_r143238804 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -100,6 +101,17 @@ object TypeCoercion { case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) => Some(TimestampType) +case (t1 @ StructType(fields1), t2 @ StructType(fields2)) if t1.sameType(t2) => + Some(StructType(fields1.zip(fields2).map { case (f1, f2) => +// Since `t1.sameType(t2)` is true, two StructTypes have the same DataType +// except `name` (in case of `spark.sql.caseSensitive=false`) and `nullable`. +// - Different names: use a lower case name because findTightestCommonType is commutative. +// - Different nullabilities: `nullable` is true iff one of them is nullable. +val name = if (f1.name == f2.name) f1.name else f1.name.toLowerCase(Locale.ROOT) --- End diff -- Why the output nested column name is lower case? Is Hive behaving like this? In addition, could you add one more test and check whether we also respect case sensitivity conf when we resolve the queries that contain the nested column in the references? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18460: [SPARK-21247][SQL] Type comparision should respect case-...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18460 I see. There is a typo in PR title: `comparision`. Will review the fix soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19438: [SPARK-22208] [SQL] Improve percentile_approx by not rou...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/19438 Maybe we can run some of the major test suites locally and update all the results. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19444: [SPARK-22214][SQL] Refactor the list hive partitions cod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19444 **[Test build #82519 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82519/testReport)** for PR 19444 at commit [`1e119ea`](https://github.com/apache/spark/commit/1e119ea7ae414e27ad0832cee347a20d24f1c0cb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19438: [SPARK-22208] [SQL] Improve percentile_approx by not rou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19438 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19438: [SPARK-22208] [SQL] Improve percentile_approx by not rou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19438 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82512/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19438: [SPARK-22208] [SQL] Improve percentile_approx by not rou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19438 **[Test build #82512 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82512/testReport)** for PR 19438 at commit [`aae2769`](https://github.com/apache/spark/commit/aae2769b753db483e0ec2652fa0c4f09fcd14c61). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19444: [SPARK-22214][SQL] Refactor the list hive partitions cod...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19444 LGTM except a minor comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19444: [SPARK-22214][SQL] Refactor the list hive partiti...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19444#discussion_r143235124 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -638,12 +638,14 @@ private[hive] class HiveClientImpl( table: CatalogTable, spec: Option[TablePartitionSpec]): Seq[CatalogTablePartition] = withHiveState { val hiveTable = toHiveTable(table, Some(userName)) -val parts = spec match { - case None => shim.getAllPartitions(client, hiveTable).map(fromHivePartition) +val partialPartSpec = spec match { --- End diff -- -> `partSpec` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18966: [SPARK-21751][SQL] CodeGeneraor.splitExpressions counts ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18966 **[Test build #82518 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82518/testReport)** for PR 18966 at commit [`b04c09c`](https://github.com/apache/spark/commit/b04c09c3683f104909713344c90e46b4129f5401). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19447: [SPARK-22215][SQL] Add configuration to set the threshol...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19447 Here the answers to your questions @gatorsmile , please tell me if I need to elaborate more deeply. This conf controls how many inner classes are generated. A big value means that we will have few inner classes and that the outer class and the inner classes will have a lot of methods. A small one means a lot of classes with few values. So, 1 - I don't think there is a big perf impact. Maybe, if this value is very small, we might have a lot nested class instances and therefore a waste of memory. But since these classes don't have any variable, this is quite negligible to me. 2 - If the user encounters an Exception like ``` Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass has grown past JVM limit of 0x ``` then he/she must reduce this value 3 - I don't think so, because I have not been able to reproduce the problem other than in a very complex use case. So probably that use case is a corner case which can benefit by this change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19394: [SPARK-22170][SQL] Reduce memory consumption in b...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19394#discussion_r143232991 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala --- @@ -73,25 +73,37 @@ case class BroadcastExchangeExec( try { val beforeCollect = System.nanoTime() // Note that we use .executeCollect() because we don't want to convert data to Scala types - val input: Array[InternalRow] = child.executeCollect() - if (input.length >= 51200) { + val (numRows, input) = child.executeCollectIterator() + if (numRows >= 51200) { throw new SparkException( - s"Cannot broadcast the table with more than 512 millions rows: ${input.length} rows") + s"Cannot broadcast the table with more than 512 millions rows: $numRows rows") } + val beforeBuild = System.nanoTime() longMetric("collectTime") += (beforeBuild - beforeCollect) / 100 - val dataSize = input.map(_.asInstanceOf[UnsafeRow].getSizeInBytes.toLong).sum + + // Construct the relation. + val relation = mode.transform(input, Some(numRows)) + + val dataSize = relation match { +case map: HashedRelation => + map.estimatedSize +case arr: Array[InternalRow] => + arr.map(_.asInstanceOf[UnsafeRow].getSizeInBytes.toLong).sum +case _ => + numRows * 512 // guess: each row is about 512 bytes --- End diff -- Yeah, it makes more sense to issue an exception. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19294: [SPARK-21549][CORE] Respect OutputFormats with no...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19294#discussion_r143232506 --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala --- @@ -57,6 +60,15 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: String) */ private def absPathStagingDir: Path = new Path(path, "_temporary-" + jobId) + /** + * Checks whether there are files to be committed to an absolute output location. + * + * As the committing and aborting the job occurs on driver where `addedAbsPathFiles` is always + * null, it is necessary to check whether the output path is specified, that may not be the case + * for committers not writing to distributed file systems. --- End diff -- This also has a grammar issue. It is not clear too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19448 **[Test build #82517 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82517/testReport)** for PR 19448 at commit [`e6fdbdc`](https://github.com/apache/spark/commit/e6fdbdcf4118283abd22f7b14586ed742d225657). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org