[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19769 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19769 LGTM except a few minor comment --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN command
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19774 **[Test build #83966 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83966/testReport)** for PR 19774 at commit [`24bfcb1`](https://github.com/apache/spark/commit/24bfcb1132d35ffa8ba2341a7ea9057b14b5ab8a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19763: [SPARK-22537][core] Aggregation of map output statistics...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19763 cc @zsxwing --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19767: [WIP][SPARK-22543][SQL] fix java 64kb compile err...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19767#discussion_r151637511 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -105,6 +105,41 @@ abstract class Expression extends TreeNode[Expression] { val isNull = ctx.freshName("isNull") val value = ctx.freshName("value") val ve = doGenCode(ctx, ExprCode("", isNull, value)) + + // TODO: support whole stage codegen too + if (ve.code.trim.length > 1024 && ctx.INPUT_ROW != null && ctx.currentVars == null) { +val setIsNull = if (ve.isNull != "false" && ve.isNull != "true") { + val globalIsNull = ctx.freshName("globalIsNull") + ctx.addMutableState("boolean", globalIsNull, s"$globalIsNull = false;") + val localIsNull = ve.isNull + ve.isNull = globalIsNull + s"$globalIsNull = $localIsNull;" +} else { + "" +} + +val setValue = { + val globalValue = ctx.freshName("globalValue") + ctx.addMutableState( +ctx.javaType(dataType), globalValue, s"$globalValue = ${ctx.defaultValue(dataType)};") + val localValue = ve.value + ve.value = globalValue + s"$globalValue = $localValue;" +} + +val funcName = ctx.freshName(nodeName) +val funcFullName = ctx.addNewFunction(funcName, + s""" + |private void $funcName(InternalRow ${ctx.INPUT_ROW}) { + | ${ve.code.trim} + | $setValue + | $setIsNull + |} + """.stripMargin) + +ve.code = s"$funcFullName(${ctx.INPUT_ROW});" + } + if (ve.code.nonEmpty) { // Add `this` in the comment. ve.copy(code = s"${ctx.registerComment(this.toString)}\n" + ve.code.trim) --- End diff -- I don't have a strong preference, it's ok to have comment at function caller side. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19773: Supporting for changing column dataType
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19773 **[Test build #83964 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83964/testReport)** for PR 19773 at commit [`1bcd74f`](https://github.com/apache/spark/commit/1bcd74fae9cb6595e04eab6ecaf621739644102f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...
Github user skonto commented on a diff in the pull request: https://github.com/apache/spark/pull/19390#discussion_r151672855 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala --- @@ -427,10 +444,10 @@ trait MesosSchedulerUtils extends Logging { // partition port offers val (resourcesWithoutPorts, portResources) = filterPortResources(offeredResources) - val portsAndRoles = requestedPorts. -map(x => (x, findPortAndGetAssignedRangeRole(x, portResources))) + val portsAndResourceInfo = requestedPorts. +map(x => (x, findPortAndGetAssignedResourceInfo(x, portResources))) --- End diff -- Ok will fix no np. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19767: [WIP][SPARK-22543][SQL] fix java 64kb compile err...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19767#discussion_r151624776 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -105,6 +105,41 @@ abstract class Expression extends TreeNode[Expression] { val isNull = ctx.freshName("isNull") val value = ctx.freshName("value") val ve = doGenCode(ctx, ExprCode("", isNull, value)) + + // TODO: support whole stage codegen too + if (ve.code.trim.length > 1024 && ctx.INPUT_ROW != null && ctx.currentVars == null) { +val setIsNull = if (ve.isNull != "false" && ve.isNull != "true") { + val globalIsNull = ctx.freshName("globalIsNull") + ctx.addMutableState("boolean", globalIsNull, s"$globalIsNull = false;") + val localIsNull = ve.isNull + ve.isNull = globalIsNull + s"$globalIsNull = $localIsNull;" +} else { + "" +} + +val setValue = { + val globalValue = ctx.freshName("globalValue") + ctx.addMutableState( +ctx.javaType(dataType), globalValue, s"$globalValue = ${ctx.defaultValue(dataType)};") + val localValue = ve.value + ve.value = globalValue + s"$globalValue = $localValue;" +} + +val funcName = ctx.freshName(nodeName) +val funcFullName = ctx.addNewFunction(funcName, + s""" + |private void $funcName(InternalRow ${ctx.INPUT_ROW}) { + | ${ve.code.trim} + | $setValue + | $setIsNull --- End diff -- yea it's already done when define `setIsNull` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19769#discussion_r151633919 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala --- @@ -87,4 +95,107 @@ class ParquetInteroperabilitySuite extends ParquetCompatibilityTest with SharedS Row(Seq(2, 3 } } + + test("parquet timestamp conversion") { +// Make a table with one parquet file written by impala, and one parquet file written by spark. +// We should only adjust the timestamps in the impala file, and only if the conf is set +val impalaFile = "test-data/impala_timestamp.parq" + +// here are the timestamps in the impala file, as they were saved by impala +val impalaFileData = + Seq( +"2001-01-01 01:01:01", +"2002-02-02 02:02:02", +"2003-03-03 03:03:03" + ).map { s => java.sql.Timestamp.valueOf(s) } --- End diff -- nit: `.map(java.sql.Timestamp.valueOf)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19630 **[Test build #83959 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83959/testReport)** for PR 19630 at commit [`cf1d1ca`](https://github.com/apache/spark/commit/cf1d1caa4f41c6bcf565cfc5b9e9901d94f56af3). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19630 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83959/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19630 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...
Github user skonto commented on a diff in the pull request: https://github.com/apache/spark/pull/19390#discussion_r151674490 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala --- @@ -451,15 +468,22 @@ trait MesosSchedulerUtils extends Logging { } /** Creates a mesos resource for a specific port number. */ - private def createResourcesFromPorts(portsAndRoles: List[(Long, String)]) : List[Resource] = { -portsAndRoles.flatMap{ case (port, role) => - createMesosPortResource(List((port, port)), Some(role))} + private def createResourcesFromPorts( + portsAndResourcesInfo: List[(Long, (String, AllocationInfo, Option[ReservationInfo]))]) +: List[Resource] = { --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/19774 [SPARK-22475][SQL] show histogram in DESC COLUMN command ## What changes were proposed in this pull request? Added the histogram representation to the output of the `DESCRIBE EXTENDED table_name column_name` command. ## How was this patch tested? Modified SQL UT and checked output Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-22475 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19774.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19774 commit 24bfcb1132d35ffa8ba2341a7ea9057b14b5ab8a Author: Marco GaidoDate: 2017-11-17T12:42:16Z [SPARK-22475][SQL] show histogram in DESC COLUMN command --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19630: [SPARK-22409] Introduce function type argument in...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/19630#discussion_r151676061 --- Diff: python/pyspark/sql/functions.py --- @@ -2049,132 +2050,12 @@ def map_values(col): # User Defined Function -- -def _wrap_function(sc, func, returnType): -command = (func, returnType) -pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command) -return sc._jvm.PythonFunction(bytearray(pickled_command), env, includes, sc.pythonExec, - sc.pythonVer, broadcast_vars, sc._javaAccumulator) - - -class PythonUdfType(object): -# row-at-a-time UDFs -NORMAL_UDF = 0 -# scalar vectorized UDFs -PANDAS_UDF = 1 -# grouped vectorized UDFs -PANDAS_GROUPED_UDF = 2 - - -class UserDefinedFunction(object): --- End diff -- So moving this will probably break some peoples code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19767: [WIP][SPARK-22543][SQL] fix java 64kb compile error for ...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19767 Looks good direction if we do not see performance degradation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9428: [SPARK-8582][Core]Optimize checkpointing to avoid computi...
Github user ferdonline commented on the issue: https://github.com/apache/spark/pull/9428 That's the reason why I want to checkpoint when they are first calculated. Further transformations use these results several times. Of course it's not a problem per se to calculate twice for the checkpoint, but doing so for 1+TB of data is nonsense and I can't cache. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19773: [SPARK-22546][SQL] Supporting for changing column dataTy...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19773 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19630 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19769#discussion_r151633640 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala --- @@ -151,6 +154,8 @@ private[parquet] class ParquetRowConverter( |${catalystType.prettyJson} """.stripMargin) + val UTC = DateTimeUtils.TimeZoneUTC --- End diff -- nit: `private val`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19767: [WIP][SPARK-22543][SQL] fix java 64kb compile err...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19767#discussion_r151644789 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -105,6 +105,41 @@ abstract class Expression extends TreeNode[Expression] { val isNull = ctx.freshName("isNull") val value = ctx.freshName("value") val ve = doGenCode(ctx, ExprCode("", isNull, value)) + + // TODO: support whole stage codegen too + if (ve.code.trim.length > 1024 && ctx.INPUT_ROW != null && ctx.currentVars == null) { +val setIsNull = if (ve.isNull != "false" && ve.isNull != "true") { + val globalIsNull = ctx.freshName("globalIsNull") + ctx.addMutableState("boolean", globalIsNull, s"$globalIsNull = false;") + val localIsNull = ve.isNull + ve.isNull = globalIsNull + s"$globalIsNull = $localIsNull;" +} else { + "" +} + +val setValue = { + val globalValue = ctx.freshName("globalValue") + ctx.addMutableState( +ctx.javaType(dataType), globalValue, s"$globalValue = ${ctx.defaultValue(dataType)};") + val localValue = ve.value + ve.value = globalValue + s"$globalValue = $localValue;" +} + +val funcName = ctx.freshName(nodeName) +val funcFullName = ctx.addNewFunction(funcName, + s""" + |private void $funcName(InternalRow ${ctx.INPUT_ROW}) { + | ${ve.code.trim} + | $setValue --- End diff -- Thanks. IMHO, I am curious whether we will see no performance degradation by using one array to compact many boolean variables. I am waiting for the updated result in [this discussion](https://github.com/apache/spark/pull/19518#issuecomment-337965330). This is because the current code seems to measure performance of interpreter due to lack of warmup. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19769 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19630 **[Test build #83962 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83962/testReport)** for PR 19630 at commit [`cf1d1ca`](https://github.com/apache/spark/commit/cf1d1caa4f41c6bcf565cfc5b9e9901d94f56af3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19769 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83960/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19769 **[Test build #83960 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83960/testReport)** for PR 19769 at commit [`953b4e8`](https://github.com/apache/spark/commit/953b4e84b717962316218aec0d635f344b44134c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19773: [SPARK-22546][SQL] Supporting for changing column dataTy...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19773 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83964/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19773: [SPARK-22546][SQL] Supporting for changing column dataTy...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19773 **[Test build #83964 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83964/testReport)** for PR 19773 at commit [`1bcd74f`](https://github.com/apache/spark/commit/1bcd74fae9cb6595e04eab6ecaf621739644102f). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19769 **[Test build #83960 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83960/testReport)** for PR 19769 at commit [`953b4e8`](https://github.com/apache/spark/commit/953b4e84b717962316218aec0d635f344b44134c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19769#discussion_r151634925 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala --- @@ -87,4 +95,107 @@ class ParquetInteroperabilitySuite extends ParquetCompatibilityTest with SharedS Row(Seq(2, 3 } } + + test("parquet timestamp conversion") { +// Make a table with one parquet file written by impala, and one parquet file written by spark. +// We should only adjust the timestamps in the impala file, and only if the conf is set +val impalaFile = "test-data/impala_timestamp.parq" + +// here are the timestamps in the impala file, as they were saved by impala +val impalaFileData = + Seq( +"2001-01-01 01:01:01", +"2002-02-02 02:02:02", +"2003-03-03 03:03:03" + ).map { s => java.sql.Timestamp.valueOf(s) } +val impalaPath = Thread.currentThread().getContextClassLoader.getResource(impalaFile) + .toURI.getPath +withTempPath { tableDir => + val ts = Seq( +"2004-04-04 04:04:04", +"2005-05-05 05:05:05", +"2006-06-06 06:06:06" + ).map { s => java.sql.Timestamp.valueOf(s) } + import testImplicits._ + // match the column names of the file from impala + val df = spark.createDataset(ts).toDF().repartition(1).withColumnRenamed("value", "ts") + df.write.parquet(tableDir.getAbsolutePath) + FileUtils.copyFile(new File(impalaPath), new File(tableDir, "part-1.parq")) + + Seq(false, true).foreach { int96TimestampConversion => +Seq(false, true).foreach { vectorized => + withSQLConf( + (SQLConf.PARQUET_INT96_TIMESTAMP_CONVERSION.key, int96TimestampConversion.toString()), + (SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, vectorized.toString()) --- End diff -- to be future proof, let's explicitly set `PARQUET_OUTPUT_TIMESTAMP_TYPE=INT96` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19767: [WIP][SPARK-22543][SQL] fix java 64kb compile err...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19767#discussion_r151636953 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -105,6 +105,41 @@ abstract class Expression extends TreeNode[Expression] { val isNull = ctx.freshName("isNull") val value = ctx.freshName("value") val ve = doGenCode(ctx, ExprCode("", isNull, value)) + + // TODO: support whole stage codegen too + if (ve.code.trim.length > 1024 && ctx.INPUT_ROW != null && ctx.currentVars == null) { +val setIsNull = if (ve.isNull != "false" && ve.isNull != "true") { + val globalIsNull = ctx.freshName("globalIsNull") + ctx.addMutableState("boolean", globalIsNull, s"$globalIsNull = false;") + val localIsNull = ve.isNull + ve.isNull = globalIsNull + s"$globalIsNull = $localIsNull;" +} else { + "" +} + +val setValue = { + val globalValue = ctx.freshName("globalValue") + ctx.addMutableState( +ctx.javaType(dataType), globalValue, s"$globalValue = ${ctx.defaultValue(dataType)};") + val localValue = ve.value + ve.value = globalValue + s"$globalValue = $localValue;" +} + +val funcName = ctx.freshName(nodeName) +val funcFullName = ctx.addNewFunction(funcName, + s""" + |private void $funcName(InternalRow ${ctx.INPUT_ROW}) { + | ${ve.code.trim} + | $setValue --- End diff -- good suggestion! Actually, this is a general strategy which can be applied to more places. If there are only boolean global variables, it's very easy to fold them into one array. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...
Github user skonto commented on a diff in the pull request: https://github.com/apache/spark/pull/19390#discussion_r151672564 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala --- @@ -175,17 +176,39 @@ trait MesosSchedulerUtils extends Logging { registerLatch.countDown() } - def createResource(name: String, amount: Double, role: Option[String] = None): Resource = { + private def setAllocationAndReservationInfo( + allocationInfo: Option[AllocationInfo], + reservationInfo: Option[ReservationInfo], + role: Option[String], + builder: Resource.Builder): Unit = { +if (role.forall(r => !r.equals(ANY_ROLE))) { --- End diff -- Even better !role.contains(ANY_ROLE) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19773: Supporting for changing column dataType
GitHub user xuanyuanking opened a pull request: https://github.com/apache/spark/pull/19773 Supporting for changing column dataType ## What changes were proposed in this pull request? Support user to change column dataType in hive table and datasource table, here also want to make a further discuss for other ddl requirement. ## How was this patch tested? Add test case in `DDLSuite.scala` and `SQLQueryTestSuite.scala` You can merge this pull request into a Git repository by running: $ git pull https://github.com/xuanyuanking/spark SPARK-22546 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19773.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19773 commit 1bcd74fae9cb6595e04eab6ecaf621739644102f Author: Yuanjian LiDate: 2017-11-17T12:11:33Z Support change column dataType --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19257: [SPARK-22042] [SQL] ReorderJoinPredicates can break when...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19257 @felixcheung Don't worry, the bug only exists in the master branch, so it won't block the 2.2.1 release. I have corrected the JIRA ticket's affected version to 2.3 . Also I'm looking into this issue --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19767: [WIP][SPARK-22543][SQL] fix java 64kb compile err...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19767#discussion_r151631456 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -105,6 +105,41 @@ abstract class Expression extends TreeNode[Expression] { val isNull = ctx.freshName("isNull") val value = ctx.freshName("value") val ve = doGenCode(ctx, ExprCode("", isNull, value)) + + // TODO: support whole stage codegen too + if (ve.code.trim.length > 1024 && ctx.INPUT_ROW != null && ctx.currentVars == null) { +val setIsNull = if (ve.isNull != "false" && ve.isNull != "true") { + val globalIsNull = ctx.freshName("globalIsNull") + ctx.addMutableState("boolean", globalIsNull, s"$globalIsNull = false;") + val localIsNull = ve.isNull + ve.isNull = globalIsNull + s"$globalIsNull = $localIsNull;" +} else { + "" +} + +val setValue = { + val globalValue = ctx.freshName("globalValue") + ctx.addMutableState( +ctx.javaType(dataType), globalValue, s"$globalValue = ${ctx.defaultValue(dataType)};") + val localValue = ve.value + ve.value = globalValue + s"$globalValue = $localValue;" +} + +val funcName = ctx.freshName(nodeName) +val funcFullName = ctx.addNewFunction(funcName, + s""" + |private void $funcName(InternalRow ${ctx.INPUT_ROW}) { + | ${ve.code.trim} + | $setValue --- End diff -- Can we always pass `value` as a return value? It can reduce # of global variables. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19769#discussion_r151634730 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala --- @@ -87,4 +95,107 @@ class ParquetInteroperabilitySuite extends ParquetCompatibilityTest with SharedS Row(Seq(2, 3 } } + + test("parquet timestamp conversion") { +// Make a table with one parquet file written by impala, and one parquet file written by spark. +// We should only adjust the timestamps in the impala file, and only if the conf is set +val impalaFile = "test-data/impala_timestamp.parq" + +// here are the timestamps in the impala file, as they were saved by impala +val impalaFileData = + Seq( +"2001-01-01 01:01:01", +"2002-02-02 02:02:02", +"2003-03-03 03:03:03" + ).map { s => java.sql.Timestamp.valueOf(s) } +val impalaPath = Thread.currentThread().getContextClassLoader.getResource(impalaFile) + .toURI.getPath +withTempPath { tableDir => + val ts = Seq( +"2004-04-04 04:04:04", +"2005-05-05 05:05:05", +"2006-06-06 06:06:06" + ).map { s => java.sql.Timestamp.valueOf(s) } + import testImplicits._ + // match the column names of the file from impala + val df = spark.createDataset(ts).toDF().repartition(1).withColumnRenamed("value", "ts") + df.write.parquet(tableDir.getAbsolutePath) + FileUtils.copyFile(new File(impalaPath), new File(tableDir, "part-1.parq")) + + Seq(false, true).foreach { int96TimestampConversion => +Seq(false, true).foreach { vectorized => + withSQLConf( + (SQLConf.PARQUET_INT96_TIMESTAMP_CONVERSION.key, int96TimestampConversion.toString()), + (SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, vectorized.toString()) + ) { +val readBack = spark.read.parquet(tableDir.getAbsolutePath).collect() +assert(readBack.size === 6) +// if we apply the conversion, we'll get the "right" values, as saved by impala in the +// original file. Otherwise, they're off by the local timezone offset, set to +// America/Los_Angeles in tests +val impalaExpectations = if (int96TimestampConversion) { + impalaFileData +} else { + impalaFileData.map { ts => +DateTimeUtils.toJavaTimestamp(DateTimeUtils.convertTz( + DateTimeUtils.fromJavaTimestamp(ts), + DateTimeUtils.TimeZoneUTC, + DateTimeUtils.getTimeZone(conf.sessionLocalTimeZone))) + } +} +val fullExpectations = (ts ++ impalaExpectations).map(_.toString).sorted.toArray +val actual = readBack.map(_.getTimestamp(0).toString).sorted +withClue(s"applyConversion = $int96TimestampConversion; vectorized = $vectorized") { + assert(fullExpectations === actual) + + // Now test that the behavior is still correct even with a filter which could get + // pushed down into parquet. We don't need extra handling for pushed down + // predicates because (a) in ParquetFilters, we ignore TimestampType and (b) parquet + // does not read statistics from int96 fields, as they are unsigned. See + // scalastyle:off line.size.limit + // https://github.com/apache/parquet-mr/blob/2fd62ee4d524c270764e9b91dca72e5cf1a005b7/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L419 + // https://github.com/apache/parquet-mr/blob/2fd62ee4d524c270764e9b91dca72e5cf1a005b7/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L348 + // scalastyle:on line.size.limit + // + // Just to be defensive in case anything ever changes in parquet, this test checks + // the assumption on column stats, and also the end-to-end behavior. + + val hadoopConf = sparkContext.hadoopConfiguration + val fs = FileSystem.get(hadoopConf) + val parts = fs.listStatus(new Path(tableDir.getAbsolutePath), new PathFilter { +override def accept(path: Path): Boolean = !path.getName.startsWith("_") + }) + // grab the meta data from the parquet file. The next section of asserts just make + // sure the test is configured correctly. + assert(parts.size == 2) +
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19518 ping @bdrillard --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19630 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...
Github user skonto commented on a diff in the pull request: https://github.com/apache/spark/pull/19390#discussion_r151674372 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala --- @@ -228,24 +254,15 @@ trait MesosSchedulerUtils extends Logging { (attr.getName, attr.getText.getValue.split(',').toSet) } - - /** Build a Mesos resource protobuf object */ - protected def createResource(resourceName: String, quantity: Double): Protos.Resource = { -Resource.newBuilder() - .setName(resourceName) - .setType(Value.Type.SCALAR) - .setScalar(Value.Scalar.newBuilder().setValue(quantity).build()) - .build() - } - /** * Converts the attributes from the resource offer into a Map of name to Attribute Value * The attribute values are the mesos attribute types and they are * * @param offerAttributes the attributes offered * @return */ - protected def toAttributeMap(offerAttributes: JList[Attribute]): Map[String, GeneratedMessage] = { + protected def toAttributeMap(offerAttributes: JList[Attribute]) +: Map[String, GeneratedMessageV3] = { --- End diff -- 2 space indent is not correct? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19630 **[Test build #83965 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83965/testReport)** for PR 19630 at commit [`cf1d1ca`](https://github.com/apache/spark/commit/cf1d1caa4f41c6bcf565cfc5b9e9901d94f56af3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...
Github user skonto commented on a diff in the pull request: https://github.com/apache/spark/pull/19390#discussion_r151679164 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala --- @@ -451,15 +468,22 @@ trait MesosSchedulerUtils extends Logging { } /** Creates a mesos resource for a specific port number. */ - private def createResourcesFromPorts(portsAndRoles: List[(Long, String)]) : List[Resource] = { -portsAndRoles.flatMap{ case (port, role) => - createMesosPortResource(List((port, port)), Some(role))} + private def createResourcesFromPorts( + portsAndResourcesInfo: List[(Long, (String, AllocationInfo, Option[ReservationInfo]))]) --- End diff -- yeah makes sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19630 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83961/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19630 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19767: [WIP][SPARK-22543][SQL] fix java 64kb compile error for ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19767 **[Test build #83963 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83963/testReport)** for PR 19767 at commit [`3dab5bd`](https://github.com/apache/spark/commit/3dab5bdbc4d2bb1818c46905afb92422bac04d9e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...
Github user skonto commented on a diff in the pull request: https://github.com/apache/spark/pull/19390#discussion_r151673115 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala --- @@ -349,13 +349,22 @@ private[spark] class MesosCoarseGrainedSchedulerBackend( val offerMem = getResource(offer.getResourcesList, "mem") val offerCpus = getResource(offer.getResourcesList, "cpus") val offerPorts = getRangeResource(offer.getResourcesList, "ports") + val offerAllocationInfo = offer.getAllocationInfo + val offerReservationInfo = offer +.getResourcesList +.asScala +.find(resource => Option(resource.getReservation).isDefined) --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...
Github user skonto commented on a diff in the pull request: https://github.com/apache/spark/pull/19390#discussion_r151673131 --- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala --- @@ -451,15 +468,22 @@ trait MesosSchedulerUtils extends Logging { } /** Creates a mesos resource for a specific port number. */ - private def createResourcesFromPorts(portsAndRoles: List[(Long, String)]) : List[Resource] = { -portsAndRoles.flatMap{ case (port, role) => - createMesosPortResource(List((port, port)), Some(role))} + private def createResourcesFromPorts( + portsAndResourcesInfo: List[(Long, (String, AllocationInfo, Option[ReservationInfo]))]) +: List[Resource] = { +portsAndResourcesInfo.flatMap{ case (port, rInfo) => --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19767: [WIP][SPARK-22543][SQL] fix java 64kb compile error for ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19767 @maropu it partially covers #18641 . One problem is that, for an expression, if its child generates code less than 1024, and it has many children, then we still have an issue. `CaseWhen` is a little different because it at most can have 20 children(depends on `spark.sql.codegen.maxCaseBranches`). So we can still prevent failures, but may not be able to JIT. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19769#discussion_r151635947 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetInteroperabilitySuite.scala --- @@ -87,4 +96,113 @@ class ParquetInteroperabilitySuite extends ParquetCompatibilityTest with SharedS Row(Seq(2, 3 } } + + val ImpalaFile = "test-data/impala_timestamp.parq" + test("parquet timestamp conversion") { +// Make a table with one parquet file written by impala, and one parquet file written by spark. +// We should only adjust the timestamps in the impala file, and only if the conf is set + +// here's the timestamps in the impala file, as they were saved by impala +val impalaFileData = + Seq( +"2001-01-01 01:01:01", +"2002-02-02 02:02:02", +"2003-03-03 03:03:03" + ).map { s => java.sql.Timestamp.valueOf(s) } +val impalaFile = Thread.currentThread().getContextClassLoader.getResource(ImpalaFile) + .toURI.getPath +withTempPath { tableDir => + val ts = Seq( +"2004-04-04 04:04:04", +"2005-05-05 05:05:05", +"2006-06-06 06:06:06" + ).map { s => java.sql.Timestamp.valueOf(s) } + val s = spark + import s.implicits._ + // match the column names of the file from impala + val df = spark.createDataset(ts).toDF().repartition(1).withColumnRenamed("value", "ts") + val schema = df.schema + df.write.parquet(tableDir.getAbsolutePath) + FileUtils.copyFile(new File(impalaFile), new File(tableDir, "part-1.parq")) + + Seq(false, true).foreach { applyConversion => +Seq(false, true).foreach { vectorized => + withSQLConf( + (SQLConf.PARQUET_INT96_TIMESTAMP_CONVERSION.key, applyConversion.toString()), + (SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, vectorized.toString()) + ) { +val read = spark.read.parquet(tableDir.getAbsolutePath).collect() +assert(read.size === 6) +// if we apply the conversion, we'll get the "right" values, as saved by impala in the +// original file. Otherwise, they're off by the local timezone offset, set to +// America/Los_Angeles in tests +val impalaExpectations = if (applyConversion) { + impalaFileData +} else { + impalaFileData.map { ts => +DateTimeUtils.toJavaTimestamp(DateTimeUtils.convertTz( + DateTimeUtils.fromJavaTimestamp(ts), + TimeZone.getTimeZone("UTC"), + TimeZone.getDefault())) + } +} +val fullExpectations = (ts ++ impalaExpectations).map { + _.toString() +}.sorted.toArray +val actual = read.map { + _.getTimestamp(0).toString() +}.sorted +withClue(s"applyConversion = $applyConversion; vectorized = $vectorized") { + assert(fullExpectations === actual) + + // Now test that the behavior is still correct even with a filter which could get + // pushed down into parquet. We don't need extra handling for pushed down + // predicates because (a) in ParquetFilters, we ignore TimestampType and (b) parquet + // does not read statistics from int96 fields, as they are unsigned. See + // scalastyle:off line.size.limit + // https://github.com/apache/parquet-mr/blob/2fd62ee4d524c270764e9b91dca72e5cf1a005b7/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L419 + // https://github.com/apache/parquet-mr/blob/2fd62ee4d524c270764e9b91dca72e5cf1a005b7/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L348 + // scalastyle:on line.size.limit + // + // Just to be defensive in case anything ever changes in parquet, this test checks + // the assumption on column stats, and also the end-to-end behavior. + + val hadoopConf = sparkContext.hadoopConfiguration + val fs = FileSystem.get(hadoopConf) + val parts = fs.listStatus(new Path(tableDir.getAbsolutePath), new PathFilter { +override def accept(path: Path): Boolean = !path.getName.startsWith("_") + }) + // grab the meta data from the parquet file. The next section of asserts just make + // sure the test is configured
[GitHub] spark pull request #19630: [SPARK-22409] Introduce function type argument in...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19630#discussion_r151677913 --- Diff: python/pyspark/sql/functions.py --- @@ -2049,132 +2050,12 @@ def map_values(col): # User Defined Function -- -def _wrap_function(sc, func, returnType): -command = (func, returnType) -pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command) -return sc._jvm.PythonFunction(bytearray(pickled_command), env, includes, sc.pythonExec, - sc.pythonVer, broadcast_vars, sc._javaAccumulator) - - -class PythonUdfType(object): -# row-at-a-time UDFs -NORMAL_UDF = 0 -# scalar vectorized UDFs -PANDAS_UDF = 1 -# grouped vectorized UDFs -PANDAS_GROUPED_UDF = 2 - - -class UserDefinedFunction(object): --- End diff -- Yup, I noticed it first too when I reviewed but then noticed he imported this indentedly: https://github.com/icexelloss/spark/blob/cf1d1caa4f41c6bcf565cfc5b9e9901d94f56af3/python/pyspark/sql/functions.py#L35 So, I guess it could be fine. I manually just double checked: ```python >>> from pyspark.sql import functions >>> functions.UserDefinedFunction >>> from pyspark import sql >>> sql.functions.UserDefinedFunction >>> from pyspark.sql.functions import UserDefinedFunction >>> from pyspark.sql.udf import UserDefinedFunction ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19739: [SPARK-22513][BUILD] Provide build profile for hadoop 2....
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19739 In any event, you can always produce your a build without any POM changes that does exactly this with `-Dhadoop.version=2.8.2` if you wanted to. You can close this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...
Github user skambha commented on the issue: https://github.com/apache/spark/pull/19747 I have taken care of adding the check in the new HiveClientImpl.alterTableDataSchema as well and have added some new tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19760: [SPARK-22533][core] Handle deprecated names in ConfigEnt...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19760 I'm ok to move the deprecated config keys of `MAX_REMOTE_BLOCK_SIZE_FETCH_TO_MEM` and `LISTENER_BUS_EVENT_QUEUE_CAPACITY` to `SparkConf` if the deprecation message really matters. But I'd like to keep `withAlternatives`. Generally it's a better interface and my future plan is to move config related stuff to a new maven module, so it can be used in modules that don't depend on the core module(e.g. the network module). It will be annoying if everytime we wanna deprecate a conf we need to change the config module. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19630 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83962/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN command
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19774 **[Test build #83966 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83966/testReport)** for PR 19774 at commit [`24bfcb1`](https://github.com/apache/spark/commit/24bfcb1132d35ffa8ba2341a7ea9057b14b5ab8a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN command
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19774 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19775: Add support for publishing Spark metrics into Pro...
GitHub user matyix opened a pull request: https://github.com/apache/spark/pull/19775 Add support for publishing Spark metrics into Prometheus ## What changes were proposed in this pull request? _Originally this PR was submitted to the Spark on K8S fork [here](https://github.com/apache-spark-on-k8s/spark/pull/531) but has been advised to resend it upstream by @erikerlandson and @foxish. K8S specific items were removed from the PR and been reworked for the Apache version._ Publishing Spark metrics into Prometheus - as highlighted in the [JIRA](https://issues.apache.org/jira/browse/SPARK-22343). Implemented a metrics sink that publishes Spark metrics into Prometheus via [Prometheus Pushgateway](https://prometheus.io/docs/instrumenting/pushing/). Metrics data published by Spark is based on [Dropwizard](http://metrics.dropwizard.io/). The format of Spark metrics is not supported natively by Prometheus thus these are converted using [DropwizardExports](https://prometheus.io/client_java/io/prometheus/client/dropwizard/DropwizardExports.html) prior pushing metrics to the pushgateway. Also the default Prometheus pushgateway client API implementation does not support metrics timestamp thus the client API has been ehanced to enrich metrics data with timestamp. ## How was this patch tested? This PR is not affecting the existing code base and not altering the functionality. Nevertheless, I have executed all `unit and integration` tests. Also this setup has been deployed and been monitored via Prometheus (Prometheus 1.7.1 + Pushgateway 0.3.1). `Manual` testing through deploying a Spark cluster, Prometheus server, Pushgateway and ran SparkPi. You can merge this pull request into a Git repository by running: $ git pull https://github.com/banzaicloud/spark apache_master_prometheus_support Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19775.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19775 commit 579cca96af187cf50fbedf5927cdea4e0bbdff26 Author: Janos MatyasDate: 2017-10-17T18:51:50Z Add support for prometheus --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19775: Add support for publishing Spark metrics into Prometheus
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19775 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19765: [SPARK-22540][SQL] Ensure HighlyCompressedMapStat...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19765 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19257: [SPARK-22042] [SQL] ReorderJoinPredicates can bre...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19257#discussion_r151714611 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala --- @@ -602,6 +602,28 @@ abstract class BucketedReadSuite extends QueryTest with SQLTestUtils { ) } + test("SPARK-22042 ReorderJoinPredicates can break when child's partitioning is not decided") { +withTable("bucketed_table", "table1", "table2") { + df.write.format("parquet").saveAsTable("table1") + df.write.format("parquet").saveAsTable("table2") + df.write.format("parquet").bucketBy(8, "j", "k").saveAsTable("bucketed_table") + + withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") { +sql(""" + |SELECT * + |FROM ( + | SELECT a.i, a.j, a.k + | FROM bucketed_table a + | JOIN table1 b + | ON a.i = b.i + |) c + |JOIN table2 + |ON c.i = table2.i + |""".stripMargin).explain() --- End diff -- use checkAnswer instead of explain in the test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19767: [SPARK-22543][SQL] fix java 64kb compile error for deepl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19767 **[Test build #83963 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83963/testReport)** for PR 19767 at commit [`3dab5bd`](https://github.com/apache/spark/commit/3dab5bdbc4d2bb1818c46905afb92422bac04d9e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19769 **[Test build #83971 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83971/testReport)** for PR 19769 at commit [`9bb4cf0`](https://github.com/apache/spark/commit/9bb4cf0514dddc005b90ddb17a22d3b05be929e5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19767: [SPARK-22543][SQL] fix java 64kb compile error for deepl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19767 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19630 **[Test build #83962 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83962/testReport)** for PR 19630 at commit [`cf1d1ca`](https://github.com/apache/spark/commit/cf1d1caa4f41c6bcf565cfc5b9e9901d94f56af3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/19630 Everyone, I don't have more changes to the PR. I think all comments are addressed at this point. Please let me know if I missed anything or there are more comments. Thank you! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN command
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19774 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83966/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19630 thanks, merging to master, cheers! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...
Github user squito commented on the issue: https://github.com/apache/spark/pull/19769 cc @henryr @zivanfi --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19730: [SPARK-22500][SQL] Fix 64KB JVM bytecode limit problem w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19730 **[Test build #83970 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83970/testReport)** for PR 19730 at commit [`83fef40`](https://github.com/apache/spark/commit/83fef403b92a96a13421901d161a0df5e6a6d7b3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN command
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19774 **[Test build #83972 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83972/testReport)** for PR 19774 at commit [`9bfa80c`](https://github.com/apache/spark/commit/9bfa80cca04a3b00e0fc2b02beb45c56f2058a34). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/19630 @HyukjinKwon Thanks for the reply on coverage. It'd be great to have an easy way to run coverage :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19257: [SPARK-22042] [SQL] ReorderJoinPredicates can break when...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19257 After some more thoughts, I think the best choice is to do planning bottom up. That requires a lot of refactoring and I'm fine to merge this workaround first. LGTM except one minor comment for the test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r151689883 --- Diff: sql/core/src/test/resources/sql-tests/inputs/describe-table-column.sql --- @@ -24,6 +24,18 @@ DESC EXTENDED desc_col_table key; DESC FORMATTED desc_col_table key; +SET spark.sql.statistics.histogram.enabled=true; +SET spark.sql.statistics.histogram.numBins=2; + +INSERT INTO desc_col_table values(1); +INSERT INTO desc_col_table values(2); +INSERT INTO desc_col_table values(3); +INSERT INTO desc_col_table values(4); --- End diff -- INSERT INTO desc_col_table values 1, 2, 3, 4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19774: [SPARK-22475][SQL] show histogram in DESC COLUMN ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19774#discussion_r151693478 --- Diff: sql/core/src/test/resources/sql-tests/inputs/describe-table-column.sql --- @@ -24,6 +24,18 @@ DESC EXTENDED desc_col_table key; DESC FORMATTED desc_col_table key; +SET spark.sql.statistics.histogram.enabled=true; +SET spark.sql.statistics.histogram.numBins=2; + +INSERT INTO desc_col_table values(1); +INSERT INTO desc_col_table values(2); +INSERT INTO desc_col_table values(3); +INSERT INTO desc_col_table values(4); + +ANALYZE TABLE desc_col_table COMPUTE STATISTICS FOR COLUMNS key; --- End diff -- please set the sql conf back to default value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19765: [SPARK-22540][SQL] Ensure HighlyCompressedMapStatus calc...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19765 Merged to master/2.2 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19747: [Spark-22431][SQL] Ensure that the datatype in th...
Github user skambha commented on a diff in the pull request: https://github.com/apache/spark/pull/19747#discussion_r151689272 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -40,6 +40,22 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { setupTestData() + test("SPARK-22431: table with nested type col with special char") { --- End diff -- Thanks @gatorsmile for your comments. I have addressed them in the latest commit. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19747 **[Test build #83968 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83968/testReport)** for PR 19747 at commit [`e5c2cf3`](https://github.com/apache/spark/commit/e5c2cf369912583b273ed573e3be4fdc5b9fb78d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19747 **[Test build #83969 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83969/testReport)** for PR 19747 at commit [`3be7b47`](https://github.com/apache/spark/commit/3be7b4736c93c6171677f6488c5a623c2eb38ad9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19767: [SPARK-22543][SQL] fix java 64kb compile error for deepl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19767 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83963/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19630 **[Test build #83965 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83965/testReport)** for PR 19630 at commit [`cf1d1ca`](https://github.com/apache/spark/commit/cf1d1caa4f41c6bcf565cfc5b9e9901d94f56af3). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19630 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83965/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19630: [SPARK-22409] Introduce function type argument in pandas...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19630 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19630: [SPARK-22409] Introduce function type argument in...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19630 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17436#discussion_r151765263 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadBenchmark.scala --- @@ -260,6 +261,7 @@ object ParquetReadBenchmark { def stringWithNullsScanBenchmark(values: Int, fractionOfNulls: Double): Unit = { withTempPath { dir => withTempTable("t1", "tempTable") { +val enableOffHeapColumnVector = spark.sqlContext.conf.offHeapColumnVectorEnabled --- End diff -- nit: spark.sessionState.conf.offHeapColumnVectorEnabled --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "spark.m...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17436 LGTM except a few minor comment, please update the PR title and description, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17436#discussion_r151764870 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala --- @@ -62,7 +69,11 @@ case class InMemoryTableScanExec( private def createAndDecompressColumn(cachedColumnarBatch: CachedBatch): ColumnarBatch = { val rowCount = cachedColumnarBatch.numRows -val columnVectors = OnHeapColumnVector.allocateColumns(rowCount, columnarBatchSchema) +val columnVectors = if (!conf.offHeapColumnVectorEnabled) { --- End diff -- only enable it when `TaskContext.get != null`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19390: [SPARK-18935][MESOS] Fix dynamic reservations on mesos
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19390 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19390: [SPARK-18935][MESOS] Fix dynamic reservations on mesos
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19390 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83967/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19772: [SPARK-22538][ML] SQLTransformer should not unper...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19772#discussion_r151732579 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala --- @@ -70,7 +70,8 @@ class SQLTransformer @Since("1.6.0") (@Since("1.6.0") override val uid: String) dataset.createOrReplaceTempView(tableName) val realStatement = $(statement).replace(tableIdentifier, tableName) val result = dataset.sparkSession.sql(realStatement) -dataset.sparkSession.catalog.dropTempView(tableName) --- End diff -- It seems like a bug: when you cache a dataframe, create a view from the dataframe, and drop the view, Spark should not uncache the original dataframe. We can discuss more later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19775: [SPARK-22343][core] Add support for publishing Spark met...
Github user erikerlandson commented on the issue: https://github.com/apache/spark/pull/19775 @matyix thanks for re-submitting! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17436#discussion_r151764498 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java --- @@ -101,9 +101,13 @@ private boolean returnColumnarBatch; /** - * The default config on whether columnarBatch should be offheap. + * The config on whether columnarBatch should be offheap. --- End diff -- nit: the memory mode of the columnarBatch --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17436#discussion_r151764317 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -140,6 +140,13 @@ object SQLConf { .booleanConf .createWithDefault(true) + val COLUMN_VECTOR_OFFHEAP_ENABLED = +buildConf("spark.sql.columnVector.offheap.enable") + .internal() + .doc("When true, use OffHeapColumnVector in ColumnarBatch.") + .booleanConf + .createWithDefault(true) --- End diff -- hey let's not change the existing behavior. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19728: [SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem w...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19728 can you split it into 3 PRs? The approaches for these 3 expression are quite different. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19388: [SPARK-22162] Executors and the driver should use consis...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19388 **[Test build #83976 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83976/testReport)** for PR 19388 at commit [`500c73c`](https://github.com/apache/spark/commit/500c73cc96290efe0194e371ab84e0cda863347d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19730: [SPARK-22500][SQL] Fix 64KB JVM bytecode limit pr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19730#discussion_r151723565 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala --- @@ -827,4 +827,34 @@ class CastSuite extends SparkFunSuite with ExpressionEvalHelper { checkEvaluation(cast(Literal.create(input, from), to), input) } + + test("SPARK-22500: cast for struct should not generate codes beyond 64KB") { +val N = 1000 + +val from1 = new StructType( + (1 to N).map(i => StructField(s"s$i", StringType)).toArray) +val to1 = new StructType( + (1 to N).map(i => StructField(s"i$i", IntegerType)).toArray) +val input1 = Row.fromSeq((1 to N).map(i => i.toString)) +val output1 = Row.fromSeq((1 to N)) +checkEvaluation(cast(Literal.create(input1, from1), to1), output1) + +val from2 = new StructType( + (1 to N).map(i => StructField(s"a$i", ArrayType(StringType, containsNull = false))).toArray) --- End diff -- or just test this case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19730: [SPARK-22500][SQL] Fix 64KB JVM bytecode limit pr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19730#discussion_r151725673 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala --- @@ -1039,13 +1039,19 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String } } """ -}.mkString("\n") +} +val fieldsEvalCodes = if (ctx.INPUT_ROW != null && ctx.currentVars == null) { + ctx.splitExpressions(fieldsEvalCode, "castStruct", +("InternalRow", ctx.INPUT_ROW) :: (rowClass, result) :: ("InternalRow", tmpRow) :: Nil) --- End diff -- I mean, we don't need to pass in `ctx.INPUT_ROW` to the split functions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19257: [SPARK-22042] [SQL] ReorderJoinPredicates can break when...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19257 Thank you for the decision, @cloud-fan . It's great to see the progress on this! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19730: [SPARK-22500][SQL] Fix 64KB JVM bytecode limit pr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19730#discussion_r151723430 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala --- @@ -827,4 +827,34 @@ class CastSuite extends SparkFunSuite with ExpressionEvalHelper { checkEvaluation(cast(Literal.create(input, from), to), input) } + + test("SPARK-22500: cast for struct should not generate codes beyond 64KB") { +val N = 1000 + +val from1 = new StructType( + (1 to N).map(i => StructField(s"s$i", StringType)).toArray) +val to1 = new StructType( + (1 to N).map(i => StructField(s"i$i", IntegerType)).toArray) +val input1 = Row.fromSeq((1 to N).map(i => i.toString)) +val output1 = Row.fromSeq((1 to N)) +checkEvaluation(cast(Literal.create(input1, from1), to1), output1) + +val from2 = new StructType( + (1 to N).map(i => StructField(s"a$i", ArrayType(StringType, containsNull = false))).toArray) --- End diff -- I'd expect something like ``` val from2 = new StructType( (1 to N).map(i => StructField(s"s$i", from1)).toArray) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19747 **[Test build #83968 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83968/testReport)** for PR 19747 at commit [`e5c2cf3`](https://github.com/apache/spark/commit/e5c2cf369912583b273ed573e3be4fdc5b9fb78d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19772: [SPARK-22538][ML] SQLTransformer should not unper...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19772 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19747 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83969/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19747 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org