[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/10289#discussion_r47920442 --- Diff: core/src/test/scala/org/apache/spark/util/collection/ExternalSorterSuite.scala --- @@ -235,7 +235,7 @@ class ExternalSorterSuite extends SparkFunSuite with LocalSparkContext { it.next() } } - +*/ --- End diff -- Sorry, this is my mistake. This file should not be modified. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/10289#discussion_r47922285 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/MLlibTestSparkContext.scala --- @@ -38,12 +38,15 @@ trait MLlibTestSparkContext extends BeforeAndAfterAll { self: Suite => } override def afterAll() { -sqlContext = null -SQLContext.clearActive() -if (sc != null) { - sc.stop() -} +try { + sqlContext = null + SQLContext.clearActive() + if (sc != null) { +sc.stop() + } sc = null -super.afterAll() +} finally { --- End diff -- Yes, I will insert two spaces for an indent --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/10289#discussion_r47920869 --- Diff: yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala --- @@ -116,7 +120,7 @@ abstract class BaseYarnClusterSuite override def afterAll() { yarnCluster.stop() -System.clearProperty("SPARK_YARN_MODE") +System.setProperties(oldProperties) --- End diff -- I am afraid that another property may be changes somewhere. I will use the original code since it works well now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/10289 [SPARK-12311][CORE] Restore previous value of "os.arch" property in test suites after forcing to set specific value to "os.arch" property Restore the original value of os.arch property after each test Since some of tests forced to set the specific value to os.arch property, we need to set the original value. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark SPARK-12311 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10289.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10289 commit df0848c94e91333d398b6f793d82c2f2f80d07a0 Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com> Date: 2015-12-14T07:32:52Z restore the original value of os.arch property after each test Since some of tests forced to set the specific value to os.arch property, we need to set the original value. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...
Github user kiszk commented on the pull request: https://github.com/apache/spark/pull/10289#issuecomment-164779119 @srowen, OK, I will correct this pattern everywhere. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...
Github user kiszk commented on the pull request: https://github.com/apache/spark/pull/10289#issuecomment-164802842 There are two potential bugs 1. A method of the super class is not called in `beforeEach()`, `afterEach()`, `beforeAll()`, and `afterAll()` 2. Although `System.setProperty()` is called, `ResetSystemProperties` is not mixed-in. I will correct them --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12580][SQL] Remove string concatenation...
Github user kiszk commented on the pull request: https://github.com/apache/spark/pull/10524#issuecomment-168571357 I see. Sounds good. I will reformat them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12580][SQL] Remove string concatenation...
Github user kiszk commented on the pull request: https://github.com/apache/spark/pull/10524#issuecomment-168872227 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12580][SQL] Remove string concatenation...
Github user kiszk commented on the pull request: https://github.com/apache/spark/pull/10524#issuecomment-168127380 @yhuai, I think that I will work. Do you think how many characters are preferable to all at a line? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12530][Build] Fix build break at Spark-...
Github user kiszk commented on the pull request: https://github.com/apache/spark/pull/10488#issuecomment-167848836 @yhuai, do you mean that I would update all of the string concatenation in @ExpressionDescription by using multi-line string literals rather than only the original one? If so, I will do this update. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12530][Build] Fix build break at Spark-...
Github user kiszk commented on the pull request: https://github.com/apache/spark/pull/10488#issuecomment-167853136 I see. I will create another JIRA entry to update other usages. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12640][SQL] Add simple benchmarking uti...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/10589#discussion_r48960725 --- Diff: core/src/test/scala/org/apache/spark/Benchmark.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import scala.collection.mutable + +import org.apache.commons.lang3.SystemUtils +import org.apache.spark.util.Utils + +/** + * Utility class to benchmark components. An example of how to use this is: + * val benchmark = new Benchmark("My Benchmark", valuesPerIteration) + * benchmark.addCase("V1", ") + * benchmark.addCase("V2", ") + * benchmark.run + * This will output the average time to run each function and the rate of each function. + * + * The benchmark function takes one argument that is the iteration that's being run + */ +class Benchmark(name: String, valuesPerIteration: Long, iters: Int = 5) { + val benchmarks = mutable.ArrayBuffer.empty[Benchmark.Case] + + def addCase(name: String, f: Int => Unit): Unit = { +benchmarks += Benchmark.Case(name, f) + } + + /** + * Runs the benchmark and outputs the results to stdout. This should be copied and added as + * a comment with the benchmark. Although the results vary from machine to machine, it should + * provide some baseline. + */ + def run(): Unit = { +require(benchmarks.nonEmpty) +val results = benchmarks.map { c => + Benchmark.measure(valuesPerIteration, c.fn, iters) +} +val firstRate = results.head.avgRate +// scalastyle:off +// The results are going to be processor specific so it is useful to include that. +println(Benchmark.getProcessorName()) +printf("%-24s %16s %16s %14s\n", name + ":", "Avg Time(ms)", "Avg Rate(M/s)", "Relative Rate") + println("-") +results.zip(benchmarks).foreach { r => + printf("%-24s %16s %16s %14s\n", r._2.name, r._1.avgMs.toString, "%10.2f" format r._1.avgRate, +"%6.2f X" format (r._1.avgRate / firstRate)) +} +println +// scalastyle:on + } +} + +object Benchmark { + case class Case(name: String, fn: Int => Unit) + case class Result(avgMs: Double, avgRate: Double) + + /** + * This should return a user helpful processor information. Getting at this depends on the OS. + * This should return something like "Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz" + */ + def getProcessorName(): String = { +if (SystemUtils.IS_OS_MAC_OSX) { + Utils.executeAndGetOutput(Seq("/usr/sbin/sysctl", "-n", "machdep.cpu.brand_string")) +} else if (SystemUtils.IS_OS_LINUX) { + Utils.executeAndGetOutput(Seq("/usr/bin/grep", "-m", "1", "\"model name\"", "/proc/cpuinfo")) +} else { + System.getenv("PROCESSOR_IDENTIFIER") +} + } + + /** + * Runs a single function `f` for iters, returning the average time the function took and + * the rate of the function. + */ + def measure(num: Long, f: Int => Unit, iters: Int): Result = { +var totalTime = 0L --- End diff -- Sounds good for printing detailed timing about each run. It also allows us to check variations of elapsed time at each run. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12640][SQL] Add simple benchmarking uti...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/10589#discussion_r48922143 --- Diff: core/src/test/scala/org/apache/spark/Benchmark.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import scala.collection.mutable + +import org.apache.commons.lang3.SystemUtils +import org.apache.spark.util.Utils + +/** + * Utility class to benchmark components. An example of how to use this is: + * val benchmark = new Benchmark("My Benchmark", valuesPerIteration) + * benchmark.addCase("V1", ") + * benchmark.addCase("V2", ") + * benchmark.run + * This will output the average time to run each function and the rate of each function. + * + * The benchmark function takes one argument that is the iteration that's being run + */ +class Benchmark(name: String, valuesPerIteration: Long, iters: Int = 5) { + val benchmarks = mutable.ArrayBuffer.empty[Benchmark.Case] + + def addCase(name: String, f: Int => Unit): Unit = { +benchmarks += Benchmark.Case(name, f) + } + + /** + * Runs the benchmark and outputs the results to stdout. This should be copied and added as + * a comment with the benchmark. Although the results vary from machine to machine, it should + * provide some baseline. + */ + def run(): Unit = { +require(benchmarks.nonEmpty) +val results = benchmarks.map { c => + Benchmark.measure(valuesPerIteration, c.fn, iters) +} +val firstRate = results.head.avgRate +// scalastyle:off +// The results are going to be processor specific so it is useful to include that. +println(Benchmark.getProcessorName()) +printf("%-24s %16s %16s %14s\n", name + ":", "Avg Time(ms)", "Avg Rate(M/s)", "Relative Rate") + println("-") +results.zip(benchmarks).foreach { r => + printf("%-24s %16s %16s %14s\n", r._2.name, r._1.avgMs.toString, "%10.2f" format r._1.avgRate, +"%6.2f X" format (r._1.avgRate / firstRate)) +} +println +// scalastyle:on + } +} + +object Benchmark { + case class Case(name: String, fn: Int => Unit) + case class Result(avgMs: Double, avgRate: Double) + + /** + * This should return a user helpful processor information. Getting at this depends on the OS. + * This should return something like "Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz" + */ + def getProcessorName(): String = { +if (SystemUtils.IS_OS_MAC_OSX) { + Utils.executeAndGetOutput(Seq("/usr/sbin/sysctl", "-n", "machdep.cpu.brand_string")) +} else if (SystemUtils.IS_OS_LINUX) { + Utils.executeAndGetOutput(Seq("/usr/bin/grep", "-m", "1", "\"model name\"", "/proc/cpuinfo")) +} else { + System.getenv("PROCESSOR_IDENTIFIER") +} + } + + /** + * Runs a single function `f` for iters, returning the average time the function took and + * the rate of the function. + */ + def measure(num: Long, f: Int => Unit, iters: Int): Result = { +var totalTime = 0L --- End diff -- If you would like to measure the peak performance, it is necessary for warm-up run before the measurement. The current driver may include execution time on a bytecode interpreter. This is because a just-in-time compiler in Java runtime usually starts translating java bytecode to CPU native code after the threshold of method invocations/branches exceeded the preset value (e.g. 1 for OracleJDK). Before executing the native code, java bytecode is executed on the interpret
[GitHub] spark pull request: [SPARK-12635][SQL] Add ColumnarBatch, an in me...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/10628#discussion_r49050418 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java --- @@ -0,0 +1,162 @@ +package org.apache.spark.sql.execution.vectorized; + +import org.apache.spark.sql.types.DataType; +import org.apache.spark.sql.types.DoubleType; +import org.apache.spark.sql.types.IntegerType; +import org.apache.spark.unsafe.Platform; + +import java.nio.ByteBuffer; +import java.nio.DoubleBuffer; + +/** + * Column data backed using offheap memory. + */ +public final class OffHeapColumnVector extends ColumnVector { + // The data stored in these two allocations need to maintain binary compatible. We can + // directly pass this buffer to external components. + private long nulls; + private long data; + + protected OffHeapColumnVector(int capacity, DataType type) { +super(capacity, type); +this.nulls = Platform.allocateMemory(capacity); --- End diff -- Is is better to first get a area for this.data and to next get a area for this.nulls for this.nulls from the combined allocated memory? It expects to this.data is aligned with a cache line boundary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12580][SQL] Remove string concatenation...
Github user kiszk commented on the pull request: https://github.com/apache/spark/pull/10524#issuecomment-168038054 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/10593#discussion_r48920875 --- Diff: core/src/test/scala/org/apache/spark/Benchmark.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import scala.collection.mutable + +import org.apache.commons.lang3.SystemUtils +import org.apache.spark.util.Utils + +/** + * Utility class to benchmark components. An example of how to use this is: + * val benchmark = new Benchmark("My Benchmark", valuesPerIteration) + * benchmark.addCase("V1", ") + * benchmark.addCase("V2", ") + * benchmark.run + * This will output the average time to run each function and the rate of each function. + * + * The benchmark function takes one argument that is the iteration that's being run + */ +class Benchmark(name: String, valuesPerIteration: Long, iters: Int = 5) { + val benchmarks = mutable.ArrayBuffer.empty[Benchmark.Case] + + def addCase(name: String, f: Int => Unit): Unit = { +benchmarks += Benchmark.Case(name, f) + } + + /** + * Runs the benchmark and outputs the results to stdout. This should be copied and added as + * a comment with the benchmark. Although the results vary from machine to machine, it should + * provide some baseline. + */ + def run(): Unit = { +require(benchmarks.nonEmpty) +val results = benchmarks.map { c => + Benchmark.measure(valuesPerIteration, c.fn, iters) +} +val firstRate = results.head.avgRate +// scalastyle:off +// The results are going to be processor specific so it is useful to include that. +println(Benchmark.getProcessorName()) +printf("%-30s %16s %16s %14s\n", name + ":", "Avg Time(ms)", "Avg Rate(M/s)", "Relative Rate") + println("---") +results.zip(benchmarks).foreach { r => + printf("%-30s %16s %16s %14s\n", r._2.name, r._1.avgMs.toString, "%10.2f" format r._1.avgRate, +"%6.2f X" format (r._1.avgRate / firstRate)) +} +println +// scalastyle:on + } +} + +object Benchmark { + case class Case(name: String, fn: Int => Unit) + case class Result(avgMs: Double, avgRate: Double) + + /** + * This should return a user helpful processor information. Getting at this depends on the OS. + * This should return something like "Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz" + */ + def getProcessorName(): String = { +if (SystemUtils.IS_OS_MAC_OSX) { + Utils.executeAndGetOutput(Seq("/usr/sbin/sysctl", "-n", "machdep.cpu.brand_string")) +} else if (SystemUtils.IS_OS_LINUX) { + Utils.executeAndGetOutput(Seq("/usr/bin/grep", "-m", "1", "\"model name\"", "/proc/cpuinfo")) +} else { + System.getenv("PROCESSOR_IDENTIFIER") +} + } + + /** + * Runs a single function `f` for iters, returning the average time the function took and + * the rate of the function. + */ + def measure(num: Long, f: Int => Unit, iters: Int): Result = { +var totalTime = 0L +for (i <- 0 until iters + 1) { + val start = System.currentTimeMillis() --- End diff -- How about calling System.nanoTime() for short-running benchmarks instead of System.currentTimeMillis()? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12640][SQL] Add simple benchmarking uti...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/10589#discussion_r48921242 --- Diff: core/src/test/scala/org/apache/spark/Benchmark.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import scala.collection.mutable + +import org.apache.commons.lang3.SystemUtils +import org.apache.spark.util.Utils + +/** + * Utility class to benchmark components. An example of how to use this is: + * val benchmark = new Benchmark("My Benchmark", valuesPerIteration) + * benchmark.addCase("V1", ") + * benchmark.addCase("V2", ") + * benchmark.run + * This will output the average time to run each function and the rate of each function. + * + * The benchmark function takes one argument that is the iteration that's being run + */ +class Benchmark(name: String, valuesPerIteration: Long, iters: Int = 5) { + val benchmarks = mutable.ArrayBuffer.empty[Benchmark.Case] + + def addCase(name: String, f: Int => Unit): Unit = { +benchmarks += Benchmark.Case(name, f) + } + + /** + * Runs the benchmark and outputs the results to stdout. This should be copied and added as + * a comment with the benchmark. Although the results vary from machine to machine, it should + * provide some baseline. + */ + def run(): Unit = { +require(benchmarks.nonEmpty) +val results = benchmarks.map { c => + Benchmark.measure(valuesPerIteration, c.fn, iters) +} +val firstRate = results.head.avgRate +// scalastyle:off +// The results are going to be processor specific so it is useful to include that. +println(Benchmark.getProcessorName()) +printf("%-24s %16s %16s %14s\n", name + ":", "Avg Time(ms)", "Avg Rate(M/s)", "Relative Rate") + println("-") +results.zip(benchmarks).foreach { r => + printf("%-24s %16s %16s %14s\n", r._2.name, r._1.avgMs.toString, "%10.2f" format r._1.avgRate, +"%6.2f X" format (r._1.avgRate / firstRate)) +} +println +// scalastyle:on + } +} + +object Benchmark { + case class Case(name: String, fn: Int => Unit) + case class Result(avgMs: Double, avgRate: Double) + + /** + * This should return a user helpful processor information. Getting at this depends on the OS. + * This should return something like "Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz" + */ + def getProcessorName(): String = { +if (SystemUtils.IS_OS_MAC_OSX) { + Utils.executeAndGetOutput(Seq("/usr/sbin/sysctl", "-n", "machdep.cpu.brand_string")) +} else if (SystemUtils.IS_OS_LINUX) { + Utils.executeAndGetOutput(Seq("/usr/bin/grep", "-m", "1", "\"model name\"", "/proc/cpuinfo")) +} else { + System.getenv("PROCESSOR_IDENTIFIER") +} + } + + /** + * Runs a single function `f` for iters, returning the average time the function took and + * the rate of the function. + */ + def measure(num: Long, f: Int => Unit, iters: Int): Result = { +var totalTime = 0L +for (i <- 0 until iters + 1) { + val start = System.currentTimeMillis() --- End diff -- How about the calling System.nanoTime() for short-running benchmarks instead of calling System.currentTimeMillis()? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/10289#discussion_r48209450 --- Diff: yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala --- @@ -43,12 +45,21 @@ import org.apache.spark.util.Utils class ClientSuite extends SparkFunSuite with Matchers with BeforeAndAfterAll { + var oldProperties: Properties = null --- End diff -- This is because System.setProperty() is not used in this file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/10289#discussion_r48209783 --- Diff: yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala --- @@ -116,7 +120,7 @@ abstract class BaseYarnClusterSuite override def afterAll() { yarnCluster.stop() -System.clearProperty("SPARK_YARN_MODE") +System.setProperties(oldProperties) super.afterAll() --- End diff -- I agree. Since the original code did not use `try finally`, I did not use it, too. Should I use `try finally`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...
Github user kiszk commented on the pull request: https://github.com/apache/spark/pull/10289#issuecomment-166765714 Can I retest this? Timeout may occur in pyspark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12530][Build] Fix build break at Spark-...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/10488 [SPARK-12530][Build] Fix build break at Spark-Master-Maven-Snapshots from #1293 Compilation error caused due to string concatenations that are not a constant Use raw string literal to avoid string concatenations https://amplab.cs.berkeley.edu/jenkins/view/Spark-Packaging/job/Spark-Master-Maven-Snapshots/1293/ You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark SPARK-12530 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10488.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10488 commit 9658c3b6b66034a515a3e5d3cfd3d6cc9c4a3305 Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com> Date: 2015-12-27T08:24:38Z fix compilation error use raw string literal to avoid string concatenations --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12530][Build] Fix build break at Spark-...
Github user kiszk commented on the pull request: https://github.com/apache/spark/pull/10488#issuecomment-167418794 Thanks for letting me know them. I will check them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12530][Build] Fix build break at Spark-...
Github user kiszk commented on the pull request: https://github.com/apache/spark/pull/10488#issuecomment-167424111 @hvanhovell, a set of following two conditions causes this compilation failure: 1. Use more than one string concatenations 2. Use scala 2.11 compiler (open question is that [this failure](https://amplab.cs.berkeley.edu/jenkins/view/Spark-Packaging/job/Spark-Master-Maven-Snapshots/1293/consoleFull) occurs by using both scala 2.10 and 2.11 compilers on Jenkins I can reproduce this compilation error by using the following command: $ java -version openjdk version "1.8.0_65" OpenJDK Runtime Environment (build 1.8.0_65-b17) OpenJDK 64-Bit Server VM (build 25.65-b01, mixed mode) $ build/mvn -Dscala-2.11 -Pyarn -Phadoop-2.4 -DskipTests clean package install -pl sql/catalyst --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12530][Build] Fix build break at Spark-...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/10488#discussion_r48472393 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -57,9 +57,10 @@ case class Md5(child: Expression) extends UnaryExpression with ImplicitCastInput * the hash length is not one of the permitted values, the return value is NULL. */ @ExpressionDescription( - usage = "_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family as a hex string of the " + -"input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 0 is equivalent " + -"to 256", + usage = +"""_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family as a hex string of the input. + SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 0 is equivalent to 256.""" + , --- End diff -- Yes, this keeps within 100 characters at a line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12530][Build] Fix build break at Spark-...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/10488#discussion_r48498744 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -57,9 +57,10 @@ case class Md5(child: Expression) extends UnaryExpression with ImplicitCastInput * the hash length is not one of the permitted values, the return value is NULL. */ @ExpressionDescription( - usage = "_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family as a hex string of the " + -"input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 0 is equivalent " + -"to 256", + usage = +"""_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family as a hex string of the input. + SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 0 is equivalent to 256.""" + , --- End diff -- I confirmed ``// scalastyle:off`` and ``// scalastyle:on`` if we use up to one string concatenation for `` usage = ``. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12530][Build] Fix build break at Spark-...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/10488#discussion_r48500799 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -57,9 +57,10 @@ case class Md5(child: Expression) extends UnaryExpression with ImplicitCastInput * the hash length is not one of the permitted values, the return value is NULL. */ @ExpressionDescription( - usage = "_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family as a hex string of the " + -"input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 0 is equivalent " + -"to 256", + usage = +"""_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family as a hex string of the input. + SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 0 is equivalent to 256.""" + , --- End diff -- @yhuai , the above line did not cause the same issue. This is because its length is less than 100 and it has only one concatenation. I checked by inserting ``// scalastyle:off`` before the annotation and ``// scalastyle:on`` after the annotation as follows: ``` // scalastyle:off @ExpressionDescription( usage = "_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family as a hex string of the " + "input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 0 is equivalent to 256", extended = "> SELECT _FUNC_('Spark', 0);\n " + "'529bc3b07127ecb7e53a4dcf1991d9152c24537d919178022b2c42657f79a26b'") // scalastyle:on ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/10289#discussion_r48238449 --- Diff: yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala --- @@ -116,7 +120,7 @@ abstract class BaseYarnClusterSuite override def afterAll() { yarnCluster.stop() -System.clearProperty("SPARK_YARN_MODE") +System.setProperties(oldProperties) super.afterAll() --- End diff -- I see, I will use `try finally` here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/10289#discussion_r48238546 --- Diff: yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala --- @@ -43,12 +45,21 @@ import org.apache.spark.util.Utils class ClientSuite extends SparkFunSuite with Matchers with BeforeAndAfterAll { + var oldProperties: Properties = null --- End diff -- Sorry, I made a mistake when I check all of the changes in this file. I will use `ResetSystemProperties` here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12502][Build][Python] Script /dev/run-t...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/10463 [SPARK-12502][Build][Python] Script /dev/run-tests fails when IBM Java is used fix an exception with IBM JDK by removing update field from a JavaVersion tuple. This is because IBM JDK does not have information on update '_xx' You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark SPARK-12502 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10463.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10463 commit 8157ee06c54d598377f4b79467012913ef4cbfad Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com> Date: 2015-12-24T03:06:13Z fix an exception with IBM JDK remove update field from a JavaVersion tuple. This is because IBM JDK does not have information on update '_xx' --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...
Github user kiszk commented on the pull request: https://github.com/apache/spark/pull/10289#issuecomment-165960572 I will fix this merge issue this weekend. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12580][SQL] Remove string concatenation...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/10524 [SPARK-12580][SQL] Remove string concatenations from usage and extended in @ExpressionDescription Use multi-line string literals for @ExpressionDescription with ``// scalastyle:off line.size.limit`` and ``// scalastyle:on line.size.limit`` You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark SPARK-12580 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10524.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10524 commit acb422cb3a7815418bf38cc883839cd8d4b81501 Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com> Date: 2015-12-30T08:06:36Z use multi-line string literals for @ExpressionDescription --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12580][SQL] Remove string concatenation...
Github user kiszk commented on the pull request: https://github.com/apache/spark/pull/10524#issuecomment-167957361 @yhuai, are these changes fine with the policy that you proposed? I would appreciate it if you would check them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12580][SQL] Remove string concatenation...
Github user kiszk commented on the pull request: https://github.com/apache/spark/pull/10524#issuecomment-167996508 On jenkins with scala-2.11, we can wrap once in a source file (i.e. only one string concatenation). To use more than one concatenation causes compilation error. Is it OK to use only one string concatenation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12530][Build] Fix build break at Spark-...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/10488#discussion_r48534614 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -57,9 +57,10 @@ case class Md5(child: Expression) extends UnaryExpression with ImplicitCastInput * the hash length is not one of the permitted values, the return value is NULL. */ @ExpressionDescription( - usage = "_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family as a hex string of the " + -"input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 0 is equivalent " + -"to 256", + usage = +"""_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family as a hex string of the input. + SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 0 is equivalent to 256.""" + , --- End diff -- @yhuai, it sounds good to use multi-line string for ``usage`` and ``extended``. How about this policy? 1. First, we use raw string literal to remove concatenations. 2. If we still want to use a line with more than 100 characters, ``// scalastyle:off`` and ``// scalastyle:on`` can be used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12530][Build] Fix build break at Spark-...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/10488#discussion_r48534869 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -57,9 +57,10 @@ case class Md5(child: Expression) extends UnaryExpression with ImplicitCastInput * the hash length is not one of the permitted values, the return value is NULL. */ @ExpressionDescription( - usage = "_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family as a hex string of the " + -"input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 0 is equivalent " + -"to 256", + usage = +"""_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family as a hex string of the input. + SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 0 is equivalent to 256.""" + , --- End diff -- In another way, I am trying to reproduce this compilation error in a standalone test case. However, I cannot reproduce this failure (i.e. compilations are successfully finished). ``` $ cat ExpressionDescription.java import java.lang.annotation.*; @Retention(RetentionPolicy.RUNTIME) public @interface ExpressionDescription { String usage() default "_FUNC_ is undocumented"; String extended() default "No example for _FUNC_."; } $ cat test.scala @ExpressionDescription( usage = "_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family as a hex string of the " + "input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 0 is equivalent " + "to 256", extended = "> SELECT _FUNC_('Spark', 0);\n " + "'529bc3b07127ecb7e53a4dcf1991d9152c24537d919178022b2c42657f79a26b'") class test { def func() : Unit = { val str = "abc" + "123" + "XYZ" } } $ javac ExpressionDescription.java $ ~/scala-2.11.7/bin/scalac test.scala $ ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13505#discussion_r65811458 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala --- @@ -86,11 +86,31 @@ package object expressions { /** * Helper functions for working with `Seq[Attribute]`. */ - implicit class AttributeSeq(attrs: Seq[Attribute]) { + implicit class AttributeSeq(val attrs: Seq[Attribute]) { /** Creates a StructType with a schema matching this `Seq[Attribute]`. */ def toStructType: StructType = { StructType(attrs.map(a => StructField(a.name, a.dataType, a.nullable))) } + +private lazy val inputArr = attrs.toArray + +private lazy val inputToOrdinal = { + val map = new java.util.HashMap[ExprId, Int](inputArr.length * 2) + var index = 0 + attrs.foreach { attr => +if (!map.containsKey(attr.exprId)) { + map.put(attr.exprId, index) +} +index += 1 --- End diff -- Got it. I did not have no preference. Good to hear a reason for this decision. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13439 Oh, you are right. IMHO, it is too complex to introduce new implementation classes only for a column vector with the same value in all of the rows. To introduce compression schemes, as implemented in ```CachedBatch``` may be more generic solution if we introduce new implementation classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13539: [SPARK-15795] [SQL] Enable more optimizations in whole s...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13539 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13439 Can we implement this feature simply by - using ```ColumnarBatch.allocate(StructType schema, MemoryMode memMode, int maxRows)``` with ```maxRows=1``` - not introducing ```OffHeapConstantColumnVector``` and ```OnHeapConstantColumnVector``` IMHO, if this PR focuses on memory usage reduction, it could be possible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11301: [SPARK-13432][SQL] add the source file name and l...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/11301#discussion_r66726703 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala --- @@ -37,6 +39,16 @@ private[sql] object Column { def apply(expr: Expression): Column = new Column(expr) def unapply(col: Column): Option[Expression] = Some(col.expr) + + @scala.annotation.varargs + def updateExpressionsOrigin(cols: Column*): Unit = { +// Update Expression.origin using the callSite of an operation +val callSite = org.apache.spark.util.Utils.getCallSite().shortForm +cols.map(col => col.expr.foreach(e => e.origin.callSite = Some(callSite))) +// Update CurrentOrigin for setting origin for LogicalPlan node +CurrentOrigin.set( + Origin(Some(callSite), CurrentOrigin.get.line, CurrentOrigin.get.startPosition)) --- End diff -- Since I was too busy, I did not have time to work for this PR. Sorry. My current idea is to set ```CurrentOrigin``` at [```SQLExecution.withNewExecutionId```](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala#L57), and then [```TreeNode.toOriginString```](https://github.com/apache/spark/pull/11301/files#diff-eac5b02bb450a235fef5e902a2671254R446) dumps ```CurrentOrigin.get.callSite```. My next step seems to pass ```callSite``` from an operation such as ```filter``` to ```DataSet``` to ```LogicalPlan``` that can be accessed at ```SQLExecution.withNewExecutionId```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13589: [SPARK-15822][SPARK-15825][SQL] Fix SMJ Segfault/...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13589#discussion_r66568805 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -490,6 +490,7 @@ class CodegenContext { addNewFunction(compareFunc, funcCode) s"this.$compareFunc($c1, $c2)" case schema: StructType => + INPUT_ROW = "i" val comparisons = GenerateOrdering.genComparisons(this, schema) --- End diff -- Is it better to use ```InternalRow $INPUT_ROW = null;``` in an assignment for ```funcCode```, to clearly show intention of this change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13505#discussion_r65804276 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala --- @@ -86,11 +86,31 @@ package object expressions { /** * Helper functions for working with `Seq[Attribute]`. */ - implicit class AttributeSeq(attrs: Seq[Attribute]) { + implicit class AttributeSeq(val attrs: Seq[Attribute]) { /** Creates a StructType with a schema matching this `Seq[Attribute]`. */ def toStructType: StructType = { StructType(attrs.map(a => StructField(a.name, a.dataType, a.nullable))) } + +private lazy val inputArr = attrs.toArray + +private lazy val inputToOrdinal = { + val map = new java.util.HashMap[ExprId, Int](inputArr.length * 2) --- End diff -- Why ```*2``` is necessary? I think that the size of map's entry is up to ```attrs.size``` since the max number of calling ```map.put()``` is equal to ``attrs.size```. Is ```attrs.size``` equal to ```inputArr.legnth```? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13663: Spark SparkSPARK-15950 Eliminate unreachable code...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/13663 Spark SparkSPARK-15950 Eliminate unreachable code at projection for complex types ## What changes were proposed in this pull request? This PR eliminates unreachable code at projection for complex types (e.g. array, map, and struct) to reduce the total size of Java byte code in a method generated by whole stage codegen. This PR eliminates the following code blocks in the following cases: 1. A type of ```project_value``` is ```UnsafeArrayData``` 2. ```project_value is``` ```null``` 3. each element of ```project_values``` is ```null``` if it never occurs Examples for complex types java val df = sparkContext.parallelize(Seq(1, 2), 1).toDF("v") df.selectExpr("Array(v + 2, v + 3)").collect df.selectExpr("struct(v + 3, v + 4)").collect df.selectExpr("map(v + 3, v + 4)").collect Before applying this PR into ```Array``` ```java /* 028 */ protected void processNext() throws java.io.IOException { /* 029 */ while (inputadapter_input.hasNext()) { /* 030 */ InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); /* 031 */ final boolean project_isNull = false; /* 032 */ this.project_values = new Object[2]; /* 033 */ double inputadapter_value = inputadapter_row.getDouble(0); /* 034 */ /* 035 */ double project_value1 = -1.0; /* 036 */ project_value1 = inputadapter_value + 2.2D; /* 037 */ if (false) { /* 038 */ project_values[0] = null; /* 039 */ } else { /* 040 */ project_values[0] = project_value1; /* 041 */ } /* 042 */ /* 043 */ double inputadapter_value1 = inputadapter_row.getDouble(1); /* 044 */ /* 045 */ double project_value4 = -1.0; /* 046 */ project_value4 = inputadapter_value1 + 3.3D; /* 047 */ if (false) { /* 048 */ project_values[1] = null; /* 049 */ } else { /* 050 */ project_values[1] = project_value4; /* 051 */ } /* 052 */ /* 053 */ final ArrayData project_value = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_values); /* 054 */ this.project_values = null; /* 055 */ project_holder.reset(); /* 056 */ /* 057 */ project_rowWriter.zeroOutNullBytes(); /* 058 */ /* 059 */ if (project_isNull) { /* 060 */ project_rowWriter.setNullAt(0); /* 061 */ } else { /* 062 */ // Remember the current cursor so that we can calculate how many bytes are /* 063 */ // written later. /* 064 */ final int project_tmpCursor = project_holder.cursor; /* 065 */ /* 066 */ if (project_value instanceof UnsafeArrayData) { /* 067 */ final int project_sizeInBytes = ((UnsafeArrayData) project_value).getSizeInBytes(); /* 068 */ // grow the global buffer before writing data. /* 069 */ project_holder.grow(project_sizeInBytes); /* 070 */ ((UnsafeArrayData) project_value).writeToMemory(project_holder.buffer, project_holder.cursor); /* 071 */ project_holder.cursor += project_sizeInBytes; /* 072 */ /* 073 */ } else { /* 074 */ final int project_numElements = project_value.numElements(); /* 075 */ project_arrayWriter.initialize(project_holder, project_numElements, 8); /* 076 */ /* 077 */ for (int project_index = 0; project_index < project_numElements; project_index++) { /* 078 */ if (project_value.isNullAt(project_index)) { /* 079 */ project_arrayWriter.setNullAt(project_index); /* 080 */ } else { /* 081 */ final double project_element = project_value.getDouble(project_index); /* 082 */ project_arrayWriter.write(project_index, project_element); /* 083 */ } /* 084 */ } /* 085 */ } /* 086 */ /* 087 */ project_rowWriter.setOffsetAndSize(0, project_tmpCursor, project_holder.cursor - project_tmpCursor); /* 088 */ project_rowWriter.alignToWords(project_holder.cursor - project_tmpCursor); /* 089 */ } /* 090 */ project_result.setTotalSize(project_holder.totalSize()); /* 091 */ append(project_result); /* 092 */ if (shouldStop()) return; /* 093 */ } /* 094 */ } After applying this PR into ```Array``` java /* 028 */ protected void processNext() throws java.io.IOException { /* 029 */ while (inputadapter_input.hasNext()) { /* 030 */ InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); /* 031 */ final
[GitHub] spark issue #13663: [SPARK-15950][SQL] Eliminate unreachable code at project...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13663 @cloud-fan and @davies , thank you for your comments. A global variable is used for 1. I will address 1. by using another approach without using a global variable in another PR. This PR will focus on 2. and 3. that are addressed without using a global variable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce additonal implementa...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/13680 [SPARK-15962][SQL] Introduce additonal implementation with a dense format for UnsafeArrayData ## What changes were proposed in this pull request? This PR introduces two implementations for ```UnsafeArrayData```. One is an sparse array ```UnsafeArrayDataSparse``` that is the original format of ```UnsafeArrayData```. The other is a dense array ```UnsafeArrayDataDense``` that is a new format of ```UnsafeArrayData```. ```UnsafeArrayDataSparse``` accepts ```null``` value in each entry of an array. However, it increases memory footprint (e.g. 2x for an integer array) due to existence of [```offsets``` area](https://github.com/apache/spark/blob/43b149fb885a27f9467aab28e5195f6f03aadcf0/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java#L35). In some cases (e.g. an unsafe array is created from an primitive array by a method ```fromPrimitiveArray```, we know there is no ```null``` value in an array. ```UnsafeArrayDataDense``` can reduce memory footprint since it does not have ```offsets``` area. ## How was this patch tested? Tested by existing unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark SPARK-15962 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13680.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13680 commit 2b15523ddf0c8b8387302611d564fcc2de239af5 Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com> Date: 2016-06-14T19:46:29Z add two implementations (sparse and dense) for UnsafeArrayData commit 639b32d91cee1572e1146f325c40e158a8158ce5 Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com> Date: 2016-06-15T06:19:18Z fix failures of testsuite --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13663: [SPARK-15950][SQL] Eliminate unreachable code at ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13663#discussion_r67209280 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala --- @@ -60,18 +60,24 @@ case class CreateArray(children: Seq[Expression]) extends Expression { ctx.INPUT_ROW, children.zipWithIndex.map { case (e, i) => val eval = e.genCode(ctx) - eval.code + s""" -if (${eval.isNull}) { --- End diff -- Javac or just-in-time compiler can optimize this. But, code generator can easily do it without a global variable, too. I think that this is a good tradeoff. What would you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13505#discussion_r65804344 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala --- @@ -86,11 +86,31 @@ package object expressions { /** * Helper functions for working with `Seq[Attribute]`. */ - implicit class AttributeSeq(attrs: Seq[Attribute]) { + implicit class AttributeSeq(val attrs: Seq[Attribute]) { /** Creates a StructType with a schema matching this `Seq[Attribute]`. */ def toStructType: StructType = { StructType(attrs.map(a => StructField(a.name, a.dataType, a.nullable))) } + +private lazy val inputArr = attrs.toArray + +private lazy val inputToOrdinal = { + val map = new java.util.HashMap[ExprId, Int](inputArr.length * 2) + var index = 0 + attrs.foreach { attr => +if (!map.containsKey(attr.exprId)) { + map.put(attr.exprId, index) +} +index += 1 --- End diff -- Which style is better, this style or a style to use ```zipWithIndex```? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13472: [SPARK-15735] Allow specifying min time to run in...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13472#discussion_r65804523 --- Diff: core/src/main/scala/org/apache/spark/util/Benchmark.scala --- @@ -33,18 +38,37 @@ import org.apache.commons.lang3.SystemUtils * * The benchmark function takes one argument that is the iteration that's being run. * - * If outputPerIteration is true, the timing for each run will be printed to stdout. + * @param name name of this benchmark. + * @param valuesPerIteration number of values used in the test case, used to compute rows/s. + * @param minNumIters the min number of iterations that will be run per case, not counting warm-up. + * @param warmupTime amount of time to spend running dummy case iterations for JIT warm-up. + * @param minTime further iterations will be run for each case until this time is used up. + * @param outputPerIteration if true, the timing for each run will be printed to stdout. + * @param output optional output stream to write benchmark results to */ private[spark] class Benchmark( name: String, valuesPerIteration: Long, -defaultNumIters: Int = 5, -outputPerIteration: Boolean = false) { +minNumIters: Int = 2, +warmupTime: FiniteDuration = 2.seconds, +minTime: FiniteDuration = 2.seconds, --- End diff -- ditto. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13472: [SPARK-15735] Allow specifying min time to run in...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13472#discussion_r65804521 --- Diff: core/src/main/scala/org/apache/spark/util/Benchmark.scala --- @@ -33,18 +38,37 @@ import org.apache.commons.lang3.SystemUtils * * The benchmark function takes one argument that is the iteration that's being run. * - * If outputPerIteration is true, the timing for each run will be printed to stdout. + * @param name name of this benchmark. + * @param valuesPerIteration number of values used in the test case, used to compute rows/s. + * @param minNumIters the min number of iterations that will be run per case, not counting warm-up. + * @param warmupTime amount of time to spend running dummy case iterations for JIT warm-up. + * @param minTime further iterations will be run for each case until this time is used up. + * @param outputPerIteration if true, the timing for each run will be printed to stdout. + * @param output optional output stream to write benchmark results to */ private[spark] class Benchmark( name: String, valuesPerIteration: Long, -defaultNumIters: Int = 5, -outputPerIteration: Boolean = false) { +minNumIters: Int = 2, +warmupTime: FiniteDuration = 2.seconds, --- End diff -- I like to add a warmup period, but is it better to 2 second as default. It depends on machine, program (we can specify a value explicitly), and others. Are there any other thoughts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13439 Can we have a benchmark program to show performance improvement? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13459 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13663: [SPARK-15950][SQL] Eliminate unreachable code at ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13663#discussion_r67849394 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameComplexTypeSuite.scala --- @@ -26,6 +27,55 @@ import org.apache.spark.sql.test.SharedSQLContext class DataFrameComplexTypeSuite extends QueryTest with SharedSQLContext { import testImplicits._ + test("create an array") { +val df = sparkContext.parallelize(Seq(1, 2), 1).toDF("v") +df.selectExpr("Array(v + 3, v + 4)").collect --- End diff -- To make sure, when the code is eliminated, the simply array creation works well without any exception. Do you think this is unnecessary? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13758#discussion_r67849962 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala --- @@ -142,3 +164,415 @@ class GenericArrayData(val array: Array[Any]) extends ArrayData { result } } + +final class GenericIntArrayData(private val primitiveArray: Array[Int]) --- End diff -- Thank you for your suggestion. It would be good to make ```GenericArrayData``` abstract. As you said, ```ArrayData``` will be mega-morphic in the class hierarchy, actual use should be monomorphic. To make each class ```final``` alleviate runtime overhead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13758#discussion_r67850981 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala --- @@ -142,3 +164,415 @@ class GenericArrayData(val array: Array[Any]) extends ArrayData { result } } + +final class GenericIntArrayData(private val primitiveArray: Array[Int]) + extends GenericArrayData(Array.empty) { + override def array(): Array[Any] = primitiveArray.toArray + + override def copy(): ArrayData = new GenericIntArrayData(primitiveArray) + + override def numElements(): Int = primitiveArray.length + + override def isNullAt(ordinal: Int): Boolean = false + override def getInt(ordinal: Int): Int = primitiveArray(ordinal) + override def toIntArray(): Array[Int] = { +val array = new Array[Int](numElements) +System.arraycopy(primitiveArray, 0, array, 0, numElements) +array + } + override def toString(): String = primitiveArray.mkString("[", ",", "]") + + override def equals(o: Any): Boolean = { --- End diff -- Good question about the definition of equality. Interestingly, there is no ```hashCode()``` and ```equals()``` methods in [UnsafeArrayData](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java). During my works regarding code generation, I cannot find any usage of ```hashCode()``` and ```equals()``` methods. Is it better to drop these two methods if all of the tests would pass without these two methods? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13758#discussion_r67993179 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala --- @@ -23,7 +23,60 @@ import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.types.{DataType, Decimal} import org.apache.spark.unsafe.types.{CalendarInterval, UTF8String} -class GenericArrayData(val array: Array[Any]) extends ArrayData { +object GenericArrayData { + def allocate(seq: Seq[Any]): GenericArrayData = new GenericRefArrayData(seq) --- End diff -- Definitely, you are right --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13758 @hvanhovell , I added [a file](https://github.com/kiszk/spark/blob/133d4c0085b5ca2f20870c05d077e25d8715e07a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayDataBenchmark.scala) of Benchmark (not ran yet). I would appreciate it if you have a time to look at this. It is very strange to me that I can run a Benchmark program under ```sql/core``` (e.g. ```MiscBenchmark)``` by using ```build/sbt "sql/test-only *MiscBenchmark*"```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13758 @hvanhovell, yes, it is good idea. Actually, I wrote a benchmark program ```org.apache.spark.sql.catalyst.util.GenericArrayBenchmark``` (not committed yet). An issue in my environment is that I cannot run a benchmark program under sql/catalyst. The following command does not execute my benchmark program... ``` build/sbt "catalyst/test-only *GenericArrayBenchmark*" ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13758#discussion_r67996014 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala --- @@ -112,8 +112,8 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper { val plan = GenerateMutableProjection.generate(expressions) val actual = plan(new GenericMutableRow(length)).toSeq(expressions.map(_.dataType)) val expected = Seq(new ArrayBasedMapData( - new GenericArrayData(0 until length), - new GenericArrayData(Seq.fill(length)(true + GenericArrayData.allocate(0 until length), + GenericArrayData.allocate(Seq.fill(length)(true --- End diff -- Ok, I will do both. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13758#discussion_r67995936 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala --- @@ -142,3 +196,414 @@ class GenericArrayData(val array: Array[Any]) extends ArrayData { result } } + +final class GenericIntArrayData(val primitiveArray: Array[Int]) extends GenericArrayData { + override def array(): Array[Any] = primitiveArray.toArray + + override def copy(): ArrayData = new GenericIntArrayData(primitiveArray) --- End diff -- Hmm, I should clone the backing array. I should return the actual type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13758#discussion_r67995993 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala --- @@ -142,3 +196,414 @@ class GenericArrayData(val array: Array[Any]) extends ArrayData { result } } + +final class GenericIntArrayData(val primitiveArray: Array[Int]) extends GenericArrayData { + override def array(): Array[Any] = primitiveArray.toArray + + override def copy(): ArrayData = new GenericIntArrayData(primitiveArray) + + override def numElements(): Int = primitiveArray.length + + override def isNullAt(ordinal: Int): Boolean = false + override def getInt(ordinal: Int): Int = primitiveArray(ordinal) + override def toIntArray(): Array[Int] = { +val array = new Array[Int](numElements) +System.arraycopy(primitiveArray, 0, array, 0, numElements) +array + } + override def toString(): String = primitiveArray.mkString("[", ",", "]") --- End diff -- Yes, I will move it up. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13758#discussion_r67997326 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala --- @@ -142,3 +164,415 @@ class GenericArrayData(val array: Array[Any]) extends ArrayData { result } } + +final class GenericIntArrayData(private val primitiveArray: Array[Int]) + extends GenericArrayData(Array.empty) { + override def array(): Array[Any] = primitiveArray.toArray + + override def copy(): ArrayData = new GenericIntArrayData(primitiveArray) + + override def numElements(): Int = primitiveArray.length + + override def isNullAt(ordinal: Int): Boolean = false + override def getInt(ordinal: Int): Int = primitiveArray(ordinal) + override def toIntArray(): Array[Int] = { +val array = new Array[Int](numElements) +System.arraycopy(primitiveArray, 0, array, 0, numElements) +array + } + override def toString(): String = primitiveArray.mkString("[", ",", "]") + + override def equals(o: Any): Boolean = { --- End diff -- I should implement ```equals()``` and ```hashCode()``` for classes related to ```GenericArrayData``` I already implemented type specialized ```equals()``` and ```hashCode()``` in ```GenericArrayData```. An issue in ```equals()``` and ```hashCode``` of ```GenericRefArrayData``` is that a type of each element may be different since they are hold in ```Array[Any]```. If I misunderstood your suggestion, could you please let me know? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13758#discussion_r68004874 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayDataBenchmark.scala --- @@ -0,0 +1,188 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import org.apache.spark.util.Benchmark + +/** + * Benchmark [[GenericArrayData]] for Dense and Sparse with primitive type + */ +object GenericArrayDataBenchmark { +/* + def allocateGenericIntArray(iters: Int): Unit = { +val count = 1024 * 1024 * 10 +var array: GenericArrayData = null + +val primitiveIntArray = new Array[Int](count) +val denseIntArray = { i: Int => + for (n <- 0L until iters) { --- End diff -- I will do this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13758#discussion_r68004837 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala --- @@ -159,17 +159,17 @@ object CatalystTypeConverters { override def toCatalystImpl(scalaValue: Any): ArrayData = { scalaValue match { case a: Array[_] => - new GenericArrayData(a.map(elementConverter.toCatalyst)) + GenericArrayData.allocate(a.map(elementConverter.toCatalyst)) --- End diff -- Yes, I feel so for now since type of `toCatalyst` seems to be `Any`. Is it better to prepare specialized code to make a type concrete here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13758#discussion_r68007482 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala --- @@ -142,3 +164,415 @@ class GenericArrayData(val array: Array[Any]) extends ArrayData { result } } + +final class GenericIntArrayData(private val primitiveArray: Array[Int]) + extends GenericArrayData(Array.empty) { + override def array(): Array[Any] = primitiveArray.toArray + + override def copy(): ArrayData = new GenericIntArrayData(primitiveArray) + + override def numElements(): Int = primitiveArray.length + + override def isNullAt(ordinal: Int): Boolean = false + override def getInt(ordinal: Int): Int = primitiveArray(ordinal) + override def toIntArray(): Array[Int] = { +val array = new Array[Int](numElements) +System.arraycopy(primitiveArray, 0, array, 0, numElements) +array + } + override def toString(): String = primitiveArray.mkString("[", ",", "]") + + override def equals(o: Any): Boolean = { --- End diff -- Thank you for your explanation. Great idea for ease of read. I will do it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13758#discussion_r67994789 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala --- @@ -142,3 +196,414 @@ class GenericArrayData(val array: Array[Any]) extends ArrayData { result } } + +final class GenericIntArrayData(val primitiveArray: Array[Int]) extends GenericArrayData { + override def array(): Array[Any] = primitiveArray.toArray + + override def copy(): ArrayData = new GenericIntArrayData(primitiveArray) + + override def numElements(): Int = primitiveArray.length + + override def isNullAt(ordinal: Int): Boolean = false + override def getInt(ordinal: Int): Int = primitiveArray(ordinal) + override def toIntArray(): Array[Int] = { +val array = new Array[Int](numElements) +System.arraycopy(primitiveArray, 0, array, 0, numElements) +array + } + override def toString(): String = primitiveArray.mkString("[", ",", "]") + + override def equals(o: Any): Boolean = { +if (!o.isInstanceOf[GenericIntArrayData]) { + return false +} + +val other = o.asInstanceOf[GenericIntArrayData] +if (other eq null) { + return false +} + +val len = numElements() +if (len != other.numElements()) { + return false +} + +var i = 0 +while (i < len) { + val o1 = primitiveArray(i) + val o2 = other.primitiveArray(i) + if (o1 != o2) { +return false + } + i += 1 +} +true + } + + override def hashCode: Int = { +var result: Int = 37 +var i = 0 +val len = numElements() +while (i < len) { + val update: Int = primitiveArray(i) + result = 37 * result + update + i += 1 +} +result + } +} + +final class GenericLongArrayData(val primitiveArray: Array[Long]) + extends GenericArrayData { + override def array(): Array[Any] = primitiveArray.toArray + + override def copy(): ArrayData = new GenericLongArrayData(primitiveArray) + + override def numElements(): Int = primitiveArray.length + + override def isNullAt(ordinal: Int): Boolean = false + override def getLong(ordinal: Int): Long = primitiveArray(ordinal) + override def toLongArray(): Array[Long] = { +val array = new Array[Long](numElements) +System.arraycopy(primitiveArray, 0, array, 0, numElements) +array + } + override def toString(): String = primitiveArray.mkString("[", ",", "]") + + override def equals(o: Any): Boolean = { +if (!o.isInstanceOf[GenericLongArrayData]) { + return false +} + +val other = o.asInstanceOf[GenericLongArrayData] +if (other eq null) { + return false +} + +val len = numElements() +if (len != other.numElements()) { + return false +} + +var i = 0 +while (i < len) { + val o1 = primitiveArray(i) + val o2 = other.primitiveArray(i) + if (o1 != o2) { +return false + } + i += 1 +} +true + } + + override def hashCode: Int = { +var result: Int = 37 +var i = 0 +val len = numElements() +while (i < len) { + val l = primitiveArray(i) + val update: Int = (l ^ (l >>> 32)).toInt + result = 37 * result + update + i += 1 +} +result + } +} + +final class GenericFloatArrayData(val primitiveArray: Array[Float]) + extends GenericArrayData { + override def array(): Array[Any] = primitiveArray.toArray + + override def copy(): ArrayData = new GenericFloatArrayData(primitiveArray) + + override def numElements(): Int = primitiveArray.length + + override def isNullAt(ordinal: Int): Boolean = false + override def getFloat(ordinal: Int): Float = primitiveArray(ordinal) + override def toFloatArray(): Array[Float] = { +val array = new Array[Float](numElements) +System.arraycopy(primitiveArray, 0, array, 0, numElements) +array + } + override def toString(): String = primitiveArray.mkString("[", ",", "]") + + override def equals(o: Any): Boolean = { +if (!o.isInstanceOf[GenericFloatArrayData]) { + return false +} + +val other = o.asInstanceOf[GenericFloatArrayData] +if (other eq null) { + return false +} + +val len = numElements() +if (len != other.numElements()) { + r
[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13758#discussion_r68015184 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayDataBenchmark.scala --- @@ -0,0 +1,188 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.util + +import org.apache.spark.util.Benchmark + +/** + * Benchmark [[GenericArrayData]] for Dense and Sparse with primitive type + */ +object GenericArrayDataBenchmark { +/* + def allocateGenericIntArray(iters: Int): Unit = { --- End diff -- Thank for your suggestions. Now, this benchmark works well in my cloud instance. I saw pretty good results (2x~ performance improvement) for all of three types (allocation, primitive array, and get elements). Later, I will commit the latest version. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13758#discussion_r68039256 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala --- @@ -142,3 +196,414 @@ class GenericArrayData(val array: Array[Any]) extends ArrayData { result } } + +final class GenericIntArrayData(val primitiveArray: Array[Int]) extends GenericArrayData { + override def array(): Array[Any] = primitiveArray.toArray + + override def copy(): ArrayData = new GenericIntArrayData(primitiveArray) + + override def numElements(): Int = primitiveArray.length + + override def isNullAt(ordinal: Int): Boolean = false + override def getInt(ordinal: Int): Int = primitiveArray(ordinal) + override def toIntArray(): Array[Int] = { +val array = new Array[Int](numElements) +System.arraycopy(primitiveArray, 0, array, 0, numElements) +array + } + override def toString(): String = primitiveArray.mkString("[", ",", "]") + + override def equals(o: Any): Boolean = { +if (!o.isInstanceOf[GenericIntArrayData]) { + return false +} + +val other = o.asInstanceOf[GenericIntArrayData] +if (other eq null) { + return false +} + +val len = numElements() +if (len != other.numElements()) { + return false +} + +var i = 0 +while (i < len) { + val o1 = primitiveArray(i) + val o2 = other.primitiveArray(i) + if (o1 != o2) { +return false + } + i += 1 +} +true + } + + override def hashCode: Int = { +var result: Int = 37 +var i = 0 +val len = numElements() +while (i < len) { + val update: Int = primitiveArray(i) + result = 37 * result + update + i += 1 +} +result + } +} + +final class GenericLongArrayData(val primitiveArray: Array[Long]) + extends GenericArrayData { + override def array(): Array[Any] = primitiveArray.toArray + + override def copy(): ArrayData = new GenericLongArrayData(primitiveArray) + + override def numElements(): Int = primitiveArray.length + + override def isNullAt(ordinal: Int): Boolean = false + override def getLong(ordinal: Int): Long = primitiveArray(ordinal) + override def toLongArray(): Array[Long] = { +val array = new Array[Long](numElements) +System.arraycopy(primitiveArray, 0, array, 0, numElements) +array + } + override def toString(): String = primitiveArray.mkString("[", ",", "]") + + override def equals(o: Any): Boolean = { +if (!o.isInstanceOf[GenericLongArrayData]) { + return false +} + +val other = o.asInstanceOf[GenericLongArrayData] +if (other eq null) { + return false +} + +val len = numElements() +if (len != other.numElements()) { + return false +} + +var i = 0 +while (i < len) { + val o1 = primitiveArray(i) + val o2 = other.primitiveArray(i) + if (o1 != o2) { +return false + } + i += 1 +} +true + } + + override def hashCode: Int = { +var result: Int = 37 +var i = 0 +val len = numElements() +while (i < len) { + val l = primitiveArray(i) + val update: Int = (l ^ (l >>> 32)).toInt + result = 37 * result + update + i += 1 +} +result + } +} + +final class GenericFloatArrayData(val primitiveArray: Array[Float]) + extends GenericArrayData { + override def array(): Array[Any] = primitiveArray.toArray + + override def copy(): ArrayData = new GenericFloatArrayData(primitiveArray) + + override def numElements(): Int = primitiveArray.length + + override def isNullAt(ordinal: Int): Boolean = false + override def getFloat(ordinal: Int): Float = primitiveArray(ordinal) + override def toFloatArray(): Array[Float] = { +val array = new Array[Float](numElements) +System.arraycopy(primitiveArray, 0, array, 0, numElements) +array + } + override def toString(): String = primitiveArray.mkString("[", ",", "]") + + override def equals(o: Any): Boolean = { +if (!o.isInstanceOf[GenericFloatArrayData]) { + return false +} + +val other = o.asInstanceOf[GenericFloatArrayData] +if (other eq null) { + return false +} + +val len = numElements() +if (len != other.numElements()) { + r
[GitHub] spark pull request #13663: [SPARK-15950][SQL] Eliminate unreachable code at ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13663#discussion_r67315501 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameComplexTypeSuite.scala --- @@ -26,6 +26,38 @@ import org.apache.spark.sql.test.SharedSQLContext class DataFrameComplexTypeSuite extends QueryTest with SharedSQLContext { import testImplicits._ + test("primitive type on array") { +val df = sparkContext.parallelize(Seq(1, 2), 1).toDF("v") --- End diff -- I added unit tests to check whether the ```zeroOutNullBytes``` is eliminated. I eliminated three cases such as ```array on array```. I think that we need to unit tests to generate an array, struct, or map, in order to check it works well even when we eliminate ```zeroOutNullBytes```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13704: [SPARK-15985][SQL] Reduce runtime overhead of a p...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/13704 [SPARK-15985][SQL] Reduce runtime overhead of a program that reads an primitive array in Dataset ## What changes were proposed in this pull request? This PR reduces runtime overhead of a program the reads an primitive array in Dataset. Generated code copies array elements from Dataset to a temporary array. If we know that types of source and destination are primitive array, we apply one of the following optimization: 1. Eliminate an allocation of ```Object[]``` and call ```ArrayData.toArray()``` method if types of source and destination are the same 2. Eliminate a pair of ```isNullAt()``` and a ```null``` assignment and allocate an primitive array instead of ```Object[]``` if types of source and destination are different An example program ``` val ds = Seq(Array(1.0, 2.0, 3.0), Array(4.0, 5.0, 6.0)).toDS() ds.map(p => { var s = 0.0 for (i <- 0 to 2) { s += p(i) } s }).show ``` Generated code before applying this PR ``` /* 036 */ protected void processNext() throws java.io.IOException { /* 037 */ while (inputadapter_input.hasNext()) { /* 038 */ InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); /* 039 */ boolean inputadapter_isNull = inputadapter_row.isNullAt(0); /* 040 */ ArrayData inputadapter_value = inputadapter_isNull ? null : (inputadapter_row.getArray(0)); /* 041 */ /* 042 */ boolean deserializetoobject_isNull1 = inputadapter_isNull; /* 043 */ ArrayData deserializetoobject_value1 = null; /* 044 */ if (!inputadapter_isNull) { /* 045 */ final int deserializetoobject_n = inputadapter_value.numElements(); /* 046 */ final Object[] deserializetoobject_values = new Object[deserializetoobject_n]; /* 047 */ for (int deserializetoobject_j = 0; deserializetoobject_j < deserializetoobject_n; deserializetoobject_j ++) { /* 048 */ if (inputadapter_value.isNullAt(deserializetoobject_j)) { /* 049 */ deserializetoobject_values[deserializetoobject_j] = null; /* 050 */ } else { /* 051 */ boolean deserializetoobject_feNull = false; /* 052 */ double deserializetoobject_fePrim = /* 053 */ inputadapter_value.getDouble(deserializetoobject_j); /* 054 */ /* 055 */ boolean deserializetoobject_teNull = deserializetoobject_feNull; /* 056 */ double deserializetoobject_tePrim = -1.0; /* 057 */ if (!deserializetoobject_feNull) { /* 058 */ deserializetoobject_tePrim = deserializetoobject_fePrim; /* 059 */ } /* 060 */ /* 061 */ if (deserializetoobject_teNull) { /* 062 */ deserializetoobject_values[deserializetoobject_j] = null; /* 063 */ } else { /* 064 */ deserializetoobject_values[deserializetoobject_j] = deserializetoobject_tePrim; /* 065 */ } /* 066 */ } /* 067 */ } /* 068 */ deserializetoobject_value1 = new org.apache.spark.sql.catalyst.util.GenericArrayData(deserializetoobject_values); /* 069 */ /* 070 */ } /* 071 */ /* 072 */ boolean deserializetoobject_isNull = deserializetoobject_isNull1; /* 073 */ final double[] deserializetoobject_value = deserializetoobject_isNull ? null : (double[]) deserializetoobject_value1.toDoubleArray(); /* 074 */ deserializetoobject_isNull = deserializetoobject_value == null; /* 075 */ /* 076 */ Object mapelements_obj = ((Expression) references[0]).eval(null); /* 077 */ scala.Function1 mapelements_value1 = (scala.Function1) mapelements_obj; /* 078 */ /* 079 */ boolean mapelements_isNull = false || deserializetoobject_isNull; /* 080 */ final double mapelements_value = mapelements_isNull ? -1.0 : (Double) mapelements_value1.apply(deserializetoobject_value); /* 081 */ /* 082 */ serializefromobject_rowWriter.zeroOutNullBytes(); /* 083 */ /* 084 */ if (mapelements_isNull) { /* 085 */ serializefromobject_rowWriter.setNullAt(0); /* 086 */ } else { /* 087 */ serializefromobject_rowWriter.write(0, mapelements_value); /* 088 */ } /* 089 */ append(serializefromobject_result); /* 090 */ if (shouldStop()) return; /* 091 */ } /* 092 */ } ``` Generated code after applying this PR ``` /* 036 */ protected void processNext() throws java.io.IOException { /* 037 */ while (inputadapter_input.hasNext()) { /* 038 */ InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); /* 039 */
[GitHub] spark pull request #13663: [SPARK-15950][SQL] Eliminate unreachable code at ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13663#discussion_r67315144 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala --- @@ -71,7 +71,8 @@ case class CreateArray(children: Seq[Expression]) extends Expression { s""" final ArrayData ${ev.value} = new $arrayClass($values); this.$values = null; - """) + """, + isNull = "false") --- End diff -- Got it. Removed this assignment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13663: [SPARK-15950][SQL] Eliminate unreachable code at ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13663#discussion_r67213607 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala --- @@ -60,18 +60,24 @@ case class CreateArray(children: Seq[Expression]) extends Expression { ctx.INPUT_ROW, children.zipWithIndex.map { case (e, i) => val eval = e.genCode(ctx) - eval.code + s""" -if (${eval.isNull}) { --- End diff -- Let me change my opinion. Another PR will handle this elimination and related features. This PR will focus on "each element of project_values is null if it never occurs". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13663: [SPARK-15950][SQL] Eliminate unreachable code at ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13663#discussion_r67863392 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameComplexTypeSuite.scala --- @@ -26,6 +27,55 @@ import org.apache.spark.sql.test.SharedSQLContext class DataFrameComplexTypeSuite extends QueryTest with SharedSQLContext { import testImplicits._ + test("create an array") { +val df = sparkContext.parallelize(Seq(1, 2), 1).toDF("v") +df.selectExpr("Array(v + 3, v + 4)").collect --- End diff -- I see. I will remove these three tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13663: [SPARK-15950][SQL] Eliminate unreachable code at project...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13663 @cloud-fan , @davies , I would appreciate it if you would look at this again. I think that I had addressed all of your comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/13758 [SPARK-16043][SQL] Prepare GenericArrayData implementation specialized for a primitive array ## What changes were proposed in this pull request? This PR addresses a ToDo of ```GenericArrayData``` class. Current implementation of ```GenericArrayData``` leads to boxing/unboxing if type of array elements are primitive. It would be good to eliminate boxing/unboxing from the view of runtime memory footprint and performance. This PR eliminattes boxing/unboxing by preparing sub classes of ```GenericArrayData``` to specialize operations in these classes. This PR prepare a new method ```GenericArrayData.allocate(...) ``` that can return generic ```GenericArrayData``` instance or a ```GenericArrayData``` instance specialized for `` type array. Here are major improvements: 1. Hold an array in a primitive array (previously ```Object[]``` is used and boxing happened in a constructor) 2. a method "get``()" gets a value from an primitive array (previously unboxing happened) 3. a method "to``Array" performs data copy using ```System.arraycopy``` (previously unboxing happened) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) add unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/kiszk/spark SPARK-16043 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13758.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13758 commit 17bdfcf7b9c155fe05eb75f73d661c2863cdf2d9 Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com> Date: 2016-06-18T06:47:11Z Implementation of GenericArrayData specialized for primitive type array add unit tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13757: [SPARK-16042][SQL] Eliminate nullcheck code at pr...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/13757 [SPARK-16042][SQL] Eliminate nullcheck code at projection for an array type ## What changes were proposed in this pull request? This PR eliminates nullcheck code at projection for an array type. The nullcheck a call to write each element of an array (lines 076-078 in "Before applying this PR") is generated. If we know all of the elements do not have ``null`` at compilation time, we can eliminate code for nullcheck. This PR checks whether ```ArrayType.containsNull``` is ```false``` to know the all of array elements do not have ```null```. An example program ``` val df = sparkContext.parallelize(Seq(1.0, 2.0), 1).toDF("v") df.selectExpr("Array(v + 2.2, v + 3.3)").collect ``` Before applying this PR ``` /* 028 */ protected void processNext() throws java.io.IOException { /* 029 */ while (inputadapter_input.hasNext()) { /* 030 */ InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); /* 031 */ double inputadapter_value = inputadapter_row.getDouble(0); /* 032 */ /* 033 */ final boolean project_isNull = false; /* 034 */ this.project_values = new Object[2]; /* 035 */ double project_value1 = -1.0; /* 036 */ project_value1 = inputadapter_value + 2.2D; /* 037 */ if (false) { /* 038 */ project_values[0] = null; /* 039 */ } else { /* 040 */ project_values[0] = project_value1; /* 041 */ } /* 042 */ /* 043 */ double project_value4 = -1.0; /* 044 */ project_value4 = inputadapter_value + 3.3D; /* 045 */ if (false) { /* 046 */ project_values[1] = null; /* 047 */ } else { /* 048 */ project_values[1] = project_value4; /* 049 */ } /* 050 */ /* 051 */ final ArrayData project_value = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_values); /* 052 */ this.project_values = null; /* 053 */ project_holder.reset(); /* 054 */ /* 055 */ project_rowWriter.zeroOutNullBytes(); /* 056 */ /* 057 */ if (project_isNull) { /* 058 */ project_rowWriter.setNullAt(0); /* 059 */ } else { /* 060 */ // Remember the current cursor so that we can calculate how many bytes are /* 061 */ // written later. /* 062 */ final int project_tmpCursor = project_holder.cursor; /* 063 */ /* 064 */ if (project_value instanceof UnsafeArrayData) { /* 065 */ final int project_sizeInBytes = ((UnsafeArrayData) project_value).getSizeInBytes(); /* 066 */ // grow the global buffer before writing data. /* 067 */ project_holder.grow(project_sizeInBytes); /* 068 */ ((UnsafeArrayData) project_value).writeToMemory(project_holder.buffer, project_holder.cursor); /* 069 */ project_holder.cursor += project_sizeInBytes; /* 070 */ /* 071 */ } else { /* 072 */ final int project_numElements = project_value.numElements(); /* 073 */ project_arrayWriter.initialize(project_holder, project_numElements, 8); /* 074 */ /* 075 */ for (int project_index = 0; project_index < project_numElements; project_index++) { /* 076 */ if (project_value.isNullAt(project_index)) { /* 077 */ project_arrayWriter.setNullAt(project_index); /* 078 */ } else { /* 079 */ final double project_element = project_value.getDouble(project_index); /* 080 */ project_arrayWriter.write(project_index, project_element); /* 081 */ } /* 082 */ /* 083 */ } /* 084 */ } /* 085 */ /* 086 */ project_rowWriter.setOffsetAndSize(0, project_tmpCursor, project_holder.cursor - project_tmpCursor); /* 087 */ project_rowWriter.alignToWords(project_holder.cursor - project_tmpCursor); /* 088 */ } /* 089 */ project_result.setTotalSize(project_holder.totalSize()); /* 090 */ append(project_result); /* 091 */ if (shouldStop()) return; /* 092 */ } ``` After applying this PR ``` /* 028 */ protected void processNext() throws java.io.IOException { /* 029 */ while (inputadapter_input.hasNext()) { /* 030 */ InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); /* 031 */ double inputadapter_value = inputadapter_row.getDouble(0); /* 032 */ /* 033 */ final boolean project_isNull = false; /* 034 */ this.project_values = new Object[2]; /* 035 */ double project_value1 = -1.0; /* 036 */ project_value1 = inpu
[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce additonal implementation wi...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13680 It is OK to always keep ```[null bits]``` One question: Is this format to keep fixed space for ```[values]```? I mean if ```[null bit]``` is true, the corresponding element in ```[value]``` occupy space. It allow us to always access all of the elements by ```4 + (null bits size) + (element size) * offset```. If an answer of the above question is yes, I agree with no ```[all zero in null bits?]```. Then, I will do prototype for this. cc @hvanhovell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce additonal implementation wi...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13680 Good to hear. I will make an implementation for single format. If I would meet some issues, I will raise them here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce additonal implementation wi...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13680 @cloud-fan thank you for your good comment. I also read [previous proposal](https://github.com/apache/spark/pull/12640#discussion_r61539393). I love to have only single format (or implementation). Since I thought that there are some reasons to keep the old format, I introduced a new dense format. IMHO, a new unified format should have three properties. 1. Remove indirect offset (for performance and footprint) 2. Have capability of presence of nullbit (for generality) 3. Quickly get information on existence of null value in an array (for performance, in particular, primitive array) Based on them, how about this single format? ``` [numElements] [all zero in null bits?] [null bits] [values] [variable length portion] ``` If we want to reduce memory footprint in the case of primitive array, we can drop ```[null bits]``` part if ```[all zero in null bits?]``` has a special value. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13758 @cloud-fan , thank you for your comment. About conversion between primitive array and unsafe array, I think we are on the same page. The motivation of my recent PRs is to reduce overhead to change an array format in a generated code from Dataset program using an primitive array. Based on this PR and [another PR](https://github.com/apache/spark/pull/13680), I am preparing another PR to directly transfer primitive array to unsafe (dense) array by '''Platform.copy()'''. Currently, while an array for any type is abstracted by ```GenericArrayData```, this PR just wraps an primitive array to alleviate overhead of data format conversion (e.g. among Java primitive and unsafe dense). Of course, it would be good to prepare APIs in ```GenericArrayData``` to convert array format between primitive array and unsafe array. Thank you for pointing a proposal for a new unsafe format. Let us discuss it in [another PR](https://github.com/apache/spark/pull/13680). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce additonal implementation wi...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13680 @cloud-fan , I have one question about null field. Should we put zero into the corresponding field to position where ```setNullAt()``` is called as ```UnsafeRow``` [does](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L194). If we avoid to put zero, this avoidance affects two properties. 1. [row equality](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L197) 2. undetermined value may be included in the array returned by ```UnsafeArrayData.toArray()``` In my current implementation, a width of each element depends on element type (4: Int, 8: Double, etc). Thus, it is hard to do the same approach as it [did](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L194). Since ```UnsafeRow``` always use 8 bytes per field. Since we want to make data conversion fast between primitive array and unsafe array, we have to keep the type-based element-width (e.g. 4: Int, 8: Double, etc). What do you think? Should we have to keep the above two property by clearing a field? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce additonal implementation wi...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13680 One potential performance issue is that we have to always clear all of null bits at ```UnsafeArrayWriter.initialize()```. This is because ```holder.buffer``` is reused for each row. If one row has more than one variable length arrays, ```holder.cursor``` may have different offset for the same array among different rows. Thus, we have to clear all of null bits. This overhead would be larger in machine learning use cases that use large arrays. If we have a flag to show whether there is an area of ```[null bits]``` or we have two implementations with ```[null bits]``` and without ```[null bits]```, we may alleviate performance overhead. What do you think? cc: @hvanhovell , @mengxr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13663: [SPARK-15950][SQL] Eliminate unreachable code at ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13663#discussion_r67386476 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala --- @@ -60,18 +60,24 @@ case class CreateArray(children: Seq[Expression]) extends Expression { ctx.INPUT_ROW, children.zipWithIndex.map { case (e, i) => val eval = e.genCode(ctx) - eval.code + s""" -if (${eval.isNull}) { --- End diff -- Just a memo. I found that ```.map(_.isNull).mkString(" || ")``` is used in some places. This generates code sequence ```isNULL = false || false;``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13663: [SPARK-15950][SQL] Eliminate unreachable code at ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13663#discussion_r67272226 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala --- @@ -60,18 +60,24 @@ case class CreateArray(children: Seq[Expression]) extends Expression { ctx.INPUT_ROW, children.zipWithIndex.map { case (e, i) => val eval = e.genCode(ctx) - eval.code + s""" -if (${eval.isNull}) { --- End diff -- @cloud-fan , the latest commit only avoids ```zeroOutNullBytes``` by a simple and clear way. Would it be possible to review this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce additonal implementation wi...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13680 I see. I assumed that virtual call will be devirtualized by declaring ```final``` method and by optimistically propagating type information in the JIT compiler. Would it be better to add a flag like ```containNull``` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13758 @hvanhovell , @cloud-fan , could you review this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce additonal implementation wi...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13680 @cloud-fan , for the first issue, we are on the same page. Your proposal is what I am thinking about as possible solutions. I will do that. For the second issue, it seems to be design choice between 1. introduce one conditional branch in ```isNullAt()``` in one implementation 2. have two implementations without conditional branch at ```isNullAt()``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13704: [SPARK-15985][SQL] Reduce runtime overhead of a program ...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13704 @cloud-fan I think ```value#63``` is ```UnsafeArrayData```. When I ran a DataFrame program, I got the following trees. Since operations for DataFrame access data in UnsafeArrayData, I think that ```LocalRelation``` and ```LocalTableScan``` keep an array as ```UnsafeArrayData```. ``` val df = Seq(Array(1.0, 2.0, 3.0), Array(4.0, 5.0, 6.0)).toDF() val df2 = df.selectExpr("value[0] + value[1] + value[2]") df2.show df2.explain(true) == Analyzed Logical Plan == ((value[0] + value[1]) + value[2]): double Project [((value#63[0] + value#63[1]) + value#63[2]) AS ((value[0] + value[1]) + value[2])#67] +- LocalRelation [value#63] == Optimized Logical Plan == LocalRelation [((value[0] + value[1]) + value[2])#67] == Physical Plan == LocalTableScan [((value[0] + value[1]) + value[2])#67] ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13899: [SPARK-16196][SQL] Codegen in-memory scan with ColumnarB...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13899 @andrewor14 Looks interesting. I created two PRs that generate similar code like [your code](https://gist.github.com/andrewor14/7ce4c37a3c6bcd5cc2b6b16c861859e9). My PRs use current ```ByteBuffer``` and supports compressions for primitive types. Do these PRs help you? https://github.com/apache/spark/pull/11956 https://github.com/apache/spark/pull/12894 I am waiting for review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13758 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13909: [SPARK-16213][SQL] Reduce runtime overhead of a p...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/13909 [SPARK-16213][SQL] Reduce runtime overhead of a program that creates an primitive array in DataFrame ## What changes were proposed in this pull request? This PR reduces runtime overhead of a program the creates an primitive array in DataFrameGenerated code performs boxing operation in an assignment from InternalRow to an ```Object[]``` temporary array (at Lines 040 and 048 in the generated code before applying this PR). If we know that type of array elements is primitive, we apply the following optimizations: 1. Eliminate a pair of ```isNullAt()``` and a null assignment 2. Allocate an primitive array instead of ```Object[]``` (eliminate boxing operations) 3. Call ```GenericArrayData.allocate(project_values)``` to avoid [boxing](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L31) in constructor of ```GenericArrayData``` if https://github.com/apache/spark/pull/13758 is merged An example program ``` val df = sparkContext.parallelize(Seq(0.0d, 1.0d), 1).toDF df.selectExpr("Array(value + 1.1d, value + 2.2d)").show ``` Generated code before applying this PR ```java /* 018 */ public void init(int index, scala.collection.Iterator inputs[]) { /* 019 */ partitionIndex = index; /* 020 */ inputadapter_input = inputs[0]; /* 021 */ this.project_values = null; /* 022 */ project_result = new UnsafeRow(1); /* 023 */ this.project_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result, 32); /* 024 */ this.project_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder, 1); /* 025 */ this.project_arrayWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter(); /* 026 */ } /* 027 */ /* 028 */ protected void processNext() throws java.io.IOException { /* 029 */ while (inputadapter_input.hasNext()) { /* 030 */ InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); /* 031 */ double inputadapter_value = inputadapter_row.getDouble(0); /* 032 */ /* 033 */ final boolean project_isNull = false; /* 034 */ this.project_values = new Object[2]; /* 035 */ double project_value7 = -1.0; /* 036 */ project_value7 = inputadapter_value + 1.1D; /* 037 */ if (false) { /* 038 */ project_values[0] = null; /* 039 */ } else { /* 040 */ project_values[0] = project_value7; /* 041 */ } /* 042 */ /* 043 */ double project_value10 = -1.0; /* 044 */ project_value10 = inputadapter_value + 2.2D; /* 045 */ if (false) { /* 046 */ project_values[1] = null; /* 047 */ } else { /* 048 */ project_values[1] = project_value10; /* 049 */ } /* 050 */ /* 051 */ /* final ArrayData project_value = org.apache.spark.sql.catalyst.util.GenericArrayData.allocate(project_values); */ /* 052 */ final ArrayData project_value = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_values); /* 053 */ this.project_values = null; /* 054 */ project_holder.reset(); /* 055 */ /* 056 */ project_rowWriter.zeroOutNullBytes(); /* 057 */ /* 058 */ if (project_isNull) { /* 059 */ project_rowWriter.setNullAt(0); /* 060 */ } else { /* 061 */ // Remember the current cursor so that we can calculate how many bytes are /* 062 */ // written later. /* 063 */ final int project_tmpCursor = project_holder.cursor; /* 064 */ /* 065 */ if (project_value instanceof UnsafeArrayData) { /* 066 */ final int project_sizeInBytes = ((UnsafeArrayData) project_value).getSizeInBytes(); /* 067 */ // grow the global buffer before writing data. /* 068 */ project_holder.grow(project_sizeInBytes); /* 069 */ ((UnsafeArrayData) project_value).writeToMemory(project_holder.buffer, project_holder.cursor); /* 070 */ project_holder.cursor += project_sizeInBytes; /* 071 */ /* 072 */ } else { /* 073 */ final int project_numElements = project_value.numElements(); /* 074 */ project_arrayWriter.initialize(project_holder, project_numElements, 8); /* 075 */ /* 076 */ for (int project_index = 0; project_index < project_numElements; project_index++) { /* 077 */ if (project_value.isNullAt(project_index)) { /* 078 */ project_arrayWriter.setNullAt(project_index); /* 079 */ } else { /* 080 */ final double pr
[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce implementation with a dense...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13680 @rxin thank you for your comment. As you said, holistic view is important. This PR is not only for machine learning. This PR has another use case for improving projection of an array in any program by https://github.com/apache/spark/pull/13911. This PR can also reduce memory footprint of an array representation and ```DataFrame.cache()/Dataset.cache()``` in any program. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13911: [SPARK-16215][SQL] Reduce runtime overhead of a p...
GitHub user kiszk opened a pull request: https://github.com/apache/spark/pull/13911 [SPARK-16215][SQL] Reduce runtime overhead of a program that writes an primitive array in Dataframe/Dataset ## What changes were proposed in this pull request? This PR optimize generate code of projection for an primitive type array. While we know primitive type array does not require null check and has contigious data region, current generated code performs null checks and performs copy for each element (at Lines 075-082 at Generated code before applying this PR) 1. Eliminate null checks for each array element 2. Perform bulk data copy by using ```Platform.copy``` 3. Eliminate primitive array allocation in ```GenericArrayData``` when https://github.com/apache/spark/pull/13758 is merged 4. Eliminate setting sparse index for ```UnsafeArrayData``` when https://github.com/apache/spark/pull/13680 is merged They are done in helper method ```UnsafeArrayWrite.writePrimitiveArray()``` (at Line 075 at Generated code after applying this PR). For now, 3 and 4 are not enabled. But, code are ready. An example program ``` val df = sparkContext.parallelize(Seq(0.0d, 1.0d), 1).toDF df.selectExpr("Array(value + 1.1d, value + 2.2d)").collect ``` Generated code before applying this PR ```java /* 028 */ protected void processNext() throws java.io.IOException { /* 029 */ while (inputadapter_input.hasNext()) { /* 030 */ InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); /* 031 */ double inputadapter_value = inputadapter_row.getDouble(0); /* 032 */ /* 033 */ final boolean project_isNull = false; /* 034 */ this.project_values = new Object[2]; /* 035 */ double project_value1 = -1.0; /* 036 */ project_value1 = inputadapter_value + 1.1D; /* 037 */ if (false) { /* 038 */ project_values[0] = null; /* 039 */ } else { /* 040 */ project_values[0] = project_value1; /* 041 */ } /* 042 */ /* 043 */ double project_value4 = -1.0; /* 044 */ project_value4 = inputadapter_value + 2.2D; /* 045 */ if (false) { /* 046 */ project_values[1] = null; /* 047 */ } else { /* 048 */ project_values[1] = project_value4; /* 049 */ } /* 050 */ /* 051 */ final ArrayData project_value = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_values); /* 052 */ this.project_values = null; /* 053 */ project_holder.reset(); /* 054 */ /* 055 */ project_rowWriter.zeroOutNullBytes(); /* 056 */ /* 057 */ if (project_isNull) { /* 058 */ project_rowWriter.setNullAt(0); /* 059 */ } else { /* 060 */ // Remember the current cursor so that we can calculate how many bytes are /* 061 */ // written later. /* 062 */ final int project_tmpCursor = project_holder.cursor; /* 063 */ /* 064 */ if (project_value instanceof UnsafeArrayData) { /* 065 */ final int project_sizeInBytes = ((UnsafeArrayData) project_value).getSizeInBytes(); /* 066 */ // grow the global buffer before writing data. /* 067 */ project_holder.grow(project_sizeInBytes); /* 068 */ ((UnsafeArrayData) project_value).writeToMemory(project_holder.buffer, project_holder.cursor); /* 069 */ project_holder.cursor += project_sizeInBytes; /* 070 */ /* 071 */ } else { /* 072 */ final int project_numElements = project_value.numElements(); /* 073 */ project_arrayWriter.initialize(project_holder, project_numElements, 8); /* 074 */ /* 075 */ for (int project_index = 0; project_index < project_numElements; project_index++) { /* 076 */ if (project_value.isNullAt(project_index)) { /* 077 */ project_arrayWriter.setNullAt(project_index); /* 078 */ } else { /* 079 */ final double project_element = project_value.getDouble(project_index); /* 080 */ project_arrayWriter.write(project_index, project_element); /* 081 */ } /* 082 */ } /* 083 */ /* 084 */ } /* 085 */ /* 086 */ project_rowWriter.setOffsetAndSize(0, project_tmpCursor, project_holder.cursor - project_tmpCursor); /* 087 */ project_rowWriter.alignToWords(project_holder.cursor - project_tmpCursor); /* 088 */ } /* 089 */ project_result.setTotalSize(project_holder.totalSize()); /* 090 */ append(project_result); /* 091 */ if (shouldStop()) return; /* 092 */ } /* 093 */ } /* 094 */ } ``` Generated code after applying this PR ```ja
[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce implementation with a dense...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13680 @cloud-fan and @hvanhovell thank you for your comments. Based on your comments, I implemented ```UnsafeArrayData``` by using one implementation with explicit clearing ```null bits``` by ```Arrays.fill()```. I would appreciate it if you review this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13663: [SPARK-15950][SQL] Eliminate unreachable code at project...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13663 @cloud-fan , @davies , I would appreciate it if you would look at this again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/13758 @hvanhovell and @cloud-fan This```GenericArrayData``` can be used in generated code for a program with an primitive array written in DataFrame or Dataset. I newly added Dataframe benchmark program and results. The performance improvements are up to 4.4x. Here is use case of this```GenericArrayData``` in the generated code of this benchmark program. ```java // source program sc.parallelize(Seq(Array.fill[Double](1)(1), 1).selectExpr("value[0]").count // part of generated code double[] value1 = ... final ArrayData value = isNull ? null : GenericArrayData.allocate(value1); ``` Another use case is PR https://github.com/apache/spark/pull/13704. https://github.com/apache/spark/pull/13704 can generate the following code. At Line 046, this PR can eliminate data conversion from a primitive array and ```Array[Any]```. At Line 051, this PR can also eliminate data conversion. ```java /* 044 */ if (!inputadapter_isNull) { /* 045 */ final double[] deserializetoobject_values = inputadapter_value.toDoubleArray(); /* 046 */ deserializetoobject_value1 = new GenericArrayData(deserializetoobject_values); /* 047 */ /* 048 */ } /* 049 */ /* 050 */ boolean deserializetoobject_isNull = deserializetoobject_isNull1; /* 051 */ final double[] deserializetoobject_value = deserializetoobject_isNull ? null : (double[]) deserializetoobject_value1.toDoubleArray(); ``` If https://github.com/apache/spark/pull/13680 is also merged, we can improve performance of format conversion between ```UnsafeArrayData``` and ```GenericArrayData``` by using ```Platform.copy```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13909: [SPARK-16213][SQL] Reduce runtime overhead of a p...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13909#discussion_r68519480 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameComplexTypeSuite.scala --- @@ -26,6 +26,20 @@ import org.apache.spark.sql.test.SharedSQLContext class DataFrameComplexTypeSuite extends QueryTest with SharedSQLContext { import testImplicits._ + test("primitive type on array") { +val df = sparkContext.parallelize(Seq(1, 2), 1).toDF("v") --- End diff -- Yes, this is what I want to do. As other test cases did, I use ```sparkContext.parallelize()``` that can explicitly specify # of partitions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13909: [SPARK-16213][SQL] Reduce runtime overhead of a p...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13909#discussion_r68519507 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameComplexTypeSuite.scala --- @@ -26,6 +26,20 @@ import org.apache.spark.sql.test.SharedSQLContext class DataFrameComplexTypeSuite extends QueryTest with SharedSQLContext { import testImplicits._ + test("primitive type on array") { +val df = sparkContext.parallelize(Seq(1, 2), 1).toDF("v") +val resDF = df.selectExpr("Array(v + 2, v + 3)") +checkAnswer(resDF, + Seq(Row(Array(3, 4)), Row(Array(4, 5 + } + + test("primitive type and null on array") { +val df = sparkContext.parallelize(Seq(1, 2), 1).toDF("v") +val resDF = df.selectExpr("Array(v + 2, null, v + 3)") +checkAnswer(resDF, + Seq(Row(Array(3, null, 4)), Row(Array(4, null, 5 + } + test("UDF on struct") { val f = udf((a: String) => a) --- End diff -- I think so. This part was not written by another developer in other PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13909: [SPARK-16213][SQL] Reduce runtime overhead of a p...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13909#discussion_r68519430 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala --- @@ -51,27 +51,52 @@ case class CreateArray(children: Seq[Expression]) extends Expression { override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val arrayClass = classOf[GenericArrayData].getName val values = ctx.freshName("values") -ctx.addMutableState("Object[]", values, s"this.$values = null;") - -ev.copy(code = s""" - final boolean ${ev.isNull} = false; - this.$values = new Object[${children.size}];""" + - ctx.splitExpressions( -ctx.INPUT_ROW, -children.zipWithIndex.map { case (e, i) => - val eval = e.genCode(ctx) - eval.code + s""" +val dt = dataType match { + case a @ ArrayType(et, _) => et +} +val isPrimitive = ctx.isPrimitiveType(dt) +val evals = children.map(e => e.genCode(ctx)) +val allNonNull = evals.find(_.isNull != "false").isEmpty --- End diff -- Yes, I did. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13909: [SPARK-16213][SQL] Reduce runtime overhead of a p...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/13909#discussion_r68519450 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala --- @@ -51,27 +51,52 @@ case class CreateArray(children: Seq[Expression]) extends Expression { override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val arrayClass = classOf[GenericArrayData].getName val values = ctx.freshName("values") -ctx.addMutableState("Object[]", values, s"this.$values = null;") - -ev.copy(code = s""" - final boolean ${ev.isNull} = false; - this.$values = new Object[${children.size}];""" + - ctx.splitExpressions( -ctx.INPUT_ROW, -children.zipWithIndex.map { case (e, i) => - val eval = e.genCode(ctx) - eval.code + s""" +val dt = dataType match { --- End diff -- thank you for good catch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13255][SQL] Integrate vectorized parque...
Github user kiszk commented on the pull request: https://github.com/apache/spark/pull/11146#issuecomment-182347108 @nongli Is there some kind of design doc on the ColumnarBatch? I am planning to make PRs for columnar storage and its computations with DataFrame/Dataset. We are curious whether ColumnarBatch and ColumnVector are designed only for Parquet or for general use. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org