from:"kiszk"

[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...

2015-12-17 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10289#discussion_r47920442
  
--- Diff: 
core/src/test/scala/org/apache/spark/util/collection/ExternalSorterSuite.scala 
---
@@ -235,7 +235,7 @@ class ExternalSorterSuite extends SparkFunSuite with 
LocalSparkContext {
   it.next()
 }
   }
-
+*/
--- End diff --

Sorry, this is my mistake. This file should not be modified.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...

2015-12-17 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10289#discussion_r47922285
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/util/MLlibTestSparkContext.scala ---
@@ -38,12 +38,15 @@ trait MLlibTestSparkContext extends BeforeAndAfterAll { 
self: Suite =>
   }
 
   override def afterAll() {
-sqlContext = null
-SQLContext.clearActive()
-if (sc != null) {
-  sc.stop()
-}
+try {
+  sqlContext = null
+  SQLContext.clearActive()
+  if (sc != null) {
+sc.stop()
+  }
 sc = null
-super.afterAll()
+} finally {
--- End diff --

Yes, I will insert two spaces for an indent


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...

2015-12-17 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10289#discussion_r47920869
  
--- Diff: 
yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala ---
@@ -116,7 +120,7 @@ abstract class BaseYarnClusterSuite
 
   override def afterAll() {
 yarnCluster.stop()
-System.clearProperty("SPARK_YARN_MODE")
+System.setProperties(oldProperties)
--- End diff --

I am afraid that another property may be changes somewhere. I will use the 
original code since it works well now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...

2015-12-13 Thread kiszk

GitHub user kiszk opened a pull request:

https://github.com/apache/spark/pull/10289

[SPARK-12311][CORE] Restore previous value of "os.arch" property in test 
suites after forcing to set specific value to "os.arch" property

Restore the original value of os.arch property after each test

Since some of tests forced to set the specific value to os.arch property, 
we need to set the original value.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kiszk/spark SPARK-12311

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10289.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10289


commit df0848c94e91333d398b6f793d82c2f2f80d07a0
Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com>
Date:   2015-12-14T07:32:52Z

restore the original value of os.arch property after each test

Since some of tests forced to set the specific value to os.arch property, 
we need to set the original value.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...

2015-12-15 Thread kiszk

Github user kiszk commented on the pull request:

https://github.com/apache/spark/pull/10289#issuecomment-164779119
  
@srowen, OK, I will correct this pattern everywhere. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...

2015-12-15 Thread kiszk

Github user kiszk commented on the pull request:

https://github.com/apache/spark/pull/10289#issuecomment-164802842
  
There are two potential bugs
1. A method of the super class is not called in `beforeEach()`, 
`afterEach()`, `beforeAll()`, and `afterAll()`
2. Although `System.setProperty()` is called, `ResetSystemProperties` is 
not mixed-in. 
I will correct them


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12580][SQL] Remove string concatenation...

2016-01-03 Thread kiszk

Github user kiszk commented on the pull request:

https://github.com/apache/spark/pull/10524#issuecomment-168571357
  
I see. Sounds good. I will reformat them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12580][SQL] Remove string concatenation...

2016-01-04 Thread kiszk

Github user kiszk commented on the pull request:

https://github.com/apache/spark/pull/10524#issuecomment-168872227
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12580][SQL] Remove string concatenation...

2015-12-30 Thread kiszk

Github user kiszk commented on the pull request:

https://github.com/apache/spark/pull/10524#issuecomment-168127380
  
@yhuai, I think that I will work. Do you think how many characters are 
preferable to all at a line? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12530][Build] Fix build break at Spark-...

2015-12-29 Thread kiszk

Github user kiszk commented on the pull request:

https://github.com/apache/spark/pull/10488#issuecomment-167848836
  
@yhuai, do you mean that I would update all of the string concatenation in 
@ExpressionDescription by using multi-line string literals rather than only the 
original one?
If so, I will do this update.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12530][Build] Fix build break at Spark-...

2015-12-29 Thread kiszk

Github user kiszk commented on the pull request:

https://github.com/apache/spark/pull/10488#issuecomment-167853136
  
I see. I will create another JIRA entry to update other usages.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12640][SQL] Add simple benchmarking uti...

2016-01-06 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10589#discussion_r48960725
  
--- Diff: core/src/test/scala/org/apache/spark/Benchmark.scala ---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import scala.collection.mutable
+
+import org.apache.commons.lang3.SystemUtils
+import org.apache.spark.util.Utils
+
+/**
+ * Utility class to benchmark components. An example of how to use this is:
+ *  val benchmark = new Benchmark("My Benchmark", valuesPerIteration)
+ *   benchmark.addCase("V1", ")
+ *   benchmark.addCase("V2", ")
+ *   benchmark.run
+ * This will output the average time to run each function and the rate of 
each function.
+ *
+ * The benchmark function takes one argument that is the iteration that's 
being run
+ */
+class Benchmark(name: String, valuesPerIteration: Long, iters: Int = 5) {
+  val benchmarks = mutable.ArrayBuffer.empty[Benchmark.Case]
+
+  def addCase(name: String, f: Int => Unit): Unit = {
+benchmarks += Benchmark.Case(name, f)
+  }
+
+  /**
+   * Runs the benchmark and outputs the results to stdout. This should be 
copied and added as
+   * a comment with the benchmark. Although the results vary from machine 
to machine, it should
+   * provide some baseline.
+   */
+  def run(): Unit = {
+require(benchmarks.nonEmpty)
+val results = benchmarks.map { c =>
+  Benchmark.measure(valuesPerIteration, c.fn, iters)
+}
+val firstRate = results.head.avgRate
+// scalastyle:off
+// The results are going to be processor specific so it is useful to 
include that.
+println(Benchmark.getProcessorName())
+printf("%-24s %16s %16s %14s\n", name + ":", "Avg Time(ms)", "Avg 
Rate(M/s)", "Relative Rate")
+
println("-")
+results.zip(benchmarks).foreach { r =>
+  printf("%-24s %16s %16s %14s\n", r._2.name, r._1.avgMs.toString, 
"%10.2f" format r._1.avgRate,
+"%6.2f X" format (r._1.avgRate / firstRate))
+}
+println
+// scalastyle:on
+  }
+}
+
+object Benchmark {
+  case class Case(name: String, fn: Int => Unit)
+  case class Result(avgMs: Double, avgRate: Double)
+
+  /**
+   * This should return a user helpful processor information. Getting at 
this depends on the OS.
+   * This should return something like "Intel(R) Core(TM) i7-4870HQ CPU @ 
2.50GHz"
+   */
+  def getProcessorName(): String = {
+if (SystemUtils.IS_OS_MAC_OSX) {
+  Utils.executeAndGetOutput(Seq("/usr/sbin/sysctl", "-n", 
"machdep.cpu.brand_string"))
+} else if (SystemUtils.IS_OS_LINUX) {
+  Utils.executeAndGetOutput(Seq("/usr/bin/grep", "-m", "1", "\"model 
name\"", "/proc/cpuinfo"))
+} else {
+  System.getenv("PROCESSOR_IDENTIFIER")
+}
+  }
+
+  /**
+   * Runs a single function `f` for iters, returning the average time the 
function took and
+   * the rate of the function.
+   */
+  def measure(num: Long, f: Int => Unit, iters: Int): Result = {
+var totalTime = 0L
--- End diff --

Sounds good for printing detailed timing about each run. It also allows us 
to check variations of elapsed time at each run.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12640][SQL] Add simple benchmarking uti...

2016-01-05 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10589#discussion_r48922143
  
--- Diff: core/src/test/scala/org/apache/spark/Benchmark.scala ---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import scala.collection.mutable
+
+import org.apache.commons.lang3.SystemUtils
+import org.apache.spark.util.Utils
+
+/**
+ * Utility class to benchmark components. An example of how to use this is:
+ *  val benchmark = new Benchmark("My Benchmark", valuesPerIteration)
+ *   benchmark.addCase("V1", ")
+ *   benchmark.addCase("V2", ")
+ *   benchmark.run
+ * This will output the average time to run each function and the rate of 
each function.
+ *
+ * The benchmark function takes one argument that is the iteration that's 
being run
+ */
+class Benchmark(name: String, valuesPerIteration: Long, iters: Int = 5) {
+  val benchmarks = mutable.ArrayBuffer.empty[Benchmark.Case]
+
+  def addCase(name: String, f: Int => Unit): Unit = {
+benchmarks += Benchmark.Case(name, f)
+  }
+
+  /**
+   * Runs the benchmark and outputs the results to stdout. This should be 
copied and added as
+   * a comment with the benchmark. Although the results vary from machine 
to machine, it should
+   * provide some baseline.
+   */
+  def run(): Unit = {
+require(benchmarks.nonEmpty)
+val results = benchmarks.map { c =>
+  Benchmark.measure(valuesPerIteration, c.fn, iters)
+}
+val firstRate = results.head.avgRate
+// scalastyle:off
+// The results are going to be processor specific so it is useful to 
include that.
+println(Benchmark.getProcessorName())
+printf("%-24s %16s %16s %14s\n", name + ":", "Avg Time(ms)", "Avg 
Rate(M/s)", "Relative Rate")
+
println("-")
+results.zip(benchmarks).foreach { r =>
+  printf("%-24s %16s %16s %14s\n", r._2.name, r._1.avgMs.toString, 
"%10.2f" format r._1.avgRate,
+"%6.2f X" format (r._1.avgRate / firstRate))
+}
+println
+// scalastyle:on
+  }
+}
+
+object Benchmark {
+  case class Case(name: String, fn: Int => Unit)
+  case class Result(avgMs: Double, avgRate: Double)
+
+  /**
+   * This should return a user helpful processor information. Getting at 
this depends on the OS.
+   * This should return something like "Intel(R) Core(TM) i7-4870HQ CPU @ 
2.50GHz"
+   */
+  def getProcessorName(): String = {
+if (SystemUtils.IS_OS_MAC_OSX) {
+  Utils.executeAndGetOutput(Seq("/usr/sbin/sysctl", "-n", 
"machdep.cpu.brand_string"))
+} else if (SystemUtils.IS_OS_LINUX) {
+  Utils.executeAndGetOutput(Seq("/usr/bin/grep", "-m", "1", "\"model 
name\"", "/proc/cpuinfo"))
+} else {
+  System.getenv("PROCESSOR_IDENTIFIER")
+}
+  }
+
+  /**
+   * Runs a single function `f` for iters, returning the average time the 
function took and
+   * the rate of the function.
+   */
+  def measure(num: Long, f: Int => Unit, iters: Int): Result = {
+var totalTime = 0L
--- End diff --

If you would like to measure the peak performance, it is necessary for 
warm-up run before the measurement. The current driver may include execution 
time on a bytecode interpreter.
This is because a just-in-time compiler in Java runtime usually starts 
translating java bytecode to CPU native code after the threshold of method 
invocations/branches exceeded the preset value (e.g. 1 for OracleJDK). 
Before executing the native code, java bytecode is executed on the interpret

[GitHub] spark pull request: [SPARK-12635][SQL] Add ColumnarBatch, an in me...

2016-01-07 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10628#discussion_r49050418
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java
 ---
@@ -0,0 +1,162 @@
+package org.apache.spark.sql.execution.vectorized;
+
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.DoubleType;
+import org.apache.spark.sql.types.IntegerType;
+import org.apache.spark.unsafe.Platform;
+
+import java.nio.ByteBuffer;
+import java.nio.DoubleBuffer;
+
+/**
+ * Column data backed using offheap memory.
+ */
+public final class OffHeapColumnVector extends ColumnVector {
+  // The data stored in these two allocations need to maintain binary 
compatible. We can
+  // directly pass this buffer to external components.
+  private long nulls;
+  private long data;
+
+  protected OffHeapColumnVector(int capacity, DataType type) {
+super(capacity, type);
+this.nulls = Platform.allocateMemory(capacity);
--- End diff --

Is is better to first get a area for this.data and to next get a area for 
this.nulls for this.nulls from the combined allocated memory?  It expects to 
this.data is aligned with a cache line boundary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12580][SQL] Remove string concatenation...

2015-12-30 Thread kiszk

Github user kiszk commented on the pull request:

https://github.com/apache/spark/pull/10524#issuecomment-168038054
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12644][SQL] Update parquet reader to be...

2016-01-05 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10593#discussion_r48920875
  
--- Diff: core/src/test/scala/org/apache/spark/Benchmark.scala ---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import scala.collection.mutable
+
+import org.apache.commons.lang3.SystemUtils
+import org.apache.spark.util.Utils
+
+/**
+ * Utility class to benchmark components. An example of how to use this is:
+ *  val benchmark = new Benchmark("My Benchmark", valuesPerIteration)
+ *   benchmark.addCase("V1", ")
+ *   benchmark.addCase("V2", ")
+ *   benchmark.run
+ * This will output the average time to run each function and the rate of 
each function.
+ *
+ * The benchmark function takes one argument that is the iteration that's 
being run
+ */
+class Benchmark(name: String, valuesPerIteration: Long, iters: Int = 5) {
+  val benchmarks = mutable.ArrayBuffer.empty[Benchmark.Case]
+
+  def addCase(name: String, f: Int => Unit): Unit = {
+benchmarks += Benchmark.Case(name, f)
+  }
+
+  /**
+   * Runs the benchmark and outputs the results to stdout. This should be 
copied and added as
+   * a comment with the benchmark. Although the results vary from machine 
to machine, it should
+   * provide some baseline.
+   */
+  def run(): Unit = {
+require(benchmarks.nonEmpty)
+val results = benchmarks.map { c =>
+  Benchmark.measure(valuesPerIteration, c.fn, iters)
+}
+val firstRate = results.head.avgRate
+// scalastyle:off
+// The results are going to be processor specific so it is useful to 
include that.
+println(Benchmark.getProcessorName())
+printf("%-30s %16s %16s %14s\n", name + ":", "Avg Time(ms)", "Avg 
Rate(M/s)", "Relative Rate")
+
println("---")
+results.zip(benchmarks).foreach { r =>
+  printf("%-30s %16s %16s %14s\n", r._2.name, r._1.avgMs.toString, 
"%10.2f" format r._1.avgRate,
+"%6.2f X" format (r._1.avgRate / firstRate))
+}
+println
+// scalastyle:on
+  }
+}
+
+object Benchmark {
+  case class Case(name: String, fn: Int => Unit)
+  case class Result(avgMs: Double, avgRate: Double)
+
+  /**
+   * This should return a user helpful processor information. Getting at 
this depends on the OS.
+   * This should return something like "Intel(R) Core(TM) i7-4870HQ CPU @ 
2.50GHz"
+   */
+  def getProcessorName(): String = {
+if (SystemUtils.IS_OS_MAC_OSX) {
+  Utils.executeAndGetOutput(Seq("/usr/sbin/sysctl", "-n", 
"machdep.cpu.brand_string"))
+} else if (SystemUtils.IS_OS_LINUX) {
+  Utils.executeAndGetOutput(Seq("/usr/bin/grep", "-m", "1", "\"model 
name\"", "/proc/cpuinfo"))
+} else {
+  System.getenv("PROCESSOR_IDENTIFIER")
+}
+  }
+
+  /**
+   * Runs a single function `f` for iters, returning the average time the 
function took and
+   * the rate of the function.
+   */
+  def measure(num: Long, f: Int => Unit, iters: Int): Result = {
+var totalTime = 0L
+for (i <- 0 until iters + 1) {
+  val start = System.currentTimeMillis()
--- End diff --

How about calling System.nanoTime() for short-running benchmarks instead of 
System.currentTimeMillis()?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12640][SQL] Add simple benchmarking uti...

2016-01-05 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10589#discussion_r48921242
  
--- Diff: core/src/test/scala/org/apache/spark/Benchmark.scala ---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import scala.collection.mutable
+
+import org.apache.commons.lang3.SystemUtils
+import org.apache.spark.util.Utils
+
+/**
+ * Utility class to benchmark components. An example of how to use this is:
+ *  val benchmark = new Benchmark("My Benchmark", valuesPerIteration)
+ *   benchmark.addCase("V1", ")
+ *   benchmark.addCase("V2", ")
+ *   benchmark.run
+ * This will output the average time to run each function and the rate of 
each function.
+ *
+ * The benchmark function takes one argument that is the iteration that's 
being run
+ */
+class Benchmark(name: String, valuesPerIteration: Long, iters: Int = 5) {
+  val benchmarks = mutable.ArrayBuffer.empty[Benchmark.Case]
+
+  def addCase(name: String, f: Int => Unit): Unit = {
+benchmarks += Benchmark.Case(name, f)
+  }
+
+  /**
+   * Runs the benchmark and outputs the results to stdout. This should be 
copied and added as
+   * a comment with the benchmark. Although the results vary from machine 
to machine, it should
+   * provide some baseline.
+   */
+  def run(): Unit = {
+require(benchmarks.nonEmpty)
+val results = benchmarks.map { c =>
+  Benchmark.measure(valuesPerIteration, c.fn, iters)
+}
+val firstRate = results.head.avgRate
+// scalastyle:off
+// The results are going to be processor specific so it is useful to 
include that.
+println(Benchmark.getProcessorName())
+printf("%-24s %16s %16s %14s\n", name + ":", "Avg Time(ms)", "Avg 
Rate(M/s)", "Relative Rate")
+
println("-")
+results.zip(benchmarks).foreach { r =>
+  printf("%-24s %16s %16s %14s\n", r._2.name, r._1.avgMs.toString, 
"%10.2f" format r._1.avgRate,
+"%6.2f X" format (r._1.avgRate / firstRate))
+}
+println
+// scalastyle:on
+  }
+}
+
+object Benchmark {
+  case class Case(name: String, fn: Int => Unit)
+  case class Result(avgMs: Double, avgRate: Double)
+
+  /**
+   * This should return a user helpful processor information. Getting at 
this depends on the OS.
+   * This should return something like "Intel(R) Core(TM) i7-4870HQ CPU @ 
2.50GHz"
+   */
+  def getProcessorName(): String = {
+if (SystemUtils.IS_OS_MAC_OSX) {
+  Utils.executeAndGetOutput(Seq("/usr/sbin/sysctl", "-n", 
"machdep.cpu.brand_string"))
+} else if (SystemUtils.IS_OS_LINUX) {
+  Utils.executeAndGetOutput(Seq("/usr/bin/grep", "-m", "1", "\"model 
name\"", "/proc/cpuinfo"))
+} else {
+  System.getenv("PROCESSOR_IDENTIFIER")
+}
+  }
+
+  /**
+   * Runs a single function `f` for iters, returning the average time the 
function took and
+   * the rate of the function.
+   */
+  def measure(num: Long, f: Int => Unit, iters: Int): Result = {
+var totalTime = 0L
+for (i <- 0 until iters + 1) {
+  val start = System.currentTimeMillis()
--- End diff --

How about the calling System.nanoTime() for short-running benchmarks 
instead of calling System.currentTimeMillis()?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...

2015-12-21 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10289#discussion_r48209450
  
--- Diff: 
yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala ---
@@ -43,12 +45,21 @@ import org.apache.spark.util.Utils
 
 class ClientSuite extends SparkFunSuite with Matchers with 
BeforeAndAfterAll {
 
+  var oldProperties: Properties = null
--- End diff --

This is because System.setProperty() is not used in this file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...

2015-12-21 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10289#discussion_r48209783
  
--- Diff: 
yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala ---
@@ -116,7 +120,7 @@ abstract class BaseYarnClusterSuite
 
   override def afterAll() {
 yarnCluster.stop()
-System.clearProperty("SPARK_YARN_MODE")
+System.setProperties(oldProperties)
 super.afterAll()
--- End diff --

I agree. Since the original code did not use `try finally`, I did not use 
it, too. Should I use `try finally`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...

2015-12-22 Thread kiszk

Github user kiszk commented on the pull request:

https://github.com/apache/spark/pull/10289#issuecomment-166765714
  
Can I retest this?  Timeout may occur in pyspark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12530][Build] Fix build break at Spark-...

2015-12-27 Thread kiszk

GitHub user kiszk opened a pull request:

https://github.com/apache/spark/pull/10488

[SPARK-12530][Build] Fix build break at Spark-Master-Maven-Snapshots from 
#1293

Compilation error caused due to string concatenations that are not a 
constant
Use raw string literal to avoid string concatenations


https://amplab.cs.berkeley.edu/jenkins/view/Spark-Packaging/job/Spark-Master-Maven-Snapshots/1293/

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kiszk/spark SPARK-12530

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10488.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10488


commit 9658c3b6b66034a515a3e5d3cfd3d6cc9c4a3305
Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com>
Date:   2015-12-27T08:24:38Z

fix compilation error

use raw string literal to avoid string concatenations




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12530][Build] Fix build break at Spark-...

2015-12-27 Thread kiszk

Github user kiszk commented on the pull request:

https://github.com/apache/spark/pull/10488#issuecomment-167418794
  
Thanks for letting me know them. I will check them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12530][Build] Fix build break at Spark-...

2015-12-27 Thread kiszk

Github user kiszk commented on the pull request:

https://github.com/apache/spark/pull/10488#issuecomment-167424111
  
@hvanhovell, a set of following two conditions causes this compilation 
failure:
1. Use more than one string concatenations
2. Use scala 2.11 compiler (open question is that [this 
failure](https://amplab.cs.berkeley.edu/jenkins/view/Spark-Packaging/job/Spark-Master-Maven-Snapshots/1293/consoleFull)
 occurs by using both scala 2.10 and 2.11 compilers on Jenkins 

I can reproduce this compilation error by using the following command:

$ java -version
openjdk version "1.8.0_65"
OpenJDK Runtime Environment (build 1.8.0_65-b17)
OpenJDK 64-Bit Server VM (build 25.65-b01, mixed mode)
$ build/mvn -Dscala-2.11 -Pyarn -Phadoop-2.4 -DskipTests clean package 
install -pl sql/catalyst



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12530][Build] Fix build break at Spark-...

2015-12-28 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10488#discussion_r48472393
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -57,9 +57,10 @@ case class Md5(child: Expression) extends 
UnaryExpression with ImplicitCastInput
  * the hash length is not one of the permitted values, the return value is 
NULL.
  */
 @ExpressionDescription(
-  usage = "_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family 
as a hex string of the " +
-"input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit 
length of 0 is equivalent " +
-"to 256",
+  usage =
+"""_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family as a 
hex string of the input.
+  SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 
0 is equivalent to 256."""
+  ,
--- End diff --

Yes, this keeps within 100 characters at a line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12530][Build] Fix build break at Spark-...

2015-12-28 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10488#discussion_r48498744
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -57,9 +57,10 @@ case class Md5(child: Expression) extends 
UnaryExpression with ImplicitCastInput
  * the hash length is not one of the permitted values, the return value is 
NULL.
  */
 @ExpressionDescription(
-  usage = "_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family 
as a hex string of the " +
-"input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit 
length of 0 is equivalent " +
-"to 256",
+  usage =
+"""_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family as a 
hex string of the input.
+  SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 
0 is equivalent to 256."""
+  ,
--- End diff --

I confirmed ``// scalastyle:off`` and ``// scalastyle:on`` if we use up to 
one string concatenation for `` usage = ``.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12530][Build] Fix build break at Spark-...

2015-12-28 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10488#discussion_r48500799
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -57,9 +57,10 @@ case class Md5(child: Expression) extends 
UnaryExpression with ImplicitCastInput
  * the hash length is not one of the permitted values, the return value is 
NULL.
  */
 @ExpressionDescription(
-  usage = "_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family 
as a hex string of the " +
-"input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit 
length of 0 is equivalent " +
-"to 256",
+  usage =
+"""_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family as a 
hex string of the input.
+  SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 
0 is equivalent to 256."""
+  ,
--- End diff --

@yhuai , the above line did not cause the same issue. This is because its 
length is less than 100 and it has only one concatenation.

I checked by inserting ``// scalastyle:off`` before the annotation and ``// 
scalastyle:on`` after the annotation as follows:

```
// scalastyle:off
@ExpressionDescription(
  usage = "_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family as 
a hex string of the " +
"input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit 
length of 0 is equivalent to 256",
  extended = "> SELECT _FUNC_('Spark', 0);\n " +
"'529bc3b07127ecb7e53a4dcf1991d9152c24537d919178022b2c42657f79a26b'")
// scalastyle:on
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...

2015-12-22 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10289#discussion_r48238449
  
--- Diff: 
yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala ---
@@ -116,7 +120,7 @@ abstract class BaseYarnClusterSuite
 
   override def afterAll() {
 yarnCluster.stop()
-System.clearProperty("SPARK_YARN_MODE")
+System.setProperties(oldProperties)
 super.afterAll()
--- End diff --

I see, I will use `try finally` here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...

2015-12-22 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10289#discussion_r48238546
  
--- Diff: 
yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala ---
@@ -43,12 +45,21 @@ import org.apache.spark.util.Utils
 
 class ClientSuite extends SparkFunSuite with Matchers with 
BeforeAndAfterAll {
 
+  var oldProperties: Properties = null
--- End diff --

Sorry, I made a mistake when I check all of the changes in this file. I 
will use `ResetSystemProperties` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12502][Build][Python] Script /dev/run-t...

2015-12-23 Thread kiszk

GitHub user kiszk opened a pull request:

https://github.com/apache/spark/pull/10463

[SPARK-12502][Build][Python] Script /dev/run-tests fails when IBM Java is 
used

fix an exception with IBM JDK by removing update field from a JavaVersion 
tuple. This is because IBM JDK does not have information on update '_xx'

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kiszk/spark SPARK-12502

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10463.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10463


commit 8157ee06c54d598377f4b79467012913ef4cbfad
Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com>
Date:   2015-12-24T03:06:13Z

fix an exception with IBM JDK

remove update field from a JavaVersion tuple. This is because IBM JDK does 
not have information on update '_xx'




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12311][CORE] Restore previous value of ...

2015-12-19 Thread kiszk

Github user kiszk commented on the pull request:

https://github.com/apache/spark/pull/10289#issuecomment-165960572
  
I will fix this merge issue this weekend.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12580][SQL] Remove string concatenation...

2015-12-30 Thread kiszk

GitHub user kiszk opened a pull request:

https://github.com/apache/spark/pull/10524

[SPARK-12580][SQL] Remove string concatenations from usage and extended in 
@ExpressionDescription

Use multi-line string literals for @ExpressionDescription with ``// 
scalastyle:off line.size.limit`` and ``// scalastyle:on line.size.limit``

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kiszk/spark SPARK-12580

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10524.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10524


commit acb422cb3a7815418bf38cc883839cd8d4b81501
Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com>
Date:   2015-12-30T08:06:36Z

use multi-line string literals for @ExpressionDescription




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12580][SQL] Remove string concatenation...

2015-12-30 Thread kiszk

Github user kiszk commented on the pull request:

https://github.com/apache/spark/pull/10524#issuecomment-167957361
  
@yhuai, are these changes fine with the policy that you proposed? I would 
appreciate it if you would check them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12580][SQL] Remove string concatenation...

2015-12-30 Thread kiszk

Github user kiszk commented on the pull request:

https://github.com/apache/spark/pull/10524#issuecomment-167996508
  
On jenkins with scala-2.11, we can wrap once in a source file (i.e. only 
one string concatenation). To use more than one concatenation causes 
compilation error.
Is it OK to use only one string concatenation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12530][Build] Fix build break at Spark-...

2015-12-29 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10488#discussion_r48534614
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -57,9 +57,10 @@ case class Md5(child: Expression) extends 
UnaryExpression with ImplicitCastInput
  * the hash length is not one of the permitted values, the return value is 
NULL.
  */
 @ExpressionDescription(
-  usage = "_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family 
as a hex string of the " +
-"input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit 
length of 0 is equivalent " +
-"to 256",
+  usage =
+"""_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family as a 
hex string of the input.
+  SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 
0 is equivalent to 256."""
+  ,
--- End diff --

@yhuai, it sounds good to use multi-line string for ``usage`` and 
``extended``. How about this policy?
1. First, we use raw string literal to remove concatenations.
2. If we still want to use a line with more than 100 characters, ``// 
scalastyle:off`` and ``// scalastyle:on`` can be used.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12530][Build] Fix build break at Spark-...

2015-12-29 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10488#discussion_r48534869
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -57,9 +57,10 @@ case class Md5(child: Expression) extends 
UnaryExpression with ImplicitCastInput
  * the hash length is not one of the permitted values, the return value is 
NULL.
  */
 @ExpressionDescription(
-  usage = "_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family 
as a hex string of the " +
-"input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit 
length of 0 is equivalent " +
-"to 256",
+  usage =
+"""_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family as a 
hex string of the input.
+  SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 
0 is equivalent to 256."""
+  ,
--- End diff --

In another way, I am trying to reproduce this compilation error in a 
standalone test case. However, I cannot reproduce this failure (i.e. 
compilations are successfully finished).

```
$ cat ExpressionDescription.java 
import java.lang.annotation.*;

@Retention(RetentionPolicy.RUNTIME)
public @interface ExpressionDescription {
String usage() default "_FUNC_ is undocumented";
String extended() default "No example for _FUNC_.";
}
$ cat test.scala
@ExpressionDescription(
  usage = "_FUNC_(input, bitLength) - Returns a checksum of SHA-2 family as 
a hex string of the " +
"input. SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit 
length of 0 is equivalent " +
"to 256",
  extended = "> SELECT _FUNC_('Spark', 0);\n " +
"'529bc3b07127ecb7e53a4dcf1991d9152c24537d919178022b2c42657f79a26b'")
class test {
  def func() : Unit = {
val str = "abc" + "123" + "XYZ"
  }
}
$ javac ExpressionDescription.java 
$ ~/scala-2.11.7/bin/scalac test.scala 
$
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...

2016-06-04 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13505#discussion_r65811458
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
 ---
@@ -86,11 +86,31 @@ package object expressions  {
   /**
* Helper functions for working with `Seq[Attribute]`.
*/
-  implicit class AttributeSeq(attrs: Seq[Attribute]) {
+  implicit class AttributeSeq(val attrs: Seq[Attribute]) {
 /** Creates a StructType with a schema matching this `Seq[Attribute]`. 
*/
 def toStructType: StructType = {
   StructType(attrs.map(a => StructField(a.name, a.dataType, 
a.nullable)))
 }
+
+private lazy val inputArr = attrs.toArray
+
+private lazy val inputToOrdinal = {
+  val map = new java.util.HashMap[ExprId, Int](inputArr.length * 2)
+  var index = 0
+  attrs.foreach { attr =>
+if (!map.containsKey(attr.exprId)) {
+  map.put(attr.exprId, index)
+}
+index += 1
--- End diff --

Got it. I did not have no preference. Good to hear a reason for this 
decision.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

2016-06-06 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13439
  
Oh, you are right. IMHO, it is too complex to introduce new implementation 
classes only for a column vector with the same value in all of the rows.
To introduce compression schemes, as implemented in ```CachedBatch``` may 
be more generic solution if we introduce new implementation classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13539: [SPARK-15795] [SQL] Enable more optimizations in whole s...

2016-06-07 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13539
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

2016-06-06 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13439
  
Can we implement this feature simply by 

- using ```ColumnarBatch.allocate(StructType schema, MemoryMode memMode, 
int maxRows)``` with ```maxRows=1```
- not introducing ```OffHeapConstantColumnVector``` and 
```OnHeapConstantColumnVector```

IMHO, if this PR focuses on memory usage reduction, it could be possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #11301: [SPARK-13432][SQL] add the source file name and l...

2016-06-12 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/11301#discussion_r66726703
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala ---
@@ -37,6 +39,16 @@ private[sql] object Column {
   def apply(expr: Expression): Column = new Column(expr)
 
   def unapply(col: Column): Option[Expression] = Some(col.expr)
+
+  @scala.annotation.varargs
+  def updateExpressionsOrigin(cols: Column*): Unit = {
+// Update Expression.origin using the callSite of an operation
+val callSite = org.apache.spark.util.Utils.getCallSite().shortForm
+cols.map(col => col.expr.foreach(e => e.origin.callSite = 
Some(callSite)))
+// Update CurrentOrigin for setting origin for LogicalPlan node
+CurrentOrigin.set(
+  Origin(Some(callSite), CurrentOrigin.get.line, 
CurrentOrigin.get.startPosition))
--- End diff --

Since I was too busy, I did not have time to work for this PR. Sorry.

My current idea is to set ```CurrentOrigin``` at 
[```SQLExecution.withNewExecutionId```](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala#L57),
 and then 
[```TreeNode.toOriginString```](https://github.com/apache/spark/pull/11301/files#diff-eac5b02bb450a235fef5e902a2671254R446)
 dumps ```CurrentOrigin.get.callSite```.

My next step seems to pass ```callSite``` from an operation such as 
```filter``` to ```DataSet``` to ```LogicalPlan``` that can be accessed at 
```SQLExecution.withNewExecutionId```.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13589: [SPARK-15822][SPARK-15825][SQL] Fix SMJ Segfault/...

2016-06-10 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13589#discussion_r66568805
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -490,6 +490,7 @@ class CodegenContext {
   addNewFunction(compareFunc, funcCode)
   s"this.$compareFunc($c1, $c2)"
 case schema: StructType =>
+  INPUT_ROW = "i"
   val comparisons = GenerateOrdering.genComparisons(this, schema)
--- End diff --

Is it better to use ```InternalRow $INPUT_ROW = null;``` in an assignment 
for ```funcCode```, to clearly show intention of this change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...

2016-06-04 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13505#discussion_r65804276
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
 ---
@@ -86,11 +86,31 @@ package object expressions  {
   /**
* Helper functions for working with `Seq[Attribute]`.
*/
-  implicit class AttributeSeq(attrs: Seq[Attribute]) {
+  implicit class AttributeSeq(val attrs: Seq[Attribute]) {
 /** Creates a StructType with a schema matching this `Seq[Attribute]`. 
*/
 def toStructType: StructType = {
   StructType(attrs.map(a => StructField(a.name, a.dataType, 
a.nullable)))
 }
+
+private lazy val inputArr = attrs.toArray
+
+private lazy val inputToOrdinal = {
+  val map = new java.util.HashMap[ExprId, Int](inputArr.length * 2)
--- End diff --

Why ```*2``` is necessary?
I think that the size of map's entry is up to ```attrs.size``` since the 
max number of calling ```map.put()``` is equal to ``attrs.size```. Is 
```attrs.size``` equal to ```inputArr.legnth```?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13663: Spark SparkSPARK-15950 Eliminate unreachable code...

2016-06-14 Thread kiszk

GitHub user kiszk opened a pull request:

https://github.com/apache/spark/pull/13663

Spark  SparkSPARK-15950  Eliminate unreachable code at projection for 
complex types

## What changes were proposed in this pull request?

This PR eliminates unreachable code at projection for complex types (e.g. 
array, map, and struct)  to reduce the total size of Java byte code in a method 
generated by whole stage codegen.
This PR eliminates the following code blocks in the following cases:

1. A type of ```project_value``` is ```UnsafeArrayData```
2. ```project_value is``` ```null```
3. each element of ```project_values``` is ```null``` if it never occurs

Examples for complex types
java
val df = sparkContext.parallelize(Seq(1, 2), 1).toDF("v")
df.selectExpr("Array(v + 2, v + 3)").collect
df.selectExpr("struct(v + 3, v + 4)").collect
df.selectExpr("map(v + 3, v + 4)").collect


Before applying this PR into ```Array```
```java
/* 028 */   protected void processNext() throws java.io.IOException {
/* 029 */ while (inputadapter_input.hasNext()) {
/* 030 */   InternalRow inputadapter_row = (InternalRow) 
inputadapter_input.next();
/* 031 */   final boolean project_isNull = false;
/* 032 */   this.project_values = new Object[2];
/* 033 */   double inputadapter_value = inputadapter_row.getDouble(0);
/* 034 */
/* 035 */   double project_value1 = -1.0;
/* 036 */   project_value1 = inputadapter_value + 2.2D;
/* 037 */   if (false) {
/* 038 */ project_values[0] = null;
/* 039 */   } else {
/* 040 */ project_values[0] = project_value1;
/* 041 */   }
/* 042 */
/* 043 */   double inputadapter_value1 = inputadapter_row.getDouble(1);
/* 044 */
/* 045 */   double project_value4 = -1.0;
/* 046 */   project_value4 = inputadapter_value1 + 3.3D;
/* 047 */   if (false) {
/* 048 */ project_values[1] = null;
/* 049 */   } else {
/* 050 */ project_values[1] = project_value4;
/* 051 */   }
/* 052 */
/* 053 */   final ArrayData project_value = new 
org.apache.spark.sql.catalyst.util.GenericArrayData(project_values);
/* 054 */   this.project_values = null;
/* 055 */   project_holder.reset();
/* 056 */
/* 057 */   project_rowWriter.zeroOutNullBytes();
/* 058 */
/* 059 */   if (project_isNull) {
/* 060 */ project_rowWriter.setNullAt(0);
/* 061 */   } else {
/* 062 */ // Remember the current cursor so that we can calculate 
how many bytes are
/* 063 */ // written later.
/* 064 */ final int project_tmpCursor = project_holder.cursor;
/* 065 */
/* 066 */ if (project_value instanceof UnsafeArrayData) {
/* 067 */   final int project_sizeInBytes = ((UnsafeArrayData) 
project_value).getSizeInBytes();
/* 068 */   // grow the global buffer before writing data.
/* 069 */   project_holder.grow(project_sizeInBytes);
/* 070 */   ((UnsafeArrayData) 
project_value).writeToMemory(project_holder.buffer, project_holder.cursor);
/* 071 */   project_holder.cursor += project_sizeInBytes;
/* 072 */
/* 073 */ } else {
/* 074 */   final int project_numElements = 
project_value.numElements();
/* 075 */   project_arrayWriter.initialize(project_holder, 
project_numElements, 8);
/* 076 */
/* 077 */   for (int project_index = 0; project_index < 
project_numElements; project_index++) {
/* 078 */ if (project_value.isNullAt(project_index)) {
/* 079 */   project_arrayWriter.setNullAt(project_index);
/* 080 */ } else {
/* 081 */   final double project_element = 
project_value.getDouble(project_index);
/* 082 */   project_arrayWriter.write(project_index, 
project_element);
/* 083 */ }
/* 084 */   }
/* 085 */ }
/* 086 */
/* 087 */ project_rowWriter.setOffsetAndSize(0, project_tmpCursor, 
project_holder.cursor - project_tmpCursor);
/* 088 */ project_rowWriter.alignToWords(project_holder.cursor - 
project_tmpCursor);
/* 089 */   }
/* 090 */   project_result.setTotalSize(project_holder.totalSize());
/* 091 */   append(project_result);
/* 092 */   if (shouldStop()) return;
/* 093 */ }
/* 094 */   }


After applying this PR into ```Array```
java
/* 028 */   protected void processNext() throws java.io.IOException {
/* 029 */ while (inputadapter_input.hasNext()) {
/* 030 */   InternalRow inputadapter_row = (InternalRow) 
inputadapter_input.next();
/* 031 */   final

[GitHub] spark issue #13663: [SPARK-15950][SQL] Eliminate unreachable code at project...

2016-06-14 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13663
  
@cloud-fan and @davies , thank you for your comments.
A global variable is used for 1. I will address 1. by using another 
approach without using a global variable in another PR. This PR will focus on 
2. and 3. that are addressed without using a global variable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce additonal implementa...

2016-06-15 Thread kiszk

GitHub user kiszk opened a pull request:

https://github.com/apache/spark/pull/13680

[SPARK-15962][SQL] Introduce additonal implementation with a dense format 
for UnsafeArrayData

## What changes were proposed in this pull request?

This PR introduces two implementations for ```UnsafeArrayData```. One is an 
sparse array ```UnsafeArrayDataSparse``` that is the original format of 
```UnsafeArrayData```. The other is a dense array ```UnsafeArrayDataDense``` 
that is a new format of ```UnsafeArrayData```.

```UnsafeArrayDataSparse``` accepts ```null``` value in each entry of an 
array. However, it increases memory footprint (e.g. 2x for an integer array) 
due to existence of [```offsets``` 
area](https://github.com/apache/spark/blob/43b149fb885a27f9467aab28e5195f6f03aadcf0/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java#L35).
 In some cases (e.g. an unsafe array is created from an primitive array by a 
method ```fromPrimitiveArray```, we know there is no ```null``` value in an 
array. ```UnsafeArrayDataDense``` can reduce memory footprint since it does not 
have ```offsets``` area.

## How was this patch tested?

Tested by existing unit tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kiszk/spark SPARK-15962

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13680.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13680


commit 2b15523ddf0c8b8387302611d564fcc2de239af5
Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com>
Date:   2016-06-14T19:46:29Z

add two implementations (sparse and dense) for UnsafeArrayData

commit 639b32d91cee1572e1146f325c40e158a8158ce5
Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com>
Date:   2016-06-15T06:19:18Z

fix failures of testsuite




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13663: [SPARK-15950][SQL] Eliminate unreachable code at ...

2016-06-15 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13663#discussion_r67209280
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -60,18 +60,24 @@ case class CreateArray(children: Seq[Expression]) 
extends Expression {
 ctx.INPUT_ROW,
 children.zipWithIndex.map { case (e, i) =>
   val eval = e.genCode(ctx)
-  eval.code + s"""
-if (${eval.isNull}) {
--- End diff --

Javac or just-in-time compiler can optimize this. But, code generator can 
easily do it without a global variable, too. I think that this is a good 
tradeoff.
What would you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13505: [SPARK-15764][SQL] Replace N^2 loop in BindRefere...

2016-06-04 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13505#discussion_r65804344
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
 ---
@@ -86,11 +86,31 @@ package object expressions  {
   /**
* Helper functions for working with `Seq[Attribute]`.
*/
-  implicit class AttributeSeq(attrs: Seq[Attribute]) {
+  implicit class AttributeSeq(val attrs: Seq[Attribute]) {
 /** Creates a StructType with a schema matching this `Seq[Attribute]`. 
*/
 def toStructType: StructType = {
   StructType(attrs.map(a => StructField(a.name, a.dataType, 
a.nullable)))
 }
+
+private lazy val inputArr = attrs.toArray
+
+private lazy val inputToOrdinal = {
+  val map = new java.util.HashMap[ExprId, Int](inputArr.length * 2)
+  var index = 0
+  attrs.foreach { attr =>
+if (!map.containsKey(attr.exprId)) {
+  map.put(attr.exprId, index)
+}
+index += 1
--- End diff --

Which style is better,  this style or a style to use ```zipWithIndex```?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13472: [SPARK-15735] Allow specifying min time to run in...

2016-06-04 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13472#discussion_r65804523
  
--- Diff: core/src/main/scala/org/apache/spark/util/Benchmark.scala ---
@@ -33,18 +38,37 @@ import org.apache.commons.lang3.SystemUtils
  *
  * The benchmark function takes one argument that is the iteration that's 
being run.
  *
- * If outputPerIteration is true, the timing for each run will be printed 
to stdout.
+ * @param name name of this benchmark.
+ * @param valuesPerIteration number of values used in the test case, used 
to compute rows/s.
+ * @param minNumIters the min number of iterations that will be run per 
case, not counting warm-up.
+ * @param warmupTime amount of time to spend running dummy case iterations 
for JIT warm-up.
+ * @param minTime further iterations will be run for each case until this 
time is used up.
+ * @param outputPerIteration if true, the timing for each run will be 
printed to stdout.
+ * @param output optional output stream to write benchmark results to
  */
 private[spark] class Benchmark(
 name: String,
 valuesPerIteration: Long,
-defaultNumIters: Int = 5,
-outputPerIteration: Boolean = false) {
+minNumIters: Int = 2,
+warmupTime: FiniteDuration = 2.seconds,
+minTime: FiniteDuration = 2.seconds,
--- End diff --

ditto.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13472: [SPARK-15735] Allow specifying min time to run in...

2016-06-04 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13472#discussion_r65804521
  
--- Diff: core/src/main/scala/org/apache/spark/util/Benchmark.scala ---
@@ -33,18 +38,37 @@ import org.apache.commons.lang3.SystemUtils
  *
  * The benchmark function takes one argument that is the iteration that's 
being run.
  *
- * If outputPerIteration is true, the timing for each run will be printed 
to stdout.
+ * @param name name of this benchmark.
+ * @param valuesPerIteration number of values used in the test case, used 
to compute rows/s.
+ * @param minNumIters the min number of iterations that will be run per 
case, not counting warm-up.
+ * @param warmupTime amount of time to spend running dummy case iterations 
for JIT warm-up.
+ * @param minTime further iterations will be run for each case until this 
time is used up.
+ * @param outputPerIteration if true, the timing for each run will be 
printed to stdout.
+ * @param output optional output stream to write benchmark results to
  */
 private[spark] class Benchmark(
 name: String,
 valuesPerIteration: Long,
-defaultNumIters: Int = 5,
-outputPerIteration: Boolean = false) {
+minNumIters: Int = 2,
+warmupTime: FiniteDuration = 2.seconds,
--- End diff --

I like to add a warmup period, but is it better to 2 second as default. It 
depends on machine, program (we can specify a value explicitly), and others.
Are there any other thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13439: [SPARK-15701][SQL] Constant ColumnVector only needs to p...

2016-06-01 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13439
  
Can we have a benchmark program to show performance improvement?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13459: [SPARK-15726] [SQL] Make DatasetBenchmark fairer among D...

2016-06-01 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13459
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13663: [SPARK-15950][SQL] Eliminate unreachable code at ...

2016-06-21 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13663#discussion_r67849394
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameComplexTypeSuite.scala ---
@@ -26,6 +27,55 @@ import org.apache.spark.sql.test.SharedSQLContext
 class DataFrameComplexTypeSuite extends QueryTest with SharedSQLContext {
   import testImplicits._
 
+  test("create an array") {
+val df = sparkContext.parallelize(Seq(1, 2), 1).toDF("v")
+df.selectExpr("Array(v + 3, v + 4)").collect
--- End diff --

To make sure, when the code is eliminated, the simply array creation works 
well without any exception.
Do you think this is unnecessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...

2016-06-21 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13758#discussion_r67849962
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala
 ---
@@ -142,3 +164,415 @@ class GenericArrayData(val array: Array[Any]) extends 
ArrayData {
 result
   }
 }
+
+final class GenericIntArrayData(private val primitiveArray: Array[Int])
--- End diff --

Thank you for your suggestion. It would be good to make 
```GenericArrayData``` abstract.
As you said, ```ArrayData``` will be mega-morphic in the class hierarchy, 
actual use should be monomorphic. To make each class ```final``` alleviate 
runtime overhead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...

2016-06-21 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13758#discussion_r67850981
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala
 ---
@@ -142,3 +164,415 @@ class GenericArrayData(val array: Array[Any]) extends 
ArrayData {
 result
   }
 }
+
+final class GenericIntArrayData(private val primitiveArray: Array[Int])
+  extends GenericArrayData(Array.empty) {
+  override def array(): Array[Any] = primitiveArray.toArray
+
+  override def copy(): ArrayData = new GenericIntArrayData(primitiveArray)
+
+  override def numElements(): Int = primitiveArray.length
+
+  override def isNullAt(ordinal: Int): Boolean = false
+  override def getInt(ordinal: Int): Int = primitiveArray(ordinal)
+  override def toIntArray(): Array[Int] = {
+val array = new Array[Int](numElements)
+System.arraycopy(primitiveArray, 0, array, 0, numElements)
+array
+  }
+  override def toString(): String = primitiveArray.mkString("[", ",", "]")
+
+  override def equals(o: Any): Boolean = {
--- End diff --

Good question about the definition of equality. Interestingly, there is no 
```hashCode()``` and ```equals()``` methods in 
[UnsafeArrayData](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java).

During my works regarding code generation, I cannot find any usage of 
```hashCode()``` and ```equals()``` methods. Is it better to drop these two 
methods if all of the tests would pass without these two methods?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...

2016-06-21 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13758#discussion_r67993179
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala
 ---
@@ -23,7 +23,60 @@ import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.types.{DataType, Decimal}
 import org.apache.spark.unsafe.types.{CalendarInterval, UTF8String}
 
-class GenericArrayData(val array: Array[Any]) extends ArrayData {
+object GenericArrayData {
+  def allocate(seq: Seq[Any]): GenericArrayData = new 
GenericRefArrayData(seq)
--- End diff --

Definitely, you are right


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...

2016-06-21 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13758
  
@hvanhovell , I added [a 
file](https://github.com/kiszk/spark/blob/133d4c0085b5ca2f20870c05d077e25d8715e07a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayDataBenchmark.scala)
 of Benchmark (not ran yet). I would appreciate it if you have a time to look 
at this.

It is very strange to me that I can run a Benchmark program under 
```sql/core``` (e.g. ```MiscBenchmark)``` by using ```build/sbt "sql/test-only 
*MiscBenchmark*"```.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...

2016-06-21 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13758
  
@hvanhovell, yes, it is good idea. Actually, I wrote a benchmark program 
```org.apache.spark.sql.catalyst.util.GenericArrayBenchmark``` (not committed 
yet). An issue in my environment is that I cannot run a benchmark program under 
sql/catalyst.

The following command does not execute my benchmark program...
```
build/sbt "catalyst/test-only *GenericArrayBenchmark*"
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...

2016-06-21 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13758#discussion_r67996014
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala
 ---
@@ -112,8 +112,8 @@ class CodeGenerationSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 val plan = GenerateMutableProjection.generate(expressions)
 val actual = plan(new 
GenericMutableRow(length)).toSeq(expressions.map(_.dataType))
 val expected = Seq(new ArrayBasedMapData(
-  new GenericArrayData(0 until length),
-  new GenericArrayData(Seq.fill(length)(true
+  GenericArrayData.allocate(0 until length),
+  GenericArrayData.allocate(Seq.fill(length)(true
--- End diff --

Ok, I will do both.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...

2016-06-21 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13758#discussion_r67995936
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala
 ---
@@ -142,3 +196,414 @@ class GenericArrayData(val array: Array[Any]) extends 
ArrayData {
 result
   }
 }
+
+final class GenericIntArrayData(val primitiveArray: Array[Int]) extends 
GenericArrayData {
+  override def array(): Array[Any] = primitiveArray.toArray
+
+  override def copy(): ArrayData = new GenericIntArrayData(primitiveArray)
--- End diff --

Hmm, I should clone the backing array. I should return the actual type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...

2016-06-21 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13758#discussion_r67995993
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala
 ---
@@ -142,3 +196,414 @@ class GenericArrayData(val array: Array[Any]) extends 
ArrayData {
 result
   }
 }
+
+final class GenericIntArrayData(val primitiveArray: Array[Int]) extends 
GenericArrayData {
+  override def array(): Array[Any] = primitiveArray.toArray
+
+  override def copy(): ArrayData = new GenericIntArrayData(primitiveArray)
+
+  override def numElements(): Int = primitiveArray.length
+
+  override def isNullAt(ordinal: Int): Boolean = false
+  override def getInt(ordinal: Int): Int = primitiveArray(ordinal)
+  override def toIntArray(): Array[Int] = {
+val array = new Array[Int](numElements)
+System.arraycopy(primitiveArray, 0, array, 0, numElements)
+array
+  }
+  override def toString(): String = primitiveArray.mkString("[", ",", "]")
--- End diff --

Yes, I will move it up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...

2016-06-21 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13758#discussion_r67997326
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala
 ---
@@ -142,3 +164,415 @@ class GenericArrayData(val array: Array[Any]) extends 
ArrayData {
 result
   }
 }
+
+final class GenericIntArrayData(private val primitiveArray: Array[Int])
+  extends GenericArrayData(Array.empty) {
+  override def array(): Array[Any] = primitiveArray.toArray
+
+  override def copy(): ArrayData = new GenericIntArrayData(primitiveArray)
+
+  override def numElements(): Int = primitiveArray.length
+
+  override def isNullAt(ordinal: Int): Boolean = false
+  override def getInt(ordinal: Int): Int = primitiveArray(ordinal)
+  override def toIntArray(): Array[Int] = {
+val array = new Array[Int](numElements)
+System.arraycopy(primitiveArray, 0, array, 0, numElements)
+array
+  }
+  override def toString(): String = primitiveArray.mkString("[", ",", "]")
+
+  override def equals(o: Any): Boolean = {
--- End diff --

I should implement ```equals()``` and ```hashCode()``` for classes related 
to ```GenericArrayData```

I already implemented type specialized ```equals()``` and ```hashCode()``` 
in ```GenericArrayData```. An issue in ```equals()``` and 
```hashCode``` of ```GenericRefArrayData``` is that a type of each element may 
be different since they are hold in ```Array[Any]```.
 
If I misunderstood your suggestion, could you please let me know?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...

2016-06-22 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13758#discussion_r68004874
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayDataBenchmark.scala
 ---
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.util.Benchmark
+
+/**
+ * Benchmark [[GenericArrayData]] for Dense and Sparse with primitive type
+ */
+object GenericArrayDataBenchmark {
+/*
+  def allocateGenericIntArray(iters: Int): Unit = {
+val count = 1024 * 1024 * 10
+var array: GenericArrayData = null
+
+val primitiveIntArray = new Array[Int](count)
+val denseIntArray = { i: Int =>
+  for (n <- 0L until iters) {
--- End diff --

I will do this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...

2016-06-22 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13758#discussion_r68004837
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala
 ---
@@ -159,17 +159,17 @@ object CatalystTypeConverters {
 override def toCatalystImpl(scalaValue: Any): ArrayData = {
   scalaValue match {
 case a: Array[_] =>
-  new GenericArrayData(a.map(elementConverter.toCatalyst))
+  GenericArrayData.allocate(a.map(elementConverter.toCatalyst))
--- End diff --

Yes, I feel so for now since type of `toCatalyst` seems to be `Any`. Is it 
better to prepare specialized code to make a type concrete here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...

2016-06-22 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13758#discussion_r68007482
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala
 ---
@@ -142,3 +164,415 @@ class GenericArrayData(val array: Array[Any]) extends 
ArrayData {
 result
   }
 }
+
+final class GenericIntArrayData(private val primitiveArray: Array[Int])
+  extends GenericArrayData(Array.empty) {
+  override def array(): Array[Any] = primitiveArray.toArray
+
+  override def copy(): ArrayData = new GenericIntArrayData(primitiveArray)
+
+  override def numElements(): Int = primitiveArray.length
+
+  override def isNullAt(ordinal: Int): Boolean = false
+  override def getInt(ordinal: Int): Int = primitiveArray(ordinal)
+  override def toIntArray(): Array[Int] = {
+val array = new Array[Int](numElements)
+System.arraycopy(primitiveArray, 0, array, 0, numElements)
+array
+  }
+  override def toString(): String = primitiveArray.mkString("[", ",", "]")
+
+  override def equals(o: Any): Boolean = {
--- End diff --

Thank you for your explanation. Great idea for ease of read. I will do it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...

2016-06-21 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13758#discussion_r67994789
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala
 ---
@@ -142,3 +196,414 @@ class GenericArrayData(val array: Array[Any]) extends 
ArrayData {
 result
   }
 }
+
+final class GenericIntArrayData(val primitiveArray: Array[Int]) extends 
GenericArrayData {
+  override def array(): Array[Any] = primitiveArray.toArray
+
+  override def copy(): ArrayData = new GenericIntArrayData(primitiveArray)
+
+  override def numElements(): Int = primitiveArray.length
+
+  override def isNullAt(ordinal: Int): Boolean = false
+  override def getInt(ordinal: Int): Int = primitiveArray(ordinal)
+  override def toIntArray(): Array[Int] = {
+val array = new Array[Int](numElements)
+System.arraycopy(primitiveArray, 0, array, 0, numElements)
+array
+  }
+  override def toString(): String = primitiveArray.mkString("[", ",", "]")
+
+  override def equals(o: Any): Boolean = {
+if (!o.isInstanceOf[GenericIntArrayData]) {
+  return false
+}
+
+val other = o.asInstanceOf[GenericIntArrayData]
+if (other eq null) {
+  return false
+}
+
+val len = numElements()
+if (len != other.numElements()) {
+  return false
+}
+
+var i = 0
+while (i < len) {
+  val o1 = primitiveArray(i)
+  val o2 = other.primitiveArray(i)
+  if (o1 != o2) {
+return false
+  }
+  i += 1
+}
+true
+  }
+
+  override def hashCode: Int = {
+var result: Int = 37
+var i = 0
+val len = numElements()
+while (i < len) {
+  val update: Int = primitiveArray(i)
+  result = 37 * result + update
+  i += 1
+}
+result
+  }
+}
+
+final class GenericLongArrayData(val primitiveArray: Array[Long])
+  extends GenericArrayData {
+  override def array(): Array[Any] = primitiveArray.toArray
+
+  override def copy(): ArrayData = new GenericLongArrayData(primitiveArray)
+
+  override def numElements(): Int = primitiveArray.length
+
+  override def isNullAt(ordinal: Int): Boolean = false
+  override def getLong(ordinal: Int): Long = primitiveArray(ordinal)
+  override def toLongArray(): Array[Long] = {
+val array = new Array[Long](numElements)
+System.arraycopy(primitiveArray, 0, array, 0, numElements)
+array
+  }
+  override def toString(): String = primitiveArray.mkString("[", ",", "]")
+
+  override def equals(o: Any): Boolean = {
+if (!o.isInstanceOf[GenericLongArrayData]) {
+  return false
+}
+
+val other = o.asInstanceOf[GenericLongArrayData]
+if (other eq null) {
+  return false
+}
+
+val len = numElements()
+if (len != other.numElements()) {
+  return false
+}
+
+var i = 0
+while (i < len) {
+  val o1 = primitiveArray(i)
+  val o2 = other.primitiveArray(i)
+  if (o1 != o2) {
+return false
+  }
+  i += 1
+}
+true
+  }
+
+  override def hashCode: Int = {
+var result: Int = 37
+var i = 0
+val len = numElements()
+while (i < len) {
+  val l = primitiveArray(i)
+  val update: Int = (l ^ (l >>> 32)).toInt
+  result = 37 * result + update
+  i += 1
+}
+result
+  }
+}
+
+final class GenericFloatArrayData(val primitiveArray: Array[Float])
+  extends GenericArrayData {
+  override def array(): Array[Any] = primitiveArray.toArray
+
+  override def copy(): ArrayData = new 
GenericFloatArrayData(primitiveArray)
+
+  override def numElements(): Int = primitiveArray.length
+
+  override def isNullAt(ordinal: Int): Boolean = false
+  override def getFloat(ordinal: Int): Float = primitiveArray(ordinal)
+  override def toFloatArray(): Array[Float] = {
+val array = new Array[Float](numElements)
+System.arraycopy(primitiveArray, 0, array, 0, numElements)
+array
+  }
+  override def toString(): String = primitiveArray.mkString("[", ",", "]")
+
+  override def equals(o: Any): Boolean = {
+if (!o.isInstanceOf[GenericFloatArrayData]) {
+  return false
+}
+
+val other = o.asInstanceOf[GenericFloatArrayData]
+if (other eq null) {
+  return false
+}
+
+val len = numElements()
+if (len != other.numElements()) {
+  r

[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...

2016-06-22 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13758#discussion_r68015184
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayDataBenchmark.scala
 ---
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import org.apache.spark.util.Benchmark
+
+/**
+ * Benchmark [[GenericArrayData]] for Dense and Sparse with primitive type
+ */
+object GenericArrayDataBenchmark {
+/*
+  def allocateGenericIntArray(iters: Int): Unit = {
--- End diff --

Thank for your suggestions. Now, this benchmark works well in my cloud 
instance. I saw pretty good results (2x~ performance improvement) for all of 
three types (allocation, primitive array, and get elements). Later, I will 
commit the latest version.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...

2016-06-22 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13758#discussion_r68039256
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala
 ---
@@ -142,3 +196,414 @@ class GenericArrayData(val array: Array[Any]) extends 
ArrayData {
 result
   }
 }
+
+final class GenericIntArrayData(val primitiveArray: Array[Int]) extends 
GenericArrayData {
+  override def array(): Array[Any] = primitiveArray.toArray
+
+  override def copy(): ArrayData = new GenericIntArrayData(primitiveArray)
+
+  override def numElements(): Int = primitiveArray.length
+
+  override def isNullAt(ordinal: Int): Boolean = false
+  override def getInt(ordinal: Int): Int = primitiveArray(ordinal)
+  override def toIntArray(): Array[Int] = {
+val array = new Array[Int](numElements)
+System.arraycopy(primitiveArray, 0, array, 0, numElements)
+array
+  }
+  override def toString(): String = primitiveArray.mkString("[", ",", "]")
+
+  override def equals(o: Any): Boolean = {
+if (!o.isInstanceOf[GenericIntArrayData]) {
+  return false
+}
+
+val other = o.asInstanceOf[GenericIntArrayData]
+if (other eq null) {
+  return false
+}
+
+val len = numElements()
+if (len != other.numElements()) {
+  return false
+}
+
+var i = 0
+while (i < len) {
+  val o1 = primitiveArray(i)
+  val o2 = other.primitiveArray(i)
+  if (o1 != o2) {
+return false
+  }
+  i += 1
+}
+true
+  }
+
+  override def hashCode: Int = {
+var result: Int = 37
+var i = 0
+val len = numElements()
+while (i < len) {
+  val update: Int = primitiveArray(i)
+  result = 37 * result + update
+  i += 1
+}
+result
+  }
+}
+
+final class GenericLongArrayData(val primitiveArray: Array[Long])
+  extends GenericArrayData {
+  override def array(): Array[Any] = primitiveArray.toArray
+
+  override def copy(): ArrayData = new GenericLongArrayData(primitiveArray)
+
+  override def numElements(): Int = primitiveArray.length
+
+  override def isNullAt(ordinal: Int): Boolean = false
+  override def getLong(ordinal: Int): Long = primitiveArray(ordinal)
+  override def toLongArray(): Array[Long] = {
+val array = new Array[Long](numElements)
+System.arraycopy(primitiveArray, 0, array, 0, numElements)
+array
+  }
+  override def toString(): String = primitiveArray.mkString("[", ",", "]")
+
+  override def equals(o: Any): Boolean = {
+if (!o.isInstanceOf[GenericLongArrayData]) {
+  return false
+}
+
+val other = o.asInstanceOf[GenericLongArrayData]
+if (other eq null) {
+  return false
+}
+
+val len = numElements()
+if (len != other.numElements()) {
+  return false
+}
+
+var i = 0
+while (i < len) {
+  val o1 = primitiveArray(i)
+  val o2 = other.primitiveArray(i)
+  if (o1 != o2) {
+return false
+  }
+  i += 1
+}
+true
+  }
+
+  override def hashCode: Int = {
+var result: Int = 37
+var i = 0
+val len = numElements()
+while (i < len) {
+  val l = primitiveArray(i)
+  val update: Int = (l ^ (l >>> 32)).toInt
+  result = 37 * result + update
+  i += 1
+}
+result
+  }
+}
+
+final class GenericFloatArrayData(val primitiveArray: Array[Float])
+  extends GenericArrayData {
+  override def array(): Array[Any] = primitiveArray.toArray
+
+  override def copy(): ArrayData = new 
GenericFloatArrayData(primitiveArray)
+
+  override def numElements(): Int = primitiveArray.length
+
+  override def isNullAt(ordinal: Int): Boolean = false
+  override def getFloat(ordinal: Int): Float = primitiveArray(ordinal)
+  override def toFloatArray(): Array[Float] = {
+val array = new Array[Float](numElements)
+System.arraycopy(primitiveArray, 0, array, 0, numElements)
+array
+  }
+  override def toString(): String = primitiveArray.mkString("[", ",", "]")
+
+  override def equals(o: Any): Boolean = {
+if (!o.isInstanceOf[GenericFloatArrayData]) {
+  return false
+}
+
+val other = o.asInstanceOf[GenericFloatArrayData]
+if (other eq null) {
+  return false
+}
+
+val len = numElements()
+if (len != other.numElements()) {
+  r

[GitHub] spark pull request #13663: [SPARK-15950][SQL] Eliminate unreachable code at ...

2016-06-16 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13663#discussion_r67315501
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameComplexTypeSuite.scala ---
@@ -26,6 +26,38 @@ import org.apache.spark.sql.test.SharedSQLContext
 class DataFrameComplexTypeSuite extends QueryTest with SharedSQLContext {
   import testImplicits._
 
+  test("primitive type on array") {
+val df = sparkContext.parallelize(Seq(1, 2), 1).toDF("v")
--- End diff --

I added unit tests to check whether the ```zeroOutNullBytes``` is 
eliminated.
I eliminated three cases such as ```array on array```. I think that we need 
to unit tests to generate an array, struct, or map, in order to check it works 
well even when we eliminate ```zeroOutNullBytes```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13704: [SPARK-15985][SQL] Reduce runtime overhead of a p...

2016-06-16 Thread kiszk

GitHub user kiszk opened a pull request:

https://github.com/apache/spark/pull/13704

[SPARK-15985][SQL] Reduce runtime overhead of a program that reads an 
primitive array in Dataset

## What changes were proposed in this pull request?

This PR reduces runtime overhead of a program the reads an primitive array 
in Dataset. Generated code copies array elements from Dataset to a temporary 
array. If we know that types of source and destination are primitive array, we 
apply one of the following optimization:

1. Eliminate an allocation of ```Object[]``` and call 
```ArrayData.toArray()``` method if types of source and destination are 
the same
2. Eliminate a pair of ```isNullAt()``` and a ```null``` assignment and 
allocate an primitive array instead of ```Object[]``` if types of source and 
destination are different


An example program
```
val ds = Seq(Array(1.0, 2.0, 3.0), Array(4.0, 5.0, 6.0)).toDS()
ds.map(p => {
 var s = 0.0
 for (i <- 0 to 2) { s += p(i) }
 s
   }).show
```

Generated code before applying this PR
```
/* 036 */   protected void processNext() throws java.io.IOException {
/* 037 */ while (inputadapter_input.hasNext()) {
/* 038 */   InternalRow inputadapter_row = (InternalRow) 
inputadapter_input.next();
/* 039 */   boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
/* 040 */   ArrayData inputadapter_value = inputadapter_isNull ? null : 
(inputadapter_row.getArray(0));
/* 041 */
/* 042 */   boolean deserializetoobject_isNull1 = inputadapter_isNull;
/* 043 */   ArrayData deserializetoobject_value1 = null;
/* 044 */   if (!inputadapter_isNull) {
/* 045 */ final int deserializetoobject_n = 
inputadapter_value.numElements();
/* 046 */ final Object[] deserializetoobject_values = new 
Object[deserializetoobject_n];
/* 047 */ for (int deserializetoobject_j = 0; deserializetoobject_j 
< deserializetoobject_n; deserializetoobject_j ++) {
/* 048 */   if (inputadapter_value.isNullAt(deserializetoobject_j)) 
{
/* 049 */ deserializetoobject_values[deserializetoobject_j] = 
null;
/* 050 */   } else {
/* 051 */ boolean deserializetoobject_feNull = false;
/* 052 */ double deserializetoobject_fePrim =
/* 053 */ inputadapter_value.getDouble(deserializetoobject_j);
/* 054 */
/* 055 */ boolean deserializetoobject_teNull = 
deserializetoobject_feNull;
/* 056 */ double deserializetoobject_tePrim = -1.0;
/* 057 */ if (!deserializetoobject_feNull) {
/* 058 */   deserializetoobject_tePrim = 
deserializetoobject_fePrim;
/* 059 */ }
/* 060 */
/* 061 */ if (deserializetoobject_teNull) {
/* 062 */   deserializetoobject_values[deserializetoobject_j] = 
null;
/* 063 */ } else {
/* 064 */   deserializetoobject_values[deserializetoobject_j] = 
deserializetoobject_tePrim;
/* 065 */ }
/* 066 */   }
/* 067 */ }
/* 068 */ deserializetoobject_value1 = new 
org.apache.spark.sql.catalyst.util.GenericArrayData(deserializetoobject_values);
/* 069 */
/* 070 */   }
/* 071 */
/* 072 */   boolean deserializetoobject_isNull = 
deserializetoobject_isNull1;
/* 073 */   final double[] deserializetoobject_value = 
deserializetoobject_isNull ? null : (double[]) 
deserializetoobject_value1.toDoubleArray();
/* 074 */   deserializetoobject_isNull = deserializetoobject_value == 
null;
/* 075 */
/* 076 */   Object mapelements_obj = ((Expression) 
references[0]).eval(null);
/* 077 */   scala.Function1 mapelements_value1 = (scala.Function1) 
mapelements_obj;
/* 078 */
/* 079 */   boolean mapelements_isNull = false || 
deserializetoobject_isNull;
/* 080 */   final double mapelements_value = mapelements_isNull ? -1.0 
: (Double) mapelements_value1.apply(deserializetoobject_value);
/* 081 */
/* 082 */   serializefromobject_rowWriter.zeroOutNullBytes();
/* 083 */
/* 084 */   if (mapelements_isNull) {
/* 085 */ serializefromobject_rowWriter.setNullAt(0);
/* 086 */   } else {
/* 087 */ serializefromobject_rowWriter.write(0, mapelements_value);
/* 088 */   }
/* 089 */   append(serializefromobject_result);
/* 090 */   if (shouldStop()) return;
/* 091 */ }
/* 092 */   }
```

Generated code after applying this PR
```
/* 036 */   protected void processNext() throws java.io.IOException {
/* 037 */ while (inputadapter_input.hasNext()) {
/* 038 */   InternalRow inputadapter_row = (InternalRow) 
inputadapter_input.next();
/* 039 */

[GitHub] spark pull request #13663: [SPARK-15950][SQL] Eliminate unreachable code at ...

2016-06-16 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13663#discussion_r67315144
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -71,7 +71,8 @@ case class CreateArray(children: Seq[Expression]) extends 
Expression {
   s"""
 final ArrayData ${ev.value} = new $arrayClass($values);
 this.$values = null;
-  """)
+  """,
+  isNull = "false")
--- End diff --

Got it. Removed this assignment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13663: [SPARK-15950][SQL] Eliminate unreachable code at ...

2016-06-15 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13663#discussion_r67213607
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -60,18 +60,24 @@ case class CreateArray(children: Seq[Expression]) 
extends Expression {
 ctx.INPUT_ROW,
 children.zipWithIndex.map { case (e, i) =>
   val eval = e.genCode(ctx)
-  eval.code + s"""
-if (${eval.isNull}) {
--- End diff --

Let me change my opinion. Another PR will handle this elimination and 
related features.
This PR will focus on "each element of project_values is null if it never 
occurs".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13663: [SPARK-15950][SQL] Eliminate unreachable code at ...

2016-06-21 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13663#discussion_r67863392
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameComplexTypeSuite.scala ---
@@ -26,6 +27,55 @@ import org.apache.spark.sql.test.SharedSQLContext
 class DataFrameComplexTypeSuite extends QueryTest with SharedSQLContext {
   import testImplicits._
 
+  test("create an array") {
+val df = sparkContext.parallelize(Seq(1, 2), 1).toDF("v")
+df.selectExpr("Array(v + 3, v + 4)").collect
--- End diff --

I see. I will remove these three tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13663: [SPARK-15950][SQL] Eliminate unreachable code at project...

2016-06-18 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13663
  
@cloud-fan , @davies , I would appreciate it if you would look at this 
again. I think that I had addressed all of your comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13758: [SPARK-16043][SQL] Prepare GenericArrayData imple...

2016-06-18 Thread kiszk

GitHub user kiszk opened a pull request:

https://github.com/apache/spark/pull/13758

[SPARK-16043][SQL] Prepare GenericArrayData implementation specialized for 
a primitive array

## What changes were proposed in this pull request?

This PR addresses a ToDo of ```GenericArrayData``` class. Current 
implementation of ```GenericArrayData``` leads to boxing/unboxing if type of 
array elements are primitive. It would be good to eliminate boxing/unboxing 
from the view of runtime memory footprint and performance.

This PR eliminattes boxing/unboxing by preparing sub classes of 
```GenericArrayData``` to specialize operations in these classes. This PR 
prepare a new method ```GenericArrayData.allocate(...) ``` that can return 
generic ```GenericArrayData``` instance or a ```GenericArrayData``` 
instance specialized for `` type array.

Here are major improvements:
1. Hold an array in a primitive array (previously ```Object[]``` is used 
and boxing happened in a constructor)
2. a method "get``()" gets a value from an primitive array 
(previously unboxing happened)
3. a method "to``Array" performs data copy using 
```System.arraycopy``` (previously unboxing happened)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)

add unit tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kiszk/spark SPARK-16043

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13758.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13758


commit 17bdfcf7b9c155fe05eb75f73d661c2863cdf2d9
Author: Kazuaki Ishizaki <ishiz...@jp.ibm.com>
Date:   2016-06-18T06:47:11Z

Implementation of GenericArrayData specialized for primitive type array

add unit tests




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13757: [SPARK-16042][SQL] Eliminate nullcheck code at pr...

2016-06-18 Thread kiszk

GitHub user kiszk opened a pull request:

https://github.com/apache/spark/pull/13757

[SPARK-16042][SQL] Eliminate nullcheck code at projection for an array type

## What changes were proposed in this pull request?

This PR eliminates nullcheck code at projection for an array type. The 
nullcheck a call to write each element of an array (lines 076-078 in "Before 
applying this PR") is generated. If we know all of the elements do not have 
``null`` at compilation time, we can eliminate code for nullcheck.

This PR checks whether ```ArrayType.containsNull``` is ```false``` to know 
the all of array elements do not have ```null```.



An example program

```
val df = sparkContext.parallelize(Seq(1.0, 2.0), 1).toDF("v")
df.selectExpr("Array(v + 2.2, v + 3.3)").collect
```

Before applying this PR

```
/* 028 */   protected void processNext() throws java.io.IOException {
/* 029 */ while (inputadapter_input.hasNext()) {
/* 030 */   InternalRow inputadapter_row = (InternalRow) 
inputadapter_input.next();
/* 031 */   double inputadapter_value = inputadapter_row.getDouble(0);
/* 032 */
/* 033 */   final boolean project_isNull = false;
/* 034 */   this.project_values = new Object[2];
/* 035 */   double project_value1 = -1.0;
/* 036 */   project_value1 = inputadapter_value + 2.2D;
/* 037 */   if (false) {
/* 038 */ project_values[0] = null;
/* 039 */   } else {
/* 040 */ project_values[0] = project_value1;
/* 041 */   }
/* 042 */
/* 043 */   double project_value4 = -1.0;
/* 044 */   project_value4 = inputadapter_value + 3.3D;
/* 045 */   if (false) {
/* 046 */ project_values[1] = null;
/* 047 */   } else {
/* 048 */ project_values[1] = project_value4;
/* 049 */   }
/* 050 */
/* 051 */   final ArrayData project_value = new 
org.apache.spark.sql.catalyst.util.GenericArrayData(project_values);
/* 052 */   this.project_values = null;
/* 053 */   project_holder.reset();
/* 054 */
/* 055 */   project_rowWriter.zeroOutNullBytes();
/* 056 */
/* 057 */   if (project_isNull) {
/* 058 */ project_rowWriter.setNullAt(0);
/* 059 */   } else {
/* 060 */ // Remember the current cursor so that we can calculate 
how many bytes are
/* 061 */ // written later.
/* 062 */ final int project_tmpCursor = project_holder.cursor;
/* 063 */
/* 064 */ if (project_value instanceof UnsafeArrayData) {
/* 065 */   final int project_sizeInBytes = ((UnsafeArrayData) 
project_value).getSizeInBytes();
/* 066 */   // grow the global buffer before writing data.
/* 067 */   project_holder.grow(project_sizeInBytes);
/* 068 */   ((UnsafeArrayData) 
project_value).writeToMemory(project_holder.buffer, project_holder.cursor);
/* 069 */   project_holder.cursor += project_sizeInBytes;
/* 070 */
/* 071 */ } else {
/* 072 */   final int project_numElements = 
project_value.numElements();
/* 073 */   project_arrayWriter.initialize(project_holder, 
project_numElements, 8);
/* 074 */
/* 075 */   for (int project_index = 0; project_index < 
project_numElements; project_index++) {
/* 076 */ if (project_value.isNullAt(project_index)) {
/* 077 */   project_arrayWriter.setNullAt(project_index);
/* 078 */ } else {
/* 079 */   final double project_element = 
project_value.getDouble(project_index);
/* 080 */   project_arrayWriter.write(project_index, 
project_element);
/* 081 */ }
/* 082 */
/* 083 */   }
/* 084 */ }
/* 085 */
/* 086 */ project_rowWriter.setOffsetAndSize(0, project_tmpCursor, 
project_holder.cursor - project_tmpCursor);
/* 087 */ project_rowWriter.alignToWords(project_holder.cursor - 
project_tmpCursor);
/* 088 */   }
/* 089 */   project_result.setTotalSize(project_holder.totalSize());
/* 090 */   append(project_result);
/* 091 */   if (shouldStop()) return;
/* 092 */ }
```

After applying this PR

```
/* 028 */   protected void processNext() throws java.io.IOException {
/* 029 */ while (inputadapter_input.hasNext()) {
/* 030 */   InternalRow inputadapter_row = (InternalRow) 
inputadapter_input.next();
/* 031 */   double inputadapter_value = inputadapter_row.getDouble(0);
/* 032 */
/* 033 */   final boolean project_isNull = false;
/* 034 */   this.project_values = new Object[2];
/* 035 */   double project_value1 = -1.0;
/* 036 */   project_value1 = inpu

[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce additonal implementation wi...

2016-06-22 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13680
  
It is OK to always keep ```[null bits]```

One question: Is this format to keep fixed space for ```[values]```? I mean 
if ```[null bit]``` is true, the corresponding element in ```[value]``` occupy 
space. It allow us to always access all of the elements by ```4 + (null bits 
size) + (element size) * offset```.

If an answer of the above question is yes, I agree with no ```[all zero in 
null bits?]```. Then,  I will do prototype for this.

cc @hvanhovell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce additonal implementation wi...

2016-06-23 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13680
  
Good to hear. I will make an implementation for single format. If I would 
meet some issues, I will raise them here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce additonal implementation wi...

2016-06-22 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13680
  
@cloud-fan thank you for your good comment. I also read [previous 
proposal](https://github.com/apache/spark/pull/12640#discussion_r61539393).
I love to have only single format (or implementation). Since I thought that 
there are some reasons to keep the old format, I introduced a new dense format.

IMHO, a new unified format should have three properties.
1. Remove indirect offset (for performance and footprint)
2. Have capability of presence of nullbit (for generality)
3. Quickly get information on existence of null value in an array (for 
performance, in particular, primitive array)

Based on them, how about this single format?
```
[numElements] [all zero in null bits?] [null bits] [values] [variable 
length portion]
``` 
If we want to reduce memory footprint in the case of primitive array, we 
can drop ```[null bits]``` part if ```[all zero in null bits?]``` has a special 
value.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...

2016-06-22 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13758
  
@cloud-fan , thank you for your comment. About conversion between primitive 
array and unsafe array, I think we are on the same page. The motivation of my 
recent PRs is to reduce overhead to change an array format in a generated code 
from Dataset program using an primitive array.

Based on this PR and [another 
PR](https://github.com/apache/spark/pull/13680), I am preparing another PR to 
directly transfer primitive array to unsafe (dense) array by 
'''Platform.copy()'''. Currently, while an array for any type is abstracted by 
```GenericArrayData```, this PR just wraps an primitive array to alleviate 
overhead of data format conversion (e.g. among Java primitive and unsafe dense).

Of course, it would be good to prepare APIs in ```GenericArrayData``` to 
convert array format between primitive array and unsafe array.

Thank you for pointing a proposal for a new unsafe format. Let us discuss 
it in [another PR](https://github.com/apache/spark/pull/13680).





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce additonal implementation wi...

2016-06-23 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13680

@cloud-fan , I have one question about null field. Should we put zero into
the corresponding field to position where ```setNullAt()``` is called as
```UnsafeRow```
[does](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L194).

If we avoid to put zero, this avoidance affects two properties.
1. [row
equality](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L197)
2. undetermined value may be included in the array returned by
```UnsafeArrayData.toArray()```

In my current implementation, a width of each element depends on element
type (4: Int, 8: Double, etc). Thus, it is hard to do the same approach as it
[did](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L194).
Since ```UnsafeRow``` always use 8 bytes per field. Since we want to make
data conversion fast between primitive array and unsafe array, we have to keep
the type-based element-width (e.g. 4: Int, 8: Double, etc).

What do you think? Should we have to keep the above two property by
clearing a field?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce additonal implementation wi...

2016-06-23 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13680
  
One potential performance issue is that we have to always clear all of null 
bits at ```UnsafeArrayWriter.initialize()```. This is because 
```holder.buffer``` is reused for each row. If one row has more than one 
variable length arrays, ```holder.cursor``` may have different offset for the 
same array among different rows. Thus, we have to clear all of null bits.
This overhead would be larger in machine learning use cases that use large 
arrays.

If we have a flag to show whether there is an area of ```[null bits]``` or 
we have two implementations with ```[null bits]``` and without ```[null 
bits]```, we may alleviate performance overhead. 

What do you think?
cc: @hvanhovell , @mengxr 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13663: [SPARK-15950][SQL] Eliminate unreachable code at ...

2016-06-16 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13663#discussion_r67386476
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -60,18 +60,24 @@ case class CreateArray(children: Seq[Expression]) 
extends Expression {
 ctx.INPUT_ROW,
 children.zipWithIndex.map { case (e, i) =>
   val eval = e.genCode(ctx)
-  eval.code + s"""
-if (${eval.isNull}) {
--- End diff --

Just a memo.
I found that ```.map(_.isNull).mkString(" || ")``` is used in 
some places. This generates code sequence ```isNULL = false || false;```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13663: [SPARK-15950][SQL] Eliminate unreachable code at ...

2016-06-15 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13663#discussion_r67272226
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -60,18 +60,24 @@ case class CreateArray(children: Seq[Expression]) 
extends Expression {
 ctx.INPUT_ROW,
 children.zipWithIndex.map { case (e, i) =>
   val eval = e.genCode(ctx)
-  eval.code + s"""
-if (${eval.isNull}) {
--- End diff --

@cloud-fan , the latest commit only avoids ```zeroOutNullBytes``` by a 
simple and clear way.
Would it be possible to review this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce additonal implementation wi...

2016-06-24 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13680
  
I see. I assumed that virtual call will be devirtualized by declaring 
```final``` method and by optimistically propagating type information in the 
JIT compiler. Would it be better to add a flag like ```containNull``` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...

2016-06-24 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13758
  
@hvanhovell , @cloud-fan , could you review this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce additonal implementation wi...

2016-06-23 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13680
  
@cloud-fan , for the first issue, we are on the same page. Your proposal is 
what I am thinking about as possible solutions. I will do that.

For the second issue, it seems to be design choice between
1. introduce one conditional branch in ```isNullAt()``` in one 
implementation
2. have two implementations without conditional branch at ```isNullAt()```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13704: [SPARK-15985][SQL] Reduce runtime overhead of a program ...

2016-06-27 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13704
  
@cloud-fan I think ```value#63``` is ```UnsafeArrayData```.
When I ran a DataFrame program, I got the following trees. Since operations 
for DataFrame access data in UnsafeArrayData, I think that ```LocalRelation``` 
and ```LocalTableScan``` keep an array as ```UnsafeArrayData```.

```
val df = Seq(Array(1.0, 2.0, 3.0), Array(4.0, 5.0, 6.0)).toDF()
val df2 = df.selectExpr("value[0] + value[1] + value[2]")
df2.show
df2.explain(true)

== Analyzed Logical Plan ==
((value[0] + value[1]) + value[2]): double
Project [((value#63[0] + value#63[1]) + value#63[2]) AS ((value[0] + 
value[1]) + value[2])#67]
+- LocalRelation [value#63]

== Optimized Logical Plan ==
LocalRelation [((value[0] + value[1]) + value[2])#67]

== Physical Plan ==
LocalTableScan [((value[0] + value[1]) + value[2])#67]
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13899: [SPARK-16196][SQL] Codegen in-memory scan with ColumnarB...

2016-06-25 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13899
  
@andrewor14 Looks interesting.

I created two PRs that generate similar code like [your 
code](https://gist.github.com/andrewor14/7ce4c37a3c6bcd5cc2b6b16c861859e9). My 
PRs use current ```ByteBuffer``` and supports compressions for primitive types. 
Do these PRs help you?
https://github.com/apache/spark/pull/11956
https://github.com/apache/spark/pull/12894

I am waiting for review.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...

2016-06-25 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13758
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13909: [SPARK-16213][SQL] Reduce runtime overhead of a p...

2016-06-25 Thread kiszk

GitHub user kiszk opened a pull request:

https://github.com/apache/spark/pull/13909

[SPARK-16213][SQL] Reduce runtime overhead of a program that creates an 
primitive array in DataFrame

## What changes were proposed in this pull request?

This PR reduces runtime overhead of a program the creates an primitive 
array in DataFrameGenerated code performs boxing operation in an assignment 
from InternalRow to an ```Object[]``` temporary array (at Lines 040 and 048 in 
the generated code before applying this PR). If we know that type of array 
elements is primitive, we apply the following optimizations:

1. Eliminate a pair of ```isNullAt()``` and a null assignment
2. Allocate an primitive array instead of ```Object[]``` (eliminate boxing 
operations)
3. Call ```GenericArrayData.allocate(project_values)``` to avoid 
[boxing](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L31)
 in constructor of ```GenericArrayData``` if 
https://github.com/apache/spark/pull/13758 is merged

An example program
```
val df = sparkContext.parallelize(Seq(0.0d, 1.0d), 1).toDF
df.selectExpr("Array(value + 1.1d, value + 2.2d)").show
```


Generated code before applying this PR
```java
/* 018 */   public void init(int index, scala.collection.Iterator inputs[]) 
{
/* 019 */ partitionIndex = index;
/* 020 */ inputadapter_input = inputs[0];
/* 021 */ this.project_values = null;
/* 022 */ project_result = new UnsafeRow(1);
/* 023 */ this.project_holder = new 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result, 
32);
/* 024 */ this.project_rowWriter = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder,
 1);
/* 025 */ this.project_arrayWriter = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
/* 026 */   }
/* 027 */
/* 028 */   protected void processNext() throws java.io.IOException {
/* 029 */ while (inputadapter_input.hasNext()) {
/* 030 */   InternalRow inputadapter_row = (InternalRow) 
inputadapter_input.next();
/* 031 */   double inputadapter_value = inputadapter_row.getDouble(0);
/* 032 */
/* 033 */   final boolean project_isNull = false;
/* 034 */   this.project_values = new Object[2];
/* 035 */   double project_value7 = -1.0;
/* 036 */   project_value7 = inputadapter_value + 1.1D;
/* 037 */   if (false) {
/* 038 */ project_values[0] = null;
/* 039 */   } else {
/* 040 */ project_values[0] = project_value7;
/* 041 */   }
/* 042 */
/* 043 */   double project_value10 = -1.0;
/* 044 */   project_value10 = inputadapter_value + 2.2D;
/* 045 */   if (false) {
/* 046 */ project_values[1] = null;
/* 047 */   } else {
/* 048 */ project_values[1] = project_value10;
/* 049 */   }
/* 050 */
/* 051 */   /* final ArrayData project_value = 
org.apache.spark.sql.catalyst.util.GenericArrayData.allocate(project_values); */
/* 052 */   final ArrayData project_value = new 
org.apache.spark.sql.catalyst.util.GenericArrayData(project_values);
/* 053 */   this.project_values = null;
/* 054 */   project_holder.reset();
/* 055 */
/* 056 */   project_rowWriter.zeroOutNullBytes();
/* 057 */
/* 058 */   if (project_isNull) {
/* 059 */ project_rowWriter.setNullAt(0);
/* 060 */   } else {
/* 061 */ // Remember the current cursor so that we can calculate 
how many bytes are
/* 062 */ // written later.
/* 063 */ final int project_tmpCursor = project_holder.cursor;
/* 064 */
/* 065 */ if (project_value instanceof UnsafeArrayData) {
/* 066 */   final int project_sizeInBytes = ((UnsafeArrayData) 
project_value).getSizeInBytes();
/* 067 */   // grow the global buffer before writing data.
/* 068 */   project_holder.grow(project_sizeInBytes);
/* 069 */   ((UnsafeArrayData) 
project_value).writeToMemory(project_holder.buffer, project_holder.cursor);
/* 070 */   project_holder.cursor += project_sizeInBytes;
/* 071 */
/* 072 */ } else {
/* 073 */   final int project_numElements = 
project_value.numElements();
/* 074 */   project_arrayWriter.initialize(project_holder, 
project_numElements, 8);
/* 075 */
/* 076 */   for (int project_index = 0; project_index < 
project_numElements; project_index++) {
/* 077 */ if (project_value.isNullAt(project_index)) {
/* 078 */   project_arrayWriter.setNullAt(project_index);
/* 079 */ } else {
/* 080 */   final double pr

[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce implementation with a dense...

2016-06-26 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13680
  
@rxin thank you for your comment. As you said, holistic view is important. 
This PR is not only for machine learning.

This PR has another use case for improving projection of an array in any 
program by https://github.com/apache/spark/pull/13911. This PR can also reduce 
memory footprint of an array representation and 
```DataFrame.cache()/Dataset.cache()``` in any program.

What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13911: [SPARK-16215][SQL] Reduce runtime overhead of a p...

2016-06-25 Thread kiszk

GitHub user kiszk opened a pull request:

https://github.com/apache/spark/pull/13911

[SPARK-16215][SQL] Reduce runtime overhead of a program that writes an 
primitive array in Dataframe/Dataset

## What changes were proposed in this pull request?

This PR optimize generate code of projection for an primitive type array. 
While we know primitive type array does not require null check and has 
contigious data region, current generated code performs null checks and 
performs copy for each element (at Lines 075-082 at Generated code before 
applying this PR)

1. Eliminate null checks for each array element
2. Perform bulk data copy by using ```Platform.copy```
3. Eliminate primitive array allocation in ```GenericArrayData``` when 
https://github.com/apache/spark/pull/13758 is merged
4. Eliminate setting sparse index for ```UnsafeArrayData``` when 
https://github.com/apache/spark/pull/13680 is merged

They are done in helper method 
```UnsafeArrayWrite.writePrimitiveArray()``` (at Line 075 at 
Generated code after applying this PR).

For now, 3 and 4 are not enabled. But, code are ready.


An example program
```
val df = sparkContext.parallelize(Seq(0.0d, 1.0d), 1).toDF
df.selectExpr("Array(value + 1.1d, value + 2.2d)").collect
```

Generated code before applying this PR
```java
/* 028 */   protected void processNext() throws java.io.IOException {
/* 029 */ while (inputadapter_input.hasNext()) {
/* 030 */   InternalRow inputadapter_row = (InternalRow) 
inputadapter_input.next();
/* 031 */   double inputadapter_value = inputadapter_row.getDouble(0);
/* 032 */
/* 033 */   final boolean project_isNull = false;
/* 034 */   this.project_values = new Object[2];
/* 035 */   double project_value1 = -1.0;
/* 036 */   project_value1 = inputadapter_value + 1.1D;
/* 037 */   if (false) {
/* 038 */ project_values[0] = null;
/* 039 */   } else {
/* 040 */ project_values[0] = project_value1;
/* 041 */   }
/* 042 */
/* 043 */   double project_value4 = -1.0;
/* 044 */   project_value4 = inputadapter_value + 2.2D;
/* 045 */   if (false) {
/* 046 */ project_values[1] = null;
/* 047 */   } else {
/* 048 */ project_values[1] = project_value4;
/* 049 */   }
/* 050 */
/* 051 */   final ArrayData project_value = new 
org.apache.spark.sql.catalyst.util.GenericArrayData(project_values);
/* 052 */   this.project_values = null;
/* 053 */   project_holder.reset();
/* 054 */
/* 055 */   project_rowWriter.zeroOutNullBytes();
/* 056 */
/* 057 */   if (project_isNull) {
/* 058 */ project_rowWriter.setNullAt(0);
/* 059 */   } else {
/* 060 */ // Remember the current cursor so that we can calculate 
how many bytes are
/* 061 */ // written later.
/* 062 */ final int project_tmpCursor = project_holder.cursor;
/* 063 */
/* 064 */ if (project_value instanceof UnsafeArrayData) {
/* 065 */   final int project_sizeInBytes = ((UnsafeArrayData) 
project_value).getSizeInBytes();
/* 066 */   // grow the global buffer before writing data.
/* 067 */   project_holder.grow(project_sizeInBytes);
/* 068 */   ((UnsafeArrayData) 
project_value).writeToMemory(project_holder.buffer, project_holder.cursor);
/* 069 */   project_holder.cursor += project_sizeInBytes;
/* 070 */
/* 071 */ } else {
/* 072 */   final int project_numElements = 
project_value.numElements();
/* 073 */   project_arrayWriter.initialize(project_holder, 
project_numElements, 8);
/* 074 */
/* 075 */   for (int project_index = 0; project_index < 
project_numElements; project_index++) {
/* 076 */ if (project_value.isNullAt(project_index)) {
/* 077 */   project_arrayWriter.setNullAt(project_index);
/* 078 */ } else {
/* 079 */   final double project_element = 
project_value.getDouble(project_index);
/* 080 */   project_arrayWriter.write(project_index, 
project_element);
/* 081 */ }
/* 082 */   }
/* 083 */
/* 084 */ }
/* 085 */
/* 086 */ project_rowWriter.setOffsetAndSize(0, project_tmpCursor, 
project_holder.cursor - project_tmpCursor);
/* 087 */ project_rowWriter.alignToWords(project_holder.cursor - 
project_tmpCursor);
/* 088 */   }
/* 089 */   project_result.setTotalSize(project_holder.totalSize());
/* 090 */   append(project_result);
/* 091 */   if (shouldStop()) return;
/* 092 */ }
/* 093 */   }
/* 094 */ }
```

Generated code after applying this PR
```ja

[GitHub] spark issue #13680: [SPARK-15962][SQL] Introduce implementation with a dense...

2016-06-26 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13680
  
@cloud-fan and @hvanhovell thank you for your comments. Based on your 
comments, I implemented ```UnsafeArrayData``` by using one implementation with 
explicit clearing ```null bits``` by ```Arrays.fill()```.
I would appreciate it if you review this PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13663: [SPARK-15950][SQL] Eliminate unreachable code at project...

2016-06-24 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13663
  
@cloud-fan , @davies , I would appreciate it if you would look at this 
again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...

2016-06-25 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/13758
  
@hvanhovell and @cloud-fan This```GenericArrayData``` can be used in 
generated code for a program with an primitive array written in DataFrame or 
Dataset. I newly added Dataframe benchmark program and results. The performance 
improvements are up to 4.4x.

Here is use case of this```GenericArrayData``` in the generated code of 
this benchmark program.
```java
// source program
sc.parallelize(Seq(Array.fill[Double](1)(1), 
1).selectExpr("value[0]").count

// part of generated code
double[] value1 = ...
final ArrayData value = isNull ? null : GenericArrayData.allocate(value1);
```

Another use case is PR https://github.com/apache/spark/pull/13704. 
https://github.com/apache/spark/pull/13704 can generate the following code. At 
Line 046, this PR can eliminate data conversion from a primitive array and 
```Array[Any]```. At Line 051, this PR can also eliminate data conversion.
```java
/* 044 */   if (!inputadapter_isNull) {
/* 045 */ final double[] deserializetoobject_values = 
inputadapter_value.toDoubleArray();
/* 046 */ deserializetoobject_value1 = new 
GenericArrayData(deserializetoobject_values);
/* 047 */
/* 048 */   }
/* 049 */
/* 050 */   boolean deserializetoobject_isNull = 
deserializetoobject_isNull1;
/* 051 */   final double[] deserializetoobject_value = 
deserializetoobject_isNull ? null : (double[]) 
deserializetoobject_value1.toDoubleArray();
```

If https://github.com/apache/spark/pull/13680 is also merged, we can 
improve performance of format conversion between ```UnsafeArrayData``` and 
```GenericArrayData``` by using ```Platform.copy```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13909: [SPARK-16213][SQL] Reduce runtime overhead of a p...

2016-06-26 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13909#discussion_r68519480
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameComplexTypeSuite.scala ---
@@ -26,6 +26,20 @@ import org.apache.spark.sql.test.SharedSQLContext
 class DataFrameComplexTypeSuite extends QueryTest with SharedSQLContext {
   import testImplicits._
 
+  test("primitive type on array") {
+val df = sparkContext.parallelize(Seq(1, 2), 1).toDF("v")
--- End diff --

Yes, this is what I want to do. As other test cases did, I use 
```sparkContext.parallelize()``` that can explicitly specify # of partitions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13909: [SPARK-16213][SQL] Reduce runtime overhead of a p...

2016-06-26 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13909#discussion_r68519507
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameComplexTypeSuite.scala ---
@@ -26,6 +26,20 @@ import org.apache.spark.sql.test.SharedSQLContext
 class DataFrameComplexTypeSuite extends QueryTest with SharedSQLContext {
   import testImplicits._
 
+  test("primitive type on array") {
+val df = sparkContext.parallelize(Seq(1, 2), 1).toDF("v")
+val resDF = df.selectExpr("Array(v + 2, v + 3)")
+checkAnswer(resDF,
+  Seq(Row(Array(3, 4)), Row(Array(4, 5
+  }
+
+  test("primitive type and null on array") {
+val df = sparkContext.parallelize(Seq(1, 2), 1).toDF("v")
+val resDF = df.selectExpr("Array(v + 2, null, v + 3)")
+checkAnswer(resDF,
+  Seq(Row(Array(3, null, 4)), Row(Array(4, null, 5
+  }
+
   test("UDF on struct") {
 val f = udf((a: String) => a)
--- End diff --

I think so. This part was not written by another developer in other PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13909: [SPARK-16213][SQL] Reduce runtime overhead of a p...

2016-06-26 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13909#discussion_r68519430
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -51,27 +51,52 @@ case class CreateArray(children: Seq[Expression]) 
extends Expression {
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
 val arrayClass = classOf[GenericArrayData].getName
 val values = ctx.freshName("values")
-ctx.addMutableState("Object[]", values, s"this.$values = null;")
-
-ev.copy(code = s"""
-  final boolean ${ev.isNull} = false;
-  this.$values = new Object[${children.size}];""" +
-  ctx.splitExpressions(
-ctx.INPUT_ROW,
-children.zipWithIndex.map { case (e, i) =>
-  val eval = e.genCode(ctx)
-  eval.code + s"""
+val dt = dataType match {
+  case a @ ArrayType(et, _) => et
+}
+val isPrimitive = ctx.isPrimitiveType(dt)
+val evals = children.map(e => e.genCode(ctx))
+val allNonNull = evals.find(_.isNull != "false").isEmpty
--- End diff --

Yes, I did.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13909: [SPARK-16213][SQL] Reduce runtime overhead of a p...

2016-06-26 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13909#discussion_r68519450
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -51,27 +51,52 @@ case class CreateArray(children: Seq[Expression]) 
extends Expression {
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
 val arrayClass = classOf[GenericArrayData].getName
 val values = ctx.freshName("values")
-ctx.addMutableState("Object[]", values, s"this.$values = null;")
-
-ev.copy(code = s"""
-  final boolean ${ev.isNull} = false;
-  this.$values = new Object[${children.size}];""" +
-  ctx.splitExpressions(
-ctx.INPUT_ROW,
-children.zipWithIndex.map { case (e, i) =>
-  val eval = e.genCode(ctx)
-  eval.code + s"""
+val dt = dataType match {
--- End diff --

thank you for good catch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13255][SQL] Integrate vectorized parque...

2016-02-10 Thread kiszk

Github user kiszk commented on the pull request:

https://github.com/apache/spark/pull/11146#issuecomment-182347108
  
@nongli Is there some kind of design doc on the ColumnarBatch? I am 
planning to make PRs for columnar storage and its computations with 
DataFrame/Dataset.
We are curious whether ColumnarBatch and ColumnVector are designed only for 
Parquet or for general use.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3561 matches

Mail list logo