[GitHub] [spark] cloud-fan commented on a change in pull request #26622: [WIP][SPARK-28023][Test] Cheap UTF8String Trim

2019-11-21 Thread GitBox
cloud-fan commented on a change in pull request #26622: 
[WIP][SPARK-28023][Test] Cheap UTF8String Trim
URL: https://github.com/apache/spark/pull/26622#discussion_r349097413
 
 

 ##
 File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CastBenchmark.scala
 ##
 @@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+
+/**
+ * Benchmark trim the string when casting string type to Boolean/Numeric types.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *  bin/spark-submit --class  --jars  

+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"sql/test:runMain "
+ *  Results will be written to "benchmarks/CastBenchmark-results.txt".
+ * }}}
+ */
+object CastBenchmark extends SqlBasedBenchmark {
 
 Review comment:
   This is too specific. I think it's good enough to put it and its result in 
PR description. We don't have to commit it into the code base.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #26622: [WIP][SPARK-28023][Test] Cheap UTF8String Trim

2019-11-21 Thread GitBox
cloud-fan commented on a change in pull request #26622: 
[WIP][SPARK-28023][Test] Cheap UTF8String Trim
URL: https://github.com/apache/spark/pull/26622#discussion_r349015970
 
 

 ##
 File path: sql/core/src/test/resources/sql-tests/inputs/comparator.sql
 ##
 @@ -1,3 +1,12 @@
 -- binary type
 select x'00' < x'0f';
 select x'00' < x'ff';
+
+-- trim string to numeric
+select '1 ' = 1Y;
+select '1 ' = 1S;
+select '1 ' = 1;
 
 Review comment:
   For this simple case, how much slower we are compared to the master branch?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #26622: [WIP][SPARK-28023][Test] Cheap UTF8String Trim

2019-11-21 Thread GitBox
cloud-fan commented on a change in pull request #26622: 
[WIP][SPARK-28023][Test] Cheap UTF8String Trim
URL: https://github.com/apache/spark/pull/26622#discussion_r349015186
 
 

 ##
 File path: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
 ##
 @@ -1077,13 +1077,21 @@ public UTF8String translate(Map 
dict) {
* @return true if the parsing was successful else false
*/
   public boolean toLong(LongWrapper toLongResult) {
+int numBytes = this.numBytes;
+int offset = 0;
+while (offset < numBytes && getByte(offset) == 0x20) {
+  offset++;
+}
+while (numBytes - 1 > offset && getByte(numBytes - 1) == 0x20) {
+  numBytes--;
 
 Review comment:
   this is not numBytes, but the endIndex. numBytes should be endIndex - 
startIndex.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #26622: [WIP][SPARK-28023][Test] Cheap UTF8String Trim

2019-11-21 Thread GitBox
cloud-fan commented on a change in pull request #26622: 
[WIP][SPARK-28023][Test] Cheap UTF8String Trim
URL: https://github.com/apache/spark/pull/26622#discussion_r349013868
 
 

 ##
 File path: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
 ##
 @@ -1077,13 +1077,21 @@ public UTF8String translate(Map 
dict) {
* @return true if the parsing was successful else false
*/
   public boolean toLong(LongWrapper toLongResult) {
+int numBytes = this.numBytes;
+int offset = 0;
+while (offset < numBytes && getByte(offset) == 0x20) {
 
 Review comment:
   does `getByte(offset) == ' '` work? using ascii code is not readable.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #26622: [WIP][SPARK-28023][Test] Cheap UTF8String Trim

2019-11-21 Thread GitBox
cloud-fan commented on a change in pull request #26622: 
[WIP][SPARK-28023][Test] Cheap UTF8String Trim
URL: https://github.com/apache/spark/pull/26622#discussion_r348961154
 
 

 ##
 File path: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
 ##
 @@ -1077,6 +1090,7 @@ public UTF8String translate(Map 
dict) {
* @return true if the parsing was successful else false
*/
   public boolean toLong(LongWrapper toLongResult) {
 
 Review comment:
   let's not make this method has side effect. We should only channe the local 
variable `offset` inside this method.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #26622: [WIP][SPARK-28023][Test] Cheap UTF8String Trim

2019-11-21 Thread GitBox
cloud-fan commented on a change in pull request #26622: 
[WIP][SPARK-28023][Test] Cheap UTF8String Trim
URL: https://github.com/apache/spark/pull/26622#discussion_r348961154
 
 

 ##
 File path: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
 ##
 @@ -1077,6 +1090,7 @@ public UTF8String translate(Map 
dict) {
* @return true if the parsing was successful else false
*/
   public boolean toLong(LongWrapper toLongResult) {
 
 Review comment:
   let's not make this method has side effect. We should only change the local 
variable `offset` inside this method.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #26622: [WIP][SPARK-28023][Test] Cheap UTF8String Trim

2019-11-21 Thread GitBox
cloud-fan commented on a change in pull request #26622: 
[WIP][SPARK-28023][Test] Cheap UTF8String Trim
URL: https://github.com/apache/spark/pull/26622#discussion_r348944332
 
 

 ##
 File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CastBenchmark.scala
 ##
 @@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+
+/**
+ * Benchmark trim the string when casting string type to Boolean/Numeric types.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *  bin/spark-submit --class  --jars  

+ *   2. build/sbt "sql/test:runMain "
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"sql/test:runMain "
+ *  Results will be written to "benchmarks/CastBenchmark-results.txt".
+ * }}}
+ */
+object CastBenchmark extends SqlBasedBenchmark {
+
+  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
+
+val title = "Cast String to Numeric"
+runBenchmark(title) {
+  withTempPath { dir =>
+val N = 500L << 14
+val df = spark.range(N)
+val types = Seq("int", "long")
+df.selectExpr(s"concat('${" " * 5}', id, '${" " * 5}') as str")
 
 Review comment:
   can we only benchmark spaces at right? The parsing logic will return 
immediately if the first char is a space, so not very useful to benchmark it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #26622: [WIP][SPARK-28023][Test] Cheap UTF8String Trim

2019-11-20 Thread GitBox
cloud-fan commented on a change in pull request #26622: 
[WIP][SPARK-28023][Test] Cheap UTF8String Trim
URL: https://github.com/apache/spark/pull/26622#discussion_r348922146
 
 

 ##
 File path: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
 ##
 @@ -553,6 +553,29 @@ public UTF8String trim() {
 return copyUTF8String(s, e);
   }
 
+  /**
+   * Trims space characters (ASCII 32) from both ends of this string.
+   *
+   * @return this string with no spaces at the start or end
+   */
+  public UTF8String nonCopyTrim() {
 
 Review comment:
   another way is to embed the trim logic into `toInt`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #26622: [WIP][SPARK-28023][Test] Cheap UTF8String Trim

2019-11-20 Thread GitBox
cloud-fan commented on a change in pull request #26622: 
[WIP][SPARK-28023][Test] Cheap UTF8String Trim
URL: https://github.com/apache/spark/pull/26622#discussion_r348921668
 
 

 ##
 File path: sql/core/benchmarks/CastBenchmark-results.txt
 ##
 @@ -0,0 +1,12 @@
+
+Benchmark trim the string
+
+
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
+Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz
+Benchmark trim the string:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
+
+cast(str as int) as c_int  2208   4341
1848  1.9 539.0   1.0X
 
 Review comment:
   what's the result of master branch?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org