[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-08-09 Thread ssonker
Github user ssonker closed the pull request at:

https://github.com/apache/spark/pull/21505


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-07-31 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r206741169
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/StringToTimestampBenchmark.scala
 ---
@@ -0,0 +1,53 @@
+/*
--- End diff --

Yea, that's what I was thinking.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-07-31 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r206699663
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,22 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
+  override def initialValue(): mutable.Map[TimeZone, Calendar] = {
+mutable.Map[TimeZone, Calendar]()
+  }
+}
+
+  def getCalendar(timeZone: TimeZone): Calendar = {
--- End diff --

Can this be private?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-07-15 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r202548671
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/StringToTimestampBenchmark.scala
 ---
@@ -0,0 +1,53 @@
+/*
--- End diff --

Does this really need to go in the code base?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-12 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r194848108
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,22 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
+  override def initialValue(): mutable.Map[TimeZone, Calendar] = {
+mutable.Map[TimeZone, Calendar]()
+  }
+}
+
+  def getCalendar(timeZone: TimeZone): Calendar = {
+val c = threadLocalComputedCalendarsMap.get()
+  .getOrElseUpdate(timeZone, {
--- End diff --

nit: `getOrElseUpdate(timeZone, Calendar.getInstance(timeZone))`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-12 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r194847824
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/StringToTimestampBenchmark.scala
 ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.util.Calendar
+
+import org.apache.spark.sql.catalyst.util.{DateTimeTestUtils, 
DateTimeUtils}
+import org.apache.spark.util.Benchmark
+
+object StringToTimestampBenchmark {
--- End diff --

This benchmark tests `Calendar` creation, not string to timestamp. how 
abuot `DateTimeUtilsBenchmark`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-11 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r194365155
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
--- End diff --

+1 for map approach


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-11 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r194325153
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
--- End diff --

I'm fine with the map approach.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-11 Thread ssonker
Github user ssonker commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r194319491
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
--- End diff --

I ran a benchmark and following is the output:
```string to timestamp calendar caching:Best/Avg Time(ms)Rate(M/s)  
 Per Row(ns)   Relative


with map 8 /9 12.9  
77.7   1.0X
without map 11 /   12  9.4  
   106.3   0.7X```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-11 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r194309823
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
--- End diff --

How much performance down without a map here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-11 Thread ssonker
Github user ssonker commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r194305723
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
--- End diff --

@kiszk @viirya I've tried running benchmarks for with/without 
```mutable.Map``` implementation. Looks like setting timezone in a calendar 
instance is a costly operation and it drags the performance down. As the number 
of timezones cannot be large, maintaining a map will not be a huge memory 
overhead. So, I suggest going with the ```mutable.Map``` approach. Comments?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-11 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r194299485
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
--- End diff --

Yes, it should work functionally if we check a given time zone every time. 
Do you know the typical access pattern of time zone? If there is temporal 
locality regarding time zone, we do not have to use `mutale.Map`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-10 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r194294961
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -125,7 +125,6 @@ object DateTimeUtils {
   .getOrElseUpdate(timeZone, {
 Calendar.getInstance(timeZone)
   })
-c.clear()
--- End diff --

Seems `setTimeInMillis` can result in all fields set. If so, `clear` is 
redundant.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-10 Thread ssonker
Github user ssonker commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r194294182
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -125,7 +125,6 @@ object DateTimeUtils {
   .getOrElseUpdate(timeZone, {
 Calendar.getInstance(timeZone)
   })
-c.clear()
--- End diff --

@viirya @kiszk Do you agree with this commit?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-10 Thread ssonker
Github user ssonker commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r194292883
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
--- End diff --

@kiszk I think @viirya meant having just one thread-local calendar 
instance. That should also work, isn't it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-10 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r194268734
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
--- End diff --

Usually, only the default time zone is used. To execute `Cast` regarding 
date is called with a timezone may use another timezone. For the correctness, I 
think that it is necessary to support multiple timezones.

To enable caching for default time zone and to create an instance for other 
time zones would also work correctly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r193696451
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
--- End diff --

Do we need to keep Calendar for many timezone? Since `getCalendar` takes a 
time zone input, we can just keep one Calendar instance, and set it with given 
timezone in `getCalendar`. WDYT? Regarding performance, is there big difference?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-07 Thread ssonker
Github user ssonker commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r193694372
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
+  override def initialValue(): mutable.Map[TimeZone, Calendar] = {
+mutable.Map[TimeZone, Calendar]()
+  }
+}
+
+  def getCalendar(timeZone: TimeZone): Calendar = {
+val c = threadLocalComputedCalendarsMap.get()
+  .getOrElseUpdate(timeZone, {
+Calendar.getInstance(timeZone)
+  })
+c.clear()
--- End diff --

Nope. It clears all the ```fields``` and zone is not a field.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r193688565
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,23 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
+  override def initialValue(): mutable.Map[TimeZone, Calendar] = {
+mutable.Map[TimeZone, Calendar]()
+  }
+}
+
+  def getCalendar(timeZone: TimeZone): Calendar = {
+val c = threadLocalComputedCalendarsMap.get()
+  .getOrElseUpdate(timeZone, {
+Calendar.getInstance(timeZone)
+  })
+c.clear()
--- End diff --

Doesn't `clear` reset the timezone of that `Calendar` instance?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r193687670
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -114,20 +114,19 @@ object DateTimeUtils {
   }
 
   private val threadLocalComputedCalendarsMap =
-new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
-  override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] 
= {
-mutable.Map[TimeZone, (Calendar, Long)]()
+new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
+  override def initialValue(): mutable.Map[TimeZone, Calendar] = {
+mutable.Map[TimeZone, Calendar]()
   }
 }
 
   def getCalendar(timeZone: TimeZone): Calendar = {
-val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
+val c = threadLocalComputedCalendarsMap.get()
   .getOrElseUpdate(timeZone, {
-val c = Calendar.getInstance(timeZone)
-(c, c.getTimeInMillis)
+Calendar.getInstance(timeZone)
   })
 c.clear()
-c.setTimeInMillis(timeInMillis)
+c.setTimeInMillis(System.currentTimeMillis())
--- End diff --

oh, Calendar.getTimeInMillis and setTimeInMillis are also UTC-based.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r193687346
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -114,20 +114,19 @@ object DateTimeUtils {
   }
 
   private val threadLocalComputedCalendarsMap =
-new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
-  override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] 
= {
-mutable.Map[TimeZone, (Calendar, Long)]()
+new ThreadLocal[mutable.Map[TimeZone, Calendar]] {
+  override def initialValue(): mutable.Map[TimeZone, Calendar] = {
+mutable.Map[TimeZone, Calendar]()
   }
 }
 
   def getCalendar(timeZone: TimeZone): Calendar = {
-val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
+val c = threadLocalComputedCalendarsMap.get()
   .getOrElseUpdate(timeZone, {
-val c = Calendar.getInstance(timeZone)
-(c, c.getTimeInMillis)
+Calendar.getInstance(timeZone)
   })
 c.clear()
-c.setTimeInMillis(timeInMillis)
+c.setTimeInMillis(System.currentTimeMillis())
--- End diff --

hmm, I think `System.currentTimeMillis()`  is UTC-based?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-07 Thread ssonker
Github user ssonker commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r193686772
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,24 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
+  override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] 
= {
+mutable.Map[TimeZone, (Calendar, Long)]()
+  }
+}
+
+  def getCalendar(timeZone: TimeZone): Calendar = {
+val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
+  .getOrElseUpdate(timeZone, {
+val c = Calendar.getInstance(timeZone)
+(c, c.getTimeInMillis)
+  })
+c.clear()
+c.setTimeInMillis(timeInMillis)
--- End diff --

@kiszk @viirya I've updated the code. Please review.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-07 Thread ssonker
Github user ssonker commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r193679978
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,24 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
+  override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] 
= {
+mutable.Map[TimeZone, (Calendar, Long)]()
+  }
+}
+
+  def getCalendar(timeZone: TimeZone): Calendar = {
+val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
+  .getOrElseUpdate(timeZone, {
+val c = Calendar.getInstance(timeZone)
+(c, c.getTimeInMillis)
+  })
+c.clear()
+c.setTimeInMillis(timeInMillis)
--- End diff --

@kiszk Thanks, I'm updating that. BTW, can you please help me understand a 
scenario where that is absolutely needed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-07 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r193678440
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,24 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
+  override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] 
= {
+mutable.Map[TimeZone, (Calendar, Long)]()
+  }
+}
+
+  def getCalendar(timeZone: TimeZone): Calendar = {
+val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
+  .getOrElseUpdate(timeZone, {
+val c = Calendar.getInstance(timeZone)
+(c, c.getTimeInMillis)
+  })
+c.clear()
+c.setTimeInMillis(timeInMillis)
--- End diff --

I agree with @viirya 's comment. Do we need to set the value of 
`System.currentTimeMillis()`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-07 Thread ssonker
Github user ssonker commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r193676953
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,24 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
+  override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] 
= {
+mutable.Map[TimeZone, (Calendar, Long)]()
+  }
+}
+
+  def getCalendar(timeZone: TimeZone): Calendar = {
+val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
+  .getOrElseUpdate(timeZone, {
+val c = Calendar.getInstance(timeZone)
+(c, c.getTimeInMillis)
--- End diff --

Yes, correct.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r193676413
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,24 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
+  override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] 
= {
+mutable.Map[TimeZone, (Calendar, Long)]()
+  }
+}
+
+  def getCalendar(timeZone: TimeZone): Calendar = {
+val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
+  .getOrElseUpdate(timeZone, {
+val c = Calendar.getInstance(timeZone)
+(c, c.getTimeInMillis)
--- End diff --

Isn't `timeInMillis` also stored when you first update this map entity? So 
next time you access this calendar, you just set it with the old 
`timeInMillis`. Isn't it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-07 Thread ssonker
Github user ssonker commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r193675674
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,24 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
+  override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] 
= {
+mutable.Map[TimeZone, (Calendar, Long)]()
+  }
+}
+
+  def getCalendar(timeZone: TimeZone): Calendar = {
+val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
+  .getOrElseUpdate(timeZone, {
+val c = Calendar.getInstance(timeZone)
+(c, c.getTimeInMillis)
--- End diff --

@viirya ^ Does that answer you question, or you mean something else?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-07 Thread ssonker
Github user ssonker commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r193675439
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,24 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
+  override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] 
= {
+mutable.Map[TimeZone, (Calendar, Long)]()
+  }
+}
+
+  def getCalendar(timeZone: TimeZone): Calendar = {
+val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
+  .getOrElseUpdate(timeZone, {
+val c = Calendar.getInstance(timeZone)
+(c, c.getTimeInMillis)
--- End diff --

Please refer line 130 for this. Before returning the calendar instance, it 
is reinitialized to the time it was originally created.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21505#discussion_r193674578
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -111,6 +113,24 @@ object DateTimeUtils {
 computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone)
   }
 
+  private val threadLocalComputedCalendarsMap =
+new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] {
+  override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] 
= {
+mutable.Map[TimeZone, (Calendar, Long)]()
+  }
+}
+
+  def getCalendar(timeZone: TimeZone): Calendar = {
+val (c, timeInMillis) = threadLocalComputedCalendarsMap.get()
+  .getOrElseUpdate(timeZone, {
+val c = Calendar.getInstance(timeZone)
+(c, c.getTimeInMillis)
--- End diff --

When you get the calendar instance next time, isn't its time out of date?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...

2018-06-07 Thread ssonker
GitHub user ssonker opened a pull request:

https://github.com/apache/spark/pull/21505

[SPARK-24457][SQL] Improving performance of stringToTimestamp by cach…

…ing Calendar instances for input timezones instead of creating new 
everytime

## What changes were proposed in this pull request?

As of now, stringToTimestamp function in DateTimeUtils creates a calendar 
instance on each call. This change maintains a thread-local timezone to 
calendar map, and creates just one calendar for each timezone. Whenever a 
calendar instance is queried given a timezone, it is looked-up inside the map, 
reinitialized and returned.

## How was this patch tested?

Using existing test cases.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ssonker/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21505.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21505


commit 84d5a911408411f327b620bb958b996e55264781
Author: Sharad Sonker 
Date:   2018-06-07T04:56:37Z

[SPARK-24457][SQL] Improving performance of stringToTimestamp by caching 
Calendar instances for input timezones instead of creating new everytime




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org