[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user ssonker closed the pull request at: https://github.com/apache/spark/pull/21505 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r206741169 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/StringToTimestampBenchmark.scala --- @@ -0,0 +1,53 @@ +/* --- End diff -- Yea, that's what I was thinking. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r206699663 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,22 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, Calendar]] { + override def initialValue(): mutable.Map[TimeZone, Calendar] = { +mutable.Map[TimeZone, Calendar]() + } +} + + def getCalendar(timeZone: TimeZone): Calendar = { --- End diff -- Can this be private? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r202548671 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/StringToTimestampBenchmark.scala --- @@ -0,0 +1,53 @@ +/* --- End diff -- Does this really need to go in the code base? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r194848108 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,22 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, Calendar]] { + override def initialValue(): mutable.Map[TimeZone, Calendar] = { +mutable.Map[TimeZone, Calendar]() + } +} + + def getCalendar(timeZone: TimeZone): Calendar = { +val c = threadLocalComputedCalendarsMap.get() + .getOrElseUpdate(timeZone, { --- End diff -- nit: `getOrElseUpdate(timeZone, Calendar.getInstance(timeZone))` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r194847824 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/StringToTimestampBenchmark.scala --- @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import java.util.Calendar + +import org.apache.spark.sql.catalyst.util.{DateTimeTestUtils, DateTimeUtils} +import org.apache.spark.util.Benchmark + +object StringToTimestampBenchmark { --- End diff -- This benchmark tests `Calendar` creation, not string to timestamp. how abuot `DateTimeUtilsBenchmark`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r194365155 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,23 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, Calendar]] { --- End diff -- +1 for map approach --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r194325153 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,23 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, Calendar]] { --- End diff -- I'm fine with the map approach. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user ssonker commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r194319491 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,23 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, Calendar]] { --- End diff -- I ran a benchmark and following is the output: ```string to timestamp calendar caching:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative with map 8 /9 12.9 77.7 1.0X without map 11 / 12 9.4 106.3 0.7X``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r194309823 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,23 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, Calendar]] { --- End diff -- How much performance down without a map here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user ssonker commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r194305723 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,23 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, Calendar]] { --- End diff -- @kiszk @viirya I've tried running benchmarks for with/without ```mutable.Map``` implementation. Looks like setting timezone in a calendar instance is a costly operation and it drags the performance down. As the number of timezones cannot be large, maintaining a map will not be a huge memory overhead. So, I suggest going with the ```mutable.Map``` approach. Comments? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r194299485 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,23 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, Calendar]] { --- End diff -- Yes, it should work functionally if we check a given time zone every time. Do you know the typical access pattern of time zone? If there is temporal locality regarding time zone, we do not have to use `mutale.Map`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r194294961 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -125,7 +125,6 @@ object DateTimeUtils { .getOrElseUpdate(timeZone, { Calendar.getInstance(timeZone) }) -c.clear() --- End diff -- Seems `setTimeInMillis` can result in all fields set. If so, `clear` is redundant. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user ssonker commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r194294182 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -125,7 +125,6 @@ object DateTimeUtils { .getOrElseUpdate(timeZone, { Calendar.getInstance(timeZone) }) -c.clear() --- End diff -- @viirya @kiszk Do you agree with this commit? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user ssonker commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r194292883 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,23 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, Calendar]] { --- End diff -- @kiszk I think @viirya meant having just one thread-local calendar instance. That should also work, isn't it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r194268734 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,23 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, Calendar]] { --- End diff -- Usually, only the default time zone is used. To execute `Cast` regarding date is called with a timezone may use another timezone. For the correctness, I think that it is necessary to support multiple timezones. To enable caching for default time zone and to create an instance for other time zones would also work correctly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r193696451 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,23 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, Calendar]] { --- End diff -- Do we need to keep Calendar for many timezone? Since `getCalendar` takes a time zone input, we can just keep one Calendar instance, and set it with given timezone in `getCalendar`. WDYT? Regarding performance, is there big difference? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user ssonker commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r193694372 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,23 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, Calendar]] { + override def initialValue(): mutable.Map[TimeZone, Calendar] = { +mutable.Map[TimeZone, Calendar]() + } +} + + def getCalendar(timeZone: TimeZone): Calendar = { +val c = threadLocalComputedCalendarsMap.get() + .getOrElseUpdate(timeZone, { +Calendar.getInstance(timeZone) + }) +c.clear() --- End diff -- Nope. It clears all the ```fields``` and zone is not a field. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r193688565 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,23 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, Calendar]] { + override def initialValue(): mutable.Map[TimeZone, Calendar] = { +mutable.Map[TimeZone, Calendar]() + } +} + + def getCalendar(timeZone: TimeZone): Calendar = { +val c = threadLocalComputedCalendarsMap.get() + .getOrElseUpdate(timeZone, { +Calendar.getInstance(timeZone) + }) +c.clear() --- End diff -- Doesn't `clear` reset the timezone of that `Calendar` instance? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r193687670 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -114,20 +114,19 @@ object DateTimeUtils { } private val threadLocalComputedCalendarsMap = -new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] { - override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = { -mutable.Map[TimeZone, (Calendar, Long)]() +new ThreadLocal[mutable.Map[TimeZone, Calendar]] { + override def initialValue(): mutable.Map[TimeZone, Calendar] = { +mutable.Map[TimeZone, Calendar]() } } def getCalendar(timeZone: TimeZone): Calendar = { -val (c, timeInMillis) = threadLocalComputedCalendarsMap.get() +val c = threadLocalComputedCalendarsMap.get() .getOrElseUpdate(timeZone, { -val c = Calendar.getInstance(timeZone) -(c, c.getTimeInMillis) +Calendar.getInstance(timeZone) }) c.clear() -c.setTimeInMillis(timeInMillis) +c.setTimeInMillis(System.currentTimeMillis()) --- End diff -- oh, Calendar.getTimeInMillis and setTimeInMillis are also UTC-based. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r193687346 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -114,20 +114,19 @@ object DateTimeUtils { } private val threadLocalComputedCalendarsMap = -new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] { - override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = { -mutable.Map[TimeZone, (Calendar, Long)]() +new ThreadLocal[mutable.Map[TimeZone, Calendar]] { + override def initialValue(): mutable.Map[TimeZone, Calendar] = { +mutable.Map[TimeZone, Calendar]() } } def getCalendar(timeZone: TimeZone): Calendar = { -val (c, timeInMillis) = threadLocalComputedCalendarsMap.get() +val c = threadLocalComputedCalendarsMap.get() .getOrElseUpdate(timeZone, { -val c = Calendar.getInstance(timeZone) -(c, c.getTimeInMillis) +Calendar.getInstance(timeZone) }) c.clear() -c.setTimeInMillis(timeInMillis) +c.setTimeInMillis(System.currentTimeMillis()) --- End diff -- hmm, I think `System.currentTimeMillis()` is UTC-based? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user ssonker commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r193686772 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,24 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] { + override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = { +mutable.Map[TimeZone, (Calendar, Long)]() + } +} + + def getCalendar(timeZone: TimeZone): Calendar = { +val (c, timeInMillis) = threadLocalComputedCalendarsMap.get() + .getOrElseUpdate(timeZone, { +val c = Calendar.getInstance(timeZone) +(c, c.getTimeInMillis) + }) +c.clear() +c.setTimeInMillis(timeInMillis) --- End diff -- @kiszk @viirya I've updated the code. Please review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user ssonker commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r193679978 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,24 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] { + override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = { +mutable.Map[TimeZone, (Calendar, Long)]() + } +} + + def getCalendar(timeZone: TimeZone): Calendar = { +val (c, timeInMillis) = threadLocalComputedCalendarsMap.get() + .getOrElseUpdate(timeZone, { +val c = Calendar.getInstance(timeZone) +(c, c.getTimeInMillis) + }) +c.clear() +c.setTimeInMillis(timeInMillis) --- End diff -- @kiszk Thanks, I'm updating that. BTW, can you please help me understand a scenario where that is absolutely needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r193678440 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,24 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] { + override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = { +mutable.Map[TimeZone, (Calendar, Long)]() + } +} + + def getCalendar(timeZone: TimeZone): Calendar = { +val (c, timeInMillis) = threadLocalComputedCalendarsMap.get() + .getOrElseUpdate(timeZone, { +val c = Calendar.getInstance(timeZone) +(c, c.getTimeInMillis) + }) +c.clear() +c.setTimeInMillis(timeInMillis) --- End diff -- I agree with @viirya 's comment. Do we need to set the value of `System.currentTimeMillis()`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user ssonker commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r193676953 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,24 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] { + override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = { +mutable.Map[TimeZone, (Calendar, Long)]() + } +} + + def getCalendar(timeZone: TimeZone): Calendar = { +val (c, timeInMillis) = threadLocalComputedCalendarsMap.get() + .getOrElseUpdate(timeZone, { +val c = Calendar.getInstance(timeZone) +(c, c.getTimeInMillis) --- End diff -- Yes, correct. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r193676413 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,24 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] { + override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = { +mutable.Map[TimeZone, (Calendar, Long)]() + } +} + + def getCalendar(timeZone: TimeZone): Calendar = { +val (c, timeInMillis) = threadLocalComputedCalendarsMap.get() + .getOrElseUpdate(timeZone, { +val c = Calendar.getInstance(timeZone) +(c, c.getTimeInMillis) --- End diff -- Isn't `timeInMillis` also stored when you first update this map entity? So next time you access this calendar, you just set it with the old `timeInMillis`. Isn't it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user ssonker commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r193675674 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,24 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] { + override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = { +mutable.Map[TimeZone, (Calendar, Long)]() + } +} + + def getCalendar(timeZone: TimeZone): Calendar = { +val (c, timeInMillis) = threadLocalComputedCalendarsMap.get() + .getOrElseUpdate(timeZone, { +val c = Calendar.getInstance(timeZone) +(c, c.getTimeInMillis) --- End diff -- @viirya ^ Does that answer you question, or you mean something else? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user ssonker commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r193675439 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,24 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] { + override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = { +mutable.Map[TimeZone, (Calendar, Long)]() + } +} + + def getCalendar(timeZone: TimeZone): Calendar = { +val (c, timeInMillis) = threadLocalComputedCalendarsMap.get() + .getOrElseUpdate(timeZone, { +val c = Calendar.getInstance(timeZone) +(c, c.getTimeInMillis) --- End diff -- Please refer line 130 for this. Before returning the calendar instance, it is reinitialized to the time it was originally created. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21505#discussion_r193674578 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -111,6 +113,24 @@ object DateTimeUtils { computedTimeZones.computeIfAbsent(timeZoneId, computeTimeZone) } + private val threadLocalComputedCalendarsMap = +new ThreadLocal[mutable.Map[TimeZone, (Calendar, Long)]] { + override def initialValue(): mutable.Map[TimeZone, (Calendar, Long)] = { +mutable.Map[TimeZone, (Calendar, Long)]() + } +} + + def getCalendar(timeZone: TimeZone): Calendar = { +val (c, timeInMillis) = threadLocalComputedCalendarsMap.get() + .getOrElseUpdate(timeZone, { +val c = Calendar.getInstance(timeZone) +(c, c.getTimeInMillis) --- End diff -- When you get the calendar instance next time, isn't its time out of date? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21505: [SPARK-24457][SQL] Improving performance of strin...
GitHub user ssonker opened a pull request: https://github.com/apache/spark/pull/21505 [SPARK-24457][SQL] Improving performance of stringToTimestamp by cach⦠â¦ing Calendar instances for input timezones instead of creating new everytime ## What changes were proposed in this pull request? As of now, stringToTimestamp function in DateTimeUtils creates a calendar instance on each call. This change maintains a thread-local timezone to calendar map, and creates just one calendar for each timezone. Whenever a calendar instance is queried given a timezone, it is looked-up inside the map, reinitialized and returned. ## How was this patch tested? Using existing test cases. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ssonker/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21505.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21505 commit 84d5a911408411f327b620bb958b996e55264781 Author: Sharad Sonker Date: 2018-06-07T04:56:37Z [SPARK-24457][SQL] Improving performance of stringToTimestamp by caching Calendar instances for input timezones instead of creating new everytime --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org