subject:"\[GitHub\] spark pull request #16750\: \[SPARK\-18937\]\[SQL\] Timezone support in CSV\/JSON p..."

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-20 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r102132903
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/ResolvedDataSourceSuite.scala
 ---
@@ -19,11 +19,15 @@ package org.apache.spark.sql.sources
 
 import org.apache.spark.SparkFunSuite
 import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
 import org.apache.spark.sql.execution.datasources.DataSource
 
 class ResolvedDataSourceSuite extends SparkFunSuite {
   private def getProvidingClass(name: String): Class[_] =
-DataSource(sparkSession = null, className = name).providingClass
+DataSource(
+  sparkSession = null,
+  className = name,
+  options = Map("timeZone" -> 
DateTimeUtils.defaultTimeZone().getID)).providingClass
--- End diff --

Unfortunately, we can't use the default session timezone because 
sparkSession is null here..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-20 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r102132850
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -859,6 +859,48 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 }
   }
 
+  test("Write timestamps correctly with timestampFormat option and 
timeZone option") {
+withTempDir { dir =>
+  // With dateFormat option and timeZone option.
+  val timestampsWithFormatPath = 
s"${dir.getCanonicalPath}/timestampsWithFormat.csv"
+  val timestampsWithFormat = spark.read
+.format("csv")
+.option("header", "true")
+.option("inferSchema", "true")
+.option("timestampFormat", "dd/MM/ HH:mm")
+.load(testFile(datesFile))
+  timestampsWithFormat.write
+.format("csv")
+.option("header", "true")
+.option("timestampFormat", "/MM/dd HH:mm")
+.option("timeZone", "GMT")
+.save(timestampsWithFormatPath)
+
+  // This will load back the timestamps as string.
+  val stringTimestampsWithFormat = spark.read
+.format("csv")
+.option("header", "true")
+.option("inferSchema", "false")
--- End diff --

I see, I'll specify the schema in the next pr.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-15 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16750


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-15 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r101386663
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/ResolvedDataSourceSuite.scala
 ---
@@ -19,11 +19,15 @@ package org.apache.spark.sql.sources
 
 import org.apache.spark.SparkFunSuite
 import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
 import org.apache.spark.sql.execution.datasources.DataSource
 
 class ResolvedDataSourceSuite extends SparkFunSuite {
   private def getProvidingClass(name: String): Class[_] =
-DataSource(sparkSession = null, className = name).providingClass
+DataSource(
+  sparkSession = null,
+  className = name,
+  options = Map("timeZone" -> 
DateTimeUtils.defaultTimeZone().getID)).providingClass
--- End diff --

why this change? I think we will have a default session timezone?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-15 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r101365748
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -859,6 +859,48 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 }
   }
 
+  test("Write timestamps correctly with timestampFormat option and 
timeZone option") {
+withTempDir { dir =>
+  // With dateFormat option and timeZone option.
+  val timestampsWithFormatPath = 
s"${dir.getCanonicalPath}/timestampsWithFormat.csv"
+  val timestampsWithFormat = spark.read
+.format("csv")
+.option("header", "true")
+.option("inferSchema", "true")
+.option("timestampFormat", "dd/MM/ HH:mm")
+.load(testFile(datesFile))
+  timestampsWithFormat.write
+.format("csv")
+.option("header", "true")
+.option("timestampFormat", "/MM/dd HH:mm")
+.option("timeZone", "GMT")
+.save(timestampsWithFormatPath)
+
+  // This will load back the timestamps as string.
+  val stringTimestampsWithFormat = spark.read
+.format("csv")
+.option("header", "true")
+.option("inferSchema", "false")
--- End diff --

actually we should make it more explicitly, by specifying a schema, like 
https://github.com/apache/spark/pull/16750/files#diff-fde14032b0e6ef8086461edf79a27c5dR1771


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-15 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r101357512
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -859,6 +859,48 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 }
   }
 
+  test("Write timestamps correctly with timestampFormat option and 
timeZone option") {
+withTempDir { dir =>
+  // With dateFormat option and timeZone option.
+  val timestampsWithFormatPath = 
s"${dir.getCanonicalPath}/timestampsWithFormat.csv"
+  val timestampsWithFormat = spark.read
+.format("csv")
+.option("header", "true")
+.option("inferSchema", "true")
+.option("timestampFormat", "dd/MM/ HH:mm")
+.load(testFile(datesFile))
+  timestampsWithFormat.write
+.format("csv")
+.option("header", "true")
+.option("timestampFormat", "/MM/dd HH:mm")
+.option("timeZone", "GMT")
+.save(timestampsWithFormatPath)
+
+  // This will load back the timestamps as string.
+  val stringTimestampsWithFormat = spark.read
+.format("csv")
+.option("header", "true")
+.option("inferSchema", "false")
--- End diff --

it will be good if we add some comments to say it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-12 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r100729875
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -859,6 +859,48 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 }
   }
 
+  test("Write timestamps correctly with timestampFormat option and 
timeZone option") {
+withTempDir { dir =>
+  // With dateFormat option and timeZone option.
+  val timestampsWithFormatPath = 
s"${dir.getCanonicalPath}/timestampsWithFormat.csv"
+  val timestampsWithFormat = spark.read
+.format("csv")
+.option("header", "true")
+.option("inferSchema", "true")
+.option("timestampFormat", "dd/MM/ HH:mm")
+.load(testFile(datesFile))
+  timestampsWithFormat.write
+.format("csv")
+.option("header", "true")
+.option("timestampFormat", "/MM/dd HH:mm")
+.option("timeZone", "GMT")
+.save(timestampsWithFormatPath)
+
+  // This will load back the timestamps as string.
+  val stringTimestampsWithFormat = spark.read
+.format("csv")
+.option("header", "true")
+.option("inferSchema", "false")
--- End diff --

The schema will be `StringType` for all columns. 
([CSVInferSchema.scala#L68](https://github.com/ueshin/apache-spark/blob/ffc4912e17cc900fc9d7ceefd0f66461109728e9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L68))


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-12 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r100729866
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
@@ -58,13 +59,15 @@ private[sql] class JSONOptions(
   private val parseMode = parameters.getOrElse("mode", "PERMISSIVE")
   val columnNameOfCorruptRecord = 
parameters.get("columnNameOfCorruptRecord")
 
+  val timeZone: TimeZone = 
TimeZone.getTimeZone(parameters.getOrElse("timeZone", defaultTimeZoneId))
+
   // Uses `FastDateFormat` which can be direct replacement for 
`SimpleDateFormat` and thread-safe.
   val dateFormat: FastDateFormat =
 FastDateFormat.getInstance(parameters.getOrElse("dateFormat", 
"-MM-dd"), Locale.US)
--- End diff --

That is a combination of the `dateFormat` and 
`DateTimeUtils.millisToDays()` (see 
[JacksonParser.scala#L251](https://github.com/ueshin/apache-spark/blob/ffc4912e17cc900fc9d7ceefd0f66461109728e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L251)
 or 
[UnivocityParser.scala#L137](https://github.com/ueshin/apache-spark/blob/ffc4912e17cc900fc9d7ceefd0f66461109728e9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L137)).

If both timezones of the `dateFormat` and `DateTimeUtils.millisToDays()` 
are the same, the days will be calculated correctly.
Here the `dateFormat` will have the default timezone to parse and 
`DateTimeUtils.millisToDays()` will also use the default timezone to calculate 
days here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-10 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r100624733
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -859,6 +859,48 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 }
   }
 
+  test("Write timestamps correctly with timestampFormat option and 
timeZone option") {
+withTempDir { dir =>
+  // With dateFormat option and timeZone option.
+  val timestampsWithFormatPath = 
s"${dir.getCanonicalPath}/timestampsWithFormat.csv"
+  val timestampsWithFormat = spark.read
+.format("csv")
+.option("header", "true")
+.option("inferSchema", "true")
+.option("timestampFormat", "dd/MM/ HH:mm")
+.load(testFile(datesFile))
+  timestampsWithFormat.write
+.format("csv")
+.option("header", "true")
+.option("timestampFormat", "/MM/dd HH:mm")
+.option("timeZone", "GMT")
+.save(timestampsWithFormatPath)
+
+  // This will load back the timestamps as string.
+  val stringTimestampsWithFormat = spark.read
+.format("csv")
+.option("header", "true")
+.option("inferSchema", "false")
--- End diff --

you turn off the schema and don't give a schema, what will be the schema 
then?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-10 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r100623330
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
@@ -58,13 +59,15 @@ private[sql] class JSONOptions(
   private val parseMode = parameters.getOrElse("mode", "PERMISSIVE")
   val columnNameOfCorruptRecord = 
parameters.get("columnNameOfCorruptRecord")
 
+  val timeZone: TimeZone = 
TimeZone.getTimeZone(parameters.getOrElse("timeZone", defaultTimeZoneId))
+
   // Uses `FastDateFormat` which can be direct replacement for 
`SimpleDateFormat` and thread-safe.
   val dateFormat: FastDateFormat =
 FastDateFormat.getInstance(parameters.getOrElse("dateFormat", 
"-MM-dd"), Locale.US)
--- End diff --

why we don't need timezone here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-08 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r100225911
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
 ---
@@ -357,30 +361,70 @@ class JsonExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 val jsonData = """{"a" 1}"""
 val schema = StructType(StructField("a", IntegerType) :: Nil)
 checkEvaluation(
-  JsonToStruct(schema, Map.empty, Literal(jsonData)),
+  JsonToStruct(schema, Map.empty, Literal(jsonData), gmtId),
   null
 )
 
 // Other modes should still return `null`.
 checkEvaluation(
-  JsonToStruct(schema, Map("mode" -> ParseModes.PERMISSIVE_MODE), 
Literal(jsonData)),
+  JsonToStruct(schema, Map("mode" -> ParseModes.PERMISSIVE_MODE), 
Literal(jsonData), gmtId),
   null
 )
   }
 
   test("from_json null input column") {
 val schema = StructType(StructField("a", IntegerType) :: Nil)
 checkEvaluation(
-  JsonToStruct(schema, Map.empty, Literal.create(null, StringType)),
+  JsonToStruct(schema, Map.empty, Literal.create(null, StringType), 
gmtId),
   null
 )
   }
 
+  test("from_json with timestamp") {
+val schema = StructType(StructField("t", TimestampType) :: Nil)
+
+val jsonData1 = """{"t": "2016-01-01T00:00:00.123Z"}"""
+var c = Calendar.getInstance(DateTimeUtils.TimeZoneGMT)
+c.set(2016, 0, 1, 0, 0, 0)
+c.set(Calendar.MILLISECOND, 123)
+checkEvaluation(
+  JsonToStruct(schema, Map.empty, Literal(jsonData1), gmtId),
+  InternalRow.fromSeq(c.getTimeInMillis * 1000L :: Nil)
+)
+checkEvaluation(
+  JsonToStruct(schema, Map.empty, Literal(jsonData1), Option("PST")),
+  InternalRow.fromSeq(c.getTimeInMillis * 1000L :: Nil)
--- End diff --

FYI, it's because the input json string includes timezone string `"Z"`, 
which means GMT.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-08 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r100225669
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
 ---
@@ -357,30 +361,70 @@ class JsonExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 val jsonData = """{"a" 1}"""
 val schema = StructType(StructField("a", IntegerType) :: Nil)
 checkEvaluation(
-  JsonToStruct(schema, Map.empty, Literal(jsonData)),
+  JsonToStruct(schema, Map.empty, Literal(jsonData), gmtId),
   null
 )
 
 // Other modes should still return `null`.
 checkEvaluation(
-  JsonToStruct(schema, Map("mode" -> ParseModes.PERMISSIVE_MODE), 
Literal(jsonData)),
+  JsonToStruct(schema, Map("mode" -> ParseModes.PERMISSIVE_MODE), 
Literal(jsonData), gmtId),
   null
 )
   }
 
   test("from_json null input column") {
 val schema = StructType(StructField("a", IntegerType) :: Nil)
 checkEvaluation(
-  JsonToStruct(schema, Map.empty, Literal.create(null, StringType)),
+  JsonToStruct(schema, Map.empty, Literal.create(null, StringType), 
gmtId),
   null
 )
   }
 
+  test("from_json with timestamp") {
+val schema = StructType(StructField("t", TimestampType) :: Nil)
+
+val jsonData1 = """{"t": "2016-01-01T00:00:00.123Z"}"""
+var c = Calendar.getInstance(DateTimeUtils.TimeZoneGMT)
+c.set(2016, 0, 1, 0, 0, 0)
+c.set(Calendar.MILLISECOND, 123)
+checkEvaluation(
+  JsonToStruct(schema, Map.empty, Literal(jsonData1), gmtId),
+  InternalRow.fromSeq(c.getTimeInMillis * 1000L :: Nil)
+)
+checkEvaluation(
+  JsonToStruct(schema, Map.empty, Literal(jsonData1), Option("PST")),
+  InternalRow.fromSeq(c.getTimeInMillis * 1000L :: Nil)
--- End diff --

I'm sorry, I should have added a comment.
I'll add soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-08 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r100225659
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
 ---
@@ -357,30 +361,70 @@ class JsonExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 val jsonData = """{"a" 1}"""
 val schema = StructType(StructField("a", IntegerType) :: Nil)
 checkEvaluation(
-  JsonToStruct(schema, Map.empty, Literal(jsonData)),
+  JsonToStruct(schema, Map.empty, Literal(jsonData), gmtId),
   null
 )
 
 // Other modes should still return `null`.
 checkEvaluation(
-  JsonToStruct(schema, Map("mode" -> ParseModes.PERMISSIVE_MODE), 
Literal(jsonData)),
+  JsonToStruct(schema, Map("mode" -> ParseModes.PERMISSIVE_MODE), 
Literal(jsonData), gmtId),
   null
 )
   }
 
   test("from_json null input column") {
 val schema = StructType(StructField("a", IntegerType) :: Nil)
 checkEvaluation(
-  JsonToStruct(schema, Map.empty, Literal.create(null, StringType)),
+  JsonToStruct(schema, Map.empty, Literal.create(null, StringType), 
gmtId),
   null
 )
   }
 
+  test("from_json with timestamp") {
+val schema = StructType(StructField("t", TimestampType) :: Nil)
+
+val jsonData1 = """{"t": "2016-01-01T00:00:00.123Z"}"""
+var c = Calendar.getInstance(DateTimeUtils.TimeZoneGMT)
+c.set(2016, 0, 1, 0, 0, 0)
+c.set(Calendar.MILLISECOND, 123)
+checkEvaluation(
+  JsonToStruct(schema, Map.empty, Literal(jsonData1), gmtId),
+  InternalRow.fromSeq(c.getTimeInMillis * 1000L :: Nil)
--- End diff --

Thanks, I'll use it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-07 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r99989708
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
@@ -31,10 +31,11 @@ import 
org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, CompressionCodecs
  * Most of these map directly to Jackson's internal options, specified in 
[[JsonParser.Feature]].
  */
 private[sql] class JSONOptions(
-@transient private val parameters: CaseInsensitiveMap)
+@transient private val parameters: CaseInsensitiveMap, 
defaultTimeZoneId: String)
--- End diff --

To cut this short, I think we can resemble  
`JSONOptions.columnNameOfCorruptRecord` or 
`ParquetOptions.compressionCodecClassName` to deal with the variant of default 
value.

It seems now it resembles the latter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-07 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r99989335
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
@@ -31,10 +31,11 @@ import 
org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, CompressionCodecs
  * Most of these map directly to Jackson's internal options, specified in 
[[JsonParser.Feature]].
  */
 private[sql] class JSONOptions(
-@transient private val parameters: CaseInsensitiveMap)
+@transient private val parameters: CaseInsensitiveMap, 
defaultTimeZoneId: String)
--- End diff --

Ah, yes, it needed to introduce such logics below before creating 
`JSONOptions`/`CSVOptions`.
```scala
val options = extraOptions.toMap
val caseInsensitiveOptions = new CaseInsensitiveMap(options)
if (caseInsensitiveOptions.contains("timeZone")) {
  caseInsensitiveOptions
} else {
  new CaseInsensitiveMap(
  options + ("timeZone" -> 
sparkSession.sessionState.conf.sessionLocalTimeZone))
}

val parsedOptions: JSONOptions = new JSONOptions(optionsWithTimeZone)
```

So, I suggested this way as It seems also because the default value of 
`timeZone` can be varied. It seems `ParquetOptions` also takes another argument 
for the same reason.

Another way I suggested is, to make this `Option[TimeZone]` to decouple the 
variant of the default value (like `JSONOptions.columnNameOfCorruptRecord`) but 
it seems `timestampFormat` in both options are dependent on `timeZone`. In that 
case, we should make it `Option` too which seems introducing some more 
complexity. So, it seems above way is better.

I am fine if we find a better cleaner way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r99830559
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
 ---
@@ -357,30 +361,70 @@ class JsonExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 val jsonData = """{"a" 1}"""
 val schema = StructType(StructField("a", IntegerType) :: Nil)
 checkEvaluation(
-  JsonToStruct(schema, Map.empty, Literal(jsonData)),
+  JsonToStruct(schema, Map.empty, Literal(jsonData), gmtId),
   null
 )
 
 // Other modes should still return `null`.
 checkEvaluation(
-  JsonToStruct(schema, Map("mode" -> ParseModes.PERMISSIVE_MODE), 
Literal(jsonData)),
+  JsonToStruct(schema, Map("mode" -> ParseModes.PERMISSIVE_MODE), 
Literal(jsonData), gmtId),
   null
 )
   }
 
   test("from_json null input column") {
 val schema = StructType(StructField("a", IntegerType) :: Nil)
 checkEvaluation(
-  JsonToStruct(schema, Map.empty, Literal.create(null, StringType)),
+  JsonToStruct(schema, Map.empty, Literal.create(null, StringType), 
gmtId),
   null
 )
   }
 
+  test("from_json with timestamp") {
+val schema = StructType(StructField("t", TimestampType) :: Nil)
+
+val jsonData1 = """{"t": "2016-01-01T00:00:00.123Z"}"""
+var c = Calendar.getInstance(DateTimeUtils.TimeZoneGMT)
+c.set(2016, 0, 1, 0, 0, 0)
+c.set(Calendar.MILLISECOND, 123)
+checkEvaluation(
+  JsonToStruct(schema, Map.empty, Literal(jsonData1), gmtId),
+  InternalRow.fromSeq(c.getTimeInMillis * 1000L :: Nil)
+)
+checkEvaluation(
+  JsonToStruct(schema, Map.empty, Literal(jsonData1), Option("PST")),
+  InternalRow.fromSeq(c.getTimeInMillis * 1000L :: Nil)
--- End diff --

why the result doesn't change? Sorry it's always hard for me to reason 
about time-related tests...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r99829778
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
 ---
@@ -357,30 +361,70 @@ class JsonExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 val jsonData = """{"a" 1}"""
 val schema = StructType(StructField("a", IntegerType) :: Nil)
 checkEvaluation(
-  JsonToStruct(schema, Map.empty, Literal(jsonData)),
+  JsonToStruct(schema, Map.empty, Literal(jsonData), gmtId),
   null
 )
 
 // Other modes should still return `null`.
 checkEvaluation(
-  JsonToStruct(schema, Map("mode" -> ParseModes.PERMISSIVE_MODE), 
Literal(jsonData)),
+  JsonToStruct(schema, Map("mode" -> ParseModes.PERMISSIVE_MODE), 
Literal(jsonData), gmtId),
   null
 )
   }
 
   test("from_json null input column") {
 val schema = StructType(StructField("a", IntegerType) :: Nil)
 checkEvaluation(
-  JsonToStruct(schema, Map.empty, Literal.create(null, StringType)),
+  JsonToStruct(schema, Map.empty, Literal.create(null, StringType), 
gmtId),
   null
 )
   }
 
+  test("from_json with timestamp") {
+val schema = StructType(StructField("t", TimestampType) :: Nil)
+
+val jsonData1 = """{"t": "2016-01-01T00:00:00.123Z"}"""
+var c = Calendar.getInstance(DateTimeUtils.TimeZoneGMT)
+c.set(2016, 0, 1, 0, 0, 0)
+c.set(Calendar.MILLISECOND, 123)
+checkEvaluation(
+  JsonToStruct(schema, Map.empty, Literal(jsonData1), gmtId),
+  InternalRow.fromSeq(c.getTimeInMillis * 1000L :: Nil)
--- End diff --

nit: `InternalRow(c.getTimeInMillis * 1000L)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-06 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r99760937
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -298,6 +299,8 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* `timestampFormat` (default `-MM-dd'T'HH:mm:ss.SSSZZ`): sets 
the string that
* indicates a timestamp format. Custom date formats follow the formats 
at
* `java.text.SimpleDateFormat`. This applies to timestamp type.
+   * `timeZone` (default session local timezone): sets the string that 
indicates a timezone
--- End diff --

I'd like to use `timeZone` for the option key as the same as 
`spark.sql.session.timeZone` for config key for the session local timezone.
What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-06 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r99760946
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
@@ -31,10 +31,11 @@ import 
org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, CompressionCodecs
  * Most of these map directly to Jackson's internal options, specified in 
[[JsonParser.Feature]].
  */
 private[sql] class JSONOptions(
-@transient private val parameters: CaseInsensitiveMap)
+@transient private val parameters: CaseInsensitiveMap, 
defaultTimeZoneId: String)
--- End diff --

I put the `timeZone` option every time creating `JSONOptions` (or 
`CSVOptions`), but there were the same contains-key check logic many times as 
@HyukjinKwon mentioned.
So I modified to pass the default timezone id to `JSONOptions` and 
`CSVOptions`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r99531422
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -298,6 +299,8 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* `timestampFormat` (default `-MM-dd'T'HH:mm:ss.SSSZZ`): sets 
the string that
* indicates a timestamp format. Custom date formats follow the formats 
at
* `java.text.SimpleDateFormat`. This applies to timestamp type.
+   * `timeZone` (default session local timezone): sets the string that 
indicates a timezone
--- End diff --

`timeZoneId`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-02-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r99531337
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
@@ -31,10 +31,11 @@ import 
org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, CompressionCodecs
  * Most of these map directly to Jackson's internal options, specified in 
[[JsonParser.Feature]].
  */
 private[sql] class JSONOptions(
-@transient private val parameters: CaseInsensitiveMap)
+@transient private val parameters: CaseInsensitiveMap, 
defaultTimeZoneId: String)
--- End diff --

shouldn't the `timeZoneId` just an option in `parameters` with key 
`timeZoneId`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-01-31 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r98834088
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -161,12 +163,3 @@ private[csv] class CSVOptions(@transient private val 
parameters: CaseInsensitive
 settings
   }
 }
-
-object CSVOptions {
--- End diff --

The `CSVOptions` (and also `JSONOptions`) will always have to take 
`timeZone` option.
I don't want callers to forget to specify it by these convenient methods.
Or should I add the default timezone id to these methods?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-01-31 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r98834068
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -329,7 +332,17 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* @since 1.4.0
*/
   def json(jsonRDD: RDD[String]): DataFrame = {
-val parsedOptions: JSONOptions = new JSONOptions(extraOptions.toMap)
+val optionsWithTimeZone = {
--- End diff --

The `timeZone` option is used in the `JSONOptions`/`CSVOptions`, so we 
can't handle it the same as `columnNameOfCorruptRecord`.
I'll modify to pass the default timezone id to `JSONOptions` and 
`CSVOptions`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-01-31 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r98834049
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -297,7 +300,7 @@ def text(self, paths):
 def csv(self, path, schema=None, sep=None, encoding=None, quote=None, 
escape=None,
 comment=None, header=None, inferSchema=None, 
ignoreLeadingWhiteSpace=None,
 ignoreTrailingWhiteSpace=None, nullValue=None, nanValue=None, 
positiveInf=None,
-negativeInf=None, dateFormat=None, timestampFormat=None, 
maxColumns=None,
+negativeInf=None, dateFormat=None, timestampFormat=None, 
timeZone=None, maxColumns=None,
--- End diff --

Ah, I see, I'll move them to the end.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-01-31 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r98624418
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -329,7 +332,17 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* @since 1.4.0
*/
   def json(jsonRDD: RDD[String]): DataFrame = {
-val parsedOptions: JSONOptions = new JSONOptions(extraOptions.toMap)
+val optionsWithTimeZone = {
--- End diff --

Could we just pass the timezone into `JSONOptions` as a default or resemble 
`columnNameOfCorruptRecord`  in`JSONOptions` below?

It seems the same logics here duplicated several times and logics to set 
default values in tests are introduced there which might be not necessary or be 
able to be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-01-31 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r98629735
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -329,7 +332,17 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* @since 1.4.0
*/
   def json(jsonRDD: RDD[String]): DataFrame = {
-val parsedOptions: JSONOptions = new JSONOptions(extraOptions.toMap)
+val optionsWithTimeZone = {
--- End diff --

It seems the same comment also applies to `CSVOptions`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-01-31 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r98625766
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -161,12 +163,3 @@ private[csv] class CSVOptions(@transient private val 
parameters: CaseInsensitive
 settings
   }
 }
-
-object CSVOptions {
--- End diff --

Do you mind if I ask the reason to remove this which apparently causing 
fixing many tests in CSV?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-01-31 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16750#discussion_r98623217
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -297,7 +300,7 @@ def text(self, paths):
 def csv(self, path, schema=None, sep=None, encoding=None, quote=None, 
escape=None,
 comment=None, header=None, inferSchema=None, 
ignoreLeadingWhiteSpace=None,
 ignoreTrailingWhiteSpace=None, nullValue=None, nanValue=None, 
positiveInf=None,
-negativeInf=None, dateFormat=None, timestampFormat=None, 
maxColumns=None,
+negativeInf=None, dateFormat=None, timestampFormat=None, 
timeZone=None, maxColumns=None,
--- End diff --

(Hi @ueshin, up to my knowledge, this should be added at the end to prevent 
breaking the existing codes that use those options by positional arguments)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

2017-01-31 Thread ueshin

GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/16750

[SPARK-18937][SQL] Timezone support in CSV/JSON parsing

## What changes were proposed in this pull request?

This is a follow-up pr of #16308.

This pr enables timezone support in CSV/JSON parsing.

We should introduce `timeZone` option for CSV/JSON datasources (the default 
value of the option is session local timezone).

The datasources should use the `timeZone` option to format/parse to 
write/read timestamp values.
Notice that while reading, if the timestampFormat has the timezone info, 
the timezone will not be used because we should respect the timezone in the 
values.

For example, if you have timestamp `"2016-01-01 00:00:00"` in `GMT`, the 
values written with the default timezone option, which is `"GMT"` because 
session local timezone is `"GMT"` here, are:

```scala
scala> spark.conf.set("spark.sql.session.timeZone", "GMT")

scala> val df = Seq(new java.sql.Timestamp(145160640L)).toDF("ts")
df: org.apache.spark.sql.DataFrame = [ts: timestamp]

scala> df.show()
+---+
|ts |
+---+
|2016-01-01 00:00:00|
+---+


scala> df.write.json("/path/to/gmtjson")
```

```sh
$ cat /path/to/gmtjson/part-*
{"ts":"2016-01-01T00:00:00.000Z"}
```

whereas setting the option to `"PST"`, they are:

```scala
scala> df.write.option("timeZone", "PST").json("/path/to/pstjson")
```

```sh
$ cat /path/to/pstjson/part-*
{"ts":"2015-12-31T16:00:00.000-08:00"}
```

We can properly read these files even if the timezone option is wrong 
because the timestamp values have timezone info:

```scala
scala> val schema = new StructType().add("ts", TimestampType)
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(ts,TimestampType,true))

scala> spark.read.schema(schema).json("/path/to/gmtjson").show()
+---+
|ts |
+---+
|2016-01-01 00:00:00|
+---+

scala> spark.read.schema(schema).option("timeZone", 
"PST").json("/path/to/gmtjson").show()
+---+
|ts |
+---+
|2016-01-01 00:00:00|
+---+
```

And even if `timezoneFormat` doesn't contain timezone info, we can properly 
read the values with setting correct timezone option:

```scala
scala> df.write.option("timestampFormat", 
"-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson")
```

```sh
$ cat /path/to/jstjson/part-*
{"ts":"2016-01-01T09:00:00"}
```

```scala
// wrong result
scala> spark.read.schema(schema).option("timestampFormat", 
"-MM-dd'T'HH:mm:ss").json("/path/to/jstjson").show()
+---+
|ts |
+---+
|2016-01-01 09:00:00|
+---+

// correct result
scala> spark.read.schema(schema).option("timestampFormat", 
"-MM-dd'T'HH:mm:ss").option("timeZone", 
"JST").json("/path/to/jstjson").show()
+---+
|ts |
+---+
|2016-01-01 00:00:00|
+---+
```

This pr also makes `JsonToStruct` and `StructToJson` 
`TimeZoneAwareExpression` to be able to evaluate values with timezone option.

## How was this patch tested?

Existing tests and added some tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark issues/SPARK-18937

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16750.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16750


commit aa052f4d11929192b749752f4b73772664d0460c
Author: Takuya UESHIN 
Date:   2017-01-05T09:29:42Z

Add timeZone option to JSONOptions.

commit 890879e24b3f63509a000585e18b288961a4e5cf
Author: Takuya UESHIN 
Date:   2017-01-06T05:11:41Z

Apply timeZone option to JSON datasources.

commit f08b78c16ac444550e7ea0857d0275b9a91b7561
Author: Takuya UESHIN 
Date:   2017-01-06T06:03:34Z

Apply timeZone option to CSV datasources.

commit 551cff99785927be3ef68c4393dca4dabb3c2ba0
Author: Takuya UESHIN 
Date:   2017-01-06T08:39:26Z

Modify python files.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

[GitHub] spark pull request #16750: [SPARK-18937][SQL] Timezone support in CSV/JSON p...

29 matches

Site Navigation

Mail list logo

Footer information