[jira] [Resolved] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect

2017-11-08 Thread Saniya Tech (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saniya Tech resolved SPARK-22460.
-
Resolution: Not A Problem

> Spark De-serialization of Timestamp field is Incorrect
> --
>
> Key: SPARK-22460
> URL: https://issues.apache.org/jira/browse/SPARK-22460
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.1
>Reporter: Saniya Tech
>
> We are trying to serialize Timestamp fields to Avro using spark-avro 
> connector. I can see the Timestamp fields are getting correctly serialized as 
> long (milliseconds since Epoch). I verified that the data is correctly read 
> back from the Avro files. It is when we encode the Dataset as a case class 
> that timestamp field is incorrectly converted to a long value as seconds 
> since Epoch. As can be seen below, this shifts the timestamp many years in 
> the future.
> Code used to reproduce the issue:
> {code:java}
> import java.sql.Timestamp
> import com.databricks.spark.avro._
> import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}
> case class TestRecord(name: String, modified: Timestamp)
> import spark.implicits._
> val data = Seq(
>   TestRecord("One", new Timestamp(System.currentTimeMillis()))
> )
> // Serialize:
> val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
> "com.example.domain")
> val path = s"s3a://some-bucket/output/"
> val ds = spark.createDataset(data)
> ds.write
>   .options(parameters)
>   .mode(SaveMode.Overwrite)
>   .avro(path)
> //
> // De-serialize
> val output = spark.read.avro(path).as[TestRecord]
> {code}
> Output from the test:
> {code:java}
> scala> data.head
> res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)
> scala> output.collect().head
> res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect

2017-11-08 Thread Saniya Tech (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245163#comment-16245163
 ] 

Saniya Tech commented on SPARK-22460:
-

Based on the feedback I am going to close this ticket and try to resolve the 
issue in spark-avro code-base. Thanks!

> Spark De-serialization of Timestamp field is Incorrect
> --
>
> Key: SPARK-22460
> URL: https://issues.apache.org/jira/browse/SPARK-22460
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.1
>Reporter: Saniya Tech
>
> We are trying to serialize Timestamp fields to Avro using spark-avro 
> connector. I can see the Timestamp fields are getting correctly serialized as 
> long (milliseconds since Epoch). I verified that the data is correctly read 
> back from the Avro files. It is when we encode the Dataset as a case class 
> that timestamp field is incorrectly converted to a long value as seconds 
> since Epoch. As can be seen below, this shifts the timestamp many years in 
> the future.
> Code used to reproduce the issue:
> {code:java}
> import java.sql.Timestamp
> import com.databricks.spark.avro._
> import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}
> case class TestRecord(name: String, modified: Timestamp)
> import spark.implicits._
> val data = Seq(
>   TestRecord("One", new Timestamp(System.currentTimeMillis()))
> )
> // Serialize:
> val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
> "com.example.domain")
> val path = s"s3a://some-bucket/output/"
> val ds = spark.createDataset(data)
> ds.write
>   .options(parameters)
>   .mode(SaveMode.Overwrite)
>   .avro(path)
> //
> // De-serialize
> val output = spark.read.avro(path).as[TestRecord]
> {code}
> Output from the test:
> {code:java}
> scala> data.head
> res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)
> scala> output.collect().head
> res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect

2017-11-07 Thread Saniya Tech (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saniya Tech reopened SPARK-22460:
-

See my last comment.

> Spark De-serialization of Timestamp field is Incorrect
> --
>
> Key: SPARK-22460
> URL: https://issues.apache.org/jira/browse/SPARK-22460
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.1
>Reporter: Saniya Tech
>
> We are trying to serialize Timestamp fields to Avro using spark-avro 
> connector. I can see the Timestamp fields are getting correctly serialized as 
> long (milliseconds since Epoch). I verified that the data is correctly read 
> back from the Avro files. It is when we encode the Dataset as a case class 
> that timestamp field is incorrectly converted to a long value as seconds 
> since Epoch. As can be seen below, this shifts the timestamp many years in 
> the future.
> Code used to reproduce the issue:
> {code:java}
> import java.sql.Timestamp
> import com.databricks.spark.avro._
> import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}
> case class TestRecord(name: String, modified: Timestamp)
> import spark.implicits._
> val data = Seq(
>   TestRecord("One", new Timestamp(System.currentTimeMillis()))
> )
> // Serialize:
> val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
> "com.example.domain")
> val path = s"s3a://some-bucket/output/"
> val ds = spark.createDataset(data)
> ds.write
>   .options(parameters)
>   .mode(SaveMode.Overwrite)
>   .avro(path)
> //
> // De-serialize
> val output = spark.read.avro(path).as[TestRecord]
> {code}
> Output from the test:
> {code:java}
> scala> data.head
> res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)
> scala> output.collect().head
> res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect

2017-11-07 Thread Saniya Tech (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16242109#comment-16242109
 ] 

Saniya Tech commented on SPARK-22460:
-

No it is not a spark-avro issue. The spark-avro serializes and deserializes the 
long value (the time in milliseconds since EPOCH) correctly. 

It is the Spark encoder which converts the dataset as the case class that 
incorrectly interprets the long value as the time in seconds since EPOCH.

I have broken the code into steps to show the results after each step:

{code:java}
// De-serialize
// rawOutput is deserializing using spark-avro connector
val rawOutput = spark.read.avro(path)
// output is encoding using Spark's `as`
val output = rawOutput.as[TestRecord]
{code}

Print-out of results for each step:
{code:java}
scala> data.head
res3: TestRecord = TestRecord(One,2017-11-07 14:19:42.427)

scala> data.head.modified.getTime
res4: Long = 1510064382427

scala> rawOutput.collect().head
res5: org.apache.spark.sql.Row = [One,1510064382427]

scala> output.collect().head
res6: TestRecord = TestRecord(One,49822-01-14 00:27:07.0)
{code}

This is the relevant code in Spark where it assumes the long value is in 
seconds:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L285
{code:java}
  // converting seconds to us
  private[this] def longToTimestamp(t: Long): Long = t * 100L
{code}

The Java API specifies the Timestamp.getTime() returns a long number in 
milliseconds:
https://docs.oracle.com/javase/8/docs/api/java/sql/Timestamp.html#getTime--



> Spark De-serialization of Timestamp field is Incorrect
> --
>
> Key: SPARK-22460
> URL: https://issues.apache.org/jira/browse/SPARK-22460
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.1
>Reporter: Saniya Tech
>
> We are trying to serialize Timestamp fields to Avro using spark-avro 
> connector. I can see the Timestamp fields are getting correctly serialized as 
> long (milliseconds since Epoch). I verified that the data is correctly read 
> back from the Avro files. It is when we encode the Dataset as a case class 
> that timestamp field is incorrectly converted to a long value as seconds 
> since Epoch. As can be seen below, this shifts the timestamp many years in 
> the future.
> Code used to reproduce the issue:
> {code:java}
> import java.sql.Timestamp
> import com.databricks.spark.avro._
> import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}
> case class TestRecord(name: String, modified: Timestamp)
> import spark.implicits._
> val data = Seq(
>   TestRecord("One", new Timestamp(System.currentTimeMillis()))
> )
> // Serialize:
> val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
> "com.example.domain")
> val path = s"s3a://some-bucket/output/"
> val ds = spark.createDataset(data)
> ds.write
>   .options(parameters)
>   .mode(SaveMode.Overwrite)
>   .avro(path)
> //
> // De-serialize
> val output = spark.read.avro(path).as[TestRecord]
> {code}
> Output from the test:
> {code:java}
> scala> data.head
> res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)
> scala> output.collect().head
> res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect

2017-11-06 Thread Saniya Tech (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saniya Tech updated SPARK-22460:

Description: 
We are trying to serialize Timestamp fields to Avro using spark-avro connector. 
I can see the Timestamp fields are getting correctly serialized as long 
(milliseconds since Epoch). I verified that the data is correctly read back 
from the Avro files. It is when we encode the Dataset as a case class that 
timestamp field is incorrectly converted to a long value as seconds since 
Epoch. As can be seen below, this shifts the timestamp many years in the future.

Code used to reproduce the issue:

{code:java}
import java.sql.Timestamp

import com.databricks.spark.avro._
import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}

case class TestRecord(name: String, modified: Timestamp)

import spark.implicits._
val data = Seq(
  TestRecord("One", new Timestamp(System.currentTimeMillis()))
)

// Serialize:
val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
"com.example.domain")
val path = s"s3a://some-bucket/output/"
val ds = spark.createDataset(data)
ds.write
  .options(parameters)
  .mode(SaveMode.Overwrite)
  .avro(path)
//

// De-serialize
val output = spark.read.avro(path).as[TestRecord]
{code}

Output from the test:
{code:java}
scala> data.head
res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)

scala> output.collect().head
res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
{code}



  was:
We are trying to serialize Timestamp fields to Avro using spark-avro connector. 
I can see the Timestamp fields are getting correctly serialized as long 
(milliseconds since Epoch). I verified that the data is correctly read back 
from the Avro files. It is when we encode the Dataset as a case class that 
timestamp field is incorrectly converted to as long value as seconds since 
Epoch. As can be seen below, this shifts the timestamp many years in the future.

Code used to reproduce the issue:

{code:java}
import java.sql.Timestamp

import com.databricks.spark.avro._
import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}

case class TestRecord(name: String, modified: Timestamp)

import spark.implicits._
val data = Seq(
  TestRecord("One", new Timestamp(System.currentTimeMillis()))
)

// Serialize:
val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
"com.example.domain")
val path = s"s3a://some-bucket/output/"
val ds = spark.createDataset(data)
ds.write
  .options(parameters)
  .mode(SaveMode.Overwrite)
  .avro(path)
//

// De-serialize
val output = spark.read.avro(path).as[TestRecord]
{code}

Output from the test:
{code:java}
scala> data.head
res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)

scala> output.collect().head
res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
{code}




> Spark De-serialization of Timestamp field is Incorrect
> --
>
> Key: SPARK-22460
> URL: https://issues.apache.org/jira/browse/SPARK-22460
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.1
>Reporter: Saniya Tech
>
> We are trying to serialize Timestamp fields to Avro using spark-avro 
> connector. I can see the Timestamp fields are getting correctly serialized as 
> long (milliseconds since Epoch). I verified that the data is correctly read 
> back from the Avro files. It is when we encode the Dataset as a case class 
> that timestamp field is incorrectly converted to a long value as seconds 
> since Epoch. As can be seen below, this shifts the timestamp many years in 
> the future.
> Code used to reproduce the issue:
> {code:java}
> import java.sql.Timestamp
> import com.databricks.spark.avro._
> import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}
> case class TestRecord(name: String, modified: Timestamp)
> import spark.implicits._
> val data = Seq(
>   TestRecord("One", new Timestamp(System.currentTimeMillis()))
> )
> // Serialize:
> val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
> "com.example.domain")
> val path = s"s3a://some-bucket/output/"
> val ds = spark.createDataset(data)
> ds.write
>   .options(parameters)
>   .mode(SaveMode.Overwrite)
>   .avro(path)
> //
> // De-serialize
> val output = spark.read.avro(path).as[TestRecord]
> {code}
> Output from the test:
> {code:java}
> scala> data.head
> res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)
> scala> output.collect().head
> res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect

2017-11-06 Thread Saniya Tech (JIRA)
Saniya Tech created SPARK-22460:
---

 Summary: Spark De-serialization of Timestamp field is Incorrect
 Key: SPARK-22460
 URL: https://issues.apache.org/jira/browse/SPARK-22460
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.1.1
Reporter: Saniya Tech


We are trying to serialize Timestamp fields to Avro using spark-avro connector. 
I can see the Timestamp fields are getting correctly serialized as long 
(milliseconds since Epoch). I verified that the data is correctly read back 
from the Avro files. It is when we encode the Dataset as a case class that 
timestamp field is incorrectly converted to as long value as seconds since 
Epoch. As can be seen below, this shifts the timestamp many years in the future.

Code used to reproduce the issue:

{code:java}
import java.sql.Timestamp

import com.databricks.spark.avro._
import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}

case class TestRecord(name: String, modified: Timestamp)

import spark.implicits._
val data = Seq(
  TestRecord("One", new Timestamp(System.currentTimeMillis()))
)

// Serialize:
val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
"com.example.domain")
val path = s"s3a://some-bucket/output/"
val ds = spark.createDataset(data)
ds.write
  .options(parameters)
  .mode(SaveMode.Overwrite)
  .avro(path)
//

// De-serialize
val output = spark.read.avro(path).as[TestRecord]
{code}

Output from the test:
{code:java}
scala> data.head
res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)

scala> output.collect().head
res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
{code}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org