[GitHub] [spark] wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table

2019-12-16 Thread GitBox
wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix 
ClassCastException when querying partitioned JSON table
URL: https://github.com/apache/spark/pull/26895#discussion_r358628516
 
 

 ##
 File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
 ##
 @@ -252,10 +254,14 @@ class HadoopTableReader(
 partProps.asScala.foreach {
   case (key, value) => props.setProperty(key, value)
 }
-deserializer.initialize(hconf, props)
+DeserializerLock.synchronized {
+  deserializer.initialize(hconf, props)
+}
 // get the table deserializer
 val tableSerDe = 
localTableDesc.getDeserializerClass.getConstructor().newInstance()
-tableSerDe.initialize(hconf, localTableDesc.getProperties)
+DeserializerLock.synchronized {
+  tableSerDe.initialize(hconf, tableProperties)
 
 Review comment:
   Yes, I did find that to be the case in my repro, that the two were the same 
class (JsonSerDe). However, the initialize calls on deserializer and on 
tableSerDe are with potentially different properties (props and tableProperties 
could differ), so I think I should initialize tableSerDe even if it has the 
same class as deserializer.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table

2019-12-16 Thread GitBox
wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix 
ClassCastException when querying partitioned JSON table
URL: https://github.com/apache/spark/pull/26895#discussion_r358510355
 
 

 ##
 File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
 ##
 @@ -252,10 +254,14 @@ class HadoopTableReader(
 partProps.asScala.foreach {
   case (key, value) => props.setProperty(key, value)
 }
-deserializer.initialize(hconf, props)
+DeserializerLock.synchronized {
+  deserializer.initialize(hconf, props)
+}
 // get the table deserializer
 val tableSerDe = 
localTableDesc.getDeserializerClass.getConstructor().newInstance()
-tableSerDe.initialize(hconf, localTableDesc.getProperties)
 
 Review comment:
   I was mistaken. The SerDe has to be instantiated and initialized within the 
transformation, as it is not Serializable and cannot be sent in the closure.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table

2019-12-15 Thread GitBox
wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix 
ClassCastException when querying partitioned JSON table
URL: https://github.com/apache/spark/pull/26895#discussion_r358004041
 
 

 ##
 File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
 ##
 @@ -132,7 +132,9 @@ class HadoopTableReader(
 val deserializedHadoopRDD = hadoopRDD.mapPartitions { iter =>
   val hconf = broadcastedHadoopConf.value.value
   val deserializer = deserializerClass.getConstructor().newInstance()
-  deserializer.initialize(hconf, localTableDesc.getProperties)
+  DeserializerLock.synchronized {
 
 Review comment:
   On the question of other callers needing to use this global lock:
   AFAIK, the only reported problem in Spark with the 
`HCatRecordObjectInspectorFactory` bug is in reading partitioned `JsonSerDe` 
tables. In `HadoopTableReader#makeRDDForTable`, we can even do without using 
the lock; we only really need to use the lock in 
`HadoopTableReader#makeRDDForPartitionedTable`.
   From my search, there are two other places in Spark where we call 
`Deserializer#initialize`:
   1. `HiveTableScanExec`, in initializing a `HadoopTableReader` instance, 
before `HadoopTableReader#makeRDDForTable` or 
`HadoopTableReader#makeRDDForPartitionedTable` even get called in `doExec`.
   2. `HiveScriptIOSchema`.
   
   I don't think we need to use the lock for 1.
   I don't know about 2., but if no problem has been reported due to the race, 
we can also leave it alone.
   In other words, I'm not proposing we guard against the race in the Hive bug 
everywhere, just in this known case.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table

2019-12-15 Thread GitBox
wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix 
ClassCastException when querying partitioned JSON table
URL: https://github.com/apache/spark/pull/26895#discussion_r358002334
 
 

 ##
 File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
 ##
 @@ -132,7 +132,9 @@ class HadoopTableReader(
 val deserializedHadoopRDD = hadoopRDD.mapPartitions { iter =>
   val hconf = broadcastedHadoopConf.value.value
   val deserializer = deserializerClass.getConstructor().newInstance()
-  deserializer.initialize(hconf, localTableDesc.getProperties)
+  DeserializerLock.synchronized {
 
 Review comment:
   HIVE-15773 and HIVE-21752 offer two different fixes for the 
`HCatRecordObjectInspectorFactory` bug. HIVE-21752 has been merged in Hive 
4.0.0, which is not released yet. Even when it is, not many users will be on 
Hive 4.0.0 for some time, so I think we need to deal with it here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table

2019-12-15 Thread GitBox
wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix 
ClassCastException when querying partitioned JSON table
URL: https://github.com/apache/spark/pull/26895#discussion_r358001919
 
 

 ##
 File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
 ##
 @@ -252,10 +254,14 @@ class HadoopTableReader(
 partProps.asScala.foreach {
   case (key, value) => props.setProperty(key, value)
 }
-deserializer.initialize(hconf, props)
+DeserializerLock.synchronized {
+  deserializer.initialize(hconf, props)
+}
 // get the table deserializer
 val tableSerDe = 
localTableDesc.getDeserializerClass.getConstructor().newInstance()
-tableSerDe.initialize(hconf, localTableDesc.getProperties)
 
 Review comment:
   Actually, the `tableSerDe` can be instantiated and initialized before the 
transformation here. Then it will be done just once, instead of repeatedly for 
each Hive partition.
   (The same applies to the `Deserializer` for the table in `makeRDDForTable`.)
   I'll make a change.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table

2019-12-15 Thread GitBox
wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix 
ClassCastException when querying partitioned JSON table
URL: https://github.com/apache/spark/pull/26895#discussion_r358000739
 
 

 ##
 File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
 ##
 @@ -132,7 +132,9 @@ class HadoopTableReader(
 val deserializedHadoopRDD = hadoopRDD.mapPartitions { iter =>
   val hconf = broadcastedHadoopConf.value.value
   val deserializer = deserializerClass.getConstructor().newInstance()
-  deserializer.initialize(hconf, localTableDesc.getProperties)
+  DeserializerLock.synchronized {
 
 Review comment:
   I don't think there will be a lot of contention. We call 
`deserializer.initialize` once for each partition.
   The race is actually in 
`HCatRecordObjectInspectorFactory.getHCatRecordObjectInspector` which is called 
by `JsonSerDe#initialize` (the factory uses a `HashMap` instead of , e.g., a 
`ConcurrentHashMap` to maintain a cache). We want the same `ObjectInspector` to 
be returned by the factory cache and set in each `JsonSerDe` instance as its 
`ObjectInspector`.
   Our objective is that there not be more than one task from each executor 
calling `JsonSerDe#initialize` at the same time.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table

2019-12-14 Thread GitBox
wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix 
ClassCastException when querying partitioned JSON table
URL: https://github.com/apache/spark/pull/26895#discussion_r357948743
 
 

 ##
 File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
 ##
 @@ -252,10 +254,14 @@ class HadoopTableReader(
 partProps.asScala.foreach {
   case (key, value) => props.setProperty(key, value)
 }
-deserializer.initialize(hconf, props)
+DeserializerLock.synchronized {
+  deserializer.initialize(hconf, props)
+}
 // get the table deserializer
 val tableSerDe = 
localTableDesc.getDeserializerClass.getConstructor().newInstance()
-tableSerDe.initialize(hconf, localTableDesc.getProperties)
 
 Review comment:
   localTableDesc = tableDesc, and tableProperties = tableDesc.getProperties, 
which was used a little above in this same block, and might as well be used 
here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org