[GitHub] [spark] wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table
wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table URL: https://github.com/apache/spark/pull/26895#discussion_r358628516 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ## @@ -252,10 +254,14 @@ class HadoopTableReader( partProps.asScala.foreach { case (key, value) => props.setProperty(key, value) } -deserializer.initialize(hconf, props) +DeserializerLock.synchronized { + deserializer.initialize(hconf, props) +} // get the table deserializer val tableSerDe = localTableDesc.getDeserializerClass.getConstructor().newInstance() -tableSerDe.initialize(hconf, localTableDesc.getProperties) +DeserializerLock.synchronized { + tableSerDe.initialize(hconf, tableProperties) Review comment: Yes, I did find that to be the case in my repro, that the two were the same class (JsonSerDe). However, the initialize calls on deserializer and on tableSerDe are with potentially different properties (props and tableProperties could differ), so I think I should initialize tableSerDe even if it has the same class as deserializer. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table
wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table URL: https://github.com/apache/spark/pull/26895#discussion_r358510355 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ## @@ -252,10 +254,14 @@ class HadoopTableReader( partProps.asScala.foreach { case (key, value) => props.setProperty(key, value) } -deserializer.initialize(hconf, props) +DeserializerLock.synchronized { + deserializer.initialize(hconf, props) +} // get the table deserializer val tableSerDe = localTableDesc.getDeserializerClass.getConstructor().newInstance() -tableSerDe.initialize(hconf, localTableDesc.getProperties) Review comment: I was mistaken. The SerDe has to be instantiated and initialized within the transformation, as it is not Serializable and cannot be sent in the closure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table
wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table URL: https://github.com/apache/spark/pull/26895#discussion_r358004041 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ## @@ -132,7 +132,9 @@ class HadoopTableReader( val deserializedHadoopRDD = hadoopRDD.mapPartitions { iter => val hconf = broadcastedHadoopConf.value.value val deserializer = deserializerClass.getConstructor().newInstance() - deserializer.initialize(hconf, localTableDesc.getProperties) + DeserializerLock.synchronized { Review comment: On the question of other callers needing to use this global lock: AFAIK, the only reported problem in Spark with the `HCatRecordObjectInspectorFactory` bug is in reading partitioned `JsonSerDe` tables. In `HadoopTableReader#makeRDDForTable`, we can even do without using the lock; we only really need to use the lock in `HadoopTableReader#makeRDDForPartitionedTable`. From my search, there are two other places in Spark where we call `Deserializer#initialize`: 1. `HiveTableScanExec`, in initializing a `HadoopTableReader` instance, before `HadoopTableReader#makeRDDForTable` or `HadoopTableReader#makeRDDForPartitionedTable` even get called in `doExec`. 2. `HiveScriptIOSchema`. I don't think we need to use the lock for 1. I don't know about 2., but if no problem has been reported due to the race, we can also leave it alone. In other words, I'm not proposing we guard against the race in the Hive bug everywhere, just in this known case. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table
wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table URL: https://github.com/apache/spark/pull/26895#discussion_r358002334 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ## @@ -132,7 +132,9 @@ class HadoopTableReader( val deserializedHadoopRDD = hadoopRDD.mapPartitions { iter => val hconf = broadcastedHadoopConf.value.value val deserializer = deserializerClass.getConstructor().newInstance() - deserializer.initialize(hconf, localTableDesc.getProperties) + DeserializerLock.synchronized { Review comment: HIVE-15773 and HIVE-21752 offer two different fixes for the `HCatRecordObjectInspectorFactory` bug. HIVE-21752 has been merged in Hive 4.0.0, which is not released yet. Even when it is, not many users will be on Hive 4.0.0 for some time, so I think we need to deal with it here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table
wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table URL: https://github.com/apache/spark/pull/26895#discussion_r358001919 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ## @@ -252,10 +254,14 @@ class HadoopTableReader( partProps.asScala.foreach { case (key, value) => props.setProperty(key, value) } -deserializer.initialize(hconf, props) +DeserializerLock.synchronized { + deserializer.initialize(hconf, props) +} // get the table deserializer val tableSerDe = localTableDesc.getDeserializerClass.getConstructor().newInstance() -tableSerDe.initialize(hconf, localTableDesc.getProperties) Review comment: Actually, the `tableSerDe` can be instantiated and initialized before the transformation here. Then it will be done just once, instead of repeatedly for each Hive partition. (The same applies to the `Deserializer` for the table in `makeRDDForTable`.) I'll make a change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table
wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table URL: https://github.com/apache/spark/pull/26895#discussion_r358000739 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ## @@ -132,7 +132,9 @@ class HadoopTableReader( val deserializedHadoopRDD = hadoopRDD.mapPartitions { iter => val hconf = broadcastedHadoopConf.value.value val deserializer = deserializerClass.getConstructor().newInstance() - deserializer.initialize(hconf, localTableDesc.getProperties) + DeserializerLock.synchronized { Review comment: I don't think there will be a lot of contention. We call `deserializer.initialize` once for each partition. The race is actually in `HCatRecordObjectInspectorFactory.getHCatRecordObjectInspector` which is called by `JsonSerDe#initialize` (the factory uses a `HashMap` instead of , e.g., a `ConcurrentHashMap` to maintain a cache). We want the same `ObjectInspector` to be returned by the factory cache and set in each `JsonSerDe` instance as its `ObjectInspector`. Our objective is that there not be more than one task from each executor calling `JsonSerDe#initialize` at the same time. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table
wypoon commented on a change in pull request #26895: [SPARK-17398][SQL] Fix ClassCastException when querying partitioned JSON table URL: https://github.com/apache/spark/pull/26895#discussion_r357948743 ## File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ## @@ -252,10 +254,14 @@ class HadoopTableReader( partProps.asScala.foreach { case (key, value) => props.setProperty(key, value) } -deserializer.initialize(hconf, props) +DeserializerLock.synchronized { + deserializer.initialize(hconf, props) +} // get the table deserializer val tableSerDe = localTableDesc.getDeserializerClass.getConstructor().newInstance() -tableSerDe.initialize(hconf, localTableDesc.getProperties) Review comment: localTableDesc = tableDesc, and tableProperties = tableDesc.getProperties, which was used a little above in this same block, and might as well be used here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org